## Definition
**K-fold cross-validation** partitions data into $k$ equal-size folds. For each fold $i = 1, \dots, k$: train on the other $k-1$ folds and validate on fold $i$. Average the $k$ validation scores for an estimate of generalisation performance.
## Procedure
```
shuffle data
partition into k folds D_1, ..., D_k
for i = 1 to k:
train_set ← ∪_{j ≠ i} D_j
val_set ← D_i
fit model on train_set
record score_i ← score(model, val_set)
return mean(score_1, ..., score_k), std(score_1, ..., score_k)
```
The standard deviation across folds estimates the *uncertainty* of the mean — invaluable when comparing close model candidates.
## Choosing $k$
- **$k = 5$:** the common workhorse. Cheap, low variance, ~80/20 train/val per fold.
- **$k = 10$:** the textbook recommendation; slightly tighter estimates.
- **$k = n$ (LOOCV):** asymptotically unbiased; expensive and surprisingly high variance.
- **$k = 2$:** minimal — equivalent to two evaluations of a 50/50 split.
The literature shows $k = 5$ or $k = 10$ consistently strikes the best bias-variance balance.
## Stratified K-Fold
For classification, stratify each fold to preserve the class distribution of the full dataset. Critical for imbalanced classes: a random split could create folds that don't contain the rare class at all.
## Repeated K-Fold
Run $k$-fold CV multiple times with different random partitions; average the results. Reduces variance from a particular partition choice. Common in academic papers reporting close comparisons.
## When K-Fold Is Wrong
- **Time-series data** — random partitioning leaks the future. Use forward-chaining CV instead.
- **Grouped data** (multiple rows per user) — leakage between folds. Use Group K-Fold.
- **Very large datasets** — a single hold-out is typically enough; k-fold overhead doesn't pay.
- **Extremely small datasets** — LOOCV may be necessary; k-fold's variance can dominate.
## Cost in Practice
$k$ training runs. For models that take hours to train, $k = 5$ means days of compute. Compromises:
- **Single hold-out** with a large enough validation set.
- **Approximate k-fold** with reduced model capacity during validation, full training for the final model.
- **Bayesian optimisation** for hyperparameter search, which adapts the number of evaluations.
## Related
- [[Cross-Validation]]
- [[Train-Validation-Test Split]]
- [[Overfitting and Underfitting]]