K-Fold Cross-Validation - Albert Masoliver's learning site

## Definition **K-fold cross-validation** partitions data into $k$ equal-size folds. For each fold $i = 1, \dots, k$: train on the other $k-1$ folds and validate on fold $i$. Average the $k$ validation scores for an estimate of generalisation performance. ## Procedure ``` shuffle data partition into k folds D_1, ..., D_k for i = 1 to k: train_set ← ∪_{j ≠ i} D_j val_set ← D_i fit model on train_set record score_i ← score(model, val_set) return mean(score_1, ..., score_k), std(score_1, ..., score_k) ``` The standard deviation across folds estimates the *uncertainty* of the mean — invaluable when comparing close model candidates. ## Choosing $k$ - **$k = 5$:** the common workhorse. Cheap, low variance, ~80/20 train/val per fold. - **$k = 10$:** the textbook recommendation; slightly tighter estimates. - **$k = n$ (LOOCV):** asymptotically unbiased; expensive and surprisingly high variance. - **$k = 2$:** minimal — equivalent to two evaluations of a 50/50 split. The literature shows $k = 5$ or $k = 10$ consistently strikes the best bias-variance balance. ## Stratified K-Fold For classification, stratify each fold to preserve the class distribution of the full dataset. Critical for imbalanced classes: a random split could create folds that don't contain the rare class at all. ## Repeated K-Fold Run $k$-fold CV multiple times with different random partitions; average the results. Reduces variance from a particular partition choice. Common in academic papers reporting close comparisons. ## When K-Fold Is Wrong - **Time-series data** — random partitioning leaks the future. Use forward-chaining CV instead. - **Grouped data** (multiple rows per user) — leakage between folds. Use Group K-Fold. - **Very large datasets** — a single hold-out is typically enough; k-fold overhead doesn't pay. - **Extremely small datasets** — LOOCV may be necessary; k-fold's variance can dominate. ## Cost in Practice $k$ training runs. For models that take hours to train, $k = 5$ means days of compute. Compromises: - **Single hold-out** with a large enough validation set. - **Approximate k-fold** with reduced model capacity during validation, full training for the final model. - **Bayesian optimisation** for hyperparameter search, which adapts the number of evaluations. ## Related - [[Cross-Validation]] - [[Train-Validation-Test Split]] - [[Overfitting and Underfitting]]