Cross-Validation - Albert Masoliver's learning site

## Definition **Cross-validation (CV)** is a family of techniques that estimate a model's generalisation performance by repeatedly splitting the data into train/validation portions and averaging the results. The standard approach when a single hold-out split would waste too much data. ## Why Over a Single Split A single 80/20 split estimates performance from 20% of the data — high variance. Cross-validation uses every example for validation *at some point* while still keeping training honest. Lower-variance estimates with the same data budget. ## Variants ### Hold-Out Single split. Simple, low cost, high variance. Reasonable for very large datasets where 20% is still huge. ### [[K-Fold Cross-Validation]] Partition into $k$ folds. Train on $k-1$, validate on the remaining one. Repeat $k$ times. Average. Standard choice with $k = 5$ or $k = 10$. ### Leave-One-Out (LOOCV) Special case with $k = n$. Each example is its own validation fold. Almost unbiased but expensive ($n$ training runs) and surprisingly high variance. ### Stratified k-Fold For classification, ensure each fold preserves the class proportions of the full dataset. Always prefer this for imbalanced classes. ### Time-Series CV Each fold uses past data for training and future data for validation. Critical for sequential / temporal data where random splitting leaks the future. ### Group k-Fold Each *group* (user, patient, session) appears in exactly one fold. Prevents within-group leakage. ## What CV Estimates CV gives an estimate of the **expected performance** of the modelling procedure (training algorithm + hyperparameters) — not of a particular trained model. The model you ultimately deploy is trained on all the data, and its precise performance can differ from the CV estimate by a small amount. ## Cost CV multiplies training time by the number of folds. For expensive models (deep networks), full $k$-fold CV may be impractical — use a fixed validation split instead. ## Common Pitfalls - **Leaky preprocessing.** Fit the scaler/encoder on the *training fold only*, not the entire dataset. - **Hyperparameter selection bias.** Picking the best CV score over many hyperparameter settings gives an optimistic estimate. Use nested CV or a separate test set. - **Random splitting on grouped data** — see Group k-Fold. ## Related - [[K-Fold Cross-Validation]] - [[Train-Validation-Test Split]] - [[Overfitting and Underfitting]]