Train-Validation-Test Split - Albert Masoliver's learning site

## Definition The **train / validation / test split** partitions data into three disjoint sets, each with a distinct role: training learns parameters; validation tunes hyperparameters and selects models; test gives a single final estimate of generalisation. The cornerstone hygiene of supervised ML. ## The Three Roles | Set | Role | Used during | |---|---|---| | **Train** (~60–80%) | Fit model parameters | Training | | **Validation** (~10–20%) | Tune hyperparameters, select models | Model development | | **Test** (~10–20%) | Estimate generalisation error | Final evaluation only | ## Why Three, Not Two A two-way split (train/test) is sufficient if you train *one* model with *no* tuning. The moment you compare two hyperparameter settings or two model families, choosing the one that does best on "test" *uses* the test set as part of the search — leaking information and inflating the apparent performance. The third set is the firewall: tune on validation, measure on test *once*. ## Practical Rules - **Test set is sacred.** Do not look at it. Do not use it to inform any decision. Run on it once, at the end, after all design choices are frozen. - **No leakage.** Features derived from the full dataset (normalisation, encoding) must be fit on train and *applied* to validation/test. - **Distributional realism.** Train, val, test should resemble the deployment distribution. If deployment is on next month's data, validation should be on this month's, train on prior months. - **Stratification.** For classification, stratify splits to preserve class proportions. ## Common Mistakes - **Time leakage.** Random splitting on time-series creates information from the future leaking into training. - **Group leakage.** Same user appearing in train and test inflates apparent generalisation. Split by *group* (user, patient, customer), not by row. - **Repeated test set use.** "I only checked test a few times" — each peek is information. The test-set guarantee evaporates. ## Variants - **[[K-Fold Cross-Validation]]** — uses every example for both training and validation in turn; for hyperparameter tuning when data is scarce. - **Nested cross-validation** — outer loop for test, inner loop for validation. The gold standard but expensive. - **Time-series cross-validation** — fold structure respects temporal order. ## Default Splits - **Large data (≥100k):** ~80/10/10. - **Medium data:** ~70/15/15. - **Small data:** k-fold CV instead of a single split. ## Related - [[Cross-Validation]] - [[K-Fold Cross-Validation]] - [[Overfitting and Underfitting]]