## Definition
The **train / validation / test split** partitions data into three disjoint sets, each with a distinct role: training learns parameters; validation tunes hyperparameters and selects models; test gives a single final estimate of generalisation. The cornerstone hygiene of supervised ML.
## The Three Roles
| Set | Role | Used during |
|---|---|---|
| **Train** (~60–80%) | Fit model parameters | Training |
| **Validation** (~10–20%) | Tune hyperparameters, select models | Model development |
| **Test** (~10–20%) | Estimate generalisation error | Final evaluation only |
## Why Three, Not Two
A two-way split (train/test) is sufficient if you train *one* model with *no* tuning. The moment you compare two hyperparameter settings or two model families, choosing the one that does best on "test" *uses* the test set as part of the search — leaking information and inflating the apparent performance.
The third set is the firewall: tune on validation, measure on test *once*.
## Practical Rules
- **Test set is sacred.** Do not look at it. Do not use it to inform any decision. Run on it once, at the end, after all design choices are frozen.
- **No leakage.** Features derived from the full dataset (normalisation, encoding) must be fit on train and *applied* to validation/test.
- **Distributional realism.** Train, val, test should resemble the deployment distribution. If deployment is on next month's data, validation should be on this month's, train on prior months.
- **Stratification.** For classification, stratify splits to preserve class proportions.
## Common Mistakes
- **Time leakage.** Random splitting on time-series creates information from the future leaking into training.
- **Group leakage.** Same user appearing in train and test inflates apparent generalisation. Split by *group* (user, patient, customer), not by row.
- **Repeated test set use.** "I only checked test a few times" — each peek is information. The test-set guarantee evaporates.
## Variants
- **[[K-Fold Cross-Validation]]** — uses every example for both training and validation in turn; for hyperparameter tuning when data is scarce.
- **Nested cross-validation** — outer loop for test, inner loop for validation. The gold standard but expensive.
- **Time-series cross-validation** — fold structure respects temporal order.
## Default Splits
- **Large data (≥100k):** ~80/10/10.
- **Medium data:** ~70/15/15.
- **Small data:** k-fold CV instead of a single split.
## Related
- [[Cross-Validation]]
- [[K-Fold Cross-Validation]]
- [[Overfitting and Underfitting]]