## Definition
**Generalisation** is the ability of a model to perform well on data it hasn't seen during training. The whole point of supervised learning: low training error is necessary but not sufficient; generalisation is the goal.
## The Generalisation Gap
$
\text{gap} = R(\hat f) - \hat R(\hat f)
$
where $R$ is the true (population) risk and $\hat R$ is the empirical (training) risk. We can only measure $\hat R$ directly; the gap must be controlled.
## Theoretical Bounds
Classical learning theory (PAC, VC) bounds the gap by quantities like:
$
R(\hat f) \leq \hat R(\hat f) + O\left(\sqrt{\frac{d_{VC}}{n}}\right)
$
with $n$ samples and $d_{VC}$ the VC dimension. The takeaway: gap shrinks with more data, grows with hypothesis-class complexity.
## Practical Estimation
Theory is loose. In practice we estimate generalisation empirically:
1. **Hold out** a test set never seen during training or model selection.
2. **Cross-validate** ([[K-Fold Cross-Validation]]) when data is scarce.
3. **Compare** training and validation performance throughout.
## Conditions for Generalisation
A model generalises only when:
- **Train and test distributions match** (i.i.d. assumption). Distribution shift breaks this.
- **The hypothesis class** can express the true relationship.
- **The training data** is sufficient to identify the right hypothesis.
- **The optimiser** finds a hypothesis close to the best in the class.
## Distribution Shift — The Quiet Killer
Models trained on one distribution often fail on slightly different deployment data:
- **Covariate shift** — $P(X)$ changes; $P(Y \mid X)$ unchanged.
- **Concept drift** — $P(Y \mid X)$ changes (user preferences evolve).
- **Domain shift** — different sources (lab vs production data).
Monitoring deployment performance and retraining are the operational answer.
## Double Descent
Modern overparameterised models often *do* generalise despite having more parameters than examples — the "double descent" phenomenon. The classical bound predicts disaster; reality shows benign overfitting in the right conditions. Mechanisms still being studied (implicit regularisation of SGD, neural tangent kernel).
## Related
- [[Bias-Variance Tradeoff]]
- [[Overfitting and Underfitting]]
- [[Hypothesis Space]]
- [[Cross-Validation]]