Generalization - Albert Masoliver's learning site

## Definition **Generalisation** is the ability of a model to perform well on data it hasn't seen during training. The whole point of supervised learning: low training error is necessary but not sufficient; generalisation is the goal. ## The Generalisation Gap $ \text{gap} = R(\hat f) - \hat R(\hat f) $ where $R$ is the true (population) risk and $\hat R$ is the empirical (training) risk. We can only measure $\hat R$ directly; the gap must be controlled. ## Theoretical Bounds Classical learning theory (PAC, VC) bounds the gap by quantities like: $ R(\hat f) \leq \hat R(\hat f) + O\left(\sqrt{\frac{d_{VC}}{n}}\right) $ with $n$ samples and $d_{VC}$ the VC dimension. The takeaway: gap shrinks with more data, grows with hypothesis-class complexity. ## Practical Estimation Theory is loose. In practice we estimate generalisation empirically: 1. **Hold out** a test set never seen during training or model selection. 2. **Cross-validate** ([[K-Fold Cross-Validation]]) when data is scarce. 3. **Compare** training and validation performance throughout. ## Conditions for Generalisation A model generalises only when: - **Train and test distributions match** (i.i.d. assumption). Distribution shift breaks this. - **The hypothesis class** can express the true relationship. - **The training data** is sufficient to identify the right hypothesis. - **The optimiser** finds a hypothesis close to the best in the class. ## Distribution Shift — The Quiet Killer Models trained on one distribution often fail on slightly different deployment data: - **Covariate shift** — $P(X)$ changes; $P(Y \mid X)$ unchanged. - **Concept drift** — $P(Y \mid X)$ changes (user preferences evolve). - **Domain shift** — different sources (lab vs production data). Monitoring deployment performance and retraining are the operational answer. ## Double Descent Modern overparameterised models often *do* generalise despite having more parameters than examples — the "double descent" phenomenon. The classical bound predicts disaster; reality shows benign overfitting in the right conditions. Mechanisms still being studied (implicit regularisation of SGD, neural tangent kernel). ## Related - [[Bias-Variance Tradeoff]] - [[Overfitting and Underfitting]] - [[Hypothesis Space]] - [[Cross-Validation]]