## Definition
**Regularisation** is any modification to a learning algorithm intended to reduce its generalisation error without necessarily reducing its training error. The standard tool against [[Overfitting and Underfitting|overfitting]].
## The Canonical Form
Add a penalty term to the loss:
$
\hat \theta = \arg\min_\theta \frac{1}{n} \sum_{i=1}^n L(f_\theta(x_i), y_i) + \lambda \, \Omega(\theta)
$
- $\lambda > 0$ — regularisation strength.
- $\Omega$ — penalty function on parameters.
## Common Penalties
- **[[L2 Regularization]] (Ridge)** — $\Omega(\theta) = \|\theta\|_2^2$. Shrinks all coefficients toward zero without eliminating them.
- **[[L1 Regularization]] (Lasso)** — $\Omega(\theta) = \|\theta\|_1$. Drives many coefficients to *exactly* zero — implicit feature selection.
- **[[Elastic Net]]** — combination of L1 and L2.
## Other Regularisation Mechanisms
Penalty terms are only one family. Others:
- **Early stopping.** Halt training before convergence; the partial fit acts as implicit regularisation.
- **Data augmentation.** Synthetic variants of training examples; reduces effective overfitting.
- **[[Dropout]].** Randomly mask neurons during training; prevents co-adaptation.
- **Weight constraints.** Hard-bound parameter norms.
- **Bayesian priors.** Implicit regularisation through prior distributions on parameters.
- **Ensembling.** Multiple models trained differently; averaging reduces variance.
## The Bias-Variance Lens
Regularisation increases bias and decreases variance — the [[Bias-Variance Tradeoff]]'s knob. Too much regularisation → underfitting; too little → overfitting. The right level is found on the validation set.
## Choosing $\lambda$
- **Grid search** over a log-scale (e.g., $10^{-4}, 10^{-3}, \dots, 10^2$).
- **Cross-validation** to score each $\lambda$.
- **Bayesian optimisation** for expensive models.
- **Early stopping** as an alternative to explicit $\lambda$ tuning.
## Mathematical Connection to Priors
Bayesian view: regularisation is a prior on parameters.
- L2 ↔ Gaussian prior $\theta \sim \mathcal{N}(0, \sigma^2 I)$.
- L1 ↔ Laplace prior $\theta \sim \text{Laplace}(0, b)$.
Maximum-a-posteriori estimation (MAP) with these priors recovers Ridge / Lasso exactly.
## Why Implicit Regularisation Matters in Deep Learning
Modern overparameterised networks have *more* parameters than data points yet generalise. Hypothesised mechanisms:
- **SGD's implicit bias** toward flat minima.
- **Architectural inductive bias** (convolutional structure, attention patterns).
- **Early stopping** during training.
The interplay of these "implicit" regularisers is an active research area.
## Related
- [[L1 Regularization]]
- [[L2 Regularization]]
- [[Elastic Net]]
- [[Overfitting and Underfitting]]
- [[Bias-Variance Tradeoff]]
- [[Dropout]]