Regularization - Albert Masoliver's learning site

## Definition **Regularisation** is any modification to a learning algorithm intended to reduce its generalisation error without necessarily reducing its training error. The standard tool against [[Overfitting and Underfitting|overfitting]]. ## The Canonical Form Add a penalty term to the loss: $ \hat \theta = \arg\min_\theta \frac{1}{n} \sum_{i=1}^n L(f_\theta(x_i), y_i) + \lambda \, \Omega(\theta) $ - $\lambda > 0$ — regularisation strength. - $\Omega$ — penalty function on parameters. ## Common Penalties - **[[L2 Regularization]] (Ridge)** — $\Omega(\theta) = \|\theta\|_2^2$. Shrinks all coefficients toward zero without eliminating them. - **[[L1 Regularization]] (Lasso)** — $\Omega(\theta) = \|\theta\|_1$. Drives many coefficients to *exactly* zero — implicit feature selection. - **[[Elastic Net]]** — combination of L1 and L2. ## Other Regularisation Mechanisms Penalty terms are only one family. Others: - **Early stopping.** Halt training before convergence; the partial fit acts as implicit regularisation. - **Data augmentation.** Synthetic variants of training examples; reduces effective overfitting. - **[[Dropout]].** Randomly mask neurons during training; prevents co-adaptation. - **Weight constraints.** Hard-bound parameter norms. - **Bayesian priors.** Implicit regularisation through prior distributions on parameters. - **Ensembling.** Multiple models trained differently; averaging reduces variance. ## The Bias-Variance Lens Regularisation increases bias and decreases variance — the [[Bias-Variance Tradeoff]]'s knob. Too much regularisation → underfitting; too little → overfitting. The right level is found on the validation set. ## Choosing $\lambda$ - **Grid search** over a log-scale (e.g., $10^{-4}, 10^{-3}, \dots, 10^2$). - **Cross-validation** to score each $\lambda$. - **Bayesian optimisation** for expensive models. - **Early stopping** as an alternative to explicit $\lambda$ tuning. ## Mathematical Connection to Priors Bayesian view: regularisation is a prior on parameters. - L2 ↔ Gaussian prior $\theta \sim \mathcal{N}(0, \sigma^2 I)$. - L1 ↔ Laplace prior $\theta \sim \text{Laplace}(0, b)$. Maximum-a-posteriori estimation (MAP) with these priors recovers Ridge / Lasso exactly. ## Why Implicit Regularisation Matters in Deep Learning Modern overparameterised networks have *more* parameters than data points yet generalise. Hypothesised mechanisms: - **SGD's implicit bias** toward flat minima. - **Architectural inductive bias** (convolutional structure, attention patterns). - **Early stopping** during training. The interplay of these "implicit" regularisers is an active research area. ## Related - [[L1 Regularization]] - [[L2 Regularization]] - [[Elastic Net]] - [[Overfitting and Underfitting]] - [[Bias-Variance Tradeoff]] - [[Dropout]]