L2 Regularization - Albert Masoliver's learning site

## Definition **L2 regularisation** (also: **Ridge regression**, *Tikhonov regularisation*, *weight decay*) adds the squared L2 norm of the parameters to the loss function: $ \mathcal{L}(\theta) = \mathcal{L}_{\text{data}}(\theta) + \lambda \|\theta\|_2^2 = \mathcal{L}_{\text{data}}(\theta) + \lambda \sum_j \theta_j^2 $ Shrinks all coefficients toward zero proportionally to their magnitude, but never exactly to zero. The standard form of regularisation when you have no reason to prefer L1. ## Why It Helps - **Reduces variance** at the cost of a small increase in bias — the [[Bias-Variance Tradeoff]]. - **Stabilises** ill-conditioned problems. For linear regression, when $X^\top X$ is near-singular, adding $\lambda I$ makes it invertible: $ \hat\theta = (X^\top X + \lambda I)^{-1} X^\top y $ That's the closed-form Ridge solution. - **Discourages large weights**, which often correspond to fitting noise. ## Behavior with Correlated Features L2 *distributes* coefficient mass among correlated features. Unlike L1 (which arbitrarily picks one), Ridge gives each correlated feature a moderate coefficient. Often a more stable choice when features are highly correlated. ## Hyperparameters - $\lambda$ — strength of regularisation. Higher $\lambda$ → stronger shrinkage → more bias, less variance. - **Cross-validate** to choose. `RidgeCV` in scikit-learn does this automatically. ## In Deep Learning: "Weight Decay" Adding L2 to the loss is mathematically equivalent (for SGD) to multiplying weights by $(1 - \lambda \eta)$ at each step — hence the term "weight decay". Standard in nearly all deep learning training. Modern optimisers like AdamW decouple weight decay from the gradient update, which empirically performs better than coupling it. ## Bayesian Interpretation L2 corresponds to a **Gaussian prior** on parameters: $ p(\theta) \propto \exp\left(-\frac{\|\theta\|_2^2}{2\sigma^2}\right) $ with $\lambda \propto 1/\sigma^2$. Maximum-a-posteriori estimation with a Gaussian prior recovers Ridge exactly. ## Trade-offs vs L1 See [[L1 Regularization]] for a side-by-side comparison. Key differences: - **L2 shrinks all coefficients; L1 zeroes many.** - **L2 always has a unique closed-form solution; L1 may not.** - **L2 is differentiable; L1 is not at zero.** ## When to Combine [[Elastic Net]] blends both penalties: L2's stability + L1's sparsity. Often outperforms either alone when features are both numerous and correlated. ## Related - [[L1 Regularization]] - [[Elastic Net]] - [[Regularization]] - [[Ridge Regression]] - [[Bias-Variance Tradeoff]]