## Definition
**L2 regularisation** (also: **Ridge regression**, *Tikhonov regularisation*, *weight decay*) adds the squared L2 norm of the parameters to the loss function:
$
\mathcal{L}(\theta) = \mathcal{L}_{\text{data}}(\theta) + \lambda \|\theta\|_2^2 = \mathcal{L}_{\text{data}}(\theta) + \lambda \sum_j \theta_j^2
$
Shrinks all coefficients toward zero proportionally to their magnitude, but never exactly to zero. The standard form of regularisation when you have no reason to prefer L1.
## Why It Helps
- **Reduces variance** at the cost of a small increase in bias — the [[Bias-Variance Tradeoff]].
- **Stabilises** ill-conditioned problems. For linear regression, when $X^\top X$ is near-singular, adding $\lambda I$ makes it invertible:
$
\hat\theta = (X^\top X + \lambda I)^{-1} X^\top y
$
That's the closed-form Ridge solution.
- **Discourages large weights**, which often correspond to fitting noise.
## Behavior with Correlated Features
L2 *distributes* coefficient mass among correlated features. Unlike L1 (which arbitrarily picks one), Ridge gives each correlated feature a moderate coefficient. Often a more stable choice when features are highly correlated.
## Hyperparameters
- $\lambda$ — strength of regularisation. Higher $\lambda$ → stronger shrinkage → more bias, less variance.
- **Cross-validate** to choose. `RidgeCV` in scikit-learn does this automatically.
## In Deep Learning: "Weight Decay"
Adding L2 to the loss is mathematically equivalent (for SGD) to multiplying weights by $(1 - \lambda \eta)$ at each step — hence the term "weight decay". Standard in nearly all deep learning training. Modern optimisers like AdamW decouple weight decay from the gradient update, which empirically performs better than coupling it.
## Bayesian Interpretation
L2 corresponds to a **Gaussian prior** on parameters:
$
p(\theta) \propto \exp\left(-\frac{\|\theta\|_2^2}{2\sigma^2}\right)
$
with $\lambda \propto 1/\sigma^2$. Maximum-a-posteriori estimation with a Gaussian prior recovers Ridge exactly.
## Trade-offs vs L1
See [[L1 Regularization]] for a side-by-side comparison. Key differences:
- **L2 shrinks all coefficients; L1 zeroes many.**
- **L2 always has a unique closed-form solution; L1 may not.**
- **L2 is differentiable; L1 is not at zero.**
## When to Combine
[[Elastic Net]] blends both penalties: L2's stability + L1's sparsity. Often outperforms either alone when features are both numerous and correlated.
## Related
- [[L1 Regularization]]
- [[Elastic Net]]
- [[Regularization]]
- [[Ridge Regression]]
- [[Bias-Variance Tradeoff]]