## Definition
The **bias-variance tradeoff** decomposes a model's expected prediction error into three components — *bias*, *variance*, and *irreducible noise* — and exposes the central tension of supervised learning: reducing one often increases the other.
## Decomposition
For squared-error loss on a target $y = f(x) + \epsilon$ with $\mathbb{E}[\epsilon] = 0$ and $\text{Var}(\epsilon) = \sigma^2$, and a model $\hat f$ learnt from a training set $D$:
$
\mathbb{E}_D[(y - \hat f(x))^2] = \underbrace{(\mathbb{E}_D[\hat f(x)] - f(x))^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}_D[(\hat f(x) - \mathbb{E}_D[\hat f(x)])^2]}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Noise}}
$
## What Each Term Means
- **Bias** — how far off the *average* prediction is from the true value. High bias = systematic error. Reflects an inability of the model class to capture the truth.
- **Variance** — how much predictions fluctuate across different training sets. High variance = sensitivity to particular data.
- **Noise** — irreducible: even the perfect model can't predict $\epsilon$.
## The Tradeoff
Increasing model complexity (richer [[Hypothesis Space]]):
- **Decreases bias** — the model can express more.
- **Increases variance** — the model fits idiosyncrasies of the particular training set.
The *total* error has a minimum somewhere in the middle.
```
Error
│
│ Total error (U-shaped)
│ \_____/
│ ___ Variance
│ /
│ /
│ ____ Bias²
└──────────────── Model complexity
```
## High Bias vs High Variance — Symptoms
| Symptom | Likely cause |
| -------------------------------- | ---------------------- |
| Bad on training, bad on test | High bias (underfitting) |
| Excellent on training, bad on test | High variance (overfitting) |
| Adding data helps | Variance problem |
| Adding features helps | Bias problem |
| Regularisation helps | Variance problem |
## Mitigations
- **High bias:** richer model, better features, more flexible function family.
- **High variance:** more data, regularisation ([[L2 Regularization]]), simpler model, ensembling ([[Bagging]] — averages reduce variance directly).
## Modern Caveats
In overparameterised deep networks, the classical U-curve doesn't hold. The "**double descent**" phenomenon: after the interpolation threshold (where the model can fit the training set perfectly), error *can decrease again* as model size grows further. This is one of the most important empirical findings of the deep learning era — and the classical tradeoff curve must be supplemented to explain it.
## Related
- [[Overfitting and Underfitting]]
- [[Hypothesis Space]]
- [[Regularization]]
- [[Generalization]]
- [[Bagging]]