## Definition The **bias-variance tradeoff** decomposes a model's expected prediction error into three components — *bias*, *variance*, and *irreducible noise* — and exposes the central tension of supervised learning: reducing one often increases the other. ## Decomposition For squared-error loss on a target $y = f(x) + \epsilon$ with $\mathbb{E}[\epsilon] = 0$ and $\text{Var}(\epsilon) = \sigma^2$, and a model $\hat f$ learnt from a training set $D$: $ \mathbb{E}_D[(y - \hat f(x))^2] = \underbrace{(\mathbb{E}_D[\hat f(x)] - f(x))^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}_D[(\hat f(x) - \mathbb{E}_D[\hat f(x)])^2]}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Noise}} $ ## What Each Term Means - **Bias** — how far off the *average* prediction is from the true value. High bias = systematic error. Reflects an inability of the model class to capture the truth. - **Variance** — how much predictions fluctuate across different training sets. High variance = sensitivity to particular data. - **Noise** — irreducible: even the perfect model can't predict $\epsilon$. ## The Tradeoff Increasing model complexity (richer [[Hypothesis Space]]): - **Decreases bias** — the model can express more. - **Increases variance** — the model fits idiosyncrasies of the particular training set. The *total* error has a minimum somewhere in the middle. ``` Error │ │ Total error (U-shaped) │ \_____/ │ ___ Variance │ / │ / │ ____ Bias² └──────────────── Model complexity ``` ## High Bias vs High Variance — Symptoms | Symptom | Likely cause | | -------------------------------- | ---------------------- | | Bad on training, bad on test | High bias (underfitting) | | Excellent on training, bad on test | High variance (overfitting) | | Adding data helps | Variance problem | | Adding features helps | Bias problem | | Regularisation helps | Variance problem | ## Mitigations - **High bias:** richer model, better features, more flexible function family. - **High variance:** more data, regularisation ([[L2 Regularization]]), simpler model, ensembling ([[Bagging]] — averages reduce variance directly). ## Modern Caveats In overparameterised deep networks, the classical U-curve doesn't hold. The "**double descent**" phenomenon: after the interpolation threshold (where the model can fit the training set perfectly), error *can decrease again* as model size grows further. This is one of the most important empirical findings of the deep learning era — and the classical tradeoff curve must be supplemented to explain it. ## Related - [[Overfitting and Underfitting]] - [[Hypothesis Space]] - [[Regularization]] - [[Generalization]] - [[Bagging]]