AdaBoost - Albert Masoliver's learning site

## Definition **AdaBoost** (Adaptive Boosting) is the original boosting algorithm (Freund & Schapire, 1995, Gödel Prize 2003). Trains a sequence of weak learners, each on a reweighted version of the training data emphasising previous errors. ## Algorithm ``` initialise sample weights w_i = 1/n for all i for t = 1 to T: train weak learner h_t on the weighted data compute weighted error ε_t = Σ_i w_i · [h_t(x_i) ≠ y_i] / Σ_i w_i if ε_t > 0.5: stop (weak learner is worse than chance) compute learner weight α_t = 0.5 · log((1 - ε_t) / ε_t) update sample weights: w_i ← w_i · exp(-α_t · y_i · h_t(x_i)) normalise w_i so they sum to 1 final prediction: sign(Σ_t α_t · h_t(x)) ``` ## Three Key Properties 1. **Weak learners' weights $\alpha_t$ depend on accuracy.** Better learners get more influence in the final vote. 2. **Misclassified examples get reweighted up.** The next learner focuses on what previous ones missed. 3. **Correctly-classified examples get reweighted down.** Confident regions get less attention. ## Theoretical Guarantee If each weak learner has accuracy > 50% (better than random), the training error of the ensemble decreases *exponentially* with $T$. A surprising and influential result — boosting can drive training error to zero with simple components. ## Why It Doesn't Always Overfit Despite increasing capacity with each round, AdaBoost often *continues to generalise* after training error reaches zero. The intuition: increasing margins on correctly-classified examples continues to improve test performance. ## Common Weak Learner: Decision Stumps A *decision stump* is a one-level decision tree. AdaBoost with stumps: - Each stump tests one feature against one threshold. - Hundreds of stumps combined yield a complex non-linear classifier. - Implicit feature selection via which features get used. ## Strengths - **Simple to implement.** - **Provably reduces training error.** - **Often generalises well** despite intuition about overfitting. - **Feature importance** falls out naturally. ## Weaknesses - **Sensitive to noisy data and outliers.** Misclassified noisy examples get reweighted up; AdaBoost spends capacity chasing noise. - **Less performant than gradient boosting** on most modern benchmarks. ## Historical Significance AdaBoost was the first practical boosting algorithm. It launched the entire boosting research programme that ultimately produced [[Gradient Boosting Machines]] and [[XGBoost]] — which now dominate the field. ## When to Use In 2026, AdaBoost is largely a pedagogical tool — gradient boosting variants outperform it on virtually all real problems. Worth knowing for theoretical grounding; rarely worth choosing in production. ## Related - [[Boosting]] - [[Gradient Boosting Machines]] - [[XGBoost]] - [[Decision Trees]]