Gradient Boosting Machines - Albert Masoliver's learning site

## Definition **Gradient Boosting Machines (GBM)** generalise boosting to arbitrary differentiable loss functions by treating boosting as **gradient descent in function space**. Each new weak learner fits the *negative gradient* of the loss with respect to the current ensemble's predictions. Introduced by Friedman (2001). ## The Mathematical Insight For a model $F(x)$ minimising loss $\mathcal{L}(y, F(x))$: $ F_t(x) = F_{t-1}(x) - \eta \cdot h_t(x) $ where $h_t$ is a weak learner fitted to approximate $\partial \mathcal{L} / \partial F$ — i.e., the *residual* between current prediction and truth. ## Algorithm ``` F_0(x) ← constant prediction (e.g., mean of y for regression) for t = 1 to T: compute "pseudo-residuals" r_i = -∂L(y_i, F_{t-1}(x_i)) / ∂F_{t-1}(x_i) train weak learner h_t to predict r_i from x_i γ_t ← line search: optimal step size F_t(x) ← F_{t-1}(x) + η · γ_t · h_t(x) ``` - $\eta$ — *shrinkage* / learning rate. Small (0.01-0.1) for better generalisation. - $T$ — number of rounds. ## Squared Error Special Case For MSE loss, the pseudo-residuals reduce to $y_i - F_{t-1}(x_i)$ — the actual residuals. Each tree predicts what the previous trees got wrong. ## Modern Variants ### [[XGBoost]] (Chen & Guestrin, 2016) Engineering tour-de-force: regularised objective, fast histogram-based split finding, sparsity-aware, parallelisable. Dominated Kaggle for years. ### LightGBM (Microsoft, 2017) Leaf-wise growth (vs level-wise), gradient-based one-side sampling. Faster training than XGBoost on large data. ### CatBoost (Yandex, 2017) Symmetric trees, ordered boosting (reduces target leakage in categorical encoding), excellent categorical handling out of the box. ## Hyperparameters That Matter - **Learning rate** ($\eta$, 0.01-0.3): small + many rounds usually beats large + few. - **Number of rounds**: use early stopping on validation. - **Max depth** of trees (3-10): controls per-round complexity. - **Subsample** (0.5-1.0): row sampling, like bagging. - **Colsample** (0.5-1.0): feature sampling per tree or per split. - **Regularisation** (L1, L2 on leaf values). ## Strengths - **State-of-the-art on tabular data** — dominant 2014-present. - **Handles mixed feature types**, missing values. - **Built-in feature importance.** - **Probabilistic output via cross-entropy loss.** ## Weaknesses - **Tuning matters more** than for Random Forest. - **Sequential** — slower training than RF. - **Overfits without early stopping.** - **No native uncertainty** — for that use natural gradient boosting or quantile regression variants. ## When NOT to Use - Unstructured data (images, audio, raw text) — deep networks win. - Streaming data with high churn — retraining cost matters. ## Related - [[XGBoost]] - [[Boosting]] - [[Random Forest]] - [[Decision Trees]] - [[Gradient Descent]]