## Definition
**Gradient Boosting Machines (GBM)** generalise boosting to arbitrary differentiable loss functions by treating boosting as **gradient descent in function space**. Each new weak learner fits the *negative gradient* of the loss with respect to the current ensemble's predictions. Introduced by Friedman (2001).
## The Mathematical Insight
For a model $F(x)$ minimising loss $\mathcal{L}(y, F(x))$:
$
F_t(x) = F_{t-1}(x) - \eta \cdot h_t(x)
$
where $h_t$ is a weak learner fitted to approximate $\partial \mathcal{L} / \partial F$ — i.e., the *residual* between current prediction and truth.
## Algorithm
```
F_0(x) ← constant prediction (e.g., mean of y for regression)
for t = 1 to T:
compute "pseudo-residuals" r_i = -∂L(y_i, F_{t-1}(x_i)) / ∂F_{t-1}(x_i)
train weak learner h_t to predict r_i from x_i
γ_t ← line search: optimal step size
F_t(x) ← F_{t-1}(x) + η · γ_t · h_t(x)
```
- $\eta$ — *shrinkage* / learning rate. Small (0.01-0.1) for better generalisation.
- $T$ — number of rounds.
## Squared Error Special Case
For MSE loss, the pseudo-residuals reduce to $y_i - F_{t-1}(x_i)$ — the actual residuals. Each tree predicts what the previous trees got wrong.
## Modern Variants
### [[XGBoost]] (Chen & Guestrin, 2016)
Engineering tour-de-force: regularised objective, fast histogram-based split finding, sparsity-aware, parallelisable. Dominated Kaggle for years.
### LightGBM (Microsoft, 2017)
Leaf-wise growth (vs level-wise), gradient-based one-side sampling. Faster training than XGBoost on large data.
### CatBoost (Yandex, 2017)
Symmetric trees, ordered boosting (reduces target leakage in categorical encoding), excellent categorical handling out of the box.
## Hyperparameters That Matter
- **Learning rate** ($\eta$, 0.01-0.3): small + many rounds usually beats large + few.
- **Number of rounds**: use early stopping on validation.
- **Max depth** of trees (3-10): controls per-round complexity.
- **Subsample** (0.5-1.0): row sampling, like bagging.
- **Colsample** (0.5-1.0): feature sampling per tree or per split.
- **Regularisation** (L1, L2 on leaf values).
## Strengths
- **State-of-the-art on tabular data** — dominant 2014-present.
- **Handles mixed feature types**, missing values.
- **Built-in feature importance.**
- **Probabilistic output via cross-entropy loss.**
## Weaknesses
- **Tuning matters more** than for Random Forest.
- **Sequential** — slower training than RF.
- **Overfits without early stopping.**
- **No native uncertainty** — for that use natural gradient boosting or quantile regression variants.
## When NOT to Use
- Unstructured data (images, audio, raw text) — deep networks win.
- Streaming data with high churn — retraining cost matters.
## Related
- [[XGBoost]]
- [[Boosting]]
- [[Random Forest]]
- [[Decision Trees]]
- [[Gradient Descent]]