## Definition
**Stacking** (also: stacked generalisation, Wolpert 1992) combines multiple diverse base models by training a *meta-learner* on their predictions. Unlike [[Bagging]] (same model, different data) and [[Boosting]] (sequential weak learners), stacking blends *different model families* through learned weights.
## Architecture
```
Level-0 (base learners): M_1, M_2, ..., M_k
↓
Their predictions form the input to:
↓
Level-1 (meta-learner): M_meta
↓
Final prediction
```
## Training Procedure (with CV to Avoid Leakage)
```
for each fold in K-fold CV:
train base learners M_1...M_k on training portion
predict on held-out portion → out-of-fold (OOF) predictions
collect OOF predictions for all examples → meta-features
train meta-learner M_meta on (OOF predictions, true labels)
retrain each base learner on the full training set
```
At inference: base learners predict, meta-learner combines.
## Why Cross-Validation Matters
If base learners predict on data they were trained on, the meta-learner overfits to optimistic base predictions. Cross-validated OOF predictions are the **only** correct way to generate meta-features — a subtle but critical step.
## Choosing Base Learners
Diversity beats individual strength:
- Combine **different model families** (linear, tree-based, kernel-based, neural).
- Combine **same family with different hyperparameters** (shallow + deep trees).
- Goal: each base learner makes *different mistakes*.
If two base learners always agree, stacking can't blend them productively.
## Choosing the Meta-Learner
Usually simple:
- **Logistic regression** for classification.
- **Linear regression** (often ridge) for regression.
A simple meta-learner reduces overfitting on the limited meta-features. Complex meta-learners (another GBM) can overfit the small set of base-learner predictions.
## When Stacking Wins
- Kaggle and similar competitions where the last 0.5% of accuracy matters.
- Problems where no single model dominates — diverse strengths exist to be combined.
## When It Doesn't Win
- Production systems where simplicity and maintainability matter — one tuned XGBoost usually beats a stacked ensemble's operational overhead.
- When base learners aren't diverse — adding more correlated models doesn't help.
## Blending (Simpler Variant)
Use a single hold-out set instead of CV for OOF predictions. Faster, simpler, smaller effective training set for the meta-learner. Common in time-constrained competition settings.
## Related
- [[Bagging]]
- [[Boosting]]
- [[Random Forest]]
- [[XGBoost]]