Stacking - Albert Masoliver's learning site

## Definition **Stacking** (also: stacked generalisation, Wolpert 1992) combines multiple diverse base models by training a *meta-learner* on their predictions. Unlike [[Bagging]] (same model, different data) and [[Boosting]] (sequential weak learners), stacking blends *different model families* through learned weights. ## Architecture ``` Level-0 (base learners): M_1, M_2, ..., M_k ↓ Their predictions form the input to: ↓ Level-1 (meta-learner): M_meta ↓ Final prediction ``` ## Training Procedure (with CV to Avoid Leakage) ``` for each fold in K-fold CV: train base learners M_1...M_k on training portion predict on held-out portion → out-of-fold (OOF) predictions collect OOF predictions for all examples → meta-features train meta-learner M_meta on (OOF predictions, true labels) retrain each base learner on the full training set ``` At inference: base learners predict, meta-learner combines. ## Why Cross-Validation Matters If base learners predict on data they were trained on, the meta-learner overfits to optimistic base predictions. Cross-validated OOF predictions are the **only** correct way to generate meta-features — a subtle but critical step. ## Choosing Base Learners Diversity beats individual strength: - Combine **different model families** (linear, tree-based, kernel-based, neural). - Combine **same family with different hyperparameters** (shallow + deep trees). - Goal: each base learner makes *different mistakes*. If two base learners always agree, stacking can't blend them productively. ## Choosing the Meta-Learner Usually simple: - **Logistic regression** for classification. - **Linear regression** (often ridge) for regression. A simple meta-learner reduces overfitting on the limited meta-features. Complex meta-learners (another GBM) can overfit the small set of base-learner predictions. ## When Stacking Wins - Kaggle and similar competitions where the last 0.5% of accuracy matters. - Problems where no single model dominates — diverse strengths exist to be combined. ## When It Doesn't Win - Production systems where simplicity and maintainability matter — one tuned XGBoost usually beats a stacked ensemble's operational overhead. - When base learners aren't diverse — adding more correlated models doesn't help. ## Blending (Simpler Variant) Use a single hold-out set instead of CV for OOF predictions. Faster, simpler, smaller effective training set for the meta-learner. Common in time-constrained competition settings. ## Related - [[Bagging]] - [[Boosting]] - [[Random Forest]] - [[XGBoost]]