Decision Trees - Albert Masoliver's learning site

## Definition A **decision tree** is a hierarchical model that makes predictions by recursively partitioning the feature space using threshold rules. Each internal node tests a feature; each leaf yields a prediction (class for classification; value for regression). ## Construction (CART Algorithm) Build top-down recursively: ``` function build_tree(D, depth): if stopping criterion met: return Leaf(majority class or mean value of D) find best split (feature j, threshold t) that maximises purity gain D_left ← {(x, y) ∈ D : x_j ≤ t} D_right ← {(x, y) ∈ D : x_j > t} return Node(j, t, build_tree(D_left, depth+1), build_tree(D_right, depth+1)) ``` ## Splitting Criteria ### Classification - **Gini impurity:** $1 - \sum_c p_c^2$. Probability of misclassifying a random sample. - **Entropy:** $-\sum_c p_c \log p_c$. Information theory measure. Both give similar trees in practice; Gini is faster to compute. ### Regression Variance reduction (SSE minimisation). See [[Decision Trees for Regression]]. ## Stopping Criteria Without limits, a tree grows until every leaf is pure — *guaranteed overfitting*. Common constraints: - **Max depth.** - **Min samples per leaf.** - **Min samples to split.** - **Min impurity decrease per split.** - **Pruning** after construction (cost-complexity pruning). ## Strengths - **Interpretable.** The rules are inspectable; flowchart-style explanations. - **No feature scaling needed.** Splits are based on order, not magnitude. - **Handles mixed feature types.** Numeric and categorical (after encoding). - **Handles missing values** naturally (with appropriate libraries). - **Captures non-linear patterns** and interactions automatically. ## Weaknesses - **High variance** — small data changes cause very different trees. Ensembling fixes this. - **Greedy construction** — locally optimal splits may not yield globally optimal trees. - **Axis-aligned splits** — can't represent diagonal decision boundaries cleanly. - **Bias toward features with many possible splits** (high cardinality) — use information-gain-ratio or feature selection. ## Ensembles Fix the Weaknesses Single decision trees are rarely competitive. Bagging gives [[Random Forest]]; boosting gives [[Gradient Boosting Machines]] / [[XGBoost]] / LightGBM / CatBoost. Tree ensembles dominate tabular ML in 2026. ## Notable Algorithms - **CART** (Classification and Regression Trees) — Breiman et al., 1984. The standard. - **ID3 / C4.5 / C5.0** — Quinlan's algorithms, used entropy-based splits. - **CHAID** — chi-squared based splits; allows multi-way splits. ## Related - [[Decision Trees for Regression]] - [[Random Forest]] - [[Gradient Boosting Machines]] - [[XGBoost]]