## Definition
A **decision tree** is a hierarchical model that makes predictions by recursively partitioning the feature space using threshold rules. Each internal node tests a feature; each leaf yields a prediction (class for classification; value for regression).
## Construction (CART Algorithm)
Build top-down recursively:
```
function build_tree(D, depth):
if stopping criterion met: return Leaf(majority class or mean value of D)
find best split (feature j, threshold t) that maximises purity gain
D_left ← {(x, y) ∈ D : x_j ≤ t}
D_right ← {(x, y) ∈ D : x_j > t}
return Node(j, t, build_tree(D_left, depth+1), build_tree(D_right, depth+1))
```
## Splitting Criteria
### Classification
- **Gini impurity:** $1 - \sum_c p_c^2$. Probability of misclassifying a random sample.
- **Entropy:** $-\sum_c p_c \log p_c$. Information theory measure.
Both give similar trees in practice; Gini is faster to compute.
### Regression
Variance reduction (SSE minimisation). See [[Decision Trees for Regression]].
## Stopping Criteria
Without limits, a tree grows until every leaf is pure — *guaranteed overfitting*. Common constraints:
- **Max depth.**
- **Min samples per leaf.**
- **Min samples to split.**
- **Min impurity decrease per split.**
- **Pruning** after construction (cost-complexity pruning).
## Strengths
- **Interpretable.** The rules are inspectable; flowchart-style explanations.
- **No feature scaling needed.** Splits are based on order, not magnitude.
- **Handles mixed feature types.** Numeric and categorical (after encoding).
- **Handles missing values** naturally (with appropriate libraries).
- **Captures non-linear patterns** and interactions automatically.
## Weaknesses
- **High variance** — small data changes cause very different trees. Ensembling fixes this.
- **Greedy construction** — locally optimal splits may not yield globally optimal trees.
- **Axis-aligned splits** — can't represent diagonal decision boundaries cleanly.
- **Bias toward features with many possible splits** (high cardinality) — use information-gain-ratio or feature selection.
## Ensembles Fix the Weaknesses
Single decision trees are rarely competitive. Bagging gives [[Random Forest]]; boosting gives [[Gradient Boosting Machines]] / [[XGBoost]] / LightGBM / CatBoost. Tree ensembles dominate tabular ML in 2026.
## Notable Algorithms
- **CART** (Classification and Regression Trees) — Breiman et al., 1984. The standard.
- **ID3 / C4.5 / C5.0** — Quinlan's algorithms, used entropy-based splits.
- **CHAID** — chi-squared based splits; allows multi-way splits.
## Related
- [[Decision Trees for Regression]]
- [[Random Forest]]
- [[Gradient Boosting Machines]]
- [[XGBoost]]