Decision Trees for Regression - Albert Masoliver's learning site

--- ## Introduction A **Decision Tree** is a method that uses a tree-like structure to make predictions or classifications. In a **regression** context, we aim to predict a continuous target variable \( y \) based on one or more input features \( x \). Decision Trees split the data into smaller subsets according to certain thresholds on these features, creating "branches" that ultimately lead to **leaves** (terminal nodes). Each leaf then provides a numeric prediction (often the mean of the target values in that leaf). ### Key Ideas 1. **Recursive Partitioning**: We begin with the entire training dataset and systematically divide it into subsets based on feature thresholds that reduce the overall prediction error. 2. **Leaf Value**: Each leaf in a **regression tree** contains a numeric value representing the target variable, typically computed as the **average** (or sometimes median) of the training samples that fall into that leaf. 3. **Interpretability**: Decision Trees are often praised for their interpretability. You can follow a path from the root to a leaf by applying simple “if–then” rules on the features. ---- ## Mathematical Details ### Splitting Criterion Unlike classification trees (which often use metrics like Gini impurity or entropy), **regression trees** use a different objective to find the “best” split at each node. Commonly, we try to **minimize the Sum of Squared Errors (SSE)** within each node’s subsets. - Suppose we have a dataset \( \mathcal{D} \) with \( N \) samples, \(\{(x_i, y_i)\}_{i=1}^N\). - A split is defined by choosing a feature \( j \) and a threshold \( t \) such that we partition \(\mathcal{D}\) into: \[ \mathcal{D}_\text{left} = \{(x_i, y_i) \mid x_{i,j} \le t\}, \quad \mathcal{D}_\text{right} = \{(x_i, y_i) \mid x_{i,j} > t\}. \] - The **prediction** at each node (leaf) is typically the average of \(y\)-values of samples in that leaf. For a leaf with indices \(\mathcal{I}\) (i.e., the set of samples that fall into that leaf), the predicted value is: \[ \hat{y} = \frac{1}{|\mathcal{I}|} \sum_{i \in \mathcal{I}} y_i. \] - We choose the split that **minimizes** the sum of squared errors (SSE) in the two child nodes: \[ \text{SSE}_{\text{split}} = \sum_{(x_i, y_i) \in \mathcal{D}_\text{left}} \left(y_i - \hat{y}_{\text{left}}\right)^2 + \sum_{(x_i, y_i) \in \mathcal{D}_\text{right}} \left(y_i - \hat{y}_{\text{right}}\right)^2. \] where \[ \hat{y}_{\text{left}} = \frac{1}{|\mathcal{D}_\text{left}|} \sum_{(x_i,y_i)\in \mathcal{D}_\text{left}} y_i \quad \text{and} \quad \hat{y}_{\text{right}} = \frac{1}{|\mathcal{D}_\text{right}|} \sum_{(x_i,y_i)\in \mathcal{D}_\text{right}} y_i. \] ### Stopping Criteria and Pruning A purely greedy approach may keep splitting until each leaf has a very small number of samples, leading to **overfitting**. To avoid this, we often apply: 4. **Stopping rules**: For example, - Stop if the node contains fewer than a certain number of samples. - Stop if reducing the SSE by further splitting is below a threshold. - Set a maximum depth for the tree. 5. **Pruning**: Grow a large tree, then **prune** it by removing splits that provide marginal improvement in reducing errors (e.g., using a validation set or a cost-complexity approach). ### Prediction To predict a value for a new point \(x_\text{new}\): 6. Start at the **root** of the tree. 7. Traverse down the tree according to the split decisions (i.e., compare \( x_\text{new,j} \) with the threshold \( t \)). 8. Once a **leaf** is reached, output the numeric value stored in that leaf. ### Example Let’s consider a simple dataset of **house prices**: - Input feature: \( \text{area} \) (m²). - Target: \( \text{price} \) (in thousands of euros). Imagine the tree has a single split at \(\text{area} \le 184.5\). - Houses in \(\text{area} \le 184.5\) go **left**. - Leaf value (prediction) is the mean of those houses’ prices (e.g., 268.107). - Houses in \(\text{area} > 184.5\) go **right**. - Leaf value is the mean of those houses’ prices (e.g., 374.308). The result is a **piecewise constant** function of the input (\(\text{area}\)), effectively dividing houses into two price categories. ---- ## Advantages and Disadvantages ### Advantages - **Interpretability**: Each internal node corresponds to a simple condition (e.g., \( x_j \le t\)). - **No Need for Feature Scaling**: Trees work fine with raw numeric or even categorical data. - **Handles Non-linearities**: By splitting the data multiple times, the final model can capture non-linear relationships. ### Disadvantages - **Overfitting**: A tree can become too large and fit noise in the training data. - **High Variance**: Small changes in the training set can result in a very different structure. - **Usually Outperformed by Ensemble Methods**: Methods like Random Forests or Gradient Boosting often yield better accuracy and stability. ---- ## Summary Decision Trees for regression offer a direct, interpretable approach: each split reduces the prediction error in a greedy manner, and each leaf stores a numeric estimate of the target variable. While easy to understand and implement, they can suffer from overfitting if not properly controlled or pruned. In practice, they are often used within **ensemble methods**, but they remain a foundational technique in **Machine Learning** for handling a variety of regression tasks. ---- **[[Back to Regression|Regression]]** **[[Back to ML Overview|ML Overview]]** --- ## References