---
## Introduction
A **Decision Tree** is a method that uses a tree-like structure to make predictions or classifications. In a **regression** context, we aim to predict a continuous target variable \( y \) based on one or more input features \( x \). Decision Trees split the data into smaller subsets according to certain thresholds on these features, creating "branches" that ultimately lead to **leaves** (terminal nodes). Each leaf then provides a numeric prediction (often the mean of the target values in that leaf).
### Key Ideas
1. **Recursive Partitioning**: We begin with the entire training dataset and systematically divide it into subsets based on feature thresholds that reduce the overall prediction error.
2. **Leaf Value**: Each leaf in a **regression tree** contains a numeric value representing the target variable, typically computed as the **average** (or sometimes median) of the training samples that fall into that leaf.
3. **Interpretability**: Decision Trees are often praised for their interpretability. You can follow a path from the root to a leaf by applying simple “if–then” rules on the features.
----
## Mathematical Details
### Splitting Criterion
Unlike classification trees (which often use metrics like Gini impurity or entropy), **regression trees** use a different objective to find the “best” split at each node. Commonly, we try to **minimize the Sum of Squared Errors (SSE)** within each node’s subsets.
- Suppose we have a dataset \( \mathcal{D} \) with \( N \) samples, \(\{(x_i, y_i)\}_{i=1}^N\).
- A split is defined by choosing a feature \( j \) and a threshold \( t \) such that we partition \(\mathcal{D}\) into:
\[
\mathcal{D}_\text{left} = \{(x_i, y_i) \mid x_{i,j} \le t\}, \quad
\mathcal{D}_\text{right} = \{(x_i, y_i) \mid x_{i,j} > t\}.
\]
- The **prediction** at each node (leaf) is typically the average of \(y\)-values of samples in that leaf. For a leaf with indices \(\mathcal{I}\) (i.e., the set of samples that fall into that leaf), the predicted value is:
\[
\hat{y} = \frac{1}{|\mathcal{I}|} \sum_{i \in \mathcal{I}} y_i.
\]
- We choose the split that **minimizes** the sum of squared errors (SSE) in the two child nodes:
\[
\text{SSE}_{\text{split}} = \sum_{(x_i, y_i) \in \mathcal{D}_\text{left}} \left(y_i - \hat{y}_{\text{left}}\right)^2
+ \sum_{(x_i, y_i) \in \mathcal{D}_\text{right}} \left(y_i - \hat{y}_{\text{right}}\right)^2.
\]
where
\[
\hat{y}_{\text{left}} = \frac{1}{|\mathcal{D}_\text{left}|} \sum_{(x_i,y_i)\in \mathcal{D}_\text{left}} y_i
\quad \text{and} \quad
\hat{y}_{\text{right}} = \frac{1}{|\mathcal{D}_\text{right}|} \sum_{(x_i,y_i)\in \mathcal{D}_\text{right}} y_i.
\]
### Stopping Criteria and Pruning
A purely greedy approach may keep splitting until each leaf has a very small number of samples, leading to **overfitting**. To avoid this, we often apply:
4. **Stopping rules**: For example,
- Stop if the node contains fewer than a certain number of samples.
- Stop if reducing the SSE by further splitting is below a threshold.
- Set a maximum depth for the tree.
5. **Pruning**: Grow a large tree, then **prune** it by removing splits that provide marginal improvement in reducing errors (e.g., using a validation set or a cost-complexity approach).
### Prediction
To predict a value for a new point \(x_\text{new}\):
6. Start at the **root** of the tree.
7. Traverse down the tree according to the split decisions (i.e., compare \( x_\text{new,j} \) with the threshold \( t \)).
8. Once a **leaf** is reached, output the numeric value stored in that leaf.
### Example
Let’s consider a simple dataset of **house prices**:
- Input feature: \( \text{area} \) (m²).
- Target: \( \text{price} \) (in thousands of euros).
Imagine the tree has a single split at \(\text{area} \le 184.5\).
- Houses in \(\text{area} \le 184.5\) go **left**.
- Leaf value (prediction) is the mean of those houses’ prices (e.g., 268.107).
- Houses in \(\text{area} > 184.5\) go **right**.
- Leaf value is the mean of those houses’ prices (e.g., 374.308).
The result is a **piecewise constant** function of the input (\(\text{area}\)), effectively dividing houses into two price categories.
----
## Advantages and Disadvantages
### Advantages
- **Interpretability**: Each internal node corresponds to a simple condition (e.g., \( x_j \le t\)).
- **No Need for Feature Scaling**: Trees work fine with raw numeric or even categorical data.
- **Handles Non-linearities**: By splitting the data multiple times, the final model can capture non-linear relationships.
### Disadvantages
- **Overfitting**: A tree can become too large and fit noise in the training data.
- **High Variance**: Small changes in the training set can result in a very different structure.
- **Usually Outperformed by Ensemble Methods**: Methods like Random Forests or Gradient Boosting often yield better accuracy and stability.
----
## Summary
Decision Trees for regression offer a direct, interpretable approach: each split reduces the prediction error in a greedy manner, and each leaf stores a numeric estimate of the target variable. While easy to understand and implement, they can suffer from overfitting if not properly controlled or pruned. In practice, they are often used within **ensemble methods**, but they remain a foundational technique in **Machine Learning** for handling a variety of regression tasks.
----
**[[Back to Regression|Regression]]**
**[[Back to ML Overview|ML Overview]]**
---
## References