## Definition
**Feature scaling** transforms numeric features to a comparable scale. Necessary for any algorithm whose behaviour depends on feature magnitude — gradient-based optimisers, distance-based methods, regularised linear models.
## Why It Matters
- **Gradient descent.** Features with very different scales produce elongated loss surfaces; gradient steps zigzag.
- **Distance-based models** ([[kNN]], k-means, SVM with RBF kernel). A feature with range 0-1000 dominates one with range 0-1 in the distance.
- **Regularisation.** L1/L2 penalties shrink all coefficients equally — unfair if features have different scales.
Algorithms unaffected by scaling: tree-based models (decision trees, random forest, gradient boosting). Their splits are based on order, not magnitude.
## Three Standard Methods
### Standardisation (Z-score normalisation)
$
x' = \frac{x - \mu}{\sigma}
$
Mean 0, unit variance. The default choice for most cases. Robust to outliers compared to min-max.
### Min-Max Normalisation
$
x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}}
$
Maps to $[0, 1]$. Useful when you need bounded inputs (image pixel intensities, neural network inputs). Sensitive to outliers.
### Robust Scaling
$
x' = \frac{x - \text{median}}{\text{IQR}}
$
Centres on median; scales by interquartile range. Robust to outliers. Right choice for heavy-tailed distributions.
## Critical Hygiene
**Fit the scaler on training data only.** Apply (transform) to validation and test.
If you fit the scaler on the full dataset, statistics from validation/test leak into the training process — inflating apparent performance and inviting silent failure in production.
In scikit-learn:
```python
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # fit + transform
X_val_scaled = scaler.transform(X_val) # transform only
X_test_scaled = scaler.transform(X_test) # transform only
```
## When Not to Scale
- **Tree-based models.** Decision splits are scale-invariant; scaling is harmless but pointless.
- **One-hot encoded features.** Already in $\{0, 1\}$; further scaling adds nothing.
- **Sparse features.** Scaling can densify sparse matrices (use `with_mean=False` if you must).
## Pipelines
Modern ML frameworks let you bundle scaling with the model as a single pipeline:
```python
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', LogisticRegression())
])
pipeline.fit(X_train, y_train)
pipeline.predict(X_test) # scaling applied automatically
```
This is the right pattern in production: the scaler ships with the model.
## Related
- [[Feature Engineering]]
- [[L2 Regularization]]
- [[Gradient Descent]]
- [[kNN]]