Feature Scaling - Albert Masoliver's learning site

## Definition **Feature scaling** transforms numeric features to a comparable scale. Necessary for any algorithm whose behaviour depends on feature magnitude — gradient-based optimisers, distance-based methods, regularised linear models. ## Why It Matters - **Gradient descent.** Features with very different scales produce elongated loss surfaces; gradient steps zigzag. - **Distance-based models** ([[kNN]], k-means, SVM with RBF kernel). A feature with range 0-1000 dominates one with range 0-1 in the distance. - **Regularisation.** L1/L2 penalties shrink all coefficients equally — unfair if features have different scales. Algorithms unaffected by scaling: tree-based models (decision trees, random forest, gradient boosting). Their splits are based on order, not magnitude. ## Three Standard Methods ### Standardisation (Z-score normalisation) $ x' = \frac{x - \mu}{\sigma} $ Mean 0, unit variance. The default choice for most cases. Robust to outliers compared to min-max. ### Min-Max Normalisation $ x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}} $ Maps to $[0, 1]$. Useful when you need bounded inputs (image pixel intensities, neural network inputs). Sensitive to outliers. ### Robust Scaling $ x' = \frac{x - \text{median}}{\text{IQR}} $ Centres on median; scales by interquartile range. Robust to outliers. Right choice for heavy-tailed distributions. ## Critical Hygiene **Fit the scaler on training data only.** Apply (transform) to validation and test. If you fit the scaler on the full dataset, statistics from validation/test leak into the training process — inflating apparent performance and inviting silent failure in production. In scikit-learn: ```python scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) # fit + transform X_val_scaled = scaler.transform(X_val) # transform only X_test_scaled = scaler.transform(X_test) # transform only ``` ## When Not to Scale - **Tree-based models.** Decision splits are scale-invariant; scaling is harmless but pointless. - **One-hot encoded features.** Already in $\{0, 1\}$; further scaling adds nothing. - **Sparse features.** Scaling can densify sparse matrices (use `with_mean=False` if you must). ## Pipelines Modern ML frameworks let you bundle scaling with the model as a single pipeline: ```python pipeline = Pipeline([ ('scaler', StandardScaler()), ('model', LogisticRegression()) ]) pipeline.fit(X_train, y_train) pipeline.predict(X_test) # scaling applied automatically ``` This is the right pattern in production: the scaler ships with the model. ## Related - [[Feature Engineering]] - [[L2 Regularization]] - [[Gradient Descent]] - [[kNN]]