Feature Selection - Albert Masoliver's learning site

## Definition **Feature selection** reduces a dataset to a subset of relevant features. Improves model performance (less noise), interpretability (fewer variables to reason about), and efficiency (less computation, less storage). ## Three Families ### Filter Methods Score each feature *independently* of any model. Cheap; usable as preprocessing. - **Variance threshold.** Drop features with near-zero variance. - **Correlation.** Drop features highly correlated with each other; keep one representative. - **Mutual information.** Measure non-linear dependence between feature and target. - **Chi-squared test.** For categorical features against a categorical target. - **ANOVA F-test.** For numeric features against a categorical target. Strength: fast, model-agnostic. Weakness: misses *feature interactions* — two features individually weak but jointly strong. ### Wrapper Methods Use a model's actual performance to select features. Search over feature subsets. - **Recursive Feature Elimination (RFE).** Train model; drop the least important feature; retrain; repeat. - **Forward selection.** Start with none; add the feature that most improves performance; repeat. - **Backward elimination.** Start with all; drop the least useful; repeat. Strength: model-aware. Weakness: expensive ($n$ training runs); risk of overfitting the selection itself — use cross-validation. ### Embedded Methods Selection happens during model training. - **[[L1 Regularization]] (Lasso).** Drives many coefficients to exactly zero — implicit feature selection. - **Tree-based feature importance.** Importance scores from random forests or gradient boosting. - **Permutation importance.** Shuffle each feature; measure performance drop. Strength: cheap (one training run); reflects feature importance for *this* model. Weakness: model-specific; tree-based importances are biased toward high-cardinality features. ## When to Reach For It - **Many irrelevant features** — noise dilutes the signal. - **Interpretability constraints** — regulators or stakeholders demand a small, defensible feature set. - **Inference-time cost** — fewer features = faster predictions and smaller models. - **High-dimensional, low-sample** datasets — feature selection often outperforms regularisation alone. ## Pitfalls - **Selection bias.** If features are selected using the test set, performance is inflated. Do selection *inside* cross-validation. - **Confusing importance for causation.** A feature important to the model is correlated with the target; that's not the same as causing the target. - **Stability.** Different runs may select different features; ensembling selections (or using stability selection) helps. ## Related - [[Feature Engineering]] - [[Dimensionality Reduction]] - [[L1 Regularization]] - [[Curse of Dimensionality]]