## Definition
**Feature selection** reduces a dataset to a subset of relevant features. Improves model performance (less noise), interpretability (fewer variables to reason about), and efficiency (less computation, less storage).
## Three Families
### Filter Methods
Score each feature *independently* of any model. Cheap; usable as preprocessing.
- **Variance threshold.** Drop features with near-zero variance.
- **Correlation.** Drop features highly correlated with each other; keep one representative.
- **Mutual information.** Measure non-linear dependence between feature and target.
- **Chi-squared test.** For categorical features against a categorical target.
- **ANOVA F-test.** For numeric features against a categorical target.
Strength: fast, model-agnostic. Weakness: misses *feature interactions* — two features individually weak but jointly strong.
### Wrapper Methods
Use a model's actual performance to select features. Search over feature subsets.
- **Recursive Feature Elimination (RFE).** Train model; drop the least important feature; retrain; repeat.
- **Forward selection.** Start with none; add the feature that most improves performance; repeat.
- **Backward elimination.** Start with all; drop the least useful; repeat.
Strength: model-aware. Weakness: expensive ($n$ training runs); risk of overfitting the selection itself — use cross-validation.
### Embedded Methods
Selection happens during model training.
- **[[L1 Regularization]] (Lasso).** Drives many coefficients to exactly zero — implicit feature selection.
- **Tree-based feature importance.** Importance scores from random forests or gradient boosting.
- **Permutation importance.** Shuffle each feature; measure performance drop.
Strength: cheap (one training run); reflects feature importance for *this* model. Weakness: model-specific; tree-based importances are biased toward high-cardinality features.
## When to Reach For It
- **Many irrelevant features** — noise dilutes the signal.
- **Interpretability constraints** — regulators or stakeholders demand a small, defensible feature set.
- **Inference-time cost** — fewer features = faster predictions and smaller models.
- **High-dimensional, low-sample** datasets — feature selection often outperforms regularisation alone.
## Pitfalls
- **Selection bias.** If features are selected using the test set, performance is inflated. Do selection *inside* cross-validation.
- **Confusing importance for causation.** A feature important to the model is correlated with the target; that's not the same as causing the target.
- **Stability.** Different runs may select different features; ensembling selections (or using stability selection) helps.
## Related
- [[Feature Engineering]]
- [[Dimensionality Reduction]]
- [[L1 Regularization]]
- [[Curse of Dimensionality]]