Lasso Regression - Albert Masoliver's learning site

## Definition **Lasso regression** is linear regression with an [[L1 Regularization]] penalty: $ \hat w = \arg\min_w \|Xw - y\|_2^2 + \lambda \|w\|_1 $ The L1 penalty drives many coefficients to **exactly zero**, performing implicit [[Feature Selection]]. Introduced by Tibshirani (1996). ## Why Coefficients Hit Zero The L1 constraint region $\|w\|_1 \leq t$ is a diamond with corners on the coordinate axes. The optimum of the constrained problem often lies at a corner, where some coefficients are zero exactly. By contrast, Ridge's L2 constraint region is a sphere — solutions shrink toward zero but never reach it exactly. ## No Closed-Form Solution L1 is non-differentiable at zero, so there's no closed-form like Ridge's. Optimisation uses: - **Coordinate descent** — update one coefficient at a time via the soft-threshold operator. Used by scikit-learn. - **LARS** (Least Angle Regression) — efficient computation of the entire regularisation path. - **Proximal gradient methods.** ## Properties - **Sparse solutions.** Many coefficients exactly zero. - **Embedded feature selection.** Train once; selection comes for free. - **Sensitive to correlated features.** Lasso arbitrarily picks one of a correlated pair and zeros the other — unstable when features are highly correlated. - **At most $n$ non-zero coefficients** (where $n$ = samples) — a hard limit. ## When Lasso Wins - **High-dimensional sparse problems.** Many features; few are relevant. Genomics, text features. - **Interpretability requirements.** A model using 10 of 1000 features is far more communicable than a Ridge model using small coefficients on all 1000. - **Implicit feature selection.** Avoids a separate selection step. ## When Lasso Loses - **Many correlated features.** Lasso arbitrarily picks one; Ridge spreads coefficient mass. [[Elastic Net]] is the fix. - **More than $n$ truly relevant features.** Lasso's hard cap matters. ## Regularisation Path A standard analysis: plot coefficient values as $\lambda$ varies from large to small. Coefficients enter the model in order of importance — visual confirmation of feature priority. ## Practical Notes - **Always standardise features.** The L1 penalty depends on scale. - **Use `LassoCV`** for automatic $\lambda$ selection via cross-validation. - **Inspect the path** to understand which features matter and at what regularisation strength. ## Related - [[Linear Regression]] - [[L1 Regularization]] - [[Ridge Regression]] - [[Elastic Net]] - [[Feature Selection]]