L1 Regularization - Albert Masoliver's learning site

## Definition **L1 regularisation** (also: **Lasso**, *least absolute shrinkage and selection operator*) adds the L1 norm of the parameters to the loss function: $ \mathcal{L}(\theta) = \mathcal{L}_{\text{data}}(\theta) + \lambda \|\theta\|_1 = \mathcal{L}_{\text{data}}(\theta) + \lambda \sum_j |\theta_j| $ Defining feature: *drives many coefficients to exactly zero* — implicit [[Feature Selection]]. ## Why Coefficients Hit Zero The L1 penalty has a kink (non-differentiability) at zero. Geometrically, the L1 constraint region $\|\theta\|_1 \leq t$ is a diamond — its corners lie on coordinate axes. The optimum of the constrained problem often lands at a corner, setting some coefficients to zero exactly. Contrast with L2: the L2 constraint region is a sphere with no corners; solutions shrink toward zero but never reach it exactly. ## Use Cases - **High-dimensional sparse regression.** Genomics, text features with thousands of dimensions. - **Implicit feature selection** without a separate selection step. - **Interpretable models.** A linear model with most coefficients zero is far easier to communicate. ## Optimization L1 is non-differentiable at zero, so standard gradient descent doesn't directly apply. Use: - **Coordinate descent.** Update one coefficient at a time analytically (the soft-threshold operator). Scikit-learn's `Lasso` uses this. - **Proximal gradient methods.** Alternate a gradient step with a soft-thresholding step. - **ADMM** for distributed settings. ## Hyperparameters - $\lambda$ — controls sparsity. Higher $\lambda$ → more coefficients driven to zero. - **Cross-validate** to choose $\lambda$. Many libraries provide `LassoCV` which does this automatically. ## Trade-offs vs L2 | Property | L1 (Lasso) | L2 (Ridge) | |---|---|---| | Feature selection | Yes (sparse) | No (shrinkage only) | | Solution uniqueness | Sometimes not unique | Always unique | | Behaviour with correlated features | Picks one arbitrarily | Distributes coefficient among them | | Differentiability | Not at zero | Everywhere | | Closed-form solution | No | Yes | For correlated features, [[Elastic Net]] combines L1 and L2 to get the best of both. ## Statistical Theory Under certain conditions (the "irrepresentability condition") Lasso recovers the true support of a sparse linear model with high probability. The result is foundational for high-dimensional statistics. ## Bayesian Interpretation L1 corresponds to a **Laplace prior** on parameters: $ p(\theta) \propto \exp\left(-\frac{|\theta|}{b}\right) $ The sharp peak at zero (compared to Gaussian) encourages sparsity. ## Related - [[L2 Regularization]] - [[Elastic Net]] - [[Regularization]] - [[Feature Selection]] - [[Lasso Regression]]