## Definition
**L1 regularisation** (also: **Lasso**, *least absolute shrinkage and selection operator*) adds the L1 norm of the parameters to the loss function:
$
\mathcal{L}(\theta) = \mathcal{L}_{\text{data}}(\theta) + \lambda \|\theta\|_1 = \mathcal{L}_{\text{data}}(\theta) + \lambda \sum_j |\theta_j|
$
Defining feature: *drives many coefficients to exactly zero* — implicit [[Feature Selection]].
## Why Coefficients Hit Zero
The L1 penalty has a kink (non-differentiability) at zero. Geometrically, the L1 constraint region $\|\theta\|_1 \leq t$ is a diamond — its corners lie on coordinate axes. The optimum of the constrained problem often lands at a corner, setting some coefficients to zero exactly.
Contrast with L2: the L2 constraint region is a sphere with no corners; solutions shrink toward zero but never reach it exactly.
## Use Cases
- **High-dimensional sparse regression.** Genomics, text features with thousands of dimensions.
- **Implicit feature selection** without a separate selection step.
- **Interpretable models.** A linear model with most coefficients zero is far easier to communicate.
## Optimization
L1 is non-differentiable at zero, so standard gradient descent doesn't directly apply. Use:
- **Coordinate descent.** Update one coefficient at a time analytically (the soft-threshold operator). Scikit-learn's `Lasso` uses this.
- **Proximal gradient methods.** Alternate a gradient step with a soft-thresholding step.
- **ADMM** for distributed settings.
## Hyperparameters
- $\lambda$ — controls sparsity. Higher $\lambda$ → more coefficients driven to zero.
- **Cross-validate** to choose $\lambda$. Many libraries provide `LassoCV` which does this automatically.
## Trade-offs vs L2
| Property | L1 (Lasso) | L2 (Ridge) |
|---|---|---|
| Feature selection | Yes (sparse) | No (shrinkage only) |
| Solution uniqueness | Sometimes not unique | Always unique |
| Behaviour with correlated features | Picks one arbitrarily | Distributes coefficient among them |
| Differentiability | Not at zero | Everywhere |
| Closed-form solution | No | Yes |
For correlated features, [[Elastic Net]] combines L1 and L2 to get the best of both.
## Statistical Theory
Under certain conditions (the "irrepresentability condition") Lasso recovers the true support of a sparse linear model with high probability. The result is foundational for high-dimensional statistics.
## Bayesian Interpretation
L1 corresponds to a **Laplace prior** on parameters:
$
p(\theta) \propto \exp\left(-\frac{|\theta|}{b}\right)
$
The sharp peak at zero (compared to Gaussian) encourages sparsity.
## Related
- [[L2 Regularization]]
- [[Elastic Net]]
- [[Regularization]]
- [[Feature Selection]]
- [[Lasso Regression]]