## Definition
**Quasi-Newton methods** approximate the Hessian (or its inverse) used in [[Newton's Method|Newton's method]] by accumulating gradient information across iterations — avoiding the cost of computing and inverting the true Hessian.
## The Idea
Maintain an approximate Hessian inverse $B_k$. After each step, update $B_k$ using the observed change in gradient — the **secant condition**:
$
B_{k+1} (\nabla f(x_{k+1}) - \nabla f(x_k)) = x_{k+1} - x_k
$
Different update formulas give different methods.
## BFGS
The most popular quasi-Newton update (Broyden, Fletcher, Goldfarb, Shanno, ~1970). Symmetric, positive-definite-preserving rank-2 update.
Storage: $O(n^2)$ for the matrix; updates: $O(n^2)$ per iteration. Practical for problems with hundreds to thousands of variables.
## L-BFGS (Limited-memory BFGS)
Stores only the last $m$ pairs of $(s_k, y_k)$ vectors (where $s_k = x_{k+1} - x_k$ and $y_k = \nabla f_{k+1} - \nabla f_k$); reconstructs the implicit Hessian action via a recursive formula.
- Storage: $O(mn)$ — typically $m = 5$-$20$.
- Per-iteration cost: $O(mn)$.
- Scales to millions of variables.
The workhorse for medium-to-large smooth optimisation problems where Hessian information helps but full Newton is infeasible.
## Convergence
Quasi-Newton methods converge **superlinearly** near the optimum — slower than Newton's quadratic but faster than gradient descent's linear. In practice, similar number of iterations to Newton with much lower per-iteration cost.
## Strengths
- **No Hessian needed.** Just gradients.
- **Approximate second-order information** without the cost.
- **Robust** across many problems with default hyperparameters.
- **Scales** (L-BFGS) to large problems.
## Weaknesses
- **Stochastic variants are tricky.** L-BFGS doesn't naturally play well with mini-batch noise — gradient differences from random batches are noisy.
- **Memory** still meaningful at very large scales.
## Where Quasi-Newton Is Used
- **Logistic regression, MaxEnt models** at moderate scale.
- **Smooth convex optimisation** generally.
- **scikit-learn's `LogisticRegression`** uses L-BFGS by default for many losses.
- **Scipy's `minimize`** uses L-BFGS-B as a strong general default.
## Why Not Deep Learning?
- Loss landscapes are non-convex and stochastic — quasi-Newton's assumptions break.
- Adam and SGD handle the noise better.
- Approximate Hessian methods for deep learning (K-FAC, Shampoo) are active research but not yet default.
## Related
- [[Newton's Method]]
- [[Gradient Descent]]
- [[Conjugate Gradient]]