Quasi-Newton Methods - Albert Masoliver's learning site

## Definition **Quasi-Newton methods** approximate the Hessian (or its inverse) used in [[Newton's Method|Newton's method]] by accumulating gradient information across iterations — avoiding the cost of computing and inverting the true Hessian. ## The Idea Maintain an approximate Hessian inverse $B_k$. After each step, update $B_k$ using the observed change in gradient — the **secant condition**: $ B_{k+1} (\nabla f(x_{k+1}) - \nabla f(x_k)) = x_{k+1} - x_k $ Different update formulas give different methods. ## BFGS The most popular quasi-Newton update (Broyden, Fletcher, Goldfarb, Shanno, ~1970). Symmetric, positive-definite-preserving rank-2 update. Storage: $O(n^2)$ for the matrix; updates: $O(n^2)$ per iteration. Practical for problems with hundreds to thousands of variables. ## L-BFGS (Limited-memory BFGS) Stores only the last $m$ pairs of $(s_k, y_k)$ vectors (where $s_k = x_{k+1} - x_k$ and $y_k = \nabla f_{k+1} - \nabla f_k$); reconstructs the implicit Hessian action via a recursive formula. - Storage: $O(mn)$ — typically $m = 5$-$20$. - Per-iteration cost: $O(mn)$. - Scales to millions of variables. The workhorse for medium-to-large smooth optimisation problems where Hessian information helps but full Newton is infeasible. ## Convergence Quasi-Newton methods converge **superlinearly** near the optimum — slower than Newton's quadratic but faster than gradient descent's linear. In practice, similar number of iterations to Newton with much lower per-iteration cost. ## Strengths - **No Hessian needed.** Just gradients. - **Approximate second-order information** without the cost. - **Robust** across many problems with default hyperparameters. - **Scales** (L-BFGS) to large problems. ## Weaknesses - **Stochastic variants are tricky.** L-BFGS doesn't naturally play well with mini-batch noise — gradient differences from random batches are noisy. - **Memory** still meaningful at very large scales. ## Where Quasi-Newton Is Used - **Logistic regression, MaxEnt models** at moderate scale. - **Smooth convex optimisation** generally. - **scikit-learn's `LogisticRegression`** uses L-BFGS by default for many losses. - **Scipy's `minimize`** uses L-BFGS-B as a strong general default. ## Why Not Deep Learning? - Loss landscapes are non-convex and stochastic — quasi-Newton's assumptions break. - Adam and SGD handle the noise better. - Approximate Hessian methods for deep learning (K-FAC, Shampoo) are active research but not yet default. ## Related - [[Newton's Method]] - [[Gradient Descent]] - [[Conjugate Gradient]]