Newton's Method - Albert Masoliver's learning site

## Definition **Newton's method** is a second-order iterative optimiser that uses both the gradient and the Hessian (second derivatives) to find local minima. Converges quadratically near the optimum — much faster than [[Gradient Descent]]'s linear convergence — but at higher per-iteration cost. ## Update Rule For a twice-differentiable function $f(x)$: $ x_{k+1} = x_k - \eta_k [\nabla^2 f(x_k)]^{-1} \nabla f(x_k) $ - $\nabla^2 f(x_k)$ — Hessian matrix at $x_k$. - $\eta_k$ — step size (often 1; line search optional). The Hessian's inverse rescales the gradient: steps are larger in flat directions, smaller in steep ones. Locally fits a quadratic to $f$ and jumps to its minimum. ## Convergence Near a strict local minimum where the Hessian is positive definite: $ \|x_{k+1} - x^*\| \leq C \|x_k - x^*\|^2 $ **Quadratic convergence** — the error squares each iteration. Reaches machine precision in ~5-10 iterations from a good starting point. Far from the optimum, Newton can diverge or oscillate. Globalisation strategies (line search, trust region) help. ## Strengths - **Quadratic convergence near optimum** — far faster than first-order methods. - **Affine invariance** — performance doesn't depend on scaling of inputs. - **Exact for quadratic objectives** — converges in one step. ## Weaknesses - **Hessian cost.** Computing and storing the Hessian is $O(n^2)$ space; inverting is $O(n^3)$ time. For high-dimensional problems (neural networks with millions of parameters), entirely impractical. - **Hessian may not be positive definite.** Saddle points and non-convex regions can give bad steps. - **No global convergence guarantee** without modifications. ## Variants ### Damped Newton Add line search: $x_{k+1} = x_k - \eta_k [\nabla^2 f]^{-1} \nabla f$ with $\eta_k$ chosen by backtracking. Trades off speed for safety. ### Modified Newton When Hessian is indefinite, modify it (add multiple of identity, or use only positive eigenvalues) to ensure a descent direction. ### [[Quasi-Newton Methods]] Approximate the Hessian (or its inverse) from gradient differences across iterations. BFGS, L-BFGS — the standard practical compromise. Much cheaper than full Newton; nearly as fast. ## Where Newton Is Used - **Convex optimisation** with moderate dimensions (~100s to ~10000s of variables). - **Logistic regression** training (Newton-IRLS). - **Statistical estimation** (MLE for models with rich structure). - Generally *not* deep learning — dimensions too high. ## Stochastic Variants Sub-sampled Newton, K-FAC, Shampoo — attempt to bring Newton-like preconditioning to deep learning by approximating the Hessian over batches. Active research area; not yet displacing Adam as the default. ## Related - [[Gradient Descent]] - [[Quasi-Newton Methods]] - [[Convex Optimization]]