## Definition
**Newton's method** is a second-order iterative optimiser that uses both the gradient and the Hessian (second derivatives) to find local minima. Converges quadratically near the optimum — much faster than [[Gradient Descent]]'s linear convergence — but at higher per-iteration cost.
## Update Rule
For a twice-differentiable function $f(x)$:
$
x_{k+1} = x_k - \eta_k [\nabla^2 f(x_k)]^{-1} \nabla f(x_k)
$
- $\nabla^2 f(x_k)$ — Hessian matrix at $x_k$.
- $\eta_k$ — step size (often 1; line search optional).
The Hessian's inverse rescales the gradient: steps are larger in flat directions, smaller in steep ones. Locally fits a quadratic to $f$ and jumps to its minimum.
## Convergence
Near a strict local minimum where the Hessian is positive definite:
$
\|x_{k+1} - x^*\| \leq C \|x_k - x^*\|^2
$
**Quadratic convergence** — the error squares each iteration. Reaches machine precision in ~5-10 iterations from a good starting point.
Far from the optimum, Newton can diverge or oscillate. Globalisation strategies (line search, trust region) help.
## Strengths
- **Quadratic convergence near optimum** — far faster than first-order methods.
- **Affine invariance** — performance doesn't depend on scaling of inputs.
- **Exact for quadratic objectives** — converges in one step.
## Weaknesses
- **Hessian cost.** Computing and storing the Hessian is $O(n^2)$ space; inverting is $O(n^3)$ time. For high-dimensional problems (neural networks with millions of parameters), entirely impractical.
- **Hessian may not be positive definite.** Saddle points and non-convex regions can give bad steps.
- **No global convergence guarantee** without modifications.
## Variants
### Damped Newton
Add line search: $x_{k+1} = x_k - \eta_k [\nabla^2 f]^{-1} \nabla f$ with $\eta_k$ chosen by backtracking. Trades off speed for safety.
### Modified Newton
When Hessian is indefinite, modify it (add multiple of identity, or use only positive eigenvalues) to ensure a descent direction.
### [[Quasi-Newton Methods]]
Approximate the Hessian (or its inverse) from gradient differences across iterations. BFGS, L-BFGS — the standard practical compromise. Much cheaper than full Newton; nearly as fast.
## Where Newton Is Used
- **Convex optimisation** with moderate dimensions (~100s to ~10000s of variables).
- **Logistic regression** training (Newton-IRLS).
- **Statistical estimation** (MLE for models with rich structure).
- Generally *not* deep learning — dimensions too high.
## Stochastic Variants
Sub-sampled Newton, K-FAC, Shampoo — attempt to bring Newton-like preconditioning to deep learning by approximating the Hessian over batches. Active research area; not yet displacing Adam as the default.
## Related
- [[Gradient Descent]]
- [[Quasi-Newton Methods]]
- [[Convex Optimization]]