Conjugate Gradient - Albert Masoliver's learning site

## Definition The **Conjugate Gradient (CG)** method is an iterative algorithm for solving large symmetric positive-definite linear systems $Ax = b$, and more generally minimising quadratic functions. Reaches the exact solution in at most $n$ iterations (for an $n$-dimensional problem) using only matrix-vector products — no matrix factorisation needed. ## The Setup For minimising $\frac{1}{2} x^\top A x - b^\top x$ where $A$ is symmetric positive definite: ``` x_0 ← initial guess r_0 = b - A x_0 (initial residual) p_0 = r_0 (initial search direction) for k = 0, 1, ...: α_k = r_k^T r_k / (p_k^T A p_k) x_{k+1} = x_k + α_k p_k r_{k+1} = r_k - α_k A p_k if ||r_{k+1}|| < tolerance: stop β_k = (r_{k+1}^T r_{k+1}) / (r_k^T r_k) p_{k+1} = r_{k+1} + β_k p_k ``` ## What "Conjugate" Means Successive search directions $p_k$ are **A-conjugate**: $p_i^\top A p_j = 0$ for $i \neq j$. This is the key — each step makes progress in a direction orthogonal (under the $A$-inner-product) to all previous steps. ## Convergence - **Finite termination:** at most $n$ iterations for $n$-dimensional problem. - **In practice:** convergence depends on the spectrum of $A$. If $A$ has $k$ distinct eigenvalues, CG converges in $k$ iterations. - **Preconditioning** (replace $A$ with $M^{-1} A$ for some easy-to-invert $M$) dramatically accelerates convergence. ## Why It Matters Many problems reduce to large symmetric positive-definite linear systems: - **Finite-element analysis** in engineering. - **Image deblurring** (Tikhonov regularisation). - **Least-squares regression** at scale (normal equations). - **PageRank** and other graph algorithms. - **Hessian-free optimisation** in deep learning — uses CG to approximately invert the Hessian for Newton-like updates without storing it. ## Non-linear Conjugate Gradient Extends to non-quadratic objectives: - Compute gradient $\nabla f(x_k)$. - Use a Fletcher-Reeves, Polak-Ribière, or Hestenes-Stiefel formula to update $\beta_k$. - Line search for the step size. Less popular than [[Quasi-Newton Methods|L-BFGS]] for general non-linear optimisation but useful when memory matters. ## Strengths - **Memory-efficient.** Only needs vectors, not matrices. - **Matrix-free.** Only needs to compute $Ax$, not store $A$. Critical when $A$ is implicit (e.g., a Hessian-vector product in a neural network). - **Theoretical guarantees** for quadratic problems. ## Weaknesses - **Symmetric positive definite only.** For general systems, use GMRES or other Krylov methods. - **Preconditioning is crucial** in practice. - **Numerical stability** issues with floating-point. ## Related - [[Newton's Method]] - [[Quasi-Newton Methods]] - [[Gradient Descent]]