Local vs Global Optimum - Albert Masoliver's learning site

## Definition - **Global optimum** $x^*$ is the *best* feasible point: $f(x^*) \leq f(x)$ for all feasible $x$. - **Local optimum** $\hat x$ is the best within a *neighbourhood*: $f(\hat x) \leq f(x)$ for all feasible $x$ near $\hat x$. Every global optimum is also local; the reverse is not true except for convex problems. ## Why It Matters Most practical algorithms (gradient descent, Newton, hill climbing) find *local* optima. Whether that's good enough depends entirely on the problem: - **Convex problems:** every local = global. No worry. - **Non-convex problems:** local optima may be far from global. The whole game of non-convex optimisation is escaping bad locals. ## In Deep Learning Neural network loss surfaces are wildly non-convex. By classical theory, gradient descent should be hopelessly trapped in bad local minima. Yet empirically: - **Most local minima generalise well** in modern overparameterised networks. - **Saddle points are far more prevalent** in high-dimensional non-convex landscapes than local minima. - **The local minima found by SGD** tend to be in *flat regions* — which correlate with good generalisation. So deep learning works despite non-convexity, not because of clever local-minima avoidance. ## Saddle Points Points where the gradient is zero but the Hessian has both positive and negative eigenvalues. Not minima but stationary. In high dimensions, saddle points vastly outnumber local minima. Plain gradient descent can stall at them; **momentum** and **stochastic noise** help escape. ## Strategies to Find Better Optima For non-convex problems: - **Random restarts.** Run local search from many different starting points; keep the best. - **Simulated annealing.** Allow uphill moves with decreasing probability — see [[Simulated Annealing]]. - **Genetic algorithms.** Population-based search with crossover and mutation — see [[Genetic Algorithms]]. - **Convex relaxation.** Replace the non-convex problem with a convex approximation; solve; use as initialisation. - **Branch-and-bound.** Exact for some problem classes; partitions the feasible region and prunes. - **Lagrangian dual** — sometimes provides a tight bound and global optimum. ## When Global Is Required Some problems genuinely need global optima: - **Worst-case analysis** (adversarial robustness). - **Certified safety / correctness.** - **Some scientific applications** (protein folding, low-energy configurations). Otherwise, "good local" is often fine — and often the only achievable target in reasonable compute. ## Related - [[Convex vs Non-Convex Optimization]] - [[Optimization Problem]] - [[Gradient Descent]] - [[Simulated Annealing]]