Gradient Descent - Albert Masoliver's learning site

## Definition **Gradient descent** is the workhorse iterative optimiser for differentiable loss functions. At each step, update the parameters by moving in the direction of the negative gradient: $ \theta_{t+1} = \theta_t - \eta \, \nabla_\theta \mathcal{L}(\theta_t) $ - $\eta > 0$ — *learning rate* (step size). - $\nabla_\theta \mathcal{L}$ — gradient of the loss with respect to parameters. ## Why It Works The gradient $\nabla \mathcal{L}$ is the direction of steepest *ascent*; the negative gradient is steepest *descent*. For sufficiently small $\eta$ on a smooth loss, each step decreases $\mathcal{L}$. For convex losses, gradient descent provably converges to the global minimum. ## Three Common Variants ### Batch Gradient Descent Compute the gradient over the *entire* training set per step: $ \nabla \mathcal{L}(\theta) = \frac{1}{n} \sum_{i=1}^n \nabla L(f_\theta(x_i), y_i) $ Exact gradient; slow per step; deterministic. ### [[Stochastic Gradient Descent]] (SGD) Compute the gradient from *one* example per step. Very noisy but very fast. ### Mini-batch Gradient Descent Compute the gradient over a small batch (typically 32–512). The practical default. Balances batch's stability with SGD's speed. ## Learning Rate — The Critical Knob - **Too small** — slow convergence; may get trapped in local minima or plateaus. - **Too large** — divergence or oscillation. - **Just right** — fast, stable descent. Modern practice: - **Learning rate schedules** (step, cosine, warmup-decay). - **Adaptive optimisers** (Adam, RMSProp) — adjust per-parameter step sizes. See [[Learning Rate Schedules]] and [[Optimizers SGD Adam]]. ## Convergence - **Convex loss + smooth + small $\eta$:** guaranteed convergence to global minimum. - **Non-convex loss:** convergence to *some* stationary point (often a local minimum or saddle point). For deep networks, the local minima found tend to generalise well in practice — a mysterious empirical fact. ## Common Pitfalls - **Saddle points** in high-dim spaces are far more common than local minima. Gradient descent can stall at them; momentum and noise help escape. - **Plateaus** — flat regions where gradients are tiny. Adaptive optimisers (Adam, RMSProp) handle them better than plain SGD. - **Exploding / vanishing gradients** in deep networks. Solutions: better initialisation, batch normalisation, residual connections. - **Bad scaling** of features — see [[Feature Scaling]]. ## Beyond First-Order Gradient descent uses only first derivatives. Second-order methods (Newton, Quasi-Newton) use the Hessian; they converge faster per step but are too expensive per step for deep learning (millions of parameters). First-order methods with smart adaptation dominate the field. ## Related - [[Stochastic Gradient Descent]] - [[Optimizers SGD Adam]] - [[Loss Functions]] - [[Backpropagation]] - [[10 - Optimization Notes Hub]]