## Definition
**Gradient descent** is the workhorse iterative optimiser for differentiable loss functions. At each step, update the parameters by moving in the direction of the negative gradient:
$
\theta_{t+1} = \theta_t - \eta \, \nabla_\theta \mathcal{L}(\theta_t)
$
- $\eta > 0$ — *learning rate* (step size).
- $\nabla_\theta \mathcal{L}$ — gradient of the loss with respect to parameters.
## Why It Works
The gradient $\nabla \mathcal{L}$ is the direction of steepest *ascent*; the negative gradient is steepest *descent*. For sufficiently small $\eta$ on a smooth loss, each step decreases $\mathcal{L}$. For convex losses, gradient descent provably converges to the global minimum.
## Three Common Variants
### Batch Gradient Descent
Compute the gradient over the *entire* training set per step:
$
\nabla \mathcal{L}(\theta) = \frac{1}{n} \sum_{i=1}^n \nabla L(f_\theta(x_i), y_i)
$
Exact gradient; slow per step; deterministic.
### [[Stochastic Gradient Descent]] (SGD)
Compute the gradient from *one* example per step. Very noisy but very fast.
### Mini-batch Gradient Descent
Compute the gradient over a small batch (typically 32–512). The practical default. Balances batch's stability with SGD's speed.
## Learning Rate — The Critical Knob
- **Too small** — slow convergence; may get trapped in local minima or plateaus.
- **Too large** — divergence or oscillation.
- **Just right** — fast, stable descent.
Modern practice:
- **Learning rate schedules** (step, cosine, warmup-decay).
- **Adaptive optimisers** (Adam, RMSProp) — adjust per-parameter step sizes.
See [[Learning Rate Schedules]] and [[Optimizers SGD Adam]].
## Convergence
- **Convex loss + smooth + small $\eta$:** guaranteed convergence to global minimum.
- **Non-convex loss:** convergence to *some* stationary point (often a local minimum or saddle point). For deep networks, the local minima found tend to generalise well in practice — a mysterious empirical fact.
## Common Pitfalls
- **Saddle points** in high-dim spaces are far more common than local minima. Gradient descent can stall at them; momentum and noise help escape.
- **Plateaus** — flat regions where gradients are tiny. Adaptive optimisers (Adam, RMSProp) handle them better than plain SGD.
- **Exploding / vanishing gradients** in deep networks. Solutions: better initialisation, batch normalisation, residual connections.
- **Bad scaling** of features — see [[Feature Scaling]].
## Beyond First-Order
Gradient descent uses only first derivatives. Second-order methods (Newton, Quasi-Newton) use the Hessian; they converge faster per step but are too expensive per step for deep learning (millions of parameters). First-order methods with smart adaptation dominate the field.
## Related
- [[Stochastic Gradient Descent]]
- [[Optimizers SGD Adam]]
- [[Loss Functions]]
- [[Backpropagation]]
- [[10 - Optimization Notes Hub]]