## Definition
**Optimisers** are the algorithms that update model parameters from gradients during neural network training. The two dominant families are **SGD with momentum** and the **adaptive** family (RMSProp, Adam, AdamW).
## SGD with Momentum
Build velocity that accumulates past gradients:
$
v_{t+1} = \mu v_t + g_t
$
$
\theta_{t+1} = \theta_t - \eta v_{t+1}
$
with $\mu \in [0.9, 0.99]$. Smooths updates; accelerates in consistent directions.
**When SGD wins:** computer vision benchmarks (ImageNet, CIFAR), large-batch training. Often the highest-quality final model with sufficient tuning.
## RMSProp
Per-parameter adaptive learning rate. Maintain a running average of squared gradients:
$
s_t = \beta s_{t-1} + (1-\beta) g_t^2
$
$
\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{s_t + \epsilon}} g_t
$
Effectively divides the learning rate by the gradient magnitude — parameters with large gradients get smaller steps, parameters with small gradients get larger steps.
## Adam (Kingma & Ba, 2015)
Combines momentum + RMSProp. Standard for most deep learning in 2026.
$
m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t \quad \text{(first moment, momentum)}
$
$
v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \quad \text{(second moment, squared gradient)}
$
Bias-corrected:
$
\hat m_t = m_t / (1 - \beta_1^t), \quad \hat v_t = v_t / (1 - \beta_2^t)
$
Update:
$
\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat v_t} + \epsilon} \hat m_t
$
Defaults: $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$.
## AdamW
Adam with **decoupled weight decay** (Loshchilov & Hutter, 2017). The original Adam applied L2 weight decay through the gradient, which interacts badly with adaptive learning rates. AdamW applies weight decay directly:
$
\theta_{t+1} = \theta_t - \eta \left( \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon} + \lambda \theta_t \right)
$
**The default optimiser for modern deep learning.** Used by virtually every LLM and large vision model.
## Why Adam Wins By Default
- **Adaptive learning rates** — different parameters need different scales; Adam handles this automatically.
- **Less tuning** than SGD — works well with default hyperparameters across many problems.
- **Fast initial convergence.**
## Why SGD Sometimes Wins Finals
- **Better generalisation** on some benchmarks — flatter minima.
- **More predictable** behaviour with proper learning-rate schedules.
- **Cheaper memory** (no second moment to store).
The "Adam learns faster, SGD generalises better" pattern is real but not universal.
## Lion, Sophia, and 2024+ Variants
Newer optimisers (Lion, Sophia, AdaFactor) emerged for very large-scale training, offering memory advantages or marginal performance gains. AdamW remains the safest default in 2026.
## Related
- [[Gradient Descent]]
- [[Stochastic Gradient Descent]]
- [[Learning Rate Schedules]]
- [[Backpropagation]]