Optimizers SGD Adam - Albert Masoliver's learning site

## Definition **Optimisers** are the algorithms that update model parameters from gradients during neural network training. The two dominant families are **SGD with momentum** and the **adaptive** family (RMSProp, Adam, AdamW). ## SGD with Momentum Build velocity that accumulates past gradients: $ v_{t+1} = \mu v_t + g_t $ $ \theta_{t+1} = \theta_t - \eta v_{t+1} $ with $\mu \in [0.9, 0.99]$. Smooths updates; accelerates in consistent directions. **When SGD wins:** computer vision benchmarks (ImageNet, CIFAR), large-batch training. Often the highest-quality final model with sufficient tuning. ## RMSProp Per-parameter adaptive learning rate. Maintain a running average of squared gradients: $ s_t = \beta s_{t-1} + (1-\beta) g_t^2 $ $ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{s_t + \epsilon}} g_t $ Effectively divides the learning rate by the gradient magnitude — parameters with large gradients get smaller steps, parameters with small gradients get larger steps. ## Adam (Kingma & Ba, 2015) Combines momentum + RMSProp. Standard for most deep learning in 2026. $ m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t \quad \text{(first moment, momentum)} $ $ v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \quad \text{(second moment, squared gradient)} $ Bias-corrected: $ \hat m_t = m_t / (1 - \beta_1^t), \quad \hat v_t = v_t / (1 - \beta_2^t) $ Update: $ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat v_t} + \epsilon} \hat m_t $ Defaults: $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$. ## AdamW Adam with **decoupled weight decay** (Loshchilov & Hutter, 2017). The original Adam applied L2 weight decay through the gradient, which interacts badly with adaptive learning rates. AdamW applies weight decay directly: $ \theta_{t+1} = \theta_t - \eta \left( \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon} + \lambda \theta_t \right) $ **The default optimiser for modern deep learning.** Used by virtually every LLM and large vision model. ## Why Adam Wins By Default - **Adaptive learning rates** — different parameters need different scales; Adam handles this automatically. - **Less tuning** than SGD — works well with default hyperparameters across many problems. - **Fast initial convergence.** ## Why SGD Sometimes Wins Finals - **Better generalisation** on some benchmarks — flatter minima. - **More predictable** behaviour with proper learning-rate schedules. - **Cheaper memory** (no second moment to store). The "Adam learns faster, SGD generalises better" pattern is real but not universal. ## Lion, Sophia, and 2024+ Variants Newer optimisers (Lion, Sophia, AdaFactor) emerged for very large-scale training, offering memory advantages or marginal performance gains. AdamW remains the safest default in 2026. ## Related - [[Gradient Descent]] - [[Stochastic Gradient Descent]] - [[Learning Rate Schedules]] - [[Backpropagation]]