Learning Rate Schedules - Albert Masoliver's learning site

## Definition A **learning rate schedule** is a rule for changing the learning rate $\eta$ during training. A constant learning rate is rarely optimal: too high → can't converge precisely; too low → slow start. Schedules adapt $\eta$ over time. ## Common Schedules ### Step Decay Multiply $\eta$ by a factor (e.g., 0.1) at fixed milestones (e.g., epoch 30, 60, 90). Simple, predictable. Classic for image classification. ### Exponential Decay $\eta_t = \eta_0 \cdot \gamma^t$ for $\gamma \in (0, 1)$. Smooth continuous decrease. ### Cosine Annealing (Loshchilov & Hutter, 2017) $ \eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})(1 + \cos(\pi t / T)) $ Smooth half-cosine from $\eta_{\max}$ down to $\eta_{\min}$ over $T$ steps. Empirically strong; the default for many transformer training recipes. ### Warmup + Cosine Start at $\eta = 0$, linearly ramp up to $\eta_{\max}$ over the first few thousand steps, then cosine-anneal to $\eta_{\min}$. **The standard recipe for modern LLM training.** Warmup prevents large early updates that can destabilise training before the model has settled. ### Cyclical Learning Rates (Smith, 2017) Periodic up-and-down between $\eta_{\min}$ and $\eta_{\max}$. Helps escape local minima; can find good learning rates without grid search. ### One-Cycle (Smith, 2018) Warmup, then cosine down, then a final very low phase. Compact recipe for fast training. ### Reduce on Plateau Monitor validation loss; reduce $\eta$ by a factor when validation plateaus. Adaptive; widely used. ## Why Warmup Matters Modern transformers initially have huge layer norms / unstable statistics. Updating aggressively from step 0 can push them into a bad regime they can't recover from. A linear warmup of even 500-1000 steps prevents this. ## Why Cosine Decay Beats Step Decay Continuous decay gives the optimiser the freedom to settle gradually. Step decay creates abrupt regime changes that can spike validation loss. ## Learning Rate vs Batch Size Scaling The **linear scaling rule** (Goyal et al., 2017): if you double the batch size, double the learning rate (up to a point). For very large batches, more sophisticated rules (LAMB, LARS) apply. ## How to Choose - **CNN classification:** step decay or cosine. Try both. - **Transformers / LLM pretraining:** warmup + cosine. Industry standard. - **Fine-tuning** a pretrained model: smaller peak $\eta$ + warmup + cosine. - **Unknown problem:** start with cosine; tune $\eta_{\max}$ first. ## LR Range Test A quick technique: run a few hundred steps with $\eta$ exponentially increasing; plot loss vs $\eta$. The peak loss-decrease region indicates a good $\eta_{\max}$. ## Related - [[Optimizers SGD Adam]] - [[Gradient Descent]] - [[Stochastic Gradient Descent]]