## Definition
A **learning rate schedule** is a rule for changing the learning rate $\eta$ during training. A constant learning rate is rarely optimal: too high → can't converge precisely; too low → slow start. Schedules adapt $\eta$ over time.
## Common Schedules
### Step Decay
Multiply $\eta$ by a factor (e.g., 0.1) at fixed milestones (e.g., epoch 30, 60, 90). Simple, predictable. Classic for image classification.
### Exponential Decay
$\eta_t = \eta_0 \cdot \gamma^t$ for $\gamma \in (0, 1)$. Smooth continuous decrease.
### Cosine Annealing (Loshchilov & Hutter, 2017)
$
\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})(1 + \cos(\pi t / T))
$
Smooth half-cosine from $\eta_{\max}$ down to $\eta_{\min}$ over $T$ steps. Empirically strong; the default for many transformer training recipes.
### Warmup + Cosine
Start at $\eta = 0$, linearly ramp up to $\eta_{\max}$ over the first few thousand steps, then cosine-anneal to $\eta_{\min}$. **The standard recipe for modern LLM training.**
Warmup prevents large early updates that can destabilise training before the model has settled.
### Cyclical Learning Rates (Smith, 2017)
Periodic up-and-down between $\eta_{\min}$ and $\eta_{\max}$. Helps escape local minima; can find good learning rates without grid search.
### One-Cycle (Smith, 2018)
Warmup, then cosine down, then a final very low phase. Compact recipe for fast training.
### Reduce on Plateau
Monitor validation loss; reduce $\eta$ by a factor when validation plateaus. Adaptive; widely used.
## Why Warmup Matters
Modern transformers initially have huge layer norms / unstable statistics. Updating aggressively from step 0 can push them into a bad regime they can't recover from. A linear warmup of even 500-1000 steps prevents this.
## Why Cosine Decay Beats Step Decay
Continuous decay gives the optimiser the freedom to settle gradually. Step decay creates abrupt regime changes that can spike validation loss.
## Learning Rate vs Batch Size Scaling
The **linear scaling rule** (Goyal et al., 2017): if you double the batch size, double the learning rate (up to a point). For very large batches, more sophisticated rules (LAMB, LARS) apply.
## How to Choose
- **CNN classification:** step decay or cosine. Try both.
- **Transformers / LLM pretraining:** warmup + cosine. Industry standard.
- **Fine-tuning** a pretrained model: smaller peak $\eta$ + warmup + cosine.
- **Unknown problem:** start with cosine; tune $\eta_{\max}$ first.
## LR Range Test
A quick technique: run a few hundred steps with $\eta$ exponentially increasing; plot loss vs $\eta$. The peak loss-decrease region indicates a good $\eta_{\max}$.
## Related
- [[Optimizers SGD Adam]]
- [[Gradient Descent]]
- [[Stochastic Gradient Descent]]