Scaling Laws - Albert Masoliver's learning site

## Definition **Scaling laws** are empirical regularities relating an LLM's loss (and capabilities) to three quantities: **parameters (N)**, **training tokens (D)**, and **compute (C)**. First systematically characterised by Kaplan et al. (OpenAI, 2020), refined by Hoffmann et al. (DeepMind, 2022 — the "Chinchilla" paper). ## The Loss-Compute Relationship Empirically, test loss falls as a power law in each of N, D, and C — provided the others are not the bottleneck: $ L(N, D) \approx L_\infty + \frac{A}{N^\alpha} + \frac{B}{D^\beta} $ (with model-family-specific constants). The implication: bigger models trained on more data on more compute predictably get better. ## Chinchilla Result (2022) Hoffmann et al. showed that earlier large models (GPT-3, Gopher) were **undertrained**. The compute-optimal ratio was roughly: > ~20 tokens per parameter. So a 70B model should see ~1.4T training tokens for compute-optimal training, not the 300B GPT-3 used. This re-anchored the field. ## Why It Matters Scaling laws turned LLM development from a research lottery into an engineering discipline — given a compute budget, you can predict the optimal parameter / data split. ## Limits and Caveats (2026 view) - Power-law extrapolations have visibly bent in recent years; pure scaling is no longer the only frontier. - Data quality and curation matter more than the original laws assumed. - Test-time compute (extended thinking, see [[Extended Thinking]]) opens a separate scaling axis that the original laws didn't model. - Mixture-of-Experts complicates the simple parameter count — *active* parameters differ from *total* parameters. ## Related - [[Large Language Model]] - [[Foundation Model]] - [[Pretraining]] - [[Extended Thinking]]