## Definition
**Scaling laws** are empirical regularities relating an LLM's loss (and capabilities) to three quantities: **parameters (N)**, **training tokens (D)**, and **compute (C)**. First systematically characterised by Kaplan et al. (OpenAI, 2020), refined by Hoffmann et al. (DeepMind, 2022 — the "Chinchilla" paper).
## The Loss-Compute Relationship
Empirically, test loss falls as a power law in each of N, D, and C — provided the others are not the bottleneck:
$
L(N, D) \approx L_\infty + \frac{A}{N^\alpha} + \frac{B}{D^\beta}
$
(with model-family-specific constants). The implication: bigger models trained on more data on more compute predictably get better.
## Chinchilla Result (2022)
Hoffmann et al. showed that earlier large models (GPT-3, Gopher) were **undertrained**. The compute-optimal ratio was roughly:
> ~20 tokens per parameter.
So a 70B model should see ~1.4T training tokens for compute-optimal training, not the 300B GPT-3 used. This re-anchored the field.
## Why It Matters
Scaling laws turned LLM development from a research lottery into an engineering discipline — given a compute budget, you can predict the optimal parameter / data split.
## Limits and Caveats (2026 view)
- Power-law extrapolations have visibly bent in recent years; pure scaling is no longer the only frontier.
- Data quality and curation matter more than the original laws assumed.
- Test-time compute (extended thinking, see [[Extended Thinking]]) opens a separate scaling axis that the original laws didn't model.
- Mixture-of-Experts complicates the simple parameter count — *active* parameters differ from *total* parameters.
## Related
- [[Large Language Model]]
- [[Foundation Model]]
- [[Pretraining]]
- [[Extended Thinking]]