## Definition
**Pretraining** is the first and most expensive training phase of an LLM. The model learns next-token prediction on a very large, broadly-curated text corpus — billions to trillions of tokens — with no task-specific labels. The result is a *base model* with general-purpose capabilities that subsequent phases sculpt.
## The Objective
For decoder-only LLMs:
$
\mathcal{L}_{\text{pretrain}} = -\sum_t \log P(x_t \mid x_{<t})
$
Each training example contributes the negative log-likelihood of every token conditional on its predecessors. Self-supervised: no human labels needed beyond the existing text.
## Why It Works
The objective is deceptively simple, but it forces the model to internalise:
- Syntax and morphology of every language in the corpus.
- World knowledge that appears in text.
- Basic reasoning patterns that humans express in prose.
- The structure of code, math, and many other formal domains.
Capabilities not directly optimised for — translation, summarisation, instruction-following — *emerge* because they're implicit in the prediction task.
## Data Mixture
A modern pretraining corpus typically contains:
- Web text (Common Crawl, filtered)
- Books and academic papers
- Code (GitHub, etc.)
- Wikipedia and curated encyclopedic sources
- Math and scientific text
- Some synthetic data (model-generated and curated)
The mixture is *the* most consequential choice in pretraining — and most labs publish very little about theirs.
## Compute Profile
Pretraining a frontier model in 2026 takes weeks to months on thousands of accelerators. Costs run into the tens or hundreds of millions of dollars. This is why pretraining a foundation model is the domain of a handful of labs.
## Base Model vs Instruction-Tuned Model
The output of pretraining is a **base model** — capable but not aligned to follow instructions. The model becomes a *useful chatbot or assistant* only after the [[Fine-Tuning]] phase (SFT, then [[RLHF]] or [[Constitutional AI]]).
## Related
- [[Large Language Model]]
- [[Foundation Model]]
- [[Scaling Laws]]
- [[Fine-Tuning]]
- [[RLHF]]