Pretraining - Albert Masoliver's learning site

## Definition **Pretraining** is the first and most expensive training phase of an LLM. The model learns next-token prediction on a very large, broadly-curated text corpus — billions to trillions of tokens — with no task-specific labels. The result is a *base model* with general-purpose capabilities that subsequent phases sculpt. ## The Objective For decoder-only LLMs: $ \mathcal{L}_{\text{pretrain}} = -\sum_t \log P(x_t \mid x_{<t}) $ Each training example contributes the negative log-likelihood of every token conditional on its predecessors. Self-supervised: no human labels needed beyond the existing text. ## Why It Works The objective is deceptively simple, but it forces the model to internalise: - Syntax and morphology of every language in the corpus. - World knowledge that appears in text. - Basic reasoning patterns that humans express in prose. - The structure of code, math, and many other formal domains. Capabilities not directly optimised for — translation, summarisation, instruction-following — *emerge* because they're implicit in the prediction task. ## Data Mixture A modern pretraining corpus typically contains: - Web text (Common Crawl, filtered) - Books and academic papers - Code (GitHub, etc.) - Wikipedia and curated encyclopedic sources - Math and scientific text - Some synthetic data (model-generated and curated) The mixture is *the* most consequential choice in pretraining — and most labs publish very little about theirs. ## Compute Profile Pretraining a frontier model in 2026 takes weeks to months on thousands of accelerators. Costs run into the tens or hundreds of millions of dollars. This is why pretraining a foundation model is the domain of a handful of labs. ## Base Model vs Instruction-Tuned Model The output of pretraining is a **base model** — capable but not aligned to follow instructions. The model becomes a *useful chatbot or assistant* only after the [[Fine-Tuning]] phase (SFT, then [[RLHF]] or [[Constitutional AI]]). ## Related - [[Large Language Model]] - [[Foundation Model]] - [[Scaling Laws]] - [[Fine-Tuning]] - [[RLHF]]