## Definition An **autoregressive language model** is a model trained to predict the next token in a sequence using only the tokens that precede it. At each step $t$, the model estimates: $ P(x_t \mid x_1, x_2, \ldots, x_{t-1}) $ Because generation proceeds left-to-right, one token at a time, the model can produce arbitrarily long open-ended outputs — the defining property of generative AI. GPT-2, GPT-3, GPT-4, Llama, Mistral, and Claude are all autoregressive. Autoregressive models are sometimes called *causal language models*. ## Training: Self-Supervision A single sentence provides many training samples automatically. Given "I love street food.", the model sees six (context, target-token) pairs — no human labelling required. This self-supervised objective is what allowed language models to scale to trillions of training tokens and become [[Large Language Model]]s. Formally, the training loss is the average negative log-likelihood across all tokens: $ \mathcal{L} = -\frac{1}{T} \sum_{t=1}^{T} \log P(x_t \mid x_{<t}) $ ## Inference: Prefill and Decode Transformer-based autoregressive models execute inference in two distinct phases (Huyen, 2024): **Prefill** — the prompt tokens are processed in parallel, producing the key-value cache needed to generate the first output token. This step is highly parallelisable. **Decode** — tokens are generated sequentially, one per forward pass. Each new token is appended to the context and conditions the next prediction. This sequential bottleneck is the primary driver of output latency: a 100-token response at 10 ms/token takes at least one second. ## Contrast with Masked Language Models | | Autoregressive | Masked (e.g., BERT) | |---|---|---| | Prediction direction | Left-to-right only | Both directions (bidirectional) | | Training signal | Next token | Randomly masked tokens | | Primary use | Open-ended generation | Classification, embeddings | | Context at inference | Preceding tokens only | Full sequence | Masked models see both preceding and following context during training, making them powerful for understanding tasks (sentiment analysis, NER, text classification) but unsuitable for generation. See [[Masked Language Model]]. ## Vocabulary and Tokens The model operates over a fixed vocabulary — the set of all tokens it can produce. GPT-4's vocabulary has 100,256 entries; Mixtral 8x7B has 32,000. Vocabulary size influences the model's expressiveness, efficiency, and how efficiently different languages are tokenised (see [[Tokenization]]). The key-value vectors for all attended tokens grow with sequence length, which is the fundamental reason extending context length for autoregressive models is computationally expensive. The [[KV Cache]] is the mechanism that avoids recomputing these vectors at each decode step. ## Why Completion Is Powerful Any task can be reframed as completion. Translation ("How are you in French is …"), summarisation ("The key points of the following article are …"), and classification ("Is this email spam? Answer yes or no: …") all reduce to next-token prediction. This generality, combined with scale, is why autoregressive models became [[Foundation Model]]s. ## Related - [[Large Language Model]] - [[Masked Language Model]] - [[Foundation Model]] - [[Tokenization]] - [[Token]] - [[KV Cache]] - [[Pretraining]] - [[Sampling]] ## Sources - [[AI Engineering - Chip Huyen]]