Perplexity - Albert Masoliver's learning site

## Definition **Perplexity** (PPL) is an information-theoretic metric that quantifies how much uncertainty a language model has when predicting the next token in a sequence. It is the exponential of [[Cross-Entropy Loss|cross entropy]]: $ \text{PPL}(P, Q) = 2^{H(P, Q)} $ where $H(P, Q)$ is the cross entropy of the model's learned distribution $Q$ with respect to the true data distribution $P$. When natural logarithm units (nats) are used — as in PyTorch and TensorFlow — the base becomes $e$: $ \text{PPL}(P, Q) = e^{H(P, Q)} $ A lower perplexity indicates the model is less uncertain and predicts tokens more accurately. ## Intuition If a model has perplexity 4, it faces the same uncertainty as choosing uniformly among 4 equally likely options at each prediction step. A language model with a vocabulary of tens of thousands of tokens achieving a perplexity of 3–10 is considered excellent — it has learned to concentrate probability mass on a tiny fraction of possible continuations. Perplexity directly tracks the model's approximation of its training data's entropy. A perfect model would achieve a perplexity equal to the intrinsic entropy of the language itself. ## Factors That Affect Perplexity - **Data structure.** More structured data (e.g., HTML, code) is more predictable and yields lower perplexity than informal prose. - **Vocabulary size.** Larger vocabularies increase the number of candidates and typically raise perplexity. Character-level perplexity is lower than word-level perplexity for the same text. - **Context length.** The more prior tokens the model can attend to, the lower its uncertainty. Perplexity decreases as context length grows up to the model's maximum context window. ## Relationship to BPC and BPB Cross entropy, perplexity, bits-per-character (BPC), and bits-per-byte (BPB) are all variations of the same underlying measurement and are mutually convertible. BPB is the most standardised for cross-model comparison because it is independent of tokenisation scheme. If the BPB is 3.43, the model can compress the original text to less than half its original size. ## Use Cases in AI Engineering 1. **Proxy for downstream capability.** Models with lower perplexity on held-out data tend to perform better on downstream tasks. Larger GPT-2 models consistently achieved lower perplexity and higher task accuracy (OpenAI, 2018). 2. **Data contamination detection.** A suspiciously low perplexity on a benchmark suggests the model saw that benchmark during training, undermining evaluation trust. See [[Data Contamination in Benchmarks]]. 3. **Training data deduplication.** New data can be added to a corpus only when the model's perplexity on it is high, ensuring novel content is not needlessly duplicated. 4. **Anomaly detection.** Gibberish or highly unusual text will produce very high perplexity, making perplexity useful as an input filter. ## Limitations Perplexity is a poor proxy for post-trained models. After supervised fine-tuning ([[Fine-Tuning]]) or [[RLHF]], a model's perplexity on raw text typically increases because the model has learned to follow instructions rather than to predict unconstrained text. This effect is sometimes called entropy collapse. Similarly, quantisation can change perplexity in unexpected ways. ## Related - [[Cross-Entropy Loss]] - [[Large Language Model]] - [[Fine-Tuning]] - [[RLHF]] - [[Hallucination]] - [[Data Contamination in Benchmarks]] ## Sources - [[AI Engineering - Chip Huyen]]