Data Contamination in Benchmarks - Albert Masoliver's learning site

## Definition **Data contamination** (also called data leakage or training on the test set) occurs when a model is evaluated on data that appeared in its training corpus, allowing the model to achieve high benchmark scores by memorisation rather than generalisation. A contaminated model may appear to perform well on a benchmark while being no more useful than a model that has not learned the underlying skill. Rylan Schaeffer's 2023 satirical paper "Pretraining on the Test Set Is All You Need" demonstrated the issue vividly: a one-million-parameter model trained exclusively on benchmark data achieved near-perfect scores, outperforming much larger models on every included benchmark. ## How Contamination Happens - **Web-scraped training data.** Models trained on internet-scale corpora inadvertently include publicly available benchmark questions and answers. Any benchmark published before a model's training cutoff is likely in its training data. This is a primary reason benchmarks become saturated so quickly. - **Indirect contamination.** Training data and benchmark questions may share a common upstream source (e.g., both drawn from the same math textbook) without any direct overlap. - **Intentional post-training.** A developer may first select the best model using clean benchmarks, then continue training that model on benchmark data before release to maximise real-world performance. The released model is contaminated but may still be the right product decision. ## Detection Methods Two approaches are used to detect whether a model has seen a specific evaluation sample during training: **N-gram overlap.** If a sequence of N tokens (typically 13+) from a benchmark example appears verbatim in the training corpus, the example is considered dirty. This approach is precise but requires access to the full training data and is computationally expensive. **[[Perplexity]] analysis.** Contaminated data produces suspiciously low perplexity because the model has effectively memorised it. This approach is cheaper and does not require training data access, but it is less precise — low perplexity can also indicate that the data is simply easy or highly structured. ## Handling Contamination - **Clean-split reporting.** Model developers should disclose what fraction of a benchmark's data was in their training corpus and report performance on both the full benchmark and the clean (uncontaminated) subset. OpenAI's GPT-3 report identified 13 benchmarks with at least 40% training overlap. - **Private hold-out sets.** Public benchmarks should maintain a private hold-out portion, releasing only a reference tool for blind evaluation to prevent future contamination. - **Outlier detection on leaderboards.** Tracking the standard deviation of model scores on a benchmark can reveal models with unusually high or consistent scores that may indicate contamination. - **[[Perplexity]]-guided deduplication.** When adding new data to a training corpus, include only samples on which the model's perplexity is high, excluding likely memorised content. ## Implications for Model Selection A model that scores very high on a public benchmark but was trained after the benchmark was published should be treated with suspicion. Public benchmarks are useful for filtering out clearly bad models, but they cannot be trusted to identify the best model for a specific application. This motivates the use of private evaluation pipelines and custom benchmarks as part of [[Evaluation-Driven Development]]. ## Related - [[Perplexity]] - [[Evaluation-Driven Development]] - [[LLM as a Judge]] - [[Large Language Model]] - [[Pretraining]] ## Sources - [[AI Engineering - Chip Huyen]]