Eval-Set Sizing Heuristic - Albert Masoliver's learning site

## Definition The **eval-set sizing heuristic** is a rule of thumb, attributed to OpenAI, for determining the minimum number of labeled evaluation examples required to reliably detect a given performance difference between two AI system configurations: for every 3× decrease in score gap, the sample size must increase by approximately 10×. ## The 3×/10× Scale | Score gap to detect | Approximate minimum examples | |---|---| | 30% difference | ~10 | | 10% difference | ~100 | | 3% difference | ~1,000 | The practical implication is immediate: a "10% improvement" measured on 20 examples is statistical noise, not a result. The difference falls below the detection threshold for that sample size. Chip Huyen states the rule directly in *AI Engineering* (O'Reilly, 2025): "for every 3× decrease in score difference, the number of samples needed increases 10×" (ch. 4, p. 46). ## Why This Matters Skipping eval-set sizing produces one of the most common failure modes in AI development: a prompt tweak or model swap appears to improve quality, the team ships it, and the improvement evaporates — or reverses — in production. The heuristic makes the noise floor explicit before the experiment runs. This connects directly to the practice of [[Evaluation-Driven Development]]: defining evaluation criteria upfront includes defining the minimum sample size required to trust the results of those evaluations. ## Relationship to Bootstrap Stability Sample size and pipeline stability are orthogonal concerns. A large enough set satisfies the sizing heuristic; [[Eval Pipeline Bootstrapping]] then tests whether that set produces consistent scores across resamples. Both checks are necessary — a 200-example set can still be untrustworthy if the scoring rubric or judge is noisy. ## Practical Application 1. Decide the smallest performance delta worth detecting (e.g., 5%). 2. Read off the approximate sample size from the 3×/10× scale (between 100 and 1,000 for a 5% target). 3. Curate that many labeled examples before running any comparison. 4. Report improvements only when they exceed the detection threshold at the chosen sample size. ## Related - [[Evaluation-Driven Development]] - [[Eval Pipeline Bootstrapping]] - [[LLM as a Judge]] - [[AI Judge Biases]] - [[Cross-Validation]] ## Sources - [[Modern AI Software Engineering - The Orchestrators Playbook]] - [[AI Engineering - Chip Huyen]]