## Definition
The **eval-set sizing heuristic** is a rule of thumb, attributed to OpenAI, for determining the minimum number of labeled evaluation examples required to reliably detect a given performance difference between two AI system configurations: for every 3× decrease in score gap, the sample size must increase by approximately 10×.
## The 3×/10× Scale
| Score gap to detect | Approximate minimum examples |
|---|---|
| 30% difference | ~10 |
| 10% difference | ~100 |
| 3% difference | ~1,000 |
The practical implication is immediate: a "10% improvement" measured on 20 examples is statistical noise, not a result. The difference falls below the detection threshold for that sample size.
Chip Huyen states the rule directly in *AI Engineering* (O'Reilly, 2025): "for every 3× decrease in score difference, the number of samples needed increases 10×" (ch. 4, p. 46).
## Why This Matters
Skipping eval-set sizing produces one of the most common failure modes in AI development: a prompt tweak or model swap appears to improve quality, the team ships it, and the improvement evaporates — or reverses — in production. The heuristic makes the noise floor explicit before the experiment runs.
This connects directly to the practice of [[Evaluation-Driven Development]]: defining evaluation criteria upfront includes defining the minimum sample size required to trust the results of those evaluations.
## Relationship to Bootstrap Stability
Sample size and pipeline stability are orthogonal concerns. A large enough set satisfies the sizing heuristic; [[Eval Pipeline Bootstrapping]] then tests whether that set produces consistent scores across resamples. Both checks are necessary — a 200-example set can still be untrustworthy if the scoring rubric or judge is noisy.
## Practical Application
1. Decide the smallest performance delta worth detecting (e.g., 5%).
2. Read off the approximate sample size from the 3×/10× scale (between 100 and 1,000 for a 5% target).
3. Curate that many labeled examples before running any comparison.
4. Report improvements only when they exceed the detection threshold at the chosen sample size.
## Related
- [[Evaluation-Driven Development]]
- [[Eval Pipeline Bootstrapping]]
- [[LLM as a Judge]]
- [[AI Judge Biases]]
- [[Cross-Validation]]
## Sources
- [[Modern AI Software Engineering - The Orchestrators Playbook]]
- [[AI Engineering - Chip Huyen]]