Eval Pipeline Bootstrapping - Albert Masoliver's learning site

## Definition **Eval pipeline bootstrapping** is the practice of testing whether an evaluation pipeline produces stable scores by repeatedly resampling the evaluation set and re-scoring each subsample. If the pipeline is trustworthy, the scores across bootstrap iterations should be close; a wide spread signals that the pipeline — not the model — is the source of variance. ## The Stability Test The procedure: 1. Take an existing labeled evaluation set. 2. Draw multiple random subsamples with replacement (bootstrap samples). 3. Run the full evaluation pipeline on each subsample independently. 4. Compare the resulting scores across samples. A spread such as 90% accuracy on one bootstrap but 70% on another indicates an untrustworthy pipeline, as Huyen states in *AI Engineering* (O'Reilly, 2025, ch. 4, p. 45). The cause may be too few examples, an inconsistent judge, an under-specified rubric, or non-deterministic sampling in the scoring model. ## Distinction from Model Evaluation Bootstrapping is often associated with model evaluation (estimating variance of a trained model's performance across data splits). In the eval-pipeline context the target of analysis is different: you are testing the *measurement instrument*, not the model being measured. A pipeline that scores inconsistently cannot produce meaningful model comparisons, regardless of how large the underlying model is or how carefully it was trained. ## Relationship to Eval-Set Sizing Pipeline stability and sample size address different failure modes. The [[Eval-Set Sizing Heuristic]] determines the minimum number of examples needed to detect a given score gap. Bootstrapping determines whether the scoring mechanism itself is reliable at any sample size. Both checks should be completed before a pipeline is used to gate production changes. ## What to Fix When Bootstrapping Reveals Instability | Symptom | Likely cause | Fix | |---|---|---| | Wide score spread, small set | Too few examples | Increase the set size (see sizing heuristic) | | Wide spread, large set | Noisy judge / rubric | Pin judge temperature to 0; tighten rubric with examples | | Wide spread, deterministic judge | Non-representative slices | Stratify sampling; add data from failing slices | ## Evaluating the Evaluator Bootstrapping is one part of a broader discipline of evaluating the evaluation pipeline. Related checks include: - Running the same pipeline twice on identical inputs and confirming identical scores (reproducibility). - Logging all variables that could change between runs: evaluation data version, rubric version, judge model and prompt, sampling configuration. - Calibrating the judge against human labels to confirm it correlates with ground truth. The goal is to ensure that a regression signal genuinely reflects model degradation, not measurement noise. ## Related - [[Evaluation-Driven Development]] - [[Eval-Set Sizing Heuristic]] - [[LLM as a Judge]] - [[AI Judge Biases]] ## Sources - [[Modern AI Software Engineering - The Orchestrators Playbook]] - [[AI Engineering - Chip Huyen]]