Evaluation-Driven Development - Albert Masoliver's learning site

## Definition **Evaluation-driven development** (EDD) is an AI engineering practice in which evaluation criteria, scoring rubrics, and test data are defined before an application is built, analogous to test-driven development (TDD) in software engineering. The guiding principle is that an AI application that cannot be evaluated should not be deployed, because an unmeasured system provides no basis for improvement and poses undetectable risk. Chip Huyen coined the term in AI Engineering (O'Reilly, 2024), observing that many teams deploy AI applications without knowing whether they work — an outcome she considers worse than never deploying at all, because the application still incurs maintenance costs while hiding its failures. ## Why It Matters Without pre-defined evaluation criteria: - Developers iterate without signal, guessing whether changes are improvements. - Failures in production are invisible until they cause business damage. - It becomes impossible to tie AI system quality to business outcomes (conversion rates, support ticket deflection, error rates). The investment in evaluation pays for itself: a reliable evaluation pipeline reduces risk, enables faster iteration, and provides the annotated data needed for [[Fine-Tuning]] later. ## The Evaluation Pipeline (Four Steps) ### Step 1: Evaluate All Components Each system component (retrieval, generation, ranking, etc.) and each conversation turn should be evaluated independently, not only end-to-end. Component-level evaluation pinpoints failure sources; task-level evaluation confirms whether the user's goal was ultimately achieved. ### Step 2: Create an Evaluation Guideline Define what a good response is (and is not) for the application. This includes: - **Evaluation criteria**: the quality dimensions that matter (e.g., relevance, factual consistency, safety). LangChain's 2023 report found teams used 2.3 criteria on average. - **Scoring rubric**: for each criterion, a precise definition of each score level, with labelled examples. Rubrics must be unambiguous enough for a human — or an [[LLM as a Judge]] — to apply consistently. - **Business metric mapping**: translate evaluation metrics into business impact (e.g., "factual consistency 90% → 50% of support tickets automated"). - **Usefulness threshold**: the minimum score an application must achieve to be worth deploying. ### Step 3: Define Evaluation Methods and Data Select evaluation methods matched to each criterion: - Exact or functional methods for structured or code outputs. - Semantic similarity or lexical metrics for text tasks with reference data. - AI judges or human annotators for subjective or open-ended criteria. Curate annotated evaluation data sliced by user tier, input length, topic, and other axes relevant to the application. Use bootstrapped sub-samples to verify the evaluation set is large enough to produce stable results. OpenAI's rule of thumb: detecting a 10% difference requires ~100 examples; detecting a 1% difference requires ~10,000. ### Step 4: Evaluate the Evaluation Pipeline Before trusting the pipeline, verify it: - Do higher-quality responses consistently receive higher scores? - Does the pipeline produce stable results across repeated runs and different data slices? - Are metrics correlated with business outcomes? - Is the cost and latency of evaluation acceptable? ## Iteration and Experiment Tracking Evaluation criteria must evolve as user behaviour and application requirements change. However, frequent changes to the evaluation pipeline undermine longitudinal comparability. Log all variables — evaluation data version, scoring rubric version, judge model and prompt — to distinguish changes in application quality from changes in the evaluation methodology itself. ## Relationship to Test-Driven Development EDD shares TDD's core discipline: define the success criteria first, then build. The analogy extends to data: evaluation annotations gathered for EDD can later be repurposed as instruction-tuning data for fine-tuning the model, compounding the investment. ## Related - [[LLM as a Judge]] - [[Comparative Evaluation]] - [[Eval-Set Sizing Heuristic]] — names the minimum sample size needed to detect a given score gap (Steps 3–4 of EDD) - [[Eval Pipeline Bootstrapping]] — the procedure for confirming the evaluation pipeline itself produces stable scores (Step 4 of EDD) - [[Fine-Tuning]] - [[Hallucination]] - [[Data Contamination in Benchmarks]] - [[Confusion Matrix]] ## Sources - [[AI Engineering - Chip Huyen]]