Instruction Dataset Design - Albert Masoliver's learning site

## Definition **Instruction dataset design** is the discipline of constructing training datasets for supervised finetuning — collections of (instruction, response) pairs — to teach a model desired behaviors. Quality, coverage, and quantity are the three governing criteria; getting even one wrong causes proportional degradation in the finetuned model. ## Three Governing Criteria ### Quality High-quality data is relevant, aligned with task requirements, consistent across annotators, correctly formatted, sufficiently unique, and compliant with privacy/legal policies. A small amount of high-quality data routinely outperforms a large amount of noisy data: Yi model authors found 10K carefully crafted instructions superior to hundreds of thousands of noisy ones (Young et al., 2024). LIMA (Zhou et al., 2023) showed a 65B model finetuned on 1,000 curated examples matched GPT-4 on 43% of cases. Key quality rules: - Remove extraneous formatting (HTML tags, trailing whitespace, inconsistent casing). Databricks found removing Markdown/HTML tokens boosted accuracy 20% while cutting token lengths 60%. - Ensure factual consistency, especially for domain-specific or safety-critical annotations. - Deduplicate to avoid skewing the distribution and wasting compute. ### Coverage (Diversity) Data must cover the range of inputs the deployed model will encounter. Diversity axes include topics, languages, output formats, instruction lengths, response lengths, and turn structure (single-turn vs multi-turn). The "Scaling Instruction-Finetuned Language Models" paper (Chung et al., 2022) showed performance increased significantly as finetuning task count grew from 9 to 282 tasks, then plateaued — suggesting diversity matters more than raw quantity up to a threshold. Different training phases require different domain mixes. For Llama 3 SFT, ~42% of data is math, reasoning, and code — far above their proportion on the internet — because high-quality math/code data disproportionately boosts reasoning capabilities. ### Quantity More data generally helps, but with diminishing returns. Rules of thumb: - For PEFT methods (e.g., LoRA): strong performance can emerge from a few hundred to a few thousand examples. - For full finetuning: typically requires tens of thousands to millions of examples. - Stronger base models need fewer finetuning examples; with sufficient data (550K+ examples) model choice matters less. A practical approach: start with 50–100 well-crafted examples to verify finetuning improves the model at all. Plot performance vs. dataset size at 25%/50%/100% subsets to estimate the value of additional data. ## Data Sources and Acquisition Priority order (highest relevance to lowest): own-application user data → curated public datasets → manually annotated → AI-synthesized. A data flywheel — leveraging user interactions to continuously improve the model — is the strongest long-term moat. Annotation guidelines are as important as the annotations themselves. Guidelines must specify what a good response looks like, how to handle edge cases, and how to resolve annotator disagreements. These guidelines double as evaluation guidelines. ## Synthesis Techniques See [[Data Synthesis for AI]] for the full treatment. The most relevant patterns for instruction data: - **Reverse instruction** — start with high-quality long-form content (books, Wikipedia), use AI to generate the prompts that would elicit that content. Avoids hallucinations in responses. - **Self-instruct** — start with a seed of diverse examples; use a model to generate new (instruction, response) pairs that resemble the seeds. Used to create Alpaca (52K examples from 175 seed examples). - **AI response generation** — human-written instructions, model-generated responses (scalable but risks style-over-substance imitation). ## Format Considerations Each model expects data in its own chat template (tokenizer-specific). Mismatched templates cause silent bugs. During finetuning, instructions typically do not need task descriptions or few-shot examples — the model learns from the training examples directly. After finetuning, prompts should exactly match the finetuning format (e.g., same delimiters, no extra prefixes). Set the **prompt loss weight** to around 10%: the model should learn mostly from the response tokens, not the instruction tokens, since at inference time only responses are generated. ## Related - [[Fine-Tuning]] - [[Parameter-Efficient Finetuning]] - [[Data Synthesis for AI]] - [[Model Distillation]] - [[Retrieval-Augmented Generation]] ## Sources - [[AI Engineering - Chip Huyen]]