## Definition
**Instruction dataset design** is the discipline of constructing training datasets for supervised finetuning — collections of (instruction, response) pairs — to teach a model desired behaviors. Quality, coverage, and quantity are the three governing criteria; getting even one wrong causes proportional degradation in the finetuned model.
## Three Governing Criteria
### Quality
High-quality data is relevant, aligned with task requirements, consistent across annotators, correctly formatted, sufficiently unique, and compliant with privacy/legal policies. A small amount of high-quality data routinely outperforms a large amount of noisy data: Yi model authors found 10K carefully crafted instructions superior to hundreds of thousands of noisy ones (Young et al., 2024). LIMA (Zhou et al., 2023) showed a 65B model finetuned on 1,000 curated examples matched GPT-4 on 43% of cases.
Key quality rules:
- Remove extraneous formatting (HTML tags, trailing whitespace, inconsistent casing). Databricks found removing Markdown/HTML tokens boosted accuracy 20% while cutting token lengths 60%.
- Ensure factual consistency, especially for domain-specific or safety-critical annotations.
- Deduplicate to avoid skewing the distribution and wasting compute.
### Coverage (Diversity)
Data must cover the range of inputs the deployed model will encounter. Diversity axes include topics, languages, output formats, instruction lengths, response lengths, and turn structure (single-turn vs multi-turn).
The "Scaling Instruction-Finetuned Language Models" paper (Chung et al., 2022) showed performance increased significantly as finetuning task count grew from 9 to 282 tasks, then plateaued — suggesting diversity matters more than raw quantity up to a threshold.
Different training phases require different domain mixes. For Llama 3 SFT, ~42% of data is math, reasoning, and code — far above their proportion on the internet — because high-quality math/code data disproportionately boosts reasoning capabilities.
### Quantity
More data generally helps, but with diminishing returns. Rules of thumb:
- For PEFT methods (e.g., LoRA): strong performance can emerge from a few hundred to a few thousand examples.
- For full finetuning: typically requires tens of thousands to millions of examples.
- Stronger base models need fewer finetuning examples; with sufficient data (550K+ examples) model choice matters less.
A practical approach: start with 50–100 well-crafted examples to verify finetuning improves the model at all. Plot performance vs. dataset size at 25%/50%/100% subsets to estimate the value of additional data.
## Data Sources and Acquisition
Priority order (highest relevance to lowest): own-application user data → curated public datasets → manually annotated → AI-synthesized. A data flywheel — leveraging user interactions to continuously improve the model — is the strongest long-term moat.
Annotation guidelines are as important as the annotations themselves. Guidelines must specify what a good response looks like, how to handle edge cases, and how to resolve annotator disagreements. These guidelines double as evaluation guidelines.
## Synthesis Techniques
See [[Data Synthesis for AI]] for the full treatment. The most relevant patterns for instruction data:
- **Reverse instruction** — start with high-quality long-form content (books, Wikipedia), use AI to generate the prompts that would elicit that content. Avoids hallucinations in responses.
- **Self-instruct** — start with a seed of diverse examples; use a model to generate new (instruction, response) pairs that resemble the seeds. Used to create Alpaca (52K examples from 175 seed examples).
- **AI response generation** — human-written instructions, model-generated responses (scalable but risks style-over-substance imitation).
## Format Considerations
Each model expects data in its own chat template (tokenizer-specific). Mismatched templates cause silent bugs. During finetuning, instructions typically do not need task descriptions or few-shot examples — the model learns from the training examples directly. After finetuning, prompts should exactly match the finetuning format (e.g., same delimiters, no extra prefixes).
Set the **prompt loss weight** to around 10%: the model should learn mostly from the response tokens, not the instruction tokens, since at inference time only responses are generated.
## Related
- [[Fine-Tuning]]
- [[Parameter-Efficient Finetuning]]
- [[Data Synthesis for AI]]
- [[Model Distillation]]
- [[Retrieval-Augmented Generation]]
## Sources
- [[AI Engineering - Chip Huyen]]