Model Distillation - Albert Masoliver's learning site

## Definition **Model distillation** (also called knowledge distillation) is a training approach in which a smaller **student** model is trained to mimic the outputs of a larger, more capable **teacher** model, compressing the teacher's knowledge into a cheaper, faster deployment artifact with minimal performance degradation (Hinton et al., 2015). ## Core Idea Instead of training the student from scratch on hard labels (e.g., one-hot class assignments), the student learns from the teacher's output distribution — soft labels that carry richer information about which outputs the teacher considers plausible and which it dismisses. This richer signal allows the student to learn more efficiently than from human-annotated labels alone. In the LLM era, distillation is most commonly implemented as **data-level distillation**: the teacher generates (instruction, response) pairs that are then used to finetune the student via supervised finetuning. The student is optimised to reproduce the teacher's generation style and reasoning patterns. ## Notable Examples **DistilBERT** (Sanh et al., 2019) — distilled from BERT; 40% smaller, 60% faster, retains 97% of BERT's language comprehension on GLUE. **Alpaca** (Taori et al., 2023) — finetuned Llama-7B on 52K examples generated by text-davinci-003 (175B). The 7B student behaves similarly to the 175B teacher while being 4% of its size. **BuzzFeed model** — finetuned Flan-T5 using LoRA and text-davinci-003-generated data; reduced inference cost by 80%. ## Student Can Exceed Teacher Distillation does not require the student to be smaller than the teacher. When a large student is trained on data generated by a smaller but instruction-following teacher, the student can surpass the teacher in quality. NVIDIA's Nemotron-4-340B-Instruct (2024) was trained on data generated by Mixtral-8×7B-Instruct (an effectively ~56B model), yet outperformed the teacher on multiple benchmarks. The Llama 3 paper cautions that training indiscriminately on self-generated data does not improve a model and can degrade it. However, by verifying quality of synthetic data and using only verified examples, continual self-improvement is possible. ## Distillation vs General Data Synthesis All distillation involves synthetic data, but not all synthetic data use is distillation. Distillation specifically implies the teacher's performance is the gold standard and the student aims to reach it. Reverse instruction (generating instructions for existing high-quality human content) is not distillation because the responses are human-authored. ## Licensing Constraint Many commercial model licenses explicitly prohibit using their outputs to train competing models. OpenAI, Anthropic, and others include such clauses. Always verify license terms before using a model's outputs as training data. ## Relationship to PEFT Synthetic instruction data (from distillation) is commonly paired with adapter-based PEFT methods like LoRA. The combination — small adapter, synthetic data — allows resource-constrained practitioners to distill large model capabilities into small, efficient deployable models on a single GPU. ## Related - [[Fine-Tuning]] - [[Parameter-Efficient Finetuning]] - [[LoRA]] - [[Data Synthesis for AI]] - [[Instruction Dataset Design]] - [[Foundation Model]] ## Sources - [[AI Engineering - Chip Huyen]]