## Definition
A **draft model** is a small, fast autoregressive model used in [[Speculative Decoding]] to propose candidate token continuations that are then verified by a larger target model. It is the "proposer" in the propose-then-verify scheme; it does not need to match the target's quality, only its tokenizer and approximate next-token preferences.
## Requirements
- **Same tokenizer** as the target. The verification step compares probabilities on identical token IDs, so vocabulary alignment is non-negotiable.
- **Cheap to evaluate** relative to the target. The wall-clock savings come from running γ serial draft steps in less time than one parallel target verification.
- **Distributional similarity** to the target on the workload. The closer the draft and target agree, the higher the [[Acceptance Rate]].
## Common Choices
- A smaller checkpoint of the same model family (e.g. Llama-7B drafting for Llama-70B).
- A distilled student trained to mimic the target.
- A pruned or quantised version of the target itself.
In the [[Self-Drafting]] alternative (e.g. Medusa) there is no separate draft model — extra heads attached to the target play the same structural role.
## Related
- [[Speculative Decoding]]
- [[Acceptance Rate]]
- [[Self-Drafting]]
- [[Foundation Model]]
## Sources
- [[Fast Inference from Transformers via Speculative Decoding (Leviathan et al.)]]
- [[Accelerating Large Language Model Decoding with Speculative Sampling (Chen et al.)]]