Draft Model - Albert Masoliver's learning site

## Definition A **draft model** is a small, fast autoregressive model used in [[Speculative Decoding]] to propose candidate token continuations that are then verified by a larger target model. It is the "proposer" in the propose-then-verify scheme; it does not need to match the target's quality, only its tokenizer and approximate next-token preferences. ## Requirements - **Same tokenizer** as the target. The verification step compares probabilities on identical token IDs, so vocabulary alignment is non-negotiable. - **Cheap to evaluate** relative to the target. The wall-clock savings come from running γ serial draft steps in less time than one parallel target verification. - **Distributional similarity** to the target on the workload. The closer the draft and target agree, the higher the [[Acceptance Rate]]. ## Common Choices - A smaller checkpoint of the same model family (e.g. Llama-7B drafting for Llama-70B). - A distilled student trained to mimic the target. - A pruned or quantised version of the target itself. In the [[Self-Drafting]] alternative (e.g. Medusa) there is no separate draft model — extra heads attached to the target play the same structural role. ## Related - [[Speculative Decoding]] - [[Acceptance Rate]] - [[Self-Drafting]] - [[Foundation Model]] ## Sources - [[Fast Inference from Transformers via Speculative Decoding (Leviathan et al.)]] - [[Accelerating Large Language Model Decoding with Speculative Sampling (Chen et al.)]]