Self-Drafting - Albert Masoliver's learning site

## Definition **Self-drafting** is the branch of [[Speculative Decoding]] in which the target model itself produces the draft, eliminating the need for a separate [[Draft Model]]. Cheap auxiliary predictors — typically lightweight heads or shallow auxiliary layers — are attached to the target and trained to forecast tokens several positions ahead in parallel. ## How It Differs from External-Draft | | External-draft (Leviathan, Chen) | Self-drafting (Medusa, EAGLE) | |---|---|---| | Second model? | Yes — a smaller LLM | No — heads on the target | | Memory cost | 2 models hosted | One model + small heads | | Tokenizer alignment | Required | Automatic | | Quality cost | None | Requires fine-tuning / training of heads | ## Verification Self-drafting methods usually generate a *tree of candidate continuations* (top-k predictions from each head combined combinatorially) and verify the entire tree in one target forward pass using a tailored attention mask. The longest accepted root-to-leaf path becomes the emitted continuation. ## Notable Methods - **Medusa** (Cai et al. 2024): adds K feed-forward heads predicting offsets +1..+K, verifies via tree attention. - **EAGLE**: a lightweight autoregressive draft layer over the target's hidden states. - **Lookahead Decoding**: training-free variant using Jacobi iteration over guessed n-grams. ## Trade-offs Self-drafting avoids the operational burden of hosting two models, but the draft quality is limited by what cheap heads can learn from frozen target representations. External-draft methods retain a quality ceiling because the draft is an actual autoregressive LM; self-drafting trades quality for simplicity. ## Related - [[Speculative Decoding]] - [[Draft Model]] - [[Acceptance Rate]] ## Sources - [[Medusa (Cai et al.)]] - [[Fast Inference from Transformers via Speculative Decoding (Leviathan et al.)]]