## Definition
**Self-drafting** is the branch of [[Speculative Decoding]] in which the target model itself produces the draft, eliminating the need for a separate [[Draft Model]]. Cheap auxiliary predictors — typically lightweight heads or shallow auxiliary layers — are attached to the target and trained to forecast tokens several positions ahead in parallel.
## How It Differs from External-Draft
| | External-draft (Leviathan, Chen) | Self-drafting (Medusa, EAGLE) |
|---|---|---|
| Second model? | Yes — a smaller LLM | No — heads on the target |
| Memory cost | 2 models hosted | One model + small heads |
| Tokenizer alignment | Required | Automatic |
| Quality cost | None | Requires fine-tuning / training of heads |
## Verification
Self-drafting methods usually generate a *tree of candidate continuations* (top-k predictions from each head combined combinatorially) and verify the entire tree in one target forward pass using a tailored attention mask. The longest accepted root-to-leaf path becomes the emitted continuation.
## Notable Methods
- **Medusa** (Cai et al. 2024): adds K feed-forward heads predicting offsets +1..+K, verifies via tree attention.
- **EAGLE**: a lightweight autoregressive draft layer over the target's hidden states.
- **Lookahead Decoding**: training-free variant using Jacobi iteration over guessed n-grams.
## Trade-offs
Self-drafting avoids the operational burden of hosting two models, but the draft quality is limited by what cheap heads can learn from frozen target representations. External-draft methods retain a quality ceiling because the draft is an actual autoregressive LM; self-drafting trades quality for simplicity.
## Related
- [[Speculative Decoding]]
- [[Draft Model]]
- [[Acceptance Rate]]
## Sources
- [[Medusa (Cai et al.)]]
- [[Fast Inference from Transformers via Speculative Decoding (Leviathan et al.)]]