## 1. Paper Identity
### Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, Tri Dao
### arXiv preprint, January 2024
### *Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads*
## 2. Core Contribution
### Eliminates the separate draft model that classical speculative decoding requires
### Adds extra lightweight *decoding heads* to the target model itself, each predicting one token further into the future
### Verification uses tree-based attention over candidate continuations drawn from the heads' top-k predictions
## 3. Method
### Augments a frozen LLM with several feed-forward "Medusa heads" that share the base model's last hidden state
### Each head predicts the token at offset +1, +2, +3, … from the current position
### At decoding time, top-k predictions from each head are combined into a *tree of candidate continuations*
### A single forward pass of the base model with a custom tree attention mask verifies the whole tree at once; the longest accepted path advances the sequence
### Two training variants: **Medusa-1** (heads only, base frozen) and **Medusa-2** (joint fine-tuning)
## 4. Key Results
### 2.2×–3.6× speedup across Vicuna-7B/13B/33B with no quality loss
### Medusa-2 (joint training) outperforms Medusa-1 by 0.3×–0.7× speedup
### Simpler to deploy than draft-model speculative decoding: no second model to host, no tokenizer alignment
## 5. Lineage / Why It Matters
### Sits in the *self-drafting* branch of the speculative decoding family (vs. external-draft methods like Leviathan et al. and Chen et al.)
### Influenced EAGLE, Hydra, ReDrafter and several production frameworks (TGI Medusa mode)
### Demonstrates that the draft–verify pattern works without a second model when the base model is augmented with cheap parallel predictors
## 6. Limitations
### Requires modifying / fine-tuning the target model; not a drop-in for closed weights
### Tree depth and width are hyperparameters that must be tuned per workload
### Quality of the heads' predictions degrades quickly past offset +3 or +4
## 7. Source
- https://arxiv.org/abs/2401.10774
- Accessed: 2026-05-23