Medusa (Cai et al.) - Albert Masoliver's learning site

## 1. Paper Identity ### Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, Tri Dao ### arXiv preprint, January 2024 ### *Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads* ## 2. Core Contribution ### Eliminates the separate draft model that classical speculative decoding requires ### Adds extra lightweight *decoding heads* to the target model itself, each predicting one token further into the future ### Verification uses tree-based attention over candidate continuations drawn from the heads' top-k predictions ## 3. Method ### Augments a frozen LLM with several feed-forward "Medusa heads" that share the base model's last hidden state ### Each head predicts the token at offset +1, +2, +3, … from the current position ### At decoding time, top-k predictions from each head are combined into a *tree of candidate continuations* ### A single forward pass of the base model with a custom tree attention mask verifies the whole tree at once; the longest accepted path advances the sequence ### Two training variants: **Medusa-1** (heads only, base frozen) and **Medusa-2** (joint fine-tuning) ## 4. Key Results ### 2.2×–3.6× speedup across Vicuna-7B/13B/33B with no quality loss ### Medusa-2 (joint training) outperforms Medusa-1 by 0.3×–0.7× speedup ### Simpler to deploy than draft-model speculative decoding: no second model to host, no tokenizer alignment ## 5. Lineage / Why It Matters ### Sits in the *self-drafting* branch of the speculative decoding family (vs. external-draft methods like Leviathan et al. and Chen et al.) ### Influenced EAGLE, Hydra, ReDrafter and several production frameworks (TGI Medusa mode) ### Demonstrates that the draft–verify pattern works without a second model when the base model is augmented with cheap parallel predictors ## 6. Limitations ### Requires modifying / fine-tuning the target model; not a drop-in for closed weights ### Tree depth and width are hyperparameters that must be tuned per workload ### Quality of the heads' predictions degrades quickly past offset +3 or +4 ## 7. Source - https://arxiv.org/abs/2401.10774 - Accessed: 2026-05-23