Fast Inference from Transformers via Speculative Decoding (Leviathan et al.)

## 1. Paper Identity ### Yaniv Leviathan, Matan Kalman, Yossi Matias ### ICML 2023 (Oral) ### *Fast Inference from Transformers via Speculative Decoding* ## 2. Core Contribution ### Introduces speculative decoding: an exact-distribution sampling algorithm that decodes K tokens from an autoregressive Transformer in fewer than K serial model calls ### Achieves a 2×–3× wall-clock speedup on T5-XXL with no architectural changes and no retraining ### Outputs are mathematically identical in distribution to standard decoding ## 3. Method ### A small, fast *draft model* (M_q) autoregressively generates a candidate continuation of γ tokens ### The large *target model* (M_p) scores all γ+1 positions in a single parallel forward pass ### A *modified rejection sampling* step accepts a prefix of the draft and resamples the first rejected token from a corrected distribution, preserving p exactly ### Per-step expected acceptance length depends on the agreement between M_q and M_p (denoted α) ## 4. Key Results ### 2.0×–3.4× speedup on T5-XXL across summarisation and translation tasks ### Empirically, larger γ helps until the cost of running the draft model dominates ### Acceptance rate α is the dominant factor in observed speedup ## 5. Lineage / Why It Matters ### Concurrent with Chen et al. 2023 (DeepMind) which arrived at essentially the same algorithm ### Spawned a family of follow-ups: Medusa (multi-head draft), EAGLE, Lookahead Decoding, staged speculative decoding ### Production-deployed in Google's serving stack and broadly adopted across LLM inference frameworks (vLLM, TGI, TensorRT-LLM) ## 6. Limitations ### Speedup depends on how easily the target's distribution is approximated by a smaller model — uniformly hard tasks benefit less ### Requires a compatible draft model that shares (at least) the tokenizer ### Additional VRAM for hosting two models simultaneously ## 7. Source - https://arxiv.org/abs/2211.17192 - Accessed: 2026-05-23