## 1. Paper Identity
### Yaniv Leviathan, Matan Kalman, Yossi Matias
### ICML 2023 (Oral)
### *Fast Inference from Transformers via Speculative Decoding*
## 2. Core Contribution
### Introduces speculative decoding: an exact-distribution sampling algorithm that decodes K tokens from an autoregressive Transformer in fewer than K serial model calls
### Achieves a 2×–3× wall-clock speedup on T5-XXL with no architectural changes and no retraining
### Outputs are mathematically identical in distribution to standard decoding
## 3. Method
### A small, fast *draft model* (M_q) autoregressively generates a candidate continuation of γ tokens
### The large *target model* (M_p) scores all γ+1 positions in a single parallel forward pass
### A *modified rejection sampling* step accepts a prefix of the draft and resamples the first rejected token from a corrected distribution, preserving p exactly
### Per-step expected acceptance length depends on the agreement between M_q and M_p (denoted α)
## 4. Key Results
### 2.0×–3.4× speedup on T5-XXL across summarisation and translation tasks
### Empirically, larger γ helps until the cost of running the draft model dominates
### Acceptance rate α is the dominant factor in observed speedup
## 5. Lineage / Why It Matters
### Concurrent with Chen et al. 2023 (DeepMind) which arrived at essentially the same algorithm
### Spawned a family of follow-ups: Medusa (multi-head draft), EAGLE, Lookahead Decoding, staged speculative decoding
### Production-deployed in Google's serving stack and broadly adopted across LLM inference frameworks (vLLM, TGI, TensorRT-LLM)
## 6. Limitations
### Speedup depends on how easily the target's distribution is approximated by a smaller model — uniformly hard tasks benefit less
### Requires a compatible draft model that shares (at least) the tokenizer
### Additional VRAM for hosting two models simultaneously
## 7. Source
- https://arxiv.org/abs/2211.17192
- Accessed: 2026-05-23