## Definition
**Speculative decoding** is an inference-time technique that accelerates autoregressive sampling from a large language model by computing several tokens in parallel while preserving the exact output distribution. A small *draft model* proposes a short continuation, and the large *target model* verifies the entire proposal in a single forward pass.
## Why It Works
Autoregressive decoding is bottlenecked by serial latency, not by FLOPs — each token requires a fresh forward pass. The target model can score many positions in parallel almost as cheaply as one. Speculative decoding exploits this asymmetry: a fast draft generates K candidate tokens serially in negligible time, and the target validates all K+1 positions concurrently. Whenever the draft and target agree, several tokens are emitted per target call.
## The Verification Step
The accepted prefix length is determined by [[Modified Rejection Sampling]], which guarantees that the resulting samples are drawn from the target distribution within hardware numerics. There is no approximation: outputs are statistically identical to standard sampling from the target.
## Variants
Two branches have emerged:
- **External-draft**: a separate smaller model acts as the [[Draft Model]] (Leviathan et al. 2023, Chen et al. 2023).
- **Self-drafting**: the target model is augmented with cheap parallel predictors so no second model is needed ([[Self-Drafting]], e.g. Medusa).
## Cost and Speedup
The expected number of tokens emitted per target call is governed by the [[Acceptance Rate]] α. Typical production deployments achieve 2×–3× wall-clock speedup; ceilings depend on draft–target alignment and on the cost ratio between the two models.
## Related
- [[Draft Model]]
- [[Acceptance Rate]]
- [[Modified Rejection Sampling]]
- [[Self-Drafting]]
- [[Sampling]]
## Sources
- [[Fast Inference from Transformers via Speculative Decoding (Leviathan et al.)]]
- [[Accelerating Large Language Model Decoding with Speculative Sampling (Chen et al.)]]
- [[Medusa (Cai et al.)]]