## Definition **Speculative decoding** is an inference-time technique that accelerates autoregressive sampling from a large language model by computing several tokens in parallel while preserving the exact output distribution. A small *draft model* proposes a short continuation, and the large *target model* verifies the entire proposal in a single forward pass. ## Why It Works Autoregressive decoding is bottlenecked by serial latency, not by FLOPs — each token requires a fresh forward pass. The target model can score many positions in parallel almost as cheaply as one. Speculative decoding exploits this asymmetry: a fast draft generates K candidate tokens serially in negligible time, and the target validates all K+1 positions concurrently. Whenever the draft and target agree, several tokens are emitted per target call. ## The Verification Step The accepted prefix length is determined by [[Modified Rejection Sampling]], which guarantees that the resulting samples are drawn from the target distribution within hardware numerics. There is no approximation: outputs are statistically identical to standard sampling from the target. ## Variants Two branches have emerged: - **External-draft**: a separate smaller model acts as the [[Draft Model]] (Leviathan et al. 2023, Chen et al. 2023). - **Self-drafting**: the target model is augmented with cheap parallel predictors so no second model is needed ([[Self-Drafting]], e.g. Medusa). ## Cost and Speedup The expected number of tokens emitted per target call is governed by the [[Acceptance Rate]] α. Typical production deployments achieve 2×–3× wall-clock speedup; ceilings depend on draft–target alignment and on the cost ratio between the two models. ## Related - [[Draft Model]] - [[Acceptance Rate]] - [[Modified Rejection Sampling]] - [[Self-Drafting]] - [[Sampling]] ## Sources - [[Fast Inference from Transformers via Speculative Decoding (Leviathan et al.)]] - [[Accelerating Large Language Model Decoding with Speculative Sampling (Chen et al.)]] - [[Medusa (Cai et al.)]]