## Definition
**Modified rejection sampling** is the verification step at the heart of [[Speculative Decoding]]. It decides which tokens proposed by the [[Draft Model]] to accept, and how to resample the first rejected one, in a way that guarantees the final samples come from the target distribution exactly.
## The Rule
For each draft token $x$ at a position with draft probability $q(x)$ and target probability $p(x)$:
$
\text{accept with probability } \min\left(1, \frac{p(x)}{q(x)}\right)
$
If accepted, advance to the next draft position. If rejected, sample one replacement token from the **residual distribution**:
$
p'(x) = \frac{\max(0,\ p(x) - q(x))}{\sum_{x'} \max(0,\ p(x') - q(x'))}
$
Then stop and emit the prefix of accepted tokens plus this resample.
## Why It Preserves the Target Distribution
The marginal probability of emitting any token $x$ under the combined accept-or-resample scheme reduces algebraically to $p(x)$. The proof is one line of probability and is the reason speculative decoding is *exact*, not an approximation — within floating-point numerics, outputs are indistinguishable from sampling directly from the target.
## Practical Notes
- Implemented in fp32 (or with careful fp16 handling) to avoid the floor in the residual becoming negative due to numerical drift.
- The same identity underlies tree-based variants ([[Self-Drafting]] methods like Medusa) where multiple candidate continuations are verified in parallel against a shared target call.
## Related
- [[Speculative Decoding]]
- [[Draft Model]]
- [[Acceptance Rate]]
- [[Sampling]]
## Sources
- [[Fast Inference from Transformers via Speculative Decoding (Leviathan et al.)]]
- [[Accelerating Large Language Model Decoding with Speculative Sampling (Chen et al.)]]