Modified Rejection Sampling - Albert Masoliver's learning site

## Definition **Modified rejection sampling** is the verification step at the heart of [[Speculative Decoding]]. It decides which tokens proposed by the [[Draft Model]] to accept, and how to resample the first rejected one, in a way that guarantees the final samples come from the target distribution exactly. ## The Rule For each draft token $x$ at a position with draft probability $q(x)$ and target probability $p(x)$: $ \text{accept with probability } \min\left(1, \frac{p(x)}{q(x)}\right) $ If accepted, advance to the next draft position. If rejected, sample one replacement token from the **residual distribution**: $ p'(x) = \frac{\max(0,\ p(x) - q(x))}{\sum_{x'} \max(0,\ p(x') - q(x'))} $ Then stop and emit the prefix of accepted tokens plus this resample. ## Why It Preserves the Target Distribution The marginal probability of emitting any token $x$ under the combined accept-or-resample scheme reduces algebraically to $p(x)$. The proof is one line of probability and is the reason speculative decoding is *exact*, not an approximation — within floating-point numerics, outputs are indistinguishable from sampling directly from the target. ## Practical Notes - Implemented in fp32 (or with careful fp16 handling) to avoid the floor in the residual becoming negative due to numerical drift. - The same identity underlies tree-based variants ([[Self-Drafting]] methods like Medusa) where multiple candidate continuations are verified in parallel against a shared target call. ## Related - [[Speculative Decoding]] - [[Draft Model]] - [[Acceptance Rate]] - [[Sampling]] ## Sources - [[Fast Inference from Transformers via Speculative Decoding (Leviathan et al.)]] - [[Accelerating Large Language Model Decoding with Speculative Sampling (Chen et al.)]]