Acceptance Rate - Albert Masoliver's learning site

## Definition The **acceptance rate** α in [[Speculative Decoding]] is the expected fraction of draft tokens that survive the target model's verification step. It is the single dominant factor in the observed wall-clock speedup: higher α means more tokens emitted per target forward pass. ## Expected Tokens per Step For a draft of length γ tokens, the expected number of tokens accepted per target call is: $ E[\text{tokens}] = \frac{1 - \alpha^{\gamma+1}}{1 - \alpha} $ When α → 1, the expression approaches γ + 1 (all draft tokens plus the resampled one). When α is low, the expression collapses toward 1 — the technique degenerates to standard decoding plus draft-model overhead. ## What Determines α - **Draft–target alignment**: distilled drafts of the same family give the highest α. - **Task entropy**: low-entropy contexts (boilerplate, repetitive patterns) yield high α; creative generation with high [[Temperature]] yields lower α. - **Position in the draft**: α typically decays for later positions, because draft errors compound. ## Tuning Implication The optimal γ (draft length) is the one that maximises (expected tokens) / (draft cost + target cost). Most production systems land at γ ∈ {3, 4, 5}; beyond that, the marginal acceptance contribution no longer covers the extra draft latency. ## Related - [[Speculative Decoding]] - [[Draft Model]] - [[Modified Rejection Sampling]] - [[Temperature]] ## Sources - [[Fast Inference from Transformers via Speculative Decoding (Leviathan et al.)]] - [[Accelerating Large Language Model Decoding with Speculative Sampling (Chen et al.)]]