## Definition
The **acceptance rate** α in [[Speculative Decoding]] is the expected fraction of draft tokens that survive the target model's verification step. It is the single dominant factor in the observed wall-clock speedup: higher α means more tokens emitted per target forward pass.
## Expected Tokens per Step
For a draft of length γ tokens, the expected number of tokens accepted per target call is:
$
E[\text{tokens}] = \frac{1 - \alpha^{\gamma+1}}{1 - \alpha}
$
When α → 1, the expression approaches γ + 1 (all draft tokens plus the resampled one). When α is low, the expression collapses toward 1 — the technique degenerates to standard decoding plus draft-model overhead.
## What Determines α
- **Draft–target alignment**: distilled drafts of the same family give the highest α.
- **Task entropy**: low-entropy contexts (boilerplate, repetitive patterns) yield high α; creative generation with high [[Temperature]] yields lower α.
- **Position in the draft**: α typically decays for later positions, because draft errors compound.
## Tuning Implication
The optimal γ (draft length) is the one that maximises (expected tokens) / (draft cost + target cost). Most production systems land at γ ∈ {3, 4, 5}; beyond that, the marginal acceptance contribution no longer covers the extra draft latency.
## Related
- [[Speculative Decoding]]
- [[Draft Model]]
- [[Modified Rejection Sampling]]
- [[Temperature]]
## Sources
- [[Fast Inference from Transformers via Speculative Decoding (Leviathan et al.)]]
- [[Accelerating Large Language Model Decoding with Speculative Sampling (Chen et al.)]]