Decoding Strategy - Albert Masoliver's learning site

## Definition A **decoding strategy** is the method that converts the model's next-token probability distribution into an actual chosen [[Token]]. The model gives you a distribution; the decoding strategy decides what to do with it — and it is your main lever for trading determinism against creativity. ## The main strategies | Strategy | How it picks | Character | | --- | --- | --- | | **Greedy** (temp 0) | always the single highest-probability token | deterministic, repetitive | | **Pure sampling** | draw from the full distribution | diverse, can wander | | **Top-k** | sample only from the k most likely tokens | bounded randomness | | **Top-p / nucleus** | sample from the smallest set whose mass ≥ p | adapts to how peaked the distribution is | | **Beam search** | keep b best partial sequences | good for short, "correct-answer" outputs | [[Temperature]] is the companion dial: it sharpens (low) or flattens (high) the distribution *before* the strategy samples from it. See [[Sampling]] and [[Logprobs]] for what is being sampled. ## The determinism-creativity lever This is the practitioner's daily decision: - **Low temperature / greedy** for code you'll diff, structured extraction, or anything where you want the same answer every time. - **Higher temperature / nucleus** when you want the model to surface options — brainstorming, drafting, generating alternatives. Match the strategy to the task, not to a default. ## "Deterministic" has limits Even greedy decoding with a fixed seed is **not byte-for-byte reproducible** across runs. Floating-point non-associativity, GPU kernel scheduling, and batching mean the same logits can resolve slightly differently on different hardware. Plan evals and diffs around *approximate* stability, not exact reproduction. ## Related - [[Sampling]] - [[Temperature]] - [[Logprobs]] - [[Large Language Model]] - [[Token]] - [[Test-Time Compute]] - [[Hands-On Large Language Models - Alammar, Grootendorst]]