## Definition **Sampling** (or **decoding**) is the runtime process of selecting the next token from the probability distribution the LLM produces. It's where the same model can output different responses to the same prompt — and where many practical knobs live. ## The Distribution For each step, the model outputs a distribution over the vocabulary: $ P(x_t = v \mid x_{<t}) \quad \forall v \in V $ Sampling chooses one $v$ from this distribution per step. The choice rule is the **decoding strategy**. ## Common Strategies ### Greedy decoding Pick the highest-probability token every step. Deterministic given the prompt. Tends to produce repetitive, sometimes degenerate output. ### Beam search Maintain $k$ candidate sequences ("beams"); at each step, extend each beam with the top tokens and keep the best $k$ partial sequences by joint probability. Better than greedy for tasks with a single correct answer (translation), but discouraged for open-ended generation — produces bland, generic output. ### Top-k sampling Restrict to the top $k$ most likely tokens; renormalise; sample. Cuts off the long tail of unlikely tokens. ### Top-p (nucleus) sampling Restrict to the smallest set of tokens whose cumulative probability is $\geq p$ (e.g., 0.9); renormalise; sample. Adapts the cutoff to the shape of each step's distribution. ### Temperature Rescales the logits before softmax — see [[Temperature]]. Lower → more deterministic; higher → more diverse. ## Stochasticity and Reproducibility Most production LLMs sample by default, so identical prompts produce different responses. For reproducibility: - Set temperature to 0 (greedy). - Or fix a seed (where the API supports it). Even greedy isn't perfectly deterministic across different batch sizes or hardware — floating-point non-associativity bites. ## Typical Defaults For chat assistants: temperature ~0.7–1.0, top-p ~0.9, no top-k or top-k=50. For code generation: lower temperature (~0.2) when correctness matters; higher when exploring alternatives. ## Related - [[Temperature]] - [[Large Language Model]] - [[Token]] - [[Hallucination]]