## Definition
**Sampling** (or **decoding**) is the runtime process of selecting the next token from the probability distribution the LLM produces. It's where the same model can output different responses to the same prompt — and where many practical knobs live.
## The Distribution
For each step, the model outputs a distribution over the vocabulary:
$
P(x_t = v \mid x_{<t}) \quad \forall v \in V
$
Sampling chooses one $v$ from this distribution per step. The choice rule is the **decoding strategy**.
## Common Strategies
### Greedy decoding
Pick the highest-probability token every step. Deterministic given the prompt. Tends to produce repetitive, sometimes degenerate output.
### Beam search
Maintain $k$ candidate sequences ("beams"); at each step, extend each beam with the top tokens and keep the best $k$ partial sequences by joint probability. Better than greedy for tasks with a single correct answer (translation), but discouraged for open-ended generation — produces bland, generic output.
### Top-k sampling
Restrict to the top $k$ most likely tokens; renormalise; sample. Cuts off the long tail of unlikely tokens.
### Top-p (nucleus) sampling
Restrict to the smallest set of tokens whose cumulative probability is $\geq p$ (e.g., 0.9); renormalise; sample. Adapts the cutoff to the shape of each step's distribution.
### Temperature
Rescales the logits before softmax — see [[Temperature]]. Lower → more deterministic; higher → more diverse.
## Stochasticity and Reproducibility
Most production LLMs sample by default, so identical prompts produce different responses. For reproducibility:
- Set temperature to 0 (greedy).
- Or fix a seed (where the API supports it).
Even greedy isn't perfectly deterministic across different batch sizes or hardware — floating-point non-associativity bites.
## Typical Defaults
For chat assistants: temperature ~0.7–1.0, top-p ~0.9, no top-k or top-k=50. For code generation: lower temperature (~0.2) when correctness matters; higher when exploring alternatives.
## Related
- [[Temperature]]
- [[Large Language Model]]
- [[Token]]
- [[Hallucination]]