## Definition **Self-consistency** is a [[Test-Time Compute]] technique: instead of trusting a single greedy answer, you sample several independent solutions to the same problem and keep the one the model agrees on most — typically a majority vote over the final answers. It trades inference compute for reliability. ## How it works 1. Take a problem and a [[Chain-of-Thought]] prompt. 2. Sample *N* completions at non-zero temperature, so each reasons along a different path. 3. Extract the final answer from each. 4. Return the **most frequent** final answer. ```text Q ──▶ sample × N ──▶ [7, 7, 7, 12, 7] ──▶ majority = 7 ``` The insight is that there are many flawed reasoning paths but they tend to disagree with each other, while correct paths tend to converge on the same answer. Marginalising over reasoning chains and voting on the *answer* filters out idiosyncratic mistakes. ## It works — and the gains scale with N Empirically, self-consistency lifts accuracy on reasoning benchmarks substantially over single-sample chain-of-thought. The effect grows with the sample count: drawing on the order of 32 samples for a benchmark like MMLU buys a real and measurable accuracy bump, as Huyen notes in *[[AI Engineering - Chip Huyen]]*. The cost is linear in *N* — ten votes is roughly ten times the spend — so it is a dial you turn up only when correctness matters more than latency or money. ## The most likely is not always the best The deeper lesson self-consistency teaches: **the single most probable continuation is not always the right one.** Greedy decoding commits to the locally likeliest token at every step and can march confidently down a wrong path. Sampling-and-voting acknowledges that the model's *best* answer is sometimes buried in its distribution, not at its mode. This is the same intuition that motivates spending more test-time compute generally. ## Relation to other test-time methods Self-consistency is the simplest member of a family. [[Tree of Thought]] also explores multiple paths but adds branching and per-node evaluation rather than a flat vote at the end. [[Reflection]] revisits a *single* answer to critique and improve it, rather than sampling many. All three spend extra inference compute to raise quality — they differ in how they structure that spend: vote, search, or revise. ## Related - [[Chain-of-Thought]] - [[Tree of Thought]] - [[Reflection]] - [[Test-Time Compute]] - [[AI Engineering - Chip Huyen]]