KV Cache - Albert Masoliver's learning site

## Definition The **KV cache** stores the key and value vectors computed by the [[Attention Mechanism]] for every token already processed, so that each new token generated during autoregressive decoding does not have to recompute attention over the entire history. It is the single most important optimization in LLM serving. ## Why it exists A [[Transformer Architecture]] generates text one token at a time, and every new token attends to all previous tokens. Without a cache, generating token *n* would re-run attention over tokens *1…n-1* from scratch — quadratic redundant work. By caching the K and V projections, each step reuses prior computation and only computes the new token's contribution. ## Prefill vs decode Inference splits into two phases with very different performance profiles: | Phase | What it does | Parallelism | | --- | --- | --- | | **Prefill** | Process the whole prompt at once, fill the cache | Fully parallel — fast | | **Decode** | Emit output tokens one by one | Sequential — the bottleneck | Prefill is a single big matrix multiply; decode is an inherently serial loop. This asymmetry is why output generation feels slower than prompt ingestion. ## The benchmark Alammar (in *Hands-On Large Language Models*) measures the effect directly: generating 100 tokens drops from roughly **21.8 s** without the cache to about **4.5 s** with it — nearly a 5× speedup, just from not throwing away work. ``` no cache : ~21.8 s / 100 tokens KV cache : ~4.5 s / 100 tokens ``` ## Practical consequences - **Long context costs latency and money.** The cache grows linearly with sequence length, so a large [[Context Window]] inflates memory and slows each decode step. This is a real driver of [[Inference Latency]]. - **Output streams for a reason.** Because decode is one-token-at-a-time, tokens are ready sequentially — streaming them to the user is natural, not a UX trick. - **Caching the prompt pays twice.** Reusing prefill state across requests is exactly what [[Prompt Caching]] exploits. ## Related - [[Transformer Architecture]] - [[Attention Mechanism]] - [[Context Window]] - [[Inference Latency]] - [[Prompt Caching]] - [[Token]] - [[Hands-On Large Language Models - Alammar, Grootendorst]]