Prefill-Decode Disaggregation - Albert Masoliver's learning site

## Definition **Prefill-decode disaggregation** is an LLM inference-service optimisation that assigns the prefill phase and the decode phase to separate hardware instances (e.g., separate GPUs or GPU groups) instead of running both on the same machine. Because the two phases have fundamentally different computational profiles, co-locating them causes resource contention that degrades both TTFT and TPOT simultaneously. ## Why the Phases Conflict LLM inference decomposes into two steps with opposing bottleneck characteristics: | Phase | Bottleneck | Nature | |---|---|---| | **Prefill** | Compute-bound | Processes all input tokens in one large parallel matrix multiply | | **Decode** | Memory-bandwidth-bound | Emits one token per step, mostly loading model weights into GPU cores | When a new prompt arrives, its prefill job competes for GPU compute with ongoing decode jobs on the same machine. A single large prefill can drain the computational budget from decode steps in progress, spiking TPOT for existing users while the new user waits for TTFT. Disaggregation eliminates this contention by routing each phase to dedicated instances. ## Architecture ``` User request │ ▼ Prefill instance(s) ──── KV state ────► Decode instance(s) (compute-bound) (transferred via (bandwidth-bound) NVLink / RDMA) ``` After the prefill instance fills the initial [[KV Cache]], it transfers the key-value state to the decode pool. Research (DistServe, Zhong et al. 2024; Inference Without Interference, Hu et al. 2024) shows that communication overhead over high-bandwidth interconnects such as NVLink is not a significant bottleneck in practice. ## Tuning the Ratio The ratio of prefill instances to decode instances is workload-dependent: - Long input sequences with TTFT priority → more prefill instances (e.g., 2:1 to 4:1). - Short inputs with TPOT priority → fewer prefill instances (e.g., 1:2 to 1:1). ## Relationship to Goodput Disaggregation directly improves [[Inference Goodput]] by allowing more requests to satisfy their SLO simultaneously: TTFT and TPOT can be optimised on independent hardware budgets rather than traded off against each other on shared resources. ## Related - [[KV Cache]] - [[Inference Latency]] - [[Inference Goodput]] - [[Continuous Batching]] ## Sources - [[AI Engineering - Chip Huyen]]