Continuous Batching - Albert Masoliver's learning site

## Definition **Continuous batching** (also called *in-flight batching*) is an LLM inference-service technique that returns each request's response as soon as it finishes generating — without waiting for the rest of the batch — and immediately slots a new request into the freed position. It eliminates the head-of-line blocking inherent in static and dynamic batching, where a short request must idle until the longest request in the same batch completes. ## The Batching Spectrum Three techniques exist for grouping concurrent inference requests: | Strategy | When batch executes | Drawback | |---|---|---| | **Static batching** | After a fixed number of requests arrive | First request waits until batch is full | | **Dynamic batching** | After N requests or a time window, whichever comes first | Under-full batches waste compute | | **Continuous batching** | Continuously; completed responses leave and new requests enter immediately | Slightly more complex to implement | The analogy from Chip Huyen (2024): static batching is a bus that waits until every seat is filled; dynamic batching is one that departs on schedule; continuous batching picks up the next passenger as soon as one gets off. ## Why It Matters for LLMs In static or dynamic batching, all sequences in a batch must be padded to the length of the longest sequence. For LLMs, where output lengths vary enormously, a single long-running request causes all shorter responses in the batch to stall, adding unnecessary latency. Continuous batching, introduced at scale in the **Orca** paper (Yu et al., 2022), selectively batches only the operations that do not block other requests — specifically the per-token decode steps — allowing short sequences to complete and return without delay. ## Interaction with PagedAttention **vLLM** popularised continuous batching alongside **PagedAttention**, which manages the [[KV Cache]] as non-contiguous memory pages. Because continuous batching frees slots dynamically, memory must be allocated and released at fine granularity; paged memory management makes this practical without fragmentation overhead. ## Effect on Metrics Continuous batching primarily improves **TPOT** (time per output token, measured by the user) for short requests that would otherwise be blocked. It also improves overall throughput by keeping GPU utilisation high without inflating per-request latency. The trade-off: implementation complexity and the need for careful memory management (see [[KV Cache]]). ## Related - [[KV Cache]] - [[Inference Latency]] - [[Inference Goodput]] - [[Prefill-Decode Disaggregation]] ## Sources - [[AI Engineering - Chip Huyen]]