## Definition
**Continuous batching** (also called *in-flight batching*) is an LLM inference-service technique that returns each request's response as soon as it finishes generating — without waiting for the rest of the batch — and immediately slots a new request into the freed position. It eliminates the head-of-line blocking inherent in static and dynamic batching, where a short request must idle until the longest request in the same batch completes.
## The Batching Spectrum
Three techniques exist for grouping concurrent inference requests:
| Strategy | When batch executes | Drawback |
|---|---|---|
| **Static batching** | After a fixed number of requests arrive | First request waits until batch is full |
| **Dynamic batching** | After N requests or a time window, whichever comes first | Under-full batches waste compute |
| **Continuous batching** | Continuously; completed responses leave and new requests enter immediately | Slightly more complex to implement |
The analogy from Chip Huyen (2024): static batching is a bus that waits until every seat is filled; dynamic batching is one that departs on schedule; continuous batching picks up the next passenger as soon as one gets off.
## Why It Matters for LLMs
In static or dynamic batching, all sequences in a batch must be padded to the length of the longest sequence. For LLMs, where output lengths vary enormously, a single long-running request causes all shorter responses in the batch to stall, adding unnecessary latency. Continuous batching, introduced at scale in the **Orca** paper (Yu et al., 2022), selectively batches only the operations that do not block other requests — specifically the per-token decode steps — allowing short sequences to complete and return without delay.
## Interaction with PagedAttention
**vLLM** popularised continuous batching alongside **PagedAttention**, which manages the [[KV Cache]] as non-contiguous memory pages. Because continuous batching frees slots dynamically, memory must be allocated and released at fine granularity; paged memory management makes this practical without fragmentation overhead.
## Effect on Metrics
Continuous batching primarily improves **TPOT** (time per output token, measured by the user) for short requests that would otherwise be blocked. It also improves overall throughput by keeping GPU utilisation high without inflating per-request latency. The trade-off: implementation complexity and the need for careful memory management (see [[KV Cache]]).
## Related
- [[KV Cache]]
- [[Inference Latency]]
- [[Inference Goodput]]
- [[Prefill-Decode Disaggregation]]
## Sources
- [[AI Engineering - Chip Huyen]]