Inference Goodput - Albert Masoliver's learning site

## Definition **Inference goodput** is the number of requests per second (or per minute) that an LLM inference service completes *while satisfying its Service Level Objective (SLO)*. A raw throughput figure counts all completed requests; goodput counts only those that meet the latency requirements the user actually cares about. The concept is adapted from networking, where goodput measures useful data transferred (excluding retransmissions and overhead), as distinct from raw bandwidth. ## Motivation: Throughput Hides Bad Latency An inference service optimised purely for throughput may process many tokens per second but deliver them so slowly that users experience unacceptable TTFT or TPOT. For example, a service handling 100 requests per minute where only 30 meet the target TTFT and TPOT has a goodput of 30 RPM, even though its raw throughput is 100 RPM. The remaining 70 are completed too late to be useful. $ \text{goodput} = \text{requests/s that satisfy SLO} $ ## SLO Composition A typical SLO for a conversational application might specify: - **TTFT** ≤ 200 ms - **TPOT** ≤ 100 ms Both conditions must hold for a request to count toward goodput. The thresholds are application-specific: a streaming chat UI requires a low TTFT; a batch document-summarisation pipeline may tolerate seconds. ## Relationship to Other Metrics | Metric | What it measures | Limitation | |---|---|---| | Throughput (tokens/s or RPS) | Raw rate of completion | Ignores whether latency meets user needs | | [[Inference Latency]] (TTFT, TPOT) | Per-request speed | Doesn't reflect how many requests succeed | | **Goodput** | Rate of SLO-compliant completions | Combines both — the operationally relevant signal | ## Practical Implications - Goodput is the correct metric to maximise when tuning [[Continuous Batching]] or [[Prefill-Decode Disaggregation]], since those techniques primarily improve the fraction of requests meeting latency targets. - Scaling up replica count (more GPU copies of the model) is the most direct lever to increase goodput when compute is the bottleneck. - Aggressive batching may raise raw throughput but lower goodput by increasing TPOT for short requests queued behind long ones. ## Related - [[Inference Latency]] - [[Continuous Batching]] - [[Prefill-Decode Disaggregation]] - [[KV Cache]] ## Sources - [[AI Engineering - Chip Huyen]]