Inference Latency - Albert Masoliver's learning site

## Definition **Inference latency** is the time a model takes to respond, and it is not one number. It decomposes into two regimes with different physics: **TTFT** (time to first token) and **TPOT** (time per output token). ## TTFT: the prefill pass Time to first token measures how long until the model emits its first [[Token]]. This is dominated by the *prefill* pass — the model reading the entire prompt in one parallel sweep to build its internal state. TTFT scales with input length: a 50K-token prompt prefills slower than a 2K one. [[Prompt Caching]] attacks TTFT directly by skipping prefill for an unchanged prefix. ## TPOT: the decode loop Time per output token measures the steady-state speed of generation *after* the first token. Output is produced one token at a time, each step conditioned on all prior tokens — inherently sequential, unlike prefill. TPOT is roughly constant per token, so total generation time is `TTFT + (output_tokens × TPOT)`. ``` total_latency ≈ TTFT + output_length × TPOT ``` ## Which metric you care about | Workload | Dominant metric | Why | |---|---|---| | Streaming chat UI | TTFT | Users feel the wait before the first word | | Coding agent step | TTFT + total | Long prompts, must finish to act | | Batch / offline | Total time | Nobody watches; throughput rules | A streaming UI lives or dies on TTFT — a snappy first token feels responsive even if the full answer is slow. A batch pipeline does not care about TTFT at all; it cares about total wall-clock and throughput. ## Output length drives both axes The single biggest lever is how much the model *writes*. Output length multiplies TPOT into latency and multiplies directly into cost ([[Per-Token Pricing]]) — output tokens typically cost several times more than input tokens. This is the practical case for [[Structured Outputs]]: emitting values instead of prose cuts output length, which cuts both latency and bill at once. ## Why it's fast at all: the KV cache Sequential decode would be hopeless if each new token re-read the whole sequence from scratch. The [[KV Cache]] stores the attention keys and values for tokens already processed, so each decode step only computes the new token against cached state. The KV cache is the reason TPOT is a small constant rather than growing with sequence length. ## Orchestrator takeaways - Shorten outputs before shortening inputs — output is the costlier axis. - Cache stable prefixes (system prompt, spec) to crush TTFT on repeated calls. - Choose the metric your UX actually exposes; optimizing total time for a streaming UI is misdirected effort. ## Related - [[Token]] - [[KV Cache]] - [[Prompt Caching]] - [[Per-Token Pricing]] - [[Structured Outputs]]