## Definition
**Inference latency** is the time a model takes to respond, and it is not one number. It decomposes into two regimes with different physics: **TTFT** (time to first token) and **TPOT** (time per output token).
## TTFT: the prefill pass
Time to first token measures how long until the model emits its first [[Token]]. This is dominated by the *prefill* pass — the model reading the entire prompt in one parallel sweep to build its internal state. TTFT scales with input length: a 50K-token prompt prefills slower than a 2K one. [[Prompt Caching]] attacks TTFT directly by skipping prefill for an unchanged prefix.
## TPOT: the decode loop
Time per output token measures the steady-state speed of generation *after* the first token. Output is produced one token at a time, each step conditioned on all prior tokens — inherently sequential, unlike prefill. TPOT is roughly constant per token, so total generation time is `TTFT + (output_tokens × TPOT)`.
```
total_latency ≈ TTFT + output_length × TPOT
```
## Which metric you care about
| Workload | Dominant metric | Why |
|---|---|---|
| Streaming chat UI | TTFT | Users feel the wait before the first word |
| Coding agent step | TTFT + total | Long prompts, must finish to act |
| Batch / offline | Total time | Nobody watches; throughput rules |
A streaming UI lives or dies on TTFT — a snappy first token feels responsive even if the full answer is slow. A batch pipeline does not care about TTFT at all; it cares about total wall-clock and throughput.
## Output length drives both axes
The single biggest lever is how much the model *writes*. Output length multiplies TPOT into latency and multiplies directly into cost ([[Per-Token Pricing]]) — output tokens typically cost several times more than input tokens. This is the practical case for [[Structured Outputs]]: emitting values instead of prose cuts output length, which cuts both latency and bill at once.
## Why it's fast at all: the KV cache
Sequential decode would be hopeless if each new token re-read the whole sequence from scratch. The [[KV Cache]] stores the attention keys and values for tokens already processed, so each decode step only computes the new token against cached state. The KV cache is the reason TPOT is a small constant rather than growing with sequence length.
## Orchestrator takeaways
- Shorten outputs before shortening inputs — output is the costlier axis.
- Cache stable prefixes (system prompt, spec) to crush TTFT on repeated calls.
- Choose the metric your UX actually exposes; optimizing total time for a streaming UI is misdirected effort.
## Related
- [[Token]]
- [[KV Cache]]
- [[Prompt Caching]]
- [[Per-Token Pricing]]
- [[Structured Outputs]]