Prompt Caching as Pricing Lever - Albert Masoliver's learning site

## Definition **Prompt caching as a pricing lever** is the use of server-side prompt-prefix caching to dramatically reduce the cost of repeated large contexts in a [[Consumption-Based Pricing]] regime. Providers price cache hits at a fraction of standard input (Anthropic: 0.10× — an 85% reduction; OpenAI GPT-5.5: cached input at $0.50/M vs $5/M standard, a 90% reduction) while charging a modest premium on cache writes (1.25× at Anthropic). For workloads with reused system prompts, long instructions, or shared agent context, caching can compress the effective per-token cost by an order of magnitude. ## When Caching Pays Off Caching is a write-once, read-many optimisation with a short TTL (5 minutes is the common default). The break-even analysis depends on hit count: - One write at 1.25× plus $N$ hits at 0.10× equals $N$ uncached reads at 1.00× when $N \approx 2$. Two hits within the TTL window already pays back the write premium. - For agent loops doing tens of iterations against the same scaffold, the saving converges toward the 85–90% theoretical maximum. ## What to Cache Patterns that compound to large reused prefixes: - System prompts and tool descriptions (the "scaffold"). - Few-shot examples that don't vary across calls. - Retrieved documents in a [[Retrieval-Augmented Generation]] flow when the same retrieval result feeds several follow-up turns. - The conversation history during an agent loop (the agent's accumulated context). The cacheable prefix must be **contiguous from the start** of the prompt — fragmenting the prefix breaks cache hits. ## Pricing Trap: Fast Mode A practical edge case worth knowing: Anthropic's Fast Mode (premium-capacity tier) re-prices the entire conversation context at Fast Mode rates if you switch to it mid-conversation. A cached prefix from standard mode becomes uncached input at Fast Mode rates, which can produce surprising bill shocks. Cache-pricing levers and capacity-pricing levers do not compose cleanly. ## Implication for Engineering In a metered-pricing world, caching architecture is a first-class concern, not an afterthought. It shifts the design centre: prompts and tool descriptions are designed to be cacheable, retrieval results are restructured to maximise prefix hits, and agent loops are structured so that growing context is contiguous with the cached prefix rather than interleaved with novel inputs. ## Related - [[Per-Token Pricing]] - [[Consumption-Based Pricing]] - [[Agentic Workload Cost Explosion]] ## Sources - [[Anthropic 2026 Pricing Shift (Kingy AI)]] - [[AI is Getting Expensive (The Register)]]