Prompt Caching - Albert Masoliver's learning site

## Definition **Prompt caching** is an Anthropic API feature that lets you mark a prefix of a prompt as cacheable. Subsequent requests sharing that prefix hit the cache and pay ~10% of the prefix cost. ## Why It Dominates Agentic Cost The prefix in agentic work is almost always **stable**: - `AGENTS.md` rarely changes between turns. - `CLAUDE.md` rarely changes. - Skill definitions don't change mid-session. - The current spec doesn't change. The per-turn delta is small. A 60%+ cache hit rate on a CI reviewer cuts per-PR cost dramatically. ## Two Modes 1. **Automatic caching.** Single `cache_control` at the top level; the system applies the breakpoint to the last cacheable block and moves it forward as conversations grow. 2. **Explicit breakpoints.** Place `cache_control` directly on individual content blocks for fine-grained control. ## Configuration ```python response = client.messages.create( model="claude-sonnet-4-6", system=[{ "type": "text", "text": system_prompt, "cache_control": {"type": "ephemeral"} }], messages=[{"role": "user", "content": "..."}], ) # response.usage.cache_read_input_tokens tells you the hit size. ``` ## Technical Details - **Minimum cacheable length.** 1,024 tokens for Sonnet; 4,096 for Opus / Haiku 4.5. Below the floor, caching costs more than it saves. - **TTL.** 5 minutes by default, refreshed on each hit. `"ttl": "1h"` extends to one hour. - **Breakpoints.** Up to 4 explicit per request; automatic caching uses one slot. ## Pricing (relative to base input) | Action | Multiplier | | -------------------- | ---------- | | 5-min cache write | 1.25× | | 1-hour cache write | 2.0× | | Cache read | 0.1× | ## What to Cache, What Not To - **Cache:** large, stable content — `AGENTS.md`, full specs, long skill definitions. - **Don't cache:** small or volatile content — the current diff, user input, today's date. ## Related - [[Token]] - [[Spec-as-Prompt Pattern]] - [[Headless Agent in CI]] - [[Model Selection Strategy]]