Per-Token Pricing - Albert Masoliver's learning site

## Definition **Per-token pricing** is the dominant unit of [[Consumption-Based Pricing]] for LLM APIs: customers pay separately for input tokens (the prompt sent to the model) and output tokens (the tokens the model generates). Output is typically priced several times higher than input, reflecting both the greater compute cost of autoregressive generation and the inelastic demand for the model's response. ## Canonical Rate Card (Mid-2026) | Provider / Model | Input ($ / M tokens) | Output ($ / M tokens) | Output/Input ratio | |---|---|---|---| | Claude Opus 4.6 | 5 | 25 | 5× | | Claude Sonnet 4.6 | 3 | 15 | 5× | | Claude Haiku 4.5 | 1 | 5 | 5× | | OpenAI GPT-5.5 | 5 | 30 | 6× | Most frontier providers cluster around a 5–6× output premium and a Haiku-class tier roughly 5× cheaper than the flagship. ## Pricing Modifiers Per-token pricing rarely exists in isolation. Common modifiers stack on the base rate: - **Cached input** — large reused prompt prefixes can be cached server-side; cache hits cost roughly 0.1× standard input (Anthropic) or 0.1× input (OpenAI's GPT-5.5 cached input at $0.50/M). - **Cache writes** — writing to cache costs slightly more than uncached input (Anthropic: 1.25×). - **Batch API** — asynchronous processing at typically 0.5× standard rates (Anthropic, OpenAI, Google all offer ~50% off). - **Capacity premium** — priority compute at 5–10× standard rates (Anthropic Fast Mode at 6×). ## Why Output is More Expensive Output tokens require a full forward pass per token (sequential generation), while input tokens can be processed in parallel within a single attention computation. The cost ratio reflects this: serving N output tokens requires N forward passes, while N input tokens require one. See [[Sampling]] and (when written) speculative-decoding atoms for how providers reduce this asymmetry. ## Practical Implications A customer's effective cost-per-task is dominated by output length and by how much input can be cached. Engineering levers in priority order: model selection (Haiku vs Opus), aggressive prompt caching, output-length discipline, and batch processing for non-interactive workloads. ## Related - [[Consumption-Based Pricing]] - [[Prompt Caching as Pricing Lever]] - [[Sampling]] ## Sources - [[Anthropic 2026 Pricing Shift (Kingy AI)]] - [[AI is Getting Expensive (The Register)]]