# Module 1 — Foundations & The "Thinking" Economy > *"The model is not your colleague. It is a priced reasoning engine you rent > by the token. Treat it accordingly."* --- ## Learning objectives By the end of this module you will be able to: 1. Explain — in terms a teammate will accept — what a token, a parameter, and "the context window" actually are, and why they each cost you something. 2. Choose between **Think**, **Think Harder**, and **Ultrathink** reasoning budgets based on the task, not the vibe. 3. Pick a model (Haiku / Sonnet / Opus) for a given role using a defensible rubric instead of intuition. 4. Diagnose and recover from context-window degradation — the "lobotomy" effect — before it ships a bug. --- ## 1.1 Why this module exists Engineers who treat the model as a magic box pay for it three ways: in latency, in dollars, and — most expensively — in subtle wrongness that ships because nobody ran the numbers on whether the cheap fast model could *actually* hold the whole problem in its head. This module gives you the mechanical vocabulary for everything that follows. Specs (Module 2), agents (Module 6), and memory (Module 5) are all strategies for spending a fixed reasoning budget well. You can't optimize what you can't measure, and you can't measure what you can't name. --- ## 1.2 LLMs for code: the part that actually matters ### Tokens are the unit of everything A **token** is the model's atomic chunk of text. For English code, the rule of thumb is roughly **~4 characters per token**, but punctuation, identifier style, and language all shift this. The practical implications: - A 500-line TypeScript file is ~3,000–5,000 tokens. - A typical "read this repo and propose a refactor" prompt easily reaches 50k–200k tokens. - Pricing, context limits, and latency are all measured in tokens, not lines or files. You can inspect tokenization for any string with the Anthropic SDK: ```python import anthropic client = anthropic.Anthropic() response = client.messages.count_tokens( model="claude-sonnet-4-6", messages=[{"role": "user", "content": "def factorial(n): return 1 if n<=1 else n*factorial(n-1)"}] ) print(response.input_tokens) # ~22 tokens ``` Run this on a representative file from your codebase before you guess what a "summarize this module" prompt is going to cost. ### Parameters and what they actually buy you A **parameter** is a learned weight inside the model. More parameters means more capacity for nuance — at the cost of latency and dollars per token. You will never tune them; you only choose between pre-trained models that have different counts. For the working orchestrator, you only need three pieces of intuition: | Property | Smaller model (Haiku-class) | Larger model (Opus-class) | |---------------------------|-----------------------------|---------------------------| | Token throughput | ~3–5× faster | Slowest | | Cost per million tokens | ~1× | ~10–15× | | Multi-step reasoning | Brittle past 2–3 hops | Holds 6–10+ hops | | Long-context recall | Degrades faster | Holds more of a 200k+ window | | Tool-use planning | Adequate | Strong | ### "Predictive logic" — what the model is actually doing The model is a conditional probability machine over the next token, given everything that came before. Two consequences that matter every single day: 1. **Order matters.** "Here is the spec. Here is the existing code. Now implement X." produces different output than "Now implement X. Here is the spec. Here is the existing code." Put the *intent* near the end. 2. **Recency wins.** Tokens at the end of the context window have more influence on the next output than tokens at the start. This is why long-running sessions drift — the original instructions get drowned out by their own intermediate results. --- ## 1.3 Reasoning budget management Modern frontier models expose a **reasoning budget** — a configurable amount of internal computation the model is allowed to spend *before* it produces a visible answer. In Claude Code, this surfaces as keywords you type in your prompt: | Keyword | Approx. internal tokens | Use when… | |-----------------|-------------------------|------------------------------------------------| | *(none)* | minimal | Mechanical edits, single-file changes, asking for syntax. | | `think` | ~4k | Multi-file refactors, choosing between two clear options. | | `think hard` | ~10k | Architectural choices, debugging non-obvious failures. | | `think harder` | ~32k | Cross-cutting changes, security review, performance regressions. | | `ultrathink` | up to ~64k | True architecture, "is this approach even right?", weighing 3+ options. | ### Heuristic: spend reasoning where you can't easily *un-spend* a mistake Reasoning tokens are cheaper than the wrong commit. Use them when: - The action is **hard to reverse** (migrations, schema changes, public APIs). - The decision **constrains future work** (framework choice, dependency). - You need the model to **weigh trade-offs**, not execute a plan. Don't use them for: - Renaming a variable across one file. - Writing a regex when you already know the pattern. - Translating a function from Python to Go *line by line*. > **Pitfall:** Engineers new to reasoning budgets tend to either always-use or > never-use them. The cost of *always* using `ultrathink` is real: slower > turns, more dollars, and — counterintuitively — *worse* output on simple > tasks, because the model second-guesses itself and rewrites correct code. ### Reading the reasoning trace When you invoke a thinking mode, the model emits an internal trace before its visible answer. Read it. It's the cheapest debugging tool you have: - If the trace **misunderstood your goal**, your prompt is the bug. Fix the prompt, not the answer. - If the trace **chose an option you don't like**, you can interrupt and steer before the model commits. - If the trace is **looping or oscillating**, the problem is under-specified. Stop and give it a constraint that breaks the symmetry. --- ## 1.4 Model selection strategy Most teams pick a model once and use it for everything. This is a mistake on the same order as using a single language for every layer of the stack. ### The three-role rubric Think of your model fleet as having three roles: #### Sonnet — the **Builder** Default for *implementation*. Strong code generation, fast enough to keep a flow state, cheap enough to run all day. When in doubt, this is the model. **Pick Sonnet when:** - You have a clear spec and just need code. - The change touches one to ~five files. - You want streaming output you can interrupt. #### Opus — the **Architect** Reserve for *decisions*. Slower, more expensive, dramatically better at long-range reasoning, cross-file synthesis, and "should we even do this" questions. **Pick Opus when:** - The task asks "what's the right approach?" not "implement this approach." - You're reviewing a PR that touches >10 files or crosses module boundaries. - You're writing a spec (Module 2) that other agents will execute against. - A wrong answer here costs days, not minutes. #### Haiku — the **Scout** The cheap, fast model. Built for **exploration** and **filtering**. **Pick Haiku when:** - You need to search a large corpus for relevance (then hand the hits to Sonnet/Opus). - You're running parallel agents on tasks where individual quality matters less than throughput. - You're building a hook or sub-agent that runs *frequently* and *cheaply* (linters, classifiers, routers). ### Configuring per-task models in Claude Code ```jsonc // .claude/settings.json { "model": "claude-sonnet-4-6", // default "subagents": { "architect": "claude-opus-4-7", // big decisions "scout": "claude-haiku-4-5-20251001" // fan-out search } } ``` Modules 4 and 6 will show you how to wire these to specific agent definitions so the right model picks itself up for the right job. ### The 90/9/1 budget heuristic A working starting point for a typical week of feature work: - **~90%** of token spend → Sonnet (the actual coding). - **~9%** → Opus (planning, review, gnarly bugs). - **~1%** → Haiku (scouting, hooks, classification). If your Opus share creeps past ~20%, you are probably *over-thinking* and under-shipping. If it's below ~3%, you are probably *under-thinking* and shipping bugs that better planning would have caught. --- ## 1.5 The context window challenge ### What "context window" really means Every model has a maximum number of tokens it can attend to in a single turn — typically 200k for frontier models in 2026, with 1M-token tiers available for select workloads. This is not RAM. It's **active attention surface**. Two things degrade as the window fills: 1. **Recall accuracy.** Information dropped into the middle of a 180k-token prompt is *less likely* to be used than the same information at the end. This is the "lost in the middle" effect, and it is real. 2. **Instruction adherence.** Early instructions ("always use TypeScript strict mode") get overwritten by recent intermediate output ("here's the `any` I added to fix a type error"). ### The "lobotomy" effect from compaction When the conversation approaches the limit, the harness summarizes earlier turns to make room. What survives is a *paraphrase* of what happened, not the original. Three failure modes follow: - **Decision drift.** A trade-off you debated three turns ago gets re-debated, because the rationale didn't survive compaction. - **Convention loss.** "We agreed to use `snake_case` for these tables" becomes "we discussed naming." - **Phantom progress.** The summary says "implemented feature X" but the code on disk is incomplete because the model now believes its own summary. Module 5 is dedicated to fixing this. For now, the survival rules: 1. **Anchor durable decisions outside the conversation.** A `DECISIONS.md` file the agent re-reads is immune to compaction. 2. **Restart sooner than you think.** A fresh session with a tight prompt and the right files in context outperforms a 4-hour-old session with summary damage. 3. **Watch the compaction warnings.** When the harness tells you it's about to summarize, that's a signal to checkpoint your state to disk, not to keep pushing. ### Practical: keep the working set small The cheapest context optimization is **not loading what you don't need**. ```bash # Bad: blanket "read everything" claude "look at the project and refactor the auth module" # Better: scope it claude "refactor src/auth/**/*.ts. The relevant test files are in test/auth/. Do not read other modules unless I tell you to." ``` A model with 30k tokens of relevant context and 0k of noise outperforms one with 30k of relevant context and 100k of "just in case." Every irrelevant token is a vote against the right answer. --- ## Lab 1 — Calibrate your fleet **Goal:** Build personal intuition for the cost/quality trade-off across models and reasoning budgets. **Time:** ~45 minutes. 1. Pick a real, modestly-sized refactor from your backlog (something like "extract this 200-line function into three smaller ones, with tests"). 2. Run it **three times**, fresh session each, with no other changes: - `claude --model claude-haiku-4-5-20251001 "<task>"` - `claude --model claude-sonnet-4-6 "<task>"` - `claude --model claude-opus-4-7 "think hard. <task>"` 3. For each run, record: wall-clock time, total tokens, whether the tests pass, and a 1–5 subjective quality score for the *code structure*. 4. Now run a fourth pass: `claude --model claude-sonnet-4-6 "ultrathink. <task>"`. Compare against the Opus run. **What to look for:** - Haiku will usually be fast and *almost* right — the gap is exactly where reasoning budget would have helped. - "Sonnet + ultrathink" often matches or beats "Opus + no thinking" at a fraction of the cost. This is the most important data point in the lab. - The Opus run's *plan* is usually better than its *code*. Note this — it's the basis for Module 6's "Architect-plans, Sonnet-implements" pattern. Keep this table. You will refer back to it when you design your agent fleet in Module 6. --- ## Common pitfalls - **Treating "context window" as memory.** It is attention surface for *one turn*. Anything you want to persist must live in a file or memory store (Module 5). - **Picking the biggest model "to be safe."** Opus on a Haiku-shaped task is slower, more expensive, and often *worse* (over-explains, second-guesses). - **Burning `ultrathink` on prompts the model doesn't have enough context to reason about.** More thinking does not manufacture missing information. Feed it the file first. - **Ignoring the reasoning trace.** The trace is free. Skipping it because "the code looks right" is how subtle bugs ship. --- ## Summary - LLMs are priced reasoning engines measured in tokens. Order and recency shape their output. - Reasoning budgets buy you better thinking on hard problems and waste money on easy ones. Match the budget to the reversibility of the action. - Sonnet builds, Opus decides, Haiku scouts. A working fleet uses all three. - Context windows are not memory. Treat them as a working set you actively curate. --- ## Further reading - Anthropic — *Token counting and pricing* (`docs.anthropic.com/en/api/messages-count-tokens`) - Anthropic — *Extended thinking* documentation - *"Lost in the Middle"* (Liu et al.) — empirical demonstration of context position effects. **Next:** [Module 2 — Advanced Spec-Driven Development](02-spec-driven-development.md)