07-human-in-the-loop - Albert Masoliver's learning site

# Module 7 — The Human-in-the-Loop & Strategic ROI > *"Tony Stark didn't fly the suit because Jarvis couldn't. He flew it > because someone has to be accountable for what the suit does."* --- ## Learning objectives By the end of this module you will be able to: 1. Articulate the **orchestrator role** — what only a human can do — in language that survives a stakeholder conversation. 2. Apply concrete **cost-optimization techniques** (model swapping, prompt caching, hook-based filtering) to cut spend by 50–90% without losing quality. 3. Define **velocity and ROI metrics** that capture agentic leverage — and know which "metrics" are vanity. 4. Recognize the failure modes of *over-automation* and design the human step back in where it belongs. --- ## 7.1 The orchestrator role — what's your job now? The most common existential question from engineers in 2026: *if the agents write the code, what do I do?* The answer is not "less of the same job." It's "a different job, with the same accountability." ### Five things only humans do well (today) 1. **Goal setting.** Translating a business need ("reduce churn") into a technical objective ("ship passwordless login by Q3") is an act of judgment, not generation. Agents propose; humans pick. 2. **Risk acceptance.** "We're going to ship this with a known limitation because the alternative is missing the launch." That sentence has a *name* on it. The agent's name doesn't have legal standing. 3. **Cross-context judgment.** "Marketing is mid-campaign, so even though this is the right refactor, we delay." The agent doesn't know about marketing. 4. **Adversarial reasoning under ambiguity.** When the spec, the data, and the stakeholder all disagree, deciding which to believe is a human call. Agents will pick whichever the prompt biased them toward. 5. **Final approval.** Pressing merge. Hitting deploy. Saying "ship it." This is not a technical act; it is a *commitment* on behalf of an organization. ### The Tony Stark / Jarvis model A useful mental model: - **Jarvis** (the agents) handles the throughput — reading, writing, testing, reviewing, suggesting. Jarvis can run at any hour, in parallel, without fatigue. - **Tony** (you) handles the direction — the goal, the trade-offs, the approval, the call when something doesn't smell right. This is not a hierarchy of intelligence. It's a partition of *accountability*. Jarvis cannot accept risk on Tony's behalf. The moment Tony tries to outsource that — by approving without reviewing, or shipping what the agents say is "ready" without judgment — the whole arrangement breaks. ### Your weekly shape, post-orchestration If you're doing this right, a week looks roughly like: | Activity | Share | Notes | |--------------------------------------------------|-------|------------------------------------------------| | Writing specs and acceptance criteria | ~25% | The most leveraged thing you do. | | Reviewing agent output (code, plans, reviews) | ~30% | Including the *reviewer agent's* reviews. | | Tuning the system (agents, hooks, memory) | ~10% | Less than you'd think; mostly Mondays. | | Strategic / cross-team conversations | ~20% | More than you'd think; this is where you live now. | | Direct hands-on coding | ~15% | The hard, novel, or critical bits. Not gone. | The numbers vary. The shape doesn't. > **Pitfall:** the engineers most at risk in this transition are not the > ones who can't code. It's the ones whose job *was* code review and code > review *only*. If your value was "I read the diff carefully," the agents > read it more carefully than you do. The escape is moving up the stack — > into specs, architecture, accountability — not out of it. --- ## 7.2 Cost optimization Agentic workflows can be cheap or ruinously expensive depending on a few load-bearing choices. The good news: the optimizations don't trade off quality. They mostly amount to *not paying for waste*. ### The 90% number A well-tuned setup commonly cuts token spend by 80–95% versus a naïve "Opus for everything" workflow. Three levers do most of the work: 1. **Right-sized model selection** (~50% savings). 2. **Prompt caching** (~40% additional, on cache hits). 3. **Hook-based filtering** of cheap, repetitive work to deterministic code (~10% additional). ### Lever 1 — Strategic model swapping The Module 1 fleet, applied with discipline: - **Default to Sonnet.** Most coding tasks belong here. - **Use Opus only where reversibility is low.** Architecture, schema changes, security review. ≤10% of token spend. - **Push routine classification, search, and routing to Haiku.** This includes the *first pass* of a reviewer pipeline, where Haiku surfaces candidates and Sonnet does the real check. A concrete example from a real review pipeline: ``` Naïve: every PR reviewed by Opus → ~120k tokens × $ × every PR Better: Haiku classifies (security-relevant? schema? otherwise?) ├─ Otherwise → Sonnet review ├─ Schema/security → Opus review → ~15k tokens of Haiku × every PR + ~120k Opus × ~10% of PRs + ~80k Sonnet × ~90% of PRs Net: ~70% cost reduction, same or better catch rate. ``` The bench numbers are workload-dependent. The principle is universal: *classify cheap, decide expensive*. ### Lever 2 — Prompt caching Anthropic's prompt caching lets you mark a prefix of your prompt as cacheable. Subsequent requests with the same prefix hit the cache and pay ~10% of the prefix cost. For agentic work this is huge, because **the prefix is almost always stable**: - `AGENTS.md` rarely changes between turns. - `CLAUDE.md` rarely changes. - The skill definitions don't change mid-session. - The current spec doesn't change. A typical Python example using the official SDK: ```python import anthropic client = anthropic.Anthropic() system_prompt = open("AGENTS.md").read() + open("CLAUDE.md").read() response = client.messages.create( model="claude-sonnet-4-6", max_tokens=2048, system=[ { "type": "text", "text": system_prompt, "cache_control": {"type": "ephemeral"}, # cache this prefix } ], messages=[{"role": "user", "content": "Refactor src/payments.ts ..."}], ) # response.usage.cache_creation_input_tokens / cache_read_input_tokens # tell you whether you hit. ``` In Claude Code, caching is on by default for stable session prefixes; you mostly need to be aware of it when *building your own tooling*. For headless CI reviews (Module 6), structuring the prompt so the spec and `AGENTS.md` are at the top — and the per-PR diff is at the end — makes the difference between paying full price per PR and paying ~10%. > **What to cache:** large, stable content (`AGENTS.md`, full specs, > long skill definitions). > **What not to cache:** small or volatile content (the current diff, > user input, today's date). Cache breakpoints have a minimum size — too > small and caching costs more than it saves. ### Lever 3 — Hook-based filtering Every cycle a hook handles is a cycle the model didn't pay for. Common wins: - Linters, formatters, type checkers run by hooks (Module 4) — the agent stops trying to "remember" to run them. - Secret scanning, dangerous-command refusal — refused at the hook, agent never burns tokens reasoning about whether to do them. - Auto-format on file write — the agent doesn't need to know your prettier config. The pattern: **deterministic problems get deterministic solutions**. Don't spend reasoning tokens on regex. ### Lever 4 — Smaller, fresher sessions A 6-hour session has paid for the same context six times in compactions. Two cheaper alternatives: - **End sessions at task boundaries.** Each new task is a fresh, cheap context. - **Externalize state to memory** (Module 5) so a fresh session catches up in a single load instead of being mid-conversation. ### Measuring spend honestly Your CLI's `/cost` and your provider's usage dashboard are the only honest sources. Two metrics worth tracking weekly: - **Tokens per merged PR.** Watch the trend, not the absolute number. Should fall over time as your `AGENTS.md`, skills, and pipelines mature. If it's rising, something is mis-tuned (probably caching). - **Cache hit rate.** For agentic CLIs that surface it, you want ≥60% on steady-state work. <30% means your prefix is unstable or you're starting too many sessions. --- ## 7.3 Measuring velocity Old velocity metrics (story points, PRs per week, lines changed) measure the wrong thing twice. The first time, because they always did. The second time, because agents can trivially run them up. ### Metrics that survive agentic workflows | Metric | Why it survives | |------------------------------------------|----------------------------------------------------| | **Lead time for changes (spec → prod)** | Whether you wrote the code or an agent did, the clock starts when someone wants something and stops when it ships. | | **Change failure rate** | Bugs that reach production. Agents that ship sloppy work show up here. | | **Mean time to restore (MTTR)** | When something breaks, how fast do you fix it? Agents help here too. | | **Cost per shipped change** | Token spend per merged PR. New, but the right shape. | | **Specs / decisions per engineer** | The new throughput metric. You produce *direction*; the agents produce *code*. | The first three are from DORA and predate AI; they remain the best macro indicators of a healthy delivery pipeline. The last two are new and specific to the agentic shape. ### Vanity metrics to ignore - **Lines of code (any direction).** Agents will happily produce 10× the lines for the same job, or refactor 10× more aggressively. Neither tells you anything. - **PR count.** An agent can split work into infinite small PRs. - **AI-assisted code percentage.** Almost all of it is AI-assisted now. The question is *quality*, not *origin*. - **"Time saved" estimates.** Universally inflated and impossible to baseline. Use lead-time instead. ### A working scorecard A monthly dashboard worth maintaining: ``` Modern AI Software Engineering — Team scorecard, May 2026 ───────────────────────────────────────────────────────── Lead time for changes (median): 2.1 days (Apr: 2.8) ↓ Change failure rate: 7.4 % (Apr: 8.1) ↓ MTTR: 49 min (Apr: 62) ↓ Cost per merged PR (tokens / $): 71k / $0.34 (Apr: 88k / $0.45) ↓ PRs reviewed by agents (1st pass): 92 % (Apr: 78%) ↑ Cache hit rate: 67 % (Apr: 51%) ↑ Specs/decisions authored: 34 (Apr: 22) ↑ ``` Trends matter more than absolutes. Pick three of these, publish them, and let the team optimize against them. --- ## 7.4 When to *remove* an agent A discipline that's underdiscussed: **decommissioning automation that isn't earning its place.** Symptoms an agent has overstayed its welcome: - **Humans always override its verdict.** A critic that's overruled 80% of the time is a tax, not a filter. Retire or retune. - **It never finds anything.** A security auditor that has passed every PR for two months is either perfect or asleep. Test it; bet on the latter. - **It introduces a new failure mode.** A release manager that occasionally publishes broken versions is worse than no release manager. Roll back the automation; fix the spec. - **It's a single point of dependency on a brittle prompt.** When the underlying model changes, brittle prompts break. If the agent's behavior changes meaningfully when you switch models, that's a tell. The cure is usually not "delete the agent." It's "tighten the spec the agent works against," then re-evaluate. Module 2 again, every time. --- ## 7.5 The path forward Three propositions to take into the rest of your career: ### 1. Specs are the durable artifact The code will be rewritten. The framework will go out of style. The agents will be replaced. The *specifications* — the contracts you wrote about what the system must do — are the longest-lived thing your team produces. Invest in them like infrastructure, because they are. ### 2. Orchestration is a skill, not a phase Reading this course will not make you an orchestrator. Running the labs will help. The thing that actually does it is the next six months of deliberate practice — agents catching your blind spots, you catching theirs, both of you improving the system together. ### 3. The human still ships it Every diff you merge has your name on it. Not the agent's. Not the provider's. Yours. The agents are extraordinary force multipliers, and the multiplication is on you. Wield them accordingly. --- ## Lab 7 — Define your scorecard, then defend it **Goal:** turn this course's principles into a measurable program you can sustain. **Time:** ~90 minutes, then continuous. 1. **Pick three metrics from §7.3** that fit your team's reality. Resist picking five. 2. **Define how each is measured.** Where does the data come from? Who pulls it? What's the cadence (weekly is usually right)? 3. **Write a one-page memo** to your manager / team / yourself explaining: - The current baseline. - The target three months out. - The mechanisms (agent rollout, spec discipline, cost levers) you expect to move the numbers. 4. **Calendar a review** for one month from now. Honestly evaluate what moved, what didn't, and which of the mechanisms you guessed wrong about. 5. **Adjust.** Drop a metric that didn't matter. Add a mechanism that did. **What to look for:** the first month, the numbers will move because the *observation* changed behavior. The second and third months, the numbers either keep moving or they don't. That's where the actual signal is. --- ## Common pitfalls - **Confusing token cost with total cost.** Token cost is real. Engineering time is realer. Optimize tokens *after* you've optimized the workflow. - **Bragging about lines of code (in any direction).** Don't. - **Treating agentic ROI as a one-time investment.** Specs decay, agents drift, prompts go stale. The maintenance is the work. - **Approving without reading.** The single biggest risk of agentic workflows. Set a personal rule: any merge that affects production gets five minutes of your eyes on the diff. *Every* time. --- ## Summary - The orchestrator's job is goals, risk acceptance, judgment, and final approval. Not less work; different work. - Cost optimization is mostly: right model, prompt caching, hooks, fresh sessions. Expect 80–95% savings against a naïve setup, at equal or better quality. - Track lead time, change failure rate, MTTR, cost per merged PR, and the number of specs/decisions authored. Ignore the rest. - Retire automation that isn't earning its place. Tighten the spec first; delete the agent second. - The agents make the code. You make the decision to ship it. That has not changed. --- ## Further reading - *Accelerate* (Forsgren, Humble, Kim) — origin of the DORA metrics; still the best treatment of velocity that holds up. - Anthropic — *Prompt caching* documentation and pricing. - Andrej Karpathy's writing on "vibe coding" and human-in-the-loop — useful counterpoints to over-automation enthusiasm. --- ## Where to go from here You've reached the end of the course. The next move is *practice*. Pick one module's lab, do it this week, then the next module's the week after. By the time you've worked through all seven, you'll have a configured, governed, instrumented agentic workflow — and the framework to keep it sharp as the tools (inevitably) change underneath you. The agents will keep getting better. So should you.