06-meta-agent-factory - Albert Masoliver's learning site

# Module 6 — The Meta-Agent Factory & Verification Frontier > *"One agent is a tool. Two agents reviewing each other are a process. The > orchestrator's job is to design the process so the bugs find each other > before the humans do."* --- ## Learning objectives By the end of this module you will be able to: 1. Design **role-based specialized agents** — Architect, Refactor, Debug, Security Auditor, Release Manager — each with the right prompt, model, and tool surface. 2. Tune agents for **deterministic execution** vs **creative exploration** and know which jobs need which. 3. Compose agents into **sequential pipelines**, **parallel fan-outs** (via git worktrees), and **reviewer loops** that catch issues the builder missed. 4. Apply the **"agents verifying agents"** philosophy: agents check each other's work against the spec, the regression suite, and even the rendered UI. 5. Wire headless agents into **CI/CD** as quality gates before code reaches a human reviewer. --- ## 6.1 Why specialize? A single general-purpose agent that does everything is the LLM equivalent of a 10,000-line `main.py`. It works, sort of, but: - It can't be tuned for the job at hand — defensive on architecture, laser-focused on bug fixes. - It can't review itself with any independence — the same prompt produced the code; that prompt won't catch its own bias. - It can't be parallelized — every task waits for the same head. Specialization fixes all three. A roster of focused agents lets you match the right tool to the right task, run independent passes for verification, and parallelize across tasks that don't depend on each other. This is the Module 1 model-selection rubric, extended one layer up: not just *which model*, but *which role*. --- ## 6.2 Role-based agent design Every specialized agent has the same four ingredients: 1. **A system prompt** — its identity and operating principles. 2. **A model** — sized to the task. 3. **A tool surface** — the smallest set of tools that lets it do its job. 4. **An invocation surface** — when (and how) other agents or humans summon it. In Claude Code, agents live in `.claude/agents/<name>.md`: ```markdown --- name: architect description: | Use for cross-module changes, public API design, schema decisions, and any "what's the right approach?" question. Reads, plans, and proposes. Never edits code directly. model: claude-opus-4-7 tools: Read, Grep, Glob, Bash(git log:*), Bash(git diff:*), WebSearch --- You are the Architect. Your job is to *decide*, not to execute. When invoked: 1. Read enough of the codebase to ground your proposal in reality, not in generalities. 2. Produce a plan with: (a) the chosen approach, (b) at least one alternative you rejected and why, (c) explicit risks, (d) files that will be touched, (e) acceptance criteria. 3. If the question is under-specified, return clarifying questions — do not invent answers. You do not edit code. You do not run mutating commands. Your output is a markdown plan that another agent (or a human) will execute. When ending: explicitly hand off ("ready for builder") or stop ("needs human decision on X"). ``` ### A working roster Five agents will cover most teams' needs. You will not build them all on day one; build them as you need them. | Agent | Model | Tools | Primary job | |-------------------|-----------|------------------------------------|--------------------------------------| | **Architect** | Opus | Read-only + Web | Plans, decisions, public-API design | | **Builder** | Sonnet | Read, Write, Edit, Bash(test/build)| Implements an approved plan | | **Refactor** | Sonnet | Read, Write, Edit, LSP | Mechanical, structure-preserving change | | **Debugger** | Sonnet (think hard) | Read, Bash(test:*, log:*), git history | Reproduces, isolates, proposes fixes | | **Security Auditor** | Opus | Read-only, semgrep, dependency CLIs | Reviews diffs for security regressions | | **Reviewer** | Sonnet (or Opus for high-stakes) | Read, Bash(test:*) | Reviews diff against the spec | | **Release Manager** | Sonnet | Read, Write, Bash(git, npm version, gh) | Cuts releases via the release skill | ### How specialization changes your prompts Without specialization: > *"Refactor `src/payments.ts`. Make sure it's well-tested. Also check > for security issues. Then write a changelog entry."* With specialization: > *"Invoke Architect: plan the refactor of `src/payments.ts` to split > external-API and persistence concerns. Then invoke Builder with the > plan. Then invoke Security Auditor on the diff."* The second prompt is shorter, more reviewable, and far less likely to silently drop one of the goals. ### The "tools, not personality" principle Specialization is **about constrained tool surfaces**, not about cute personas. Don't write *"You are a senior engineer with 20 years of experience…"* — the model is what it is. Write *"You may use Read and Grep. You may not call Write or Bash. Your output must be a plan."* The constraints do the work. --- ## 6.3 Deterministic vs creative agents ### Two failure modes from the same setting - **A creative agent on a deterministic job.** A `release-manager` that improvises about which files to update produces broken releases. - **A deterministic agent on a creative job.** An `architect` that won't consider alternatives produces local optima dressed up as plans. Tune accordingly: | Property | Deterministic agent | Creative agent | |---------------------|----------------------------------|-----------------------------------| | Model | Sonnet or Haiku | Opus | | Reasoning budget | none / `think` (low) | `think harder` / `ultrathink` | | Temperature * | low (0.0–0.2) | higher (0.7–1.0) | | Prompt style | Numbered, "if X then Y" steps | Open: "weigh the alternatives" | | Output format | Strict JSON / fixed template | Markdown with sections | | Tools | Few, narrow | Read-heavy, web-search OK | *Temperature is the SDK knob; CLIs typically expose it indirectly via mode or per-agent config. ### Examples A **release-manager** agent should be deterministic: ```markdown --- name: release-manager model: claude-sonnet-4-6 tools: Read, Edit, Bash(npm version:*), Bash(git tag:*), Bash(git commit:*) --- Execute the release checklist exactly: 1. Verify clean working tree. If dirty: STOP, report. 2. Compute next version per semver: `npm version <major|minor|patch>`. 3. Update `CHANGELOG.md` using the existing template. 4. Commit with message `chore(release): vX.Y.Z`. 5. Tag `vX.Y.Z`. 6. STOP. Do not push. Report what was done. Deviation from these steps is a bug. If you cannot complete a step, report and stop. Do not improvise. ``` An **architect** agent should be more open: ```markdown --- name: architect model: claude-opus-4-7 tools: Read, Grep, Glob, WebSearch --- ultrathink. When invoked, weigh at least two distinct approaches and explain the trade-offs concretely (latency, complexity, blast radius, team familiarity). Recommend one. Acknowledge what would change your mind. Your plan is a document, not a command. Make it good enough that the team would approve it in review without you in the room. ``` > **Heuristic:** the cost of an agent making a wrong creative call is high > *once* (one bad architectural decision). The cost of a deterministic > agent making a wrong call is high *every time it runs*. Tune for the > downside. --- ## 6.4 Agent routing strategies How do multiple agents *interact*? Three patterns dominate. ### Sequential pipelines: Plan → Implement → Verify The default. Each phase hands off to the next. ``` [Human] │ "Add passwordless login per openspec/2026-05-add-passwordless-login" ▼ [Architect] → produces plan.md │ ▼ [Builder] → implements per plan, runs tests │ ▼ [Reviewer] → diffs vs spec, flags gaps │ ▼ [Human] → approve / push back ``` In Claude Code, the parent agent can delegate via sub-agent calls: ``` You: Implement openspec/changes/2026-05-add-passwordless-login. Coordinator (you, in the CLI): → Use Agent(architect) to produce a plan. → Read the plan. → Use Agent(builder) with the plan and the spec. → Use Agent(reviewer) on the resulting diff. → Surface the reviewer's verdict to me. ``` The pipeline pays for itself the first time the reviewer catches the builder leaving a debug `console.log` in production code. ### Parallel agents via git worktrees When tasks are independent, run them in parallel — and use git worktrees to keep them out of each other's way. ```bash # Create worktrees for three parallel experiments git worktree add ../app-experiment-a -b experiment/a git worktree add ../app-experiment-b -b experiment/b git worktree add ../app-experiment-c -b experiment/c # Launch one agent per worktree (use separate panes / tabs / screens) ( cd ../app-experiment-a && claude "implement approach A from plan.md" ) & ( cd ../app-experiment-b && claude "implement approach B from plan.md" ) & ( cd ../app-experiment-c && claude "implement approach C from plan.md" ) & wait ``` Three benefits: 1. **Isolation.** Each agent has its own working tree; no file conflicts. 2. **A/B testing of approaches.** Compare three implementations against the same spec. 3. **Time leverage.** Three 20-minute runs in 20 minutes total, not 60. When to use: - True architectural alternatives where comparison is the point. - A spec where you genuinely don't know which approach is best. - High-stakes changes where redundancy is cheap insurance. When *not* to use: - Sequential dependencies (front-end needs the back-end shape first). - Tasks where the right answer is obvious (you're just burning tokens). - Anything touching shared external state (databases, queues) — the isolation is at the filesystem level, not the database level. ### Reviewer loops and the Builder–Critic pattern The single highest-value composition: a **Critic** agent reviews the **Builder**'s diff and can demand revisions before the work reaches you. ``` [Builder] → diff v1 → [Critic] │ ├── pass → [Human] └── fail → back to [Builder] with reasons ``` The Critic agent's job is to be **rigorous and slightly adversarial**. Its prompt: ```markdown --- name: critic model: claude-sonnet-4-6 tools: Read, Bash(git diff:*), Bash(npm test:*), Bash(npm run typecheck) --- You are reviewing a diff against a spec. Be rigorous. For each acceptance criterion in the spec: - Locate the code that satisfies it. - Identify at least one input where the code might fail. - Either confirm satisfied (with file:line evidence) or report unsatisfied. Also check for: - Tests added for each AC. Missing tests = unsatisfied. - Type errors, lint warnings, console.log left behind, TODOs. - Spec drift: code does something the spec doesn't authorize. Output is a markdown report ending with: PASS or NEEDS_REVISION. Be conservative — when in doubt, NEEDS_REVISION. The human will ultimately decide; your job is to surface, not to gatekeep. ``` Two failure modes to guard against: - **Sycophantic critic.** A Critic prompted vaguely produces "looks good to me." Always demand evidence (file:line, test names) for any PASS. - **Infinite loop.** Builder produces v2, Critic flags new issues, repeat. Cap revisions at 3 and require human input on the 4th cycle. The agents agreeing for the wrong reasons is worse than them disagreeing. ### Consensus and quorum patterns For high-stakes calls, run **N independent reviewers** and require consensus. This catches single-agent bias: ``` [Builder] → diff → [Reviewer-A (Sonnet)] → vote → [Reviewer-B (Opus, ultrathink)] → vote → [Security Auditor (Opus)] → vote If 2+ pass → [Human] If <2 pass → back to [Builder] with merged feedback ``` This is overkill for most changes. Reach for it on: schema migrations, authentication, billing logic, anything you'd page someone for at 3am. --- ## 6.5 Agents verifying agents — the philosophy The deeper move underneath every pattern above: **the same intelligence that wrote the code can verify it — if it's reading it cold, against an independent contract.** Three properties make this work in practice: 1. **Independence.** The verifier must not have seen the builder's intermediate reasoning. Pass it the diff and the spec, nothing else. In Claude Code, this is a fresh sub-agent invocation, not a continuation. 2. **A contract.** Without the spec (Module 2), the verifier has nothing to verify against; it falls back to taste, which is what produced the code in the first place. 3. **Asymmetric prompting.** The builder's prompt is "make this work." The verifier's prompt is "find where this fails." Different prompts bias toward different outputs. ### Validating specs and detecting drift The simplest verifier doesn't even read the code — it reads the spec and the test output, and asks: *does the test prove this AC?* ``` For each AC in the spec: - Locate the test that claims to cover it. - Read the test code. - Confirm: does this test actually fail when the AC is violated? - If the test is a tautology (e.g., asserts the function returns what the function returned), report it as INSUFFICIENT. ``` This catches the most common silent failure mode in AI-generated code: tests that pass because they don't actually test anything. ### Automated regression detection The Critic agent's most valuable side effect: it can run the existing test suite before and after the change and report any new failures. A pre- commit hook (Module 4) can enforce this: ```bash # .claude/hooks/pre-commit-regression-check.sh set -e BASELINE=$(git stash create) # snapshot current git stash apply npm test --silent && BASELINE_GREEN=1 # before change git stash drop # Run again with current working tree npm test --silent && CURRENT_GREEN=1 if [ "$BASELINE_GREEN" = "1" ] && [ -z "$CURRENT_GREEN" ]; then echo "REGRESSION: tests passed before this change, fail after." >&2 exit 2 fi ``` ### Visual UI verification For frontend changes, the verifier should *see the rendered output*. Modern agents can drive Chrome (via Playwright or a dedicated MCP server) and return screenshots: ```markdown --- name: ui-verifier model: claude-sonnet-4-6 tools: Read, Bash(npx playwright:*), chrome-mcp --- For each UI change in the diff: 1. Identify the affected route(s). 2. Use chrome-mcp to load each route in three viewports (mobile 375px, tablet 768px, desktop 1440px). 3. Capture screenshots. 4. Inspect for: clipped text, overflow, missing fonts, broken contrast, console errors. 5. Report any issues with screenshot reference. ``` A failing screenshot is more convincing than a passing unit test. ### AI-assisted TDD A useful loop: 1. Human writes the spec (Module 2). 2. **Test-writer agent** produces failing tests from the spec. 3. Human reviews the tests (this is the cheap, important review). 4. **Builder agent** implements until the tests pass. 5. **Critic agent** verifies the tests actually exercise the AC. The order matters. Tests written *after* the code tend to match the code, not the spec. Tests written *first*, from the spec, drive the implementation toward the contract. --- ## 6.6 CI/CD integration: headless agents as quality gates The same agents you use interactively can run in CI in **headless mode**. In Claude Code: ```bash claude -p "$(cat .github/prompts/pr-review.md)" \ --output-format json \ --max-turns 10 \ > pr-review.json ``` The `-p` flag (print mode) runs non-interactively and exits when done. `--output-format json` gives you machine-readable output. `--max-turns` caps the agent's wandering. ### A GitHub Actions example ```yaml # .github/workflows/agentic-review.yml name: Agentic review on: pull_request: types: [opened, synchronize] jobs: review: runs-on: ubuntu-latest permissions: pull-requests: write contents: read steps: - uses: actions/checkout@v4 with: { fetch-depth: 0 } - name: Install Claude Code run: npm install -g @anthropic-ai/claude-code - name: Run reviewer agent env: ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} run: | git fetch origin "${{ github.base_ref }}" claude -p "@.claude/agents/critic.md The diff is git diff origin/${{ github.base_ref }}...HEAD. The spec is openspec/changes/$(git log -1 --pretty=%s | grep -oP 'changes/\K[^ ]+')/proposal.md. Output a markdown review ending with PASS or NEEDS_REVISION." \ --max-turns 6 \ --output-format text > review.md - name: Post review as PR comment uses: peter-evans/create-or-update-comment@v4 with: issue-number: ${{ github.event.pull_request.number }} body-file: review.md - name: Fail if reviewer says NEEDS_REVISION run: | if grep -q "NEEDS_REVISION" review.md; then echo "::error::Agentic reviewer requested revisions" exit 1 fi ``` What this buys you: - Every PR gets a structured review against its own spec, posted as a comment. - The reviewer can block merges automatically when issues are found. - Humans review the *review*, not the raw code, on the first pass. ### Cost shape and rate limits Headless agents in CI run on every PR push. Three controls keep the bill sane: 1. **Use Sonnet, not Opus**, for routine reviews. Reserve Opus for labeled-as-critical PRs. 2. **Cap `--max-turns`.** 6–10 is usually right; more usually means the agent is thrashing. 3. **Cache the spec, not the code.** Anthropic's prompt caching (Module 7) is dramatic here — the spec and `AGENTS.md` rarely change between PRs, so a cache hit on the first ~10k tokens of every review saves real money. --- ## 6.7 The frontier: verifier independence and adversarial agents Two emerging patterns worth your attention. ### Independence as a property to engineer The more independent your verifier is from your builder, the more informative its verdicts. Independence axes: - **Different model family.** Sonnet builder reviewed by Opus verifier. - **Different reasoning trace.** Verifier never sees the builder's plan. - **Different inputs.** Verifier gets only diff + spec, not the conversation that produced them. - **Different tools.** Verifier can run tests; builder might not have. The cheapest move is the "different reasoning trace" axis: always run the verifier as a fresh sub-agent invocation, not a continuation of the builder's session. ### Adversarial agents For security-sensitive code, an explicitly adversarial agent — prompted to *find ways to break* the implementation — can surface bugs no collaborative review would: ```markdown --- name: red-team model: claude-opus-4-7 tools: Read, Bash(curl:*), Bash(npm test:*) --- ultrathink. You are a security researcher with a grudge. The code in this diff is going to production. Find at least three concrete attacks: each must be a specific input and the resulting incorrect behavior. Cite line numbers. Do not propose fixes — your job is to break. After three, also consider: - What about side channels (timing, error messages)? - What does this trust that it shouldn't? - What can an authenticated user do that they shouldn't? ``` Use sparingly. Adversarial agents are expensive (Opus + ultrathink) and their output is unpleasant to read. They earn their keep on auth, billing, data export, and any code that runs unauthenticated. --- ## Lab 6 — Stand up a three-agent pipeline **Goal:** ship a feature through a Plan → Build → Review pipeline of specialized agents, with the reviewer's verdict gating completion. **Time:** ~2 hours. 1. **Define three agents** in `.claude/agents/`: - `architect.md` — Opus, read-only, produces plan. - `builder.md` — Sonnet, read/write/bash(test), implements plan. - `critic.md` — Sonnet, read + bash(test), reviews diff against spec. 2. **Pick a real change** with a spec (use the OpenSpec proposal from Lab 2 if you did it). 3. **Run the pipeline manually** by invoking each sub-agent in turn from a coordinator session. Capture the artifacts (plan, diff, review). 4. **Iterate.** When the critic returns NEEDS_REVISION, re-run the builder with the critic's feedback as additional context. Cap at 3 loops. 5. **Promote to CI.** Add a minimal GitHub Action that runs the critic on every PR and posts a comment. Use the workflow in §6.6 as a starting point. **What to look for:** the *quality* of the critic's feedback. If it's generic, your spec is too vague (Module 2 problem) or its prompt is too permissive. Strengthen one of those, not the critic itself. --- ## Common pitfalls - **Agents that drift.** A "builder" that starts giving architectural opinions is mis-prompted. Tighten the tool surface and rewrite the system prompt to make creative output unrewarded. - **Critics that say PASS too easily.** Demand evidence. A PASS without a file:line citation is just vibes. - **Pipelines that never converge.** Build → Review → Build → Review with no progress means the spec is wrong, not the agents. Stop and rewrite the spec. - **Treating sub-agents as cheap.** A 5-agent pipeline costs roughly 5× the tokens of a single agent. Reserve the complex pipelines for the changes whose blast radius justifies them. - **Skipping the human at the end.** The agents catch most of the bugs. *Most* is not all. The human approval step is non-negotiable for anything that ships. --- ## Summary - Specialized agents are constrained tool surfaces with focused prompts. Personalities are decoration; constraints do the work. - Tune the reasoning budget and the prompt to the job — deterministic for execution, creative for decisions. - Sequential pipelines (Plan → Build → Review), parallel fan-outs (via worktrees), and reviewer loops cover most orchestration patterns. - The "agents verifying agents" philosophy works because the verifier is *independent* and has a *contract*. Without both, you have ceremony, not verification. - Headless agents in CI are how this scales — every PR gets a structured review against its own spec. --- ## Further reading - *Claude Code — Sub-agents* documentation. - *Git worktrees* — the official Git documentation; short and worth re-reading. - *Builder–Critic* and *Generator–Discriminator* patterns in the broader multi-agent literature. - Anthropic — *Prompt caching* (becomes load-bearing in Module 7). **Next:** [Module 7 — The Human-in-the-Loop & Strategic ROI](07-human-in-the-loop.md)