# Module 6 — The Meta-Agent Factory & Verification Frontier
> *"One agent is a tool. Two agents reviewing each other are a process. The
> orchestrator's job is to design the process so the bugs find each other
> before the humans do."*
---
## Learning objectives
By the end of this module you will be able to:
1. Design **role-based specialized agents** — Architect, Refactor, Debug,
Security Auditor, Release Manager — each with the right prompt, model,
and tool surface.
2. Tune agents for **deterministic execution** vs **creative exploration**
and know which jobs need which.
3. Compose agents into **sequential pipelines**, **parallel fan-outs** (via
git worktrees), and **reviewer loops** that catch issues the builder
missed.
4. Apply the **"agents verifying agents"** philosophy: agents check each
other's work against the spec, the regression suite, and even the
rendered UI.
5. Wire headless agents into **CI/CD** as quality gates before code reaches
a human reviewer.
---
## 6.1 Why specialize?
A single general-purpose agent that does everything is the LLM equivalent of
a 10,000-line `main.py`. It works, sort of, but:
- It can't be tuned for the job at hand — defensive on architecture,
laser-focused on bug fixes.
- It can't review itself with any independence — the same prompt produced
the code; that prompt won't catch its own bias.
- It can't be parallelized — every task waits for the same head.
Specialization fixes all three. A roster of focused agents lets you match
the right tool to the right task, run independent passes for verification,
and parallelize across tasks that don't depend on each other.
This is the Module 1 model-selection rubric, extended one layer up: not just
*which model*, but *which role*.
---
## 6.2 Role-based agent design
Every specialized agent has the same four ingredients:
1. **A system prompt** — its identity and operating principles.
2. **A model** — sized to the task.
3. **A tool surface** — the smallest set of tools that lets it do its job.
4. **An invocation surface** — when (and how) other agents or humans
summon it.
In Claude Code, agents live in `.claude/agents/<name>.md`:
```markdown
---
name: architect
description: |
Use for cross-module changes, public API design, schema decisions, and
any "what's the right approach?" question. Reads, plans, and proposes.
Never edits code directly.
model: claude-opus-4-7
tools: Read, Grep, Glob, Bash(git log:*), Bash(git diff:*), WebSearch
---
You are the Architect. Your job is to *decide*, not to execute.
When invoked:
1. Read enough of the codebase to ground your proposal in reality, not
in generalities.
2. Produce a plan with: (a) the chosen approach, (b) at least one
alternative you rejected and why, (c) explicit risks, (d) files that
will be touched, (e) acceptance criteria.
3. If the question is under-specified, return clarifying questions —
do not invent answers.
You do not edit code. You do not run mutating commands. Your output is
a markdown plan that another agent (or a human) will execute.
When ending: explicitly hand off ("ready for builder") or stop
("needs human decision on X").
```
### A working roster
Five agents will cover most teams' needs. You will not build them all on
day one; build them as you need them.
| Agent | Model | Tools | Primary job |
|-------------------|-----------|------------------------------------|--------------------------------------|
| **Architect** | Opus | Read-only + Web | Plans, decisions, public-API design |
| **Builder** | Sonnet | Read, Write, Edit, Bash(test/build)| Implements an approved plan |
| **Refactor** | Sonnet | Read, Write, Edit, LSP | Mechanical, structure-preserving change |
| **Debugger** | Sonnet (think hard) | Read, Bash(test:*, log:*), git history | Reproduces, isolates, proposes fixes |
| **Security Auditor** | Opus | Read-only, semgrep, dependency CLIs | Reviews diffs for security regressions |
| **Reviewer** | Sonnet (or Opus for high-stakes) | Read, Bash(test:*) | Reviews diff against the spec |
| **Release Manager** | Sonnet | Read, Write, Bash(git, npm version, gh) | Cuts releases via the release skill |
### How specialization changes your prompts
Without specialization:
> *"Refactor `src/payments.ts`. Make sure it's well-tested. Also check
> for security issues. Then write a changelog entry."*
With specialization:
> *"Invoke Architect: plan the refactor of `src/payments.ts` to split
> external-API and persistence concerns. Then invoke Builder with the
> plan. Then invoke Security Auditor on the diff."*
The second prompt is shorter, more reviewable, and far less likely to
silently drop one of the goals.
### The "tools, not personality" principle
Specialization is **about constrained tool surfaces**, not about cute
personas. Don't write *"You are a senior engineer with 20 years of
experience…"* — the model is what it is. Write *"You may use Read and Grep.
You may not call Write or Bash. Your output must be a plan."* The
constraints do the work.
---
## 6.3 Deterministic vs creative agents
### Two failure modes from the same setting
- **A creative agent on a deterministic job.** A `release-manager` that
improvises about which files to update produces broken releases.
- **A deterministic agent on a creative job.** An `architect` that won't
consider alternatives produces local optima dressed up as plans.
Tune accordingly:
| Property | Deterministic agent | Creative agent |
|---------------------|----------------------------------|-----------------------------------|
| Model | Sonnet or Haiku | Opus |
| Reasoning budget | none / `think` (low) | `think harder` / `ultrathink` |
| Temperature * | low (0.0–0.2) | higher (0.7–1.0) |
| Prompt style | Numbered, "if X then Y" steps | Open: "weigh the alternatives" |
| Output format | Strict JSON / fixed template | Markdown with sections |
| Tools | Few, narrow | Read-heavy, web-search OK |
*Temperature is the SDK knob; CLIs typically expose it indirectly via mode
or per-agent config.
### Examples
A **release-manager** agent should be deterministic:
```markdown
---
name: release-manager
model: claude-sonnet-4-6
tools: Read, Edit, Bash(npm version:*), Bash(git tag:*), Bash(git commit:*)
---
Execute the release checklist exactly:
1. Verify clean working tree. If dirty: STOP, report.
2. Compute next version per semver: `npm version <major|minor|patch>`.
3. Update `CHANGELOG.md` using the existing template.
4. Commit with message `chore(release): vX.Y.Z`.
5. Tag `vX.Y.Z`.
6. STOP. Do not push. Report what was done.
Deviation from these steps is a bug. If you cannot complete a step,
report and stop. Do not improvise.
```
An **architect** agent should be more open:
```markdown
---
name: architect
model: claude-opus-4-7
tools: Read, Grep, Glob, WebSearch
---
ultrathink. When invoked, weigh at least two distinct approaches and
explain the trade-offs concretely (latency, complexity, blast radius,
team familiarity). Recommend one. Acknowledge what would change your
mind.
Your plan is a document, not a command. Make it good enough that the
team would approve it in review without you in the room.
```
> **Heuristic:** the cost of an agent making a wrong creative call is high
> *once* (one bad architectural decision). The cost of a deterministic
> agent making a wrong call is high *every time it runs*. Tune for the
> downside.
---
## 6.4 Agent routing strategies
How do multiple agents *interact*? Three patterns dominate.
### Sequential pipelines: Plan → Implement → Verify
The default. Each phase hands off to the next.
```
[Human]
│ "Add passwordless login per openspec/2026-05-add-passwordless-login"
▼
[Architect] → produces plan.md
│
▼
[Builder] → implements per plan, runs tests
│
▼
[Reviewer] → diffs vs spec, flags gaps
│
▼
[Human] → approve / push back
```
In Claude Code, the parent agent can delegate via sub-agent calls:
```
You: Implement openspec/changes/2026-05-add-passwordless-login.
Coordinator (you, in the CLI):
→ Use Agent(architect) to produce a plan.
→ Read the plan.
→ Use Agent(builder) with the plan and the spec.
→ Use Agent(reviewer) on the resulting diff.
→ Surface the reviewer's verdict to me.
```
The pipeline pays for itself the first time the reviewer catches the
builder leaving a debug `console.log` in production code.
### Parallel agents via git worktrees
When tasks are independent, run them in parallel — and use git worktrees
to keep them out of each other's way.
```bash
# Create worktrees for three parallel experiments
git worktree add ../app-experiment-a -b experiment/a
git worktree add ../app-experiment-b -b experiment/b
git worktree add ../app-experiment-c -b experiment/c
# Launch one agent per worktree (use separate panes / tabs / screens)
( cd ../app-experiment-a && claude "implement approach A from plan.md" ) &
( cd ../app-experiment-b && claude "implement approach B from plan.md" ) &
( cd ../app-experiment-c && claude "implement approach C from plan.md" ) &
wait
```
Three benefits:
1. **Isolation.** Each agent has its own working tree; no file conflicts.
2. **A/B testing of approaches.** Compare three implementations against
the same spec.
3. **Time leverage.** Three 20-minute runs in 20 minutes total, not 60.
When to use:
- True architectural alternatives where comparison is the point.
- A spec where you genuinely don't know which approach is best.
- High-stakes changes where redundancy is cheap insurance.
When *not* to use:
- Sequential dependencies (front-end needs the back-end shape first).
- Tasks where the right answer is obvious (you're just burning tokens).
- Anything touching shared external state (databases, queues) — the
isolation is at the filesystem level, not the database level.
### Reviewer loops and the Builder–Critic pattern
The single highest-value composition: a **Critic** agent reviews the
**Builder**'s diff and can demand revisions before the work reaches you.
```
[Builder] → diff v1 → [Critic]
│
├── pass → [Human]
└── fail → back to [Builder] with reasons
```
The Critic agent's job is to be **rigorous and slightly adversarial**. Its
prompt:
```markdown
---
name: critic
model: claude-sonnet-4-6
tools: Read, Bash(git diff:*), Bash(npm test:*), Bash(npm run typecheck)
---
You are reviewing a diff against a spec. Be rigorous.
For each acceptance criterion in the spec:
- Locate the code that satisfies it.
- Identify at least one input where the code might fail.
- Either confirm satisfied (with file:line evidence) or report unsatisfied.
Also check for:
- Tests added for each AC. Missing tests = unsatisfied.
- Type errors, lint warnings, console.log left behind, TODOs.
- Spec drift: code does something the spec doesn't authorize.
Output is a markdown report ending with: PASS or NEEDS_REVISION.
Be conservative — when in doubt, NEEDS_REVISION. The human will
ultimately decide; your job is to surface, not to gatekeep.
```
Two failure modes to guard against:
- **Sycophantic critic.** A Critic prompted vaguely produces "looks good
to me." Always demand evidence (file:line, test names) for any PASS.
- **Infinite loop.** Builder produces v2, Critic flags new issues, repeat.
Cap revisions at 3 and require human input on the 4th cycle. The
agents agreeing for the wrong reasons is worse than them disagreeing.
### Consensus and quorum patterns
For high-stakes calls, run **N independent reviewers** and require
consensus. This catches single-agent bias:
```
[Builder] → diff → [Reviewer-A (Sonnet)] → vote
→ [Reviewer-B (Opus, ultrathink)] → vote
→ [Security Auditor (Opus)] → vote
If 2+ pass → [Human]
If <2 pass → back to [Builder] with merged feedback
```
This is overkill for most changes. Reach for it on: schema migrations,
authentication, billing logic, anything you'd page someone for at 3am.
---
## 6.5 Agents verifying agents — the philosophy
The deeper move underneath every pattern above: **the same intelligence
that wrote the code can verify it — if it's reading it cold, against an
independent contract.**
Three properties make this work in practice:
1. **Independence.** The verifier must not have seen the builder's
intermediate reasoning. Pass it the diff and the spec, nothing else.
In Claude Code, this is a fresh sub-agent invocation, not a
continuation.
2. **A contract.** Without the spec (Module 2), the verifier has nothing
to verify against; it falls back to taste, which is what produced the
code in the first place.
3. **Asymmetric prompting.** The builder's prompt is "make this work."
The verifier's prompt is "find where this fails." Different prompts
bias toward different outputs.
### Validating specs and detecting drift
The simplest verifier doesn't even read the code — it reads the spec and
the test output, and asks: *does the test prove this AC?*
```
For each AC in the spec:
- Locate the test that claims to cover it.
- Read the test code.
- Confirm: does this test actually fail when the AC is violated?
- If the test is a tautology (e.g., asserts the function returns what
the function returned), report it as INSUFFICIENT.
```
This catches the most common silent failure mode in AI-generated code:
tests that pass because they don't actually test anything.
### Automated regression detection
The Critic agent's most valuable side effect: it can run the existing test
suite before and after the change and report any new failures. A pre-
commit hook (Module 4) can enforce this:
```bash
# .claude/hooks/pre-commit-regression-check.sh
set -e
BASELINE=$(git stash create) # snapshot current
git stash apply
npm test --silent && BASELINE_GREEN=1 # before change
git stash drop
# Run again with current working tree
npm test --silent && CURRENT_GREEN=1
if [ "$BASELINE_GREEN" = "1" ] && [ -z "$CURRENT_GREEN" ]; then
echo "REGRESSION: tests passed before this change, fail after." >&2
exit 2
fi
```
### Visual UI verification
For frontend changes, the verifier should *see the rendered output*. Modern
agents can drive Chrome (via Playwright or a dedicated MCP server) and
return screenshots:
```markdown
---
name: ui-verifier
model: claude-sonnet-4-6
tools: Read, Bash(npx playwright:*), chrome-mcp
---
For each UI change in the diff:
1. Identify the affected route(s).
2. Use chrome-mcp to load each route in three viewports
(mobile 375px, tablet 768px, desktop 1440px).
3. Capture screenshots.
4. Inspect for: clipped text, overflow, missing fonts, broken
contrast, console errors.
5. Report any issues with screenshot reference.
```
A failing screenshot is more convincing than a passing unit test.
### AI-assisted TDD
A useful loop:
1. Human writes the spec (Module 2).
2. **Test-writer agent** produces failing tests from the spec.
3. Human reviews the tests (this is the cheap, important review).
4. **Builder agent** implements until the tests pass.
5. **Critic agent** verifies the tests actually exercise the AC.
The order matters. Tests written *after* the code tend to match the code,
not the spec. Tests written *first*, from the spec, drive the
implementation toward the contract.
---
## 6.6 CI/CD integration: headless agents as quality gates
The same agents you use interactively can run in CI in **headless mode**.
In Claude Code:
```bash
claude -p "$(cat .github/prompts/pr-review.md)" \
--output-format json \
--max-turns 10 \
> pr-review.json
```
The `-p` flag (print mode) runs non-interactively and exits when done.
`--output-format json` gives you machine-readable output. `--max-turns`
caps the agent's wandering.
### A GitHub Actions example
```yaml
# .github/workflows/agentic-review.yml
name: Agentic review
on:
pull_request:
types: [opened, synchronize]
jobs:
review:
runs-on: ubuntu-latest
permissions:
pull-requests: write
contents: read
steps:
- uses: actions/checkout@v4
with: { fetch-depth: 0 }
- name: Install Claude Code
run: npm install -g @anthropic-ai/claude-code
- name: Run reviewer agent
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: |
git fetch origin "${{ github.base_ref }}"
claude -p "@.claude/agents/critic.md
The diff is git diff origin/${{ github.base_ref }}...HEAD.
The spec is openspec/changes/$(git log -1 --pretty=%s | grep -oP 'changes/\K[^ ]+')/proposal.md.
Output a markdown review ending with PASS or NEEDS_REVISION." \
--max-turns 6 \
--output-format text > review.md
- name: Post review as PR comment
uses: peter-evans/create-or-update-comment@v4
with:
issue-number: ${{ github.event.pull_request.number }}
body-file: review.md
- name: Fail if reviewer says NEEDS_REVISION
run: |
if grep -q "NEEDS_REVISION" review.md; then
echo "::error::Agentic reviewer requested revisions"
exit 1
fi
```
What this buys you:
- Every PR gets a structured review against its own spec, posted as a
comment.
- The reviewer can block merges automatically when issues are found.
- Humans review the *review*, not the raw code, on the first pass.
### Cost shape and rate limits
Headless agents in CI run on every PR push. Three controls keep the bill
sane:
1. **Use Sonnet, not Opus**, for routine reviews. Reserve Opus for
labeled-as-critical PRs.
2. **Cap `--max-turns`.** 6–10 is usually right; more usually means the
agent is thrashing.
3. **Cache the spec, not the code.** Anthropic's prompt caching (Module
7) is dramatic here — the spec and `AGENTS.md` rarely change between
PRs, so a cache hit on the first ~10k tokens of every review saves
real money.
---
## 6.7 The frontier: verifier independence and adversarial agents
Two emerging patterns worth your attention.
### Independence as a property to engineer
The more independent your verifier is from your builder, the more
informative its verdicts. Independence axes:
- **Different model family.** Sonnet builder reviewed by Opus verifier.
- **Different reasoning trace.** Verifier never sees the builder's plan.
- **Different inputs.** Verifier gets only diff + spec, not the
conversation that produced them.
- **Different tools.** Verifier can run tests; builder might not have.
The cheapest move is the "different reasoning trace" axis: always run the
verifier as a fresh sub-agent invocation, not a continuation of the
builder's session.
### Adversarial agents
For security-sensitive code, an explicitly adversarial agent — prompted to
*find ways to break* the implementation — can surface bugs no
collaborative review would:
```markdown
---
name: red-team
model: claude-opus-4-7
tools: Read, Bash(curl:*), Bash(npm test:*)
---
ultrathink. You are a security researcher with a grudge. The code in
this diff is going to production. Find at least three concrete attacks:
each must be a specific input and the resulting incorrect behavior.
Cite line numbers. Do not propose fixes — your job is to break.
After three, also consider:
- What about side channels (timing, error messages)?
- What does this trust that it shouldn't?
- What can an authenticated user do that they shouldn't?
```
Use sparingly. Adversarial agents are expensive (Opus + ultrathink) and
their output is unpleasant to read. They earn their keep on auth, billing,
data export, and any code that runs unauthenticated.
---
## Lab 6 — Stand up a three-agent pipeline
**Goal:** ship a feature through a Plan → Build → Review pipeline of
specialized agents, with the reviewer's verdict gating completion.
**Time:** ~2 hours.
1. **Define three agents** in `.claude/agents/`:
- `architect.md` — Opus, read-only, produces plan.
- `builder.md` — Sonnet, read/write/bash(test), implements plan.
- `critic.md` — Sonnet, read + bash(test), reviews diff against spec.
2. **Pick a real change** with a spec (use the OpenSpec proposal from
Lab 2 if you did it).
3. **Run the pipeline manually** by invoking each sub-agent in turn from
a coordinator session. Capture the artifacts (plan, diff, review).
4. **Iterate.** When the critic returns NEEDS_REVISION, re-run the
builder with the critic's feedback as additional context. Cap at 3
loops.
5. **Promote to CI.** Add a minimal GitHub Action that runs the critic on
every PR and posts a comment. Use the workflow in §6.6 as a starting
point.
**What to look for:** the *quality* of the critic's feedback. If it's
generic, your spec is too vague (Module 2 problem) or its prompt is too
permissive. Strengthen one of those, not the critic itself.
---
## Common pitfalls
- **Agents that drift.** A "builder" that starts giving architectural
opinions is mis-prompted. Tighten the tool surface and rewrite the
system prompt to make creative output unrewarded.
- **Critics that say PASS too easily.** Demand evidence. A PASS without a
file:line citation is just vibes.
- **Pipelines that never converge.** Build → Review → Build → Review with
no progress means the spec is wrong, not the agents. Stop and rewrite
the spec.
- **Treating sub-agents as cheap.** A 5-agent pipeline costs roughly 5×
the tokens of a single agent. Reserve the complex pipelines for the
changes whose blast radius justifies them.
- **Skipping the human at the end.** The agents catch most of the bugs.
*Most* is not all. The human approval step is non-negotiable for
anything that ships.
---
## Summary
- Specialized agents are constrained tool surfaces with focused prompts.
Personalities are decoration; constraints do the work.
- Tune the reasoning budget and the prompt to the job — deterministic for
execution, creative for decisions.
- Sequential pipelines (Plan → Build → Review), parallel fan-outs (via
worktrees), and reviewer loops cover most orchestration patterns.
- The "agents verifying agents" philosophy works because the verifier is
*independent* and has a *contract*. Without both, you have ceremony,
not verification.
- Headless agents in CI are how this scales — every PR gets a structured
review against its own spec.
---
## Further reading
- *Claude Code — Sub-agents* documentation.
- *Git worktrees* — the official Git documentation; short and worth
re-reading.
- *Builder–Critic* and *Generator–Discriminator* patterns in the broader
multi-agent literature.
- Anthropic — *Prompt caching* (becomes load-bearing in Module 7).
**Next:** [Module 7 — The Human-in-the-Loop & Strategic ROI](07-human-in-the-loop.md)