### Why You'd Even Build a Multi-Agent System Most agentic systems are still single-agent. One model, one context window, one tool-using loop. They work. Teams reach for multi-agent setups when the work outgrows what fits in a single context: a research task spanning twelve API endpoints, a code review that touches fifty files, a migration plan covering three independent codebases. The pattern that's emerged across Anthropic, Cognition, OpenAI, Microsoft, and LangChain handles these by splitting work between a lead [[AI Agent]] that owns the user's full request and ephemeral subagents that each take one slice, run in isolation, and report back with a paragraph. The architecture has converged enough that [[Multi-Agent AI Systems in 2026 (FlowHunt)]] documents it as the default; peer-collaborating "GroupChat" designs, dominant in early 2024, have mostly disappeared from new builds. If you're building serious agentic software in 2026, this is what production looks like. ### Where This Shows Up Three patterns cover almost all real deployments. The first is parallel research. The orchestrator receives a question with several independent sub-questions, spawns one subagent per sub-question, and merges the results. Anthropic's Research feature is the canonical case: a Lead Researcher splits the query, dispatches workers, and a separate CitationAgent post-processes for attribution. Anthropic's internal evaluation reported the multi-agent setup outperforming a single-agent baseline by over 90% on complex tasks. The second is asynchronous code work. Cognition's Managed Devins use a fan-out pattern: a coordinating Devin breaks the work into pieces, spawns managed Devins each with their own virtual machine, and merges the results. Each managed Devin is a full Devin internally — its own browser, terminal, and editor — but the orchestrator never sees that complexity. Only the final summary crosses the boundary. The third is opportunistic isolation. A long-running agent on a complex task spawns subagents not because the work is genuinely parallel, but to keep its own context clean. A subagent investigates a tangent, returns a paragraph, and the orchestrator continues with its main thread uncluttered. Without this, long agent sessions become unsustainable — the orchestrator's [[Context Window]] fills with tool-call noise and reasoning quality drops. If your workload doesn't look like one of those three shapes, you probably don't need multi-agent yet. Single-agent costs a quarter of the tokens and is easier to debug. ### Implementation Patterns The skeleton is straightforward: 1. Orchestrator receives the task, plans, decides what to do itself vs what to delegate. 2. For each delegated task, it constructs a system prompt and task description and spawns a subagent. 3. Subagents run their own [[Agentic Loop]] in [[Subagent Context Isolation]]. They may parallelise or run serially. 4. Each subagent returns one [[Compressed Summary Return]]. 5. The orchestrator integrates summaries and continues. Three implementation details account for most of the production complexity. **Task description quality.** The biggest variable in a multi-agent system is how clearly the orchestrator writes the subagent's task. Vague task in → vague summary out. A precise task — "find three peer-reviewed sources from the last 18 months that report on X, return title, authors, and the one-sentence finding from each" — gets a usable result. Anthropic's own teams have written that prompt phrasing was the difference between efficient research and wasted spend. Treat the task description as the contract between two agents. **Spawning policy.** When does the orchestrator delegate vs handle inline? Anthropic's deployed Research system uses a rough scale: one agent for simple fact-finding, two to four subagents for direct comparisons, ten or more for multi-faceted research. The scale isn't a rule; it's an admission that the right cutoff depends on the task. Build the policy as a small set of heuristics the orchestrator can apply, not as a hard threshold. **Model routing.** The orchestrator runs the most capable model — Opus, GPT-5.5, Gemini Pro — because it carries the most context and makes the most consequential decisions. Subagents run cheaper models matched to their task: Haiku for linting and simple lookup, Sonnet for coding, Opus only when a subagent's task genuinely needs frontier reasoning. Cost discipline here matters as much as the pattern itself; see [[Model Selection Strategy]]. ### Trade-offs in Production The pattern's biggest cost is tokens. Multi-agent systems use roughly 15× the tokens of a single chat interaction, by FlowHunt's industry measurement, and Anthropic reports the same number. That 15× isn't even — it concentrates in tasks where subagents do extensive tool-calling internally. A budget that assumed 2024 chat-level consumption won't survive a multi-agent rollout. Plan for it. The second cost is determinism. Summaries are lossy by design. When the orchestrator gets back a wrong answer, you can't easily tell *why* it's wrong without re-running the subagent's work. Two mitigations help in practice: have subagents cite their sources or attach minimal structured evidence to summaries, and log every subagent's full trace out-of-band. The orchestrator doesn't see the trace, but you can replay it during debugging. The third is failure handling. A subagent that hallucinates, hits a tool error, or loops will produce a confident but wrong summary. The orchestrator has no easy way to spot this from the summary alone. Options: ask the subagent to score its own confidence (directional, not reliable), use a verifier subagent on critical outputs, or have the orchestrator double-check by re-issuing the same task with different model routing. None are clean. Pick the one that matches your tolerance for cost vs error. The thing that gets simpler is parallelism. Subagents are stateless by construction. Spawning ten in parallel is as easy as spawning one, and the orchestrator integrates results in whatever order they return. ### Tooling and Ecosystem The pattern shows up under different names in different frameworks, but the shape is the same. Microsoft's Agent Framework (the AutoGen successor) calls it orchestrator-worker. LangChain and LangGraph implement it as a supervisor graph node delegating to worker nodes. Anthropic's *Building Effective AI Agents* describes the architecture under "Orchestrator-workers." Cognition's Devin documentation uses "Managed Devins" for the worker tier. Whichever framework matches your stack will do; the abstractions are largely interchangeable. Two protocols are coalescing around the pattern. [[Model Context Protocol]] standardises how each agent — orchestrator or subagent — connects to tools and data. A2A (Agent2Agent), now under the Linux Foundation AI & Agents Foundation, standardises agent-to-agent communication for the rare case when subagents do need lateral channels. MCP handles the vertical axis; A2A handles the horizontal one. Most current deployments only use MCP, since pure orchestrator-subagent has no peer-to-peer channel. A2A becomes relevant when you compose orchestrators across organisational boundaries. The observability layer is the part the ecosystem hasn't solved well. Logging subagent traces is straightforward. Reconstructing the *causal chain* — why the orchestrator chose to spawn this subagent given that summary — is still mostly bespoke. If you're building a system like this, invest in a tracing layer early. The failure modes are subtle, and you can't debug them from chat logs. ### In Practice Start single-agent. Move to orchestrator-subagent when one of three signals fires: your single agent's context is filling with tool output, you're doing work that genuinely parallelises, or you need the failure isolation that [[Ephemeral Subagent]]s give you. When you do move, copy Anthropic's structure rather than inventing one. A lead orchestrator on your most capable model, subagents on cheaper tiers, a dedicated subagent for any cross-cutting concern (citations, validation, summarisation). Pay disproportionate attention to the task descriptions the orchestrator writes — that's where the wins and losses live. Budget for 15× the tokens you used in single-agent for the equivalent task, and instrument heavily. The reason this pattern won is unglamorous: it matches how production engineers think about distributed systems. One process owns the state, workers do bounded units of work, results flow back as messages. That intuition is doing a lot of work here. It's why the same architecture keeps converging across labs that didn't coordinate. ## References - [[Multi-Agent AI Systems in 2026 (FlowHunt)]] — https://www.flowhunt.io/blog/multi-agent-ai-system/ - [[The Architecture of Scale - Anthropic Sub-Agents (Oswal)]] — https://medium.com/codetodeploy/the-architecture-of-scale-a-deep-dive-into-anthropics-sub-agents-6c4faae1abda