05-memory-orchestration - Albert Masoliver's learning site

# Module 5 — Memory Orchestration & Context Engineering > *"Context is what the model sees this turn. Memory is what survives the > next compaction, the next session, the next teammate. They are different > problems and they need different tools."* --- ## Learning objectives By the end of this module you will be able to: 1. Distinguish **context** (per-turn attention surface) from **memory** (cross-session persistence) and choose the right strategy for each. 2. Build a layered memory stack — file-based notes, auto-memory, structured stores like Engram — that survives compaction and onboards new sessions in seconds. 3. Apply **context-engineering** patterns: hierarchical compression, scoped memories, temporal awareness, just-in-time retrieval. 4. Diagnose and recover from memory failures: stale facts, duplicates, contradictions, "phantom progress" from compaction. --- ## 5.1 The amnesia problem An agent's "memory" within a session feels real until the moment it breaks. Common breakage moments: - A multi-hour session hits the compaction threshold. You return from lunch and the agent has forgotten the convention you agreed to in turn 12. - You start a new session on Tuesday. The agent has no idea you spent Monday debugging the exact issue it's now confidently re-proposing the same broken solution to. - A teammate picks up your branch. They get a fresh agent that knows nothing about the trade-offs you weighed. These aren't model failures. They're the absence of a memory layer. Building one is a first-class engineering activity, not a luxury. --- ## 5.2 Context vs memory ### Context — the attention surface for one turn Everything the model considers when generating its next response: - The system prompt. - The `AGENTS.md` and `CLAUDE.md` files loaded at session start. - The conversation history (compacted if necessary). - The files explicitly read or attached. - Tool outputs from this turn. Context is **ephemeral**. The token at position N in the window has no guaranteed influence over the token at position N+10,000. ### Memory — what survives Memory is anything **outside the conversation** that an agent can re-load: - Files in the repo (`DECISIONS.md`, `CHANGELOG.md`, code comments). - Per-project memory stores managed by the harness. - External knowledge bases reached via MCP. - Structured stores like **Engram** that record decisions, entities, and relationships across sessions. The crucial property is **re-loadable**. If a future agent in a future session can't deterministically retrieve it, it isn't memory — it's just hope. ### Why the distinction matters Conflating them produces two failure modes: - **Treating context as memory.** "I told you yesterday we decided on PostgreSQL." Yesterday's session is gone. The model has no way to know. - **Treating memory as context.** Loading a 100-page memory store into every session as context wastes tokens and degrades quality ("lost-in-the-middle" again). The right shape: **memory is a corpus you retrieve from**; **context is the small, freshly-curated subset of memory you load right now**. --- ## 5.3 Layered memory architecture A working memory stack has three layers. Each has a different write discipline and a different retrieval pattern. ### Layer 1 — Source-of-truth files (in the repo) The slowest-changing, most durable layer. Lives in the repo, reviewed in PRs, survives team turnover. Typical files: ``` docs/ ├── ARCHITECTURE.md # high-level system shape ├── DECISIONS.md # architecture decision records (ADRs) ├── CONVENTIONS.md # detailed coding conventions └── GLOSSARY.md # domain terms with definitions ``` Rules of thumb: - Write to these layers when a decision is **final** and **load-bearing for future agents**. - Each entry has a date and a "why." Decisions without context get re-litigated. - Prefer ADR format for `DECISIONS.md`: ```markdown ## ADR-0042 — Use PostgreSQL row-level security for tenant isolation **Date:** 2026-04-19 **Status:** Accepted ### Context We considered (a) separate databases per tenant, (b) schema-per-tenant, (c) row-level security (RLS). We have ~3k tenants today and project ~50k in three years. Per-database overhead at that scale dominates the cost model. ### Decision We use Postgres RLS keyed on `tenant_id`. All application connections run with the `app_user` role; queries SET LOCAL `app.tenant_id` at the start of each transaction. ### Consequences - Bench shows ~3% query overhead vs no RLS. Acceptable. - ALL new tables containing tenant-scoped data MUST add a `tenant_id` column and an RLS policy. See `db/migrations/_template_rls.sql`. - The `pg_dump` flow needs to be re-tested for cross-tenant leakage at every major Postgres upgrade. ``` An agent reading this on a fresh session knows what to do *and what to guard against* — the consequences section is where the real value lives. ### Layer 2 — Harness-managed memory (per-project, per-user) Modern agentic CLIs ship a file-based memory store the agent itself can write to between turns. In Claude Code, it lives under `~/.claude/projects/<project>/memory/` and is loaded automatically. This layer captures things the agent learns *during* sessions: - The user's role, preferences, expertise. - Feedback patterns ("don't summarize unprompted"). - Project facts that aren't yet stable enough for `docs/`. - References to external systems ("incidents are tracked in PagerDuty"). The discipline that makes this work: - **Write incrementally, one fact per file.** Atomicity makes contradiction visible. - **Use frontmatter for type and discoverability.** The harness uses descriptions to surface only relevant memories per session. - **Update or delete stale entries actively.** A memory that says "we use MySQL" when the code uses Postgres is a hallucination factory. A sample memory file (typical structure): ```markdown --- name: tenant-isolation description: Decision and operational rules for tenant isolation; canonical reference is ADR-0042. metadata: type: project --- Tenant isolation is enforced via Postgres row-level security. All tenant-scoped tables must include `tenant_id` and an RLS policy. **Why:** We picked RLS over separate DBs at ~3k tenants; see docs/DECISIONS.md ADR-0042 for the full reasoning. **How to apply:** When generating new tables or migrations touching tenant-scoped data, add the RLS policy template from db/migrations/_template_rls.sql. Don't propose schema-per-tenant alternatives without re-checking the ADR. ``` Note the explicit `Why` and `How to apply` — those are what let the model *generalize* the memory to new situations instead of repeating it verbatim. ### Layer 3 — Structured external stores (Engram and friends) When a project gets big — many engineers, many specs, many decisions — file notes hit a ceiling. A **structured memory store** like Engram is the next step. It records entities (specs, decisions, incidents, people) and the relationships between them, and exposes a query interface (usually MCP). What you get over plain files: - **Cross-reference.** "Show me every decision touching the auth module made since the last incident." - **Decay.** Entries can be marked stale, deprecated, or superseded — and the store enforces it. - **Multi-agent.** The reviewer agent and the builder agent see the *same* memory store, so they can't disagree about facts. A typical Engram-style call from an agent: ``` engram.search({ type: "decision", about: "auth.session_token", since: "2026-01-01" }) → [ { id: ADR-0038, decided: "tokens are JWT with 1h expiry" }, { id: ADR-0041, decided: "refresh tokens stored hashed in DB" } ] ``` Whether you need this layer depends on team size and project complexity. A two-person side project does not. A 30-engineer platform team running a multi-year program absolutely does — without it, every new engineer's first month is a rediscovery exercise. --- ## 5.4 Context engineering — choosing what enters the window ### Hierarchical compression Not every memory needs to enter context as its full text. Compress on the way in: - **Title only** for irrelevant items in a search result. - **Title + summary** for plausibly relevant items. - **Full text** only for items the agent decided to use. In practice this looks like a retrieval pipeline: 1. Scout agent (Haiku) reads memory titles + descriptions, returns ~10 IDs. 2. Builder agent (Sonnet) loads the full text of the top ~3. 3. Architect agent (Opus) is only invoked if those ~3 disagree or are insufficient. This is *exactly* the model-selection pyramid from Module 1, applied to memory. ### Memory scopes — user vs project vs session Three scopes, three retention policies: | Scope | Where it lives | Decay rate | Example | |-----------|-------------------------------------------|------------|----------------------------------| | User | `~/.claude/memory/` (per developer) | Slow | "Prefers terse responses." | | Project | `~/.claude/projects/<id>/memory/` or `docs/` | Medium | "We use RLS for tenant isolation." | | Session | Conversation history | Fast (compaction) | "We've decided this turn to call the variable `tenantId`." | Two failure patterns: - **Promoting session facts to project memory too eagerly.** A decision made in turn 3 might be reversed in turn 9. Wait until the change ships or is otherwise *stable* before writing it to a layer with slow decay. - **Letting project facts pollute user memory.** "User prefers PostgreSQL" is not a user preference; it's a project fact. Write it to the project scope. ### Temporal context — the date matters Two patterns: 1. **Anchor absolute dates in memory.** "We'll freeze on Thursday" rots in 24 hours. "We'll freeze on 2026-03-05" doesn't. 2. **Tell the model the current date.** Modern harnesses do this automatically. If yours doesn't, include it in `AGENTS.md` or in the first message of the session. The model's training cutoff is months behind; without a date anchor, it will reason as if it were earlier than it is. ### Just-in-time retrieval, not just-in-case loading The temptation: "let me load all the docs at the start of the session so the agent has everything." The cost: context bloat and lost-in-the-middle degradation. The better pattern: - **At session start, load only standing rules** (`AGENTS.md`, `DECISIONS.md` index). - **Let the agent retrieve specifics** as it decides it needs them — via MCP, file reads, or memory queries. - **Re-curate the window** when the task changes mid-session ("the architecture phase is done; drop those files, load the implementation ones"). ### Anti-patterns to avoid - **The "kitchen sink" `AGENTS.md`.** 2,000 lines of standing rules produces compaction the moment anything else happens. Keep it under ~300 lines and link out to detail files. - **Memory as ChangeLog.** Writing "today I did X" memories that no future session will ever query. Use git log; that's its job. - **Re-summarizing the summary.** When compaction happens, the existing summary is *the input*. Re-summarizing degrades it further. After two compactions, restart the session. --- ## 5.5 Engram — a closer look Engram is one of several emerging structured-memory tools designed for agentic workflows. It deserves a closer look because its model is broadly applicable even if you choose a different implementation. ### Core entities - **Decision** — an immutable record of a choice. Has a status (`proposed`, `accepted`, `superseded`). - **Entity** — a noun in the project (a service, a model, a customer segment, a person). - **Event** — something that happened (deploy, incident, spec change). - **Reference** — a pointer to canonical source (file path, URL, ticket). Relationships connect them: decisions *supersede* other decisions; events *affect* entities; entities *depend on* entities. ### Why this shape works The schema mirrors what good engineering teams already do informally. ADRs *are* decisions. PagerDuty incidents *are* events. The CMDB is a graph of entities. Engram simply gives the agent a single interface to all of it. ### Sample workflow 1. **Session starts.** Agent calls `engram.context_for(repo: "billing")` and gets back the 8 most-relevant active decisions and 3 recent events. 2. **Mid-session.** Agent proposes a change that would supersede ADR-0038. It calls `engram.record_decision_draft(...)` so the proposal itself becomes a queryable entity. 3. **PR merges.** A hook calls `engram.accept_decision(ADR-0042)`, marking it active and ADR-0038 superseded. 4. **Next month, new engineer arrives.** Their first prompt: `"explain the current state of tenant isolation"`. Engram returns ADR-0042 (active) with a link back to ADR-0038 (superseded), giving them the *history* of the decision, not just the latest answer. ### Where Engram-like stores struggle - **Free-form prose.** They are not designed for "the long discussion we had about whether to do X." Keep that in `docs/` and reference it. - **Ambiguous scope.** If a decision applies to *this service* vs *the whole org*, the entity model has to express that. Bad scoping creates ghost-applies and ghost-doesn't-applies. --- ## 5.6 Auto-memory — letting the agent write its own notes Most modern harnesses include an **auto-memory** system: the agent observes the conversation and writes durable memories on its own initiative. Used well, this is a force multiplier. Used badly, it's an entropy generator. ### When auto-memory helps - User explicitly asks: "remember that we always use the `slim` Docker image." - User corrects the agent's approach: "no, don't mock the database — we tried that and got burned." That correction should become a memory. - A non-obvious decision is reached after deliberation. Write it down. ### When auto-memory hurts - Writing memories for every passing detail. Memory files balloon. - Writing contradictory memories without reconciling them. - Writing memories that paraphrase the code. The code is its own memory. The discipline: - **Reviewable.** Every auto-memory write should be visible to you. Most harnesses surface them in the session — don't hide them. - **Editable.** Memory is a directory of files; treat them as code. Periodically audit and prune. - **Decay-aware.** When a memory is contradicted, *delete or update*. Never let two contradictory memories coexist. --- ## Lab 5 — Build a three-layer memory stack **Goal:** install a memory architecture that demonstrably improves a second-session experience. **Time:** ~75 minutes. 1. **Layer 1.** In a real repo, create `docs/DECISIONS.md` and seed it with two real ADRs (write them from existing decisions you can articulate). Add a one-line pointer to `AGENTS.md`. 2. **Layer 2.** Have a working session about a non-trivial change. At three moments — when a convention is agreed, when a non-obvious trade-off is resolved, when a stale assumption is corrected — ask the agent to write a memory file. Review and edit each one for clarity. 3. **Layer 3 (optional).** Install an Engram MCP server (or any structured store of your choice) and record one decision through it. Note the difference in retrieval ergonomics vs the files. 4. **The test.** Tomorrow, in a brand new session and a brand new branch, start with: *"I want to make a change in the area we discussed yesterday. Recap your understanding."* Score the recap on a 1–5 scale for accuracy and completeness. **What to look for:** a well-built stack produces a recap that surprises you with how much it remembers — including the *why* behind decisions, not just the *what*. If the recap is vague or wrong, your layer-1 or layer-2 write discipline needs work; do not blame the model. --- ## Common pitfalls - **Trusting compaction.** The summary is a paraphrase; load-bearing content has to live outside the conversation. - **Memory drift without auditing.** Files marked "we use X" while the code uses Y, contradicted by another file saying "we use Z." Audit monthly. - **Loading everything every time.** Memory is for *retrieval*, not for *constant presence in context*. Curate per session. - **No `Why:` in your memories.** Without the why, the agent can't generalize; it can only quote. Edge cases will defeat it. --- ## Summary - Context and memory are different concepts and need different tools. - Build memory in layers: source-of-truth files, harness-managed memory, structured external stores. - Engineer context actively — compress, scope, anchor in time, retrieve just-in-time. - Auto-memory is powerful when reviewed and pruned, and corrosive when ignored. --- ## Further reading - *Engram* — project documentation. - Anthropic's *auto-memory* documentation and the patterns in the public `claude-code` examples. - *Architecture Decision Records* (Michael Nygard) — the original ADR template; still the right shape. **Next:** [Module 6 — The Meta-Agent Factory & Verification Frontier](06-meta-agent-factory.md)