07-human-in-the-loop-labs - Albert Masoliver's learning site

# Labs — Module 7: The Human-in-the-Loop & Strategic ROI > Four short labs that turn the orchestrator role from a metaphor > into a measurable practice. Each fits in a single focused block; > the whole set is ~70 minutes. The follow-through afterwards is the > work. | Lab | Title | Time | Maps to | |-----|--------------------------------------------------|--------|--------------------------------------| | 7.1 | Bucket last week against the orchestrator shape | 15 min | §7.1 Your weekly shape | | 7.2 | One cost lever, measured before and after | 20 min | §7.2 Cost optimization | | 7.3 | Stand up a two-metric scorecard | 20 min | §7.3 Measuring velocity | | 7.4 | Decommission one component | 15 min | §7.4 When to remove an agent | --- ## Lab 7.1 — Bucket last week against the orchestrator shape ### Objective Compare your actual week against the §7.1 shape. Name one shift you'll make. ### Time 15 minutes. ### Real-world scenarios - **A — IC engineer at mid-size SaaS.** Hands-on too high; specs too low. Shift: "this week, one OpenSpec proposal before code." - **B — Tech lead on platform.** Tuning eating 25%+; firefighting prompts. Shift: "promote three `AGENTS.md` rules to hooks." - **C — Architect at large enterprise.** Meetings eating 40%+ mis-labeled as strategic. Shift: collapse two recurring meetings to async. ### Setup Calendar + ticket history + recent PRs visible. ### Steps 1. Bucket last week's working hours into: Specs / Reviews / Tuning / Strategic / Hands-on / Meetings. Approximations are fine. 2. Tabulate `Category | Hours | % | §7.1 target % | Delta` in `labs/notes/7.1-week.md`. 3. Write one observable, dated shift: *"This week I'll do X by Friday."* ### Deliverable The week table + the one-shift commitment in `labs/notes/7.1-week.md`. ### Success criteria - All hours accounted for; no "miscellaneous" >10%. - The shift is *observable* — a teammate could verify on Friday. ### Reflection - Which category surprised you most — over or under? ### Stretch - Repeat next week. The shift should show in the numbers, or it wasn't real. --- ## Lab 7.2 — One cost lever, measured before and after ### Objective Apply a single cost lever to one real workflow and measure the delta. Don't claim 90%; report what *you* got. ### Time 20 minutes. ### Real-world scenarios - **A — CI-heavy team.** Lever: prompt caching on reviewer prefix. Expected delta: 40–50% per-PR. - **B — Local-session-heavy.** Lever: bounded sessions (end at task boundaries). Expected: ~30% daily cost. - **C — Mixed workload.** Lever: route one Opus usage to Sonnet where it doesn't earn Opus. Expected: 20–40% on that path. ### Setup Pull a 5-sample baseline before changing anything (5 PRs or 5 sessions), record cost. ### Steps 1. Apply the lever (one config change, not a rewrite). 2. Run 3–5 more samples on the same workflow. 3. Compute before/after deltas. Compare against expected. ### Deliverable `labs/notes/7.2-cost.md` with baseline, the lever, after-numbers, the % delta, and one observed quality regression (if any). ### Success criteria - Real measurements, before and after, on the *same* workflow type. - Honesty about regressions: if quality dropped, name it. ### Reflection - Was the lever you expected to pay most actually the biggest? Or did something else surprise you? ### Stretch - Add a recurring cost snapshot (Stop hook → CSV append) so the data accumulates without effort. --- ## Lab 7.3 — Stand up a two-metric scorecard ### Objective Pick two metrics, define how each is measured, capture baselines, publish. Skip target-setting depth; that's the follow-through. ### Time 20 minutes. ### Real-world scenarios - **A — 5-person product team.** Metrics: lead time, cost per merged PR. - **B — 15-person platform team.** Metrics: MTTR, PRs auto-reviewed on first pass. - **C — 50-engineer org.** Metrics: change failure rate, specs authored per engineer per month. ### Setup Open `labs/notes/7.3-scorecard.md`. ### Steps 1. Pick two metrics from §7.3 that fit your team. Skip vanity (lines of code, PR count). 2. For each: define data source, query/script, owner, cadence (weekly). 3. Capture today's baseline value for each. No guesses. ### Deliverable `labs/notes/7.3-scorecard.md` published somewhere persistent (wiki, Slack canvas, or repo `dashboard.md`). ### Success criteria - Each metric has an *operational definition* — a teammate could compute the same number. - Baselines are *measured*, not estimated. ### Reflection - Which metric might *worsen* short-term as you adopt agentic workflows? Plan for it. ### Stretch - Add one *leading* metric — something that predicts movement in the two lagging ones. --- ## Lab 7.4 — Decommission one component ### Objective Practice the under-discussed discipline: *remove* automation that isn't earning. Today, retire one thing. ### Time 15 minutes. ### Real-world scenarios - **A — Early stage (4–6 weeks in).** Many speculative components. Easy retirement candidates. - **B — Mature (3–6 months in).** Specific friction points known. Tune most; retire the worst offenders. - **C — Year-old platform.** Several prompts now obsolete because the underlying model improved. Retire eagerly. ### Setup Open `labs/notes/7.4-decommission.md` with columns: `Component | What it does | Last earned its keep | Verdict`. ### Steps 1. Inventory: list every sub-agent, skill, MCP server, hook, and `AGENTS.md` rule you've installed. 2. Per row, name a concrete event in the last month where it earned its keep. If you can't, mark RETIRE. 3. Execute **one** retirement: delete the file or disable the config. No commented-out leftovers. ### Deliverable The audit table + the retirement commit + `labs/notes/7.4-decommission.md` with the retired component and the *reason*. ### Success criteria - At least one component is *actually deleted* from the repo. Not flagged, not deprecated — gone. - The reason is data ("hasn't fired usefully in 6 weeks"), not vibes. ### Reflection - Was the retirement painful? Sunk cost, identity, or actual loss? ### Stretch - Calendar the next decommission audit for 90 days from today. --- ## Wrap-up: end of the course In `labs/notes/`: `7.1` through `7.4`. In your repo / process: - A published two-metric scorecard with baselines. - A measurably cheaper workflow on one lever. - Fewer, sharper automations than before this lab. You now have, across all seven modules' labs: - A **configured, governed agentic CLI** (Modules 3–4). - A **memory stack** that survives sessions and teammates (Module 5). - A **specialized agent fleet** with verifier pipelines (Module 6). - A **defensible orchestrator practice** with numbers behind it (Module 7). The agents will keep getting better. So should the system you wrap around them. --- ## Where to go next - Repeat the labs in a different repo. Different shape, different gaps. - Teach one lab to a teammate. - When the tooling shifts under the instructions, open a PR. Living documents stay alive only when their readers tend them.