AI Application Architecture - Albert Masoliver's learning site

## Definition **AI application architecture** is the layered, progressively enriched structure that most production foundation-model applications converge on. Rather than specifying a fixed blueprint, the pattern is *additive*: start from the minimum viable pipeline and introduce each component only when the cost of not having it becomes concrete. Chip Huyen (2024) validates this five-step progression across multiple companies. ## The Five Steps **Step 1 — Enhance context.** The application retrieves relevant data (documents, tables, images) and augments the model's prompt with it before generation. This is the foundation-model equivalent of feature engineering and is universally necessary. It underpins [[Retrieval-Augmented Generation]] and all tool-use patterns. **Step 2 — Add guardrails.** Input guardrails screen queries for sensitive data (PII leaking to external APIs, prompt injection) and out-of-scope intent. Output guardrails catch format failures, hallucinations, toxicity, and brand risk. Failures trigger retry logic (parallel redundant calls to manage latency) or human hand-off. See [[Prompt Injection]] for the threat model. **Step 3 — Add model router and gateway.** As pipelines grow to involve multiple models, a router directs each query to the appropriate model or specialised pipeline (intent classifier → billing specialist, troubleshooter, FAQ redirect). A [[Model Gateway]] provides the unified infrastructure layer for access control, cost governance, and fallback. **Step 4 — Reduce latency with caches.** Beyond [[KV Cache]] and [[Prompt Caching]] (handled inside the model API), the application layer adds *exact caching* (deterministic lookup of prior query-response pairs) and optionally *semantic caching* (vector-similarity lookup). Semantic caching has a higher hit rate but introduces error risk if the similarity threshold is poorly tuned. **Step 5 — Add agent patterns.** Complex tasks require loops, branching, parallel execution, and write actions. The response can feed back into the same pipeline — triggering another retrieval, a tool call, or a planning step — before being returned to the user. Write actions (composing emails, placing orders, executing code) vastly expand capability but also expand the attack surface and failure surface. ## Monitoring and Observability Every component adds failure modes. Observability must be designed in from the start, not retrofitted. Key operational metrics: - **MTTD** (mean time to detection): how quickly failures are caught. - **MTTR** (mean time to response): how quickly they are resolved. - **CFR** (change failure rate): fraction of deployments that cause regressions. Traces link every step from raw query to final response, making it possible to pinpoint exactly where a pipeline fails — retrieval, model, scoring, or guardrail. ## Design Principles - Separation is fluid: guardrails can live inside the inference service, inside the gateway, or as standalone components. - Each added component multiplies the failure surface — add only when the benefit is concrete. - Aim for parallelism in latency-sensitive pipelines: routing and PII removal can run simultaneously. - The gateway is both an observability chokepoint and a cost-control lever. ## Related - [[Three-Layer AI Stack]] - [[Retrieval-Augmented Generation]] - [[Prompt Injection]] - [[KV Cache]] - [[Prompt Caching]] - [[Model Gateway]] - [[Agentic Loop]] - [[Orchestrator-Subagent Pattern]] ## Sources - [[AI Engineering - Chip Huyen]]