## Definition
**AI application architecture** is the layered, progressively enriched structure that most production foundation-model applications converge on. Rather than specifying a fixed blueprint, the pattern is *additive*: start from the minimum viable pipeline and introduce each component only when the cost of not having it becomes concrete. Chip Huyen (2024) validates this five-step progression across multiple companies.
## The Five Steps
**Step 1 — Enhance context.** The application retrieves relevant data (documents, tables, images) and augments the model's prompt with it before generation. This is the foundation-model equivalent of feature engineering and is universally necessary. It underpins [[Retrieval-Augmented Generation]] and all tool-use patterns.
**Step 2 — Add guardrails.** Input guardrails screen queries for sensitive data (PII leaking to external APIs, prompt injection) and out-of-scope intent. Output guardrails catch format failures, hallucinations, toxicity, and brand risk. Failures trigger retry logic (parallel redundant calls to manage latency) or human hand-off. See [[Prompt Injection]] for the threat model.
**Step 3 — Add model router and gateway.** As pipelines grow to involve multiple models, a router directs each query to the appropriate model or specialised pipeline (intent classifier → billing specialist, troubleshooter, FAQ redirect). A [[Model Gateway]] provides the unified infrastructure layer for access control, cost governance, and fallback.
**Step 4 — Reduce latency with caches.** Beyond [[KV Cache]] and [[Prompt Caching]] (handled inside the model API), the application layer adds *exact caching* (deterministic lookup of prior query-response pairs) and optionally *semantic caching* (vector-similarity lookup). Semantic caching has a higher hit rate but introduces error risk if the similarity threshold is poorly tuned.
**Step 5 — Add agent patterns.** Complex tasks require loops, branching, parallel execution, and write actions. The response can feed back into the same pipeline — triggering another retrieval, a tool call, or a planning step — before being returned to the user. Write actions (composing emails, placing orders, executing code) vastly expand capability but also expand the attack surface and failure surface.
## Monitoring and Observability
Every component adds failure modes. Observability must be designed in from the start, not retrofitted. Key operational metrics:
- **MTTD** (mean time to detection): how quickly failures are caught.
- **MTTR** (mean time to response): how quickly they are resolved.
- **CFR** (change failure rate): fraction of deployments that cause regressions.
Traces link every step from raw query to final response, making it possible to pinpoint exactly where a pipeline fails — retrieval, model, scoring, or guardrail.
## Design Principles
- Separation is fluid: guardrails can live inside the inference service, inside the gateway, or as standalone components.
- Each added component multiplies the failure surface — add only when the benefit is concrete.
- Aim for parallelism in latency-sensitive pipelines: routing and PII removal can run simultaneously.
- The gateway is both an observability chokepoint and a cost-control lever.
## Related
- [[Three-Layer AI Stack]]
- [[Retrieval-Augmented Generation]]
- [[Prompt Injection]]
- [[KV Cache]]
- [[Prompt Caching]]
- [[Model Gateway]]
- [[Agentic Loop]]
- [[Orchestrator-Subagent Pattern]]
## Sources
- [[AI Engineering - Chip Huyen]]