Retrieval-Augmented Generation - Albert Masoliver's learning site

## Definition **Retrieval-Augmented Generation (RAG)** is the pattern of conditioning an LLM's response on documents retrieved at query time from an external knowledge base — typically a [[Vector Database]] of embeddings. Introduced by Lewis et al. (2020) — see [[Retrieval-Augmented Generation (Lewis et al.)]]. ## Why RAG - **Knowledge that changes faster than training.** Pretraining cutoffs are months behind; RAG injects fresh content per query. - **Knowledge that doesn't belong in weights.** Internal docs, customer data, regulatory filings — too private or specific to bake into a model. - **Mitigates [[Hallucination]].** Generation is grounded in retrieved text; the model can cite sources. - **Cheaper than fine-tuning.** Update the index, not the model. ## The Canonical Pipeline ``` User query │ ▼ [Query embedding] ──→ [Vector DB] ──→ Top-k chunks │ ▼ [Prompt template with chunks + query] │ ▼ [LLM] │ ▼ Generated response (optionally with citations) ``` ## Indexing Side 1. **Chunk** documents (typically 200–1000 tokens, sometimes with overlap). 2. **Embed** each chunk via an embedding model — see [[Embedding]]. 3. **Store** in a vector database with metadata for filtering — see [[Vector Database]]. ## Retrieval Side 1. **Embed the query** in the same vector space. 2. **Search** for nearest neighbours (cosine or dot product). 3. **Optionally rerank** the top-N with a cross-encoder for higher precision. 4. **Filter** by metadata (date, source, tenant). ## Generation Side 1. **Compose a prompt** with the retrieved chunks as context plus the user query. 2. **Generate** the response. 3. **Cite** the sources (chunk IDs, page numbers, URLs) the response drew on. ## Common Pitfalls - **Wrong chunking.** Too large → noisy context; too small → loses meaning across chunks. - **Stale index.** Documents updated but not re-indexed; the LLM cites stale info. - **Embedding-model mismatch.** Index built with one model, queries embedded with another. - **No reranking.** Top-k by vector similarity isn't always relevance-ordered; a reranker helps. - **No citations enforced.** The model claims to draw on retrieved docs but doesn't — a [[Hallucination]] dressed in RAG clothing. ## Evolution Toward Agentic Retrieval In modern agentic systems, retrieval is increasingly **invoked as a tool** rather than as a preprocessing step. The agent decides *when* to retrieve, *what* to retrieve, and how to iterate — see [[Tool Use]]. ## Related - [[Vector Database]] - [[Embedding-Based Retrieval]] - [[Semantic Search]] - [[Embedding]] - [[Hallucination]] - [[Retrieval-Augmented Generation (Lewis et al.)]]