Embedding-Based Retrieval - Albert Masoliver's learning site

## Definition **Embedding-based retrieval** is the mechanism that turns text (or images, or other inputs) into dense vectors and uses vector similarity — typically cosine or dot product — to identify the most semantically relevant items from a large index. The engineering core of [[Semantic Search]] and [[Retrieval-Augmented Generation]]. ## The Loop 1. **Index time.** Compute and store embeddings for every chunk in the corpus. 2. **Query time.** Embed the query with the *same* embedding model. 3. **Search.** Find the top-$k$ nearest neighbours in the index. 4. **Optionally rerank** with a cross-encoder. ## Why "Same Model" Matters A query embedded with model A and an index built with model B are in different geometric spaces. The cosine of their vectors is mathematically defined but semantically meaningless. **Re-index whenever you change the embedding model.** ## Symmetric vs Asymmetric Embeddings - **Symmetric** — queries and documents are encoded by the same model and pooled the same way. Used for clustering, deduplication. - **Asymmetric** — queries and documents use different encoders (or different pooling). Often better for RAG because query distribution ≠ document distribution. Modern embedding models (Voyage, OpenAI, Cohere, BGE) come in both flavours; some accept a prefix like `"query: ..."` vs `"passage: ..."` to switch modes. ## Chunking Decisions The chunk is the unit of retrieval. The decision affects everything: - **Too large.** Retrieved chunks contain mostly irrelevant content; the LLM has to filter noise. - **Too small.** Meaning fragments across chunks; the right answer requires multiple hits stitched together. - **Overlap.** A small overlap (e.g., 10–20%) hedges against boundary cases. - **Structural chunking.** Respect natural boundaries (sections, functions, paragraphs) over fixed token counts. The right chunk size is domain-specific. Measure on representative queries. ## Quality Levers - **Better embedding model.** The single biggest lever, usually. - **Better chunking.** Often underweighted; can double retrieval quality. - **Reranker.** Cross-encoder over top-50 candidates. - **Hybrid with BM25.** See [[Semantic Search]]. - **Query rewriting.** Have an LLM expand or rephrase the query before embedding (HyDE, multi-query retrieval). ## Evaluation Retrieval is evaluable independently of the generator: - **Hit Rate@k** — does the right document appear in top-$k$? - **MRR** (Mean Reciprocal Rank) — where in the top-$k$? - **nDCG** — graded relevance. Frameworks like RAGAS, Ragatouille, Tonic Validate operationalise these for production RAG. ## Related - [[Embedding]] - [[Semantic Search]] - [[Vector Database]] - [[Retrieval-Augmented Generation]]