## Definition
**Embedding-based retrieval** is the mechanism that turns text (or images, or other inputs) into dense vectors and uses vector similarity — typically cosine or dot product — to identify the most semantically relevant items from a large index. The engineering core of [[Semantic Search]] and [[Retrieval-Augmented Generation]].
## The Loop
1. **Index time.** Compute and store embeddings for every chunk in the corpus.
2. **Query time.** Embed the query with the *same* embedding model.
3. **Search.** Find the top-$k$ nearest neighbours in the index.
4. **Optionally rerank** with a cross-encoder.
## Why "Same Model" Matters
A query embedded with model A and an index built with model B are in different geometric spaces. The cosine of their vectors is mathematically defined but semantically meaningless. **Re-index whenever you change the embedding model.**
## Symmetric vs Asymmetric Embeddings
- **Symmetric** — queries and documents are encoded by the same model and pooled the same way. Used for clustering, deduplication.
- **Asymmetric** — queries and documents use different encoders (or different pooling). Often better for RAG because query distribution ≠ document distribution.
Modern embedding models (Voyage, OpenAI, Cohere, BGE) come in both flavours; some accept a prefix like `"query: ..."` vs `"passage: ..."` to switch modes.
## Chunking Decisions
The chunk is the unit of retrieval. The decision affects everything:
- **Too large.** Retrieved chunks contain mostly irrelevant content; the LLM has to filter noise.
- **Too small.** Meaning fragments across chunks; the right answer requires multiple hits stitched together.
- **Overlap.** A small overlap (e.g., 10–20%) hedges against boundary cases.
- **Structural chunking.** Respect natural boundaries (sections, functions, paragraphs) over fixed token counts.
The right chunk size is domain-specific. Measure on representative queries.
## Quality Levers
- **Better embedding model.** The single biggest lever, usually.
- **Better chunking.** Often underweighted; can double retrieval quality.
- **Reranker.** Cross-encoder over top-50 candidates.
- **Hybrid with BM25.** See [[Semantic Search]].
- **Query rewriting.** Have an LLM expand or rephrase the query before embedding (HyDE, multi-query retrieval).
## Evaluation
Retrieval is evaluable independently of the generator:
- **Hit Rate@k** — does the right document appear in top-$k$?
- **MRR** (Mean Reciprocal Rank) — where in the top-$k$?
- **nDCG** — graded relevance.
Frameworks like RAGAS, Ragatouille, Tonic Validate operationalise these for production RAG.
## Related
- [[Embedding]]
- [[Semantic Search]]
- [[Vector Database]]
- [[Retrieval-Augmented Generation]]