Semantic Search - Albert Masoliver's learning site

## Definition **Semantic search** retrieves documents by *meaning* rather than by exact lexical overlap, using vector embeddings of queries and documents in a shared space. The same query — phrased differently — should return the same relevant documents. ## Vs Lexical Search (BM25) | Property | Lexical (BM25) | Semantic (dense) | | ------------------------------------- | -------------- | ------------------ | | Matches exact terms | Strong | Weak | | Matches paraphrases / synonyms | Weak | Strong | | Handles typos | Weak | Stronger | | Cross-lingual | None natively | Yes (with multilingual embeddings) | | Explainable | Yes (matched terms) | Mostly opaque | | Index/query cost | Cheap | More expensive | | Cold start (no training) | Works | Requires embedding model | Neither is universally better. The dominant production pattern is **hybrid**. ## Hybrid Search Run both lexical and semantic search, then merge: - **Reciprocal Rank Fusion (RRF)** — combine rank positions, not raw scores. Robust and simple; the workhorse. - **Learned linear combination** — weighted sum of lexical and semantic scores. - **Cross-encoder reranking** — top-N from both methods passed to a cross-encoder that scores each (query, doc) pair jointly. In practice, hybrid + cross-encoder reranker is the strongest off-the-shelf retrieval setup as of 2026. ## Why It Mattered Pre-2020, search was overwhelmingly lexical. Semantic search opened: - **"What's the doc about user authentication that doesn't use the word 'authentication'?"** — semantic finds it; lexical doesn't. - **Cross-lingual search.** Embed in 100+ languages; match queries across them. - **Question-answering at scale.** Retrieve the few passages that *answer* a question, not the many that *mention* its keywords. ## Failure Modes - **Domain mismatch.** A general-purpose embedding model misses domain jargon. Mitigate with domain-specific embeddings or fine-tuning. - **Spurious geometric closeness.** Two unrelated documents can land near each other in embedding space. Mitigate with reranking. - **No exact-match fallback.** A user searching for a specific product code wants the product code, not "similar things." Hybrid restores this. ## Connection to RAG Semantic search is the *retrieval* step of [[Retrieval-Augmented Generation]]. The vector DB ([[Vector Database]]) is the storage layer. The generation layer is the LLM. ## Related - [[Embedding]] - [[Vector Database]] - [[Retrieval-Augmented Generation]] - [[Embedding-Based Retrieval]]