Term-Based Retrieval - Albert Masoliver's learning site

## Definition **Term-based retrieval** (also called lexical retrieval) ranks documents by matching the literal terms in a query to the terms in indexed documents, using statistical weighting rather than semantic meaning. It is the backbone of classical search engines and a fast, reliable baseline for [[Retrieval-Augmented Generation]] systems. ## TF-IDF The core weighting scheme behind most term-based systems combines two signals: - **Term Frequency (TF)** — how many times a term $t$ appears in document $D$, written $f(t, D)$. The assumption: more occurrences signal higher relevance. - **Inverse Document Frequency (IDF)** — how rare the term is across the whole corpus. If $N$ is the total number of documents and $C(t)$ is the count that contain $t$: $ \text{IDF}(t) = \log \frac{N}{C(t)} $ Common stop words like "the" have high $C(t)$ and therefore low IDF, so they contribute little. The naive TF-IDF score of document $D$ for query $Q$ with terms $t_1, \ldots, t_q$ is: $ \text{Score}(D, Q) = \sum_{i=1}^{q} \text{IDF}(t_i) \times f(t_i, D) $ ## BM25 **Okapi BM25** (Best Matching 25, Robertson et al., 1980s) is the industry-standard refinement of TF-IDF. Its key improvement is normalising term frequency by document length — longer documents naturally contain a term more often, so raw TF overcounts their relevance. BM25 and its variants (BM25+, BM25F) remain formidable baselines that modern embedding-based systems must beat to justify their added cost. ## Inverted Index Term-based retrieval is fast because documents are pre-indexed in an **inverted index**: a dictionary mapping each term to the list of documents that contain it, along with stored term frequencies. Given a query, the retriever performs a direct lookup rather than scanning every document. ``` Term | Doc count | (doc_id, TF) pairs ---------|-----------|-------------------- "banana" | 2 | (10, 3), (5, 2) "model" | 17 | (1, 5), (10, 1), ... ``` Elasticsearch, built on Lucene (2010), is the dominant open-source implementation. ## Tokenisation Choices Before indexing, text is broken into terms (tokenisation). Common steps: - Lowercase all characters. - Remove punctuation and stop words. - Handle n-grams: the bigram "hot dog" can be treated as a single term to preserve compound meaning. Classical NLP libraries (NLTK, spaCy, Stanford CoreNLP) handle this automatically. ## Strengths and Limitations | Aspect | Verdict | |---|---| | Speed | Much faster than [[Embedding-Based Retrieval]] — term lookup beats nearest-neighbour search | | Baseline quality | Strong out of the box; no training required | | Exact-match keywords | Excellent — product codes, error codes, proper nouns | | Semantic understanding | None — querying "transformer architecture" may return results about the electrical device | | Tuning surface | Limited; vocabulary and weighting are the main levers | Because term-based retrieval operates lexically, it fails when the query and document use different words for the same concept ("sofa" vs "couch"). This limitation motivates [[Hybrid Search]] and [[Embedding-Based Retrieval]]. ## Related - [[Retrieval-Augmented Generation]] - [[Embedding-Based Retrieval]] - [[Hybrid Search]] - [[Semantic Search]] - [[Vector Database]] ## Sources - [[AI Engineering - Chip Huyen]]