## Definition
**Term-based retrieval** (also called lexical retrieval) ranks documents by matching the literal terms in a query to the terms in indexed documents, using statistical weighting rather than semantic meaning. It is the backbone of classical search engines and a fast, reliable baseline for [[Retrieval-Augmented Generation]] systems.
## TF-IDF
The core weighting scheme behind most term-based systems combines two signals:
- **Term Frequency (TF)** — how many times a term $t$ appears in document $D$, written $f(t, D)$. The assumption: more occurrences signal higher relevance.
- **Inverse Document Frequency (IDF)** — how rare the term is across the whole corpus. If $N$ is the total number of documents and $C(t)$ is the count that contain $t$:
$
\text{IDF}(t) = \log \frac{N}{C(t)}
$
Common stop words like "the" have high $C(t)$ and therefore low IDF, so they contribute little.
The naive TF-IDF score of document $D$ for query $Q$ with terms $t_1, \ldots, t_q$ is:
$
\text{Score}(D, Q) = \sum_{i=1}^{q} \text{IDF}(t_i) \times f(t_i, D)
$
## BM25
**Okapi BM25** (Best Matching 25, Robertson et al., 1980s) is the industry-standard refinement of TF-IDF. Its key improvement is normalising term frequency by document length — longer documents naturally contain a term more often, so raw TF overcounts their relevance. BM25 and its variants (BM25+, BM25F) remain formidable baselines that modern embedding-based systems must beat to justify their added cost.
## Inverted Index
Term-based retrieval is fast because documents are pre-indexed in an **inverted index**: a dictionary mapping each term to the list of documents that contain it, along with stored term frequencies. Given a query, the retriever performs a direct lookup rather than scanning every document.
```
Term | Doc count | (doc_id, TF) pairs
---------|-----------|--------------------
"banana" | 2 | (10, 3), (5, 2)
"model" | 17 | (1, 5), (10, 1), ...
```
Elasticsearch, built on Lucene (2010), is the dominant open-source implementation.
## Tokenisation Choices
Before indexing, text is broken into terms (tokenisation). Common steps:
- Lowercase all characters.
- Remove punctuation and stop words.
- Handle n-grams: the bigram "hot dog" can be treated as a single term to preserve compound meaning.
Classical NLP libraries (NLTK, spaCy, Stanford CoreNLP) handle this automatically.
## Strengths and Limitations
| Aspect | Verdict |
|---|---|
| Speed | Much faster than [[Embedding-Based Retrieval]] — term lookup beats nearest-neighbour search |
| Baseline quality | Strong out of the box; no training required |
| Exact-match keywords | Excellent — product codes, error codes, proper nouns |
| Semantic understanding | None — querying "transformer architecture" may return results about the electrical device |
| Tuning surface | Limited; vocabulary and weighting are the main levers |
Because term-based retrieval operates lexically, it fails when the query and document use different words for the same concept ("sofa" vs "couch"). This limitation motivates [[Hybrid Search]] and [[Embedding-Based Retrieval]].
## Related
- [[Retrieval-Augmented Generation]]
- [[Embedding-Based Retrieval]]
- [[Hybrid Search]]
- [[Semantic Search]]
- [[Vector Database]]
## Sources
- [[AI Engineering - Chip Huyen]]