Finetuning vs RAG - Albert Masoliver's learning site

## Definition **Finetuning vs RAG** is a decision framework for choosing between adapting a model's weights (finetuning) and giving a model access to external retrieved knowledge ([[Retrieval-Augmented Generation|RAG]]) to improve its performance. The key diagnostic is whether a model's failures are **information-based** or **behavior-based**. ## The Diagnostic Question > "Is the model failing because it lacks information, or because it behaves incorrectly?" **Information-based failures** — outputs are factually wrong or outdated: - The model doesn't have private/internal knowledge. - The model's training cut-off predates relevant events. - RAG is the appropriate remedy. **Behavior-based failures** — the model has the right knowledge but: - Outputs are in the wrong format (e.g., invalid HTML, non-standard SQL dialect). - Outputs lack the required style, depth, or structure. - The model ignores specific instructions or produces unsafe responses. - Finetuning is the appropriate remedy. The empirical summary: **finetuning is for form; RAG is for facts.** ## Empirical Evidence Ovadia et al. (2024, "Fine-Tuning or Retrieval?") showed that for current-event QA tasks, RAG with the base model outperformed both RAG with finetuned models and finetuning alone. Finetuning can enhance one task while degrading the base model's general capabilities — a phenomenon sometimes called an alignment tax. Conversely, RAG with a finetuned model improved performance 43% of the time over RAG alone on MMLU benchmarks, but was no better 57% of the time — suggesting limited compounding benefit in most cases. ## Recommended Workflow 1. **Prompt engineering** — exhaust prompt-based improvements first (Chapter 5/6 techniques). Many apparent finetuning needs dissolve with better prompts. 2. **Add examples** to the prompt (1–50 few-shot examples). 3. **RAG with simple retrieval** (term-based, e.g., BM25) if information gaps exist. 4. **Advanced RAG** (embedding-based, hybrid search) for persistent information failures. 5. **Finetuning** for persistent behavioral failures (format, style, safety, domain-specific syntax). 6. **RAG + finetuning** together if both failure types persist. Evaluation criteria and an evaluation pipeline should be established before any of these steps. Evaluation is present at every iteration, not only at the beginning. ## Why Not Always Finetune Finetuning incurs high upfront costs: annotated data (slow and expensive), ML talent to run training, infrastructure to serve the finetuned model, and ongoing maintenance as base models evolve. Starting with finetuning before exhausting prompting is usually a mistake; anecdotally, many "finetuning is needed" conclusions dissolve once prompting experiments are properly designed and executed. With prompt caching (repetitive prompt segments cached for reuse), the token-efficiency benefit of finetuning is also reduced, though finetuning still wins when the number of examples exceeds context limits. ## Related - [[Fine-Tuning]] - [[Retrieval-Augmented Generation]] - [[In-Context Learning]] - [[Prompt Engineering]] - [[Hallucination]] ## Sources - [[AI Engineering - Chip Huyen]]