## Definition
**RLHF** (Reinforcement Learning from Human Feedback) is the post-training technique that uses human preference data to shape an LLM's behaviour. Introduced as a practical recipe by Christiano et al. (2017) and Stiennon et al. (2020); operationalised at scale by OpenAI's InstructGPT (2022) and adopted across the field.
## The Three Phases
1. **Supervised Fine-Tuning (SFT).** Train the base model on high-quality demonstration data — humans writing ideal responses.
2. **Reward Model (RM) training.** Collect (prompt, response A, response B) triples where humans pick which is better. Train a model that predicts the preferred response.
3. **RL fine-tuning (PPO).** Use Proximal Policy Optimisation: sample responses, score them with the RM, update the policy to favour high-scoring responses. A KL-divergence penalty keeps the policy close to the SFT model.
## Why It Worked
Pretrained base models are good at completion but bad at following instructions and refusing harmful requests. RLHF taught them to do both *without* a flood of labelled examples for every behaviour — preferences scale better than gold answers.
## Failure Modes
- **Reward hacking.** The policy finds high-RM responses that humans actually dislike (sycophancy, hedging, refusing too aggressively).
- **Mode collapse.** Outputs become repetitive and stylistically narrow.
- **Misalignment with the actual user.** The RM reflects the *labellers'* preferences, not necessarily the end user's.
## Successors and Alternatives
- **DPO (Direct Preference Optimisation)** — skips the explicit reward model; trains directly on preference data with a closed-form objective. Simpler and competitive.
- **[[Constitutional AI]]** — replaces human labellers with AI-generated feedback guided by a written constitution (RLAIF).
- **KTO**, **IPO**, **ORPO** — algorithmic refinements that further simplify the loss.
## Why It Matters for the Orchestrator
The alignment of a frontier model — what it will and won't do, its tone, its refusal patterns — is downstream of the RLHF / RLAIF data the lab collected. Two models with identical architectures can behave very differently because of post-training.
## Related
- [[Fine-Tuning]]
- [[Constitutional AI]]
- [[Alignment]]
- [[Large Language Model]]