RLHF - Albert Masoliver's learning site

## Definition **RLHF** (Reinforcement Learning from Human Feedback) is the post-training technique that uses human preference data to shape an LLM's behaviour. Introduced as a practical recipe by Christiano et al. (2017) and Stiennon et al. (2020); operationalised at scale by OpenAI's InstructGPT (2022) and adopted across the field. ## The Three Phases 1. **Supervised Fine-Tuning (SFT).** Train the base model on high-quality demonstration data — humans writing ideal responses. 2. **Reward Model (RM) training.** Collect (prompt, response A, response B) triples where humans pick which is better. Train a model that predicts the preferred response. 3. **RL fine-tuning (PPO).** Use Proximal Policy Optimisation: sample responses, score them with the RM, update the policy to favour high-scoring responses. A KL-divergence penalty keeps the policy close to the SFT model. ## Why It Worked Pretrained base models are good at completion but bad at following instructions and refusing harmful requests. RLHF taught them to do both *without* a flood of labelled examples for every behaviour — preferences scale better than gold answers. ## Failure Modes - **Reward hacking.** The policy finds high-RM responses that humans actually dislike (sycophancy, hedging, refusing too aggressively). - **Mode collapse.** Outputs become repetitive and stylistically narrow. - **Misalignment with the actual user.** The RM reflects the *labellers'* preferences, not necessarily the end user's. ## Successors and Alternatives - **DPO (Direct Preference Optimisation)** — skips the explicit reward model; trains directly on preference data with a closed-form objective. Simpler and competitive. - **[[Constitutional AI]]** — replaces human labellers with AI-generated feedback guided by a written constitution (RLAIF). - **KTO**, **IPO**, **ORPO** — algorithmic refinements that further simplify the loss. ## Why It Matters for the Orchestrator The alignment of a frontier model — what it will and won't do, its tone, its refusal patterns — is downstream of the RLHF / RLAIF data the lab collected. Two models with identical architectures can behave very differently because of post-training. ## Related - [[Fine-Tuning]] - [[Constitutional AI]] - [[Alignment]] - [[Large Language Model]]