Constitutional AI - Albert Masoliver's learning site

## Definition **Constitutional AI (CAI)** is an alignment method developed at Anthropic (Bai et al., 2022) that replaces human feedback labels with AI feedback guided by a written **constitution** — a small set of natural-language principles. Sometimes called **RLAIF** (Reinforcement Learning from AI Feedback). ## The Two-Stage Method 1. **Supervised stage.** The model is prompted to **critique and revise** its own responses according to constitutional principles. The revised outputs become the SFT dataset. 2. **Reinforcement stage.** Another model evaluates pairs of responses against the constitution, producing preference data. A reward model is trained on these AI preferences; PPO fine-tunes the assistant. ## The Constitution A list of natural-language guidelines. Anthropic's published principles draw on: - The UN Declaration of Human Rights. - Anti-deception and honesty clauses. - Avoiding gratuitous harm. - Respecting human autonomy. The constitution is *public* — see Anthropic's *"Claude's Constitution"* publication. ## Why It Matters - **Scalability.** Removes the bottleneck of human labelling for every harmful-behaviour edge case. - **Debuggability.** Disagree with a behaviour, change the principle, retrain. Compare with [[RLHF]], where the "rules" live implicitly in millions of preference labels. - **Transparency.** Publishing the principles makes the alignment target *inspectable* in a way RLHF preference data is not. ## Limitations - The constitution is itself authored by humans — values aren't escaped, just refactored. - AI feedback can inherit biases from the model providing it. - Principles interact in non-obvious ways; debugging is still hard. ## Lineage - Bai et al., *Constitutional AI* (2022). - DeepMind's *Sparrow* used a similar rules-from-text approach. - Anthropic's later work refines CAI with finer-grained harm categories and contextual application. ## Underpins The safety posture of the Claude family of models. Every Claude release ships with a constitution Anthropic stands behind publicly. ## Related - [[RLHF]] - [[Alignment]] - [[Fine-Tuning]] - [[Constitutional AI (Bai et al.)]]