## Definition
**Constitutional AI (CAI)** is an alignment method developed at Anthropic (Bai et al., 2022) that replaces human feedback labels with AI feedback guided by a written **constitution** — a small set of natural-language principles. Sometimes called **RLAIF** (Reinforcement Learning from AI Feedback).
## The Two-Stage Method
1. **Supervised stage.** The model is prompted to **critique and revise** its own responses according to constitutional principles. The revised outputs become the SFT dataset.
2. **Reinforcement stage.** Another model evaluates pairs of responses against the constitution, producing preference data. A reward model is trained on these AI preferences; PPO fine-tunes the assistant.
## The Constitution
A list of natural-language guidelines. Anthropic's published principles draw on:
- The UN Declaration of Human Rights.
- Anti-deception and honesty clauses.
- Avoiding gratuitous harm.
- Respecting human autonomy.
The constitution is *public* — see Anthropic's *"Claude's Constitution"* publication.
## Why It Matters
- **Scalability.** Removes the bottleneck of human labelling for every harmful-behaviour edge case.
- **Debuggability.** Disagree with a behaviour, change the principle, retrain. Compare with [[RLHF]], where the "rules" live implicitly in millions of preference labels.
- **Transparency.** Publishing the principles makes the alignment target *inspectable* in a way RLHF preference data is not.
## Limitations
- The constitution is itself authored by humans — values aren't escaped, just refactored.
- AI feedback can inherit biases from the model providing it.
- Principles interact in non-obvious ways; debugging is still hard.
## Lineage
- Bai et al., *Constitutional AI* (2022).
- DeepMind's *Sparrow* used a similar rules-from-text approach.
- Anthropic's later work refines CAI with finer-grained harm categories and contextual application.
## Underpins
The safety posture of the Claude family of models. Every Claude release ships with a constitution Anthropic stands behind publicly.
## Related
- [[RLHF]]
- [[Alignment]]
- [[Fine-Tuning]]
- [[Constitutional AI (Bai et al.)]]