## Definition
**Alignment** is the field and practice of ensuring an AI system's behaviour matches its operator's and users' intent. For LLMs specifically: techniques that move a base model toward being **helpful, harmless, and honest** (the "HHH" framing).
## The HHH Framing
- **Helpful** — provides genuine value on the user's request.
- **Harmless** — refuses to assist with clearly harmful tasks; doesn't generate disallowed content.
- **Honest** — doesn't deceive; expresses uncertainty when warranted (see [[Hallucination]]).
These three pull against each other regularly — a more cautious model is less helpful; a more helpful model risks honesty by inventing answers. Alignment work is largely about navigating these tensions.
## Techniques
- **Supervised fine-tuning** on curated examples.
- **[[RLHF]]** — preference data shapes refusals and helpfulness.
- **[[Constitutional AI]]** — principle-based AI feedback.
- **System prompts and tool-use restrictions** at deployment time.
- **Adversarial training** — generate jailbreak attempts and train against them.
## Outer vs Inner Alignment
- **Outer alignment.** Are we training the model toward the right objective?
- **Inner alignment.** Is the trained model actually optimising for that objective internally, or for a proxy?
Modern LLM alignment is mostly outer-alignment work. Inner alignment remains an open research problem.
## Why It Matters for the Orchestrator
- Different frontier models are aligned differently. Their refusal patterns, hedging behaviour, and willingness to express uncertainty vary.
- Choosing a model is partly choosing an alignment posture.
- Your `AGENTS.md` and system prompts add *another* layer of alignment on top of the model's training. Use deliberately.
## Threats to Alignment in Deployment
- **[[Prompt Injection]]** — malicious inputs override the operator's intent.
- **Jailbreaks** — adversarially-crafted prompts bypass safety training.
- **Reward hacking** — the model optimises a proxy of the operator's goals.
## Related
- [[RLHF]]
- [[Constitutional AI]]
- [[Hallucination]]
- [[Prompt Injection]]
- [[Building Effective AI Agents]]