Alignment - Albert Masoliver's learning site

## Definition **Alignment** is the field and practice of ensuring an AI system's behaviour matches its operator's and users' intent. For LLMs specifically: techniques that move a base model toward being **helpful, harmless, and honest** (the "HHH" framing). ## The HHH Framing - **Helpful** — provides genuine value on the user's request. - **Harmless** — refuses to assist with clearly harmful tasks; doesn't generate disallowed content. - **Honest** — doesn't deceive; expresses uncertainty when warranted (see [[Hallucination]]). These three pull against each other regularly — a more cautious model is less helpful; a more helpful model risks honesty by inventing answers. Alignment work is largely about navigating these tensions. ## Techniques - **Supervised fine-tuning** on curated examples. - **[[RLHF]]** — preference data shapes refusals and helpfulness. - **[[Constitutional AI]]** — principle-based AI feedback. - **System prompts and tool-use restrictions** at deployment time. - **Adversarial training** — generate jailbreak attempts and train against them. ## Outer vs Inner Alignment - **Outer alignment.** Are we training the model toward the right objective? - **Inner alignment.** Is the trained model actually optimising for that objective internally, or for a proxy? Modern LLM alignment is mostly outer-alignment work. Inner alignment remains an open research problem. ## Why It Matters for the Orchestrator - Different frontier models are aligned differently. Their refusal patterns, hedging behaviour, and willingness to express uncertainty vary. - Choosing a model is partly choosing an alignment posture. - Your `AGENTS.md` and system prompts add *another* layer of alignment on top of the model's training. Use deliberately. ## Threats to Alignment in Deployment - **[[Prompt Injection]]** — malicious inputs override the operator's intent. - **Jailbreaks** — adversarially-crafted prompts bypass safety training. - **Reward hacking** — the model optimises a proxy of the operator's goals. ## Related - [[RLHF]] - [[Constitutional AI]] - [[Hallucination]] - [[Prompt Injection]] - [[Building Effective AI Agents]]