## Definition
**Policy gradient** methods learn the policy $\pi_\theta(a \mid s)$ directly — parameterised by $\theta$ — by performing gradient ascent on the expected return. The alternative to value-based methods like [[Q-Learning]].
## The Policy Gradient Theorem
The gradient of expected return $J(\theta) = \mathbb{E}_\pi[\sum_t \gamma^t r_t]$ is:
$
\nabla J(\theta) = \mathbb{E}_\pi \left[ \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot Q^\pi(s_t, a_t) \right]
$
This is the foundation of all policy-gradient algorithms.
## REINFORCE (Vanilla Policy Gradient)
The simplest policy-gradient algorithm (Williams, 1992):
```
for each episode:
sample trajectory τ = (s_0, a_0, r_0, ..., s_T) following π_θ
compute returns G_t = Σ_{k≥t} γ^{k-t} r_k
update θ ← θ + α · Σ_t ∇_θ log π_θ(a_t | s_t) · G_t
```
Use the actual return $G_t$ as an unbiased estimate of $Q^\pi(s_t, a_t)$. Simple but high variance.
## Reducing Variance: Baselines
Subtract a baseline $b(s)$ from the return:
$
\nabla J(\theta) \approx \mathbb{E}_\pi \left[ \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot (G_t - b(s_t)) \right]
$
Doesn't bias the gradient (because $\mathbb{E}[\nabla \log \pi \cdot b(s)] = 0$) but dramatically reduces variance. Using $b(s) = V^\pi(s)$ gives the **advantage** $A(s, a) = Q(s, a) - V(s)$ — leading to [[Actor-Critic]] methods.
## When To Prefer Policy Gradient Over Q-Learning
- **Continuous or large action spaces** — no max over actions needed.
- **Stochastic policies** are natural (exploration baked in).
- **Constrained action spaces** are easy to model.
## When To Prefer Q-Learning
- **Discrete actions** with reasonable cardinality.
- **Off-policy learning** from a replay buffer is needed.
- **Sample efficiency** is critical (Q-learning often more sample-efficient in simple settings).
## Modern Policy-Gradient Algorithms
- **TRPO** (Trust Region Policy Optimisation) — constrains policy updates to a trust region.
- **PPO** (Proximal Policy Optimisation, Schulman et al. 2017) — clipped surrogate objective; the workhorse for many production RL systems. Used by OpenAI to train GPT-style models via [[RLHF]].
- **A2C / A3C** — synchronous / asynchronous actor-critic.
- **SAC** (Soft Actor-Critic) — maximum-entropy RL for continuous control.
## Connection to RLHF
In LLM alignment via [[RLHF]], the "RL" step is essentially policy gradient: the policy is the language model; the reward is a learned reward model; PPO is the algorithm. Mathematically tame compared to control RL — a single action (the response), no environment dynamics — but the policy-gradient framework underlies it.
## Related
- [[Actor-Critic]]
- [[Q-Learning]]
- [[Reinforcement Learning]]
- [[RLHF]]
- [[Gradient Descent]]