## Definition **Policy gradient** methods learn the policy $\pi_\theta(a \mid s)$ directly — parameterised by $\theta$ — by performing gradient ascent on the expected return. The alternative to value-based methods like [[Q-Learning]]. ## The Policy Gradient Theorem The gradient of expected return $J(\theta) = \mathbb{E}_\pi[\sum_t \gamma^t r_t]$ is: $ \nabla J(\theta) = \mathbb{E}_\pi \left[ \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot Q^\pi(s_t, a_t) \right] $ This is the foundation of all policy-gradient algorithms. ## REINFORCE (Vanilla Policy Gradient) The simplest policy-gradient algorithm (Williams, 1992): ``` for each episode: sample trajectory τ = (s_0, a_0, r_0, ..., s_T) following π_θ compute returns G_t = Σ_{k≥t} γ^{k-t} r_k update θ ← θ + α · Σ_t ∇_θ log π_θ(a_t | s_t) · G_t ``` Use the actual return $G_t$ as an unbiased estimate of $Q^\pi(s_t, a_t)$. Simple but high variance. ## Reducing Variance: Baselines Subtract a baseline $b(s)$ from the return: $ \nabla J(\theta) \approx \mathbb{E}_\pi \left[ \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot (G_t - b(s_t)) \right] $ Doesn't bias the gradient (because $\mathbb{E}[\nabla \log \pi \cdot b(s)] = 0$) but dramatically reduces variance. Using $b(s) = V^\pi(s)$ gives the **advantage** $A(s, a) = Q(s, a) - V(s)$ — leading to [[Actor-Critic]] methods. ## When To Prefer Policy Gradient Over Q-Learning - **Continuous or large action spaces** — no max over actions needed. - **Stochastic policies** are natural (exploration baked in). - **Constrained action spaces** are easy to model. ## When To Prefer Q-Learning - **Discrete actions** with reasonable cardinality. - **Off-policy learning** from a replay buffer is needed. - **Sample efficiency** is critical (Q-learning often more sample-efficient in simple settings). ## Modern Policy-Gradient Algorithms - **TRPO** (Trust Region Policy Optimisation) — constrains policy updates to a trust region. - **PPO** (Proximal Policy Optimisation, Schulman et al. 2017) — clipped surrogate objective; the workhorse for many production RL systems. Used by OpenAI to train GPT-style models via [[RLHF]]. - **A2C / A3C** — synchronous / asynchronous actor-critic. - **SAC** (Soft Actor-Critic) — maximum-entropy RL for continuous control. ## Connection to RLHF In LLM alignment via [[RLHF]], the "RL" step is essentially policy gradient: the policy is the language model; the reward is a learned reward model; PPO is the algorithm. Mathematically tame compared to control RL — a single action (the response), no environment dynamics — but the policy-gradient framework underlies it. ## Related - [[Actor-Critic]] - [[Q-Learning]] - [[Reinforcement Learning]] - [[RLHF]] - [[Gradient Descent]]