Actor-Critic - Albert Masoliver's learning site

## Definition **Actor-Critic** methods combine the policy-gradient (actor) and value-function (critic) approaches. The actor decides actions; the critic evaluates them. The hybrid yields lower variance than pure policy gradient and faster convergence than pure value-based methods. ## Architecture Two parameterised functions, often sharing a backbone neural network: - **Actor**: $\pi_\theta(a \mid s)$ — outputs a distribution over actions. - **Critic**: $V_\phi(s)$ (or $Q_\phi(s, a)$) — estimates value. ## Update Rules The critic estimates the value of the state. The actor uses the critic's estimate as a baseline: **Critic update (regression):** $ \phi \leftarrow \phi - \beta \cdot \nabla_\phi (V_\phi(s) - G_t)^2 $ where $G_t$ is the TD target $r + \gamma V_\phi(s')$. **Actor update (policy gradient with baseline):** $ \theta \leftarrow \theta + \alpha \cdot \nabla_\theta \log \pi_\theta(a \mid s) \cdot A(s, a) $ with **advantage** $A(s, a) = Q(s, a) - V(s) \approx r + \gamma V_\phi(s') - V_\phi(s)$ (the TD error). ## Why It Helps Pure policy gradient (REINFORCE) uses noisy Monte Carlo returns as targets — high variance. Replacing returns with bootstrapped TD targets from the critic reduces variance substantially. The trade-off: a small bias from the critic's imperfect value estimates. ## Major Variants ### A2C (Advantage Actor-Critic) Synchronous parallel actor-critic. Multiple workers gather experience in parallel; updates are aggregated synchronously. ### A3C (Asynchronous Advantage Actor-Critic) Each worker updates a shared parameter set asynchronously. Once popular but largely superseded by A2C and PPO. ### PPO (Proximal Policy Optimisation) Actor-critic with a clipped surrogate objective that prevents catastrophic policy updates. The de facto standard for many modern RL applications, including [[RLHF]] for LLM alignment. ### DDPG / TD3 / SAC Deterministic / soft actor-critic for continuous action spaces. ## Hyperparameters - $\alpha$ — actor learning rate (typically smaller than critic's). - $\beta$ — critic learning rate. - Discount $\gamma$. - Entropy bonus weight — encourages exploration via maximum-entropy regularisation. ## When to Use - Continuous action spaces. - Sample-efficient training relative to pure policy gradient. - Standard for most modern RL applications. - LLM alignment ([[RLHF]] uses PPO, which is actor-critic). ## Related - [[Policy Gradient]] - [[Q-Learning]] - [[Reinforcement Learning]] - [[Markov Decision Process]] - [[RLHF]]