## Definition
**Actor-Critic** methods combine the policy-gradient (actor) and value-function (critic) approaches. The actor decides actions; the critic evaluates them. The hybrid yields lower variance than pure policy gradient and faster convergence than pure value-based methods.
## Architecture
Two parameterised functions, often sharing a backbone neural network:
- **Actor**: $\pi_\theta(a \mid s)$ — outputs a distribution over actions.
- **Critic**: $V_\phi(s)$ (or $Q_\phi(s, a)$) — estimates value.
## Update Rules
The critic estimates the value of the state. The actor uses the critic's estimate as a baseline:
**Critic update (regression):**
$
\phi \leftarrow \phi - \beta \cdot \nabla_\phi (V_\phi(s) - G_t)^2
$
where $G_t$ is the TD target $r + \gamma V_\phi(s')$.
**Actor update (policy gradient with baseline):**
$
\theta \leftarrow \theta + \alpha \cdot \nabla_\theta \log \pi_\theta(a \mid s) \cdot A(s, a)
$
with **advantage** $A(s, a) = Q(s, a) - V(s) \approx r + \gamma V_\phi(s') - V_\phi(s)$ (the TD error).
## Why It Helps
Pure policy gradient (REINFORCE) uses noisy Monte Carlo returns as targets — high variance. Replacing returns with bootstrapped TD targets from the critic reduces variance substantially. The trade-off: a small bias from the critic's imperfect value estimates.
## Major Variants
### A2C (Advantage Actor-Critic)
Synchronous parallel actor-critic. Multiple workers gather experience in parallel; updates are aggregated synchronously.
### A3C (Asynchronous Advantage Actor-Critic)
Each worker updates a shared parameter set asynchronously. Once popular but largely superseded by A2C and PPO.
### PPO (Proximal Policy Optimisation)
Actor-critic with a clipped surrogate objective that prevents catastrophic policy updates. The de facto standard for many modern RL applications, including [[RLHF]] for LLM alignment.
### DDPG / TD3 / SAC
Deterministic / soft actor-critic for continuous action spaces.
## Hyperparameters
- $\alpha$ — actor learning rate (typically smaller than critic's).
- $\beta$ — critic learning rate.
- Discount $\gamma$.
- Entropy bonus weight — encourages exploration via maximum-entropy regularisation.
## When to Use
- Continuous action spaces.
- Sample-efficient training relative to pure policy gradient.
- Standard for most modern RL applications.
- LLM alignment ([[RLHF]] uses PPO, which is actor-critic).
## Related
- [[Policy Gradient]]
- [[Q-Learning]]
- [[Reinforcement Learning]]
- [[Markov Decision Process]]
- [[RLHF]]