## Definition **Policy iteration** alternates two steps until convergence: (1) **evaluate** the current policy by computing its value function, (2) **improve** the policy by acting greedily with respect to the evaluation. A classical dynamic-programming algorithm for solving a known [[Markov Decision Process]]. ## Algorithm ``` initialise π arbitrarily repeat: # Policy Evaluation repeat until ||V_new - V|| < ε: for each state s: V_new(s) ← R(s, π(s)) + γ · Σ_s' P(s' | s, π(s)) · V(s') V ← V_new # Policy Improvement π_new(s) ← argmax_a [R(s, a) + γ · Σ_s' P(s' | s, a) · V(s')] if π_new = π: return π # policy unchanged → optimal π ← π_new ``` ## Why It Converges The **policy improvement theorem** guarantees that the new policy is *at least as good* as the old one — and strictly better unless already optimal. Combined with the finite number of deterministic policies in finite MDPs, convergence in finite steps is guaranteed. In practice, policy iteration converges in surprisingly few outer iterations — often a handful — even for large MDPs. ## Modified Policy Iteration Full policy evaluation each iteration is wasteful. **Modified policy iteration** does only a few evaluation sweeps (sometimes just one) between improvements. The result still converges and is often faster overall. In the limit, modified policy iteration with one sweep equals [[Value Iteration]] — they are two ends of a spectrum. ## Comparison to Value Iteration | Property | Value Iteration | Policy Iteration | |---|---|---| | Iterations to converge | Many | Few | | Cost per iteration | Cheap | Expensive (full evaluation) | | Modifiable | Yes (asynchronous, partial) | Yes (modified PI) | | Best when | Many states, simple policy | Few states, complex value computation | ## Limitations - **Known dynamics required.** In real RL, $P$ and $R$ are unknown — use sampled methods. - **Tabular representation.** Doesn't scale to continuous or very large state spaces without function approximation. ## Generalised Policy Iteration (GPI) A theoretical umbrella: any algorithm that alternates partial policy evaluation with partial policy improvement converges (under reasonable conditions). [[Q-Learning]], [[SARSA]], actor-critic methods, value iteration, and policy iteration are all GPI variants. Almost every RL algorithm fits the GPI framework — a useful unifying lens. ## Related - [[Markov Decision Process]] - [[Value Iteration]] - [[Q-Learning]] - [[Policy Gradient]] - [[Reinforcement Learning]]