Value Iteration - Albert Masoliver's learning site

## Definition **Value iteration** is a dynamic programming algorithm for solving a known [[Markov Decision Process]]. It iteratively applies the **Bellman optimality operator** to the value function until convergence, then extracts the optimal policy. ## Algorithm ``` initialise V(s) = 0 for all s repeat until ||V_new - V|| < ε: for each state s: V_new(s) ← max_a [R(s, a) + γ · Σ_s' P(s' | s, a) · V(s')] V ← V_new extract policy: π(s) ← argmax_a [R(s, a) + γ · Σ_s' P(s' | s, a) · V(s')] ``` ## Why It Works The Bellman optimality operator is a **contraction mapping** in $\gamma$-weighted max norm: $ \|T V - T V'\|_\infty \leq \gamma \|V - V'\|_\infty $ By Banach's fixed-point theorem, repeated application converges geometrically to the unique fixed point — $V^*$. ## Convergence Rate After $k$ iterations: $ \|V_k - V^*\|_\infty \leq \frac{\gamma^k}{1 - \gamma} \|V_0 - V^*\|_\infty $ So convergence is **geometric in $\gamma^k$**: fewer iterations needed when $\gamma$ is smaller. For $\gamma = 0.99$, expect ~hundreds of iterations to converge to high precision. ## Strengths - **Conceptually simple.** - **Convergence guaranteed** for $\gamma < 1$. - **Optimal solution** (not just locally optimal). ## Limitations - **Requires known $P$ and $R$.** In real RL settings these are unknown — use [[Q-Learning]] or model-based methods. - **Curse of dimensionality.** Per iteration, scans all $|S| \cdot |A|$ state-action pairs and sums over all $|S|$ next states. Doesn't scale beyond a few million states. - **Tabular representation.** For continuous or very large state spaces, function approximation is needed (and convergence guarantees weaken). ## Comparison to Policy Iteration | Property | Value Iteration | Policy Iteration | |---|---|---| | Inner loop | None (one Bellman update) | Policy evaluation (multiple Bellman updates) | | Outer loop | Many iterations | Few iterations | | Each iteration | Cheap | Expensive (policy evaluation is itself iterative) | | Total to converge | Often more iterations, faster per step | Fewer iterations, more per step | In practice, **modified policy iteration** (a small number of policy-evaluation steps between improvements) often wins. ## Asynchronous Variants Updates don't have to be in parallel. Asynchronous value iteration updates states one at a time — converges as long as every state is updated infinitely often. Enables incremental and sample-based variants. ## Related - [[Markov Decision Process]] - [[Policy Iteration]] - [[Q-Learning]] - [[Dynamic Programming]]