Bellman Equation - Albert Masoliver's learning site

## Definition The **Bellman equation** is the recursive relation that characterises the optimal value function in sequential decision problems. The cornerstone of [[Dynamic Programming]], [[Reinforcement Learning]], and optimal control theory. Named after Richard Bellman (1953). ## The Principle of Optimality > *An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.* In other words: the tail of an optimal solution is itself optimal. This is what makes recursive decomposition work. ## Bellman Equation for Value (Optimal) For a [[Markov Decision Process]] with discount $\gamma$: $ V^*(s) = \max_a \left[ R(s, a) + \gamma \sum_{s'} P(s' \mid s, a) V^*(s') \right] $ The value of being in state $s$ = the best action's immediate reward plus discounted expected value of the next state. ## Bellman Equation for Q (Action-Value) $ Q^*(s, a) = R(s, a) + \gamma \sum_{s'} P(s' \mid s, a) \max_{a'} Q^*(s', a') $ The value of taking action $a$ in state $s$ = immediate reward + discounted expected best-next value. ## Bellman Expectation Equation (For a Given Policy $\pi$) $ V^\pi(s) = \sum_a \pi(a \mid s) \left[ R(s, a) + \gamma \sum_{s'} P(s' \mid s, a) V^\pi(s') \right] $ Same form, but with $\pi$-weighted action averaging instead of a max. ## Why It's Foundational The Bellman equation reduces an infinite-horizon optimisation to a fixed-point problem: $ V^* = T V^* $ where $T$ is the Bellman optimality operator. Solving the equation = finding the fixed point. The contraction property of $T$ (with discount $\gamma < 1$) guarantees uniqueness and gives convergence rates. ## Algorithms That Solve It Most of [[Reinforcement Learning]] can be cast as approximately solving the Bellman equation: - **[[Value Iteration]].** Repeated application: $V_{k+1} = T V_k$. - **[[Policy Iteration]].** Alternates policy evaluation (solve Bellman expectation for the current $\pi$) and policy improvement. - **[[Q-Learning]].** Stochastic gradient-style updates toward satisfying the Bellman optimality equation. - **TD learning.** General class of algorithms using bootstrapped Bellman estimates. ## Function Approximation When the state space is too large for tabular methods, replace exact values with parameterised approximations: $ V_\theta(s) \approx V^*(s) $ Train $\theta$ to minimise the **Bellman error**: $|V_\theta(s) - T V_\theta(s)|$. This is the foundation of deep RL (DQN minimises a Bellman-error-like loss). ## The Curse of Dimensionality The Bellman equation requires summing over all next states $s'$. For continuous or very large state spaces, this sum is intractable. Mitigations: - **Sampling.** Replace exact expectation with Monte Carlo estimates. - **Function approximation.** Represent value with parameterised functions. ## Beyond RL The Bellman equation appears wherever sequential decisions matter: - **Optimal control.** Hamilton-Jacobi-Bellman PDE (continuous-time analog). - **Finance.** Optimal stopping, option pricing. - **Inventory management.** Multi-period stocking decisions. - **Operations research.** Multi-stage resource allocation. ## Related - [[Markov Decision Process]] - [[Value Iteration]] - [[Policy Iteration]] - [[Q-Learning]] - [[Dynamic Programming]] - [[Reinforcement Learning]]