Markov Decision Process - Albert Masoliver's learning site

## Definition A **Markov Decision Process (MDP)** is the mathematical framework for sequential decision-making under uncertainty. The foundation of [[Reinforcement Learning]]. ## Formal Definition An MDP is the tuple $\langle S, A, P, R, \gamma \rangle$: - $S$ — set of states. - $A$ — set of actions. - $P(s' \mid s, a)$ — transition probability. - $R(s, a)$ or $R(s, a, s')$ — reward function. - $\gamma \in [0, 1)$ — discount factor. ## The Markov Property > The future is independent of the past given the present. Formally: $P(s_{t+1} \mid s_t, a_t, s_{t-1}, a_{t-1}, \dots) = P(s_{t+1} \mid s_t, a_t)$. This is the assumption that makes MDPs tractable. ## Goal: Find an Optimal Policy A **policy** $\pi: S \to A$ (or $\pi(a \mid s)$ for stochastic) maps states to actions. The objective is to find a policy maximising expected discounted cumulative reward: $ V^\pi(s) = \mathbb{E}_\pi \left[ \sum_{t=0}^\infty \gamma^t R(s_t, a_t) \mid s_0 = s \right] $ This is the **state-value function** under policy $\pi$. ## Action-Value Function $ Q^\pi(s, a) = \mathbb{E}_\pi \left[ \sum_{t=0}^\infty \gamma^t R(s_t, a_t) \mid s_0 = s, a_0 = a \right] $ Q answers: "What's the expected return if I take action $a$ in state $s$ and then follow $\pi$?" ## Bellman Equations $V^\pi$ satisfies the recursion: $ V^\pi(s) = \sum_a \pi(a \mid s) \left[ R(s, a) + \gamma \sum_{s'} P(s' \mid s, a) V^\pi(s') \right] $ For the optimal policy $\pi^*$: $ V^*(s) = \max_a \left[ R(s, a) + \gamma \sum_{s'} P(s' \mid s, a) V^*(s') \right] $ This is the **Bellman optimality equation**. Most RL algorithms exploit it. ## Solving MDPs When the dynamics ($P$, $R$) are known: - [[Value Iteration]] — iterate Bellman until convergence. - [[Policy Iteration]] — alternate evaluation and improvement. - Linear programming. When dynamics are unknown (the RL regime): - Model-free: [[Q-Learning]], [[SARSA]], [[Policy Gradient]], etc. - Model-based: learn $P, R$ from interaction, then plan. ## Discount Factor $\gamma$ - $\gamma \to 0$: short-sighted; only immediate rewards matter. - $\gamma \to 1$: long-sighted; future rewards matter almost as much as now. - Common values: 0.9, 0.99, 0.999. The discount also serves a mathematical purpose: ensures convergence of infinite-horizon sums even without terminal states. ## Extensions - **POMDP** (Partially Observable MDP) — agent observes only a noisy function of the state. - **Continuous state/action spaces** — function approximation needed (deep RL). - **Multi-agent MDPs** — multiple decision makers. ## Related - [[Reinforcement Learning]] - [[Value Iteration]] - [[Policy Iteration]] - [[Q-Learning]] - [[Hidden Markov Model]]