## Definition
A **Markov Decision Process (MDP)** is the mathematical framework for sequential decision-making under uncertainty. The foundation of [[Reinforcement Learning]].
## Formal Definition
An MDP is the tuple $\langle S, A, P, R, \gamma \rangle$:
- $S$ — set of states.
- $A$ — set of actions.
- $P(s' \mid s, a)$ — transition probability.
- $R(s, a)$ or $R(s, a, s')$ — reward function.
- $\gamma \in [0, 1)$ — discount factor.
## The Markov Property
> The future is independent of the past given the present.
Formally: $P(s_{t+1} \mid s_t, a_t, s_{t-1}, a_{t-1}, \dots) = P(s_{t+1} \mid s_t, a_t)$.
This is the assumption that makes MDPs tractable.
## Goal: Find an Optimal Policy
A **policy** $\pi: S \to A$ (or $\pi(a \mid s)$ for stochastic) maps states to actions. The objective is to find a policy maximising expected discounted cumulative reward:
$
V^\pi(s) = \mathbb{E}_\pi \left[ \sum_{t=0}^\infty \gamma^t R(s_t, a_t) \mid s_0 = s \right]
$
This is the **state-value function** under policy $\pi$.
## Action-Value Function
$
Q^\pi(s, a) = \mathbb{E}_\pi \left[ \sum_{t=0}^\infty \gamma^t R(s_t, a_t) \mid s_0 = s, a_0 = a \right]
$
Q answers: "What's the expected return if I take action $a$ in state $s$ and then follow $\pi$?"
## Bellman Equations
$V^\pi$ satisfies the recursion:
$
V^\pi(s) = \sum_a \pi(a \mid s) \left[ R(s, a) + \gamma \sum_{s'} P(s' \mid s, a) V^\pi(s') \right]
$
For the optimal policy $\pi^*$:
$
V^*(s) = \max_a \left[ R(s, a) + \gamma \sum_{s'} P(s' \mid s, a) V^*(s') \right]
$
This is the **Bellman optimality equation**. Most RL algorithms exploit it.
## Solving MDPs
When the dynamics ($P$, $R$) are known:
- [[Value Iteration]] — iterate Bellman until convergence.
- [[Policy Iteration]] — alternate evaluation and improvement.
- Linear programming.
When dynamics are unknown (the RL regime):
- Model-free: [[Q-Learning]], [[SARSA]], [[Policy Gradient]], etc.
- Model-based: learn $P, R$ from interaction, then plan.
## Discount Factor $\gamma$
- $\gamma \to 0$: short-sighted; only immediate rewards matter.
- $\gamma \to 1$: long-sighted; future rewards matter almost as much as now.
- Common values: 0.9, 0.99, 0.999.
The discount also serves a mathematical purpose: ensures convergence of infinite-horizon sums even without terminal states.
## Extensions
- **POMDP** (Partially Observable MDP) — agent observes only a noisy function of the state.
- **Continuous state/action spaces** — function approximation needed (deep RL).
- **Multi-agent MDPs** — multiple decision makers.
## Related
- [[Reinforcement Learning]]
- [[Value Iteration]]
- [[Policy Iteration]]
- [[Q-Learning]]
- [[Hidden Markov Model]]