## Definition
**Q-learning** (Watkins, 1989) is a *model-free*, *off-policy* reinforcement learning algorithm. It learns the action-value function $Q^*(s, a)$ — the expected return from taking action $a$ in state $s$ and following the optimal policy thereafter — directly from environment interaction, without needing the transition dynamics.
## Update Rule
After observing transition $(s, a, r, s')$:
$
Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right]
$
- $\alpha$ — learning rate.
- $\gamma$ — discount factor.
- The bracketed term is the **TD error** — the difference between the bootstrapped target and the current estimate.
## Off-Policy Property
Q-learning learns about the *greedy* policy ($\max_{a'}$) while *behaving* via a different exploration policy (commonly $\epsilon$-greedy). This is the "off-policy" property: the policy being learned $\neq$ the policy generating data.
Contrast with [[SARSA]], which is *on-policy* — it learns about the policy actually being followed.
## Behavior Policy: $\epsilon$-Greedy
To balance [[Exploration vs Exploitation]]:
```
with probability ε: pick a random action
with probability 1 - ε: pick a = argmax Q(s, a)
```
Decay $\epsilon$ from 1.0 toward a small value (0.05) over training.
## Convergence
Under conditions:
- All state-action pairs visited infinitely often.
- Learning rate decreased appropriately (Robbins-Monro: $\sum \alpha_t = \infty$, $\sum \alpha_t^2 < \infty$).
- Bounded rewards.
Tabular Q-learning is guaranteed to converge to $Q^*$ (Watkins & Dayan, 1992).
## Deep Q-Networks (DQN)
For large or continuous state spaces, replace the Q-table with a neural network $Q_\theta(s, a)$. DQN (Mnih et al., 2013, 2015) introduced two stabilisation tricks:
- **Experience replay.** Store transitions in a buffer; sample mini-batches randomly to break temporal correlations.
- **Target network.** Compute the TD target $r + \gamma \max_{a'} Q_{\theta^-}(s', a')$ using a separate, periodically-updated target network parameters $\theta^-$.
DQN played Atari games at human level from raw pixels — a landmark result.
## Limitations
- **Discrete actions** — naive Q-learning requires a max over actions; impractical for continuous action spaces. Use deterministic policy gradients (DDPG, TD3, SAC) instead.
- **Bootstrapping + function approximation + off-policy** — the "deadly triad". Can diverge. DQN's tricks mitigate but don't eliminate.
- **Sample inefficient.** Many environment interactions needed.
## When to Use
- Discrete action spaces.
- Environments where samples are cheap (simulators).
- Baseline RL algorithm; well-understood.
## Related
- [[Markov Decision Process]]
- [[SARSA]]
- [[Policy Gradient]]
- [[Exploration vs Exploitation]]
- [[Value Iteration]]