Q-Learning - Albert Masoliver's learning site

## Definition **Q-learning** (Watkins, 1989) is a *model-free*, *off-policy* reinforcement learning algorithm. It learns the action-value function $Q^*(s, a)$ — the expected return from taking action $a$ in state $s$ and following the optimal policy thereafter — directly from environment interaction, without needing the transition dynamics. ## Update Rule After observing transition $(s, a, r, s')$: $ Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right] $ - $\alpha$ — learning rate. - $\gamma$ — discount factor. - The bracketed term is the **TD error** — the difference between the bootstrapped target and the current estimate. ## Off-Policy Property Q-learning learns about the *greedy* policy ($\max_{a'}$) while *behaving* via a different exploration policy (commonly $\epsilon$-greedy). This is the "off-policy" property: the policy being learned $\neq$ the policy generating data. Contrast with [[SARSA]], which is *on-policy* — it learns about the policy actually being followed. ## Behavior Policy: $\epsilon$-Greedy To balance [[Exploration vs Exploitation]]: ``` with probability ε: pick a random action with probability 1 - ε: pick a = argmax Q(s, a) ``` Decay $\epsilon$ from 1.0 toward a small value (0.05) over training. ## Convergence Under conditions: - All state-action pairs visited infinitely often. - Learning rate decreased appropriately (Robbins-Monro: $\sum \alpha_t = \infty$, $\sum \alpha_t^2 < \infty$). - Bounded rewards. Tabular Q-learning is guaranteed to converge to $Q^*$ (Watkins & Dayan, 1992). ## Deep Q-Networks (DQN) For large or continuous state spaces, replace the Q-table with a neural network $Q_\theta(s, a)$. DQN (Mnih et al., 2013, 2015) introduced two stabilisation tricks: - **Experience replay.** Store transitions in a buffer; sample mini-batches randomly to break temporal correlations. - **Target network.** Compute the TD target $r + \gamma \max_{a'} Q_{\theta^-}(s', a')$ using a separate, periodically-updated target network parameters $\theta^-$. DQN played Atari games at human level from raw pixels — a landmark result. ## Limitations - **Discrete actions** — naive Q-learning requires a max over actions; impractical for continuous action spaces. Use deterministic policy gradients (DDPG, TD3, SAC) instead. - **Bootstrapping + function approximation + off-policy** — the "deadly triad". Can diverge. DQN's tricks mitigate but don't eliminate. - **Sample inefficient.** Many environment interactions needed. ## When to Use - Discrete action spaces. - Environments where samples are cheap (simulators). - Baseline RL algorithm; well-understood. ## Related - [[Markov Decision Process]] - [[SARSA]] - [[Policy Gradient]] - [[Exploration vs Exploitation]] - [[Value Iteration]]