Reinforcement Learning - Albert Masoliver's learning site

## Definition **Reinforcement Learning (RL)** is the paradigm where an agent learns a *policy* — a mapping from states to actions — by interacting with an environment that gives reward signals. Unlike [[Supervised Learning]], there are no labelled examples; only delayed rewards from the agent's own choices. ## The RL Setup An RL problem is the tuple $\langle S, A, P, R, \gamma \rangle$: - $S$ — state space. - $A$ — action space. - $P(s' \mid s, a)$ — transition dynamics. - $R(s, a)$ — reward function. - $\gamma \in [0, 1)$ — discount factor for future rewards. This is a [[Markov Decision Process]]. The agent's objective: find a policy $\pi: S \to A$ maximising expected discounted cumulative reward. ## The Exploration-Exploitation Dilemma At every step, the agent must choose between: - **Exploitation** — take the action currently believed best. - **Exploration** — try an action whose value is uncertain. Pure exploitation locks in suboptimal policies; pure exploration never accumulates reward. See [[Exploration vs Exploitation]] and [[Multi-Armed Bandit]] for the classical framing. ## Main Algorithmic Families ### Value-based Learn the *value* of states or state-action pairs; derive policy by acting greedily. Examples: [[Q-Learning]], [[SARSA]], Deep Q-Networks (DQN). ### Policy-based Learn the policy directly, parameterised. Examples: [[Policy Gradient]], REINFORCE. ### Actor-Critic Combine: a *policy* network (actor) and a *value* network (critic). Critic estimates future return; actor updates toward the critic's signal. Examples: A2C, A3C, PPO, SAC. See [[Actor-Critic]]. ### Model-based Learn $P$ and $R$ from interaction, then plan within the learnt model. Sample-efficient when the model is accurate. ## Why RL Is Hard - **Sparse rewards.** Maybe reward only at game end; credit assignment over many steps. - **High variance.** Stochastic policies + stochastic environments → noisy gradient signals. - **Sample inefficiency.** Many algorithms need millions of environment interactions. - **Stability.** Training can diverge if hyperparameters are off. - **Reward hacking.** Agents exploit reward-function loopholes rather than the intended task. ## Major Successes - **TD-Gammon** (1992) — backgammon at world-class level via TD learning. - **DQN** (2013-2015) — Atari games from raw pixels with no game-specific knowledge. - **AlphaGo / AlphaZero / MuZero** — superhuman play in Go, chess, shogi. - **RLHF** — fine-tunes LLMs from human preference feedback. See [[RLHF]]. - **Robotic manipulation** in simulation transferred to real world. ## RL vs LLM RLHF The "RL" in RLHF for LLMs is somewhat unusual: a single action (a complete response), reward from a human-preference model, no environment dynamics. Practically a contextual bandit problem more than a full MDP — but the policy-gradient mathematics applies. ## Related - [[Markov Decision Process]] - [[Q-Learning]] - [[Policy Gradient]] - [[Actor-Critic]] - [[Multi-Armed Bandit]] - [[RLHF]]