## Definition
**Reinforcement Learning (RL)** is the paradigm where an agent learns a *policy* — a mapping from states to actions — by interacting with an environment that gives reward signals. Unlike [[Supervised Learning]], there are no labelled examples; only delayed rewards from the agent's own choices.
## The RL Setup
An RL problem is the tuple $\langle S, A, P, R, \gamma \rangle$:
- $S$ — state space.
- $A$ — action space.
- $P(s' \mid s, a)$ — transition dynamics.
- $R(s, a)$ — reward function.
- $\gamma \in [0, 1)$ — discount factor for future rewards.
This is a [[Markov Decision Process]]. The agent's objective: find a policy $\pi: S \to A$ maximising expected discounted cumulative reward.
## The Exploration-Exploitation Dilemma
At every step, the agent must choose between:
- **Exploitation** — take the action currently believed best.
- **Exploration** — try an action whose value is uncertain.
Pure exploitation locks in suboptimal policies; pure exploration never accumulates reward. See [[Exploration vs Exploitation]] and [[Multi-Armed Bandit]] for the classical framing.
## Main Algorithmic Families
### Value-based
Learn the *value* of states or state-action pairs; derive policy by acting greedily. Examples: [[Q-Learning]], [[SARSA]], Deep Q-Networks (DQN).
### Policy-based
Learn the policy directly, parameterised. Examples: [[Policy Gradient]], REINFORCE.
### Actor-Critic
Combine: a *policy* network (actor) and a *value* network (critic). Critic estimates future return; actor updates toward the critic's signal. Examples: A2C, A3C, PPO, SAC. See [[Actor-Critic]].
### Model-based
Learn $P$ and $R$ from interaction, then plan within the learnt model. Sample-efficient when the model is accurate.
## Why RL Is Hard
- **Sparse rewards.** Maybe reward only at game end; credit assignment over many steps.
- **High variance.** Stochastic policies + stochastic environments → noisy gradient signals.
- **Sample inefficiency.** Many algorithms need millions of environment interactions.
- **Stability.** Training can diverge if hyperparameters are off.
- **Reward hacking.** Agents exploit reward-function loopholes rather than the intended task.
## Major Successes
- **TD-Gammon** (1992) — backgammon at world-class level via TD learning.
- **DQN** (2013-2015) — Atari games from raw pixels with no game-specific knowledge.
- **AlphaGo / AlphaZero / MuZero** — superhuman play in Go, chess, shogi.
- **RLHF** — fine-tunes LLMs from human preference feedback. See [[RLHF]].
- **Robotic manipulation** in simulation transferred to real world.
## RL vs LLM RLHF
The "RL" in RLHF for LLMs is somewhat unusual: a single action (a complete response), reward from a human-preference model, no environment dynamics. Practically a contextual bandit problem more than a full MDP — but the policy-gradient mathematics applies.
## Related
- [[Markov Decision Process]]
- [[Q-Learning]]
- [[Policy Gradient]]
- [[Actor-Critic]]
- [[Multi-Armed Bandit]]
- [[RLHF]]