## Definition
The **exploration-exploitation tradeoff** is the central tension in reinforcement learning and online decision-making. An agent must repeatedly choose between:
- **Exploit:** take the action believed best given current knowledge.
- **Explore:** try less-known actions to refine knowledge.
Pure exploitation locks in suboptimal policies. Pure exploration never accumulates reward. The right balance is the whole game.
## Why It's Hard
In supervised learning, you have a fixed dataset and an objective — straightforward. In sequential decision-making, the data you collect *depends on what you do*. Bad exploration → bad data → bad future decisions, compounding.
## Strategies
### $\epsilon$-Greedy
With probability $\epsilon$, pick a random action; otherwise pick the best-known. Simple. Decay $\epsilon$ over time as confidence grows.
- **Pro:** trivial to implement; universal.
- **Con:** explores uniformly — doesn't focus on the most uncertain actions.
### Boltzmann (Softmax) Exploration
Pick action with probability proportional to $\exp(Q(s, a) / \tau)$. Temperature $\tau$ controls exploration: high → uniform; low → greedy.
### Upper Confidence Bound (UCB)
Pick the action maximising $\hat\mu_a + c \sqrt{\ln t / n_a}$. Adds an exploration bonus inversely proportional to how often the action has been tried.
- **Principle:** *Optimism in the face of uncertainty.* Treat under-explored actions as if they might be the best.
- Logarithmic regret (provably near-optimal).
### Thompson Sampling
Maintain a posterior over each action's value. Sample from each posterior; pick the highest sampled value. Naturally explores actions with high uncertainty.
- Bayesian, elegant.
- Often empirically the strongest.
### Curiosity-Driven Exploration
Intrinsic reward for visiting novel states. Used in deep RL when extrinsic rewards are sparse. Variants: count-based curiosity, prediction-error curiosity, random network distillation.
### Optimistic Initialisation
Initialise value estimates very high. Early exploration is forced because every action looks promising until tried.
## In Different Settings
### Bandits
The cleanest exploration-exploitation problem. See [[Multi-Armed Bandit]].
### Tabular RL
Standard methods: $\epsilon$-greedy with $\epsilon$ decay; UCB on state-action counts.
### Deep RL
Harder because state-action visit counts aren't directly available. Solutions: Boltzmann exploration with entropy bonus (PPO, SAC), curiosity-driven methods, parameter-space noise.
### Hard-Exploration Environments
Where rewards are extremely sparse (Montezuma's Revenge — Atari). Standard methods fail; specialised techniques needed: hierarchical exploration, intrinsic motivation, behaviour cloning from demonstrations.
## Theoretical Lower Bound
For $K$-armed bandits, no algorithm can achieve regret better than $\Omega(\sqrt{KT})$. UCB and Thompson achieve this bound (up to constants).
## The Orchestrator's Question
When deploying an RL agent, ask: how much regret accumulates during exploration? In simulations, exploration is free. In production (medical, financial, robotic), exploration *costs*. Conservative exploration via SARSA or off-policy learning from logged data are then preferred.
## Related
- [[Multi-Armed Bandit]]
- [[Q-Learning]]
- [[SARSA]]
- [[Reinforcement Learning]]