Exploration vs Exploitation - Albert Masoliver's learning site

## Definition The **exploration-exploitation tradeoff** is the central tension in reinforcement learning and online decision-making. An agent must repeatedly choose between: - **Exploit:** take the action believed best given current knowledge. - **Explore:** try less-known actions to refine knowledge. Pure exploitation locks in suboptimal policies. Pure exploration never accumulates reward. The right balance is the whole game. ## Why It's Hard In supervised learning, you have a fixed dataset and an objective — straightforward. In sequential decision-making, the data you collect *depends on what you do*. Bad exploration → bad data → bad future decisions, compounding. ## Strategies ### $\epsilon$-Greedy With probability $\epsilon$, pick a random action; otherwise pick the best-known. Simple. Decay $\epsilon$ over time as confidence grows. - **Pro:** trivial to implement; universal. - **Con:** explores uniformly — doesn't focus on the most uncertain actions. ### Boltzmann (Softmax) Exploration Pick action with probability proportional to $\exp(Q(s, a) / \tau)$. Temperature $\tau$ controls exploration: high → uniform; low → greedy. ### Upper Confidence Bound (UCB) Pick the action maximising $\hat\mu_a + c \sqrt{\ln t / n_a}$. Adds an exploration bonus inversely proportional to how often the action has been tried. - **Principle:** *Optimism in the face of uncertainty.* Treat under-explored actions as if they might be the best. - Logarithmic regret (provably near-optimal). ### Thompson Sampling Maintain a posterior over each action's value. Sample from each posterior; pick the highest sampled value. Naturally explores actions with high uncertainty. - Bayesian, elegant. - Often empirically the strongest. ### Curiosity-Driven Exploration Intrinsic reward for visiting novel states. Used in deep RL when extrinsic rewards are sparse. Variants: count-based curiosity, prediction-error curiosity, random network distillation. ### Optimistic Initialisation Initialise value estimates very high. Early exploration is forced because every action looks promising until tried. ## In Different Settings ### Bandits The cleanest exploration-exploitation problem. See [[Multi-Armed Bandit]]. ### Tabular RL Standard methods: $\epsilon$-greedy with $\epsilon$ decay; UCB on state-action counts. ### Deep RL Harder because state-action visit counts aren't directly available. Solutions: Boltzmann exploration with entropy bonus (PPO, SAC), curiosity-driven methods, parameter-space noise. ### Hard-Exploration Environments Where rewards are extremely sparse (Montezuma's Revenge — Atari). Standard methods fail; specialised techniques needed: hierarchical exploration, intrinsic motivation, behaviour cloning from demonstrations. ## Theoretical Lower Bound For $K$-armed bandits, no algorithm can achieve regret better than $\Omega(\sqrt{KT})$. UCB and Thompson achieve this bound (up to constants). ## The Orchestrator's Question When deploying an RL agent, ask: how much regret accumulates during exploration? In simulations, exploration is free. In production (medical, financial, robotic), exploration *costs*. Conservative exploration via SARSA or off-policy learning from logged data are then preferred. ## Related - [[Multi-Armed Bandit]] - [[Q-Learning]] - [[SARSA]] - [[Reinforcement Learning]]