## Definition
The **multi-armed bandit** is the simplest RL problem: $K$ arms (actions), each with an unknown reward distribution. The agent repeatedly picks an arm and observes a reward, with the goal of maximising cumulative reward over time. No states; just actions.
## Why "Bandit"
The name comes from a "one-armed bandit" — slot machine. The multi-armed bandit imagines a gambler facing $K$ slot machines with unknown payoffs.
## The Core Tension
[[Exploration vs Exploitation]] in pure form:
- **Exploit:** pull the arm with the highest empirical mean.
- **Explore:** pull other arms to learn their values.
Pure exploitation locks in early estimates; pure exploration never converges. The art is the balance.
## Regret
Performance measured by **regret**: cumulative reward gap vs always pulling the best arm:
$
R_T = T \cdot \mu^* - \mathbb{E}\left[\sum_{t=1}^T r_t\right]
$
where $\mu^*$ is the best arm's mean. Goal: minimise regret.
## Algorithms
### Epsilon-Greedy
With probability $\epsilon$, explore; otherwise exploit.
- Simple, well-known.
- Linear regret unless $\epsilon \to 0$ over time.
### Upper Confidence Bound (UCB)
Pick arm $\arg\max_a [\hat\mu_a + c \sqrt{\ln t / n_a}]$ — empirical mean plus an exploration bonus that shrinks as $n_a$ grows.
- Logarithmic regret (provably optimal up to constants).
- Foundational principle: **optimism in the face of uncertainty.**
### Thompson Sampling
Maintain a posterior over each arm's reward distribution. Sample from each posterior; pick the arm whose sample is highest. Bayesian, elegantly handles uncertainty, often empirically best.
### Gradient Bandit
Maintain preference scores $H_a$; turn into probabilities via softmax; update via gradient on reward.
## Contextual Bandits
A natural extension: at each step, the agent observes a *context* $x$ before choosing an arm. Rewards depend on both arm and context.
- Bridges bandits and supervised learning.
- Algorithms: LinUCB, Thompson sampling for contextual bandits, neural bandits.
- Workhorses for **recommendation systems** and **online ad serving** in production.
## Connection to Full RL
A bandit is a one-state MDP with rewards depending on action alone. Adding state-dependent rewards and multi-step credit gives a full MDP. The exploration techniques developed for bandits (UCB, Thompson) carry over and inspire methods used in deep RL.
## When Bandits Are the Right Frame
- **Recommendation, ad selection, A/B testing** — pick from $K$ options; observe reward; repeat.
- **Hyperparameter tuning** — each hyperparameter setting is an arm; reward is validation performance.
- **Clinical trial design** — adaptive allocation to better-performing treatments.
## Related
- [[Exploration vs Exploitation]]
- [[Reinforcement Learning]]
- [[Q-Learning]]
- [[Monte Carlo Tree Search]]