## Definition The **multi-armed bandit** is the simplest RL problem: $K$ arms (actions), each with an unknown reward distribution. The agent repeatedly picks an arm and observes a reward, with the goal of maximising cumulative reward over time. No states; just actions. ## Why "Bandit" The name comes from a "one-armed bandit" — slot machine. The multi-armed bandit imagines a gambler facing $K$ slot machines with unknown payoffs. ## The Core Tension [[Exploration vs Exploitation]] in pure form: - **Exploit:** pull the arm with the highest empirical mean. - **Explore:** pull other arms to learn their values. Pure exploitation locks in early estimates; pure exploration never converges. The art is the balance. ## Regret Performance measured by **regret**: cumulative reward gap vs always pulling the best arm: $ R_T = T \cdot \mu^* - \mathbb{E}\left[\sum_{t=1}^T r_t\right] $ where $\mu^*$ is the best arm's mean. Goal: minimise regret. ## Algorithms ### Epsilon-Greedy With probability $\epsilon$, explore; otherwise exploit. - Simple, well-known. - Linear regret unless $\epsilon \to 0$ over time. ### Upper Confidence Bound (UCB) Pick arm $\arg\max_a [\hat\mu_a + c \sqrt{\ln t / n_a}]$ — empirical mean plus an exploration bonus that shrinks as $n_a$ grows. - Logarithmic regret (provably optimal up to constants). - Foundational principle: **optimism in the face of uncertainty.** ### Thompson Sampling Maintain a posterior over each arm's reward distribution. Sample from each posterior; pick the arm whose sample is highest. Bayesian, elegantly handles uncertainty, often empirically best. ### Gradient Bandit Maintain preference scores $H_a$; turn into probabilities via softmax; update via gradient on reward. ## Contextual Bandits A natural extension: at each step, the agent observes a *context* $x$ before choosing an arm. Rewards depend on both arm and context. - Bridges bandits and supervised learning. - Algorithms: LinUCB, Thompson sampling for contextual bandits, neural bandits. - Workhorses for **recommendation systems** and **online ad serving** in production. ## Connection to Full RL A bandit is a one-state MDP with rewards depending on action alone. Adding state-dependent rewards and multi-step credit gives a full MDP. The exploration techniques developed for bandits (UCB, Thompson) carry over and inspire methods used in deep RL. ## When Bandits Are the Right Frame - **Recommendation, ad selection, A/B testing** — pick from $K$ options; observe reward; repeat. - **Hyperparameter tuning** — each hyperparameter setting is an arm; reward is validation performance. - **Clinical trial design** — adaptive allocation to better-performing treatments. ## Related - [[Exploration vs Exploitation]] - [[Reinforcement Learning]] - [[Q-Learning]] - [[Monte Carlo Tree Search]]