Multi-Armed
Bandits
Ten slot machines. Each pays out at a different unknown rate. You have a thousand pulls. Explore an unknown lever, or exploit the best one so far? The simplest setting where exploration vs exploitation actually matters — and the testbed for nearly every algorithm that handles it.
A multi-armed bandit is an MDP without states — only actions ("arms"), each with an unknown reward distribution.
The challenge is the exploration–exploitation tradeoff: pulling the apparent-best arm collects reward but might miss a better one; trying random arms gathers info but wastes pulls. Three classic strategies, and below they race head-to-head:
→ ε-greedy — most pulls go to the apparent-best arm; ε of them go to a random arm. Simple, surprisingly hard to beat.
→ UCB ("upper confidence bound") — pull the arm with the highest mean plus a confidence bonus that shrinks as you learn it. Optimism in the face of uncertainty.
→ Thompson sampling — keep a Beta posterior per arm; sample one value from each; pull whichever sample is highest. Bayesian and provably efficient.
Bandits are everywhere in production ML. A/B testing — choosing which design to show users — is a 2-armed bandit. News feeds, ad placement, recommendation rankings, drug-trial allocation, model-routing — all bandit problems. Every "should I show this user version A or B?" decision is, formally, a bandit.
The bandit is also the gentlest entry into RL. No state, no transitions, no Bellman equation — just "which arm should I pull?" Master it and the broader RL machinery makes more sense.
- Hit Run race. All three strategies pull arms simultaneously on the same problem. Watch their cumulative-regret curves diverge — UCB and Thompson typically pull ahead of ε-greedy.
- Switch to UCB · confidence bonus: each arm shows a vertical "confidence bonus" segment above its estimated mean. The bonus shrinks as the arm gets pulled — that's exploration converting into knowledge.
- Switch to Thompson · Beta posteriors: each arm shows its Beta posterior bell curve. Wide bell = uncertain → likely to be sampled → likely to be pulled. The bell narrows around the true mean as evidence accumulates.
- Crank ε to 1.0 — pure random pulls. Regret grows linearly. Drop to 0 — pure greed; sometimes lucky, usually locked onto a sub-optimal arm.
regret(T) = T · μ* − Σt rt
"after T pulls: how much reward you'd have if you'd always pulled the best arm, minus the reward you actually got. Lower = better."
T = total pulls so far · μ* = mu-star = true mean of best arm · rt = reward you got at pull t · Σt = sum across pulls
Microsoft, Booking.com, Spotify, and most large product orgs use bandit-based experimentation to allocate traffic. Bandit allocation cuts the regret of running a "loser" variant for the full duration of a fixed-split test.
Yahoo News' 2009 paper popularized bandits for article recommendation. Reddit's "r/all" ranking, Hacker News' front page, and most personalization layers since use bandit variants to balance "popular" vs "let's see if this new thing pops."
Bandit-style allocation of patients to treatment arms — assign more people to the apparently-better arm as evidence accumulates. Reduces the ethical cost of randomly assigning patients to a likely-worse treatment.
Google Ads' "smart bidding" and Meta's ads platform pick which creative to show next using contextual bandits. Each impression is a pull; click-through rate is the reward; the user-and-page is the context.
- Bandit Algorithms free book Lattimore & Szepesvári · The reference textbook. Free PDF online; covers UCB, Thompson sampling, contextual and adversarial bandits in depth.
- A Contextual-Bandit Approach to Personalized News Article Recommendation paper Li et al. (2010) · The Yahoo News paper that brought bandits into web personalization. Foundational for every modern recommender's exploration layer.
- The Multi-Armed Bandit Problem and Its Solutions essay Lilian Weng · Patient walkthrough of ε-greedy, UCB, Thompson sampling, with code. Pairs nicely with this page.
- Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems paper Bubeck & Cesa-Bianchi · The theoretical foundation paper. Where the regret bounds come from.