jthomas.site// notebook · v.4.2026
Machine Learning, Visualized · Vol. XXIX

Multi-Armed
Bandits

Ten slot machines. Each pays out at a different unknown rate. You have a thousand pulls. Explore an unknown lever, or exploit the best one so far? The simplest setting where exploration vs exploitation actually matters — and the testbed for nearly every algorithm that handles it.

The concept

A multi-armed bandit is an MDP without states — only actions ("arms"), each with an unknown reward distribution.

The challenge is the exploration–exploitation tradeoff: pulling the apparent-best arm collects reward but might miss a better one; trying random arms gathers info but wastes pulls. Three classic strategies, and below they race head-to-head:

ε-greedy — most pulls go to the apparent-best arm; ε of them go to a random arm. Simple, surprisingly hard to beat.
UCB ("upper confidence bound") — pull the arm with the highest mean plus a confidence bonus that shrinks as you learn it. Optimism in the face of uncertainty.
Thompson sampling — keep a Beta posterior per arm; sample one value from each; pull whichever sample is highest. Bayesian and provably efficient.

Why ML cares

Bandits are everywhere in production ML. A/B testing — choosing which design to show users — is a 2-armed bandit. News feeds, ad placement, recommendation rankings, drug-trial allocation, model-routing — all bandit problems. Every "should I show this user version A or B?" decision is, formally, a bandit.

The bandit is also the gentlest entry into RL. No state, no transitions, no Bellman equation — just "which arm should I pull?" Master it and the broader RL machinery makes more sense.

Try this
  1. Hit Run race. All three strategies pull arms simultaneously on the same problem. Watch their cumulative-regret curves diverge — UCB and Thompson typically pull ahead of ε-greedy.
  2. Switch to UCB · confidence bonus: each arm shows a vertical "confidence bonus" segment above its estimated mean. The bonus shrinks as the arm gets pulled — that's exploration converting into knowledge.
  3. Switch to Thompson · Beta posteriors: each arm shows its Beta posterior bell curve. Wide bell = uncertain → likely to be sampled → likely to be pulled. The bell narrows around the true mean as evidence accumulates.
  4. Crank ε to 1.0 — pure random pulls. Regret grows linearly. Drop to 0 — pure greed; sometimes lucky, usually locked onto a sub-optimal arm.
· Three strategies pull from the same 10 arms in parallel. Each arm column shows: estimated mean (tick), true mean (dashed line), pull count (bar height), confidence bonus (segment above the tick), Beta posterior (bell curve). Below: cumulative regret — three curves, head-to-head.
CUMULATIVE REGRET
regret(T) = T · μ* − Σt rt
"after T pulls: how much reward you'd have if you'd always pulled the best arm, minus the reward you actually got. Lower = better."
T = total pulls so far · μ* = mu-star = true mean of best arm · rt = reward you got at pull t · Σt = sum across pulls
ε-greedy (green) UCB (accent) Thompson (ink) dashed = true mean of arm tinted column = best arm
Where you've seen this04 examples
A/B testing infrastructure

Microsoft, Booking.com, Spotify, and most large product orgs use bandit-based experimentation to allocate traffic. Bandit allocation cuts the regret of running a "loser" variant for the full duration of a fixed-split test.

News and content ranking

Yahoo News' 2009 paper popularized bandits for article recommendation. Reddit's "r/all" ranking, Hacker News' front page, and most personalization layers since use bandit variants to balance "popular" vs "let's see if this new thing pops."

Adaptive clinical trials

Bandit-style allocation of patients to treatment arms — assign more people to the apparently-better arm as evidence accumulates. Reduces the ethical cost of randomly assigning patients to a likely-worse treatment.

Online ads & bidding

Google Ads' "smart bidding" and Meta's ads platform pick which creative to show next using contextual bandits. Each impression is a pull; click-through rate is the reward; the user-and-page is the context.

Further reading