Machine Learning, Visualized · Vol. XXIX

Multi-Armed
Bandits

Ten slot machines. Each pays out at a different unknown rate. You have a thousand pulls. Explore an unknown lever, or exploit the best one so far? The simplest setting where exploration vs exploitation actually matters — and the testbed for nearly every algorithm that handles it.

The concept

A multi-armed bandit is an MDP without states — only actions ("arms"), each with an unknown reward distribution.

The challenge is the exploration–exploitation tradeoff: pulling the apparent-best arm collects reward but might miss a better one; trying random arms gathers info but wastes pulls. Three classic strategies, and below they race head-to-head:

→ ε-greedy — most pulls go to the apparent-best arm; ε of them go to a random arm. Simple, surprisingly hard to beat.
→ UCB ("upper confidence bound") — pull the arm with the highest mean plus a confidence bonus that shrinks as you learn it. Optimism in the face of uncertainty.
→ Thompson sampling — keep a Beta posterior per arm; sample one value from each; pull whichever sample is highest. Bayesian and provably efficient.

Why ML cares

Bandits are everywhere in production ML. A/B testing — choosing which design to show users — is a 2-armed bandit. News feeds, ad placement, recommendation rankings, drug-trial allocation, model-routing — all bandit problems. Every "should I show this user version A or B?" decision is, formally, a bandit.

The bandit is also the gentlest entry into RL. No state, no transitions, no Bellman equation — just "which arm should I pull?" Master it and the broader RL machinery makes more sense.

Try this

Hit Run race. All three strategies pull arms simultaneously on the same problem. Watch their cumulative-regret curves diverge — UCB and Thompson typically pull ahead of ε-greedy.
Switch to UCB · confidence bonus: each arm shows a vertical "confidence bonus" segment above its estimated mean. The bonus shrinks as the arm gets pulled — that's exploration converting into knowledge.
Switch to Thompson · Beta posteriors: each arm shows its Beta posterior bell curve. Wide bell = uncertain → likely to be sampled → likely to be pulled. The bell narrows around the true mean as evidence accumulates.
Crank ε to 1.0 — pure random pulls. Regret grows linearly. Drop to 0 — pure greed; sometimes lucky, usually locked onto a sub-optimal arm.

· Three strategies pull from the same 10 arms in parallel. Each arm column shows: estimated mean (tick), true mean (dashed line), pull count (bar height), confidence bonus (segment above the tick), Beta posterior (bell curve). Below: cumulative regret — three curves, head-to-head.

CUMULATIVE REGRET
regret(T) = T · μ* − Σ_t r_t
"after T pulls: how much reward you'd have if you'd always pulled the best arm, minus the reward you actually got. Lower = better."
T = total pulls so far · μ* = mu-star = true mean of best arm · r_t = reward you got at pull t · Σ_t = sum across pulls

ε-greedy (green) UCB (accent) Thompson (ink) dashed = true mean of arm tinted column = best arm

Before bandits

Before bandit algorithms (Thompson 1933, Robbins 1952), the explore/exploit tradeoff was treated by intuition — A/B-test everything for X days, pick the winner. Bandits gave provably-optimal strategies for the same budget. UCB1 and Thompson sampling now run YouTube recommendations, ad rankings, news feeds, and adaptive clinical trials.

The bandit problem

An agent with one job: pull the arm of the slot machine that pays best — without knowing the rates ahead of time. Each pull is a vote of confidence and an opportunity cost.

Symbol gloss

ε (epsilon) — probability of trying a random arm in ε-greedy. UCB — "upper confidence bound": pull the arm whose estimated mean plus a confidence bonus is highest. Thompson sampling — each arm has a Beta posterior over its true mean; sample one value per arm, pull the arm with the highest sample. μ* (mu-star) — the true mean of the best arm.

UCB formula

x̄_i + √(2 ln t / n_i)

Pull the arm with the highest of these. x̄_i = empirical mean. n_i = times pulled. The square-root term is the confidence bonus — large for under-sampled arms, shrinks as n_i grows.

Thompson, plainly

After α wins and β losses, the posterior over an arm's true rate is Beta(1+α, 1+β) — a bell curve over [0, 1]. Sample once from each arm's posterior, pull whichever sample came out largest. Wide bells (few samples) tend to throw extreme values, so under-sampled arms get explored automatically.

Sub-linear regret

A "good" bandit algorithm grows regret slower than T — meaning the average regret per pull → 0 as you keep playing. UCB and Thompson achieve O(log T); ε-greedy with fixed ε grows linearly.

Where you've seen this04 examples

A/B testing infrastructure

Microsoft, Booking.com, Spotify, and most large product orgs use bandit-based experimentation to allocate traffic. Bandit allocation cuts the regret of running a "loser" variant for the full duration of a fixed-split test.

News and content ranking

Yahoo News' 2009 paper popularized bandits for article recommendation. Reddit's "r/all" ranking, Hacker News' front page, and most personalization layers since use bandit variants to balance "popular" vs "let's see if this new thing pops."

Adaptive clinical trials

Bandit-style allocation of patients to treatment arms — assign more people to the apparently-better arm as evidence accumulates. Reduces the ethical cost of randomly assigning patients to a likely-worse treatment.

Online ads & bidding

Google Ads' "smart bidding" and Meta's ads platform pick which creative to show next using contextual bandits. Each impression is a pull; click-through rate is the reward; the user-and-page is the context.