Machine Learning, Visualized · Vol. XXVII

Q-Learning

An agent stumbles around a grid. Every move updates one cell of a giant table. No model of the environment, no plan — just trial, error, and bookkeeping. Eventually the table tells the agent what to do everywhere.

The concept

Q-learning learns the value of every (state, action) pair — the table Q(s, a) — without knowing the transition probabilities or rewards in advance.

Each step: the agent picks an action (mostly greedy with respect to current Q, sometimes random — that's ε-greedy exploration), observes the reward and next state, then updates one cell: Q(s, a) ← Q(s, a) + α [r + γ max Q(s′, ·) − Q(s, a)].

That single update rule, applied millions of times, converges to the optimal Q. The optimal policy is then "pick the action with the highest Q at every state." No symbolic reasoning required — just arithmetic and patience.

Why ML cares

Q-learning is the foundational model-free RL algorithm — and the direct ancestor of DeepMind's DQN, the network that learned to play 49 Atari games from raw pixels in 2015.

Every modern deep-RL method (PPO, A3C, SAC, AlphaZero) descends from this idea: estimate value functions from experience, use them to drive better policy. Even RLHF for language models is, under the hood, Q-learning's spiritual cousin.

Try this

Hit Train. Watch the agent (the dot) explore the grid; the Q-table on the right fills in cell by cell as it learns which moves earn reward.
Drag ε to 0 — pure greed. The agent stops exploring and may settle for a suboptimal policy. Drag to 1 — pure random — and learning is slow but eventually correct.
Reset and watch the agent's first 20 episodes. Initially it wanders aimlessly; after enough updates, it walks straight to the goal.

· Left: the world with the agent (accent dot) and a fading trail of recent steps. Right: the Q-table — four wedges per cell, one per direction (↑→↓←). Every step flashes the wedge it updated and prints the TD-error — the gap between the old estimate and the new evidence, which is the signal that drives all of learning.

Q-UPDATE
Q(s,a) ← Q(s,a) + α · [ r + γ max_a′ Q(s′,a′) − Q(s,a) ]
old estimate ← old estimate + (learning rate) × [ reward + γ × best next Q − old estimate ]
α = alpha (learning rate, 0–1) · γ = gamma (discount) · ε = epsilon (random-action rate) · max_a′ = best Q over next state's actions

agent (current state) fading trail = recent steps warm wedge = positive Q (good move) dark wedge = negative Q EXPLOIT = greedy · EXPLORE = random (ε)

Before Q-learning

Before Watkins (1989), reinforcement learning required a model of the environment — you had to know all transition probabilities and rewards to compute values via the Bellman equation. Q-learning made it model-free: learn directly from experience, with provable convergence. It's the algorithm that taught a computer to play Atari (DeepMind, 2013).

The RL loop

Q-learning's update: after every (s, a, r, s′), nudge Q(s,a) toward r + γ·max Q(s′,·). No model of the world. Pure bookkeeping plus arithmetic.

Glossary

Q(s,a) — expected total reward if you take action a from state s and act greedily after. α (alpha) — learning rate (how much to nudge each update). γ (gamma) — discount on future reward. ε (epsilon) — fraction of steps spent exploring (random) vs exploiting (best so far). TD-error — gap between old guess and new evidence.

From table to network

When states are too many to enumerate (Atari pixels, robot images), replace the table with a neural network Q(s, a; θ). That's DQN — same update, gradient descent in place of cell assignment.

Where you've seen this04 examples

DeepMind's Atari DQN

The 2015 Nature paper that learned to play 49 Atari games from pixels. A convolutional Q-network, trained with the same update rule you see above. The first time deep learning beat humans on a non-trivial RL benchmark.

Recommendation systems

YouTube's "what to watch next" model uses Q-learning variants to balance long-term engagement (treat each session as a state, each video as an action). Netflix and TikTok do similar.

Adaptive ad bidding

Real-time auction systems use Q-learning to decide bid prices given user, page, and time-of-day state. Google Ads, Meta's ad system, and most large DSPs run Q-style updates at billions-of-events scale.

Game AI in production

Battle-grid AI in strategy games, NPC behavior in MMOs, drone racing — all Q-learning or its modern descendants. Often pre-trained offline, then frozen at runtime.