Q-Learning
An agent stumbles around a grid. Every move updates one cell of a giant table. No model of the environment, no plan — just trial, error, and bookkeeping. Eventually the table tells the agent what to do everywhere.
Q-learning learns the value of every (state, action) pair — the table Q(s, a) — without knowing the transition probabilities or rewards in advance.
Each step: the agent picks an action (mostly greedy with respect to current Q, sometimes random — that's ε-greedy exploration), observes the reward and next state, then updates one cell: Q(s, a) ← Q(s, a) + α [r + γ max Q(s′, ·) − Q(s, a)].
That single update rule, applied millions of times, converges to the optimal Q. The optimal policy is then "pick the action with the highest Q at every state." No symbolic reasoning required — just arithmetic and patience.
Q-learning is the foundational model-free RL algorithm — and the direct ancestor of DeepMind's DQN, the network that learned to play 49 Atari games from raw pixels in 2015.
Every modern deep-RL method (PPO, A3C, SAC, AlphaZero) descends from this idea: estimate value functions from experience, use them to drive better policy. Even RLHF for language models is, under the hood, Q-learning's spiritual cousin.
- Hit Train. Watch the agent (the dot) explore the grid; the Q-table on the right fills in cell by cell as it learns which moves earn reward.
- Drag ε to 0 — pure greed. The agent stops exploring and may settle for a suboptimal policy. Drag to 1 — pure random — and learning is slow but eventually correct.
- Reset and watch the agent's first 20 episodes. Initially it wanders aimlessly; after enough updates, it walks straight to the goal.
Q(s,a) ← Q(s,a) + α · [ r + γ maxa′ Q(s′,a′) − Q(s,a) ]
old estimate ← old estimate + (learning rate) × [ reward + γ × best next Q − old estimate ]
α = alpha (learning rate, 0–1) · γ = gamma (discount) · ε = epsilon (random-action rate) · maxa′ = best Q over next state's actions
The 2015 Nature paper that learned to play 49 Atari games from pixels. A convolutional Q-network, trained with the same update rule you see above. The first time deep learning beat humans on a non-trivial RL benchmark.
YouTube's "what to watch next" model uses Q-learning variants to balance long-term engagement (treat each session as a state, each video as an action). Netflix and TikTok do similar.
Real-time auction systems use Q-learning to decide bid prices given user, page, and time-of-day state. Google Ads, Meta's ad system, and most large DSPs run Q-style updates at billions-of-events scale.
Battle-grid AI in strategy games, NPC behavior in MMOs, drone racing — all Q-learning or its modern descendants. Often pre-trained offline, then frozen at runtime.
- Human-level control through deep reinforcement learning paper Mnih et al. (2015) · The DQN Nature paper. Established the recipe — Q-learning + deep nets + experience replay + target networks — that's still the foundation of value-based deep RL.
- Reinforcement Learning: An Introduction · Ch. 6 textbook Sutton & Barto · The temporal-difference learning chapter. Q-learning is presented as a special case; SARSA is its on-policy sibling.
- Spinning Up in Deep RL course OpenAI · Practical, code-first introduction to modern deep-RL algorithms. The DQN section walks from tabular Q-learning to the full deep variant.
- Deep Reinforcement Learning: Pong from Pixels essay Andrej Karpathy · A from-scratch policy-gradient agent that learns Pong. Pairs nicely with this page — the same RL frame, different algorithm.