Machine Learning, Visualized · Vol. XXVI

Markov Decision
Processes

An agent in a grid. A handful of states. Actions, transitions, and a reward at the end. The minimal stage on which every reinforcement-learning algorithm performs — and the one you have to draw on a whiteboard before any of them make sense.

The concept

A Markov Decision Process (MDP) is a tuple (S, A, P, R, γ): states, actions, a transition function, a reward function, and a discount.

The agent observes a state, picks an action, transitions to a new state with some probability, and receives a reward. The "Markov" property means the next state depends only on the current state and action — not the full history.

Solve the MDP by computing the value of each state — the expected total reward you'll collect starting from there. Once you know the values, picking the best action at each state ("the policy") is trivial: pick whichever action leads to the highest-valued next state.

Why ML cares

Every reinforcement-learning algorithm — Q-learning, policy gradients, AlphaGo, ChatGPT-style RLHF — solves some MDP. The grid below is a toy, but the structure is identical: states, actions, rewards, transitions. Understanding the MDP makes everything that follows make sense.

MDPs also appear far outside RL. Inventory control, dialog systems, click-through optimization, healthcare treatment planning, robot navigation — anywhere a sequence of decisions affects future states is, mathematically, an MDP.

Try this

Hit Step (one Bellman backup) or Auto-step. Watch the value of each cell update as information from the goal "ripples" outward — high-value states are the ones close to the reward.
Drag the discount γ slider toward 0. Faraway rewards stop mattering — the agent becomes myopic, and only cells adjacent to the goal hold meaningful value. Drag it to 1; distant goals propagate fully across the grid.
Click any cell to make it a wall (or remove one). The values rearrange around the obstacle as a new optimal path emerges.

· Each cell shows its value V(s) — the expected total reward from that state. Thin arrows are transitions (where each action might land); thicker arrow + accent border marks the optimal action. Cells that just changed value pulse — the wave moving outward from the goal is information about future reward.

BELLMAN UPDATE
V(s) ← max_a [ R(s,a) + γ Σ P(s′|s,a) · V(s′) ]
value of here = best-action ( immediate reward + γ × average over where you might land of that cell's value )
s = state · a = action · R = reward · P = transition prob · γ = discount · Σ = sum over next states · V(s′) = value of next state

thick arrow = optimal action (intended direction) thin arrow = stochastic slip (10% each side) warm fill = high V pulse = value just changed this sweep

Before MDPs

Before Bellman's 1957 framing, sequential decision problems were ad-hoc — each game, robot, or planner had its own theory. MDPs unified optimal control under one mathematical frame: states, actions, transitions, rewards, discounting. Every modern RL algorithm assumes this scaffolding.

The RL loop

The Markov property: the next state depends only on the current state and action — not on the full history of how you got here. That memorylessness is what makes the Bellman update tractable.

Glossary

State s — where you are. Action a — what you do. Policy π ("pi") — your strategy: state → action. Value V(s) — expected total reward starting from s; "expected" because transitions are random, so the value is an average over possible futures. Discount γ ("gamma") — how much you care about distant future (γ=1: equally; γ=0: only the next step). Σ ("sigma") — sum over the listed possibilities, weighted by their probability.

Value iteration, plainly

Start with V=0 everywhere. Each sweep: for every cell, replace V(s) with the best action's reward plus γ·V of where it lands. Watch the value of the goal "leak" outward one cell per sweep — that's the wave you see.

Up next

Q-learning is the same idea, but the agent learns by experience instead of being told the transitions and rewards in advance.

Where you've seen this04 examples

Robot navigation

Roomba-style robots, warehouse pickers, and AMRs all model their environment as a grid (or a continuous variant) and solve it as an MDP. Walls become forbidden states; charging docks become high-value goals.

Dialog systems

Customer-service bots cast each conversation turn as a state. Actions are responses; rewards are completed transactions or positive surveys. Modern RLHF training is, formally, an MDP — turns are states, the LLM is the agent.

Inventory and supply chain

How much to reorder at each warehouse, when, given uncertain demand? An MDP. Walmart and Amazon's order-management systems are full of MDP-shaped decision rules.

Healthcare treatment policies

Sequencing chemotherapy regimens, ICU ventilator settings, sepsis interventions — all MDP problems. The policy is "what to do next given the patient's current state." A growing area of medical ML research.