jthomas.site// notebook · v.4.2026
Machine Learning, Visualized · Vol. XXVI

Markov Decision
Processes

An agent in a grid. A handful of states. Actions, transitions, and a reward at the end. The minimal stage on which every reinforcement-learning algorithm performs — and the one you have to draw on a whiteboard before any of them make sense.

The concept

A Markov Decision Process (MDP) is a tuple (S, A, P, R, γ): states, actions, a transition function, a reward function, and a discount.

The agent observes a state, picks an action, transitions to a new state with some probability, and receives a reward. The "Markov" property means the next state depends only on the current state and action — not the full history.

Solve the MDP by computing the value of each state — the expected total reward you'll collect starting from there. Once you know the values, picking the best action at each state ("the policy") is trivial: pick whichever action leads to the highest-valued next state.

Why ML cares

Every reinforcement-learning algorithm — Q-learning, policy gradients, AlphaGo, ChatGPT-style RLHF — solves some MDP. The grid below is a toy, but the structure is identical: states, actions, rewards, transitions. Understanding the MDP makes everything that follows make sense.

MDPs also appear far outside RL. Inventory control, dialog systems, click-through optimization, healthcare treatment planning, robot navigation — anywhere a sequence of decisions affects future states is, mathematically, an MDP.

Try this
  1. Hit Step (one Bellman backup) or Auto-step. Watch the value of each cell update as information from the goal "ripples" outward — high-value states are the ones close to the reward.
  2. Drag the discount γ slider toward 0. Faraway rewards stop mattering — the agent becomes myopic, and only cells adjacent to the goal hold meaningful value. Drag it to 1; distant goals propagate fully across the grid.
  3. Click any cell to make it a wall (or remove one). The values rearrange around the obstacle as a new optimal path emerges.
· Each cell shows its value V(s) — the expected total reward from that state. Thin arrows are transitions (where each action might land); thicker arrow + accent border marks the optimal action. Cells that just changed value pulse — the wave moving outward from the goal is information about future reward.
BELLMAN UPDATE
V(s) ← maxa [ R(s,a) + γ Σ P(s′|s,a) · V(s′) ]
value of here = best-action ( immediate reward + γ × average over where you might land of that cell's value )
s = state · a = action · R = reward · P = transition prob · γ = discount · Σ = sum over next states · V(s′) = value of next state
thick arrow = optimal action (intended direction) thin arrow = stochastic slip (10% each side) warm fill = high V pulse = value just changed this sweep
Where you've seen this04 examples
Robot navigation

Roomba-style robots, warehouse pickers, and AMRs all model their environment as a grid (or a continuous variant) and solve it as an MDP. Walls become forbidden states; charging docks become high-value goals.

Dialog systems

Customer-service bots cast each conversation turn as a state. Actions are responses; rewards are completed transactions or positive surveys. Modern RLHF training is, formally, an MDP — turns are states, the LLM is the agent.

Inventory and supply chain

How much to reorder at each warehouse, when, given uncertain demand? An MDP. Walmart and Amazon's order-management systems are full of MDP-shaped decision rules.

Healthcare treatment policies

Sequencing chemotherapy regimens, ICU ventilator settings, sepsis interventions — all MDP problems. The policy is "what to do next given the patient's current state." A growing area of medical ML research.

Further reading