Markov Decision
Processes
An agent in a grid. A handful of states. Actions, transitions, and a reward at the end. The minimal stage on which every reinforcement-learning algorithm performs — and the one you have to draw on a whiteboard before any of them make sense.
A Markov Decision Process (MDP) is a tuple (S, A, P, R, γ): states, actions, a transition function, a reward function, and a discount.
The agent observes a state, picks an action, transitions to a new state with some probability, and receives a reward. The "Markov" property means the next state depends only on the current state and action — not the full history.
Solve the MDP by computing the value of each state — the expected total reward you'll collect starting from there. Once you know the values, picking the best action at each state ("the policy") is trivial: pick whichever action leads to the highest-valued next state.
Every reinforcement-learning algorithm — Q-learning, policy gradients, AlphaGo, ChatGPT-style RLHF — solves some MDP. The grid below is a toy, but the structure is identical: states, actions, rewards, transitions. Understanding the MDP makes everything that follows make sense.
MDPs also appear far outside RL. Inventory control, dialog systems, click-through optimization, healthcare treatment planning, robot navigation — anywhere a sequence of decisions affects future states is, mathematically, an MDP.
- Hit Step (one Bellman backup) or Auto-step. Watch the value of each cell update as information from the goal "ripples" outward — high-value states are the ones close to the reward.
- Drag the discount γ slider toward 0. Faraway rewards stop mattering — the agent becomes myopic, and only cells adjacent to the goal hold meaningful value. Drag it to 1; distant goals propagate fully across the grid.
- Click any cell to make it a wall (or remove one). The values rearrange around the obstacle as a new optimal path emerges.
V(s) ← maxa [ R(s,a) + γ Σ P(s′|s,a) · V(s′) ]
value of here = best-action ( immediate reward + γ × average over where you might land of that cell's value )
s = state · a = action · R = reward · P = transition prob · γ = discount · Σ = sum over next states · V(s′) = value of next state
Roomba-style robots, warehouse pickers, and AMRs all model their environment as a grid (or a continuous variant) and solve it as an MDP. Walls become forbidden states; charging docks become high-value goals.
Customer-service bots cast each conversation turn as a state. Actions are responses; rewards are completed transactions or positive surveys. Modern RLHF training is, formally, an MDP — turns are states, the LLM is the agent.
How much to reorder at each warehouse, when, given uncertain demand? An MDP. Walmart and Amazon's order-management systems are full of MDP-shaped decision rules.
Sequencing chemotherapy regimens, ICU ventilator settings, sepsis interventions — all MDP problems. The policy is "what to do next given the patient's current state." A growing area of medical ML research.
- Reinforcement Learning: An Introduction textbook Sutton & Barto · The canonical RL textbook. Free online. Chapter 3 is the MDP chapter; everything else builds on it.
- UCL Course on RL lectures David Silver (DeepMind) · The lecture series that introduced a generation of AI researchers to MDPs. The Markov property and Bellman equation are the first lecture.
- Gymnasium library Farama Foundation · The standard Python interface for RL environments. Every environment exposes the MDP structure: step(), reset(), observations, rewards.
- Artificial Intelligence: Foundations of Computational Agents free book Poole & Mackworth · The MDP / value-iteration treatment for the broader AI audience. Excellent worked examples on grid worlds.