jthomas.site// notebook · v.4.2026
Machine Learning, Visualized · Vol. XXVIII

Policy
Gradients

Q-learning learns values, then derives a policy. Policy-gradient methods skip the middleman and learn the policy directly — nudging the probability of each action up or down by exactly the reward it brings.

The concept

A policy is a recipe: given the state I'm in, what's the probability of each action? A policy gradient nudges those probabilities toward whichever ones earned reward.

The rule, in one line: roll out the policy, see what reward you got, then push up the log-probability of the actions you took — by an amount proportional to the reward. Good actions get more likely. Bad actions get less likely. That's it.

Below: a 3×3 gridworld where the agent picks one of four moves at each cell. The bars above each cell are the agent's current action probabilities. Watch them tilt toward the goal as rollouts come in.

Why ML cares

Policy gradients are the engine behind every modern deep-RL system that doesn't fit Q-learning's box. PPO is the workhorse of RLHF for ChatGPT and Claude; A3C powered DeepMind's Atari and StarCraft work; REINFORCE is the original.

They're also the natural method when actions are continuous (robot joint torques, portfolio weights) or when the policy is a neural net — both cases where Q-learning's "argmax over actions" becomes intractable.

Try this
  1. Hit Train. Watch the probability bars over each cell tilt toward whichever action points to the goal. The arrows on the grid grow thicker as the policy gets more confident.
  2. Switch to Trajectories · arrows by reward. Each rollout's sampled actions appear as arrows colored by outcome — green if the rollout ended at the goal, red if it fell in the pit. Switch to Continuous · Gaussian policy for the same idea in a 2D action space — μ drifts toward whichever samples earned high reward.
  3. Move the goal: click any empty cell to make it the new goal. The policy starts re-learning from where it was — non-stationary RL in miniature.
· A grid of states. Above each cell, four stacked bars show π(↑), π(→), π(↓), π(←) — the agent's current probability of each move. The faint trail is the most recent rollout; the dot is the agent. Each rollout collects (state, action, reward), then bars wiggle toward whatever moves the high-reward rollouts used.
REINFORCE GRADIENT
∇ J(θ) = 𝔼[ ∇ log πθ(a|s) · R ]
"to improve the policy π with parameters θ: nudge each action's log-probability up — by exactly the reward R it earned"
∇ = gradient (which way to nudge θ) · θ = theta (policy params) · π = pi (policy) · 𝔼 = average over rollouts · log = natural log · R = reward this rollout earned
agent tall bar / thick arrow = high π(a|s) green arrow = rollout reached goal (push UP) red arrow = rollout fell in pit (push DOWN)
Where you've seen this04 examples
RLHF for LLMs

ChatGPT, Claude, and Gemini all use PPO (a policy-gradient variant) in their alignment pipelines. The LM is the policy; preferences from human raters define the reward. The same equation as on this page, just at billions of parameters.

Robot locomotion

Boston Dynamics-style locomotion controllers, Atlas's parkour, Cassie's running — all policy-gradient policies trained in simulation, deployed on hardware. Continuous action spaces (joint torques) are policy-gradient territory.

AlphaGo, AlphaStar, AlphaFold-RL

DeepMind's flagship RL systems all use policy-gradient methods (often combined with value learning). AlphaStar beat top StarCraft II players in 2019; the policy was a deep network trained with a policy-gradient objective.

Trading and portfolio optimization

Quantitative funds use policy gradients to learn position-sizing rules from market data. The action space is continuous (how much of each asset to hold); the reward is risk-adjusted return.

Further reading