Machine Learning, Visualized · Vol. XXVIII

Policy
Gradients

Q-learning learns values, then derives a policy. Policy-gradient methods skip the middleman and learn the policy directly — nudging the probability of each action up or down by exactly the reward it brings.

The concept

A policy is a recipe: given the state I'm in, what's the probability of each action? A policy gradient nudges those probabilities toward whichever ones earned reward.

The rule, in one line: roll out the policy, see what reward you got, then push up the log-probability of the actions you took — by an amount proportional to the reward. Good actions get more likely. Bad actions get less likely. That's it.

Below: a 3×3 gridworld where the agent picks one of four moves at each cell. The bars above each cell are the agent's current action probabilities. Watch them tilt toward the goal as rollouts come in.

Why ML cares

Policy gradients are the engine behind every modern deep-RL system that doesn't fit Q-learning's box. PPO is the workhorse of RLHF for ChatGPT and Claude; A3C powered DeepMind's Atari and StarCraft work; REINFORCE is the original.

They're also the natural method when actions are continuous (robot joint torques, portfolio weights) or when the policy is a neural net — both cases where Q-learning's "argmax over actions" becomes intractable.

Try this

Hit Train. Watch the probability bars over each cell tilt toward whichever action points to the goal. The arrows on the grid grow thicker as the policy gets more confident.
Switch to Trajectories · arrows by reward. Each rollout's sampled actions appear as arrows colored by outcome — green if the rollout ended at the goal, red if it fell in the pit. Switch to Continuous · Gaussian policy for the same idea in a 2D action space — μ drifts toward whichever samples earned high reward.
Move the goal: click any empty cell to make it the new goal. The policy starts re-learning from where it was — non-stationary RL in miniature.

· A grid of states. Above each cell, four stacked bars show π(↑), π(→), π(↓), π(←) — the agent's current probability of each move. The faint trail is the most recent rollout; the dot is the agent. Each rollout collects (state, action, reward), then bars wiggle toward whatever moves the high-reward rollouts used.

REINFORCE GRADIENT
∇ J(θ) = 𝔼[ ∇ log π_θ(a|s) · R ]
"to improve the policy π with parameters θ: nudge each action's log-probability up — by exactly the reward R it earned"
∇ = gradient (which way to nudge θ) · θ = theta (policy params) · π = pi (policy) · 𝔼 = average over rollouts · log = natural log · R = reward this rollout earned

agent tall bar / thick arrow = high π(a|s) green arrow = rollout reached goal (push UP) red arrow = rollout fell in pit (push DOWN)

Before policy gradients

Q-learning works for discrete actions but breaks on continuous ones — you can't argmax over a continuum, and the policy is implicit (greedy with respect to Q). Williams' REINFORCE (1992) optimized the policy directly, scaling to continuous actions and stochastic policies. They power continuous control (robotics, MuJoCo) and modern RLHF, fine-tuning language models on human feedback.

The RL loop

Sampling means: the policy gives each action a probability, then the agent rolls a die to pick one. High-probability actions come up often — but the policy can still try unusual moves, which is how it discovers good ones.

REINFORCE, in code

for each rollout:
  collect (s, a, r) trace
  R ← total discounted reward
  for each (s, a) in trace:
    θ ← θ + α · ∇log π(a|s) · R

Symbol gloss

π (pi) — the policy: a function from state to a distribution over actions. θ (theta) — the parameters of the policy (here: a number per (state, action)). α (alpha) — learning rate. γ (gamma) — discount: how much future rewards matter (1 = patient; 0 = greedy now). R — total reward the rollout earned. ∇ log π(a|s) — the direction in parameter space that increases the action's probability.

High variance

REINFORCE is correct but noisy — single rollouts swing wildly. Baselines subtract an average reward to lower the noise (R − b instead of R). Actor-critic learns a value function as the baseline. PPO adds a clipped surrogate to prevent overshoot.

In RLHF

PPO is the algorithm fine-tuning every chat-aligned LLM. The policy is the language model; "actions" are tokens; the "reward" is a human-preference model's score.

Where you've seen this04 examples

RLHF for LLMs

ChatGPT, Claude, and Gemini all use PPO (a policy-gradient variant) in their alignment pipelines. The LM is the policy; preferences from human raters define the reward. The same equation as on this page, just at billions of parameters.

Robot locomotion

Boston Dynamics-style locomotion controllers, Atlas's parkour, Cassie's running — all policy-gradient policies trained in simulation, deployed on hardware. Continuous action spaces (joint torques) are policy-gradient territory.

AlphaGo, AlphaStar, AlphaFold-RL

DeepMind's flagship RL systems all use policy-gradient methods (often combined with value learning). AlphaStar beat top StarCraft II players in 2019; the policy was a deep network trained with a policy-gradient objective.

Trading and portfolio optimization

Quantitative funds use policy gradients to learn position-sizing rules from market data. The action space is continuous (how much of each asset to hold); the reward is risk-adjusted return.