Policy
Gradients
Q-learning learns values, then derives a policy. Policy-gradient methods skip the middleman and learn the policy directly — nudging the probability of each action up or down by exactly the reward it brings.
A policy is a recipe: given the state I'm in, what's the probability of each action? A policy gradient nudges those probabilities toward whichever ones earned reward.
The rule, in one line: roll out the policy, see what reward you got, then push up the log-probability of the actions you took — by an amount proportional to the reward. Good actions get more likely. Bad actions get less likely. That's it.
Below: a 3×3 gridworld where the agent picks one of four moves at each cell. The bars above each cell are the agent's current action probabilities. Watch them tilt toward the goal as rollouts come in.
Policy gradients are the engine behind every modern deep-RL system that doesn't fit Q-learning's box. PPO is the workhorse of RLHF for ChatGPT and Claude; A3C powered DeepMind's Atari and StarCraft work; REINFORCE is the original.
They're also the natural method when actions are continuous (robot joint torques, portfolio weights) or when the policy is a neural net — both cases where Q-learning's "argmax over actions" becomes intractable.
- Hit Train. Watch the probability bars over each cell tilt toward whichever action points to the goal. The arrows on the grid grow thicker as the policy gets more confident.
- Switch to Trajectories · arrows by reward. Each rollout's sampled actions appear as arrows colored by outcome — green if the rollout ended at the goal, red if it fell in the pit. Switch to Continuous · Gaussian policy for the same idea in a 2D action space — μ drifts toward whichever samples earned high reward.
- Move the goal: click any empty cell to make it the new goal. The policy starts re-learning from where it was — non-stationary RL in miniature.
∇ J(θ) = 𝔼[ ∇ log πθ(a|s) · R ]
"to improve the policy π with parameters θ: nudge each action's log-probability up — by exactly the reward R it earned"
∇ = gradient (which way to nudge θ) · θ = theta (policy params) · π = pi (policy) · 𝔼 = average over rollouts · log = natural log · R = reward this rollout earned
ChatGPT, Claude, and Gemini all use PPO (a policy-gradient variant) in their alignment pipelines. The LM is the policy; preferences from human raters define the reward. The same equation as on this page, just at billions of parameters.
Boston Dynamics-style locomotion controllers, Atlas's parkour, Cassie's running — all policy-gradient policies trained in simulation, deployed on hardware. Continuous action spaces (joint torques) are policy-gradient territory.
DeepMind's flagship RL systems all use policy-gradient methods (often combined with value learning). AlphaStar beat top StarCraft II players in 2019; the policy was a deep network trained with a policy-gradient objective.
Quantitative funds use policy gradients to learn position-sizing rules from market data. The action space is continuous (how much of each asset to hold); the reward is risk-adjusted return.
- Proximal Policy Optimization paper Schulman et al. (2017) · The PPO paper. The default policy-gradient algorithm in modern practice — used in RLHF, robotics, game-playing.
- Policy Gradient Methods (Scholarpedia) reference Sutton, McAllester, Singh, Mansour · Compact reference covering the policy-gradient theorem and its variants.
- Deep RL: Pong from Pixels essay Andrej Karpathy · 130-line policy-gradient agent that learns Pong. The clearest "implement REINFORCE in Python" walkthrough on the internet.
- Illustrating RLHF essay Lambert, Castricato et al. (Hugging Face) · Walks through how PPO is plugged into LLM alignment, with diagrams. Bridges this page to modern LLM training.