Machine Learning, Visualized · Vol. XIII

A Race of
Optimizers

Same hill. Same start. Four different rules for descending it. Each carries a different memory of where it has been — and pays for it differently.

The concept

An optimizer is a rule for converting gradients into weight updates. They differ in how much memory they keep of past gradients.

SGD has none — each step uses only the current gradient. Momentum remembers a running velocity, accumulating consistent directions. RMSprop remembers per-coordinate gradient magnitudes, scaling steps where signals are loud or quiet. Adam combines momentum and RMSprop with bias correction.

None is universally best. The right choice depends on the loss-landscape geometry — and watching them race on the same surface is the fastest way to develop intuition for which one breaks where.

Why ML cares

Adam and its variants (AdamW, Lion, Lamb, Adafactor) are the workhorses of every transformer training run today. SGD with momentum still wins on convolutional networks (and is what trained ResNet, the backbone of computer vision through 2020).

The choice often shows up as a percentage point of accuracy or a 2× speedup in convergence. On a billion-dollar pre-training run, that's the difference between two weeks and a month — and between top-of-leaderboard and second place.

Try this

Pick Narrow ravine and hit Race. Vanilla SGD bounces across the ravine walls; Momentum slides down the long axis; Adam barely notices the ravine at all.
Switch to Saddle. Plain SGD sits on the saddle for ages — gradient is nearly zero. Momentum eventually rolls off; Adam breaks free almost instantly.
On Bowl, momentum methods overshoot the minimum and orbit before settling. Plain SGD walks straight in. The simplest case is the one where memory hurts.

Loss over time

Where you've seen this 04 examples

Adam in every LLM

GPT-3, GPT-4, Gemini, Claude — all trained with AdamW, the weight-decay variant of Adam. The 2014 Adam paper is one of the most-cited results in modern ML, with a reach few methodological papers ever achieve.

SGD with momentum for vision

ResNet, EfficientNet, and the entire ImageNet leaderboard for years was won by SGD with Nesterov momentum. The intuition: a well-tuned SGD finds flatter minima that generalize better than the sharp ones Adam tends to land in.

RMSprop in reinforcement learning

Atari-playing agents (DQN), policy gradient methods, and many of DeepMind's classic RL systems used RMSprop. Reward signals in RL are noisy and unevenly scaled — exactly the regime RMSprop's per-coordinate scaling helps with.

Lion and Adafactor — Adam's heirs

2023's Lion (Google) uses the sign of an Adam-like estimator to halve memory; Adafactor reduces memory by factorizing the second moment. Both are built on Adam's structure but adapt it for billion-parameter training where every byte counts.