A Race of
Optimizers
Same hill. Same start. Four different rules for descending it. Each carries a different memory of where it has been — and pays for it differently.
An optimizer is a rule for converting gradients into weight updates. They differ in how much memory they keep of past gradients.
SGD has none — each step uses only the current gradient. Momentum remembers a running velocity, accumulating consistent directions. RMSprop remembers per-coordinate gradient magnitudes, scaling steps where signals are loud or quiet. Adam combines momentum and RMSprop with bias correction.
None is universally best. The right choice depends on the loss-landscape geometry — and watching them race on the same surface is the fastest way to develop intuition for which one breaks where.
Adam and its variants (AdamW, Lion, Lamb, Adafactor) are the workhorses of every transformer training run today. SGD with momentum still wins on convolutional networks (and is what trained ResNet, the backbone of computer vision through 2020).
The choice often shows up as a percentage point of accuracy or a 2× speedup in convergence. On a billion-dollar pre-training run, that's the difference between two weeks and a month — and between top-of-leaderboard and second place.
- Pick Narrow ravine and hit Race. Vanilla SGD bounces across the ravine walls; Momentum slides down the long axis; Adam barely notices the ravine at all.
- Switch to Saddle. Plain SGD sits on the saddle for ages — gradient is nearly zero. Momentum eventually rolls off; Adam breaks free almost instantly.
- On Bowl, momentum methods overshoot the minimum and orbit before settling. Plain SGD walks straight in. The simplest case is the one where memory hurts.
GPT-3, GPT-4, Gemini, Claude — all trained with AdamW, the weight-decay variant of Adam. The 2014 Adam paper is one of the most-cited results in modern ML, with a reach few methodological papers ever achieve.
ResNet, EfficientNet, and the entire ImageNet leaderboard for years was won by SGD with Nesterov momentum. The intuition: a well-tuned SGD finds flatter minima that generalize better than the sharp ones Adam tends to land in.
Atari-playing agents (DQN), policy gradient methods, and many of DeepMind's classic RL systems used RMSprop. Reward signals in RL are noisy and unevenly scaled — exactly the regime RMSprop's per-coordinate scaling helps with.
2023's Lion (Google) uses the sign of an Adam-like estimator to halve memory; Adafactor reduces memory by factorizing the second moment. Both are built on Adam's structure but adapt it for billion-parameter training where every byte counts.
- Adam: A Method for Stochastic Optimization paper Kingma & Ba (2014) · The original Adam paper. Compact, derives the update rule from a clear motivation, and includes the convergence analysis. Required reading.
- Why Momentum Really Works interactive essay Gabriel Goh (Distill) · The clearest visual explanation of momentum on the internet. Pairs with the ravine race above to make the "memory of velocity" concept concrete.
- An overview of gradient descent optimization algorithms survey Sebastian Ruder · A patient, math-heavy comparison of all the optimizers above plus several niche ones. Best single page if you want to actually understand the differences.
- Symbolic Discovery of Optimization Algorithms (Lion) paper Chen et al. (2023) · The story of how Google researchers used program search to discover a better optimizer than Adam — and what they found. A glimpse of where the field is going.