Gradients
& Derivatives
A derivative is a slope: how steeply a curve rises at one spot. A gradient generalizes the idea to surfaces — a vector pointing in the direction of steepest ascent. Every learning algorithm is built on this.
A derivative is the slope of a curve at one point. A gradient is the same idea on a surface — a vector that points uphill.
The number f ′(x) tells you how fast f changes when you nudge x. Positive means rising; negative means falling; zero means flat. At a minimum or a maximum, the slope is exactly zero.
For a function of two variables f(x, y), the gradient ∇f = (∂f/∂x, ∂f/∂y) is a vector in the plane. It points in the direction the surface rises fastest, and its length tells you how steep that climb is.
Training a neural network is following a gradient. Cross every layer in reverse, multiply derivatives by the chain rule (that's backpropagation), and you have ∇L — the gradient of the loss. Take a small step in the opposite direction. Repeat a few million times.
Every modern optimizer — SGD, Adam, RMSProp, Adafactor — is a different recipe for "use the gradient cleverly." Understanding what a gradient is geometrically is the difference between training models and running them.
- In 1D mode, hit Play and watch the orange tangent rotate as it sweeps the curve. At each peak and valley, the line goes flat — that's f ′(x) = 0.
- Switch to 2D mode, pick the Saddle surface, and drag the dot toward the center. The gradient arrow shrinks to nothing — that flat point is a critical point that's neither a min nor a max.
- Try the Banana surface. The gradient arrow always points perpendicular to the level curves — drift sideways and the height stays the same.
Each weight in a 100-billion-parameter LLM has a partial derivative ∂L/∂w. The training loop is one giant gradient calculation per batch, executed in reverse via backpropagation. Without this idea there is no GPT, no Gemini, no Claude.
Light bouncing off a curved surface depends on the surface normal — the gradient of the height function, perpendicular to it. Every shaded pixel in a video game is, somewhere in the math, a gradient.
The "edges" in a photo are places where pixel intensity changes fastest — large gradient magnitude. The Sobel and Canny edge filters are gradient operators in disguise; every Photoshop filter that "sharpens" or "embosses" is too.
To fool a classifier into mistaking a panda for a gibbon, attackers compute the gradient of the loss with respect to the input pixels and step in that direction. Gradient descent on the pixels, not the weights.
- 3Blue1Brown — Essence of Calculus video series Grant Sanderson · The animation companion to this entry. Episodes 2 and 3 are exactly the tangent-line and chain-rule story.
- CS231n — Optimization & Gradient Descent course notes Andrej Karpathy · Stanford notes that move from "what's a gradient" to "training a neural net" in twelve clear pages.
- Deep Learning · Chapter 4: Numerical Computation textbook Goodfellow, Bengio, Courville · The chapter that bridges single-variable calculus and the gradient-based optimization actually used in practice.
- Why Momentum Really Works interactive essay Gabriel Goh (Distill) · A beautiful interactive on what happens after gradient descent — momentum, oscillation, and conditioning, with figures you can drag.