jthomas.site// notebook · v.4.2026
Machine Learning, Visualized · Vol. III

Gradients
& Derivatives

A derivative is a slope: how steeply a curve rises at one spot. A gradient generalizes the idea to surfaces — a vector pointing in the direction of steepest ascent. Every learning algorithm is built on this.

The concept

A derivative is the slope of a curve at one point. A gradient is the same idea on a surface — a vector that points uphill.

The number f ′(x) tells you how fast f changes when you nudge x. Positive means rising; negative means falling; zero means flat. At a minimum or a maximum, the slope is exactly zero.

For a function of two variables f(x, y), the gradient ∇f = (∂f/∂x, ∂f/∂y) is a vector in the plane. It points in the direction the surface rises fastest, and its length tells you how steep that climb is.

Why ML cares

Training a neural network is following a gradient. Cross every layer in reverse, multiply derivatives by the chain rule (that's backpropagation), and you have ∇L — the gradient of the loss. Take a small step in the opposite direction. Repeat a few million times.

Every modern optimizer — SGD, Adam, RMSProp, Adafactor — is a different recipe for "use the gradient cleverly." Understanding what a gradient is geometrically is the difference between training models and running them.

Try this
  1. In 1D mode, hit Play and watch the orange tangent rotate as it sweeps the curve. At each peak and valley, the line goes flat — that's f ′(x) = 0.
  2. Switch to 2D mode, pick the Saddle surface, and drag the dot toward the center. The gradient arrow shrinks to nothing — that flat point is a critical point that's neither a min nor a max.
  3. Try the Banana surface. The gradient arrow always points perpendicular to the level curves — drift sideways and the height stays the same.
Where you've seen this 04 examples
Training every neural network you've used

Each weight in a 100-billion-parameter LLM has a partial derivative ∂L/∂w. The training loop is one giant gradient calculation per batch, executed in reverse via backpropagation. Without this idea there is no GPT, no Gemini, no Claude.

Computer graphics and ray tracing

Light bouncing off a curved surface depends on the surface normal — the gradient of the height function, perpendicular to it. Every shaded pixel in a video game is, somewhere in the math, a gradient.

Image edge detection

The "edges" in a photo are places where pixel intensity changes fastest — large gradient magnitude. The Sobel and Canny edge filters are gradient operators in disguise; every Photoshop filter that "sharpens" or "embosses" is too.

Adversarial examples

To fool a classifier into mistaking a panda for a gibbon, attackers compute the gradient of the loss with respect to the input pixels and step in that direction. Gradient descent on the pixels, not the weights.

Further reading