Machine Learning, Visualized · Vol. XI

Back
propagation

The signal flows forward to make a prediction. The error flows backward to fix the weights. Chain rule, applied layer by layer — the engine that taught a generation of networks how to learn.

The concept

Backpropagation is how a neural network figures out which weight to blame for an error — and by how much.

The forward pass produces a prediction; comparing it to the truth gives a loss. To reduce the loss, you need ∂L/∂w for every weight in the network. Backprop computes all of those at once, by applying the chain rule from the output backward toward the input.

The trick: each layer's gradient is the upstream gradient times the layer's local derivative. So gradients can be computed recursively in one backward sweep — exactly as fast as the forward pass, with no symbolic differentiation needed.

Why ML cares

Backprop is the algorithm that made deep learning practical. Without it, training a network with billions of parameters would require billions of finite-difference probes per step — computationally infeasible. With it, training is just two passes per step: forward, then back.

Every modern training framework — PyTorch, JAX, TensorFlow — is essentially a tool for building computation graphs and running backprop on them automatically. The technical name is automatic differentiation; backprop is its workhorse mode.

Try this

Press Forward → Backward. Watch the forward pass light up neurons left-to-right, then the backward pass light up gradient halos right-to-left.
Click Train continuously. The decision boundary on the bottom canvas reshapes as the network learns. Watch the loss curve drop in the rail.
Click Next sample a few times. Each sample produces a different gradient — different blame patterns light up different connections orange (increase) or green (decrease).

Decision boundary, evolving

Before this

Before backpropagation (Werbos 1974, Rumelhart-Hinton-Williams 1986), nobody knew how to train multi-layer networks at scale. Each weight got an arbitrary little nudge; learning was glacial. Backprop computed every gradient in one efficient pass — the engine of every deep model since.

Chain rule

∂L/∂wᵢⱼ = ∂L/∂outⱼ · ƒ′(zⱼ) · xᵢ

The gradient on a single weight is three things multiplied: the upstream error coming back through the network, the local derivative of the activation at that neuron, and the input that flowed through that weight.

Why three? Each weight gets blamed in proportion to (a) how much its neuron's output mattered downstream, (b) how responsive the neuron was to a nudge (its slope), and (c) how big the input it scaled was. Multiply, and you've assigned blame all the way down.

The error δ at each layer is the upstream error multiplied by the local derivative — that's it. Backprop is just the chain rule applied recursively, output → input.

Reading the picture

weight will increase (gradient is negative, so −η·∇ pushes up)

weight will decrease

thicker line = bigger blame ( |∂L/∂w| )

green halo on a neuron = how much error δ flows through it

dashed green = the true target class

Symbols, plainly

∂ (partial-d) means partial derivative — "if I nudge this single weight by a hair, how much does L change?" L = the loss (a single number measuring how wrong we are). w = a weight. z = the pre-activation (the weighted sum before the bend), ƒ′(z) = the slope of the activation at that z — near zero, the gradient gets killed and learning stalls. δ (delta) = the error flowing back through a neuron; the footer's recursion δ ← W⁽ˡ⁺¹⁾ᵀ δ⁽ˡ⁺¹⁾ reads "this layer's error = upstream error pulled back through the next layer's transposed weights."

Decision boundary

The contour map below shows the network's current confidence over the input plane. As you train, watch it warp to fit the data.

Where you've seen this 04 examples

Training every modern ML model

Every weight in every neural network — Gemini, GPT, Stable Diffusion, AlphaFold — was nudged into place by some variant of backpropagation. The 1986 paper that popularized it is one of the most-cited results in machine learning.

Automatic differentiation in PyTorch & JAX

Whenever you call loss.backward() in PyTorch, you're invoking reverse-mode automatic differentiation — the algorithmic generalization of backprop. JAX's grad() is the same idea, packaged differently.

Neural-network theorem provers

Backprop computes ∂loss/∂input as a side-effect of computing ∂loss/∂weights. Adversarial-example attacks, saliency maps, and many interpretability methods all exploit this gradient-of-input.

Beyond neural networks

Backprop also powers gradient-based learning in physical-system simulators, differentiable rendering for graphics, and even some classical optimization problems reformulated as differentiable programs.