Back
propagation
The signal flows forward to make a prediction. The error flows backward to fix the weights. Chain rule, applied layer by layer — the engine that taught a generation of networks how to learn.
Backpropagation is how a neural network figures out which weight to blame for an error — and by how much.
The forward pass produces a prediction; comparing it to the truth gives a loss. To reduce the loss, you need ∂L/∂w for every weight in the network. Backprop computes all of those at once, by applying the chain rule from the output backward toward the input.
The trick: each layer's gradient is the upstream gradient times the layer's local derivative. So gradients can be computed recursively in one backward sweep — exactly as fast as the forward pass, with no symbolic differentiation needed.
Backprop is the algorithm that made deep learning practical. Without it, training a network with billions of parameters would require billions of finite-difference probes per step — computationally infeasible. With it, training is just two passes per step: forward, then back.
Every modern training framework — PyTorch, JAX, TensorFlow — is essentially a tool for building computation graphs and running backprop on them automatically. The technical name is automatic differentiation; backprop is its workhorse mode.
- Press Forward → Backward. Watch the forward pass light up neurons left-to-right, then the backward pass light up gradient halos right-to-left.
- Click Train continuously. The decision boundary on the bottom canvas reshapes as the network learns. Watch the loss curve drop in the rail.
- Click Next sample a few times. Each sample produces a different gradient — different blame patterns light up different connections orange (increase) or green (decrease).
Every weight in every neural network — Gemini, GPT, Stable Diffusion, AlphaFold — was nudged into place by some variant of backpropagation. The 1986 paper that popularized it is one of the most-cited results in machine learning.
Whenever you call loss.backward() in PyTorch, you're invoking reverse-mode automatic differentiation — the algorithmic generalization of backprop. JAX's grad() is the same idea, packaged differently.
Backprop computes ∂loss/∂input as a side-effect of computing ∂loss/∂weights. Adversarial-example attacks, saliency maps, and many interpretability methods all exploit this gradient-of-input.
Backprop also powers gradient-based learning in physical-system simulators, differentiable rendering for graphics, and even some classical optimization problems reformulated as differentiable programs.
- Learning representations by back-propagating errors paper (1986) Rumelhart, Hinton, Williams · The paper that introduced backprop to mainstream ML and re-started a decade-long winter. Worth reading for the historical voice alone.
- The spelled-out intro to neural networks and backpropagation video Andrej Karpathy · A two-hour from-scratch build of micrograd — Python autodiff in ~100 lines. The clearest explanation of backprop currently available.
- Calculus on Computational Graphs: Backpropagation essay Christopher Olah · The clearest visual derivation of backprop on the internet, from a graph-theoretic angle that explains why reverse-mode autodiff scales the way it does.
- PyTorch autograd reference reference The mechanics of how a modern framework actually does backprop on dynamic computation graphs. Read this once and you'll never look at loss.backward() the same way.