jthomas.site// notebook · v.4.2026
Machine Learning, Visualized · Vol. XI

Back
propagation

The signal flows forward to make a prediction. The error flows backward to fix the weights. Chain rule, applied layer by layer — the engine that taught a generation of networks how to learn.

The concept

Backpropagation is how a neural network figures out which weight to blame for an error — and by how much.

The forward pass produces a prediction; comparing it to the truth gives a loss. To reduce the loss, you need ∂L/∂w for every weight in the network. Backprop computes all of those at once, by applying the chain rule from the output backward toward the input.

The trick: each layer's gradient is the upstream gradient times the layer's local derivative. So gradients can be computed recursively in one backward sweep — exactly as fast as the forward pass, with no symbolic differentiation needed.

Why ML cares

Backprop is the algorithm that made deep learning practical. Without it, training a network with billions of parameters would require billions of finite-difference probes per step — computationally infeasible. With it, training is just two passes per step: forward, then back.

Every modern training framework — PyTorch, JAX, TensorFlow — is essentially a tool for building computation graphs and running backprop on them automatically. The technical name is automatic differentiation; backprop is its workhorse mode.

Try this
  1. Press Forward → Backward. Watch the forward pass light up neurons left-to-right, then the backward pass light up gradient halos right-to-left.
  2. Click Train continuously. The decision boundary on the bottom canvas reshapes as the network learns. Watch the loss curve drop in the rail.
  3. Click Next sample a few times. Each sample produces a different gradient — different blame patterns light up different connections orange (increase) or green (decrease).
Decision boundary, evolving
Where you've seen this 04 examples
Training every modern ML model

Every weight in every neural network — Gemini, GPT, Stable Diffusion, AlphaFold — was nudged into place by some variant of backpropagation. The 1986 paper that popularized it is one of the most-cited results in machine learning.

Automatic differentiation in PyTorch & JAX

Whenever you call loss.backward() in PyTorch, you're invoking reverse-mode automatic differentiation — the algorithmic generalization of backprop. JAX's grad() is the same idea, packaged differently.

Neural-network theorem provers

Backprop computes ∂loss/∂input as a side-effect of computing ∂loss/∂weights. Adversarial-example attacks, saliency maps, and many interpretability methods all exploit this gradient-of-input.

Beyond neural networks

Backprop also powers gradient-based learning in physical-system simulators, differentiable rendering for graphics, and even some classical optimization problems reformulated as differentiable programs.

Further reading