Machine Learning, Visualized · Vol. X

Forward
Propagation

A neural network is a pipeline. Each layer multiplies the previous activations by a weight matrix, adds a bias, and bends the result through a nonlinearity. Watch the signal travel.

The concept

A forward pass is the act of running input through a neural network to produce an output, layer by layer.

Each layer does three things: (1) multiply the previous activations by a weight matrix, (2) add a bias vector, (3) bend the result through a nonlinearity. The output of layer ℓ becomes the input of layer ℓ+1.

The math is one line: a⁽ˡ⁾ = ƒ(W⁽ˡ⁾ a⁽ˡ⁻¹⁾ + b⁽ˡ⁾). Stack three of these and you have the network on the right — input → hidden 1 → hidden 2 → output, with tanh between hidden layers and softmax at the top.

Why ML cares

Forward propagation is what every deployed neural network does at inference time — every Gemini response, every Stable Diffusion image, every Tesla camera frame is the result of running a forward pass on a trained network. Training is forward propagation followed by backpropagation; inference is just forward.

The same forward pass also defines what the network can express: a stack of linear layers with bends in between can approximate any continuous function (universal approximation theorem). The depth and the bends are what make it powerful.

Try this

Hit Replay flow. Watch the signal hop column by column — pre-activations (the weighted sums) lit up first, then post-activations (after the bend).
Drag x₁ and x₂. The network is fixed; only the inputs change. Watch how a small shift in one number ripples differently through every neuron.
Try the four corner presets. Each one is a different region of the input space, and the softmax outputs change smoothly as you move between them — a learned 2D classifier in action.

Before this

Before networks, function approximation needed hand-designed feature pipelines — domain experts crafting inputs by hand for every new task. The forward pass formalized "stack many simple operations and the composition will be expressive enough." Hornik (1989) made it a theorem: a network like the one on the right can approximate any continuous function.

The recipe — one layer

a⁽ˡ⁾ = ƒ(W⁽ˡ⁾ a⁽ˡ⁻¹⁾ + b⁽ˡ⁾)

Linear map, shift, bend. Repeat. Without the bend, a chain of matrix multiplies would collapse to one matrix multiply.

Legend

positive weight

negative weight

fill = activation magnitude

Symbols, plainly

a⁽ˡ⁾ = the activations of layer ℓ (a vector of neuron outputs); the superscript is just an index, not an exponent. W⁽ˡ⁾ = that layer's weight matrix; b⁽ˡ⁾ = its bias vector. ƒ = the activation function (here tanh in hidden layers, softmax at the output). ŷ ("y-hat") = the prediction. The whole equation reads: take the previous layer's outputs, multiply by this layer's weights, add a bias, then bend.

Logits → probabilities

Softmax turns the final raw scores (logits — pre-softmax linear outputs, can be any real number) into a probability distribution that sums to 1.

Where you've seen this 04 examples

Every model inference, ever

The "forward pass" is what every deployed neural network does at inference. Every prediction your iPhone makes, every Gemini response, every YouTube recommendation — a forward pass through trained weights.

Image classification

Convolutional networks process images by stacking these forward passes — but the matrices are organized as filters that slide over patches of the image. The basic recipe (multiply, bias, bend, repeat) is unchanged.

Transformer attention

Each transformer layer is a forward pass with a special structure: queries, keys, values are all computed by linear layers, then combined with attention weights. Stack 100 of these and you have GPT-4.

Tabular ML

Multi-layer perceptrons remain the workhorse for tabular data — credit risk, ad-CTR prediction, propensity models. Each prediction is a small forward pass through a few densely connected layers.