Machine Learning, Visualized · Vol. XIV

Build a
Network

Stack layers, pick a nonlinearity, point it at a dataset, and press play. The decision boundary forms in real time as the network learns.

The concept

A multi-layer perceptron (MLP) is a stack of linear layers separated by nonlinearities. Add depth and width, and it becomes a universal function approximator.

Each hidden layer takes the previous activations, applies a weight matrix and bias, and bends the result through an activation function. Chain enough of these together and any continuous decision boundary becomes expressible.

The interesting question is no longer "can it fit this?" — almost always yes — but "how few neurons and layers does it take to fit it cleanly?" Watching a small network learn a hard pattern is the fastest way to develop intuition for that.

Why ML cares

The MLP is the simplest deep network and the workhorse for tabular data — risk scoring, click-through prediction, recommender re-ranking. It's also the backbone of every transformer's feed-forward block, every autoencoder's encoder/decoder, every diffusion model's denoiser.

The universal approximation theorem (Cybenko 1989, Hornik 1991) proved that one hidden layer with enough neurons can approximate any continuous function. Modern practice uses deeper, narrower networks because depth makes the function easier to learn from limited data — even though shallow alternatives exist in principle.

Try this

Pick the spiral dataset and start with one layer of 1 neuron. It can't curve enough — accuracy plateaus around chance. Add a layer; bump neurons; watch the boundary spiral.
Switch activation to ReLU on the spiral. Notice the boundary becomes piecewise-linear — sharp angular regions instead of smooth curves. That's the kink in ReLU showing through.
Try XOR with two hidden layers of 4 neurons. Watch the boundary fold itself into a clean cross in just a few seconds — the canonical "deep beats shallow" example.

· The shaded heatmap is the network's confidence over the input plane. Oxblood means class 1, ink means class 0; intensity is the network's confidence. Click anywhere to trace one point's forward pass through the layers.

Architecture · live schematic

Before this

Single-layer perceptrons couldn't even solve XOR — Minsky & Papert's 1969 critique froze neural-net research and helped trigger the first "AI winter." The fix turned out to be simple: add a hidden layer of nonlinear units, plus backprop to actually train it. Universal approximation, suddenly within reach.

One layer's job

a^(ℓ) = f(W^(ℓ)·a^(ℓ−1) + b^(ℓ))

a^(ℓ) is the activation vector at layer ℓ — one number per neuron. W^(ℓ) is that layer's weight matrix (one row per neuron in this layer, one column per neuron in the previous one). b^(ℓ) is the bias vector (one bias per neuron). f is the activation function (tanh or ReLU here).

Reading the symbols

tanh (hyperbolic tangent): squashes any real number into (−1, +1) — smooth and saturating.

ReLU = max(0, x). Negatives become 0, positives pass through. Cheap and surprisingly effective.

σ (sigma) / sigmoid: σ(x) = 1 / (1 + e^−x). Squashes into (0, 1). Used on the final output here so we can read it as P(class 1).

η (eta) is the learning rate — how big a step gradient descent takes at each update.

A universal approximator

One hidden layer with enough neurons can approximate any continuous function (Cybenko, 1989). In practice we use many layers with fewer neurons each — depth wins over width.

Why depth helps

A deep narrow network can compose features hierarchically — early layers detect simple things, later layers combine them. A single huge layer has to memorize each region individually; a deep one builds shared sub-routines, which generalize from less data.

Where you've seen this 04 examples

Tabular ML in production

Credit risk, click-through prediction, propensity models — wherever the data is rows in a database, an MLP (or its gradient-boosted-tree cousin) is the workhorse model. Often beats more elaborate architectures on this regime.

Inside every transformer

Each transformer layer alternates self-attention with a 2-layer MLP (often called the FFN). Most of GPT and Gemini's parameters live in those MLPs — they do the heavy lifting of feature transformation.

Encoders, decoders, denoisers

Autoencoders, VAEs, and diffusion-model U-Nets all stack MLPs (or their convolutional variants) in encoder/decoder pairs. Same recipe, different scaffolding.

Physics & scientific ML

Surrogate models for fluid dynamics, weather forecasting, and protein folding all use MLPs to approximate expensive simulators. AlphaFold's MSA features ride on top of MLP transformations.