jthomas.site// notebook · v.4.2026
Machine Learning, Visualized · Vol. XIV

Build a
Network

Stack layers, pick a nonlinearity, point it at a dataset, and press play. The decision boundary forms in real time as the network learns.

The concept

A multi-layer perceptron (MLP) is a stack of linear layers separated by nonlinearities. Add depth and width, and it becomes a universal function approximator.

Each hidden layer takes the previous activations, applies a weight matrix and bias, and bends the result through an activation function. Chain enough of these together and any continuous decision boundary becomes expressible.

The interesting question is no longer "can it fit this?" — almost always yes — but "how few neurons and layers does it take to fit it cleanly?" Watching a small network learn a hard pattern is the fastest way to develop intuition for that.

Why ML cares

The MLP is the simplest deep network and the workhorse for tabular data — risk scoring, click-through prediction, recommender re-ranking. It's also the backbone of every transformer's feed-forward block, every autoencoder's encoder/decoder, every diffusion model's denoiser.

The universal approximation theorem (Cybenko 1989, Hornik 1991) proved that one hidden layer with enough neurons can approximate any continuous function. Modern practice uses deeper, narrower networks because depth makes the function easier to learn from limited data — even though shallow alternatives exist in principle.

Try this
  1. Pick the spiral dataset and start with one layer of 1 neuron. It can't curve enough — accuracy plateaus around chance. Add a layer; bump neurons; watch the boundary spiral.
  2. Switch activation to ReLU on the spiral. Notice the boundary becomes piecewise-linear — sharp angular regions instead of smooth curves. That's the kink in ReLU showing through.
  3. Try XOR with two hidden layers of 4 neurons. Watch the boundary fold itself into a clean cross in just a few seconds — the canonical "deep beats shallow" example.
· The shaded heatmap is the network's confidence over the input plane. Oxblood means class 1, ink means class 0; intensity is the network's confidence. Click anywhere to trace one point's forward pass through the layers.
Architecture · live schematic
Where you've seen this 04 examples
Tabular ML in production

Credit risk, click-through prediction, propensity models — wherever the data is rows in a database, an MLP (or its gradient-boosted-tree cousin) is the workhorse model. Often beats more elaborate architectures on this regime.

Inside every transformer

Each transformer layer alternates self-attention with a 2-layer MLP (often called the FFN). Most of GPT and Gemini's parameters live in those MLPs — they do the heavy lifting of feature transformation.

Encoders, decoders, denoisers

Autoencoders, VAEs, and diffusion-model U-Nets all stack MLPs (or their convolutional variants) in encoder/decoder pairs. Same recipe, different scaffolding.

Physics & scientific ML

Surrogate models for fluid dynamics, weather forecasting, and protein folding all use MLPs to approximate expensive simulators. AlphaFold's MSA features ride on top of MLP transformations.

Further reading