Build a
Network
Stack layers, pick a nonlinearity, point it at a dataset, and press play. The decision boundary forms in real time as the network learns.
A multi-layer perceptron (MLP) is a stack of linear layers separated by nonlinearities. Add depth and width, and it becomes a universal function approximator.
Each hidden layer takes the previous activations, applies a weight matrix and bias, and bends the result through an activation function. Chain enough of these together and any continuous decision boundary becomes expressible.
The interesting question is no longer "can it fit this?" — almost always yes — but "how few neurons and layers does it take to fit it cleanly?" Watching a small network learn a hard pattern is the fastest way to develop intuition for that.
The MLP is the simplest deep network and the workhorse for tabular data — risk scoring, click-through prediction, recommender re-ranking. It's also the backbone of every transformer's feed-forward block, every autoencoder's encoder/decoder, every diffusion model's denoiser.
The universal approximation theorem (Cybenko 1989, Hornik 1991) proved that one hidden layer with enough neurons can approximate any continuous function. Modern practice uses deeper, narrower networks because depth makes the function easier to learn from limited data — even though shallow alternatives exist in principle.
- Pick the spiral dataset and start with one layer of 1 neuron. It can't curve enough — accuracy plateaus around chance. Add a layer; bump neurons; watch the boundary spiral.
- Switch activation to ReLU on the spiral. Notice the boundary becomes piecewise-linear — sharp angular regions instead of smooth curves. That's the kink in ReLU showing through.
- Try XOR with two hidden layers of 4 neurons. Watch the boundary fold itself into a clean cross in just a few seconds — the canonical "deep beats shallow" example.
Credit risk, click-through prediction, propensity models — wherever the data is rows in a database, an MLP (or its gradient-boosted-tree cousin) is the workhorse model. Often beats more elaborate architectures on this regime.
Each transformer layer alternates self-attention with a 2-layer MLP (often called the FFN). Most of GPT and Gemini's parameters live in those MLPs — they do the heavy lifting of feature transformation.
Autoencoders, VAEs, and diffusion-model U-Nets all stack MLPs (or their convolutional variants) in encoder/decoder pairs. Same recipe, different scaffolding.
Surrogate models for fluid dynamics, weather forecasting, and protein folding all use MLPs to approximate expensive simulators. AlphaFold's MSA features ride on top of MLP transformations.
- TensorFlow Playground interactive Smilkov et al. · The most polished version of the toy above. Same datasets, more architectural knobs, configurable feature crossings.
- A visual proof that neural nets can compute any function free book Michael Nielsen · A patient construction showing how a single hidden layer can carve any shape, building up from a single sigmoid step. The clearest universal-approximation argument.
- In Search of the Real Inductive Bias paper Neyshabur, Tomioka, Srebro (2015) · Why depth wins over width — the implicit regularization that makes deeper networks generalize better than universal approximation alone would predict.
- Deep Learning · Chapter 6 textbook Goodfellow, Bengio, Courville · The textbook chapter on feedforward networks. The bridge between forward propagation and the architectures coming next in this series.