jthomas.site// notebook · v.4.2026
Machine Learning, Visualized · Vol. XIX

Holding it
Together

Two simple tricks that make deep nets trainable: dropout randomly silences neurons during training, and batch normalization standardizes activations so each layer sees clean inputs. Both look like accidents that turned out to matter.

The concept

Deep networks overfit and drift. Dropout and batch norm are two cheap surgical fixes that quietly hold them together.

Dropout: at training time, randomly zero out a fraction (say 30%) of neurons in each layer. Forces the network to spread responsibility across many units instead of relying on any single one. At test time, all neurons are active and outputs scaled.

Batch norm: at every layer, subtract the mini-batch mean and divide by the standard deviation, then learn an affine rescale. Keeps activations from drifting to extreme values as the network trains.

Why ML cares

These two tricks turned the dial from "deep nets don't really train well" to "deep nets are routine." The 2014–2016 explosion of computer-vision results — VGG, GoogLeNet, ResNet — was as much about these regularizers as about the architectures themselves.

Today, dropout still appears in many transformers; batch norm has been largely replaced by layer norm in language models (different statistics, same idea). Both are part of every working deep network's plumbing.

Try this
  1. Drag the dropout rate from 0 to 0.7. Watch random units fade to zero on each forward pass — and the test-time outputs stay almost identical thanks to scaling.
  2. Toggle BatchNorm off on the comparison plot. Without it, deeper layers' activations drift to extreme values; with it, every layer's distribution stays roughly standard.
  3. Push dropout near 0.9 — almost every unit silenced; the surviving ones get scaled up so much they look hot. Then drop to 0 — full network back. Anywhere in between is the regularization regime.
· Top: a 4-layer network's hidden activations on a single forward pass. Diagonal stripes mark dropout-silenced neurons. Bottom: activation distribution per layer.
The two regularizer blocks · what's inside each
Where you've seen this04 examples
AlexNet to today

AlexNet (2012) used dropout aggressively in its dense layers — and was the first paper to show that very deep CNNs could actually generalize. Almost every CNN since has carried the trick.

ResNet and BatchNorm

ResNet's 152-layer networks would not train without BatchNorm. The combination of skip connections + BN is what made depth go from "hard" to "trivial" — and is the architectural hand-shake of the deep-learning era.

Transformers and LayerNorm

Every transformer block ends with a LayerNorm. GPT, BERT, Gemini — all of them. Without it the residual stream drifts and training stalls within a few hundred steps.

A/B-testing for deep learning

"Adding dropout p=0.1 helped on this task" / "BN actually hurt on this task" are still routine empirical findings. Both tools require tuning and can backfire — they're not free.

Further reading