jthomas.site// notebook · v.4.2026
Machine Learning, Visualized · Vol. IX

A Field Guide to
Activations

A neural network is a stack of linear layers. Without nonlinearity between them, the whole thing collapses to a single matrix. The activation is what gives a network its bend.

The concept

An activation function is the nonlinear bend applied to each neuron's output — without it, a deep network is no more powerful than a single matrix.

The math is brutal: chain N matrix multiplications and you still have one matrix multiplication (call it M = M_N · M_{N-1} · … · M_1). Insert a nonlinear function between layers and that simplification is no longer possible — the network can now express curves, decision boundaries, and shapes that no single linear map can.

Five workhorses dominate the literature: sigmoid, tanh, ReLU, leaky ReLU, and GELU. Each is a different choice about how steeply to bend, where to bend, and what happens on the negative side.

Why ML cares

The choice of activation has decided whole eras of deep learning. The 1980s ran on sigmoid and stalled — gradients vanished as networks deepened. The 2012 AlexNet revolution was, in part, a switch to ReLU. The transformer era (GPT, BERT, Gemini) runs on GELU and SwiGLU.

The derivative matters as much as the function itself. Where the derivative goes flat, gradients during backprop go to zero — and learning stops. Sigmoid's flat tails are why you don't see it in modern deep networks.

Try this
  1. Toggle Show derivatives. Look at sigmoid's: it's a tiny bump that vanishes outside roughly (−4, 4). That's the vanishing gradient problem.
  2. Compare ReLU and leaky ReLU derivatives. The leaky version keeps a tiny slope on the negative side — preventing "dead neurons" whose gradient is permanently zero.
  3. Hover the chart at x = 1. ReLU and leaky agree; sigmoid is at ~0.73; GELU is between identity and ReLU. Each makes a different commitment about how to handle "moderately positive" signals.
Where you've seen this 04 examples
AlexNet, 2012 — the ReLU moment

The ImageNet winner that kicked off the deep-learning era used ReLU instead of tanh in its hidden layers. The paper attributes much of the speedup — and much of the depth that became practical — to that single change.

GELU in transformers

BERT, GPT-2/3/4, and most modern LLMs use GELU between attention layers. Its smoothness and slight negative-side curvature seems to help with the optimization landscape — though "why exactly" remains an open question.

Sigmoid in binary classifiers

The output layer of any binary classifier — fraud detector, click-through model, medical risk score — is a sigmoid. It maps any real number to a probability in (0, 1). Hidden layers moved on; output layers didn't.

Tanh in old-school RNNs

LSTMs and vanilla RNNs used tanh on the hidden state because zero-centered outputs kept the recurrence stable. Now mostly displaced by attention, but the underlying intuition — "bend, but symmetrically" — explains why.

Further reading