jthomas.site// notebook · v.4.2026
Machine Learning, Visualized · Vol. XXIV

The
Transformer

Stack attention blocks. Add skip connections, layer norm, multi-head, and an MLP. The 2017 architecture that ate everything — language, code, images, audio, proteins.

The concept

A transformer is a stack of identical blocks. Each block does two things: attention (let tokens look at each other) and a feed-forward MLP (transform features per token).

Around each sub-block: a residual connection (add the input back) and a LayerNorm. The residual stream carries information forward; each block writes its update into it. Multi-head attention runs several attention operations in parallel, each with its own learned Q/K/V projections.

That's the whole architecture. Stack 12 (BERT), 96 (GPT-3), or 200+ (GPT-4) of these blocks, train on a lot of text, and you get a language model.

Why ML cares

Transformers are the architecture behind GPT-4, Claude, Gemini, Llama, AlphaFold 2, DALL·E, Whisper, ViT — essentially every state-of-the-art model since 2018. The 2017 paper Attention Is All You Need is one of the most consequential machine-learning publications ever written.

The reason for the dominance: transformers parallelize cleanly across sequence positions (unlike RNNs), scale predictably (loss decreases as a power law in compute), and transfer well across tasks. Three properties that nothing else in the architecture zoo has all three of.

Try this
  1. Open One block · step through (or click the canvas to replay). Watch the residual stream pick up two updates: attention writes once, then the FFN writes once. The little dot riding down the dashed line is what every layer would carry through a real model.
  2. Open Heads at one layer, then drag the heads slider from 1 to 4. With one head, attention has a single pattern; with four heads you can see each head specialize — previous-token, first-token, content match, local window.
  3. Open Across layers · one query. Click the canvas to cycle the query word and scrub the layer slider. Early layers tend to look locally; later ones reach further. The "thought process" varies with content and depth.
Where you've seen this04 examples
Every language model

GPT-2/3/4, Gemini, Claude, Llama, Mistral, Qwen — every modern LLM is a stack of transformer blocks (often 100+). The differences are size, training data, and minor architectural tweaks.

Vision Transformers (ViT)

Cut an image into 16×16 patches, embed each patch as a token, run a transformer. Won ImageNet around 2021. Now standard for image models, often outperforming CNNs at scale.

AlphaFold 2 & ESMFold

Protein structure prediction. The Evoformer and structure modules are transformer-based. The 2020 DeepMind result that solved a 50-year-old problem.

Whisper, Gemini Live, voice agents

Speech recognition and synthesis transitioned from RNN/LSTM to encoder–decoder transformers around 2020. Whisper is the open-source reference; Gemini Live is the productized version.

Further reading