Machine Learning, Visualized · Vol. XXIV

The
Transformer

Stack attention blocks. Add skip connections, layer norm, multi-head, and an MLP. The 2017 architecture that ate everything — language, code, images, audio, proteins.

The concept

A transformer is a stack of identical blocks. Each block does two things: attention (let tokens look at each other) and a feed-forward MLP (transform features per token).

Around each sub-block: a residual connection (add the input back) and a LayerNorm. The residual stream carries information forward; each block writes its update into it. Multi-head attention runs several attention operations in parallel, each with its own learned Q/K/V projections.

That's the whole architecture. Stack 12 (BERT), 96 (GPT-3), or 200+ (GPT-4) of these blocks, train on a lot of text, and you get a language model.

Why ML cares

Transformers are the architecture behind GPT-4, Claude, Gemini, Llama, AlphaFold 2, DALL·E, Whisper, ViT — essentially every state-of-the-art model since 2018. The 2017 paper Attention Is All You Need is one of the most consequential machine-learning publications ever written.

The reason for the dominance: transformers parallelize cleanly across sequence positions (unlike RNNs), scale predictably (loss decreases as a power law in compute), and transfer well across tasks. Three properties that nothing else in the architecture zoo has all three of.

Try this

Open One block · step through (or click the canvas to replay). Watch the residual stream pick up two updates: attention writes once, then the FFN writes once. The little dot riding down the dashed line is what every layer would carry through a real model.
Open Heads at one layer, then drag the heads slider from 1 to 4. With one head, attention has a single pattern; with four heads you can see each head specialize — previous-token, first-token, content match, local window.
Open Across layers · one query. Click the canvas to cycle the query word and scrub the layer slider. Early layers tend to look locally; later ones reach further. The "thought process" varies with content and depth.

Before this

RNNs and LSTMs were sequential — slow to train, struggling past hundreds of tokens. Transformers (Vaswani 2017) ditched recurrence entirely and made everything attention plus parallel processing. GPUs love parallelism; the result was 10–100× training speedup and far longer effective context. Every modern LLM is a stack of transformer blocks.

The architecture

The block, in one line

out = LN(x + FFN(LN(x + Attn(x))))

x: input embedding. Attn(x): attention mixes context across tokens. The + is addition (a "residual"). LN: LayerNorm — re-standardizes values so they don't blow up. FFN: a small token-wise MLP. The whole sandwich repeats N times.

Multi-head

Run several attention operations in parallel with different learned Q/K/V projections. Heads tend to specialize — one tracks the previous token, another the first token, another content similarity.

Positional encoding

Attention is permutation-invariant by default. Positional encodings (sinusoidal in the original, RoPE in modern transformers) inject "this is token #3" information.

Where you've seen this04 examples

Every language model

GPT-2/3/4, Gemini, Claude, Llama, Mistral, Qwen — every modern LLM is a stack of transformer blocks (often 100+). The differences are size, training data, and minor architectural tweaks.

Vision Transformers (ViT)

Cut an image into 16×16 patches, embed each patch as a token, run a transformer. Won ImageNet around 2021. Now standard for image models, often outperforming CNNs at scale.

AlphaFold 2 & ESMFold

Protein structure prediction. The Evoformer and structure modules are transformer-based. The 2020 DeepMind result that solved a 50-year-old problem.

Whisper, Gemini Live, voice agents

Speech recognition and synthesis transitioned from RNN/LSTM to encoder–decoder transformers around 2020. Whisper is the open-source reference; Gemini Live is the productized version.