The
Transformer
Stack attention blocks. Add skip connections, layer norm, multi-head, and an MLP. The 2017 architecture that ate everything — language, code, images, audio, proteins.
A transformer is a stack of identical blocks. Each block does two things: attention (let tokens look at each other) and a feed-forward MLP (transform features per token).
Around each sub-block: a residual connection (add the input back) and a LayerNorm. The residual stream carries information forward; each block writes its update into it. Multi-head attention runs several attention operations in parallel, each with its own learned Q/K/V projections.
That's the whole architecture. Stack 12 (BERT), 96 (GPT-3), or 200+ (GPT-4) of these blocks, train on a lot of text, and you get a language model.
Transformers are the architecture behind GPT-4, Claude, Gemini, Llama, AlphaFold 2, DALL·E, Whisper, ViT — essentially every state-of-the-art model since 2018. The 2017 paper Attention Is All You Need is one of the most consequential machine-learning publications ever written.
The reason for the dominance: transformers parallelize cleanly across sequence positions (unlike RNNs), scale predictably (loss decreases as a power law in compute), and transfer well across tasks. Three properties that nothing else in the architecture zoo has all three of.
- Open One block · step through (or click the canvas to replay). Watch the residual stream pick up two updates: attention writes once, then the FFN writes once. The little dot riding down the dashed line is what every layer would carry through a real model.
- Open Heads at one layer, then drag the heads slider from 1 to 4. With one head, attention has a single pattern; with four heads you can see each head specialize — previous-token, first-token, content match, local window.
- Open Across layers · one query. Click the canvas to cycle the query word and scrub the layer slider. Early layers tend to look locally; later ones reach further. The "thought process" varies with content and depth.
GPT-2/3/4, Gemini, Claude, Llama, Mistral, Qwen — every modern LLM is a stack of transformer blocks (often 100+). The differences are size, training data, and minor architectural tweaks.
Cut an image into 16×16 patches, embed each patch as a token, run a transformer. Won ImageNet around 2021. Now standard for image models, often outperforming CNNs at scale.
Protein structure prediction. The Evoformer and structure modules are transformer-based. The 2020 DeepMind result that solved a 50-year-old problem.
Speech recognition and synthesis transitioned from RNN/LSTM to encoder–decoder transformers around 2020. Whisper is the open-source reference; Gemini Live is the productized version.
- Attention Is All You Need paper Vaswani et al. (2017) · The transformer paper. The single most consequential ML paper of the 2010s.
- The Annotated Transformer walkthrough Sasha Rush · Line-by-line PyTorch implementation of the original transformer paper, with prose explaining each block.
- Let's build GPT: from scratch, in code video Andrej Karpathy · A two-hour live coding of nanoGPT. Best end-to-end "build a transformer from scratch" lesson available.
- Transformer Circuits Thread research Anthropic · Mechanistic interpretability — deciphering what each attention head and each MLP neuron actually computes inside a trained transformer.