What the
model looks at
A neural network reading "the bat hit the ball" needs to know that bat here means baseball, not the animal. Attention is the mechanism that lets each token gather context from the others — a learned, soft, content-addressed lookup.
Attention lets each position in a sequence look at every other position and choose what to absorb.
Each token gets three roles, one learned vector each: a query (what am I looking for?), a key (what do I represent?), and a value (what would I contribute?). To compute attention for one query: take its dot product with every key — that's a similarity score — then softmax turns the scores into a list of probabilities that sum to 1. The output is those probabilities used as weights on the values.
Written in math: Attention(Q, K, V) = softmax(QKᵀ / √d) V. The √d just keeps the numbers from blowing up at large dimension. One operation. One equation. The mechanism that ate machine learning.
Self-attention is the engine of every transformer — and transformers are now the default architecture for language (GPT, Gemini, Claude), code (Copilot), images (ViT, DiT), audio (Whisper), proteins (AlphaFold), and even reinforcement-learning agents.
Before attention, sequence models had to compress an entire sentence into a single bottleneck vector. Attention lets every output position freely query the whole input — context is a database lookup, not a memory squeeze. That's why long documents finally became tractable.
- Click bank in "river bank flooded." The bars show bank attending strongly to river and flooded — context disambiguates word sense.
- Switch to All pairs · matrix. Now you see every word's attention pattern at once. Click any row to make that word the query. Function words (the, a) light up rows roughly uniformly; content words make sharper, more selective rows.
- Switch to The output · weighted blend. The bar shows what bank actually becomes after attention: a literal mix of the words it attended to. Each segment's width is its softmax weight.
GPT-4, Claude, Gemini, Llama — every one is a stack of self-attention layers. The "context window" you read about (128k tokens, 1M tokens) is exactly how many tokens each query can attend to.
AlphaFold 2 used attention over residue-pair embeddings to predict 3D structure from amino-acid sequences. The 2020 result was a Nobel-Prize-level achievement; attention was the architectural backbone.
Modern diffusion models like DALL·E 3 and Imagen use attention layers in their denoiser networks instead of the older U-Net convolutions. Same recipe; different domain.
YouTube's recommendation model uses attention over your watch history to score new candidate videos. Same Q/K/V structure; just videos instead of tokens.
- Attention Is All You Need paper Vaswani et al. (2017) · The transformer paper. Six pages that reorganized AI for the next decade. Read it once even if you don't follow every detail.
- The Illustrated Transformer essay Jay Alammar · The most-shared visual explanation of attention and transformers. The diagrams that taught a generation.
- A Mathematical Framework for Transformer Circuits essay Anthropic Interpretability · The deep-mechanism view of attention: how QK and OV circuits implement specific behaviors. Required reading if you want to understand what attention learns.
- Attention? Attention! essay Lilian Weng · A historical and technical survey of attention variants. Covers everything from Bahdanau's 2014 RNN attention to modern multi-head transformer attention.