Machine Learning, Visualized · Vol. XXIII

What the
model looks at

A neural network reading "the bat hit the ball" needs to know that bat here means baseball, not the animal. Attention is the mechanism that lets each token gather context from the others — a learned, soft, content-addressed lookup.

The concept

Attention lets each position in a sequence look at every other position and choose what to absorb.

Each token gets three roles, one learned vector each: a query (what am I looking for?), a key (what do I represent?), and a value (what would I contribute?). To compute attention for one query: take its dot product with every key — that's a similarity score — then softmax turns the scores into a list of probabilities that sum to 1. The output is those probabilities used as weights on the values.

Written in math: Attention(Q, K, V) = softmax(QKᵀ / √d) V. The √d just keeps the numbers from blowing up at large dimension. One operation. One equation. The mechanism that ate machine learning.

Why ML cares

Self-attention is the engine of every transformer — and transformers are now the default architecture for language (GPT, Gemini, Claude), code (Copilot), images (ViT, DiT), audio (Whisper), proteins (AlphaFold), and even reinforcement-learning agents.

Before attention, sequence models had to compress an entire sentence into a single bottleneck vector. Attention lets every output position freely query the whole input — context is a database lookup, not a memory squeeze. That's why long documents finally became tractable.

Try this

Click bank in "river bank flooded." The bars show bank attending strongly to river and flooded — context disambiguates word sense.
Switch to All pairs · matrix. Now you see every word's attention pattern at once. Click any row to make that word the query. Function words (the, a) light up rows roughly uniformly; content words make sharper, more selective rows.
Switch to The output · weighted blend. The bar shows what bank actually becomes after attention: a literal mix of the words it attended to. Each segment's width is its softmax weight.

· The query word "looks at" the others through learned similarity. Strong matches get high weights; weak matches fade. The query then gets re-described as a weighted blend of everything it attended to.

Before this

Before attention, sequence-to-sequence models had to compress an entire sentence into one fixed-size vector. Long inputs degraded badly. Attention (Bahdanau 2014, Vaswani 2017) replaced compression with content-addressed lookup: every output token can re-read every input token. Memory squeeze became database lookup; long contexts became tractable.

A library, not a list

Old recurrent models read tokens one at a time and squeezed everything into a single hidden state. Attention lets each token reach back to any prior token, with weights determined by what the model is currently looking for.

Q, K, V

Each token emits three vectors — query, key, value — by passing its embedding through three small learned matrices. Match queries against keys, blend values. The matrices are what training shapes.

Softmax, plainly

Take any list of scores. Exponentiate each, divide by the sum. The result is a list of weights between 0 and 1 that sum to 1 — a probability distribution. Big scores end up much bigger than small ones; that's why attention concentrates on a few keys, not everyone.

In practice

Real attention learns these patterns from data — no hand-coded semantics. Different attention heads in a trained model end up specializing in syntax, coreference, position, and stranger things.

Where you've seen this04 examples

Every LLM you've used

GPT-4, Claude, Gemini, Llama — every one is a stack of self-attention layers. The "context window" you read about (128k tokens, 1M tokens) is exactly how many tokens each query can attend to.

AlphaFold's protein folding

AlphaFold 2 used attention over residue-pair embeddings to predict 3D structure from amino-acid sequences. The 2020 result was a Nobel-Prize-level achievement; attention was the architectural backbone.

Image generation (DiT)

Modern diffusion models like DALL·E 3 and Imagen use attention layers in their denoiser networks instead of the older U-Net convolutions. Same recipe; different domain.

Recommender systems

YouTube's recommendation model uses attention over your watch history to score new candidate videos. Same Q/K/V structure; just videos instead of tokens.