jthomas.site// notebook · v.4.2026
Machine Learning, Visualized · Vol. XXIII

What the
model looks at

A neural network reading "the bat hit the ball" needs to know that bat here means baseball, not the animal. Attention is the mechanism that lets each token gather context from the others — a learned, soft, content-addressed lookup.

The concept

Attention lets each position in a sequence look at every other position and choose what to absorb.

Each token gets three roles, one learned vector each: a query (what am I looking for?), a key (what do I represent?), and a value (what would I contribute?). To compute attention for one query: take its dot product with every key — that's a similarity score — then softmax turns the scores into a list of probabilities that sum to 1. The output is those probabilities used as weights on the values.

Written in math: Attention(Q, K, V) = softmax(QKᵀ / √d) V. The √d just keeps the numbers from blowing up at large dimension. One operation. One equation. The mechanism that ate machine learning.

Why ML cares

Self-attention is the engine of every transformer — and transformers are now the default architecture for language (GPT, Gemini, Claude), code (Copilot), images (ViT, DiT), audio (Whisper), proteins (AlphaFold), and even reinforcement-learning agents.

Before attention, sequence models had to compress an entire sentence into a single bottleneck vector. Attention lets every output position freely query the whole input — context is a database lookup, not a memory squeeze. That's why long documents finally became tractable.

Try this
  1. Click bank in "river bank flooded." The bars show bank attending strongly to river and flooded — context disambiguates word sense.
  2. Switch to All pairs · matrix. Now you see every word's attention pattern at once. Click any row to make that word the query. Function words (the, a) light up rows roughly uniformly; content words make sharper, more selective rows.
  3. Switch to The output · weighted blend. The bar shows what bank actually becomes after attention: a literal mix of the words it attended to. Each segment's width is its softmax weight.
· The query word "looks at" the others through learned similarity. Strong matches get high weights; weak matches fade. The query then gets re-described as a weighted blend of everything it attended to.
Where you've seen this04 examples
Every LLM you've used

GPT-4, Claude, Gemini, Llama — every one is a stack of self-attention layers. The "context window" you read about (128k tokens, 1M tokens) is exactly how many tokens each query can attend to.

AlphaFold's protein folding

AlphaFold 2 used attention over residue-pair embeddings to predict 3D structure from amino-acid sequences. The 2020 result was a Nobel-Prize-level achievement; attention was the architectural backbone.

Image generation (DiT)

Modern diffusion models like DALL·E 3 and Imagen use attention layers in their denoiser networks instead of the older U-Net convolutions. Same recipe; different domain.

Recommender systems

YouTube's recommendation model uses attention over your watch history to score new candidate videos. Same Q/K/V structure; just videos instead of tokens.

Further reading