Machine Learning, Visualized · Vol. XV

Convolutional
Networks

Slide a small kernel over an image and a single feature map appears. Stack feature maps, downsample, repeat — and the network discovers edges, textures, parts, then objects.

The concept

A convolution is a small weight matrix (a kernel) that slides across the image, computing a weighted sum at every position.

The same kernel is applied everywhere — that's the key insight. A 3×3 horizontal-edge detector trained on the top-left of an image works equally well at the bottom-right. Translation invariance falls out of the architecture, not the training data.

Stack convolutions, intersperse with ReLU and pooling, and the network discovers a hierarchy: pixel-level edges → textures → object parts → whole objects. Every layer's filters are learned, not hand-coded.

Why ML cares

From AlexNet (2012) through ResNet, EfficientNet, and ConvNeXt, CNNs dominated computer vision for over a decade. Today's vision-transformer alternatives are competitive only at scale — and even then, "ConvNeXt v2" with vanilla convolutions stays surprisingly close.

Beyond images, the same idea — a small filter sliding over a structured input — powers speech models (1D conv on audio), graph nets, and even tokenizers in some LLMs. Wherever the data has local structure, convolutions exploit it for free.

Try this

Pick the face input and the vertical edges filter. The output highlights the eyes and mouth — vertical transitions between dark and light.
Switch to blur. The image goes soft. Try sharpen on the same input — edges pop. These are the same operations Photoshop's filters apply.
Compare edge_h on the cross vs diag images. A purely horizontal-edge kernel barely fires on a diagonal — the filter is direction-selective.

· Left: input image. Middle: result of one convolution + ReLU. Right: 2×2 max-pool. The yellow box on the input is the kernel's current sliding-window position; the lit cell on the output is the value just computed.

CNN architecture · the typical recipe

Before this

Before convolutions, image models had to learn position-specific weights — a separate detector for "edge at pixel (5,7)" versus "edge at pixel (5,8)". With millions of pixels per image, that was hopeless. Convolutions share one kernel across every position — slashing parameters by orders of magnitude and giving translation equivariance for free.

The convolution equation

out[i,j] = Σ_m,n in[i+m, j+n] · K[m,n]

The output at position (i, j) is a weighted sum over a small window of the input, weighted by the kernel K. Slide that window across every position and you've performed convolution. Σ (sigma) is just "sum"; m, n index the kernel's rows and columns.

Concretely, for a 3×3 kernel: 9 element-wise multiplications, then add them up — that gives one number for the output. Slide one pixel right, repeat. The yellow box on the canvas shows exactly which 9 inputs feed the currently-lit output cell.

Why convolution

Reusing one kernel across the whole image cuts parameters by orders of magnitude vs a fully-connected layer — and bakes in translation equivariance: shift the input, the output shifts the same way.

ReLU & pooling

ReLU(x) = max(0, x): clips negatives to zero. Cheap, useful, prevents the conv output from carrying negative "anti-features."

2×2 max-pool: in each 2×2 window of the feature map, keep only the largest value. Halves the resolution; bakes in small-shift invariance.

Hierarchy

Layer 1 learns edge detectors (like the presets here). Layer 5 learns part detectors (eyes, wheels). Layer 12 learns whole-object detectors. The hierarchy emerges from the data — nobody specifies it.

Where you've seen this 04 examples

AlexNet, 2012

The model that beat ImageNet by 10 percentage points and convinced the field that deep learning was the future. Eight layers, ReLU, dropout, two GPUs — primitive by today's standards but a step-function in performance.

Phone cameras and Photos

Every "scene mode," "portrait blur," and face-finding box on your phone is a convolutional network. So is the auto-tagging in Apple Photos, Google Photos, and every album-cleanup app.

Medical imaging

CNNs detect tumors in mammograms, classify diabetic retinopathy from retinal scans, and segment organs in CT volumes. Often matching or exceeding specialist radiologists on benchmark tasks.

Self-driving perception

Lane detection, sign reading, pedestrian segmentation, depth estimation from camera feeds — all CNNs. Tesla's Autopilot stack famously runs a thicket of them.