jthomas.site// notebook · v.4.2026
Machine Learning, Visualized · Vol. XV

Convolutional
Networks

Slide a small kernel over an image and a single feature map appears. Stack feature maps, downsample, repeat — and the network discovers edges, textures, parts, then objects.

The concept

A convolution is a small weight matrix (a kernel) that slides across the image, computing a weighted sum at every position.

The same kernel is applied everywhere — that's the key insight. A 3×3 horizontal-edge detector trained on the top-left of an image works equally well at the bottom-right. Translation invariance falls out of the architecture, not the training data.

Stack convolutions, intersperse with ReLU and pooling, and the network discovers a hierarchy: pixel-level edges → textures → object parts → whole objects. Every layer's filters are learned, not hand-coded.

Why ML cares

From AlexNet (2012) through ResNet, EfficientNet, and ConvNeXt, CNNs dominated computer vision for over a decade. Today's vision-transformer alternatives are competitive only at scale — and even then, "ConvNeXt v2" with vanilla convolutions stays surprisingly close.

Beyond images, the same idea — a small filter sliding over a structured input — powers speech models (1D conv on audio), graph nets, and even tokenizers in some LLMs. Wherever the data has local structure, convolutions exploit it for free.

Try this
  1. Pick the face input and the vertical edges filter. The output highlights the eyes and mouth — vertical transitions between dark and light.
  2. Switch to blur. The image goes soft. Try sharpen on the same input — edges pop. These are the same operations Photoshop's filters apply.
  3. Compare edge_h on the cross vs diag images. A purely horizontal-edge kernel barely fires on a diagonal — the filter is direction-selective.
· Left: input image. Middle: result of one convolution + ReLU. Right: 2×2 max-pool. The yellow box on the input is the kernel's current sliding-window position; the lit cell on the output is the value just computed.
CNN architecture · the typical recipe
Where you've seen this 04 examples
AlexNet, 2012

The model that beat ImageNet by 10 percentage points and convinced the field that deep learning was the future. Eight layers, ReLU, dropout, two GPUs — primitive by today's standards but a step-function in performance.

Phone cameras and Photos

Every "scene mode," "portrait blur," and face-finding box on your phone is a convolutional network. So is the auto-tagging in Apple Photos, Google Photos, and every album-cleanup app.

Medical imaging

CNNs detect tumors in mammograms, classify diabetic retinopathy from retinal scans, and segment organs in CT volumes. Often matching or exceeding specialist radiologists on benchmark tasks.

Self-driving perception

Lane detection, sign reading, pedestrian segmentation, depth estimation from camera feeds — all CNNs. Tesla's Autopilot stack famously runs a thicket of them.

Further reading