Convolutional
Networks
Slide a small kernel over an image and a single feature map appears. Stack feature maps, downsample, repeat — and the network discovers edges, textures, parts, then objects.
A convolution is a small weight matrix (a kernel) that slides across the image, computing a weighted sum at every position.
The same kernel is applied everywhere — that's the key insight. A 3×3 horizontal-edge detector trained on the top-left of an image works equally well at the bottom-right. Translation invariance falls out of the architecture, not the training data.
Stack convolutions, intersperse with ReLU and pooling, and the network discovers a hierarchy: pixel-level edges → textures → object parts → whole objects. Every layer's filters are learned, not hand-coded.
From AlexNet (2012) through ResNet, EfficientNet, and ConvNeXt, CNNs dominated computer vision for over a decade. Today's vision-transformer alternatives are competitive only at scale — and even then, "ConvNeXt v2" with vanilla convolutions stays surprisingly close.
Beyond images, the same idea — a small filter sliding over a structured input — powers speech models (1D conv on audio), graph nets, and even tokenizers in some LLMs. Wherever the data has local structure, convolutions exploit it for free.
- Pick the face input and the vertical edges filter. The output highlights the eyes and mouth — vertical transitions between dark and light.
- Switch to blur. The image goes soft. Try sharpen on the same input — edges pop. These are the same operations Photoshop's filters apply.
- Compare edge_h on the cross vs diag images. A purely horizontal-edge kernel barely fires on a diagonal — the filter is direction-selective.
The model that beat ImageNet by 10 percentage points and convinced the field that deep learning was the future. Eight layers, ReLU, dropout, two GPUs — primitive by today's standards but a step-function in performance.
Every "scene mode," "portrait blur," and face-finding box on your phone is a convolutional network. So is the auto-tagging in Apple Photos, Google Photos, and every album-cleanup app.
CNNs detect tumors in mammograms, classify diabetic retinopathy from retinal scans, and segment organs in CT volumes. Often matching or exceeding specialist radiologists on benchmark tasks.
Lane detection, sign reading, pedestrian segmentation, depth estimation from camera feeds — all CNNs. Tesla's Autopilot stack famously runs a thicket of them.
- CS231n — Convolutional Neural Networks course notes Karpathy/Li · The reference notes for understanding CNNs end-to-end. Covers the math, the architectural choices, and the practical tricks.
- Feature Visualization interactive essay Olah, Mordvintsev, Schubert (Distill) · What the filters in a trained CNN actually look at, layer by layer. The clearest visual evidence of the hierarchy described above.
- Deep Residual Learning for Image Recognition paper He et al. (2016) · ResNet. The skip-connection idea that made networks of 100+ layers trainable. Won ImageNet 2015 by a large margin.
- A ConvNet for the 2020s paper Liu et al. (2022) · ConvNeXt. The paper that showed pure CNNs (when modernized) can still match transformers on vision benchmarks. The conv idea isn't dead — it's just stylish.