Machine Learning, Visualized · Vol. XVI

An Image
Classifier, end to end

Pixels in, label out. Follow the signal through conv blocks, pooling, a flatten, and a final softmax — the whole assembly that turns a 32×32 grid of brightnesses into "this is a circle, 87% sure."

The concept

An image classifier is a pipeline: image → feature maps → pooled features → flattened vector → class probabilities.

Each conv block applies a few learned filters and a ReLU; pooling shrinks the spatial dimensions while keeping the strongest signals. After two or three blocks, the original 32×32 image becomes an 8×8 stack of high-level feature maps.

Those feature maps get flattened into a long vector and fed to a small MLP that produces logits — one per class. Softmax turns logits into probabilities. The class with the highest probability is the prediction.

Why ML cares

This is the simplest end-to-end architecture in computer vision. A scaled-up version of it (more layers, more channels, batch norm, residual connections) is what AlexNet, VGG, ResNet, and EfficientNet all are.

Even today, with vision transformers and diffusion models in the headlines, the convolutional classifier is the default reach for any "does this image contain X?" task that doesn't justify a billion-parameter model.

Try this

Pick the circle input. Hit Replay and watch the signal propagate stage by stage. The probabilities at the end commit to one class.
Try the cross input — the early conv outputs look very different, but the same architecture handles it. That's the network's invariance at work.
Switch inputs at the end of a replay. The pipeline runs again with the new image; only the output probabilities change in real time.

LeNet-style architecture · shape annotations

Before this

Before deep CNNs (LeCun's LeNet in 1989, AlexNet in 2012), image classification used hand-engineered features — SIFT, HOG, colour histograms — fed to a linear classifier. Accuracy was capped because the features were hand-picked. CNNs learn features end-to-end; ImageNet 2012 ended the hand-engineering era overnight.

From logits to probabilities

Logit = the raw, unnormalized score the network assigns to one class. Any real number — positive, negative, big, small. The dense layer at the end produces one logit per class. Softmax is what turns those into probabilities.

p_i = e^z_i / Σ_j e^z_j

Exponentiate each logit — that's e^z, the natural exponential — then divide each result by the sum of all of them. Result: a list of positive numbers that add to 1. The largest input ends up the most likely class. Σ (sigma) is "sum over j" — the normalizer.

Receptive field

Each output unit "sees" a region of the original image — its receptive field. After pooling once, one unit covers 2×2 input pixels; after twice, 4×4; deep layers can see almost the whole image.

Channel count grows

As spatial resolution shrinks, channel count grows. Why? Because the network discards where things are and keeps what they are — and "what" needs more dimensions to express.

The classifier head

After all the convolutions, a small fully-connected MLP turns the spatial features into class scores. That last step is identical to Vol XIV's MLP — just operating on learned features instead of raw pixels.

Where you've seen this 04 examples

Photo organization apps

Apple Photos' "people," "places," and "categories" features run essentially this pipeline (much deeper, on millions of training images) on every photo in your library — locally on the device.

Plant disease detection

Apps like PlantNet and Pl@ntNet point a phone camera at a leaf and identify the species or disease. The model in the app: a CNN classifier trained on a few hundred thousand expert-labeled photos.

Industrial quality control

Manufacturing lines use vision classifiers to spot defects in microchips, glass bottles, paint finishes — anywhere a camera can see and a label exists. Faster and more consistent than human inspectors.

Captchas (sort of)

"Click on all the squares with traffic lights" trains exactly this kind of network in the background. Every solve adds another labeled image to a vast corpus that quietly powers Google's vision systems.