An Image
Classifier, end to end
Pixels in, label out. Follow the signal through conv blocks, pooling, a flatten, and a final softmax — the whole assembly that turns a 32×32 grid of brightnesses into "this is a circle, 87% sure."
An image classifier is a pipeline: image → feature maps → pooled features → flattened vector → class probabilities.
Each conv block applies a few learned filters and a ReLU; pooling shrinks the spatial dimensions while keeping the strongest signals. After two or three blocks, the original 32×32 image becomes an 8×8 stack of high-level feature maps.
Those feature maps get flattened into a long vector and fed to a small MLP that produces logits — one per class. Softmax turns logits into probabilities. The class with the highest probability is the prediction.
This is the simplest end-to-end architecture in computer vision. A scaled-up version of it (more layers, more channels, batch norm, residual connections) is what AlexNet, VGG, ResNet, and EfficientNet all are.
Even today, with vision transformers and diffusion models in the headlines, the convolutional classifier is the default reach for any "does this image contain X?" task that doesn't justify a billion-parameter model.
- Pick the circle input. Hit Replay and watch the signal propagate stage by stage. The probabilities at the end commit to one class.
- Try the cross input — the early conv outputs look very different, but the same architecture handles it. That's the network's invariance at work.
- Switch inputs at the end of a replay. The pipeline runs again with the new image; only the output probabilities change in real time.
Apple Photos' "people," "places," and "categories" features run essentially this pipeline (much deeper, on millions of training images) on every photo in your library — locally on the device.
Apps like PlantNet and Pl@ntNet point a phone camera at a leaf and identify the species or disease. The model in the app: a CNN classifier trained on a few hundred thousand expert-labeled photos.
Manufacturing lines use vision classifiers to spot defects in microchips, glass bottles, paint finishes — anywhere a camera can see and a label exists. Faster and more consistent than human inspectors.
"Click on all the squares with traffic lights" trains exactly this kind of network in the background. Every solve adds another labeled image to a vast corpus that quietly powers Google's vision systems.
- CS231n — Convolutional Neural Networks for Visual Recognition course Karpathy / Li / Yeung · The Stanford course that teaches everything on this page in detail. Free notes online.
- CNN Explainer interactive Wang et al. (Polo Club) · A web-based interactive that lets you click through every operation in a real CNN, tracing values from input to softmax. The natural next step after this page.
- Very Deep Convolutional Networks (VGG) paper Simonyan & Zisserman (2014) · The classic VGG architecture. Pure conv→ReLU→pool stacks; one of the cleanest realizations of the recipe on this page.
- fast.ai — Practical Deep Learning course Jeremy Howard & Rachel Thomas · Top-down practical course that gets you training real classifiers on real datasets in the first lesson.