Loss
Landscapes
Training is descent through a high-dimensional terrain. We can't draw the real surface — millions of weights — but a 2D slice teaches the intuition. Every minimum is a hypothesis the network might settle into.
A loss landscape is the surface defined by the loss function as the network's weights vary. Training is the act of finding a low point on it.
For a 2-weight model, the landscape is a 2D surface — like the terrain you see on the right. For a real network with millions of weights, it's a surface in millions of dimensions; we visualize 2D slices through that.
The geometry decides how easy training is. A clean bowl is solvable from anywhere. A long ravine forces zig-zags. Saddles stall progress. Rough terrain has many shallow minima, each a different network behavior.
Almost every result in modern optimization theory — momentum, Adam, learning-rate schedules, batch normalization — is a fix for some pathology of the loss landscape: ravines, plateaus, saddles, ill-conditioning. Knowing the shape tells you which fix to reach for.
In high dimensions, the geometry gets weird and counter-intuitive. Empirical work (Dauphin et al., Goodfellow et al.) suggests local minima are rare; saddle points vastly outnumber them and are where networks actually slow down. The pictures here are a 2D peek at that 3D-and-up reality.
- Pick Convex bowl and hit Descend. The marble walks straight to the minimum. Crank η way up — watch it overshoot and oscillate, then diverge.
- Switch to Narrow ravine. Vanilla gradient descent zigzags miserably down the long axis. This is the pathology that momentum exists to solve.
- Try Two minima. Click on different starting points around the map. Where you start determines which basin you fall into — initialization matters.
Two networks with the same architecture but different starting weights can converge to wildly different solutions — different basins of the loss landscape. Initialization schemes (Xavier, He) are essentially "drop the marble in a good neighborhood."
2018 research (Garipov et al.) showed that distinct minima of large networks are often connected by low-loss paths — the landscape isn't isolated valleys but a network of them. This is unique to high dimensions; in 2D the picture lies.
Sharp minima (steep walls) tend to generalize worse than flat ones (gentle slopes). Methods like SAM (sharpness-aware minimization) explicitly bias gradient descent toward flatter basins — and consistently improve test accuracy.
Hao Li's 2017 paper Visualizing the Loss Landscape of Neural Nets rendered the 2D slices for ResNet, with-and-without skip connections. The skip-connection version is dramatically smoother — a striking illustration of why ResNets train so much better than plain deep CNNs.
- Why Momentum Really Works interactive essay Gabriel Goh (Distill) · The same kind of contour-and-marble visualization, used to explain exactly why momentum solves the ravine pathology you're seeing on this page.
- Visualizing the Loss Landscape of Neural Nets paper Li, Xu, Taylor, Studer, Goldstein (2018) · The paper with the famous "ResNet vs no skip connections" landscape comparison. Page-long but image-heavy and worth the click.
- Identifying and attacking the saddle point problem paper Dauphin et al. (2014) · The argument that saddles, not local minima, are what slow down training in high dimensions. The result that reframed how the field thinks about non-convexity.
- losslandscape.com gallery Javier Ideami · 3D-rendered loss landscapes, animated. The visceral version of the contour-map view above.