jthomas.site// notebook · v.4.2026
Machine Learning, Visualized · Vol. XII

Loss
Landscapes

Training is descent through a high-dimensional terrain. We can't draw the real surface — millions of weights — but a 2D slice teaches the intuition. Every minimum is a hypothesis the network might settle into.

The concept

A loss landscape is the surface defined by the loss function as the network's weights vary. Training is the act of finding a low point on it.

For a 2-weight model, the landscape is a 2D surface — like the terrain you see on the right. For a real network with millions of weights, it's a surface in millions of dimensions; we visualize 2D slices through that.

The geometry decides how easy training is. A clean bowl is solvable from anywhere. A long ravine forces zig-zags. Saddles stall progress. Rough terrain has many shallow minima, each a different network behavior.

Why ML cares

Almost every result in modern optimization theory — momentum, Adam, learning-rate schedules, batch normalization — is a fix for some pathology of the loss landscape: ravines, plateaus, saddles, ill-conditioning. Knowing the shape tells you which fix to reach for.

In high dimensions, the geometry gets weird and counter-intuitive. Empirical work (Dauphin et al., Goodfellow et al.) suggests local minima are rare; saddle points vastly outnumber them and are where networks actually slow down. The pictures here are a 2D peek at that 3D-and-up reality.

Try this
  1. Pick Convex bowl and hit Descend. The marble walks straight to the minimum. Crank η way up — watch it overshoot and oscillate, then diverge.
  2. Switch to Narrow ravine. Vanilla gradient descent zigzags miserably down the long axis. This is the pathology that momentum exists to solve.
  3. Try Two minima. Click on different starting points around the map. Where you start determines which basin you fall into — initialization matters.
· Contour lines mark equal-loss elevations. The marble follows the steepest descent — the green arrow shows the direction it will move next, scaled by the step size.
Where you've seen this 04 examples
Why deep networks need good initialization

Two networks with the same architecture but different starting weights can converge to wildly different solutions — different basins of the loss landscape. Initialization schemes (Xavier, He) are essentially "drop the marble in a good neighborhood."

Mode connectivity

2018 research (Garipov et al.) showed that distinct minima of large networks are often connected by low-loss paths — the landscape isn't isolated valleys but a network of them. This is unique to high dimensions; in 2D the picture lies.

Generalization and flatness

Sharp minima (steep walls) tend to generalize worse than flat ones (gentle slopes). Methods like SAM (sharpness-aware minimization) explicitly bias gradient descent toward flatter basins — and consistently improve test accuracy.

Loss landscapes as art

Hao Li's 2017 paper Visualizing the Loss Landscape of Neural Nets rendered the 2D slices for ResNet, with-and-without skip connections. The skip-connection version is dramatically smoother — a striking illustration of why ResNets train so much better than plain deep CNNs.

Further reading