Machine Learning, Visualized · Vol. XXXII

Bias-Variance
Tradeoff

A linear fit on quadratic data misses the curve. A 12th-degree polynomial fits every point exactly — and bends wildly between them. Somewhere in the middle, just enough flexibility, sits the model you want.

The concept

Total prediction error decomposes into bias² + variance + irreducible noise.

Bias: how far is the model's average prediction from the truth, across many imagined training datasets? A linear model on curved data has high bias — it can't capture the curve, no matter which dataset you train it on. Underfitting.

Variance: how much do the model's predictions change when the training set changes? A 12-degree polynomial has high variance — different training samples produce wildly different curves. Overfitting.

The sweet spot is whatever model complexity minimizes the sum. For a fixed dataset size, that's a specific degree, depth, or capacity.

Why ML cares

Bias-variance is the conceptual frame for nearly every model-selection decision: "should I use a deeper network, more features, more regularization?" Each choice slides the dial.

It's also the lens through which the double-descent phenomenon was discovered — overparameterized neural networks (more parameters than data points) sometimes show lower test error than smaller models, breaking the classical U-shape. The story is more nuanced than the textbook curve, but the textbook curve is still the right starting point.

Try this

Drag the polynomial degree slider from 1 to 15. The dashed vertical line in the bottom plot moves with you — bias² (ink) shrinks, variance (oxblood) grows. Watch the connector lines between the spread of fits (top) and the variance reading (bottom).
Click Resample several times. Watch how degree-1 fits stay similar while degree-15 fits look completely different — that bundle of wiggly curves is variance.
Toggle Show noise. The data splits into the smooth true signal sin(πx) and the vertical sticks of added noise. The σ² readout in the gloss tells you the irreducible piece — no model, however clever, can beat it.

The decomposition, in plain English

Error(x) = bias²(x) + variance(x) + σ²

bias² · how far the model's average prediction (across many imagined training datasets) is from the truth. Squared so positive.
variance · how much the model's prediction wobbles when you swap in a different training set of the same size.
σ² (sigma-squared) · irreducible noise in the data — the labels themselves are noisy. No model can beat this; it sets the error floor.
Error(x) · expected squared error at a point x, averaged over all the training datasets you might have drawn.

· Top: a polynomial of the chosen degree fit to noisy samples of f(x) = sin(πx). The 10 faded curves are fits to different random training sets — their spread is the variance, made visible. Bottom: bias², variance, and total error vs polynomial degree, with a dashed vertical line at your current degree. Connector lines link the spread above to the variance reading below.

Before this

Before the bias-variance decomposition (Geman, Bienenstock, Doursat 1992), practitioners overfit and underfit by accident — there was no shared language for the trade-off. The formula split error into bias² + variance + noise, giving a vocabulary for "your model is too simple" vs "your model memorized the training set." Modern double descent (Belkin 2019) added a twist: in heavily overparameterized regimes, more capacity can reduce variance. The classic frame is still where everyone starts.

The bias-variance bullseye

Each target is one model class. The red dot is truth; the ink dots are predictions from many imagined training datasets. Off-center cluster = bias. Spread of cluster = variance. You want the top-left quadrant.

The U-curve, mathematically

↑ complexity ⇒ bias² ↓ & variance ↑

More flexibility means the model can fit any underlying truth (bias falls), but it can also chase noise (variance rises). The minimum of the sum is your sweet spot — and it shifts with dataset size.

Double descent caveat

Past degree ~12 in this toy, the U here keeps rising — the classical regime. In deep nets, when parameters far exceed data, the right tail sometimes descends again (Belkin et al., 2019). Why is still partly an open question. Beyond this course; covered in the further-reading link.

Regularization

L1/L2 penalties trade variance for bias. At the optimal regularization strength, you've moved closer to the sweet spot without changing model class.

Where you've seen this04 examples

Choosing a model

"Linear regression vs gradient-boosted trees vs deep neural net" is a bias-variance question. Linear regression is high-bias; deep networks are flexible (low-bias) but variance-prone unless regularized.

Cross-validation

Held-out-set accuracy approximates the bias-variance sum on unseen data. The standard model-selection workflow — try many model variants, pick whichever gives best validation accuracy — is this curve, made empirical.

Ensemble methods

Random forests average many high-variance trees to drive variance down without raising bias. Bagging is, formally, a variance-reduction technique.

Double descent in deep learning

Empirical work (Belkin et al., 2019) showed that classical bias-variance curves break for very large neural networks. Generalization sometimes improves past the interpolation threshold — a still-debated frontier.