jthomas.site// notebook · v.4.2026
Machine Learning, Visualized · Vol. XXXII

Bias-Variance
Tradeoff

A linear fit on quadratic data misses the curve. A 12th-degree polynomial fits every point exactly — and bends wildly between them. Somewhere in the middle, just enough flexibility, sits the model you want.

The concept

Total prediction error decomposes into bias² + variance + irreducible noise.

Bias: how far is the model's average prediction from the truth, across many imagined training datasets? A linear model on curved data has high bias — it can't capture the curve, no matter which dataset you train it on. Underfitting.

Variance: how much do the model's predictions change when the training set changes? A 12-degree polynomial has high variance — different training samples produce wildly different curves. Overfitting.

The sweet spot is whatever model complexity minimizes the sum. For a fixed dataset size, that's a specific degree, depth, or capacity.

Why ML cares

Bias-variance is the conceptual frame for nearly every model-selection decision: "should I use a deeper network, more features, more regularization?" Each choice slides the dial.

It's also the lens through which the double-descent phenomenon was discovered — overparameterized neural networks (more parameters than data points) sometimes show lower test error than smaller models, breaking the classical U-shape. The story is more nuanced than the textbook curve, but the textbook curve is still the right starting point.

Try this
  1. Drag the polynomial degree slider from 1 to 15. The dashed vertical line in the bottom plot moves with you — bias² (ink) shrinks, variance (oxblood) grows. Watch the connector lines between the spread of fits (top) and the variance reading (bottom).
  2. Click Resample several times. Watch how degree-1 fits stay similar while degree-15 fits look completely different — that bundle of wiggly curves is variance.
  3. Toggle Show noise. The data splits into the smooth true signal sin(πx) and the vertical sticks of added noise. The σ² readout in the gloss tells you the irreducible piece — no model, however clever, can beat it.
The decomposition, in plain English
Error(x) = bias²(x) + variance(x) + σ²
  • bias² · how far the model's average prediction (across many imagined training datasets) is from the truth. Squared so positive.
  • variance · how much the model's prediction wobbles when you swap in a different training set of the same size.
  • σ² (sigma-squared) · irreducible noise in the data — the labels themselves are noisy. No model can beat this; it sets the error floor.
  • Error(x) · expected squared error at a point x, averaged over all the training datasets you might have drawn.
· Top: a polynomial of the chosen degree fit to noisy samples of f(x) = sin(πx). The 10 faded curves are fits to different random training sets — their spread is the variance, made visible. Bottom: bias², variance, and total error vs polynomial degree, with a dashed vertical line at your current degree. Connector lines link the spread above to the variance reading below.
Where you've seen this04 examples
Choosing a model

"Linear regression vs gradient-boosted trees vs deep neural net" is a bias-variance question. Linear regression is high-bias; deep networks are flexible (low-bias) but variance-prone unless regularized.

Cross-validation

Held-out-set accuracy approximates the bias-variance sum on unseen data. The standard model-selection workflow — try many model variants, pick whichever gives best validation accuracy — is this curve, made empirical.

Ensemble methods

Random forests average many high-variance trees to drive variance down without raising bias. Bagging is, formally, a variance-reduction technique.

Double descent in deep learning

Empirical work (Belkin et al., 2019) showed that classical bias-variance curves break for very large neural networks. Generalization sometimes improves past the interpolation threshold — a still-debated frontier.

Further reading