jthomas.site// notebook · v.4.2026
Machine Learning, Visualized · Vol. XXXI

Hyperparameter
Tuning

Every model has knobs — learning rate, weight decay, layer count, dropout. Grid search probes them on a regular lattice; random search throws darts; Bayesian optimization learns where to throw next. Three different theories of how to comb a knob space.

The concept

Hyperparameter tuning is the search for the knob settings that maximize validation accuracy — a black-box optimization in a few-to-many dimensions.

Grid search: try every combination of a regular lattice. Simple; wasteful; embarrassingly parallel.

Random search: sample knob settings at random. Counter-intuitively often beats grid search — Bergstra & Bengio (2012) showed that for any "important dimension" random search has a much better chance of finding good values along it.

Bayesian optimization: build a model of the score landscape (typically a Gaussian process) and pick the next point that maximizes "expected improvement" given the model.

Why ML cares

The difference between a 90%-accuracy model and a 96% one is often hyperparameter tuning. Tuning is what separates published-paper results from production-quality ones.

For deep learning specifically, training a model is expensive — each evaluation in this search costs hours of GPU time. Bayesian optimization (Hyperband, BOHB, Vizier) lets you find good settings with far fewer trials than grid or random — making it the default in modern AutoML systems.

Try this
  1. Click ▷ Run search. Three panels race side-by-side on the same hidden landscape: grid spreads probes on a lattice, random scatters them, Bayesian starts random then clusters near the best regions found so far.
  2. Watch the budget bar under each panel fill as trials are spent. The bottom plot shows "best score so far" against GPU minutes — the curve that reaches a high score with the fewest minutes wins.
  3. Switch the Score landscape to Bumpy. Multiple peaks penalize grid (it doesn't notice the bumps) and reward Bayesian (it tracks promising regions).
Symbol gloss
  • trial · one full training run with one specific hyperparameter setting (e.g., learning rate = 3e-4, weight decay = 0.01).
  • budget · total GPU-time you can afford. Each trial draws from this pool.
  • θ (theta) · a vector of hyperparameters; each axis of the 2D landscape is one knob.
  • f(θ) · the (unknown) score of training with hyperparameters θ — the heatmap above is what you're trying to climb.
  • EI (expected improvement) · the heuristic Bayesian uses — for each candidate θ, estimate how much better than the current best it might be, then pick the θ that maximizes that estimate.
· Top: three panels race side-by-side on the same hidden score landscape (oxblood = high score, cream = low). Each panel's probes are dots; the encircled point is the best-so-far. The budget bar under each panel shows GPU-time consumed. Bottom: best score vs GPU minutes — the curve that climbs fastest wins.
Where you've seen this04 examples
Google Vizier

Google's internal HPO service tunes thousands of production models — search ranking, ad bidding, YouTube recommendations. Bayesian optimization at fleet scale.

AutoML systems

Cloud AutoML (Google, Azure, AWS) wraps HPO behind a "give me a good model" API. The orchestration is mostly hyperparameter tuning + architecture search.

Drug discovery and materials

Bayesian optimization is widely used to choose which experiments to run next in chemistry and materials science — each "trial" is a wet-lab synthesis costing thousands of dollars.

Compiler autotuning

Optimizing kernels for GPU performance (TVM, Triton, XLA) involves searching huge spaces of compilation flags. Modern compilers use Bayesian-style HPO to find good flag combinations.

Further reading