Machine Learning, Visualized · Vol. XXXI

Hyperparameter
Tuning

Every model has knobs — learning rate, weight decay, layer count, dropout. Grid search probes them on a regular lattice; random search throws darts; Bayesian optimization learns where to throw next. Three different theories of how to comb a knob space.

The concept

Hyperparameter tuning is the search for the knob settings that maximize validation accuracy — a black-box optimization in a few-to-many dimensions.

Grid search: try every combination of a regular lattice. Simple; wasteful; embarrassingly parallel.

Random search: sample knob settings at random. Counter-intuitively often beats grid search — Bergstra & Bengio (2012) showed that for any "important dimension" random search has a much better chance of finding good values along it.

Bayesian optimization: build a model of the score landscape (typically a Gaussian process) and pick the next point that maximizes "expected improvement" given the model.

Why ML cares

The difference between a 90%-accuracy model and a 96% one is often hyperparameter tuning. Tuning is what separates published-paper results from production-quality ones.

For deep learning specifically, training a model is expensive — each evaluation in this search costs hours of GPU time. Bayesian optimization (Hyperband, BOHB, Vizier) lets you find good settings with far fewer trials than grid or random — making it the default in modern AutoML systems.

Try this

Click ▷ Run search. Three panels race side-by-side on the same hidden landscape: grid spreads probes on a lattice, random scatters them, Bayesian starts random then clusters near the best regions found so far.
Watch the budget bar under each panel fill as trials are spent. The bottom plot shows "best score so far" against GPU minutes — the curve that reaches a high score with the fewest minutes wins.
Switch the Score landscape to Bumpy. Multiple peaks penalize grid (it doesn't notice the bumps) and reward Bayesian (it tracks promising regions).

Symbol gloss

trial · one full training run with one specific hyperparameter setting (e.g., learning rate = 3e-4, weight decay = 0.01).
budget · total GPU-time you can afford. Each trial draws from this pool.
θ (theta) · a vector of hyperparameters; each axis of the 2D landscape is one knob.
f(θ) · the (unknown) score of training with hyperparameters θ — the heatmap above is what you're trying to climb.
EI (expected improvement) · the heuristic Bayesian uses — for each candidate θ, estimate how much better than the current best it might be, then pick the θ that maximizes that estimate.

· Top: three panels race side-by-side on the same hidden score landscape (oxblood = high score, cream = low). Each panel's probes are dots; the encircled point is the best-so-far. The budget bar under each panel shows GPU-time consumed. Bottom: best score vs GPU minutes — the curve that climbs fastest wins.

Before this

Before automated tuning, hyperparameters were guessed by hand — practitioners reused defaults from papers and prayed. Bergstra & Bengio (2012) showed that random search beats grid search for the same budget; Bayesian methods (Spearmint, GPyOpt) added probabilistic models of the score landscape. Modern AutoML services (Vizier, Optuna) tune at scale, and SOTA results often hinge on careful tuning rather than fancier architectures.

Three strategies, side-by-side

Grid covers everything uniformly; random scatters; Bayesian uses early scattered probes to learn the landscape, then clusters its remaining budget where the score model says payoff is highest.

Bayesian intuition

After a few trials, fit a probabilistic model of the landscape — for each θ, the model gives a guess (mean) and an uncertainty (variance). Pick the next θ where (expected score) + (uncertainty bonus) is highest. So Bayesian rewards both promising and unexplored regions — that's the explore-vs-exploit tradeoff in one line.

Random > grid

Bergstra & Bengio (2012): in high-D spaces, only a few hyperparameters actually matter. Grid search wastes its budget probing every combination of irrelevant knobs; random search puts more diversity along each axis.

Hyperband & BOHB

Modern HPO combines Bayesian-style smart sampling with early stopping — kill bad trials early, give compute to promising ones. Hyperband, BOHB, and Google Vizier are standard.

Where you've seen this04 examples

Google Vizier

Google's internal HPO service tunes thousands of production models — search ranking, ad bidding, YouTube recommendations. Bayesian optimization at fleet scale.

AutoML systems

Cloud AutoML (Google, Azure, AWS) wraps HPO behind a "give me a good model" API. The orchestration is mostly hyperparameter tuning + architecture search.

Drug discovery and materials

Bayesian optimization is widely used to choose which experiments to run next in chemistry and materials science — each "trial" is a wet-lab synthesis costing thousands of dollars.

Compiler autotuning

Optimizing kernels for GPU performance (TVM, Triton, XLA) involves searching huge spaces of compilation flags. Modern compilers use Bayesian-style HPO to find good flag combinations.