Hyperparameter
Tuning
Every model has knobs — learning rate, weight decay, layer count, dropout. Grid search probes them on a regular lattice; random search throws darts; Bayesian optimization learns where to throw next. Three different theories of how to comb a knob space.
Hyperparameter tuning is the search for the knob settings that maximize validation accuracy — a black-box optimization in a few-to-many dimensions.
Grid search: try every combination of a regular lattice. Simple; wasteful; embarrassingly parallel.
Random search: sample knob settings at random. Counter-intuitively often beats grid search — Bergstra & Bengio (2012) showed that for any "important dimension" random search has a much better chance of finding good values along it.
Bayesian optimization: build a model of the score landscape (typically a Gaussian process) and pick the next point that maximizes "expected improvement" given the model.
The difference between a 90%-accuracy model and a 96% one is often hyperparameter tuning. Tuning is what separates published-paper results from production-quality ones.
For deep learning specifically, training a model is expensive — each evaluation in this search costs hours of GPU time. Bayesian optimization (Hyperband, BOHB, Vizier) lets you find good settings with far fewer trials than grid or random — making it the default in modern AutoML systems.
- Click ▷ Run search. Three panels race side-by-side on the same hidden landscape: grid spreads probes on a lattice, random scatters them, Bayesian starts random then clusters near the best regions found so far.
- Watch the budget bar under each panel fill as trials are spent. The bottom plot shows "best score so far" against GPU minutes — the curve that reaches a high score with the fewest minutes wins.
- Switch the Score landscape to Bumpy. Multiple peaks penalize grid (it doesn't notice the bumps) and reward Bayesian (it tracks promising regions).
- trial · one full training run with one specific hyperparameter setting (e.g., learning rate = 3e-4, weight decay = 0.01).
- budget · total GPU-time you can afford. Each trial draws from this pool.
- θ (theta) · a vector of hyperparameters; each axis of the 2D landscape is one knob.
- f(θ) · the (unknown) score of training with hyperparameters θ — the heatmap above is what you're trying to climb.
- EI (expected improvement) · the heuristic Bayesian uses — for each candidate θ, estimate how much better than the current best it might be, then pick the θ that maximizes that estimate.
Google's internal HPO service tunes thousands of production models — search ranking, ad bidding, YouTube recommendations. Bayesian optimization at fleet scale.
Cloud AutoML (Google, Azure, AWS) wraps HPO behind a "give me a good model" API. The orchestration is mostly hyperparameter tuning + architecture search.
Bayesian optimization is widely used to choose which experiments to run next in chemistry and materials science — each "trial" is a wet-lab synthesis costing thousands of dollars.
Optimizing kernels for GPU performance (TVM, Triton, XLA) involves searching huge spaces of compilation flags. Modern compilers use Bayesian-style HPO to find good flag combinations.
- Random Search for Hyper-Parameter Optimization paper Bergstra & Bengio (2012) · The paper that argued (with empirical evidence) that random search beats grid search in high dimensions. A small-but-foundational result.
- Exploring Bayesian Optimization interactive essay Agnihotri & Batra (Distill) · The clearest visual explanation of how Gaussian processes drive Bayesian optimization, with sliders.
- Hyperband paper Li et al. (2018) · The "smart early stopping" method that became the default in deep-learning HPO. Combines well with Bayesian search; standard now.
- Optuna library The most popular Python HPO library. Tree-structured Parzen Estimator + Hyperband + visualization, all in a few lines of code.