Machine Learning, Visualized · Vol. VI

Probability
Distributions

A distribution assigns probability to outcomes — heads or tails, photon arrivals, tomorrow's temperature. Tweak the parameters and watch the shape morph. Then sample from it and see the histogram converge.

The concept

A probability distribution is a function that assigns a number — a probability — to each possible outcome of a random process.

For continuous outcomes (height, temperature) it's a density — a curve f(x) whose area under each interval is the chance the value falls there. For discrete outcomes (coin flip, count) it's a mass function — a number P(k) for each outcome that sums to 1.

Most distributions are described by one or two parameters. The Gaussian's mean and standard deviation; the Bernoulli's success probability p; the Poisson's rate λ. Change the parameters and the curve breathes.

Reading the Greek. μ ("mu") is the mean — the center of the distribution. σ ("sigma") is the standard deviation — the spread. λ ("lambda") is a rate — events per unit time. π ("pi") is the constant 3.14159…; e is Euler's number, 2.71828…. k is a discrete count; n is the number of trials.

Why ML cares

Every probabilistic model in machine learning is, under the hood, a parametrized distribution that the algorithm fits to data. Linear regression assumes Gaussian noise. Logistic regression outputs a Bernoulli. Variational autoencoders decode samples from a Gaussian prior. Diffusion models reverse a Gaussian noising process.

The Central Limit Theorem says: sum enough independent random things and you get a Gaussian, regardless of where you started. That single fact is why "errors are noisy" almost always means "errors are normally distributed" in practice.

Try this

Open Gaussian. The shaded band is μ ± σ — that's variance made visible. Drag σ wider and the band breathes with it.
Hit +1000. Samples drop in over a second instead of all-at-once; watch the inset trace pin the running mean to μ. That's the law of large numbers as a heartbeat.
Switch to Binomial. Notice the lollipop stems with gaps between integers — discrete distributions live only on whole numbers, and the visual finally says so.

· Continuous distributions render as a filled curve; discrete distributions render as lollipop stems on integer outcomes. The shaded band is μ ± σ — variance, visually. The inset traces the running sample mean as it converges to μ.

Where you've seen this 04 examples

A/B tests at every tech company

When Google or Meta rolls out a feature to 1% of users, conversion counts in each bucket follow a binomial. The "is the new design better?" question is answered by comparing two binomials — same math you can play with above by sliding p.

Web servers and queue capacity

Requests-per-second to a server are modeled as Poisson; wait-times between them are exponential. AWS auto-scaling rules, Twilio rate limits, and SLO budgets are all built on these two distributions.

Diffusion image models

Stable Diffusion, Imagen, and DALL·E start from pure Gaussian noise and denoise in steps. The forward process is "add a little Gaussian"; the reverse process — what a neural net learns — is "subtract the right Gaussian."

Insurance and risk

Catastrophe pricing models claim frequencies as Poisson and claim sizes as log-normal or exponential. Every "1-in-100-year flood" calculation is a tail probability of a fitted distribution.