Probability
Distributions
A distribution assigns probability to outcomes — heads or tails, photon arrivals, tomorrow's temperature. Tweak the parameters and watch the shape morph. Then sample from it and see the histogram converge.
A probability distribution is a function that assigns a number — a probability — to each possible outcome of a random process.
For continuous outcomes (height, temperature) it's a density — a curve f(x) whose area under each interval is the chance the value falls there. For discrete outcomes (coin flip, count) it's a mass function — a number P(k) for each outcome that sums to 1.
Most distributions are described by one or two parameters. The Gaussian's mean and standard deviation; the Bernoulli's success probability p; the Poisson's rate λ. Change the parameters and the curve breathes.
Reading the Greek. μ ("mu") is the mean — the center of the distribution. σ ("sigma") is the standard deviation — the spread. λ ("lambda") is a rate — events per unit time. π ("pi") is the constant 3.14159…; e is Euler's number, 2.71828…. k is a discrete count; n is the number of trials.
Every probabilistic model in machine learning is, under the hood, a parametrized distribution that the algorithm fits to data. Linear regression assumes Gaussian noise. Logistic regression outputs a Bernoulli. Variational autoencoders decode samples from a Gaussian prior. Diffusion models reverse a Gaussian noising process.
The Central Limit Theorem says: sum enough independent random things and you get a Gaussian, regardless of where you started. That single fact is why "errors are noisy" almost always means "errors are normally distributed" in practice.
- Open Gaussian. The shaded band is μ ± σ — that's variance made visible. Drag σ wider and the band breathes with it.
- Hit +1000. Samples drop in over a second instead of all-at-once; watch the inset trace pin the running mean to μ. That's the law of large numbers as a heartbeat.
- Switch to Binomial. Notice the lollipop stems with gaps between integers — discrete distributions live only on whole numbers, and the visual finally says so.
When Google or Meta rolls out a feature to 1% of users, conversion counts in each bucket follow a binomial. The "is the new design better?" question is answered by comparing two binomials — same math you can play with above by sliding p.
Requests-per-second to a server are modeled as Poisson; wait-times between them are exponential. AWS auto-scaling rules, Twilio rate limits, and SLO budgets are all built on these two distributions.
Stable Diffusion, Imagen, and DALL·E start from pure Gaussian noise and denoise in steps. The forward process is "add a little Gaussian"; the reverse process — what a neural net learns — is "subtract the right Gaussian."
Catastrophe pricing models claim frequencies as Poisson and claim sizes as log-normal or exponential. Every "1-in-100-year flood" calculation is a tail probability of a fitted distribution.
- Seeing Theory interactive Daniel Kunin (Brown) · A visual introduction to probability that makes the case for distributions through animation. Chapter 3 on distributions is the natural sequel to this page.
- The Central Limit Theorem reference Wikipedia · The single result that explains why Gaussians appear everywhere. The entry's "illustration" section has good intuition pumps.
- 3Blue1Brown — But what is the Central Limit Theorem? video Grant Sanderson · A visual proof sketch of why sums of random variables become Gaussian. Pairs naturally with the binomial → Gaussian convergence above.
- scipy.stats — distributions in Python reference 100+ distributions with consistent pdf, cdf, sf, rvs, fit APIs. The reference everyone reaches for.