Machine Learning, Visualized · Vol. VII

Bayes'
Theorem

You start with a guess about the world (the prior). Then you see evidence. Bayes' rule tells you exactly how much to update — no more, no less. The whole of probabilistic reasoning rests on this one equation.

The concept

Bayes' theorem tells you how to update your belief about an event A after observing evidence B: P(A | B) = P(B | A) · P(A) / P(B).

Reading the notation. P(A) means "the probability of A." The vertical bar | reads "given that" — so P(A | B) is "the probability of A, given that B happened." A and B stand for events, claims, or hypotheses ("has the disease"; "test came back positive"). Bayes' rule swaps the order of the conditional: it lets you compute P(A | B) when what you actually know is P(B | A).

Read left to right: the chance of A given B equals the chance of B given A, re-weighted by how plausible A was to begin with (the prior), and normalized by how often B occurs at all (the evidence).

The most counter-intuitive consequence is the base rate fallacy: even a 95%-accurate test for a 1% disease produces only ~16% probability of disease given a positive result. Most positives are false positives — simply because healthy people vastly outnumber sick ones.

Why ML cares

Every probabilistic classifier, from the spam filter that built Gmail to the variational autoencoder behind image generation, is doing some form of Bayesian update. Naive Bayes classifiers compute exactly the equation above for every word in an email; modern Bayesian deep learning generalizes the same idea to billions of parameters.

Bayes' rule is also the engine of scientific reasoning under uncertainty: experiments are evidence, and how much your beliefs change after seeing them is exactly the posterior. It is the math behind A/B test analysis, particle-physics discoveries, and clinical drug approvals.

Try this

Open Medical test. Switch to Population view. Out of 100,000 people the test flags ~5,950 — but only ~950 actually have the disease. P(disease | positive) ≈ 16%. The base-rate fallacy made visible.
In Tree mode, watch the posterior as a giant collapsing fraction: numerator is the highlighted A∩B leaf; denominator is A∩B + ¬A∩B. The whole rule is a ratio of two leaves.
Drag Prior up to 0.5. In Population view the bright dots overwhelm the faint ones — posterior leaps to ~95%. The test didn't change; the prior did.

Before this

Before Bayes (1763, popularized by Laplace), updating beliefs with evidence had no formula — people argued by intuition. Bayes' rule turned belief-updating into a single equation. It is why a doctor reading a 95%-accurate test should still mostly not believe the patient is sick — base rates dominate.

In the margin

P(A | B) = P(B | A) P(A) / P(B)

In words: the chance of A given B is the chance of B given A, re-weighted by how plausible A was to begin with, and normalized by how often B happens at all.

Reverend Thomas Bayes proved this in the 1740s. It now powers spam filters, medical diagnosis, particle physics, and every probabilistic neural network.

Try the medical case

The base rate fallacy: even with a 95%-accurate test for a 1% disease, a positive result means only ~16% chance you're sick. Most positives are false positives, because healthy people vastly outnumber sick ones.

The four ingredients

Prior P(A) · Likelihood P(B|A) · Evidence P(B) · Posterior P(A|B). Three are inputs; the fourth comes out.

Where you've seen this 04 examples

Gmail's first spam filter

Paul Graham's 2002 essay "A Plan for Spam" — the proof-of-concept for Bayesian spam filtering — kicked off an industry. The math is exactly what's on this page: each word's P(spam | word) combined into a single posterior.

Medical screening interpretation

Doctors and statisticians teach the base-rate fallacy specifically because intuition gets it so wrong. A 95%-accurate test for a 1% disease produces a 16% posterior — exactly what the medical scenario shows above. This single fact reshapes how screening programs are designed.

Particle physics discovery

The Higgs boson announcement in 2012 was a Bayesian-style reasoning chain: observe an excess; compute the likelihood under "Higgs" vs "background only"; the ratio is the discovery significance. Five-sigma is a posterior threshold.

A/B testing

Modern experimentation platforms (Google Optimize, Eppo, Statsig) increasingly report posteriors instead of p-values. Bayesian A/B testing answers "what is the probability B is better than A?" — exactly the kind of conditional that the posterior on this page computes.