Machine Learning, Visualized · Vol. XXI

A Continent of
Latent Space

A plain autoencoder learns one point per training example. A variational autoencoder learns a whole region per example — and because the regions overlap and fill the space, you can sample anywhere and decode to a plausible new output.

The concept

A variational autoencoder (VAE) maps each input to a distribution over latent space — typically a Gaussian with a learned mean and variance — instead of a single point.

The encoder outputs (μ, σ) for each input. We sample z ~ N(μ, σ²) from that distribution, decode it, and add a regularization term that pulls every encoder output toward the standard normal N(0, I). The result: latent space fills smoothly, with no holes.

Once trained, you can ignore the encoder entirely and just sample z ~ N(0, I), decode, and get plausible new examples. Generation comes for free.

Why ML cares

Stable Diffusion's "VAE" is literally this — a 2014-vintage VAE that compresses 512×512 RGB images to 64×64×4 latents before the diffusion process runs. Without it, the diffusion model would have to learn to denoise raw pixels (a much harder optimization).

VAEs also pioneered the reparameterization trick: sample ε ~ N(0,I) first, then compute z = μ + σ · ε. This lets gradients flow through the random sampling — a technique now everywhere in modern probabilistic ML.

Try this

Switch to KL · two distributions. You see two ellipses: the encoder's q(z|x) = N(μ, σ²) and the prior N(0, I). Hit Improve KL and the encoder ellipse gets pulled toward the prior. That tug is the KL regularizer making latent space tile cleanly.
Switch to Reparameterization. Watch ε get sampled from the prior, scaled by σ, translated by μ — landing at z. Sampling becomes a deterministic function of an outside random ε, so gradients flow.
Switch to Walk the manifold. Pick two endpoints; the slider interpolates the latent line; the decoder paints each step. Smooth interpolation means every point in between is plausible.
Hit Sample N(0, I) on the manifold view. Every random draw decodes to something — that's what the VAE bought you.

Before this

Plain autoencoders learned a code per training example, but the latent space had holes — interpolating between codes produced garbage. VAEs (Kingma & Welling 2014) made every region of latent space decode to something by training the encoder to output a distribution and pulling all of them toward N(0, I). Sampling becomes free; smooth interpolation appears.

The architecture

The encoder outputs two vectors, not one — a mean μ and a spread σ. Sampling is rebuilt as z = μ + σ · ε so gradients can flow through.

VAE loss

‖x − x̂‖² + KL(q(z|x) ∥ N(0, I))

First term: reconstruction error — how well x̂ matches the input x. Second term: KL divergence — pulls the encoder's output distribution q(z|x) toward the standard normal N(0, I), so the latent space stays smooth. Tension between the two is what fills latent space.

Symbol gloss

μ (mu): the encoder's predicted mean of the latent code. σ (sigma): the encoder's predicted spread (standard deviation). Both are vectors — one number per latent dimension.
ε (epsilon): a tiny random sample drawn from N(0, I), the standard normal (mean 0, variance 1).
KL: Kullback–Leibler divergence — a measure of how different two distributions are. Zero means identical; bigger means more different.

Reparameterization trick

z = μ + σ · ε

Read it as arithmetic: take the noise ε, stretch it by the encoder's spread σ, then shift it to the mean μ. The randomness is an "input," not part of the graph — so gradients flow cleanly through μ and σ during training.

VAE vs GAN

VAEs produce slightly blurry samples (MSE likes the average). GANs produce sharper samples (adversarial loss penalizes implausibility). Modern systems often combine both.

Where you've seen this04 examples

Stable Diffusion's image VAE

Stable Diffusion's first stage compresses 512×512 RGB → 64×64×4 latents using a VAE; the diffusion process operates entirely in this latent space. Speeds up training and inference by ~10×.

Drug molecule generation

VAEs trained on SMILES strings learn a continuous latent space of valid molecules. Optimize a property (binding affinity, solubility) directly in latent space, then decode — generating novel candidate drugs.

Voice and speech synthesis

Variational TTS systems (VITS, NaturalSpeech) use VAE-style encoders to capture speaker identity, then condition decoding on text. The smooth latent space allows voice morphing between speakers.

Anomaly detection (probabilistic)

Compute the likelihood a sample's encoding came from N(0,I). Far-from-typical examples have low likelihood and high reconstruction error — a more principled anomaly score than a plain AE provides.