A Continent of
Latent Space
A plain autoencoder learns one point per training example. A variational autoencoder learns a whole region per example — and because the regions overlap and fill the space, you can sample anywhere and decode to a plausible new output.
A variational autoencoder (VAE) maps each input to a distribution over latent space — typically a Gaussian with a learned mean and variance — instead of a single point.
The encoder outputs (μ, σ) for each input. We sample z ~ N(μ, σ²) from that distribution, decode it, and add a regularization term that pulls every encoder output toward the standard normal N(0, I). The result: latent space fills smoothly, with no holes.
Once trained, you can ignore the encoder entirely and just sample z ~ N(0, I), decode, and get plausible new examples. Generation comes for free.
Stable Diffusion's "VAE" is literally this — a 2014-vintage VAE that compresses 512×512 RGB images to 64×64×4 latents before the diffusion process runs. Without it, the diffusion model would have to learn to denoise raw pixels (a much harder optimization).
VAEs also pioneered the reparameterization trick: sample ε ~ N(0,I) first, then compute z = μ + σ · ε. This lets gradients flow through the random sampling — a technique now everywhere in modern probabilistic ML.
- Switch to KL · two distributions. You see two ellipses: the encoder's q(z|x) = N(μ, σ²) and the prior N(0, I). Hit Improve KL and the encoder ellipse gets pulled toward the prior. That tug is the KL regularizer making latent space tile cleanly.
- Switch to Reparameterization. Watch ε get sampled from the prior, scaled by σ, translated by μ — landing at z. Sampling becomes a deterministic function of an outside random ε, so gradients flow.
- Switch to Walk the manifold. Pick two endpoints; the slider interpolates the latent line; the decoder paints each step. Smooth interpolation means every point in between is plausible.
- Hit Sample N(0, I) on the manifold view. Every random draw decodes to something — that's what the VAE bought you.
Stable Diffusion's first stage compresses 512×512 RGB → 64×64×4 latents using a VAE; the diffusion process operates entirely in this latent space. Speeds up training and inference by ~10×.
VAEs trained on SMILES strings learn a continuous latent space of valid molecules. Optimize a property (binding affinity, solubility) directly in latent space, then decode — generating novel candidate drugs.
Variational TTS systems (VITS, NaturalSpeech) use VAE-style encoders to capture speaker identity, then condition decoding on text. The smooth latent space allows voice morphing between speakers.
Compute the likelihood a sample's encoding came from N(0,I). Far-from-typical examples have low likelihood and high reconstruction error — a more principled anomaly score than a plain AE provides.
- Auto-Encoding Variational Bayes paper Kingma & Welling (2014) · The original VAE paper. Establishes the variational lower bound, the reparameterization trick, and the framework that VAEs run on.
- From Autoencoder to Beta-VAE essay Lilian Weng · A patient walkthrough of the math, the variants (β-VAE, VQ-VAE, IWAE), and why each one was introduced.
- An Introduction to Variational Autoencoders survey Kingma & Welling (2019) · A book-length treatment from the inventors. The canonical reference if you're going deep.
- Neural Discrete Representation Learning (VQ-VAE) paper van den Oord et al. (2017) · The discrete-latent variant. Underlies most modern image-token systems (DALL·E, Parti, MaskGIT).