jthomas.site// notebook · v.4.2026
Machine Learning, Visualized · Vol. XXI

A Continent of
Latent Space

A plain autoencoder learns one point per training example. A variational autoencoder learns a whole region per example — and because the regions overlap and fill the space, you can sample anywhere and decode to a plausible new output.

The concept

A variational autoencoder (VAE) maps each input to a distribution over latent space — typically a Gaussian with a learned mean and variance — instead of a single point.

The encoder outputs (μ, σ) for each input. We sample z ~ N(μ, σ²) from that distribution, decode it, and add a regularization term that pulls every encoder output toward the standard normal N(0, I). The result: latent space fills smoothly, with no holes.

Once trained, you can ignore the encoder entirely and just sample z ~ N(0, I), decode, and get plausible new examples. Generation comes for free.

Why ML cares

Stable Diffusion's "VAE" is literally this — a 2014-vintage VAE that compresses 512×512 RGB images to 64×64×4 latents before the diffusion process runs. Without it, the diffusion model would have to learn to denoise raw pixels (a much harder optimization).

VAEs also pioneered the reparameterization trick: sample ε ~ N(0,I) first, then compute z = μ + σ · ε. This lets gradients flow through the random sampling — a technique now everywhere in modern probabilistic ML.

Try this
  1. Switch to KL · two distributions. You see two ellipses: the encoder's q(z|x) = N(μ, σ²) and the prior N(0, I). Hit Improve KL and the encoder ellipse gets pulled toward the prior. That tug is the KL regularizer making latent space tile cleanly.
  2. Switch to Reparameterization. Watch ε get sampled from the prior, scaled by σ, translated by μ — landing at z. Sampling becomes a deterministic function of an outside random ε, so gradients flow.
  3. Switch to Walk the manifold. Pick two endpoints; the slider interpolates the latent line; the decoder paints each step. Smooth interpolation means every point in between is plausible.
  4. Hit Sample N(0, I) on the manifold view. Every random draw decodes to something — that's what the VAE bought you.
Where you've seen this04 examples
Stable Diffusion's image VAE

Stable Diffusion's first stage compresses 512×512 RGB → 64×64×4 latents using a VAE; the diffusion process operates entirely in this latent space. Speeds up training and inference by ~10×.

Drug molecule generation

VAEs trained on SMILES strings learn a continuous latent space of valid molecules. Optimize a property (binding affinity, solubility) directly in latent space, then decode — generating novel candidate drugs.

Voice and speech synthesis

Variational TTS systems (VITS, NaturalSpeech) use VAE-style encoders to capture speaker identity, then condition decoding on text. The smooth latent space allows voice morphing between speakers.

Anomaly detection (probabilistic)

Compute the likelihood a sample's encoding came from N(0,I). Far-from-typical examples have low likelihood and high reconstruction error — a more principled anomaly score than a plain AE provides.

Further reading