## Definition
A **Variational Autoencoder (VAE)** (Kingma & Welling, 2013) is a probabilistic [[Autoencoder]] whose latent space is *structured* for generation. Unlike a vanilla autoencoder, sampling from the latent space produces meaningful new outputs.
## Architecture
- **Encoder** $q_\phi(z \mid x)$ — outputs the *parameters* of a distribution over $z$ (typically mean and variance of a Gaussian).
- **Decoder** $p_\theta(x \mid z)$ — reconstructs $x$ from sampled $z$.
- **Prior** $p(z)$ — standard Gaussian by default.
Generation: sample $z \sim p(z)$, then $x \sim p_\theta(x \mid z)$.
## The ELBO (Evidence Lower BOund)
VAEs maximise a tractable lower bound on the data log-likelihood:
$
\mathcal{L}_{\text{ELBO}} = \mathbb{E}_{q_\phi(z \mid x)}[\log p_\theta(x \mid z)] - D_{\text{KL}}(q_\phi(z \mid x) \| p(z))
$
Two terms:
- **Reconstruction term:** the decoder should reconstruct the input from samples of the encoder's distribution.
- **KL regularisation:** the encoder's output should stay close to the prior. Keeps the latent space well-shaped for sampling.
## The Reparameterisation Trick
To backprop through sampling, write:
$
z = \mu(x) + \sigma(x) \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)
$
Now $z$ is a deterministic function of inputs and noise; gradients flow through $\mu$ and $\sigma$. Backprop just works.
## What VAE Does Well
- **Sampling new data.** Generate by sampling $z$ then decoding.
- **Smooth latent space.** Interpolating between two latent codes produces a smooth path through data space.
- **Disentanglement.** Variants (β-VAE, FactorVAE) encourage individual latent dimensions to capture independent factors of variation.
## What VAE Does Less Well
- **Sharpness of samples.** VAE-generated images are notoriously blurry — the reconstruction loss (MSE / BCE) averages over plausible options.
- **Mode coverage** vs **GANs.** VAE typically covers more modes; GAN typically produces sharper but less diverse samples. The classic trade-off.
- **Likelihood evaluation.** ELBO is a *lower bound* on the true log-likelihood; gap can be large.
## Modern Status
VAEs were dominant generative models ~2014-2018. Subsequently:
- **GANs** produced sharper images.
- **Diffusion models** (2020+) produced sharper *and* more diverse images while being easier to train than GANs.
- **VAEs remain useful** for latent-space modelling (e.g., latent diffusion uses a VAE to compress images before applying diffusion in latent space — Stable Diffusion architecture).
## Conditional VAE
Condition both encoder and decoder on some side information $y$:
$
q_\phi(z \mid x, y), \quad p_\theta(x \mid z, y)
$
Enables controlled generation — "generate a digit of class 7".
## Related
- [[Autoencoder]]
- [[Generative Adversarial Network]]
- [[Diffusion Model]]
- [[Self-Supervised Learning]]