Variational Autoencoder - Albert Masoliver's learning site

## Definition A **Variational Autoencoder (VAE)** (Kingma & Welling, 2013) is a probabilistic [[Autoencoder]] whose latent space is *structured* for generation. Unlike a vanilla autoencoder, sampling from the latent space produces meaningful new outputs. ## Architecture - **Encoder** $q_\phi(z \mid x)$ — outputs the *parameters* of a distribution over $z$ (typically mean and variance of a Gaussian). - **Decoder** $p_\theta(x \mid z)$ — reconstructs $x$ from sampled $z$. - **Prior** $p(z)$ — standard Gaussian by default. Generation: sample $z \sim p(z)$, then $x \sim p_\theta(x \mid z)$. ## The ELBO (Evidence Lower BOund) VAEs maximise a tractable lower bound on the data log-likelihood: $ \mathcal{L}_{\text{ELBO}} = \mathbb{E}_{q_\phi(z \mid x)}[\log p_\theta(x \mid z)] - D_{\text{KL}}(q_\phi(z \mid x) \| p(z)) $ Two terms: - **Reconstruction term:** the decoder should reconstruct the input from samples of the encoder's distribution. - **KL regularisation:** the encoder's output should stay close to the prior. Keeps the latent space well-shaped for sampling. ## The Reparameterisation Trick To backprop through sampling, write: $ z = \mu(x) + \sigma(x) \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I) $ Now $z$ is a deterministic function of inputs and noise; gradients flow through $\mu$ and $\sigma$. Backprop just works. ## What VAE Does Well - **Sampling new data.** Generate by sampling $z$ then decoding. - **Smooth latent space.** Interpolating between two latent codes produces a smooth path through data space. - **Disentanglement.** Variants (β-VAE, FactorVAE) encourage individual latent dimensions to capture independent factors of variation. ## What VAE Does Less Well - **Sharpness of samples.** VAE-generated images are notoriously blurry — the reconstruction loss (MSE / BCE) averages over plausible options. - **Mode coverage** vs **GANs.** VAE typically covers more modes; GAN typically produces sharper but less diverse samples. The classic trade-off. - **Likelihood evaluation.** ELBO is a *lower bound* on the true log-likelihood; gap can be large. ## Modern Status VAEs were dominant generative models ~2014-2018. Subsequently: - **GANs** produced sharper images. - **Diffusion models** (2020+) produced sharper *and* more diverse images while being easier to train than GANs. - **VAEs remain useful** for latent-space modelling (e.g., latent diffusion uses a VAE to compress images before applying diffusion in latent space — Stable Diffusion architecture). ## Conditional VAE Condition both encoder and decoder on some side information $y$: $ q_\phi(z \mid x, y), \quad p_\theta(x \mid z, y) $ Enables controlled generation — "generate a digit of class 7". ## Related - [[Autoencoder]] - [[Generative Adversarial Network]] - [[Diffusion Model]] - [[Self-Supervised Learning]]