Diffusion Model - Albert Masoliver's learning site

## Definition A **diffusion model** generates data by reversing a gradual noising process. Train it to remove a little noise at a time from a noisy version of real data; at inference, start from pure noise and iteratively denoise to produce a new sample. The dominant architecture for image and increasingly video generation. ## The Two Processes ### Forward (noising) Take a real sample $x_0$ and progressively add Gaussian noise: $ x_t = \sqrt{\bar\alpha_t}\, x_0 + \sqrt{1 - \bar\alpha_t}\, \epsilon, \quad \epsilon \sim \mathcal{N}(0, I) $ After enough steps $T$, $x_T$ is essentially pure noise. ### Reverse (denoising) Train a neural network — typically a U-Net or Diffusion Transformer (DiT) — to predict the noise (or the clean sample) at each step. At inference, run the reverse process from $x_T \sim \mathcal{N}(0, I)$ back to $x_0$. ## Why It Replaced GANs - **Stable training.** No adversarial dynamics; no mode collapse. - **High diversity.** Covers the full data distribution rather than memorising modes. - **High fidelity at scale.** Matches and exceeds GAN quality on most benchmarks. - **Conditioning is easy.** Text-conditioned generation (Stable Diffusion, DALL-E, Imagen) plugs cleanly into the framework. ## Latent Diffusion Rather than diffusing in pixel space, train a VAE to compress images into a small latent, then diffuse in latent space. This is what made high-resolution image generation tractable — Stable Diffusion uses this approach, and most modern image models followed. ## Classifier-Free Guidance A trick that controls the strength of conditioning at inference: run the model both conditioned (on the prompt) and unconditioned, then extrapolate from the unconditioned prediction toward the conditioned one. Higher guidance → tighter prompt adherence; too high → over-saturated or unnatural outputs. ## Sampling Schedules Classical DDPM uses 1000 denoising steps. Modern samplers — DDIM, DPM-Solver, Euler, UniPC, Flow Matching — produce comparable quality in 10–50 steps. Sampler choice is now a non-trivial knob. ## Beyond Images - **Video diffusion** — temporal extension; Sora, Veo, Runway. - **Audio diffusion** — waveform or spectrogram-space generation. - **Molecule and protein design** — diffusion in domain-specific latent spaces. ## Related - [[Generative AI]] - [[Multimodal Model]] - [[Foundation Model]]