## Definition
A **diffusion model** generates data by reversing a gradual noising process. Train it to remove a little noise at a time from a noisy version of real data; at inference, start from pure noise and iteratively denoise to produce a new sample. The dominant architecture for image and increasingly video generation.
## The Two Processes
### Forward (noising)
Take a real sample $x_0$ and progressively add Gaussian noise:
$
x_t = \sqrt{\bar\alpha_t}\, x_0 + \sqrt{1 - \bar\alpha_t}\, \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)
$
After enough steps $T$, $x_T$ is essentially pure noise.
### Reverse (denoising)
Train a neural network — typically a U-Net or Diffusion Transformer (DiT) — to predict the noise (or the clean sample) at each step. At inference, run the reverse process from $x_T \sim \mathcal{N}(0, I)$ back to $x_0$.
## Why It Replaced GANs
- **Stable training.** No adversarial dynamics; no mode collapse.
- **High diversity.** Covers the full data distribution rather than memorising modes.
- **High fidelity at scale.** Matches and exceeds GAN quality on most benchmarks.
- **Conditioning is easy.** Text-conditioned generation (Stable Diffusion, DALL-E, Imagen) plugs cleanly into the framework.
## Latent Diffusion
Rather than diffusing in pixel space, train a VAE to compress images into a small latent, then diffuse in latent space. This is what made high-resolution image generation tractable — Stable Diffusion uses this approach, and most modern image models followed.
## Classifier-Free Guidance
A trick that controls the strength of conditioning at inference: run the model both conditioned (on the prompt) and unconditioned, then extrapolate from the unconditioned prediction toward the conditioned one. Higher guidance → tighter prompt adherence; too high → over-saturated or unnatural outputs.
## Sampling Schedules
Classical DDPM uses 1000 denoising steps. Modern samplers — DDIM, DPM-Solver, Euler, UniPC, Flow Matching — produce comparable quality in 10–50 steps. Sampler choice is now a non-trivial knob.
## Beyond Images
- **Video diffusion** — temporal extension; Sora, Veo, Runway.
- **Audio diffusion** — waveform or spectrogram-space generation.
- **Molecule and protein design** — diffusion in domain-specific latent spaces.
## Related
- [[Generative AI]]
- [[Multimodal Model]]
- [[Foundation Model]]