## Definition
**Expectation-Maximization (EM)** is a general iterative algorithm for finding maximum-likelihood parameter estimates in models with *latent (unobserved) variables*. Classic applications: [[Gaussian Mixture Model]], [[Hidden Markov Model]], missing data imputation, mixture of experts.
## The Setup
Observed data $X$, latent variables $Z$, parameters $\theta$. We want $\theta$ that maximises:
$
\theta^* = \arg\max_\theta \log P(X \mid \theta) = \arg\max_\theta \log \sum_Z P(X, Z \mid \theta)
$
The log-of-sum is intractable in general. EM finesses this by iteratively constructing easier sub-problems.
## The Two Steps
### E-step
Given current $\theta^{(t)}$, compute the *posterior over latent variables*:
$
Q(Z) := P(Z \mid X, \theta^{(t)})
$
### M-step
Maximise the *expected complete-data log-likelihood* over $\theta$:
$
\theta^{(t+1)} = \arg\max_\theta \, \mathbb{E}_{Z \sim Q}[\log P(X, Z \mid \theta)]
$
The maximisation is usually tractable in closed form for nice model families.
## Why It Works
Each iteration **monotonically increases** (or leaves unchanged) the data log-likelihood $\log P(X \mid \theta)$. This is the key theoretical property. Convergence to a local maximum is guaranteed.
EM is a coordinate-ascent on a lower bound (the ELBO — Evidence Lower BOund) of the log-likelihood. Variational inference generalises this view.
## Examples
### GMM
- E-step: compute responsibilities $\gamma_{ic}$ — probability that point $i$ belongs to component $c$.
- M-step: update means, covariances, weights using responsibility-weighted statistics.
### HMM (Baum-Welch)
- E-step: forward-backward computes posterior probabilities of state sequences.
- M-step: update transition and emission probabilities.
### Missing Data
- E-step: impute expected values for missing entries given current model.
- M-step: refit model treating imputed values as observed.
## Properties
- **Sensitive to initialisation.** Different starting $\theta$ → different local optima. Run from multiple seeds.
- **Converges slowly** — linear convergence, vs Newton's quadratic. But each step is cheap.
- **No guaranteed global optimum.** Combine with simulated annealing or many restarts for hard problems.
## Generalisations
- **Variational EM.** Replace exact $Q$ with a tractable variational distribution. Allows handling models where exact E-step is intractable.
- **Stochastic EM.** Update on mini-batches.
- **Generalised EM (GEM).** M-step doesn't fully maximise; just improves. Still monotonic.
## When to Use
- Models with latent variables and tractable posterior $P(Z \mid X, \theta)$.
- Clustering with soft assignments.
- Sequence models with hidden states.
- Missing data.
## Related
- [[Gaussian Mixture Model]]
- [[Hidden Markov Model]]
- [[Bayesian Network]]
- [[Unsupervised Learning]]