Expectation-Maximization - Albert Masoliver's learning site

## Definition **Expectation-Maximization (EM)** is a general iterative algorithm for finding maximum-likelihood parameter estimates in models with *latent (unobserved) variables*. Classic applications: [[Gaussian Mixture Model]], [[Hidden Markov Model]], missing data imputation, mixture of experts. ## The Setup Observed data $X$, latent variables $Z$, parameters $\theta$. We want $\theta$ that maximises: $ \theta^* = \arg\max_\theta \log P(X \mid \theta) = \arg\max_\theta \log \sum_Z P(X, Z \mid \theta) $ The log-of-sum is intractable in general. EM finesses this by iteratively constructing easier sub-problems. ## The Two Steps ### E-step Given current $\theta^{(t)}$, compute the *posterior over latent variables*: $ Q(Z) := P(Z \mid X, \theta^{(t)}) $ ### M-step Maximise the *expected complete-data log-likelihood* over $\theta$: $ \theta^{(t+1)} = \arg\max_\theta \, \mathbb{E}_{Z \sim Q}[\log P(X, Z \mid \theta)] $ The maximisation is usually tractable in closed form for nice model families. ## Why It Works Each iteration **monotonically increases** (or leaves unchanged) the data log-likelihood $\log P(X \mid \theta)$. This is the key theoretical property. Convergence to a local maximum is guaranteed. EM is a coordinate-ascent on a lower bound (the ELBO — Evidence Lower BOund) of the log-likelihood. Variational inference generalises this view. ## Examples ### GMM - E-step: compute responsibilities $\gamma_{ic}$ — probability that point $i$ belongs to component $c$. - M-step: update means, covariances, weights using responsibility-weighted statistics. ### HMM (Baum-Welch) - E-step: forward-backward computes posterior probabilities of state sequences. - M-step: update transition and emission probabilities. ### Missing Data - E-step: impute expected values for missing entries given current model. - M-step: refit model treating imputed values as observed. ## Properties - **Sensitive to initialisation.** Different starting $\theta$ → different local optima. Run from multiple seeds. - **Converges slowly** — linear convergence, vs Newton's quadratic. But each step is cheap. - **No guaranteed global optimum.** Combine with simulated annealing or many restarts for hard problems. ## Generalisations - **Variational EM.** Replace exact $Q$ with a tractable variational distribution. Allows handling models where exact E-step is intractable. - **Stochastic EM.** Update on mini-batches. - **Generalised EM (GEM).** M-step doesn't fully maximise; just improves. Still monotonic. ## When to Use - Models with latent variables and tractable posterior $P(Z \mid X, \theta)$. - Clustering with soft assignments. - Sequence models with hidden states. - Missing data. ## Related - [[Gaussian Mixture Model]] - [[Hidden Markov Model]] - [[Bayesian Network]] - [[Unsupervised Learning]]