Principal Component Analysis - Albert Masoliver's learning site

## Definition **Principal Component Analysis (PCA)** is the canonical linear [[Dimensionality Reduction]] technique. It finds orthogonal directions ("principal components") that capture maximum variance in the data, then projects the data onto the top $k$. ## Mathematical Formulation Given centred data matrix $X \in \mathbb{R}^{n \times d}$: 1. Compute the covariance matrix $C = \frac{1}{n-1} X^\top X$. 2. Eigendecompose $C = V \Lambda V^\top$. 3. Sort eigenvalues $\lambda_i$ in decreasing order. 4. Take the top $k$ eigenvectors as columns of $W \in \mathbb{R}^{d \times k}$. 5. Project: $Z = X W$, with $Z \in \mathbb{R}^{n \times k}$. Equivalently, use the SVD of $X$: $X = U \Sigma V^\top$. The columns of $V$ are the principal components. ## What It Optimises PCA finds the $k$-dimensional subspace minimising squared reconstruction error: $ \min_W \| X - X W W^\top \|_F^2 \quad \text{subject to} \quad W^\top W = I $ Equivalently, maximising the variance of the projected data. ## Explained Variance The $i$-th eigenvalue $\lambda_i$ equals the variance captured by the $i$-th component. **Explained variance ratio** is $\lambda_i / \sum_j \lambda_j$. Plotting cumulative explained variance against $k$ is the standard way to choose how many components to keep — typically 90-99%. ## Properties - **Linear.** Cannot capture non-linear manifolds. - **Variance-based.** A dimension dominant in variance might be dominated by *noise* if scales aren't normalised. Always standardise features first ([[Feature Scaling]]). - **Unsupervised.** Doesn't use labels. PCA can drop directions that *separate classes* if those directions have low variance. - **Reversible (with information loss).** Reconstruction $\hat X = Z W^\top$ approximates $X$. ## Variants - **Kernel PCA.** Apply PCA in a feature space induced by a kernel — extends to non-linear data. - **Sparse PCA.** Encourages sparse components for interpretability. - **Incremental PCA.** Process data in mini-batches; useful when full data doesn't fit in memory. - **Robust PCA.** Decomposes into low-rank + sparse components; useful with outliers. ## Common Uses - **Visualisation** (project to 2-3 dimensions). - **Compression** (store $k$-dimensional codes). - **Preprocessing** for downstream models. - **Noise removal** (keep top components only). - **Eigenfaces** (faces from PCA of face images — historical computer vision). ## When PCA Fails - **Strongly non-linear data.** Manifolds become tangled; use t-SNE / UMAP. - **Labels matter.** PCA is unsupervised; LDA or supervised methods may serve better. - **Components without scaling.** Failing to standardise gives the largest-scale feature undue weight. ## Related - [[Dimensionality Reduction]] - [[t-SNE]] - [[UMAP]] - [[Feature Scaling]] - [[Embedding]]