UMAP - Albert Masoliver's learning site

## Definition **UMAP** (Uniform Manifold Approximation and Projection) is a non-linear dimensionality-reduction technique introduced by McInnes, Healy & Melville (2018). Theoretically grounded in Riemannian geometry; in practice, faster than [[t-SNE]], preserves global structure better, and produces consistently more informative visualisations. ## How It Works (Intuitively) 1. **Build a fuzzy simplicial set** representing the manifold in high-dim space — essentially a weighted neighbourhood graph with adaptive bandwidths. 2. **Construct a similar fuzzy graph in low-dim space.** 3. **Optimise the low-dim graph** to match the high-dim graph by minimising cross-entropy. The mathematical machinery (fuzzy topology, simplicial sets) is sophisticated; the practical result is fast, reliable, intuitive embeddings. ## Key Hyperparameters - **`n_neighbors`** (typical: 5-50, default 15). Controls the balance between local and global structure. Smaller values emphasise tight clusters; larger values reveal broader topology. - **`min_dist`** (typical: 0.0-0.99, default 0.1). Minimum allowed distance between points in the low-dim representation. Smaller values produce tighter clusters; larger values spread points out. - **`metric`** — distance metric in the original space. Defaults to Euclidean; supports cosine, hamming, etc. ## Advantages Over t-SNE | Property | t-SNE | UMAP | |---|---|---| | Speed | Slow on >10k points | Scales to millions | | Global structure | Distorted | Better preserved | | Embedding new points | Requires re-running | Supports transformation (no re-fit) | | Multiple components | Yes (2D/3D) | Yes (arbitrary $k$) | | Theoretical grounding | Heuristic | Riemannian / algebraic topology | ## Preserved Properties - **Local neighbourhoods** (like t-SNE). - **Some global structure** — clusters and the gross distances between them are more interpretable than in t-SNE. - **Topology of connected components.** ## Cautions - **Hyperparameter sensitivity.** Different `n_neighbors` and `min_dist` produce different embeddings. Best practice: try a few combinations. - **Distances are not metric.** Like t-SNE, UMAP distorts metric properties; absolute distances in the embedded space are not meaningful. - **Stochastic.** Different random seeds give different layouts (though typically more stable than t-SNE). ## Common Uses - **Visualisation** of high-dimensional data (genomics, embeddings, document collections). - **Pre-processing** for downstream models — *unlike* t-SNE, UMAP embeddings are often usable as features. - **Clustering** — UMAP + DBSCAN/HDBSCAN is a popular combination. - **Embedding inspection** — visualise sentence embeddings, image embeddings, etc. ## Implementations - **Python:** `umap-learn` (reference). - **GPU:** RAPIDS cuML's UMAP. - **Approximate variants:** PaCMAP (similar idea, sometimes faster), TriMap. ## Related - [[Dimensionality Reduction]] - [[t-SNE]] - [[Principal Component Analysis]] - [[Embedding]]