## Definition
**UMAP** (Uniform Manifold Approximation and Projection) is a non-linear dimensionality-reduction technique introduced by McInnes, Healy & Melville (2018). Theoretically grounded in Riemannian geometry; in practice, faster than [[t-SNE]], preserves global structure better, and produces consistently more informative visualisations.
## How It Works (Intuitively)
1. **Build a fuzzy simplicial set** representing the manifold in high-dim space — essentially a weighted neighbourhood graph with adaptive bandwidths.
2. **Construct a similar fuzzy graph in low-dim space.**
3. **Optimise the low-dim graph** to match the high-dim graph by minimising cross-entropy.
The mathematical machinery (fuzzy topology, simplicial sets) is sophisticated; the practical result is fast, reliable, intuitive embeddings.
## Key Hyperparameters
- **`n_neighbors`** (typical: 5-50, default 15). Controls the balance between local and global structure. Smaller values emphasise tight clusters; larger values reveal broader topology.
- **`min_dist`** (typical: 0.0-0.99, default 0.1). Minimum allowed distance between points in the low-dim representation. Smaller values produce tighter clusters; larger values spread points out.
- **`metric`** — distance metric in the original space. Defaults to Euclidean; supports cosine, hamming, etc.
## Advantages Over t-SNE
| Property | t-SNE | UMAP |
|---|---|---|
| Speed | Slow on >10k points | Scales to millions |
| Global structure | Distorted | Better preserved |
| Embedding new points | Requires re-running | Supports transformation (no re-fit) |
| Multiple components | Yes (2D/3D) | Yes (arbitrary $k$) |
| Theoretical grounding | Heuristic | Riemannian / algebraic topology |
## Preserved Properties
- **Local neighbourhoods** (like t-SNE).
- **Some global structure** — clusters and the gross distances between them are more interpretable than in t-SNE.
- **Topology of connected components.**
## Cautions
- **Hyperparameter sensitivity.** Different `n_neighbors` and `min_dist` produce different embeddings. Best practice: try a few combinations.
- **Distances are not metric.** Like t-SNE, UMAP distorts metric properties; absolute distances in the embedded space are not meaningful.
- **Stochastic.** Different random seeds give different layouts (though typically more stable than t-SNE).
## Common Uses
- **Visualisation** of high-dimensional data (genomics, embeddings, document collections).
- **Pre-processing** for downstream models — *unlike* t-SNE, UMAP embeddings are often usable as features.
- **Clustering** — UMAP + DBSCAN/HDBSCAN is a popular combination.
- **Embedding inspection** — visualise sentence embeddings, image embeddings, etc.
## Implementations
- **Python:** `umap-learn` (reference).
- **GPU:** RAPIDS cuML's UMAP.
- **Approximate variants:** PaCMAP (similar idea, sometimes faster), TriMap.
## Related
- [[Dimensionality Reduction]]
- [[t-SNE]]
- [[Principal Component Analysis]]
- [[Embedding]]