Dataset embedding and visualization are crucial tasks in machine learning as they allow to gather information about the data that can be leveraged to design downstream systems. UMAP is a commonly used dimensionality reduction algorithm because of its ability to accurately represent both local and global structures of data. Its computational cost is composed by the cost for the construction of the k nearest neighbors graph and the cost for the layout optimization phase using Stochastic Gradient Descent. This work explores how using approximate k nearest neighbor search indexes can speed up the algorithm while preserving the quality of the embedding. We were able to achieve up to a 9x speedup in UMAP embeddings using approximate k-nearest neighbor search algorithms like HNSW, while preserving the quality of the embeddings for both visualization and downstream machine learning tasks.
Dataset embedding and visualization are crucial tasks in machine learning as they allow to gather information about the data that can be leveraged to design downstream systems. UMAP is a commonly used dimensionality reduction algorithm because of its ability to accurately represent both local and global structures of data. Its computational cost is composed by the cost for the construction of the k nearest neighbors graph and the cost for the layout optimization phase using Stochastic Gradient Descent. This work explores how using approximate k nearest neighbor search indexes can speed up the algorithm while preserving the quality of the embedding. We were able to achieve up to a 9x speedup in UMAP embeddings using approximate k-nearest neighbor search algorithms like HNSW, while preserving the quality of the embeddings for both visualization and downstream machine learning tasks.
Fast and accurate manifold learning using approximate nearest neighbor search
FOTI, GIOVANNI
2023/2024
Abstract
Dataset embedding and visualization are crucial tasks in machine learning as they allow to gather information about the data that can be leveraged to design downstream systems. UMAP is a commonly used dimensionality reduction algorithm because of its ability to accurately represent both local and global structures of data. Its computational cost is composed by the cost for the construction of the k nearest neighbors graph and the cost for the layout optimization phase using Stochastic Gradient Descent. This work explores how using approximate k nearest neighbor search indexes can speed up the algorithm while preserving the quality of the embedding. We were able to achieve up to a 9x speedup in UMAP embeddings using approximate k-nearest neighbor search algorithms like HNSW, while preserving the quality of the embeddings for both visualization and downstream machine learning tasks.File | Dimensione | Formato | |
---|---|---|---|
Foti_Giovanni.pdf
accesso aperto
Dimensione
26.21 MB
Formato
Adobe PDF
|
26.21 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/78054