Dataset embedding and visualization are crucial tasks in machine learning as they allow to gather information about the data that can be leveraged to design downstream systems. UMAP is a commonly used dimensionality reduction algorithm because of its ability to accurately represent both local and global structures of data. Its computational cost is composed by the cost for the construction of the k nearest neighbors graph and the cost for the layout optimization phase using Stochastic Gradient Descent. This work explores how using approximate k nearest neighbor search indexes can speed up the algorithm while preserving the quality of the embedding. We were able to achieve up to a 9x speedup in UMAP embeddings using approximate k-nearest neighbor search algorithms like HNSW, while preserving the quality of the embeddings for both visualization and downstream machine learning tasks.

Dataset embedding and visualization are crucial tasks in machine learning as they allow to gather information about the data that can be leveraged to design downstream systems. UMAP is a commonly used dimensionality reduction algorithm because of its ability to accurately represent both local and global structures of data. Its computational cost is composed by the cost for the construction of the k nearest neighbors graph and the cost for the layout optimization phase using Stochastic Gradient Descent. This work explores how using approximate k nearest neighbor search indexes can speed up the algorithm while preserving the quality of the embedding. We were able to achieve up to a 9x speedup in UMAP embeddings using approximate k-nearest neighbor search algorithms like HNSW, while preserving the quality of the embeddings for both visualization and downstream machine learning tasks.

Fast and accurate manifold learning using approximate nearest neighbor search

FOTI, GIOVANNI
2023/2024

Abstract

Dataset embedding and visualization are crucial tasks in machine learning as they allow to gather information about the data that can be leveraged to design downstream systems. UMAP is a commonly used dimensionality reduction algorithm because of its ability to accurately represent both local and global structures of data. Its computational cost is composed by the cost for the construction of the k nearest neighbors graph and the cost for the layout optimization phase using Stochastic Gradient Descent. This work explores how using approximate k nearest neighbor search indexes can speed up the algorithm while preserving the quality of the embedding. We were able to achieve up to a 9x speedup in UMAP embeddings using approximate k-nearest neighbor search algorithms like HNSW, while preserving the quality of the embeddings for both visualization and downstream machine learning tasks.
2023
Fast and accurate manifold learning using approximate nearest neighbor search
Dataset embedding and visualization are crucial tasks in machine learning as they allow to gather information about the data that can be leveraged to design downstream systems. UMAP is a commonly used dimensionality reduction algorithm because of its ability to accurately represent both local and global structures of data. Its computational cost is composed by the cost for the construction of the k nearest neighbors graph and the cost for the layout optimization phase using Stochastic Gradient Descent. This work explores how using approximate k nearest neighbor search indexes can speed up the algorithm while preserving the quality of the embedding. We were able to achieve up to a 9x speedup in UMAP embeddings using approximate k-nearest neighbor search algorithms like HNSW, while preserving the quality of the embeddings for both visualization and downstream machine learning tasks.
data visualization
machine learning
similarity search
manifold learning
File in questo prodotto:
File Dimensione Formato  
Foti_Giovanni.pdf

accesso aperto

Dimensione 26.21 MB
Formato Adobe PDF
26.21 MB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/78054