Fast and accurate manifold learning using approximate nearest neighbor search

Dataset embedding and visualization are crucial tasks in machine learning as they allow to gather information about the data that can be leveraged to design downstream systems. UMAP is a commonly used dimensionality reduction algorithm because of its ability to accurately represent both local and global structures of data. Its computational cost is composed by the cost for the construction of the k nearest neighbors graph and the cost for the layout optimization phase using Stochastic Gradient Descent. This work explores how using approximate k nearest neighbor search indexes can speed up the algorithm while preserving the quality of the embedding. We were able to achieve up to a 9x speedup in UMAP embeddings using approximate k-nearest neighbor search algorithms like HNSW, while preserving the quality of the embeddings for both visualization and downstream machine learning tasks.

Fast and accurate manifold learning using approximate nearest neighbor search

FOTI, GIOVANNI

2023/2024

Abstract

Dataset embedding and visualization are crucial tasks in machine learning as they allow to gather information about the data that can be leveraged to design downstream systems. UMAP is a commonly used dimensionality reduction algorithm because of its ability to accurately represent both local and global structures of data. Its computational cost is composed by the cost for the construction of the k nearest neighbors graph and the cost for the layout optimization phase using Stochastic Gradient Descent. This work explores how using approximate k nearest neighbor search indexes can speed up the algorithm while preserving the quality of the embedding. We were able to achieve up to a 9x speedup in UMAP embeddings using approximate k-nearest neighbor search algorithms like HNSW, while preserving the quality of the embeddings for both visualization and downstream machine learning tasks.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Ingegneria dell'Informazione - DEI
			
	Corso di studio
	
				COMPUTER ENGINEERING Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2023
			
	Titolo inglese
	
				Fast and accurate manifold learning using approximate nearest neighbor search
			
	Abstract in italiano
	
				Dataset embedding and visualization are crucial tasks in machine learning as they allow to gather information about the data that can be leveraged to design downstream systems. UMAP is a commonly used dimensionality reduction algorithm because of its ability to accurately represent both local and global structures of data. Its computational cost is composed by the cost for the construction of the k nearest neighbors graph and the cost for the layout optimization phase using Stochastic Gradient Descent. This work explores how using approximate k nearest neighbor search indexes can speed up the algorithm while preserving the quality of the embedding. We were able to achieve up to a 9x speedup in UMAP embeddings using approximate k-nearest neighbor search algorithms like HNSW, while preserving the quality of the embeddings for both visualization and downstream machine learning tasks.
			
	Parola chiave
	
				data visualization
machine learning
similarity search
manifold learning
			
	Relatore
	
				CECCARELLO, MATTEO
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Foti_Giovanni.pdf accesso aperto Dimensione 26.21 MB Formato Adobe PDF Visualizza/Apri	26.21 MB	Adobe PDF	Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/78054