Clustering for Large-Scale High-Dimensional Data Visualization

Multiple dimension reduction techniques, whether preserving global or local structures, demonstrate impressive visualization performance on many real-world datasets. Techniques such as T-SNE, UMAP, and TriMap are popular choices. However, the main challenge remains the running time, especially when working with large, high-dimensional data. One potential solution is to take a sample subset of the original data to start the embedding process. While this approach might not yield results as accurate as those obtained from the full dataset, it significantly reduces computation time. Center-based clustering is a fundamental primitive in data analysis, which allows to identify key landmark points that are representative of the entirety of the dataset. This thesis introduces a technique combining UMAP and clustering focusing on global structure preservation. We propose four metrics, each providing a different perspective on applying UMAP over k centers. Our results are supported by a series of experiments on both real-world and synthetic datasets, containing up to 15 million points. These experiments demonstrate that our algorithm produces higher-quality solutions than the standard UMAP method.

Clustering for Large-Scale High-Dimensional Data Visualization

TAHAN, PARIA

2023/2024

Abstract

Multiple dimension reduction techniques, whether preserving global or local structures, demonstrate impressive visualization performance on many real-world datasets. Techniques such as T-SNE, UMAP, and TriMap are popular choices. However, the main challenge remains the running time, especially when working with large, high-dimensional data. One potential solution is to take a sample subset of the original data to start the embedding process. While this approach might not yield results as accurate as those obtained from the full dataset, it significantly reduces computation time. Center-based clustering is a fundamental primitive in data analysis, which allows to identify key landmark points that are representative of the entirety of the dataset. This thesis introduces a technique combining UMAP and clustering focusing on global structure preservation. We propose four metrics, each providing a different perspective on applying UMAP over k centers. Our results are supported by a series of experiments on both real-world and synthetic datasets, containing up to 15 million points. These experiments demonstrate that our algorithm produces higher-quality solutions than the standard UMAP method.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Ingegneria dell'Informazione - DEI
			
	Corso di studio
	
				COMPUTER ENGINEERING Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2023
			
	Titolo inglese
	
				Clustering for Large-Scale High-Dimensional Data Visualization
			
	Abstract in italiano
	
				Multiple dimension reduction techniques, whether preserving global or local structures, demonstrate impressive visualization performance on many real-world datasets. Techniques such as T-SNE, UMAP, and TriMap are popular choices. However, the main challenge remains the running time, especially when working with large, high-dimensional data. One potential solution is to take a sample subset of the original data to start the embedding process. While this approach might not yield results as accurate as those obtained from the full dataset, it significantly reduces computation time. Center-based clustering is a fundamental primitive in data analysis, which allows to identify key landmark points that are representative of the entirety of the dataset. This thesis introduces a technique combining UMAP and clustering focusing on global structure preservation. We propose four metrics, each providing a different perspective on applying UMAP over k centers. Our results are supported by a series of experiments on both real-world and synthetic datasets, containing up to 15 million points. These experiments demonstrate that our algorithm produces higher-quality solutions than the standard UMAP method.
			
	Parola chiave
	
				Dimension Reduction
Clustering
UMAP
High-Dimensional
Visualization
			
	Relatore
	
				CECCARELLO, MATTEO
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
MsC_Thesis_Paria_Tahan_a.pdf accesso aperto Dimensione 1.35 MB Formato Adobe PDF Visualizza/Apri	1.35 MB	Adobe PDF	Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/73732