Multiple dimension reduction techniques, whether preserving global or local structures, demonstrate impressive visualization performance on many real-world datasets. Techniques such as T-SNE, UMAP, and TriMap are popular choices. However, the main challenge remains the running time, especially when working with large, high-dimensional data. One potential solution is to take a sample subset of the original data to start the embedding process. While this approach might not yield results as accurate as those obtained from the full dataset, it significantly reduces computation time. Center-based clustering is a fundamental primitive in data analysis, which allows to identify key landmark points that are representative of the entirety of the dataset. This thesis introduces a technique combining UMAP and clustering focusing on global structure preservation. We propose four metrics, each providing a different perspective on applying UMAP over k centers. Our results are supported by a series of experiments on both real-world and synthetic datasets, containing up to 15 million points. These experiments demonstrate that our algorithm produces higher-quality solutions than the standard UMAP method.

Multiple dimension reduction techniques, whether preserving global or local structures, demonstrate impressive visualization performance on many real-world datasets. Techniques such as T-SNE, UMAP, and TriMap are popular choices. However, the main challenge remains the running time, especially when working with large, high-dimensional data. One potential solution is to take a sample subset of the original data to start the embedding process. While this approach might not yield results as accurate as those obtained from the full dataset, it significantly reduces computation time. Center-based clustering is a fundamental primitive in data analysis, which allows to identify key landmark points that are representative of the entirety of the dataset. This thesis introduces a technique combining UMAP and clustering focusing on global structure preservation. We propose four metrics, each providing a different perspective on applying UMAP over k centers. Our results are supported by a series of experiments on both real-world and synthetic datasets, containing up to 15 million points. These experiments demonstrate that our algorithm produces higher-quality solutions than the standard UMAP method.

Clustering for Large-Scale High-Dimensional Data Visualization

TAHAN, PARIA
2023/2024

Abstract

Multiple dimension reduction techniques, whether preserving global or local structures, demonstrate impressive visualization performance on many real-world datasets. Techniques such as T-SNE, UMAP, and TriMap are popular choices. However, the main challenge remains the running time, especially when working with large, high-dimensional data. One potential solution is to take a sample subset of the original data to start the embedding process. While this approach might not yield results as accurate as those obtained from the full dataset, it significantly reduces computation time. Center-based clustering is a fundamental primitive in data analysis, which allows to identify key landmark points that are representative of the entirety of the dataset. This thesis introduces a technique combining UMAP and clustering focusing on global structure preservation. We propose four metrics, each providing a different perspective on applying UMAP over k centers. Our results are supported by a series of experiments on both real-world and synthetic datasets, containing up to 15 million points. These experiments demonstrate that our algorithm produces higher-quality solutions than the standard UMAP method.
2023
Clustering for Large-Scale High-Dimensional Data Visualization
Multiple dimension reduction techniques, whether preserving global or local structures, demonstrate impressive visualization performance on many real-world datasets. Techniques such as T-SNE, UMAP, and TriMap are popular choices. However, the main challenge remains the running time, especially when working with large, high-dimensional data. One potential solution is to take a sample subset of the original data to start the embedding process. While this approach might not yield results as accurate as those obtained from the full dataset, it significantly reduces computation time. Center-based clustering is a fundamental primitive in data analysis, which allows to identify key landmark points that are representative of the entirety of the dataset. This thesis introduces a technique combining UMAP and clustering focusing on global structure preservation. We propose four metrics, each providing a different perspective on applying UMAP over k centers. Our results are supported by a series of experiments on both real-world and synthetic datasets, containing up to 15 million points. These experiments demonstrate that our algorithm produces higher-quality solutions than the standard UMAP method.
Dimension Reduction
Clustering
UMAP
High-Dimensional
Visualization
File in questo prodotto:
File Dimensione Formato  
MsC_Thesis_Paria_Tahan_a.pdf

accesso aperto

Dimensione 1.35 MB
Formato Adobe PDF
1.35 MB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/73732