Multiple dimension reduction techniques, whether preserving global or local structures, demonstrate impressive visualization performance on many real-world datasets. Techniques such as T-SNE, UMAP, and TriMap are popular choices. However, the main challenge remains the running time, especially when working with large, high-dimensional data. One potential solution is to take a sample subset of the original data to start the embedding process. While this approach might not yield results as accurate as those obtained from the full dataset, it significantly reduces computation time. Center-based clustering is a fundamental primitive in data analysis, which allows to identify key landmark points that are representative of the entirety of the dataset. This thesis introduces a technique combining UMAP and clustering focusing on global structure preservation. We propose four metrics, each providing a different perspective on applying UMAP over k centers. Our results are supported by a series of experiments on both real-world and synthetic datasets, containing up to 15 million points. These experiments demonstrate that our algorithm produces higher-quality solutions than the standard UMAP method.
Multiple dimension reduction techniques, whether preserving global or local structures, demonstrate impressive visualization performance on many real-world datasets. Techniques such as T-SNE, UMAP, and TriMap are popular choices. However, the main challenge remains the running time, especially when working with large, high-dimensional data. One potential solution is to take a sample subset of the original data to start the embedding process. While this approach might not yield results as accurate as those obtained from the full dataset, it significantly reduces computation time. Center-based clustering is a fundamental primitive in data analysis, which allows to identify key landmark points that are representative of the entirety of the dataset. This thesis introduces a technique combining UMAP and clustering focusing on global structure preservation. We propose four metrics, each providing a different perspective on applying UMAP over k centers. Our results are supported by a series of experiments on both real-world and synthetic datasets, containing up to 15 million points. These experiments demonstrate that our algorithm produces higher-quality solutions than the standard UMAP method.
Clustering for Large-Scale High-Dimensional Data Visualization
TAHAN, PARIA
2023/2024
Abstract
Multiple dimension reduction techniques, whether preserving global or local structures, demonstrate impressive visualization performance on many real-world datasets. Techniques such as T-SNE, UMAP, and TriMap are popular choices. However, the main challenge remains the running time, especially when working with large, high-dimensional data. One potential solution is to take a sample subset of the original data to start the embedding process. While this approach might not yield results as accurate as those obtained from the full dataset, it significantly reduces computation time. Center-based clustering is a fundamental primitive in data analysis, which allows to identify key landmark points that are representative of the entirety of the dataset. This thesis introduces a technique combining UMAP and clustering focusing on global structure preservation. We propose four metrics, each providing a different perspective on applying UMAP over k centers. Our results are supported by a series of experiments on both real-world and synthetic datasets, containing up to 15 million points. These experiments demonstrate that our algorithm produces higher-quality solutions than the standard UMAP method.File | Dimensione | Formato | |
---|---|---|---|
MsC_Thesis_Paria_Tahan_a.pdf
accesso aperto
Dimensione
1.35 MB
Formato
Adobe PDF
|
1.35 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/73732