Adaptive Density-Aware Sampling For High-Dimensional Datasets

In the era of big data, the efficient management and analysis of large and highly non-uniform datasets have become critical challenges in machine learning and data science. This thesis addresses the problem of data reduction by proposing a novel sampling framework designed to preserve the structural properties of the feature space while reducing redundancy. The work introduces the Cluster-Based Density (CBD) sampling framework, a method that leverages density-based clustering, specifically HDBSCAN, to guide the sampling process. Unlike traditional approaches that focus primarily on statistical representativeness, the proposed method explicitly accounts for the spatial distribution of data points. By selecting fewer samples from dense regions and more from sparse areas, the framework aims to maintain a balanced and informative representation of the dataset. The effectiveness of the approach is evaluated through its application to a real-world dataset provided by an industrial partner. The results demonstrate that the CBD method significantly reduces dataset size while preserving predictive performance, particularly in complex and heterogeneous feature spaces where standard sampling techniques tend to fail. Furthermore, the thesis extends the framework to dynamic environments by introducing an adaptive procedure capable of incrementally integrating new observations. This extension allows the model to update its internal structure over time, ensuring that the sampled representation remains consistent with the evolving data distribution. The results highlight the effectiveness and scalability of the proposed method, making it a practical solution for data reduction in both static and dynamic machine learning scenarios.

Adaptive Density-Aware Sampling For High-Dimensional Datasets

MARZOLA, DEVIS

2025/2026

Abstract

In the era of big data, the efficient management and analysis of large and highly non-uniform datasets have become critical challenges in machine learning and data science. This thesis addresses the problem of data reduction by proposing a novel sampling framework designed to preserve the structural properties of the feature space while reducing redundancy. The work introduces the Cluster-Based Density (CBD) sampling framework, a method that leverages density-based clustering, specifically HDBSCAN, to guide the sampling process. Unlike traditional approaches that focus primarily on statistical representativeness, the proposed method explicitly accounts for the spatial distribution of data points. By selecting fewer samples from dense regions and more from sparse areas, the framework aims to maintain a balanced and informative representation of the dataset. The effectiveness of the approach is evaluated through its application to a real-world dataset provided by an industrial partner. The results demonstrate that the CBD method significantly reduces dataset size while preserving predictive performance, particularly in complex and heterogeneous feature spaces where standard sampling techniques tend to fail. Furthermore, the thesis extends the framework to dynamic environments by introducing an adaptive procedure capable of incrementally integrating new observations. This extension allows the model to update its internal structure over time, ensuring that the sampled representation remains consistent with the evolving data distribution. The results highlight the effectiveness and scalability of the proposed method, making it a practical solution for data reduction in both static and dynamic machine learning scenarios.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Matematica "Tullio Levi-Civita" - DM
			
	Corso di studio
	
				DATA SCIENCE  Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2025
			
	Titolo inglese
	
				Adaptive Density-Aware Sampling For High-Dimensional Datasets
			
	Abstract in italiano
	
				In the era of big data, the efficient management and analysis of large and highly non-uniform datasets have become critical challenges in machine learning and data science. This thesis addresses the problem of data reduction by proposing a novel sampling framework designed to preserve the structural properties of the feature space while reducing redundancy.
The work introduces the Cluster-Based Density (CBD) sampling framework, a method that leverages density-based clustering, specifically HDBSCAN, to guide the sampling process. Unlike traditional approaches that focus primarily on statistical representativeness, the proposed method explicitly accounts for the spatial distribution of data points. By selecting fewer samples from dense regions and more from sparse areas, the framework aims to maintain a balanced and informative representation of the dataset.
The effectiveness of the approach is evaluated through its application to a real-world dataset provided by an industrial partner. The results demonstrate that the CBD method significantly reduces dataset size while preserving predictive performance, particularly in complex and heterogeneous feature spaces where standard sampling techniques tend to fail.
Furthermore, the thesis extends the framework to dynamic environments by introducing an adaptive procedure capable of incrementally integrating new observations. This extension allows the model to update its internal structure over time, ensuring that the sampled representation remains consistent with the evolving data distribution.
The results highlight the effectiveness and scalability of the proposed method, making it a practical solution for data reduction in both static and dynamic machine learning scenarios.
			
	Parola chiave
	
				Sampling
High-Dim. Dataset
Density
			
	Relatore
	
				SUSTO, GIAN ANTONIO
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Adaptive_Density_Aware_Sampling_For_High_Dimensional_Datasets.pdf accesso aperto Dimensione 14.07 MB Formato Adobe PDF Visualizza/Apri	14.07 MB	Adobe PDF	Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/108233