In the era of big data, the efficient management and analysis of large and highly non-uniform datasets have become critical challenges in machine learning and data science. This thesis addresses the problem of data reduction by proposing a novel sampling framework designed to preserve the structural properties of the feature space while reducing redundancy. The work introduces the Cluster-Based Density (CBD) sampling framework, a method that leverages density-based clustering, specifically HDBSCAN, to guide the sampling process. Unlike traditional approaches that focus primarily on statistical representativeness, the proposed method explicitly accounts for the spatial distribution of data points. By selecting fewer samples from dense regions and more from sparse areas, the framework aims to maintain a balanced and informative representation of the dataset. The effectiveness of the approach is evaluated through its application to a real-world dataset provided by an industrial partner. The results demonstrate that the CBD method significantly reduces dataset size while preserving predictive performance, particularly in complex and heterogeneous feature spaces where standard sampling techniques tend to fail. Furthermore, the thesis extends the framework to dynamic environments by introducing an adaptive procedure capable of incrementally integrating new observations. This extension allows the model to update its internal structure over time, ensuring that the sampled representation remains consistent with the evolving data distribution. The results highlight the effectiveness and scalability of the proposed method, making it a practical solution for data reduction in both static and dynamic machine learning scenarios.

In the era of big data, the efficient management and analysis of large and highly non-uniform datasets have become critical challenges in machine learning and data science. This thesis addresses the problem of data reduction by proposing a novel sampling framework designed to preserve the structural properties of the feature space while reducing redundancy. The work introduces the Cluster-Based Density (CBD) sampling framework, a method that leverages density-based clustering, specifically HDBSCAN, to guide the sampling process. Unlike traditional approaches that focus primarily on statistical representativeness, the proposed method explicitly accounts for the spatial distribution of data points. By selecting fewer samples from dense regions and more from sparse areas, the framework aims to maintain a balanced and informative representation of the dataset. The effectiveness of the approach is evaluated through its application to a real-world dataset provided by an industrial partner. The results demonstrate that the CBD method significantly reduces dataset size while preserving predictive performance, particularly in complex and heterogeneous feature spaces where standard sampling techniques tend to fail. Furthermore, the thesis extends the framework to dynamic environments by introducing an adaptive procedure capable of incrementally integrating new observations. This extension allows the model to update its internal structure over time, ensuring that the sampled representation remains consistent with the evolving data distribution. The results highlight the effectiveness and scalability of the proposed method, making it a practical solution for data reduction in both static and dynamic machine learning scenarios.

Adaptive Density-Aware Sampling For High-Dimensional Datasets

MARZOLA, DEVIS
2025/2026

Abstract

In the era of big data, the efficient management and analysis of large and highly non-uniform datasets have become critical challenges in machine learning and data science. This thesis addresses the problem of data reduction by proposing a novel sampling framework designed to preserve the structural properties of the feature space while reducing redundancy. The work introduces the Cluster-Based Density (CBD) sampling framework, a method that leverages density-based clustering, specifically HDBSCAN, to guide the sampling process. Unlike traditional approaches that focus primarily on statistical representativeness, the proposed method explicitly accounts for the spatial distribution of data points. By selecting fewer samples from dense regions and more from sparse areas, the framework aims to maintain a balanced and informative representation of the dataset. The effectiveness of the approach is evaluated through its application to a real-world dataset provided by an industrial partner. The results demonstrate that the CBD method significantly reduces dataset size while preserving predictive performance, particularly in complex and heterogeneous feature spaces where standard sampling techniques tend to fail. Furthermore, the thesis extends the framework to dynamic environments by introducing an adaptive procedure capable of incrementally integrating new observations. This extension allows the model to update its internal structure over time, ensuring that the sampled representation remains consistent with the evolving data distribution. The results highlight the effectiveness and scalability of the proposed method, making it a practical solution for data reduction in both static and dynamic machine learning scenarios.
2025
Adaptive Density-Aware Sampling For High-Dimensional Datasets
In the era of big data, the efficient management and analysis of large and highly non-uniform datasets have become critical challenges in machine learning and data science. This thesis addresses the problem of data reduction by proposing a novel sampling framework designed to preserve the structural properties of the feature space while reducing redundancy. The work introduces the Cluster-Based Density (CBD) sampling framework, a method that leverages density-based clustering, specifically HDBSCAN, to guide the sampling process. Unlike traditional approaches that focus primarily on statistical representativeness, the proposed method explicitly accounts for the spatial distribution of data points. By selecting fewer samples from dense regions and more from sparse areas, the framework aims to maintain a balanced and informative representation of the dataset. The effectiveness of the approach is evaluated through its application to a real-world dataset provided by an industrial partner. The results demonstrate that the CBD method significantly reduces dataset size while preserving predictive performance, particularly in complex and heterogeneous feature spaces where standard sampling techniques tend to fail. Furthermore, the thesis extends the framework to dynamic environments by introducing an adaptive procedure capable of incrementally integrating new observations. This extension allows the model to update its internal structure over time, ensuring that the sampled representation remains consistent with the evolving data distribution. The results highlight the effectiveness and scalability of the proposed method, making it a practical solution for data reduction in both static and dynamic machine learning scenarios.
Sampling
High-Dim. Dataset
Density
File in questo prodotto:
File Dimensione Formato  
Adaptive_Density_Aware_Sampling_For_High_Dimensional_Datasets.pdf

accesso aperto

Dimensione 14.07 MB
Formato Adobe PDF
14.07 MB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/108233