Obtaining timely, high-quality answers to aggregate queries remains a significant challenge in modern Big Data scenarios, primarily due to the complexity of processing and cleaning vast, noisy datasets. To address this challenge, the Sample-and-Clean framework (Wang et al., SIGMOD 2014) proposed a hybrid approach for Sampling-based approximate query processing (SAQP) that leverages sampling and partial data cleaning to achieve fast and accurate aggregate query processing on dirty data. In this thesis, we first reproduce the original experiments of the Sample-and-Clean framework. Our reproduction covers both synthetic and real datasets, namely TPC-H and DBpedia, following the same evaluation methodology of the original study, and provides insights into the framework’s consistency across different environments. We then extend the application of Sample-and-Clean to the context of knowledge graphs. By adapting the framework to graph-structured data, we investigate whether the same trade-offs between query accuracy, cleaning cost, and efficiency hold in graph-based query processing. Our experiments demonstrate that the framework remains consistent in this new domain, offering improvements in both accuracy and efficiency over traditional full-cleaning or no-cleaning approaches. This thesis contributes both a reproducibility validation of a key framework used as a baseline for approximate query processing and an exploration of its applicability beyond relational data. By reporting its effectiveness on knowledge graphs, we present the framework’s potential for broader adoption in graph-based data systems.
Obtaining timely, high-quality answers to aggregate queries remains a significant challenge in modern Big Data scenarios, primarily due to the complexity of processing and cleaning vast, noisy datasets. To address this challenge, the Sample-and-Clean framework (Wang et al., SIGMOD 2014) proposed a hybrid approach for Sampling-based approximate query processing (SAQP) that leverages sampling and partial data cleaning to achieve fast and accurate aggregate query processing on dirty data. In this thesis, we first reproduce the original experiments of the Sample-and-Clean framework. Our reproduction covers both synthetic and real datasets, namely TPC-H and DBpedia, following the same evaluation methodology of the original study, and provides insights into the framework’s consistency across different environments. We then extend the application of Sample-and-Clean to the context of knowledge graphs. By adapting the framework to graph-structured data, we investigate whether the same trade-offs between query accuracy, cleaning cost, and efficiency hold in graph-based query processing. Our experiments demonstrate that the framework remains consistent in this new domain, offering improvements in both accuracy and efficiency over traditional full-cleaning or no-cleaning approaches. This thesis contributes both a reproducibility validation of a key framework used as a baseline for approximate query processing and an exploration of its applicability beyond relational data. By reporting its effectiveness on knowledge graphs, we present the framework’s potential for broader adoption in graph-based data systems.
Reproducing and Extending the Sample-and-Clean Framework: From Relational Data to Knowledge Graphs
FINCATO, SAVERIO
2024/2025
Abstract
Obtaining timely, high-quality answers to aggregate queries remains a significant challenge in modern Big Data scenarios, primarily due to the complexity of processing and cleaning vast, noisy datasets. To address this challenge, the Sample-and-Clean framework (Wang et al., SIGMOD 2014) proposed a hybrid approach for Sampling-based approximate query processing (SAQP) that leverages sampling and partial data cleaning to achieve fast and accurate aggregate query processing on dirty data. In this thesis, we first reproduce the original experiments of the Sample-and-Clean framework. Our reproduction covers both synthetic and real datasets, namely TPC-H and DBpedia, following the same evaluation methodology of the original study, and provides insights into the framework’s consistency across different environments. We then extend the application of Sample-and-Clean to the context of knowledge graphs. By adapting the framework to graph-structured data, we investigate whether the same trade-offs between query accuracy, cleaning cost, and efficiency hold in graph-based query processing. Our experiments demonstrate that the framework remains consistent in this new domain, offering improvements in both accuracy and efficiency over traditional full-cleaning or no-cleaning approaches. This thesis contributes both a reproducibility validation of a key framework used as a baseline for approximate query processing and an exploration of its applicability beyond relational data. By reporting its effectiveness on knowledge graphs, we present the framework’s potential for broader adoption in graph-based data systems.| File | Dimensione | Formato | |
|---|---|---|---|
|
Fincato_Saverio.pdf
accesso aperto
Dimensione
7.32 MB
Formato
Adobe PDF
|
7.32 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/99603