Obtaining timely, high-quality answers to aggregate queries remains a significant challenge in modern Big Data scenarios, primarily due to the complexity of processing and cleaning vast, noisy datasets. To address this challenge, the Sample-and-Clean framework (Wang et al., SIGMOD 2014) proposed a hybrid approach for Sampling-based approximate query processing (SAQP) that leverages sampling and partial data cleaning to achieve fast and accurate aggregate query processing on dirty data. In this thesis, we first reproduce the original experiments of the Sample-and-Clean framework. Our reproduction covers both synthetic and real datasets, namely TPC-H and DBpedia, following the same evaluation methodology of the original study, and provides insights into the framework’s consistency across different environments. We then extend the application of Sample-and-Clean to the context of knowledge graphs. By adapting the framework to graph-structured data, we investigate whether the same trade-offs between query accuracy, cleaning cost, and efficiency hold in graph-based query processing. Our experiments demonstrate that the framework remains consistent in this new domain, offering improvements in both accuracy and efficiency over traditional full-cleaning or no-cleaning approaches. This thesis contributes both a reproducibility validation of a key framework used as a baseline for approximate query processing and an exploration of its applicability beyond relational data. By reporting its effectiveness on knowledge graphs, we present the framework’s potential for broader adoption in graph-based data systems.

Obtaining timely, high-quality answers to aggregate queries remains a significant challenge in modern Big Data scenarios, primarily due to the complexity of processing and cleaning vast, noisy datasets. To address this challenge, the Sample-and-Clean framework (Wang et al., SIGMOD 2014) proposed a hybrid approach for Sampling-based approximate query processing (SAQP) that leverages sampling and partial data cleaning to achieve fast and accurate aggregate query processing on dirty data. In this thesis, we first reproduce the original experiments of the Sample-and-Clean framework. Our reproduction covers both synthetic and real datasets, namely TPC-H and DBpedia, following the same evaluation methodology of the original study, and provides insights into the framework’s consistency across different environments. We then extend the application of Sample-and-Clean to the context of knowledge graphs. By adapting the framework to graph-structured data, we investigate whether the same trade-offs between query accuracy, cleaning cost, and efficiency hold in graph-based query processing. Our experiments demonstrate that the framework remains consistent in this new domain, offering improvements in both accuracy and efficiency over traditional full-cleaning or no-cleaning approaches. This thesis contributes both a reproducibility validation of a key framework used as a baseline for approximate query processing and an exploration of its applicability beyond relational data. By reporting its effectiveness on knowledge graphs, we present the framework’s potential for broader adoption in graph-based data systems.

Reproducing and Extending the Sample-and-Clean Framework: From Relational Data to Knowledge Graphs

FINCATO, SAVERIO
2024/2025

Abstract

Obtaining timely, high-quality answers to aggregate queries remains a significant challenge in modern Big Data scenarios, primarily due to the complexity of processing and cleaning vast, noisy datasets. To address this challenge, the Sample-and-Clean framework (Wang et al., SIGMOD 2014) proposed a hybrid approach for Sampling-based approximate query processing (SAQP) that leverages sampling and partial data cleaning to achieve fast and accurate aggregate query processing on dirty data. In this thesis, we first reproduce the original experiments of the Sample-and-Clean framework. Our reproduction covers both synthetic and real datasets, namely TPC-H and DBpedia, following the same evaluation methodology of the original study, and provides insights into the framework’s consistency across different environments. We then extend the application of Sample-and-Clean to the context of knowledge graphs. By adapting the framework to graph-structured data, we investigate whether the same trade-offs between query accuracy, cleaning cost, and efficiency hold in graph-based query processing. Our experiments demonstrate that the framework remains consistent in this new domain, offering improvements in both accuracy and efficiency over traditional full-cleaning or no-cleaning approaches. This thesis contributes both a reproducibility validation of a key framework used as a baseline for approximate query processing and an exploration of its applicability beyond relational data. By reporting its effectiveness on knowledge graphs, we present the framework’s potential for broader adoption in graph-based data systems.
2024
Reproducing and Extending the Sample-and-Clean Framework: From Relational Data to Knowledge Graphs
Obtaining timely, high-quality answers to aggregate queries remains a significant challenge in modern Big Data scenarios, primarily due to the complexity of processing and cleaning vast, noisy datasets. To address this challenge, the Sample-and-Clean framework (Wang et al., SIGMOD 2014) proposed a hybrid approach for Sampling-based approximate query processing (SAQP) that leverages sampling and partial data cleaning to achieve fast and accurate aggregate query processing on dirty data. In this thesis, we first reproduce the original experiments of the Sample-and-Clean framework. Our reproduction covers both synthetic and real datasets, namely TPC-H and DBpedia, following the same evaluation methodology of the original study, and provides insights into the framework’s consistency across different environments. We then extend the application of Sample-and-Clean to the context of knowledge graphs. By adapting the framework to graph-structured data, we investigate whether the same trade-offs between query accuracy, cleaning cost, and efficiency hold in graph-based query processing. Our experiments demonstrate that the framework remains consistent in this new domain, offering improvements in both accuracy and efficiency over traditional full-cleaning or no-cleaning approaches. This thesis contributes both a reproducibility validation of a key framework used as a baseline for approximate query processing and an exploration of its applicability beyond relational data. By reporting its effectiveness on knowledge graphs, we present the framework’s potential for broader adoption in graph-based data systems.
Database
Knowledge Graphs
Query
SAQP
File in questo prodotto:
File Dimensione Formato  
Fincato_Saverio.pdf

accesso aperto

Dimensione 7.32 MB
Formato Adobe PDF
7.32 MB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/99603