Reproducing and Extending the Sample-and-Clean Framework: From Relational Data to Knowledge Graphs

Obtaining timely, high-quality answers to aggregate queries remains a significant challenge in modern Big Data scenarios, primarily due to the complexity of processing and cleaning vast, noisy datasets. To address this challenge, the Sample-and-Clean framework (Wang et al., SIGMOD 2014) proposed a hybrid approach for Sampling-based approximate query processing (SAQP) that leverages sampling and partial data cleaning to achieve fast and accurate aggregate query processing on dirty data. In this thesis, we first reproduce the original experiments of the Sample-and-Clean framework. Our reproduction covers both synthetic and real datasets, namely TPC-H and DBpedia, following the same evaluation methodology of the original study, and provides insights into the framework’s consistency across different environments. We then extend the application of Sample-and-Clean to the context of knowledge graphs. By adapting the framework to graph-structured data, we investigate whether the same trade-offs between query accuracy, cleaning cost, and efficiency hold in graph-based query processing. Our experiments demonstrate that the framework remains consistent in this new domain, offering improvements in both accuracy and efficiency over traditional full-cleaning or no-cleaning approaches. This thesis contributes both a reproducibility validation of a key framework used as a baseline for approximate query processing and an exploration of its applicability beyond relational data. By reporting its effectiveness on knowledge graphs, we present the framework’s potential for broader adoption in graph-based data systems.

Reproducing and Extending the Sample-and-Clean Framework: From Relational Data to Knowledge Graphs

FINCATO, SAVERIO

2024/2025

Abstract

Obtaining timely, high-quality answers to aggregate queries remains a significant challenge in modern Big Data scenarios, primarily due to the complexity of processing and cleaning vast, noisy datasets. To address this challenge, the Sample-and-Clean framework (Wang et al., SIGMOD 2014) proposed a hybrid approach for Sampling-based approximate query processing (SAQP) that leverages sampling and partial data cleaning to achieve fast and accurate aggregate query processing on dirty data. In this thesis, we first reproduce the original experiments of the Sample-and-Clean framework. Our reproduction covers both synthetic and real datasets, namely TPC-H and DBpedia, following the same evaluation methodology of the original study, and provides insights into the framework’s consistency across different environments. We then extend the application of Sample-and-Clean to the context of knowledge graphs. By adapting the framework to graph-structured data, we investigate whether the same trade-offs between query accuracy, cleaning cost, and efficiency hold in graph-based query processing. Our experiments demonstrate that the framework remains consistent in this new domain, offering improvements in both accuracy and efficiency over traditional full-cleaning or no-cleaning approaches. This thesis contributes both a reproducibility validation of a key framework used as a baseline for approximate query processing and an exploration of its applicability beyond relational data. By reporting its effectiveness on knowledge graphs, we present the framework’s potential for broader adoption in graph-based data systems.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Ingegneria dell'Informazione - DEI
			
	Corso di studio
	
				COMPUTER ENGINEERING Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2024
			
	Titolo inglese
	
				Reproducing and Extending the Sample-and-Clean Framework: From Relational Data to Knowledge Graphs
			
	Abstract in italiano
	
				Obtaining timely, high-quality answers to aggregate queries remains a significant challenge in modern Big Data scenarios, primarily due to the complexity of processing and cleaning vast, noisy datasets.
To address this challenge, the Sample-and-Clean framework (Wang et al., SIGMOD 2014) proposed a hybrid approach for Sampling-based approximate query processing (SAQP) that leverages sampling and partial data cleaning to achieve fast and accurate aggregate query processing on dirty data. In this thesis, we first reproduce the original experiments of the Sample-and-Clean framework. Our reproduction covers both synthetic and real datasets, namely TPC-H and DBpedia, following the same evaluation methodology of the original study, and provides insights into the framework’s consistency across different environments. We then extend the application of Sample-and-Clean to the context of knowledge graphs. By adapting the framework to graph-structured data, we investigate whether the same trade-offs between query accuracy, cleaning cost, and efficiency hold in graph-based query processing.  Our experiments demonstrate that the framework remains consistent in this new domain, offering improvements in both accuracy and efficiency over traditional full-cleaning or no-cleaning approaches. This thesis contributes both a reproducibility validation of a key framework used as a baseline for approximate query processing and an exploration of its applicability beyond relational data. By reporting its effectiveness on knowledge graphs, we present the framework’s potential for broader adoption in graph-based data systems.
			
	Parola chiave
	
				Database
Knowledge Graphs
Query
SAQP
			
	Relatore
	
				MARCHESIN, STEFANO
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Fincato_Saverio.pdf accesso aperto Dimensione 7.32 MB Formato Adobe PDF Visualizza/Apri	7.32 MB	Adobe PDF	Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/99603