Towards data-based search engines for RDF graphs: a reproducibility study

The RDF framework, thanks to its flexibility and versatility, is one of the most used formats for sharing data and knowledge on the Web. Nowadays a lot of RDF datasets and RDF knowledge repositories are available in the scientific and political fields and can be easily consulted from a lot of open data portals. However, these RDF datasets cannot be fully exploited and accessed due to the absence of advanced search engines that allow users to retrieve the best datasets that suit their needs. These systems solve the Ad-Hoc RDF Datasets retrieval task: answer to a user keyword query with a rank of 10 datasets ordered by relevance. The current systems are not so advanced and are principally based on the datasets metadata, which could be incomplete or not always available, instead of being based on their content. ACORDAR is the first open test collection created to evaluate the systems developed for the Ad-Hoc RDF Datasets retrieval task. This test collection can ensure a boost in the development and improvement of these systems and a possible switch from metadata-based to content-based search systems. The main focus of this thesis is a reproducibility study on the ACORDAR collection. We are going to actually test how this collection is good, useful and suited for the Ad Hoc RDF datasets retrieval task by reproducing the baseline systems developed by the ACORDAR creators and by discussing all the reproducibility problems encountered during the development of the reproduced systems.

Il framework RDF, grazie alla sua flessibilità e versatilità, è uno dei formati più utilizzati per la condivisione di dati e informazioni sul Web. Al giorno d'oggi sono infatti disponibili molti datasets e knowledge repositories in formato RDF in ambito scientifico e politico, facilmente consultabili e scaricabili da numerosi open data portals. Tuttavia, questi datasets RDF non possono essere sfruttati e consultati appieno, a causa dell'assenza di motori di ricerca avanzati che permettano agli utenti di ottenere i datasets più adatti alle loro esigenze. Questi sistemi rispondono alle esigenze del Ad-Hoc RDF Datasets Retrieval task: lo scopo di questo task è rispondere ad una keyword query dell'utente con un rank di 10 datasets in ordine di rilevanza. I sistemi attuali non sono così avanzati e si basano principalmente sui metadati dei dataset, che potrebbero essere incompleti o non sempre disponibili, invece di basarsi sul loro contenuto. ACORDAR è la prima open test collection creata per testare i sistemi sviluppati per l'Ad-Hoc RDF Datasets Retrieval task. Questa test collection può garantire un impulso nello sviluppo di questi sistemi e un possibile passaggio da sistemi di ricerca basati sui metadati a sistemi basati sul contenuto dei dataset. L'obiettivo principale di questa tesi è uno studio sulla riproducibilità su ACORDAR. Verrà testata la qualità, l'utilità e l'adeguatezza di questa collection per il task, riproducendo i sistemi di base sviluppati dai creatori di ACORDAR e discutendo tutti i problemi di riproducibilità incontrati durante lo sviluppo dei sistemi riprodotti.