Domain-Specific Cross-Lingual RAG Applications: Information Retrieval Fine Tuning with Synthetic Data Generation

In the context of Retrieval-Augmented Generation (RAG) systems, the development of high-performance retrieval algorithms for effective query-document matching is essential to answer the user's queries, especially with complex queries spanning multiple documents. This thesis proposes a novel framework leveraging generative capabilities of large language models to generate domain-specific, synthetic query-document pairs in a cross-lingual setting, where queries are created in multiple languages (including high and low-resource languages) and documents are primarily in Italian. The synthetic data generated is used to fine-tune transformer based dense sentence encoders, enabling more effective information retrieval. The proposed approach introduces a query generation pipeline to enhance retrieval efficiency while preserving semantic integrity. Through comprehensive experiments, the framework demonstrates that fine-tuning smaller, task-specific models using high-quality synthetic data can outperform state-of-the-art retrieval solutions in terms of accuracy and computational efficiency (up to +7.5\% in retrieval MAP@10 and +4.4% in question answering accuracy). This work highlights the potential of cross-lingual synthetic data generation as a cost-effective and scalable solution for improving domain-specific information retrieval in RAG applications, especially in scenarios involving multilingual queries and limited annotated data. This approach not only improves efficiency but also addresses critical security concerns. By enabling the evaluation of sensitive data to be conducted locally on company machines, risks associated with data leaks are significantly mitigated, resulting in enhanced compliance with data protection regulations.

Domain-Specific Cross-Lingual RAG Applications: Information Retrieval Fine Tuning with Synthetic Data Generation

BARBIERO, LORENZO

2024/2025

Abstract

In the context of Retrieval-Augmented Generation (RAG) systems, the development of high-performance retrieval algorithms for effective query-document matching is essential to answer the user's queries, especially with complex queries spanning multiple documents. This thesis proposes a novel framework leveraging generative capabilities of large language models to generate domain-specific, synthetic query-document pairs in a cross-lingual setting, where queries are created in multiple languages (including high and low-resource languages) and documents are primarily in Italian. The synthetic data generated is used to fine-tune transformer based dense sentence encoders, enabling more effective information retrieval. The proposed approach introduces a query generation pipeline to enhance retrieval efficiency while preserving semantic integrity. Through comprehensive experiments, the framework demonstrates that fine-tuning smaller, task-specific models using high-quality synthetic data can outperform state-of-the-art retrieval solutions in terms of accuracy and computational efficiency (up to +7.5\% in retrieval MAP@10 and +4.4% in question answering accuracy). This work highlights the potential of cross-lingual synthetic data generation as a cost-effective and scalable solution for improving domain-specific information retrieval in RAG applications, especially in scenarios involving multilingual queries and limited annotated data. This approach not only improves efficiency but also addresses critical security concerns. By enabling the evaluation of sensitive data to be conducted locally on company machines, risks associated with data leaks are significantly mitigated, resulting in enhanced compliance with data protection regulations.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Fisica e Astronomia "Galileo Galilei" - DFA
			
	Corso di studio
	
				PHYSICS OF DATA Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2024
			
	Titolo inglese
	
				Domain-Specific Cross-Lingual RAG Applications: Information Retrieval Fine Tuning with Synthetic Data Generation
			
	Abstract in italiano
	
				In the context of Retrieval-Augmented Generation (RAG) systems, the development of high-performance retrieval algorithms for effective query-document matching is essential to answer the user's queries, especially with complex queries spanning multiple documents. This thesis proposes a novel framework leveraging generative capabilities of large language models to generate domain-specific, synthetic query-document pairs in a cross-lingual setting, where queries are created in multiple languages (including high and low-resource languages) and documents are primarily in Italian. The synthetic data generated is used to fine-tune transformer based dense sentence encoders, enabling more effective information retrieval.
The proposed approach introduces a query generation pipeline to enhance retrieval efficiency while preserving semantic integrity. Through comprehensive experiments, the framework demonstrates that fine-tuning smaller, task-specific models using high-quality synthetic data can outperform state-of-the-art retrieval solutions in terms of accuracy and computational efficiency (up to +7.5\% in retrieval MAP@10 and +4.4% in question answering accuracy).
This work highlights the potential of cross-lingual synthetic data generation as a cost-effective and scalable solution for improving domain-specific information retrieval in RAG applications, especially in scenarios involving multilingual queries and limited annotated data. This approach not only improves efficiency but also addresses critical security concerns. By enabling the evaluation of sensitive data to be conducted locally on company machines, risks associated with data leaks are significantly mitigated, resulting in enhanced compliance with data protection regulations.
			
	Parola chiave
	
				InformationRetrieval
RAG
LLM
			
	Relatore
	
				PAZZINI, JACOPO
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Barbiero_Lorenzo.pdf accesso aperto Dimensione 2.15 MB Formato Adobe PDF Visualizza/Apri	2.15 MB	Adobe PDF	Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/84545