In the context of Retrieval-Augmented Generation (RAG) systems, the development of high-performance retrieval algorithms for effective query-document matching is essential to answer the user's queries, especially with complex queries spanning multiple documents. This thesis proposes a novel framework leveraging generative capabilities of large language models to generate domain-specific, synthetic query-document pairs in a cross-lingual setting, where queries are created in multiple languages (including high and low-resource languages) and documents are primarily in Italian. The synthetic data generated is used to fine-tune transformer based dense sentence encoders, enabling more effective information retrieval. The proposed approach introduces a query generation pipeline to enhance retrieval efficiency while preserving semantic integrity. Through comprehensive experiments, the framework demonstrates that fine-tuning smaller, task-specific models using high-quality synthetic data can outperform state-of-the-art retrieval solutions in terms of accuracy and computational efficiency (up to +7.5\% in retrieval MAP@10 and +4.4% in question answering accuracy). This work highlights the potential of cross-lingual synthetic data generation as a cost-effective and scalable solution for improving domain-specific information retrieval in RAG applications, especially in scenarios involving multilingual queries and limited annotated data. This approach not only improves efficiency but also addresses critical security concerns. By enabling the evaluation of sensitive data to be conducted locally on company machines, risks associated with data leaks are significantly mitigated, resulting in enhanced compliance with data protection regulations.
In the context of Retrieval-Augmented Generation (RAG) systems, the development of high-performance retrieval algorithms for effective query-document matching is essential to answer the user's queries, especially with complex queries spanning multiple documents. This thesis proposes a novel framework leveraging generative capabilities of large language models to generate domain-specific, synthetic query-document pairs in a cross-lingual setting, where queries are created in multiple languages (including high and low-resource languages) and documents are primarily in Italian. The synthetic data generated is used to fine-tune transformer based dense sentence encoders, enabling more effective information retrieval. The proposed approach introduces a query generation pipeline to enhance retrieval efficiency while preserving semantic integrity. Through comprehensive experiments, the framework demonstrates that fine-tuning smaller, task-specific models using high-quality synthetic data can outperform state-of-the-art retrieval solutions in terms of accuracy and computational efficiency (up to +7.5\% in retrieval MAP@10 and +4.4% in question answering accuracy). This work highlights the potential of cross-lingual synthetic data generation as a cost-effective and scalable solution for improving domain-specific information retrieval in RAG applications, especially in scenarios involving multilingual queries and limited annotated data. This approach not only improves efficiency but also addresses critical security concerns. By enabling the evaluation of sensitive data to be conducted locally on company machines, risks associated with data leaks are significantly mitigated, resulting in enhanced compliance with data protection regulations.
Domain-Specific Cross-Lingual RAG Applications: Information Retrieval Fine Tuning with Synthetic Data Generation
BARBIERO, LORENZO
2024/2025
Abstract
In the context of Retrieval-Augmented Generation (RAG) systems, the development of high-performance retrieval algorithms for effective query-document matching is essential to answer the user's queries, especially with complex queries spanning multiple documents. This thesis proposes a novel framework leveraging generative capabilities of large language models to generate domain-specific, synthetic query-document pairs in a cross-lingual setting, where queries are created in multiple languages (including high and low-resource languages) and documents are primarily in Italian. The synthetic data generated is used to fine-tune transformer based dense sentence encoders, enabling more effective information retrieval. The proposed approach introduces a query generation pipeline to enhance retrieval efficiency while preserving semantic integrity. Through comprehensive experiments, the framework demonstrates that fine-tuning smaller, task-specific models using high-quality synthetic data can outperform state-of-the-art retrieval solutions in terms of accuracy and computational efficiency (up to +7.5\% in retrieval MAP@10 and +4.4% in question answering accuracy). This work highlights the potential of cross-lingual synthetic data generation as a cost-effective and scalable solution for improving domain-specific information retrieval in RAG applications, especially in scenarios involving multilingual queries and limited annotated data. This approach not only improves efficiency but also addresses critical security concerns. By enabling the evaluation of sensitive data to be conducted locally on company machines, risks associated with data leaks are significantly mitigated, resulting in enhanced compliance with data protection regulations.File | Dimensione | Formato | |
---|---|---|---|
Barbiero_Lorenzo.pdf
accesso aperto
Dimensione
2.15 MB
Formato
Adobe PDF
|
2.15 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/84545