This thesis explores the application of multilingual embedding models to the semantic search for the Italian language, a critical step toward integrating these technologies into Retrieval-Augmented Generation (RAG) frameworks. The work leverages state-of-the-art pre-trained and fine-tuned neural models to address the challenges of document retrieval in both symmetric and asymmetric contexts. Using a variety of datasets, including translated corpora for validation, the study evaluates models such as LaBSE, multilingual-e5-large, and bge-m3 for their ability to generate meaningful embeddings and improve retrieval performance. Performance for the asymmetric framework is assessed using nDCG@10. The fine-tuning phase, where the model is modified by inserting an adapter on top of the query embedding for each pre-trained model, demonstrates the adaptability of two of the aforementioned models to Italian-language tasks. The statistical significance has been assessed with the Wilcoxon signed-rank test, which results in a p-value <0.001 for multilingual-e5-large and bge-m3, beating their counterpart without the addition of the adapter. One of our models, multilingual-e5-large with the linear adapter, achieved superior results to proprietary solutions like OpenAI's text-embedding-3-small. The significance has been assessed with the same statistical test, resulting in a p-value <0.05. Additionally, our solution demonstrated substantial improvements in document retrieval times, reducing latency of OpenAI's model with our best-performing model of one order of magnitude. Furthermore, the training process is cost-effective and the lightweight design of the model enables it to operate on local hardware.
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained and Fine-tuned Approach
RINALDI, NICOLÒ
2023/2024
Abstract
This thesis explores the application of multilingual embedding models to the semantic search for the Italian language, a critical step toward integrating these technologies into Retrieval-Augmented Generation (RAG) frameworks. The work leverages state-of-the-art pre-trained and fine-tuned neural models to address the challenges of document retrieval in both symmetric and asymmetric contexts. Using a variety of datasets, including translated corpora for validation, the study evaluates models such as LaBSE, multilingual-e5-large, and bge-m3 for their ability to generate meaningful embeddings and improve retrieval performance. Performance for the asymmetric framework is assessed using nDCG@10. The fine-tuning phase, where the model is modified by inserting an adapter on top of the query embedding for each pre-trained model, demonstrates the adaptability of two of the aforementioned models to Italian-language tasks. The statistical significance has been assessed with the Wilcoxon signed-rank test, which results in a p-value <0.001 for multilingual-e5-large and bge-m3, beating their counterpart without the addition of the adapter. One of our models, multilingual-e5-large with the linear adapter, achieved superior results to proprietary solutions like OpenAI's text-embedding-3-small. The significance has been assessed with the same statistical test, resulting in a p-value <0.05. Additionally, our solution demonstrated substantial improvements in document retrieval times, reducing latency of OpenAI's model with our best-performing model of one order of magnitude. Furthermore, the training process is cost-effective and the lightweight design of the model enables it to operate on local hardware.File | Dimensione | Formato | |
---|---|---|---|
Nicolò Rinaldi - Exploring Multilingual Embeddings for Italian Semantic Search A Pretrained and Fine tuned Approach.pdf
accesso aperto
Dimensione
1.61 MB
Formato
Adobe PDF
|
1.61 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/80899