This thesis proposes a large language model based tool to help biologists at R&I Genetics in the report writing phase. By searching relevant articles and exploiting the capability of Llama 2, this project aims to help biologists in genetic analysis. Developed during a curricular internship at R&I Genetics, the system addresses one of the challenges faced in genomic diagnostics of rare diseases: in the drafting phase of a report biologists need to search for, and read, dozens of scientific articles to find scientific evidence in support of an argument. The tool developed in this work uses Llama 2, the open large language model created by Meta AI, and the retrieval augmented generation (RAG) framework to access and analyse biomedical articles from PubMed, providing biologists with the answers they are looking for. The thesis provides an overview of R&I Genetics and its workflow, emphasising the crucial role of finding useful scientific evidence to support the genetic analysis. Especially in the biomedical field, having a tool capable of reading and analysing dozens of scientific articles in a few seconds is as a valuable resource for efficiently navigating through a multitude of articles. The primary objectives of the thesis include examining the history and evolution of LLMs, exploring various techniques for expanding LLMs' knowledge, testing different information retrieval and prompting techniques, and evaluating the effectiveness of the developed tool. Ethical considerations and potential future work directions are also discussed.
Questa tesi propone uno strumento basato su un large language model per aiutare i biologi di R&I Genetics nella fase di redazione dei referti. Cercando articoli rilevanti e sfruttando la capacità di Llama 2, questo progetto mira a supportare i biologi nell'analisi genetica. Sviluppato durante un tirocinio curricolare presso R&I Genetics, il sistema affronta una delle sfide nel campo della diagnostica genomica delle malattie rare: nella fase di redazione del referto i biologi devono cercare e leggere dozzine di articoli scientifici per trovare prove scientifiche a sostegno delle loro argomentazioni. Lo strumento sviluppato in questo lavoro utilizza Llama 2, il large language model creato da Meta AI, e il framework retrieval augmented generation (RAG) per accedere e analizzare articoli biomedici da PubMed, fornendo ai biologi le risposte che cercano. La tesi fornisce una panoramica di R&I Genetics e del suo workflow, enfatizzando il ruolo cruciale nel trovare prove scientifiche utili a supporto dell'analisi genetica. Specialmente nel campo biomedico, avere uno strumento capace di leggere e analizzare dozzine di articoli scientifici in pochi secondi è una risorsa preziosa per districarsi efficientemente tra una moltitudine di articoli. Questa tesi contiene una breve analisi della storia e dell'evoluzione degli LLMs, l'esplorazione di diverse tecniche per ampliare la loro conoscenza, la sperimentazione di varie metodologie di information retrieval e di prompt, nonché la valutazione dell'efficacia del sistema sviluppato. Inoltre, vengono affrontate considerazioni etiche e discussioni su possibili sviluppi futuri.
Large language models in genetica molecolare: retrieval augmented generation per facilitare la consulenza genetica
GREGORI, ANDREA
2023/2024
Abstract
This thesis proposes a large language model based tool to help biologists at R&I Genetics in the report writing phase. By searching relevant articles and exploiting the capability of Llama 2, this project aims to help biologists in genetic analysis. Developed during a curricular internship at R&I Genetics, the system addresses one of the challenges faced in genomic diagnostics of rare diseases: in the drafting phase of a report biologists need to search for, and read, dozens of scientific articles to find scientific evidence in support of an argument. The tool developed in this work uses Llama 2, the open large language model created by Meta AI, and the retrieval augmented generation (RAG) framework to access and analyse biomedical articles from PubMed, providing biologists with the answers they are looking for. The thesis provides an overview of R&I Genetics and its workflow, emphasising the crucial role of finding useful scientific evidence to support the genetic analysis. Especially in the biomedical field, having a tool capable of reading and analysing dozens of scientific articles in a few seconds is as a valuable resource for efficiently navigating through a multitude of articles. The primary objectives of the thesis include examining the history and evolution of LLMs, exploring various techniques for expanding LLMs' knowledge, testing different information retrieval and prompting techniques, and evaluating the effectiveness of the developed tool. Ethical considerations and potential future work directions are also discussed.File | Dimensione | Formato | |
---|---|---|---|
Gregori_Andrea.pdf
embargo fino al 03/07/2025
Dimensione
2.39 MB
Formato
Adobe PDF
|
2.39 MB | Adobe PDF |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/65944