One of the most important aspects of biological research is functional classification and annotation of protein sequences. InterPro is a database that categorizes protein sequence patterns, known as “signatures”, and provides models for automatically classifying new protein sequences. Moreover, most of the InterPro entries are associated with detailed functional information, which is standardized by employing Gene Ontology (GO) terms. While GO terms have proven to be very useful for machine-related tasks, for researchers that work with biological data, it is very important to have concise, human-readable, and understandable descriptions for the biological entities (proteins, families, domains, sites) under study. The purpose of this project is to automate the generation of short functional descriptions for different types of biological entities by using natural language generation models. In particular, we focused on those signatures corresponding to intrinsically disordered regions, i.e. protein fragments that are missing a fixed three-dimensional structure and have poor functional characterization. To achieve our goal, different datasets with varying levels of complexity were constructed from InterPro entries. Simpler datasets contain GO terms as input, while more complex datasets combine GO terms and functional descriptions extracted from the literature, as reported in the UniProtKB database, of the proteins associated with a given InterPro signature, as well as the corresponding protein organism. We fine-tuned the T5, GPT-2, and BioGPT large language models. Notably, BioGPT is a model that is pre-trained using large-scale biomedical literature. Our findings demonstrate that the fine-tuned T5 model significantly outperforms the pre-trained T5 model, and it is also superior compared to the fine-tuned GPT-2 and BioGPT models. Best results were achieved when the model was fine-tuned with the most complex dataset type. The best model was further tested on a novel task: generating descriptions for intrinsically disordered regions, which were not part of the training data, highlighting its adaptability but also revealing areas for improvement.
One of the most important aspects of biological research is functional classification and annotation of protein sequences. InterPro is a database that categorizes protein sequence patterns, known as “signatures”, and provides models for automatically classifying new protein sequences. Moreover, most of the InterPro entries are associated with detailed functional information, which is standardized by employing Gene Ontology (GO) terms. While GO terms have proven to be very useful for machine-related tasks, for researchers that work with biological data, it is very important to have concise, human-readable, and understandable descriptions for the biological entities (proteins, families, domains, sites) under study. The purpose of this project is to automate the generation of short functional descriptions for different types of biological entities by using natural language generation models. In particular, we focused on those signatures corresponding to intrinsically disordered regions, i.e. protein fragments that are missing a fixed three-dimensional structure and have poor functional characterization. To achieve our goal, different datasets with varying levels of complexity were constructed from InterPro entries. Simpler datasets contain GO terms as input, while more complex datasets combine GO terms and functional descriptions extracted from the literature, as reported in the UniProtKB database, of the proteins associated with a given InterPro signature, as well as the corresponding protein organism. We fine-tuned the T5, GPT-2, and BioGPT large language models. Notably, BioGPT is a model that is pre-trained using large-scale biomedical literature. Our findings demonstrate that the fine-tuned T5 model significantly outperforms the pre-trained T5 model, and it is also superior compared to the fine-tuned GPT-2 and BioGPT models. Best results were achieved when the model was fine-tuned with the most complex dataset type. The best model was further tested on a novel task: generating descriptions for intrinsically disordered regions, which were not part of the training data, highlighting its adaptability but also revealing areas for improvement.
AI-based Generation of Descriptions for Protein Signatures: Fine-Tuning of Large Language Models and Comparative Analysis
KRALEVSKA, ANGELA
2023/2024
Abstract
One of the most important aspects of biological research is functional classification and annotation of protein sequences. InterPro is a database that categorizes protein sequence patterns, known as “signatures”, and provides models for automatically classifying new protein sequences. Moreover, most of the InterPro entries are associated with detailed functional information, which is standardized by employing Gene Ontology (GO) terms. While GO terms have proven to be very useful for machine-related tasks, for researchers that work with biological data, it is very important to have concise, human-readable, and understandable descriptions for the biological entities (proteins, families, domains, sites) under study. The purpose of this project is to automate the generation of short functional descriptions for different types of biological entities by using natural language generation models. In particular, we focused on those signatures corresponding to intrinsically disordered regions, i.e. protein fragments that are missing a fixed three-dimensional structure and have poor functional characterization. To achieve our goal, different datasets with varying levels of complexity were constructed from InterPro entries. Simpler datasets contain GO terms as input, while more complex datasets combine GO terms and functional descriptions extracted from the literature, as reported in the UniProtKB database, of the proteins associated with a given InterPro signature, as well as the corresponding protein organism. We fine-tuned the T5, GPT-2, and BioGPT large language models. Notably, BioGPT is a model that is pre-trained using large-scale biomedical literature. Our findings demonstrate that the fine-tuned T5 model significantly outperforms the pre-trained T5 model, and it is also superior compared to the fine-tuned GPT-2 and BioGPT models. Best results were achieved when the model was fine-tuned with the most complex dataset type. The best model was further tested on a novel task: generating descriptions for intrinsically disordered regions, which were not part of the training data, highlighting its adaptability but also revealing areas for improvement.File | Dimensione | Formato | |
---|---|---|---|
Data_Science_MsC_Thesis_Angela_Kralevska.pdf
accesso aperto
Dimensione
10.16 MB
Formato
Adobe PDF
|
10.16 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/80893