Automatic protein function prediction is a growing field in recent years due to the increased availability of protein sequences. In fact, in the absence of computational methods, protein function assignment would be based solely on labour-intensive wet-lab methods. This study focuses on a comprehensive data analysis of the Uniprot GAF (Gene Annotation Format) file and pursues two primary directions. The first direction involves a detailed statistical analysis to explore the association between annotated Gene Ontology (GO) terms at the protein level. The second direction investigates the relationships between protein functions and various taxonomic groups. Through this dual approach, the study aims to refine protein function prediction, contributing to more accurate and reliable annotations. All analyses are conducted twice: once for annotations supported by experimental evidence codes, and once for annotations supported by both experimental and computational evidence codes, excluding ND (No biological Data available), NAS (Non-traceable Author Statement) and IEA (Inferred from Electronic Annotation) annotations. From the data analysis, a taxonomic filter is derived along with multiple pairs of associated GO terms, each accompanied by a related association score, and possible terms for downpropagate predictions. These results are then integrated into the predictions made by a machine learning model.

Enhancing protein function prediction through statistical association and taxonomic enrichment analysis

PERTILE, ANDREA VALENTINA
2023/2024

Abstract

Automatic protein function prediction is a growing field in recent years due to the increased availability of protein sequences. In fact, in the absence of computational methods, protein function assignment would be based solely on labour-intensive wet-lab methods. This study focuses on a comprehensive data analysis of the Uniprot GAF (Gene Annotation Format) file and pursues two primary directions. The first direction involves a detailed statistical analysis to explore the association between annotated Gene Ontology (GO) terms at the protein level. The second direction investigates the relationships between protein functions and various taxonomic groups. Through this dual approach, the study aims to refine protein function prediction, contributing to more accurate and reliable annotations. All analyses are conducted twice: once for annotations supported by experimental evidence codes, and once for annotations supported by both experimental and computational evidence codes, excluding ND (No biological Data available), NAS (Non-traceable Author Statement) and IEA (Inferred from Electronic Annotation) annotations. From the data analysis, a taxonomic filter is derived along with multiple pairs of associated GO terms, each accompanied by a related association score, and possible terms for downpropagate predictions. These results are then integrated into the predictions made by a machine learning model.
2023
Enhancing protein function prediction through statistical association and taxonomic enrichment analysis
function prediction
taxonomy
association
gene ontology
File in questo prodotto:
File Dimensione Formato  
Pertile_AndreaValentina.pdf

accesso riservato

Dimensione 11.69 MB
Formato Adobe PDF
11.69 MB Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/68386