Associating genetic factors with phenotypic traits is currently one of the main challenges in biology and is crucial for understanding microbial communities. In the last years, extensive databases storing phenotypic traits associated with several microbes have been developed. Among these, FAPROTAX aims to functionally annotate microbial taxa to 92 functional classes. In this thesis project, supervised machine learning algorithms were applied to data retrieved from FAPROTAX to classify organisms into functional groups and identify genetic markers of phenotypes. We developed a tool that combines different machine learning models - logistic regression, random forest and support vector machines - to infer microbial functional traits from vectors of gene annotations. The applied strategy resulted in 86 models with moderate to high classification efficiency depending on the considered phenotype. Tests on independent datasets revealed robust performances on both entire and fragmented genomes and the tool was successfully applied to metagenomic samples from anaerobic digestion environments. Finally, as demonstrated through a case study on acetoclastic methanogenesis, by applying the developed machine learning approach to modified versions of the training dataset it is possible to update the tool or integrate it with models for functions of interest, extending the tool’s applications in the field.

Development of a new tool for functional investigation of microbial communities: empowering the analysis through machine learning

FRAULINI, SOFIA
2022/2023

Abstract

Associating genetic factors with phenotypic traits is currently one of the main challenges in biology and is crucial for understanding microbial communities. In the last years, extensive databases storing phenotypic traits associated with several microbes have been developed. Among these, FAPROTAX aims to functionally annotate microbial taxa to 92 functional classes. In this thesis project, supervised machine learning algorithms were applied to data retrieved from FAPROTAX to classify organisms into functional groups and identify genetic markers of phenotypes. We developed a tool that combines different machine learning models - logistic regression, random forest and support vector machines - to infer microbial functional traits from vectors of gene annotations. The applied strategy resulted in 86 models with moderate to high classification efficiency depending on the considered phenotype. Tests on independent datasets revealed robust performances on both entire and fragmented genomes and the tool was successfully applied to metagenomic samples from anaerobic digestion environments. Finally, as demonstrated through a case study on acetoclastic methanogenesis, by applying the developed machine learning approach to modified versions of the training dataset it is possible to update the tool or integrate it with models for functions of interest, extending the tool’s applications in the field.
2022
Development of a new tool for functional investigation of microbial communities: empowering the analysis through machine learning
Machine learning
Metagenomics
Function association
FAPROTAX database
File in questo prodotto:
File Dimensione Formato  
Fraulini_Sofia.pdf

accesso riservato

Dimensione 25.93 MB
Formato Adobe PDF
25.93 MB Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/51276