Associating genetic factors with phenotypic traits is currently one of the main challenges in biology and is crucial for understanding microbial communities. In the last years, extensive databases storing phenotypic traits associated with several microbes have been developed. Among these, FAPROTAX aims to functionally annotate microbial taxa to 92 functional classes. In this thesis project, supervised machine learning algorithms were applied to data retrieved from FAPROTAX to classify organisms into functional groups and identify genetic markers of phenotypes. We developed a tool that combines different machine learning models - logistic regression, random forest and support vector machines - to infer microbial functional traits from vectors of gene annotations. The applied strategy resulted in 86 models with moderate to high classification efficiency depending on the considered phenotype. Tests on independent datasets revealed robust performances on both entire and fragmented genomes and the tool was successfully applied to metagenomic samples from anaerobic digestion environments. Finally, as demonstrated through a case study on acetoclastic methanogenesis, by applying the developed machine learning approach to modified versions of the training dataset it is possible to update the tool or integrate it with models for functions of interest, extending the tool’s applications in the field.
Development of a new tool for functional investigation of microbial communities: empowering the analysis through machine learning
FRAULINI, SOFIA
2022/2023
Abstract
Associating genetic factors with phenotypic traits is currently one of the main challenges in biology and is crucial for understanding microbial communities. In the last years, extensive databases storing phenotypic traits associated with several microbes have been developed. Among these, FAPROTAX aims to functionally annotate microbial taxa to 92 functional classes. In this thesis project, supervised machine learning algorithms were applied to data retrieved from FAPROTAX to classify organisms into functional groups and identify genetic markers of phenotypes. We developed a tool that combines different machine learning models - logistic regression, random forest and support vector machines - to infer microbial functional traits from vectors of gene annotations. The applied strategy resulted in 86 models with moderate to high classification efficiency depending on the considered phenotype. Tests on independent datasets revealed robust performances on both entire and fragmented genomes and the tool was successfully applied to metagenomic samples from anaerobic digestion environments. Finally, as demonstrated through a case study on acetoclastic methanogenesis, by applying the developed machine learning approach to modified versions of the training dataset it is possible to update the tool or integrate it with models for functions of interest, extending the tool’s applications in the field.File | Dimensione | Formato | |
---|---|---|---|
Fraulini_Sofia.pdf
accesso riservato
Dimensione
25.93 MB
Formato
Adobe PDF
|
25.93 MB | Adobe PDF |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/51276