Metagenomic assembly and classification are essential for uncovering the taxonomic structure of microbial communities, yet existing approaches often struggle with fragmented short-read assemblies and ambiguous contig assignments. This thesis investigates a graph-based learning framework for metagenomic contig classification, introducing a Graph Neural Network (GNN) that incorporates both sequence-derived features and assembly graph topology. By leveraging contig connectivity, the model learns contextual relationships that traditional sequence-based classifiers cannot capture. Synthetic benchmark datasets were constructed to mimic realistic microbial communities under two ecological conditions, GENERIC (balanced composition) and FILTERED (enriched in mobile genetic elements), for both short- and long-read sequencing modalities. The GNN was evaluated against the state-of-the-art 4CAC baseline across these scenarios. Results show consistent improvements in macro-F1, with the most pronounced gains (up to 20%) for minority classes such as viruses and plasmids in short-read data. For long-read assemblies, where contigs exhibit higher contiguity, the GNN achieved performance comparable to 4CAC. Validation on a real-world PacBio HiFi gut metagenome further confirmed the model’s robustness to empirical biological noise. While the framework proved robust and generalizable, training and graph construction were extremely memory-intensive, requiring over 200 GB of RAM per large-scale experiment. Future work should focus on memory-efficient graph sampling, hybrid long–short read integration, and broader validation on diverse environmental samples. Overall, this study demonstrates that topology-aware learning provides a scalable and biologically meaningful extension to existing metagenomic classification methods. The source code for this project can be found on https://github.com/shbnmzr/thesis-project
Improving Multi-Class Metagenomic Contig Classification with Graph-Based Learning
ZARESHAHRAKI, SHABNAM
2025/2026
Abstract
Metagenomic assembly and classification are essential for uncovering the taxonomic structure of microbial communities, yet existing approaches often struggle with fragmented short-read assemblies and ambiguous contig assignments. This thesis investigates a graph-based learning framework for metagenomic contig classification, introducing a Graph Neural Network (GNN) that incorporates both sequence-derived features and assembly graph topology. By leveraging contig connectivity, the model learns contextual relationships that traditional sequence-based classifiers cannot capture. Synthetic benchmark datasets were constructed to mimic realistic microbial communities under two ecological conditions, GENERIC (balanced composition) and FILTERED (enriched in mobile genetic elements), for both short- and long-read sequencing modalities. The GNN was evaluated against the state-of-the-art 4CAC baseline across these scenarios. Results show consistent improvements in macro-F1, with the most pronounced gains (up to 20%) for minority classes such as viruses and plasmids in short-read data. For long-read assemblies, where contigs exhibit higher contiguity, the GNN achieved performance comparable to 4CAC. Validation on a real-world PacBio HiFi gut metagenome further confirmed the model’s robustness to empirical biological noise. While the framework proved robust and generalizable, training and graph construction were extremely memory-intensive, requiring over 200 GB of RAM per large-scale experiment. Future work should focus on memory-efficient graph sampling, hybrid long–short read integration, and broader validation on diverse environmental samples. Overall, this study demonstrates that topology-aware learning provides a scalable and biologically meaningful extension to existing metagenomic classification methods. The source code for this project can be found on https://github.com/shbnmzr/thesis-project| File | Dimensione | Formato | |
|---|---|---|---|
|
Zareshahraki_Shabnam.pdf
accesso aperto
Dimensione
1.58 MB
Formato
Adobe PDF
|
1.58 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/106282