Metagenomics has enabled the study of microbial communities at an unprecedented scale. The introduction of metagenomic assembled genomes (MAGs, i.e. genomes for which an isolation procedure has never been carried out) into the field has permitted the study of yet-to-be characterized species, expanding our knowledge of underexplored environments. Still, high sample complexity and species-specific variability increase the chances of chimeric or erroneous assembly reconstructions, jeopardizing the study of highly rich microbiome samples such as those derived from soil. In this thesis work, we leveraged the collection of over 1.5 million microbial genomes that is hosted in the University of Trento to set an algorithm able to identify regions of erroneous junctions between genomic regions that followed from the assembly procedure. First, we built a database of 76 species, from which we extracted 50 genomes each, of nearly perfect completeness and absent contamination. Simulating Illumina short reads at coverage 100X on the so-obtained genomes, we next conducted the assembly procedure on 50 pairs of genomes from the same species for each species, in order to increase the chances of obtaining chimerica assemblies. Putative chimeric assemblies were then assigned to a degree of erroneousness based on a function of the depth of coverage which penalized whichever region does not correspond to the theoretical coverage after re-mapping the simulated reads against each contig: briefly, if a region is not fully mapped by none of the two genomes, is considered erroneously assembled. Next two different deep learning models were fitted on these data in order to predict the presence of misassembled regions in contigs, with the aim of developing a post-assembly quality control tool in order to improve the quality of binning results. All considered models generalize quite well (AUC ~ 0.7/0.8) on chimeric assembly contigs created from the same genomes used to generate the training set contigs, while slightly worse (AUC ~ 0.6/0.7) on contigs that come from genomes not used for training. While to be optimized, the presented model is capable of learning DNA-related properties of the genomic sequences in order to distinguish between correctly and erroneously assembled genomic regions in bacteria.

Metagenomics has enabled the study of microbial communities at an unprecedented scale. The introduction of metagenomic assembled genomes (MAGs, i.e. genomes for which an isolation procedure has never been carried out) into the field has permitted the study of yet-to-be characterized species, expanding our knowledge of underexplored environments. Still, high sample complexity and species-specific variability increase the chances of chimeric or erroneous assembly reconstructions, jeopardizing the study of highly rich microbiome samples such as those derived from soil. In this thesis work, we leveraged the collection of over 1.5 million microbial genomes that is hosted in the University of Trento to set an algorithm able to identify regions of erroneous junctions between genomic regions that followed from the assembly procedure. First, we built a database of 76 species, from which we extracted 50 genomes each, of nearly perfect completeness and absent contamination. Simulating Illumina short reads at coverage 100X on the so-obtained genomes, we next conducted the assembly procedure on 50 pairs of genomes from the same species for each species, in order to increase the chances of obtaining chimerica assemblies. Putative chimeric assemblies were then assigned to a degree of erroneousness based on a function of the depth of coverage which penalized whichever region does not correspond to the theoretical coverage after re-mapping the simulated reads against each contig: briefly, if a region is not fully mapped by none of the two genomes, is considered erroneously assembled. Next two different deep learning models were fitted on these data in order to predict the presence of misassembled regions in contigs, with the aim of developing a post-assembly quality control tool in order to improve the quality of binning results. All considered models generalize quite well (AUC ~ 0.7/0.8) on chimeric assembly contigs created from the same genomes used to generate the training set contigs, while slightly worse (AUC ~ 0.6/0.7) on contigs that come from genomes not used for training. While to be optimized, the presented model is capable of learning DNA-related properties of the genomic sequences in order to distinguish between correctly and erroneously assembled genomic regions in bacteria.

Deep learning approaches for the identification of erroneous regions in metagenome-assembled genomes

CHILOIRO, MARCO
2024/2025

Abstract

Metagenomics has enabled the study of microbial communities at an unprecedented scale. The introduction of metagenomic assembled genomes (MAGs, i.e. genomes for which an isolation procedure has never been carried out) into the field has permitted the study of yet-to-be characterized species, expanding our knowledge of underexplored environments. Still, high sample complexity and species-specific variability increase the chances of chimeric or erroneous assembly reconstructions, jeopardizing the study of highly rich microbiome samples such as those derived from soil. In this thesis work, we leveraged the collection of over 1.5 million microbial genomes that is hosted in the University of Trento to set an algorithm able to identify regions of erroneous junctions between genomic regions that followed from the assembly procedure. First, we built a database of 76 species, from which we extracted 50 genomes each, of nearly perfect completeness and absent contamination. Simulating Illumina short reads at coverage 100X on the so-obtained genomes, we next conducted the assembly procedure on 50 pairs of genomes from the same species for each species, in order to increase the chances of obtaining chimerica assemblies. Putative chimeric assemblies were then assigned to a degree of erroneousness based on a function of the depth of coverage which penalized whichever region does not correspond to the theoretical coverage after re-mapping the simulated reads against each contig: briefly, if a region is not fully mapped by none of the two genomes, is considered erroneously assembled. Next two different deep learning models were fitted on these data in order to predict the presence of misassembled regions in contigs, with the aim of developing a post-assembly quality control tool in order to improve the quality of binning results. All considered models generalize quite well (AUC ~ 0.7/0.8) on chimeric assembly contigs created from the same genomes used to generate the training set contigs, while slightly worse (AUC ~ 0.6/0.7) on contigs that come from genomes not used for training. While to be optimized, the presented model is capable of learning DNA-related properties of the genomic sequences in order to distinguish between correctly and erroneously assembled genomic regions in bacteria.
2024
Deep learning approaches for the identification of erroneous regions in metagenome-assembled genomes
Metagenomics has enabled the study of microbial communities at an unprecedented scale. The introduction of metagenomic assembled genomes (MAGs, i.e. genomes for which an isolation procedure has never been carried out) into the field has permitted the study of yet-to-be characterized species, expanding our knowledge of underexplored environments. Still, high sample complexity and species-specific variability increase the chances of chimeric or erroneous assembly reconstructions, jeopardizing the study of highly rich microbiome samples such as those derived from soil. In this thesis work, we leveraged the collection of over 1.5 million microbial genomes that is hosted in the University of Trento to set an algorithm able to identify regions of erroneous junctions between genomic regions that followed from the assembly procedure. First, we built a database of 76 species, from which we extracted 50 genomes each, of nearly perfect completeness and absent contamination. Simulating Illumina short reads at coverage 100X on the so-obtained genomes, we next conducted the assembly procedure on 50 pairs of genomes from the same species for each species, in order to increase the chances of obtaining chimerica assemblies. Putative chimeric assemblies were then assigned to a degree of erroneousness based on a function of the depth of coverage which penalized whichever region does not correspond to the theoretical coverage after re-mapping the simulated reads against each contig: briefly, if a region is not fully mapped by none of the two genomes, is considered erroneously assembled. Next two different deep learning models were fitted on these data in order to predict the presence of misassembled regions in contigs, with the aim of developing a post-assembly quality control tool in order to improve the quality of binning results. All considered models generalize quite well (AUC ~ 0.7/0.8) on chimeric assembly contigs created from the same genomes used to generate the training set contigs, while slightly worse (AUC ~ 0.6/0.7) on contigs that come from genomes not used for training. While to be optimized, the presented model is capable of learning DNA-related properties of the genomic sequences in order to distinguish between correctly and erroneously assembled genomic regions in bacteria.
Metagenomics
Metagenomic assembly
Deep learning
CNN
Transformers
File in questo prodotto:
File Dimensione Formato  
Chiloiro_Marco.pdf

accesso aperto

Dimensione 1.72 MB
Formato Adobe PDF
1.72 MB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/84547