Deep learning approaches for the identification of erroneous regions in metagenome-assembled genomes

Metagenomics has enabled the study of microbial communities at an unprecedented scale. The introduction of metagenomic assembled genomes (MAGs, i.e. genomes for which an isolation procedure has never been carried out) into the field has permitted the study of yet-to-be characterized species, expanding our knowledge of underexplored environments. Still, high sample complexity and species-specific variability increase the chances of chimeric or erroneous assembly reconstructions, jeopardizing the study of highly rich microbiome samples such as those derived from soil. In this thesis work, we leveraged the collection of over 1.5 million microbial genomes that is hosted in the University of Trento to set an algorithm able to identify regions of erroneous junctions between genomic regions that followed from the assembly procedure. First, we built a database of 76 species, from which we extracted 50 genomes each, of nearly perfect completeness and absent contamination. Simulating Illumina short reads at coverage 100X on the so-obtained genomes, we next conducted the assembly procedure on 50 pairs of genomes from the same species for each species, in order to increase the chances of obtaining chimerica assemblies. Putative chimeric assemblies were then assigned to a degree of erroneousness based on a function of the depth of coverage which penalized whichever region does not correspond to the theoretical coverage after re-mapping the simulated reads against each contig: briefly, if a region is not fully mapped by none of the two genomes, is considered erroneously assembled. Next two different deep learning models were fitted on these data in order to predict the presence of misassembled regions in contigs, with the aim of developing a post-assembly quality control tool in order to improve the quality of binning results. All considered models generalize quite well (AUC ~ 0.7/0.8) on chimeric assembly contigs created from the same genomes used to generate the training set contigs, while slightly worse (AUC ~ 0.6/0.7) on contigs that come from genomes not used for training. While to be optimized, the presented model is capable of learning DNA-related properties of the genomic sequences in order to distinguish between correctly and erroneously assembled genomic regions in bacteria.

Deep learning approaches for the identification of erroneous regions in metagenome-assembled genomes

CHILOIRO, MARCO

2024/2025

Abstract

Metagenomics has enabled the study of microbial communities at an unprecedented scale. The introduction of metagenomic assembled genomes (MAGs, i.e. genomes for which an isolation procedure has never been carried out) into the field has permitted the study of yet-to-be characterized species, expanding our knowledge of underexplored environments. Still, high sample complexity and species-specific variability increase the chances of chimeric or erroneous assembly reconstructions, jeopardizing the study of highly rich microbiome samples such as those derived from soil. In this thesis work, we leveraged the collection of over 1.5 million microbial genomes that is hosted in the University of Trento to set an algorithm able to identify regions of erroneous junctions between genomic regions that followed from the assembly procedure. First, we built a database of 76 species, from which we extracted 50 genomes each, of nearly perfect completeness and absent contamination. Simulating Illumina short reads at coverage 100X on the so-obtained genomes, we next conducted the assembly procedure on 50 pairs of genomes from the same species for each species, in order to increase the chances of obtaining chimerica assemblies. Putative chimeric assemblies were then assigned to a degree of erroneousness based on a function of the depth of coverage which penalized whichever region does not correspond to the theoretical coverage after re-mapping the simulated reads against each contig: briefly, if a region is not fully mapped by none of the two genomes, is considered erroneously assembled. Next two different deep learning models were fitted on these data in order to predict the presence of misassembled regions in contigs, with the aim of developing a post-assembly quality control tool in order to improve the quality of binning results. All considered models generalize quite well (AUC ~ 0.7/0.8) on chimeric assembly contigs created from the same genomes used to generate the training set contigs, while slightly worse (AUC ~ 0.6/0.7) on contigs that come from genomes not used for training. While to be optimized, the presented model is capable of learning DNA-related properties of the genomic sequences in order to distinguish between correctly and erroneously assembled genomic regions in bacteria.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Fisica e Astronomia "Galileo Galilei" - DFA
			
	Corso di studio
	
				PHYSICS OF DATA Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2024
			
	Titolo inglese
	
				Deep learning approaches for the identification of erroneous regions in metagenome-assembled genomes
			
	Abstract in italiano
	
				Metagenomics has enabled the study of microbial communities at an unprecedented scale. The introduction of metagenomic assembled genomes (MAGs, i.e. genomes for which an isolation procedure has never been carried out) into the field has permitted the study of yet-to-be characterized species, expanding our knowledge of underexplored environments. Still, high sample complexity and species-specific variability increase the chances of chimeric or erroneous assembly reconstructions, jeopardizing the study of highly rich microbiome samples such as those derived from soil.

In this thesis work, we leveraged the collection of over 1.5 million microbial genomes that is hosted in the University of Trento to set an algorithm able to identify regions of erroneous junctions between genomic regions that followed from the assembly procedure. First, we built a database of 76 species, from which we extracted 50 genomes each, of nearly perfect completeness and absent contamination. Simulating Illumina short reads at coverage 100X on the so-obtained genomes, we next conducted the assembly procedure on 50 pairs of genomes from the same species for each species, in order to increase the chances of obtaining chimerica assemblies. Putative chimeric assemblies were then assigned to a degree of erroneousness based on a function of the depth of coverage which penalized whichever region does not correspond to the theoretical coverage after re-mapping the simulated reads against each contig: briefly, if a region is not fully mapped by none of the two genomes, is considered erroneously assembled.

Next two different deep learning models were fitted on these data in order to predict the presence of misassembled regions in contigs, with the aim of developing a post-assembly quality control tool in order to improve the quality of binning results.

All considered models generalize quite well (AUC ~ 0.7/0.8) on chimeric assembly contigs created from the same genomes used to generate the training set contigs, while slightly worse (AUC ~ 0.6/0.7) on contigs that come from genomes not used for training. While to be optimized, the presented model is capable of learning DNA-related properties of the genomic sequences in order to distinguish between correctly and erroneously assembled genomic regions in bacteria.
			
	Parola chiave
	
				Metagenomics
Metagenomic assembly
Deep learning
CNN
Transformers
			
	Relatore
	
				BAIESI, MARCO
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Chiloiro_Marco.pdf accesso aperto Dimensione 1.72 MB Formato Adobe PDF Visualizza/Apri	1.72 MB	Adobe PDF	Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/84547