A Study on the Effects of Using Sampling for Metagenomic Comparison

Nowadays, ecological sciences depend heavily on genetic studies. Among these, analysis of environmental genetic material — i.e., metagenomics — is becoming increasingly popular for inferring essential information about microbial life and its interaction with ecosystems. An interesting application of metagenomics in this field is metagenomic comparison, that is the assessment of biotic dissimilarity between microbial environments. Current technologies allow us to produce Terabytes of metagenomic data with little effort. Consequently, the analysis of datasets of such size requires a large amount of computational resources. This led to the development and application of several strategies of dimensionality reduction, which are now being exploited for metagenomic comparison too. In this thesis, we analyse three different methods of reducing dimensionality to see what an impact they have in relation to reference-based methods. Our results show that a sketching on distinct k-mers, as implemented in the tool SimkaMin, have almost no impact on both abundance-based and presence-absence-based comparison for a sketching size larger than 10^5 distinct k-mers. On smaller sketches, quality of results decreases. On SPRISS’ sampling scheme, in which reads are selected uniformly at random with replacement, abundance-based Bray-Curtis dissimilarity showed no significant variations on moderated sampling rates — e.g., above 2% — and a marked quality decline on lower sampling rates. When the k-mers used are too short, 12 bp for instance, this sampling scheme seems to improve drastically dissimilarity measures. On the presence-absence Jaccard distance, instead, SPRISS’ subsampling scheme improves the correlation between reference-based and compositional-based methods at moderate sampling rates. Lastly, comparison of approximate sets of frequent k-mers, as outputted by SPRISS, hold lower correlation with reference-based dissimilarities, except on very short k-mers. Overall, our study suggests that rare k-mers are of both types: weakly informative and noise. Their impact is imperceivable on abundance-based dissimilarity, whereas the noisy part of them affect negatively the quality of the Jaccard index, which benefits from a moderate subsampling indeed.

A Study on the Effects of Using Sampling for Metagenomic Comparison

GALLINA, GIORGIO

2022/2023

Abstract

Nowadays, ecological sciences depend heavily on genetic studies. Among these, analysis of environmental genetic material — i.e., metagenomics — is becoming increasingly popular for inferring essential information about microbial life and its interaction with ecosystems. An interesting application of metagenomics in this field is metagenomic comparison, that is the assessment of biotic dissimilarity between microbial environments. Current technologies allow us to produce Terabytes of metagenomic data with little effort. Consequently, the analysis of datasets of such size requires a large amount of computational resources. This led to the development and application of several strategies of dimensionality reduction, which are now being exploited for metagenomic comparison too. In this thesis, we analyse three different methods of reducing dimensionality to see what an impact they have in relation to reference-based methods. Our results show that a sketching on distinct k-mers, as implemented in the tool SimkaMin, have almost no impact on both abundance-based and presence-absence-based comparison for a sketching size larger than 10^5 distinct k-mers. On smaller sketches, quality of results decreases. On SPRISS’ sampling scheme, in which reads are selected uniformly at random with replacement, abundance-based Bray-Curtis dissimilarity showed no significant variations on moderated sampling rates — e.g., above 2% — and a marked quality decline on lower sampling rates. When the k-mers used are too short, 12 bp for instance, this sampling scheme seems to improve drastically dissimilarity measures. On the presence-absence Jaccard distance, instead, SPRISS’ subsampling scheme improves the correlation between reference-based and compositional-based methods at moderate sampling rates. Lastly, comparison of approximate sets of frequent k-mers, as outputted by SPRISS, hold lower correlation with reference-based dissimilarities, except on very short k-mers. Overall, our study suggests that rare k-mers are of both types: weakly informative and noise. Their impact is imperceivable on abundance-based dissimilarity, whereas the noisy part of them affect negatively the quality of the Jaccard index, which benefits from a moderate subsampling indeed.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Ingegneria dell'Informazione - DEI
			
	Corso di studio
	
				COMPUTER ENGINEERING Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2022
			
	Titolo inglese
	
				A Study on the Effects of Using Sampling for Metagenomic Comparison
			
	Abstract in italiano
	
				Nowadays, ecological sciences depend heavily on genetic studies. Among these,
analysis of environmental genetic material — i.e., metagenomics — is becoming
increasingly popular for inferring essential information about microbial life and its
interaction with ecosystems. An interesting application of metagenomics in this field
is metagenomic comparison, that is the assessment of biotic dissimilarity between
microbial environments. Current technologies allow us to produce Terabytes of
metagenomic data with little effort. Consequently, the analysis of datasets of such
size requires a large amount of computational resources. This led to the development
and application of several strategies of dimensionality reduction, which are now
being exploited for metagenomic comparison too.
In this thesis, we analyse three different methods of reducing dimensionality to see
what an impact they have in relation to reference-based methods. Our results show
that a sketching on distinct k-mers, as implemented in the tool SimkaMin, have
almost no impact on both abundance-based and presence-absence-based comparison
for a sketching size larger than 10^5 distinct k-mers. On smaller sketches, quality of
results decreases. On SPRISS’ sampling scheme, in which reads are selected uniformly
at random with replacement, abundance-based Bray-Curtis dissimilarity showed
no significant variations on moderated sampling rates — e.g., above 2% — and a
marked quality decline on lower sampling rates. When the k-mers used are too short,
12 bp for instance, this sampling scheme seems to improve drastically dissimilarity
measures. On the presence-absence Jaccard distance, instead, SPRISS’ subsampling
scheme improves the correlation between reference-based and compositional-based
methods at moderate sampling rates. Lastly, comparison of approximate sets of
frequent k-mers, as outputted by SPRISS, hold lower correlation with reference-based
dissimilarities, except on very short k-mers.
Overall, our study suggests that rare k-mers are of both types: weakly informative
and noise. Their impact is imperceivable on abundance-based dissimilarity, whereas
the noisy part of them affect negatively the quality of the Jaccard index, which
benefits from a moderate subsampling indeed.
			
	Parola chiave
	
				Metagenomics
Comparison
Sampling
Reference-base
k-mer
			
	Relatore
	
				PIZZI, CINZIA
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Gallina_Giorgio.pdf accesso aperto Dimensione 3.33 MB Formato Adobe PDF Visualizza/Apri	3.33 MB	Adobe PDF	Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/46146