Nowadays, ecological sciences depend heavily on genetic studies. Among these, analysis of environmental genetic material — i.e., metagenomics — is becoming increasingly popular for inferring essential information about microbial life and its interaction with ecosystems. An interesting application of metagenomics in this field is metagenomic comparison, that is the assessment of biotic dissimilarity between microbial environments. Current technologies allow us to produce Terabytes of metagenomic data with little effort. Consequently, the analysis of datasets of such size requires a large amount of computational resources. This led to the development and application of several strategies of dimensionality reduction, which are now being exploited for metagenomic comparison too. In this thesis, we analyse three different methods of reducing dimensionality to see what an impact they have in relation to reference-based methods. Our results show that a sketching on distinct k-mers, as implemented in the tool SimkaMin, have almost no impact on both abundance-based and presence-absence-based comparison for a sketching size larger than 10^5 distinct k-mers. On smaller sketches, quality of results decreases. On SPRISS’ sampling scheme, in which reads are selected uniformly at random with replacement, abundance-based Bray-Curtis dissimilarity showed no significant variations on moderated sampling rates — e.g., above 2% — and a marked quality decline on lower sampling rates. When the k-mers used are too short, 12 bp for instance, this sampling scheme seems to improve drastically dissimilarity measures. On the presence-absence Jaccard distance, instead, SPRISS’ subsampling scheme improves the correlation between reference-based and compositional-based methods at moderate sampling rates. Lastly, comparison of approximate sets of frequent k-mers, as outputted by SPRISS, hold lower correlation with reference-based dissimilarities, except on very short k-mers. Overall, our study suggests that rare k-mers are of both types: weakly informative and noise. Their impact is imperceivable on abundance-based dissimilarity, whereas the noisy part of them affect negatively the quality of the Jaccard index, which benefits from a moderate subsampling indeed.

Nowadays, ecological sciences depend heavily on genetic studies. Among these, analysis of environmental genetic material — i.e., metagenomics — is becoming increasingly popular for inferring essential information about microbial life and its interaction with ecosystems. An interesting application of metagenomics in this field is metagenomic comparison, that is the assessment of biotic dissimilarity between microbial environments. Current technologies allow us to produce Terabytes of metagenomic data with little effort. Consequently, the analysis of datasets of such size requires a large amount of computational resources. This led to the development and application of several strategies of dimensionality reduction, which are now being exploited for metagenomic comparison too. In this thesis, we analyse three different methods of reducing dimensionality to see what an impact they have in relation to reference-based methods. Our results show that a sketching on distinct k-mers, as implemented in the tool SimkaMin, have almost no impact on both abundance-based and presence-absence-based comparison for a sketching size larger than 10^5 distinct k-mers. On smaller sketches, quality of results decreases. On SPRISS’ sampling scheme, in which reads are selected uniformly at random with replacement, abundance-based Bray-Curtis dissimilarity showed no significant variations on moderated sampling rates — e.g., above 2% — and a marked quality decline on lower sampling rates. When the k-mers used are too short, 12 bp for instance, this sampling scheme seems to improve drastically dissimilarity measures. On the presence-absence Jaccard distance, instead, SPRISS’ subsampling scheme improves the correlation between reference-based and compositional-based methods at moderate sampling rates. Lastly, comparison of approximate sets of frequent k-mers, as outputted by SPRISS, hold lower correlation with reference-based dissimilarities, except on very short k-mers. Overall, our study suggests that rare k-mers are of both types: weakly informative and noise. Their impact is imperceivable on abundance-based dissimilarity, whereas the noisy part of them affect negatively the quality of the Jaccard index, which benefits from a moderate subsampling indeed.

A Study on the Effects of Using Sampling for Metagenomic Comparison

GALLINA, GIORGIO
2022/2023

Abstract

Nowadays, ecological sciences depend heavily on genetic studies. Among these, analysis of environmental genetic material — i.e., metagenomics — is becoming increasingly popular for inferring essential information about microbial life and its interaction with ecosystems. An interesting application of metagenomics in this field is metagenomic comparison, that is the assessment of biotic dissimilarity between microbial environments. Current technologies allow us to produce Terabytes of metagenomic data with little effort. Consequently, the analysis of datasets of such size requires a large amount of computational resources. This led to the development and application of several strategies of dimensionality reduction, which are now being exploited for metagenomic comparison too. In this thesis, we analyse three different methods of reducing dimensionality to see what an impact they have in relation to reference-based methods. Our results show that a sketching on distinct k-mers, as implemented in the tool SimkaMin, have almost no impact on both abundance-based and presence-absence-based comparison for a sketching size larger than 10^5 distinct k-mers. On smaller sketches, quality of results decreases. On SPRISS’ sampling scheme, in which reads are selected uniformly at random with replacement, abundance-based Bray-Curtis dissimilarity showed no significant variations on moderated sampling rates — e.g., above 2% — and a marked quality decline on lower sampling rates. When the k-mers used are too short, 12 bp for instance, this sampling scheme seems to improve drastically dissimilarity measures. On the presence-absence Jaccard distance, instead, SPRISS’ subsampling scheme improves the correlation between reference-based and compositional-based methods at moderate sampling rates. Lastly, comparison of approximate sets of frequent k-mers, as outputted by SPRISS, hold lower correlation with reference-based dissimilarities, except on very short k-mers. Overall, our study suggests that rare k-mers are of both types: weakly informative and noise. Their impact is imperceivable on abundance-based dissimilarity, whereas the noisy part of them affect negatively the quality of the Jaccard index, which benefits from a moderate subsampling indeed.
2022
A Study on the Effects of Using Sampling for Metagenomic Comparison
Nowadays, ecological sciences depend heavily on genetic studies. Among these, analysis of environmental genetic material — i.e., metagenomics — is becoming increasingly popular for inferring essential information about microbial life and its interaction with ecosystems. An interesting application of metagenomics in this field is metagenomic comparison, that is the assessment of biotic dissimilarity between microbial environments. Current technologies allow us to produce Terabytes of metagenomic data with little effort. Consequently, the analysis of datasets of such size requires a large amount of computational resources. This led to the development and application of several strategies of dimensionality reduction, which are now being exploited for metagenomic comparison too. In this thesis, we analyse three different methods of reducing dimensionality to see what an impact they have in relation to reference-based methods. Our results show that a sketching on distinct k-mers, as implemented in the tool SimkaMin, have almost no impact on both abundance-based and presence-absence-based comparison for a sketching size larger than 10^5 distinct k-mers. On smaller sketches, quality of results decreases. On SPRISS’ sampling scheme, in which reads are selected uniformly at random with replacement, abundance-based Bray-Curtis dissimilarity showed no significant variations on moderated sampling rates — e.g., above 2% — and a marked quality decline on lower sampling rates. When the k-mers used are too short, 12 bp for instance, this sampling scheme seems to improve drastically dissimilarity measures. On the presence-absence Jaccard distance, instead, SPRISS’ subsampling scheme improves the correlation between reference-based and compositional-based methods at moderate sampling rates. Lastly, comparison of approximate sets of frequent k-mers, as outputted by SPRISS, hold lower correlation with reference-based dissimilarities, except on very short k-mers. Overall, our study suggests that rare k-mers are of both types: weakly informative and noise. Their impact is imperceivable on abundance-based dissimilarity, whereas the noisy part of them affect negatively the quality of the Jaccard index, which benefits from a moderate subsampling indeed.
Metagenomics
Comparison
Sampling
Reference-base
k-mer
File in questo prodotto:
File Dimensione Formato  
Gallina_Giorgio.pdf

accesso aperto

Dimensione 3.33 MB
Formato Adobe PDF
3.33 MB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/46146