Next-generation sequencing technologies have revolutionized genomics by producing large amounts of data that challenge conventional methods of archiving and analysis. This thesis addresses two key issues: efficient compression of sequencing data and reconstruction of phylogenies using alignment-free approaches. A new compression strategy based on de Bruijn compacted graphs is introduced, which minimizes data redundancy by optimizing path coverages in k-mer sets. The proposed USTAR method exploits the inherent connectivity of these graphs to significantly reduce storage requirements while preserving the information in the data. Building on this foundation, the study employs some alignment-free phylogenetic reconstruction techniques, namely phyBWT2, Mash and SANS serif. These tools bypass the traditional, and computationally expensive, need for sequence alignment using innovative methods such as the extended Burrows-Wheeler transform and MinHash sketching, accelerating tree inference without compromising accuracy. Extensive experiments on real and simulated datasets show that using USTAR for data compression, combined with alignment-free reconstruction, significantly improves computational efficiency by reducing storage space and execution time while preserving the integrity of the sequencing information and the quality of phylogenetic analyses.

Next-generation sequencing technologies have revolutionized genomics by producing large amounts of data that challenge conventional methods of archiving and analysis. This thesis addresses two key issues: efficient compression of sequencing data and reconstruction of phylogenies using alignment-free approaches. A new compression strategy based on de Bruijn compacted graphs is introduced, which minimizes data redundancy by optimizing path coverages in k-mer sets. The proposed USTAR method exploits the inherent connectivity of these graphs to significantly reduce storage requirements while preserving the information in the data. Building on this foundation, the study employs some alignment-free phylogenetic reconstruction techniques, namely phyBWT2, Mash and SANS serif. These tools bypass the traditional, and computationally expensive, need for sequence alignment using innovative methods such as the extended Burrows-Wheeler transform and MinHash sketching, accelerating tree inference without compromising accuracy. Extensive experiments on real and simulated datasets show that using USTAR for data compression, combined with alignment-free reconstruction, significantly improves computational efficiency by reducing storage space and execution time while preserving the integrity of the sequencing information and the quality of phylogenetic analyses.

Compression of Sequencing Data for Phylogeny Reconstruction

NICETTO, ANDREA
2024/2025

Abstract

Next-generation sequencing technologies have revolutionized genomics by producing large amounts of data that challenge conventional methods of archiving and analysis. This thesis addresses two key issues: efficient compression of sequencing data and reconstruction of phylogenies using alignment-free approaches. A new compression strategy based on de Bruijn compacted graphs is introduced, which minimizes data redundancy by optimizing path coverages in k-mer sets. The proposed USTAR method exploits the inherent connectivity of these graphs to significantly reduce storage requirements while preserving the information in the data. Building on this foundation, the study employs some alignment-free phylogenetic reconstruction techniques, namely phyBWT2, Mash and SANS serif. These tools bypass the traditional, and computationally expensive, need for sequence alignment using innovative methods such as the extended Burrows-Wheeler transform and MinHash sketching, accelerating tree inference without compromising accuracy. Extensive experiments on real and simulated datasets show that using USTAR for data compression, combined with alignment-free reconstruction, significantly improves computational efficiency by reducing storage space and execution time while preserving the integrity of the sequencing information and the quality of phylogenetic analyses.
2024
Compression of Sequencing Data for Phylogeny Reconstruction
Next-generation sequencing technologies have revolutionized genomics by producing large amounts of data that challenge conventional methods of archiving and analysis. This thesis addresses two key issues: efficient compression of sequencing data and reconstruction of phylogenies using alignment-free approaches. A new compression strategy based on de Bruijn compacted graphs is introduced, which minimizes data redundancy by optimizing path coverages in k-mer sets. The proposed USTAR method exploits the inherent connectivity of these graphs to significantly reduce storage requirements while preserving the information in the data. Building on this foundation, the study employs some alignment-free phylogenetic reconstruction techniques, namely phyBWT2, Mash and SANS serif. These tools bypass the traditional, and computationally expensive, need for sequence alignment using innovative methods such as the extended Burrows-Wheeler transform and MinHash sketching, accelerating tree inference without compromising accuracy. Extensive experiments on real and simulated datasets show that using USTAR for data compression, combined with alignment-free reconstruction, significantly improves computational efficiency by reducing storage space and execution time while preserving the integrity of the sequencing information and the quality of phylogenetic analyses.
Phylogeny
Compression
Sequencing data
k-mer
File in questo prodotto:
File Dimensione Formato  
Nicetto_Andrea.pdf

accesso aperto

Dimensione 4.6 MB
Formato Adobe PDF
4.6 MB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/83215