Next-generation sequencing technologies have revolutionized genomics by producing large amounts of data that challenge conventional methods of archiving and analysis. This thesis addresses two key issues: efficient compression of sequencing data and reconstruction of phylogenies using alignment-free approaches. A new compression strategy based on de Bruijn compacted graphs is introduced, which minimizes data redundancy by optimizing path coverages in k-mer sets. The proposed USTAR method exploits the inherent connectivity of these graphs to significantly reduce storage requirements while preserving the information in the data. Building on this foundation, the study employs some alignment-free phylogenetic reconstruction techniques, namely phyBWT2, Mash and SANS serif. These tools bypass the traditional, and computationally expensive, need for sequence alignment using innovative methods such as the extended Burrows-Wheeler transform and MinHash sketching, accelerating tree inference without compromising accuracy. Extensive experiments on real and simulated datasets show that using USTAR for data compression, combined with alignment-free reconstruction, significantly improves computational efficiency by reducing storage space and execution time while preserving the integrity of the sequencing information and the quality of phylogenetic analyses.
Next-generation sequencing technologies have revolutionized genomics by producing large amounts of data that challenge conventional methods of archiving and analysis. This thesis addresses two key issues: efficient compression of sequencing data and reconstruction of phylogenies using alignment-free approaches. A new compression strategy based on de Bruijn compacted graphs is introduced, which minimizes data redundancy by optimizing path coverages in k-mer sets. The proposed USTAR method exploits the inherent connectivity of these graphs to significantly reduce storage requirements while preserving the information in the data. Building on this foundation, the study employs some alignment-free phylogenetic reconstruction techniques, namely phyBWT2, Mash and SANS serif. These tools bypass the traditional, and computationally expensive, need for sequence alignment using innovative methods such as the extended Burrows-Wheeler transform and MinHash sketching, accelerating tree inference without compromising accuracy. Extensive experiments on real and simulated datasets show that using USTAR for data compression, combined with alignment-free reconstruction, significantly improves computational efficiency by reducing storage space and execution time while preserving the integrity of the sequencing information and the quality of phylogenetic analyses.
Compression of Sequencing Data for Phylogeny Reconstruction
NICETTO, ANDREA
2024/2025
Abstract
Next-generation sequencing technologies have revolutionized genomics by producing large amounts of data that challenge conventional methods of archiving and analysis. This thesis addresses two key issues: efficient compression of sequencing data and reconstruction of phylogenies using alignment-free approaches. A new compression strategy based on de Bruijn compacted graphs is introduced, which minimizes data redundancy by optimizing path coverages in k-mer sets. The proposed USTAR method exploits the inherent connectivity of these graphs to significantly reduce storage requirements while preserving the information in the data. Building on this foundation, the study employs some alignment-free phylogenetic reconstruction techniques, namely phyBWT2, Mash and SANS serif. These tools bypass the traditional, and computationally expensive, need for sequence alignment using innovative methods such as the extended Burrows-Wheeler transform and MinHash sketching, accelerating tree inference without compromising accuracy. Extensive experiments on real and simulated datasets show that using USTAR for data compression, combined with alignment-free reconstruction, significantly improves computational efficiency by reducing storage space and execution time while preserving the integrity of the sequencing information and the quality of phylogenetic analyses.File | Dimensione | Formato | |
---|---|---|---|
Nicetto_Andrea.pdf
accesso aperto
Dimensione
4.6 MB
Formato
Adobe PDF
|
4.6 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/83215