This thesis focuses on USTAR, a k-mer based tool designed to compress large k-mer sets through an approach based on De Bruijn Graphs representations. By exploiting the structural properties of these graphs, USTAR generates compact representations of k-mer collections that are substantially smaller than the original datasets. The main objective of this work is to assess whether USTAR-based compression can be effectively integrated into bioinformatics pipelines that rely on k-mer based tools. In particular, this thesis investigates to what extent compressed datasets can be used as input for existing tools without compromising their correctness, while potentially improving computational performance and reducing storage requirements. To achieve this goal, several widely used k-mer based tools are evaluated across multiple datasets. For each tool, execution time, output characteristics, and storage usage are compared when operating on original datasets and on datasets compressed using USTAR. This comparison allows for a quantitative evaluation of the benefits introduced by USTAR in terms of computational efficiency and disk space utilization. Beyond the potential advantages, this thesis also aims to identify limitations and drawbacks introduced by dataset compression. Particular attention is given to understanding how compression affects tool accuracy, result consistency, and overall pipeline behavior. The ultimate objective is to determine under which conditions, and for which classes of applications, USTAR-based compression represents a practical and beneficial choice in real-world bioinformatics workflows. Many of the scripts used in this thesis, along with their detailed explanations, are available in the dedicated GitHub page at github.com/OrsolonLudovico/MasterThesis.
This thesis focuses on USTAR, a k-mer based tool designed to compress large k-mer sets through an approach based on De Bruijn Graphs representations. By exploiting the structural properties of these graphs, USTAR generates compact representations of k-mer collections that are substantially smaller than the original datasets. The main objective of this work is to assess whether USTAR-based compression can be effectively integrated into bioinformatics pipelines that rely on k-mer based tools. In particular, this thesis investigates to what extent compressed datasets can be used as input for existing tools without compromising their correctness, while potentially improving computational performance and reducing storage requirements. To achieve this goal, several widely used k-mer based tools are evaluated across multiple datasets. For each tool, execution time, output characteristics, and storage usage are compared when operating on original datasets and on datasets compressed using USTAR. This comparison allows for a quantitative evaluation of the benefits introduced by USTAR in terms of computational efficiency and disk space utilization. Beyond the potential advantages, this thesis also aims to identify limitations and drawbacks introduced by dataset compression. Particular attention is given to understanding how compression affects tool accuracy, result consistency, and overall pipeline behavior. The ultimate objective is to determine under which conditions, and for which classes of applications, USTAR-based compression represents a practical and beneficial choice in real-world bioinformatics workflows. Many of the scripts used in this thesis, along with their detailed explanations, are available in the dedicated GitHub page at github.com/OrsolonLudovico/MasterThesis.
Exploring how k-mer based compression speeds up many bioinformatics applications
ORSOLON, LUDOVICO
2025/2026
Abstract
This thesis focuses on USTAR, a k-mer based tool designed to compress large k-mer sets through an approach based on De Bruijn Graphs representations. By exploiting the structural properties of these graphs, USTAR generates compact representations of k-mer collections that are substantially smaller than the original datasets. The main objective of this work is to assess whether USTAR-based compression can be effectively integrated into bioinformatics pipelines that rely on k-mer based tools. In particular, this thesis investigates to what extent compressed datasets can be used as input for existing tools without compromising their correctness, while potentially improving computational performance and reducing storage requirements. To achieve this goal, several widely used k-mer based tools are evaluated across multiple datasets. For each tool, execution time, output characteristics, and storage usage are compared when operating on original datasets and on datasets compressed using USTAR. This comparison allows for a quantitative evaluation of the benefits introduced by USTAR in terms of computational efficiency and disk space utilization. Beyond the potential advantages, this thesis also aims to identify limitations and drawbacks introduced by dataset compression. Particular attention is given to understanding how compression affects tool accuracy, result consistency, and overall pipeline behavior. The ultimate objective is to determine under which conditions, and for which classes of applications, USTAR-based compression represents a practical and beneficial choice in real-world bioinformatics workflows. Many of the scripts used in this thesis, along with their detailed explanations, are available in the dedicated GitHub page at github.com/OrsolonLudovico/MasterThesis.| File | Dimensione | Formato | |
|---|---|---|---|
|
Orsolon_Ludovico.pdf
accesso aperto
Dimensione
2.07 MB
Formato
Adobe PDF
|
2.07 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/106278