Exploring how k-mer based compression speeds up many bioinformatics applications

This thesis focuses on USTAR, a k-mer based tool designed to compress large k-mer sets through an approach based on De Bruijn Graphs representations. By exploiting the structural properties of these graphs, USTAR generates compact representations of k-mer collections that are substantially smaller than the original datasets. The main objective of this work is to assess whether USTAR-based compression can be effectively integrated into bioinformatics pipelines that rely on k-mer based tools. In particular, this thesis investigates to what extent compressed datasets can be used as input for existing tools without compromising their correctness, while potentially improving computational performance and reducing storage requirements. To achieve this goal, several widely used k-mer based tools are evaluated across multiple datasets. For each tool, execution time, output characteristics, and storage usage are compared when operating on original datasets and on datasets compressed using USTAR. This comparison allows for a quantitative evaluation of the benefits introduced by USTAR in terms of computational efficiency and disk space utilization. Beyond the potential advantages, this thesis also aims to identify limitations and drawbacks introduced by dataset compression. Particular attention is given to understanding how compression affects tool accuracy, result consistency, and overall pipeline behavior. The ultimate objective is to determine under which conditions, and for which classes of applications, USTAR-based compression represents a practical and beneficial choice in real-world bioinformatics workflows. Many of the scripts used in this thesis, along with their detailed explanations, are available in the dedicated GitHub page at github.com/OrsolonLudovico/MasterThesis.

Exploring how k-mer based compression speeds up many bioinformatics applications

ORSOLON, LUDOVICO

2025/2026

Abstract

This thesis focuses on USTAR, a k-mer based tool designed to compress large k-mer sets through an approach based on De Bruijn Graphs representations. By exploiting the structural properties of these graphs, USTAR generates compact representations of k-mer collections that are substantially smaller than the original datasets. The main objective of this work is to assess whether USTAR-based compression can be effectively integrated into bioinformatics pipelines that rely on k-mer based tools. In particular, this thesis investigates to what extent compressed datasets can be used as input for existing tools without compromising their correctness, while potentially improving computational performance and reducing storage requirements. To achieve this goal, several widely used k-mer based tools are evaluated across multiple datasets. For each tool, execution time, output characteristics, and storage usage are compared when operating on original datasets and on datasets compressed using USTAR. This comparison allows for a quantitative evaluation of the benefits introduced by USTAR in terms of computational efficiency and disk space utilization. Beyond the potential advantages, this thesis also aims to identify limitations and drawbacks introduced by dataset compression. Particular attention is given to understanding how compression affects tool accuracy, result consistency, and overall pipeline behavior. The ultimate objective is to determine under which conditions, and for which classes of applications, USTAR-based compression represents a practical and beneficial choice in real-world bioinformatics workflows. Many of the scripts used in this thesis, along with their detailed explanations, are available in the dedicated GitHub page at github.com/OrsolonLudovico/MasterThesis.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Ingegneria dell'Informazione - DEI
			
	Corso di studio
	
				COMPUTER ENGINEERING Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2025
			
	Titolo inglese
	
				Exploring how k-mer based compression speeds up many bioinformatics applications
			
	Abstract in italiano
	
				This thesis focuses on USTAR, a k-mer based tool designed to compress large k-mer sets through an approach based on De Bruijn Graphs representations. By exploiting the structural properties of these graphs, USTAR generates compact representations of k-mer collections that are substantially smaller than the original datasets. The main objective of this work is to assess whether USTAR-based compression can be effectively integrated into bioinformatics pipelines that rely on k-mer based tools. In particular, this thesis investigates to what extent compressed datasets can be used as input for existing tools without compromising their correctness, while potentially improving computational performance and reducing storage requirements. To achieve this goal, several widely used k-mer based tools are evaluated across multiple datasets. For each tool, execution time, output characteristics, and storage usage are compared when operating on original datasets and on datasets compressed using USTAR. This comparison allows for a quantitative evaluation of the benefits introduced by USTAR in terms of computational efficiency and disk space utilization. Beyond the potential advantages, this thesis also aims to identify limitations and drawbacks introduced by dataset compression. Particular attention is given to understanding how compression affects tool accuracy, result consistency, and overall pipeline behavior. The ultimate objective is to determine under which conditions, and for which classes of applications, USTAR-based compression represents a practical and beneficial choice in real-world bioinformatics workflows.
Many of the scripts used in this thesis, along with their detailed explanations, are available in the dedicated GitHub page at github.com/OrsolonLudovico/MasterThesis.
			
	Parola chiave
	
				bioinformatics
k-mer
USTAR
			
	Relatore
	
				COMIN, MATTEO
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Orsolon_Ludovico.pdf accesso aperto Dimensione 2.07 MB Formato Adobe PDF Visualizza/Apri	2.07 MB	Adobe PDF	Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/106278