The introduction of new sequencing techniques has resulted in an exponential increase in the volume of available genomic data. This necessitated the development of more efficient methodologies to manage and analyze this mass of information. Sketching techniques have been introduced to address this need, allowing for a compact and approximate representation of large amounts of data, while still preserving their essential features. In this thesis, a description of the main sketching algorithms is provided, followed by a detailed analysis of the MinHash method. This technique proves to be particularly effective in estimating the similarity between datasets and is employed in several areas of genomic analysis. Some practical use cases are also explored, highlighting the advantages in terms of efficiency, both in space and computational time.
L’introduzione di nuove tecniche di sequenziamento ha determinato un aumento esponenziale del volume di dati genomici disponibili. Questo ha reso indispensabile lo sviluppo di metodologie più efficienti per gestire e analizzare questa mole di informazioni. Le tecniche di sketching sono state introdotte per rispondere a questa esigenza, consentendo una rappresentazione compatta e approssimativa di grandi quantità di dati, preservandone comunque le caratteristiche essenziali. In questa tesi viene fornita una descrizione dei principali algoritmi di sketching, seguita da un’analisi dettagliata del metodo MinHash. Tale tecnica si rivela particolarmente efficace nel valutare la somiglianza tra insiemi di dati, trovando impiego in diversi ambiti dell’analisi genomica. Verranno inoltre analizzati alcuni casi d’uso pratici, evidenziando i vantaggi in termini di efficienza, sia per quanto riguarda lo spazio che il tempo computazionale.
Applicazione di MinHash nell'elaborazione dei dati genomici
GUAN, BEINI
2023/2024
Abstract
The introduction of new sequencing techniques has resulted in an exponential increase in the volume of available genomic data. This necessitated the development of more efficient methodologies to manage and analyze this mass of information. Sketching techniques have been introduced to address this need, allowing for a compact and approximate representation of large amounts of data, while still preserving their essential features. In this thesis, a description of the main sketching algorithms is provided, followed by a detailed analysis of the MinHash method. This technique proves to be particularly effective in estimating the similarity between datasets and is employed in several areas of genomic analysis. Some practical use cases are also explored, highlighting the advantages in terms of efficiency, both in space and computational time.File | Dimensione | Formato | |
---|---|---|---|
Guan_Beini.pdf
accesso riservato
Dimensione
2.93 MB
Formato
Adobe PDF
|
2.93 MB | Adobe PDF |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/76482