Deep Learning For Genomic Sequences

Nowadays, thanks to advanced techniques, such as the Next Generation Sequencing (NGS), the time and costs of DNA sequencing are constantly lowering. These kinds of techniques offer high throughput, speed and scalability, so that the amount of available data on DNA sequences is greater than ever before. Nevertheless, when it comes to decoding and understanding what is encoded within this great number of sequences, there is an urgent need for new technolo- gies, which can keep up with the data production and be able to comprehend the contextual information of genes, scattered over long sequences of DNA. Artificial Intelligence and Deep Learning, in the field of Genomics, promise great advances in the interpretation and classification of genomic sequences. These kinds of models can learn and recognize significant genomic sequences and patterns, without the need for expensive, time- consuming, complicated wet-lab experiments. Moreover, they have been proven to do that, even when trained with a shortage of data. This study will describe the state-of-the-art deep-learning model architecture, namely the Transformer, and how it works. Afterward, two examples of its application to the biolog- ical problem will be presented: Nucleotide Transformers and Gena-LM, both implementing advanced foundational DNA language models, capable of high performances in numerous se- quence prediction tasks. These works will be described and compared. Lastly, the aforementioned models will be tested: the fine-tuning technique will be exploited, assessing the performances of each model on different datasets. All the results and the fine-tuned models can be found on the HuggingFace page of the author: https://huggingface.co/LiukG

Deep Learning For Genomic Sequences

GUARNIERI, LUCA

2023/2024

Abstract

Nowadays, thanks to advanced techniques, such as the Next Generation Sequencing (NGS), the time and costs of DNA sequencing are constantly lowering. These kinds of techniques offer high throughput, speed and scalability, so that the amount of available data on DNA sequences is greater than ever before. Nevertheless, when it comes to decoding and understanding what is encoded within this great number of sequences, there is an urgent need for new technolo- gies, which can keep up with the data production and be able to comprehend the contextual information of genes, scattered over long sequences of DNA. Artificial Intelligence and Deep Learning, in the field of Genomics, promise great advances in the interpretation and classification of genomic sequences. These kinds of models can learn and recognize significant genomic sequences and patterns, without the need for expensive, time- consuming, complicated wet-lab experiments. Moreover, they have been proven to do that, even when trained with a shortage of data. This study will describe the state-of-the-art deep-learning model architecture, namely the Transformer, and how it works. Afterward, two examples of its application to the biolog- ical problem will be presented: Nucleotide Transformers and Gena-LM, both implementing advanced foundational DNA language models, capable of high performances in numerous se- quence prediction tasks. These works will be described and compared. Lastly, the aforementioned models will be tested: the fine-tuning technique will be exploited, assessing the performances of each model on different datasets. All the results and the fine-tuned models can be found on the HuggingFace page of the author: https://huggingface.co/LiukG

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Ingegneria dell'Informazione - DEI
			
	Corso di studio
	
				INGEGNERIA BIOMEDICA Laurea di Primo Livello (D.M. 270/2004)
			
	Anno Accademico
	
				2023
			
	Titolo inglese
	
				Deep Learning For Genomic Sequences
			
	Abstract in italiano
	
				Nowadays, thanks to advanced techniques, such as the Next Generation Sequencing (NGS), the
time and costs of DNA sequencing are constantly lowering. These kinds of techniques offer
high throughput, speed and scalability, so that the amount of available data on DNA sequences
is greater than ever before. Nevertheless, when it comes to decoding and understanding what
is encoded within this great number of sequences, there is an urgent need for new technolo-
gies, which can keep up with the data production and be able to comprehend the contextual
information of genes, scattered over long sequences of DNA.
Artificial Intelligence and Deep Learning, in the field of Genomics, promise great advances
in the interpretation and classification of genomic sequences. These kinds of models can learn
and recognize significant genomic sequences and patterns, without the need for expensive, time-
consuming, complicated wet-lab experiments. Moreover, they have been proven to do that, even
when trained with a shortage of data.
This study will describe the state-of-the-art deep-learning model architecture, namely the
Transformer, and how it works. Afterward, two examples of its application to the biolog-
ical problem will be presented: Nucleotide Transformers and Gena-LM, both implementing
advanced foundational DNA language models, capable of high performances in numerous se-
quence prediction tasks. These works will be described and compared.
Lastly, the aforementioned models will be tested: the fine-tuning technique will be exploited,
assessing the performances of each model on different datasets.
All the results and the fine-tuned models can be found on the HuggingFace page of the
author: https://huggingface.co/LiukG
			
	Parola chiave
	
				Deep Learning
Genome
Genomic sequences
			
	Relatore
	
				SALES, GABRIELE
			
	Appare nelle tipologie:
	
				Lauree triennali

File in questo prodotto:

File	Dimensione	Formato
Guarnieri_Luca.pdf accesso aperto Dimensione 1.34 MB Formato Adobe PDF Visualizza/Apri	1.34 MB	Adobe PDF	Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/62530