Nowadays, thanks to advanced techniques, such as the Next Generation Sequencing (NGS), the time and costs of DNA sequencing are constantly lowering. These kinds of techniques offer high throughput, speed and scalability, so that the amount of available data on DNA sequences is greater than ever before. Nevertheless, when it comes to decoding and understanding what is encoded within this great number of sequences, there is an urgent need for new technolo- gies, which can keep up with the data production and be able to comprehend the contextual information of genes, scattered over long sequences of DNA. Artificial Intelligence and Deep Learning, in the field of Genomics, promise great advances in the interpretation and classification of genomic sequences. These kinds of models can learn and recognize significant genomic sequences and patterns, without the need for expensive, time- consuming, complicated wet-lab experiments. Moreover, they have been proven to do that, even when trained with a shortage of data. This study will describe the state-of-the-art deep-learning model architecture, namely the Transformer, and how it works. Afterward, two examples of its application to the biolog- ical problem will be presented: Nucleotide Transformers and Gena-LM, both implementing advanced foundational DNA language models, capable of high performances in numerous se- quence prediction tasks. These works will be described and compared. Lastly, the aforementioned models will be tested: the fine-tuning technique will be exploited, assessing the performances of each model on different datasets. All the results and the fine-tuned models can be found on the HuggingFace page of the author: https://huggingface.co/LiukG

Nowadays, thanks to advanced techniques, such as the Next Generation Sequencing (NGS), the time and costs of DNA sequencing are constantly lowering. These kinds of techniques offer high throughput, speed and scalability, so that the amount of available data on DNA sequences is greater than ever before. Nevertheless, when it comes to decoding and understanding what is encoded within this great number of sequences, there is an urgent need for new technolo- gies, which can keep up with the data production and be able to comprehend the contextual information of genes, scattered over long sequences of DNA. Artificial Intelligence and Deep Learning, in the field of Genomics, promise great advances in the interpretation and classification of genomic sequences. These kinds of models can learn and recognize significant genomic sequences and patterns, without the need for expensive, time- consuming, complicated wet-lab experiments. Moreover, they have been proven to do that, even when trained with a shortage of data. This study will describe the state-of-the-art deep-learning model architecture, namely the Transformer, and how it works. Afterward, two examples of its application to the biolog- ical problem will be presented: Nucleotide Transformers and Gena-LM, both implementing advanced foundational DNA language models, capable of high performances in numerous se- quence prediction tasks. These works will be described and compared. Lastly, the aforementioned models will be tested: the fine-tuning technique will be exploited, assessing the performances of each model on different datasets. All the results and the fine-tuned models can be found on the HuggingFace page of the author: https://huggingface.co/LiukG

Deep Learning For Genomic Sequences

GUARNIERI, LUCA
2023/2024

Abstract

Nowadays, thanks to advanced techniques, such as the Next Generation Sequencing (NGS), the time and costs of DNA sequencing are constantly lowering. These kinds of techniques offer high throughput, speed and scalability, so that the amount of available data on DNA sequences is greater than ever before. Nevertheless, when it comes to decoding and understanding what is encoded within this great number of sequences, there is an urgent need for new technolo- gies, which can keep up with the data production and be able to comprehend the contextual information of genes, scattered over long sequences of DNA. Artificial Intelligence and Deep Learning, in the field of Genomics, promise great advances in the interpretation and classification of genomic sequences. These kinds of models can learn and recognize significant genomic sequences and patterns, without the need for expensive, time- consuming, complicated wet-lab experiments. Moreover, they have been proven to do that, even when trained with a shortage of data. This study will describe the state-of-the-art deep-learning model architecture, namely the Transformer, and how it works. Afterward, two examples of its application to the biolog- ical problem will be presented: Nucleotide Transformers and Gena-LM, both implementing advanced foundational DNA language models, capable of high performances in numerous se- quence prediction tasks. These works will be described and compared. Lastly, the aforementioned models will be tested: the fine-tuning technique will be exploited, assessing the performances of each model on different datasets. All the results and the fine-tuned models can be found on the HuggingFace page of the author: https://huggingface.co/LiukG
2023
Deep Learning For Genomic Sequences
Nowadays, thanks to advanced techniques, such as the Next Generation Sequencing (NGS), the time and costs of DNA sequencing are constantly lowering. These kinds of techniques offer high throughput, speed and scalability, so that the amount of available data on DNA sequences is greater than ever before. Nevertheless, when it comes to decoding and understanding what is encoded within this great number of sequences, there is an urgent need for new technolo- gies, which can keep up with the data production and be able to comprehend the contextual information of genes, scattered over long sequences of DNA. Artificial Intelligence and Deep Learning, in the field of Genomics, promise great advances in the interpretation and classification of genomic sequences. These kinds of models can learn and recognize significant genomic sequences and patterns, without the need for expensive, time- consuming, complicated wet-lab experiments. Moreover, they have been proven to do that, even when trained with a shortage of data. This study will describe the state-of-the-art deep-learning model architecture, namely the Transformer, and how it works. Afterward, two examples of its application to the biolog- ical problem will be presented: Nucleotide Transformers and Gena-LM, both implementing advanced foundational DNA language models, capable of high performances in numerous se- quence prediction tasks. These works will be described and compared. Lastly, the aforementioned models will be tested: the fine-tuning technique will be exploited, assessing the performances of each model on different datasets. All the results and the fine-tuned models can be found on the HuggingFace page of the author: https://huggingface.co/LiukG
Deep Learning
Genome
Genomic sequences
File in questo prodotto:
File Dimensione Formato  
Guarnieri_Luca.pdf

accesso aperto

Dimensione 1.34 MB
Formato Adobe PDF
1.34 MB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/62530