Large Language Models for the study of Bacteriophages

Bacteriophages represent the most abundant biological entities on Earth, playing pivotal roles in shaping bacterial communities across diverse ecosystems. Their influence extends to critical areas such as human gut microbiota, food safety and agricultural productivity. A fundamental prerequisite for harnessing their potential lies in identifying the specific bacterial hosts they infect. Traditional experimental methods for determining phage-host interactions, while reliable, remain labor-intensive and costly, as they require screening individual phages against extensive panels of bacterial strains. To accelerate this process, computational approaches have been developed to predict putative phage-host interactions in silico. The rapid expansion of genomic databases, fueled by advances in sequencing technologies, has enabled machine learning (ML) models to leverage growing datasets of known phage-host pairs. Among these, protein language models (PLMs) have emerged as powerful tools for biological sequence analysis, demonstrating exceptional performance in tasks such as structure prediction and function annotation. However, their application to phage-host interaction prediction remains underexplored. In this study, I employ the PLM ESM2 to generate proteome-level embeddings for experimentally validated phage-bacteria pairs. These embeddings are then used to train a neural network model designed to predict phage-host interactions, providing a scalable and efficient alternative to traditional screening methods.

Large Language Models for the study of Bacteriophages

BENATTI, LORENZO

2024/2025

Abstract

Bacteriophages represent the most abundant biological entities on Earth, playing pivotal roles in shaping bacterial communities across diverse ecosystems. Their influence extends to critical areas such as human gut microbiota, food safety and agricultural productivity. A fundamental prerequisite for harnessing their potential lies in identifying the specific bacterial hosts they infect. Traditional experimental methods for determining phage-host interactions, while reliable, remain labor-intensive and costly, as they require screening individual phages against extensive panels of bacterial strains. To accelerate this process, computational approaches have been developed to predict putative phage-host interactions in silico. The rapid expansion of genomic databases, fueled by advances in sequencing technologies, has enabled machine learning (ML) models to leverage growing datasets of known phage-host pairs. Among these, protein language models (PLMs) have emerged as powerful tools for biological sequence analysis, demonstrating exceptional performance in tasks such as structure prediction and function annotation. However, their application to phage-host interaction prediction remains underexplored. In this study, I employ the PLM ESM2 to generate proteome-level embeddings for experimentally validated phage-bacteria pairs. These embeddings are then used to train a neural network model designed to predict phage-host interactions, providing a scalable and efficient alternative to traditional screening methods.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Ingegneria dell'Informazione - DEI
			
	Corso di studio
	
				BIOINGEGNERIA Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2024
			
	Titolo inglese
	
				Large Language Models for the study of Bacteriophages
			
	Parola chiave
	
				LLM
Bacteriophages
phylogeny
Virus
			
	Relatore
	
				BELLATO, MASSIMO
			
	Correlatore
	
				SALES, GABRIELE
DI CAMILLO, BARBARA
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Benatti_Lorenzo.pdf accesso aperto Dimensione 3.42 MB Formato Adobe PDF Visualizza/Apri	3.42 MB	Adobe PDF	Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/84350