Bacteriophages represent the most abundant biological entities on Earth, playing pivotal roles in shaping bacterial communities across diverse ecosystems. Their influence extends to critical areas such as human gut microbiota, food safety and agricultural productivity. A fundamental prerequisite for harnessing their potential lies in identifying the specific bacterial hosts they infect. Traditional experimental methods for determining phage-host interactions, while reliable, remain labor-intensive and costly, as they require screening individual phages against extensive panels of bacterial strains. To accelerate this process, computational approaches have been developed to predict putative phage-host interactions in silico. The rapid expansion of genomic databases, fueled by advances in sequencing technologies, has enabled machine learning (ML) models to leverage growing datasets of known phage-host pairs. Among these, protein language models (PLMs) have emerged as powerful tools for biological sequence analysis, demonstrating exceptional performance in tasks such as structure prediction and function annotation. However, their application to phage-host interaction prediction remains underexplored. In this study, I employ the PLM ESM2 to generate proteome-level embeddings for experimentally validated phage-bacteria pairs. These embeddings are then used to train a neural network model designed to predict phage-host interactions, providing a scalable and efficient alternative to traditional screening methods.

Large Language Models for the study of Bacteriophages

BENATTI, LORENZO
2024/2025

Abstract

Bacteriophages represent the most abundant biological entities on Earth, playing pivotal roles in shaping bacterial communities across diverse ecosystems. Their influence extends to critical areas such as human gut microbiota, food safety and agricultural productivity. A fundamental prerequisite for harnessing their potential lies in identifying the specific bacterial hosts they infect. Traditional experimental methods for determining phage-host interactions, while reliable, remain labor-intensive and costly, as they require screening individual phages against extensive panels of bacterial strains. To accelerate this process, computational approaches have been developed to predict putative phage-host interactions in silico. The rapid expansion of genomic databases, fueled by advances in sequencing technologies, has enabled machine learning (ML) models to leverage growing datasets of known phage-host pairs. Among these, protein language models (PLMs) have emerged as powerful tools for biological sequence analysis, demonstrating exceptional performance in tasks such as structure prediction and function annotation. However, their application to phage-host interaction prediction remains underexplored. In this study, I employ the PLM ESM2 to generate proteome-level embeddings for experimentally validated phage-bacteria pairs. These embeddings are then used to train a neural network model designed to predict phage-host interactions, providing a scalable and efficient alternative to traditional screening methods.
2024
Large Language Models for the study of Bacteriophages
LLM
Bacteriophages
phylogeny
Virus
File in questo prodotto:
File Dimensione Formato  
Benatti_Lorenzo.pdf

accesso aperto

Dimensione 3.42 MB
Formato Adobe PDF
3.42 MB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/84350