Bacteriophages represent the most abundant biological entities on Earth, playing pivotal roles in shaping bacterial communities across diverse ecosystems. Their influence extends to critical areas such as human gut microbiota, food safety and agricultural productivity. A fundamental prerequisite for harnessing their potential lies in identifying the specific bacterial hosts they infect. Traditional experimental methods for determining phage-host interactions, while reliable, remain labor-intensive and costly, as they require screening individual phages against extensive panels of bacterial strains. To accelerate this process, computational approaches have been developed to predict putative phage-host interactions in silico. The rapid expansion of genomic databases, fueled by advances in sequencing technologies, has enabled machine learning (ML) models to leverage growing datasets of known phage-host pairs. Among these, protein language models (PLMs) have emerged as powerful tools for biological sequence analysis, demonstrating exceptional performance in tasks such as structure prediction and function annotation. However, their application to phage-host interaction prediction remains underexplored. In this study, I employ the PLM ESM2 to generate proteome-level embeddings for experimentally validated phage-bacteria pairs. These embeddings are then used to train a neural network model designed to predict phage-host interactions, providing a scalable and efficient alternative to traditional screening methods.
Large Language Models for the study of Bacteriophages
BENATTI, LORENZO
2024/2025
Abstract
Bacteriophages represent the most abundant biological entities on Earth, playing pivotal roles in shaping bacterial communities across diverse ecosystems. Their influence extends to critical areas such as human gut microbiota, food safety and agricultural productivity. A fundamental prerequisite for harnessing their potential lies in identifying the specific bacterial hosts they infect. Traditional experimental methods for determining phage-host interactions, while reliable, remain labor-intensive and costly, as they require screening individual phages against extensive panels of bacterial strains. To accelerate this process, computational approaches have been developed to predict putative phage-host interactions in silico. The rapid expansion of genomic databases, fueled by advances in sequencing technologies, has enabled machine learning (ML) models to leverage growing datasets of known phage-host pairs. Among these, protein language models (PLMs) have emerged as powerful tools for biological sequence analysis, demonstrating exceptional performance in tasks such as structure prediction and function annotation. However, their application to phage-host interaction prediction remains underexplored. In this study, I employ the PLM ESM2 to generate proteome-level embeddings for experimentally validated phage-bacteria pairs. These embeddings are then used to train a neural network model designed to predict phage-host interactions, providing a scalable and efficient alternative to traditional screening methods.File | Dimensione | Formato | |
---|---|---|---|
Benatti_Lorenzo.pdf
accesso aperto
Dimensione
3.42 MB
Formato
Adobe PDF
|
3.42 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/84350