The escalation of antimicrobial resistance represents a critical threat to global health, with projections suggesting 10 million annual deaths by 2050. While bacteriophages offer a promising therapeutic alternative, their clinical integration is hindered by the necessity of accurately classifying their lifestyles as virulent or temperate. Precise identification is vital, as temperate phages carry the risk of horizontal gene transfer and the mobilization of resistance genes. Traditional computational approaches often rely on restricted context windows or isolated genomic fragments, thereby losing the long-range structural dependencies and non-coding signals that define a phage's fundamental biological strategy. This thesis introduces a whole-genome classification framework leveraging Evo 2, a 7-billion-parameter DNA foundation model. By utilizing the sub-quadratic StripedHyena 2 architecture, the model extends the context window to capture dependencies across millions of base pairs at single-nucleotide resolution. High-dimensional latent embeddings were extracted from the model's intermediate layers and used to train supervised classifiers on a curated dataset of 2,176 bacteriophage genomes. Experimental results demonstrate that the proposed Whole-Genome Summation (WGS) approach significantly outperforms existing state-of-the-art tools. A Gradient Boosting classifier trained on WGS-32 embeddings achieved a Matthews Correlation Coefficient (MCC) of 0.92, a substantial improvement over the 0.53 MCC recorded by the DeepPL model. Furthermore, this research reveals that non-coding regions and hypothetical proteins contain robust, lifestyle-specific signals, with the non-coding pipeline alone achieving an MCC of 0.80. These findings suggest that lifestyle signatures are deeply embedded within underexplored areas of the genome. By providing an annotation-independent, high-accuracy solution, this framework facilitates the rapid and safe selection of virulent phages for clinical, agricultural, and food safety applications.

The escalation of antimicrobial resistance represents a critical threat to global health, with projections suggesting 10 million annual deaths by 2050. While bacteriophages offer a promising therapeutic alternative, their clinical integration is hindered by the necessity of accurately classifying their lifestyles as virulent or temperate. Precise identification is vital, as temperate phages carry the risk of horizontal gene transfer and the mobilization of resistance genes. Traditional computational approaches often rely on restricted context windows or isolated genomic fragments, thereby losing the long-range structural dependencies and non-coding signals that define a phage's fundamental biological strategy. This thesis introduces a whole-genome classification framework leveraging Evo 2, a 7-billion-parameter DNA foundation model. By utilizing the sub-quadratic StripedHyena 2 architecture, the model extends the context window to capture dependencies across millions of base pairs at single-nucleotide resolution. High-dimensional latent embeddings were extracted from the model's intermediate layers and used to train supervised classifiers on a curated dataset of 2,176 bacteriophage genomes. Experimental results demonstrate that the proposed Whole-Genome Summation (WGS) approach significantly outperforms existing state-of-the-art tools. A Gradient Boosting classifier trained on WGS-32 embeddings achieved a Matthews Correlation Coefficient (MCC) of 0.92, a substantial improvement over the 0.53 MCC recorded by the DeepPL model. Furthermore, this research reveals that non-coding regions and hypothetical proteins contain robust, lifestyle-specific signals, with the non-coding pipeline alone achieving an MCC of 0.80. These findings suggest that lifestyle signatures are deeply embedded within underexplored areas of the genome. By providing an annotation-independent, high-accuracy solution, this framework facilitates the rapid and safe selection of virulent phages for clinical, agricultural, and food safety applications.

Whole-Genome Phage Lifestyle Classification Using Evo 2 DNA Foundation Model Embeddings

STELLA, FRANCESCO
2025/2026

Abstract

The escalation of antimicrobial resistance represents a critical threat to global health, with projections suggesting 10 million annual deaths by 2050. While bacteriophages offer a promising therapeutic alternative, their clinical integration is hindered by the necessity of accurately classifying their lifestyles as virulent or temperate. Precise identification is vital, as temperate phages carry the risk of horizontal gene transfer and the mobilization of resistance genes. Traditional computational approaches often rely on restricted context windows or isolated genomic fragments, thereby losing the long-range structural dependencies and non-coding signals that define a phage's fundamental biological strategy. This thesis introduces a whole-genome classification framework leveraging Evo 2, a 7-billion-parameter DNA foundation model. By utilizing the sub-quadratic StripedHyena 2 architecture, the model extends the context window to capture dependencies across millions of base pairs at single-nucleotide resolution. High-dimensional latent embeddings were extracted from the model's intermediate layers and used to train supervised classifiers on a curated dataset of 2,176 bacteriophage genomes. Experimental results demonstrate that the proposed Whole-Genome Summation (WGS) approach significantly outperforms existing state-of-the-art tools. A Gradient Boosting classifier trained on WGS-32 embeddings achieved a Matthews Correlation Coefficient (MCC) of 0.92, a substantial improvement over the 0.53 MCC recorded by the DeepPL model. Furthermore, this research reveals that non-coding regions and hypothetical proteins contain robust, lifestyle-specific signals, with the non-coding pipeline alone achieving an MCC of 0.80. These findings suggest that lifestyle signatures are deeply embedded within underexplored areas of the genome. By providing an annotation-independent, high-accuracy solution, this framework facilitates the rapid and safe selection of virulent phages for clinical, agricultural, and food safety applications.
2025
Whole-Genome Phage Lifestyle Classification Using Evo 2 DNA Foundation Model Embeddings
The escalation of antimicrobial resistance represents a critical threat to global health, with projections suggesting 10 million annual deaths by 2050. While bacteriophages offer a promising therapeutic alternative, their clinical integration is hindered by the necessity of accurately classifying their lifestyles as virulent or temperate. Precise identification is vital, as temperate phages carry the risk of horizontal gene transfer and the mobilization of resistance genes. Traditional computational approaches often rely on restricted context windows or isolated genomic fragments, thereby losing the long-range structural dependencies and non-coding signals that define a phage's fundamental biological strategy. This thesis introduces a whole-genome classification framework leveraging Evo 2, a 7-billion-parameter DNA foundation model. By utilizing the sub-quadratic StripedHyena 2 architecture, the model extends the context window to capture dependencies across millions of base pairs at single-nucleotide resolution. High-dimensional latent embeddings were extracted from the model's intermediate layers and used to train supervised classifiers on a curated dataset of 2,176 bacteriophage genomes. Experimental results demonstrate that the proposed Whole-Genome Summation (WGS) approach significantly outperforms existing state-of-the-art tools. A Gradient Boosting classifier trained on WGS-32 embeddings achieved a Matthews Correlation Coefficient (MCC) of 0.92, a substantial improvement over the 0.53 MCC recorded by the DeepPL model. Furthermore, this research reveals that non-coding regions and hypothetical proteins contain robust, lifestyle-specific signals, with the non-coding pipeline alone achieving an MCC of 0.80. These findings suggest that lifestyle signatures are deeply embedded within underexplored areas of the genome. By providing an annotation-independent, high-accuracy solution, this framework facilitates the rapid and safe selection of virulent phages for clinical, agricultural, and food safety applications.
Evo 2
Large Language Model
Phage Lifestyle
DNA Embeddings
File in questo prodotto:
File Dimensione Formato  
Stella_Francesco.pdf

embargo fino al 22/04/2029

Dimensione 2.22 MB
Formato Adobe PDF
2.22 MB Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/108020