Shotgun metagenomic sequencing represents a fundamental tool for bioprospecting and monitoring bioprocesses in industrial biotechnology. However, multiclass metagenomic classification suffers from several limitations intrinsic to the adopted classification strategy. For mobile genetic elements (plasmids and viruses), classification is hindered by the high background noise generated by bacterial chromosomal fragments. Current analytical tools present a methodological trade-off: topological architectures based on assembly graphs lack sensitivity, while semantic Deep Learning models tend to generate high false positive rates due to the lack of spatial context. This study presents a hybrid computational model designed to overcome this dichotomy, integrating continuous confidence vectors extracted from a convolutional neural network into the relational infrastructure of the graph, improving overall metagenomic classification. To counteract the systematic loss of topologically isolated plasmid contigs, a conditional recovery heuristic, named Plasmid Rescue, was implemented. The architecture was validated on controlled synthetic datasets and on a real environmental sample, comparing predictions against a ground truth based on a consensus of mutually orthogonal tools. Quantitative analysis of the data demonstrates that the hybrid strategy effectively alters the predictive balance. For real data, at the cost of a tolerable reduction in precision, the recall of the plasmid class registered a significant increase, rising from 11.48% in the baseline topological model to 59.02%. This allowed doubling the plasmid-specific F1 score, maintaining an overall classification accuracy above 80%. Concurrently, the study documents and analyzes the model's limitations in viral identification, explaining how this discrepancy largely stems from the difficulty of current validation systems in handling the biological interference of prophage fragments. The orthogonal integration between semantic inference and topological continuity proves to be an essential structural requirement for the accurate profiling of complex microbial communities. The proposed architecture provides a robust analytical tool that is independent of sample composition, applicable to the study of population dynamics and the pre-fractionation of metagenomic data. The accurate isolation of the mobilome paves the way for direct industrial applications, facilitating the bioprospecting of novel enzymatic pathways and the tracking of antimicrobial resistance (AMR) determinants.
Il sequenziamento metagenomico shotgun rappresenta uno strumento fondamentale per la bioprospezione e il monitoraggio dei bioprocessi nelle biotecnologie industriali. Tuttavia, la classificazione metagenomica multiclasse soffre di alcuni limiti intrinseci alla strategia di classificazione adottata. Per gli elementi genetici mobili (plasmidi e virus) la classificazione è ostacolata dall'elevato rumore di fondo generato dai frammenti cromosomici batterici. Gli strumenti analitici attuali presentano un compromesso metodologico: le architetture topologiche basate sui grafi di assemblaggio mancano di sensibilità, mentre i modelli semantici di Deep Learning tendono a generare alti tassi di falsi positivi a causa dell'assenza di contesto spaziale. Questo studio presenta un modello computazionale ibrido progettato per superare questa dicotomia, integrando i vettori di confidenza continui estratti da una rete neurale convoluzionale all'interno dell'infrastruttura relazionale del grafo, migliorando la classificazione metagenomica generale. Per contrastare la perdita sistematica dei contig plasmidici topologicamente isolati, è stata implementata un'euristica di recupero condizionale, denominata Plasmid Rescue. L'architettura è stata validata su dataset sintetici controllati e su un campione ambientale reale, confrontando le predizioni con una verità di riferimento basata su un consenso di strumenti tra loro ortogonali. L'analisi quantitativa sui dati dimostra che la strategia ibrida altera efficacemente il bilancio predittivo. Per i dati reali, a fronte di una tollerabile riduzione della precisione, il recall della classe plasmidica ha registrato un incremento significativo, passando dall'11.48% del modello topologico di base al 59.02%. Questo ha permesso di raddoppiare il punteggio F1 specifico per i plasmidi, mantenendo un'accuratezza globale di classificazione superiore all'80%. Contestualmente, lo studio documenta e analizza i limiti del modello nell'identificazione virale, spiegando come questa discrepanza derivi in larga parte dalla difficoltà dei sistemi di validazione attuali nel gestire l'interferenza biologica dei frammenti profagici. L'integrazione ortogonale tra inferenza semantica e continuità topologica si conferma un requisito strutturale essenziale per un'accurata profilazione delle comunità microbiche complesse. L'architettura proposta fornisce uno strumento analitico robusto e indipendente dalla composizione del campione, applicabile allo studio delle dinamiche di popolazione e al pre-frazionamento dei dati metagenomici. L'isolamento accurato del mobiloma apre a dirette applicazioni industriali, facilitando la bioprospezione di nuovi vie enzimatiche e il tracciamento dei determinanti di resistenza antimicrobica (AMR).
Sviluppo di un approccio ibrido basato su Deep Learning e grafi di assemblaggio per la classificazione metagenomica multi-classe
MARONGIU, MICHELE
2025/2026
Abstract
Shotgun metagenomic sequencing represents a fundamental tool for bioprospecting and monitoring bioprocesses in industrial biotechnology. However, multiclass metagenomic classification suffers from several limitations intrinsic to the adopted classification strategy. For mobile genetic elements (plasmids and viruses), classification is hindered by the high background noise generated by bacterial chromosomal fragments. Current analytical tools present a methodological trade-off: topological architectures based on assembly graphs lack sensitivity, while semantic Deep Learning models tend to generate high false positive rates due to the lack of spatial context. This study presents a hybrid computational model designed to overcome this dichotomy, integrating continuous confidence vectors extracted from a convolutional neural network into the relational infrastructure of the graph, improving overall metagenomic classification. To counteract the systematic loss of topologically isolated plasmid contigs, a conditional recovery heuristic, named Plasmid Rescue, was implemented. The architecture was validated on controlled synthetic datasets and on a real environmental sample, comparing predictions against a ground truth based on a consensus of mutually orthogonal tools. Quantitative analysis of the data demonstrates that the hybrid strategy effectively alters the predictive balance. For real data, at the cost of a tolerable reduction in precision, the recall of the plasmid class registered a significant increase, rising from 11.48% in the baseline topological model to 59.02%. This allowed doubling the plasmid-specific F1 score, maintaining an overall classification accuracy above 80%. Concurrently, the study documents and analyzes the model's limitations in viral identification, explaining how this discrepancy largely stems from the difficulty of current validation systems in handling the biological interference of prophage fragments. The orthogonal integration between semantic inference and topological continuity proves to be an essential structural requirement for the accurate profiling of complex microbial communities. The proposed architecture provides a robust analytical tool that is independent of sample composition, applicable to the study of population dynamics and the pre-fractionation of metagenomic data. The accurate isolation of the mobilome paves the way for direct industrial applications, facilitating the bioprospecting of novel enzymatic pathways and the tracking of antimicrobial resistance (AMR) determinants.| File | Dimensione | Formato | |
|---|---|---|---|
|
Marongiu_Michele.pdf
Accesso riservato
Dimensione
2.71 MB
Formato
Adobe PDF
|
2.71 MB | Adobe PDF |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/105730