The rapid decline of global biodiversity has renewed attention on the effectiveness of current conservation practices. A major obstacle in this context is the so-called ”taxonomic impediment”, referring to the limited availability of expert taxonomists and comprehensive, high-quality taxonomic reference data. At the same time, recent advances in high-throughput sequencing have enabled the widespread use of short genomic regions-commonly referred to as DNA barcodes-for reliable species identification. The ability to generate large volumes of sequence data across diverse taxa has consequently increased the demand for accurate and scalable automated sequence analysis methods. In this work, we present a new framework for species- and genus-level identification from DNA barcode sequences that leverages an ensemble of deep neural network models. We explore and compare multiple strategies for transforming nucleotide sequences into representations compatible with deep learning architectures. These strategies include, among others, novel chaos game-derived mapping approaches. Additionally, we investigate the use of currently available foundation models for DNA barcoding analysis, evaluating both off-the-shelf configurations and fine-tuned variants, and systematically comparing their performance against models trained from scratch. By integrating models trained on heterogeneous representations, the resulting ensemble captures complementary information and delivers performance that matches or surpasses existing state-of-the-art approaches on both synthetic benchmarks and real biological datasets.

Integrating Foundation Models and Feature-Based Neural Networks for DNA Barcoding Classification

CARRARO, EDDIE
2025/2026

Abstract

The rapid decline of global biodiversity has renewed attention on the effectiveness of current conservation practices. A major obstacle in this context is the so-called ”taxonomic impediment”, referring to the limited availability of expert taxonomists and comprehensive, high-quality taxonomic reference data. At the same time, recent advances in high-throughput sequencing have enabled the widespread use of short genomic regions-commonly referred to as DNA barcodes-for reliable species identification. The ability to generate large volumes of sequence data across diverse taxa has consequently increased the demand for accurate and scalable automated sequence analysis methods. In this work, we present a new framework for species- and genus-level identification from DNA barcode sequences that leverages an ensemble of deep neural network models. We explore and compare multiple strategies for transforming nucleotide sequences into representations compatible with deep learning architectures. These strategies include, among others, novel chaos game-derived mapping approaches. Additionally, we investigate the use of currently available foundation models for DNA barcoding analysis, evaluating both off-the-shelf configurations and fine-tuned variants, and systematically comparing their performance against models trained from scratch. By integrating models trained on heterogeneous representations, the resulting ensemble captures complementary information and delivers performance that matches or surpasses existing state-of-the-art approaches on both synthetic benchmarks and real biological datasets.
2025
Integrating Foundation Models and Feature-Based Neural Networks for DNA Barcoding Classification
Deep Learning
DNA Barcoding
Foundation Models
File in questo prodotto:
File Dimensione Formato  
Carraro_Eddie.pdf

embargo fino al 02/10/2027

Dimensione 940.55 kB
Formato Adobe PDF
940.55 kB Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/106271