The rapid decline of global biodiversity has renewed attention on the effectiveness of current conservation practices. A major obstacle in this context is the so-called ”taxonomic impediment”, referring to the limited availability of expert taxonomists and comprehensive, high-quality taxonomic reference data. At the same time, recent advances in high-throughput sequencing have enabled the widespread use of short genomic regions-commonly referred to as DNA barcodes-for reliable species identification. The ability to generate large volumes of sequence data across diverse taxa has consequently increased the demand for accurate and scalable automated sequence analysis methods. In this work, we present a new framework for species- and genus-level identification from DNA barcode sequences that leverages an ensemble of deep neural network models. We explore and compare multiple strategies for transforming nucleotide sequences into representations compatible with deep learning architectures. These strategies include, among others, novel chaos game-derived mapping approaches. Additionally, we investigate the use of currently available foundation models for DNA barcoding analysis, evaluating both off-the-shelf configurations and fine-tuned variants, and systematically comparing their performance against models trained from scratch. By integrating models trained on heterogeneous representations, the resulting ensemble captures complementary information and delivers performance that matches or surpasses existing state-of-the-art approaches on both synthetic benchmarks and real biological datasets.
Integrating Foundation Models and Feature-Based Neural Networks for DNA Barcoding Classification
CARRARO, EDDIE
2025/2026
Abstract
The rapid decline of global biodiversity has renewed attention on the effectiveness of current conservation practices. A major obstacle in this context is the so-called ”taxonomic impediment”, referring to the limited availability of expert taxonomists and comprehensive, high-quality taxonomic reference data. At the same time, recent advances in high-throughput sequencing have enabled the widespread use of short genomic regions-commonly referred to as DNA barcodes-for reliable species identification. The ability to generate large volumes of sequence data across diverse taxa has consequently increased the demand for accurate and scalable automated sequence analysis methods. In this work, we present a new framework for species- and genus-level identification from DNA barcode sequences that leverages an ensemble of deep neural network models. We explore and compare multiple strategies for transforming nucleotide sequences into representations compatible with deep learning architectures. These strategies include, among others, novel chaos game-derived mapping approaches. Additionally, we investigate the use of currently available foundation models for DNA barcoding analysis, evaluating both off-the-shelf configurations and fine-tuned variants, and systematically comparing their performance against models trained from scratch. By integrating models trained on heterogeneous representations, the resulting ensemble captures complementary information and delivers performance that matches or surpasses existing state-of-the-art approaches on both synthetic benchmarks and real biological datasets.| File | Dimensione | Formato | |
|---|---|---|---|
|
Carraro_Eddie.pdf
embargo fino al 02/10/2027
Dimensione
940.55 kB
Formato
Adobe PDF
|
940.55 kB | Adobe PDF |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/106271