Retrotransposons, a major class of transposable elements (TEs) that replicate via an RNA intermediate, make up a substantial fraction of the human genome and influence gene regulation, genome stability, evolution, and disease. Quantifying their expression is difficult because high sequence similarity leads to pervasive read multi mapping. Several tools have been developed for this purpose, yet no method is universally optimal. We systematically benchmarked purpose-built TE-expression tools (SalmonTE, Telescope, TEspeX) alongside the general-purpose transcript quantifier Salmon and evaluated different quantification strategies that can be implemented using these tools. Using simulated RNA-seq datasets with known ground truth, we assessed performance at the locus and family levels. For locus-level quantification, Salmon-based methods (Salmon, SalmonTE) achieved the highest overall accuracy and computational efficiency, but only when genes and retrotransposons were quantified jointly; joint modeling reduced spurious retrotransposon signal. For family-level quantification, aggregating locus-level estimates to family level produced more reliable results than mapping reads to retrotransposon family consensus sequences. These results define best practices for retrotransposon expression analysis and provide a practical framework for selecting computational strategies that balance precision, robustness, and speed.

Retrotransposons, a major class of transposable elements (TEs) that replicate via an RNA intermediate, make up a substantial fraction of the human genome and influence gene regulation, genome stability, evolution, and disease. Quantifying their expression is difficult because high sequence similarity leads to pervasive read multi mapping. Several tools have been developed for this purpose, yet no method is universally optimal. We systematically benchmarked purpose-built TE-expression tools (SalmonTE, Telescope, TEspeX) alongside the general-purpose transcript quantifier Salmon and evaluated different quantification strategies that can be implemented using these tools. Using simulated RNA-seq datasets with known ground truth, we assessed performance at the locus and family levels. For locus-level quantification, Salmon-based methods (Salmon, SalmonTE) achieved the highest overall accuracy and computational efficiency, but only when genes and retrotransposons were quantified jointly; joint modeling reduced spurious retrotransposon signal. For family-level quantification, aggregating locus-level estimates to family level produced more reliable results than mapping reads to retrotransposon family consensus sequences. These results define best practices for retrotransposon expression analysis and provide a practical framework for selecting computational strategies that balance precision, robustness, and speed.

Benchmarking Strategies for Retrotransposons Expression Analysis at Family and Locus-Level Resolution

MOHAMED, MOHAMED KHALED MOHAMED HOSNY ELSAYED
2024/2025

Abstract

Retrotransposons, a major class of transposable elements (TEs) that replicate via an RNA intermediate, make up a substantial fraction of the human genome and influence gene regulation, genome stability, evolution, and disease. Quantifying their expression is difficult because high sequence similarity leads to pervasive read multi mapping. Several tools have been developed for this purpose, yet no method is universally optimal. We systematically benchmarked purpose-built TE-expression tools (SalmonTE, Telescope, TEspeX) alongside the general-purpose transcript quantifier Salmon and evaluated different quantification strategies that can be implemented using these tools. Using simulated RNA-seq datasets with known ground truth, we assessed performance at the locus and family levels. For locus-level quantification, Salmon-based methods (Salmon, SalmonTE) achieved the highest overall accuracy and computational efficiency, but only when genes and retrotransposons were quantified jointly; joint modeling reduced spurious retrotransposon signal. For family-level quantification, aggregating locus-level estimates to family level produced more reliable results than mapping reads to retrotransposon family consensus sequences. These results define best practices for retrotransposon expression analysis and provide a practical framework for selecting computational strategies that balance precision, robustness, and speed.
2024
Benchmarking Strategies for Retrotransposons Expression Analysis at Family and Locus-Level Resolution
Retrotransposons, a major class of transposable elements (TEs) that replicate via an RNA intermediate, make up a substantial fraction of the human genome and influence gene regulation, genome stability, evolution, and disease. Quantifying their expression is difficult because high sequence similarity leads to pervasive read multi mapping. Several tools have been developed for this purpose, yet no method is universally optimal. We systematically benchmarked purpose-built TE-expression tools (SalmonTE, Telescope, TEspeX) alongside the general-purpose transcript quantifier Salmon and evaluated different quantification strategies that can be implemented using these tools. Using simulated RNA-seq datasets with known ground truth, we assessed performance at the locus and family levels. For locus-level quantification, Salmon-based methods (Salmon, SalmonTE) achieved the highest overall accuracy and computational efficiency, but only when genes and retrotransposons were quantified jointly; joint modeling reduced spurious retrotransposon signal. For family-level quantification, aggregating locus-level estimates to family level produced more reliable results than mapping reads to retrotransposon family consensus sequences. These results define best practices for retrotransposon expression analysis and provide a practical framework for selecting computational strategies that balance precision, robustness, and speed.
Retrotransposons
Transposable element
Benchmark
File in questo prodotto:
File Dimensione Formato  
Mohamed_MohamedKhaledMohamedHosnyElsayed.pdf

Accesso riservato

Dimensione 1.45 MB
Formato Adobe PDF
1.45 MB Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/91413