Retrotransposons, a major class of transposable elements (TEs) that replicate via an RNA intermediate, make up a substantial fraction of the human genome and influence gene regulation, genome stability, evolution, and disease. Quantifying their expression is difficult because high sequence similarity leads to pervasive read multi mapping. Several tools have been developed for this purpose, yet no method is universally optimal. We systematically benchmarked purpose-built TE-expression tools (SalmonTE, Telescope, TEspeX) alongside the general-purpose transcript quantifier Salmon and evaluated different quantification strategies that can be implemented using these tools. Using simulated RNA-seq datasets with known ground truth, we assessed performance at the locus and family levels. For locus-level quantification, Salmon-based methods (Salmon, SalmonTE) achieved the highest overall accuracy and computational efficiency, but only when genes and retrotransposons were quantified jointly; joint modeling reduced spurious retrotransposon signal. For family-level quantification, aggregating locus-level estimates to family level produced more reliable results than mapping reads to retrotransposon family consensus sequences. These results define best practices for retrotransposon expression analysis and provide a practical framework for selecting computational strategies that balance precision, robustness, and speed.
Retrotransposons, a major class of transposable elements (TEs) that replicate via an RNA intermediate, make up a substantial fraction of the human genome and influence gene regulation, genome stability, evolution, and disease. Quantifying their expression is difficult because high sequence similarity leads to pervasive read multi mapping. Several tools have been developed for this purpose, yet no method is universally optimal. We systematically benchmarked purpose-built TE-expression tools (SalmonTE, Telescope, TEspeX) alongside the general-purpose transcript quantifier Salmon and evaluated different quantification strategies that can be implemented using these tools. Using simulated RNA-seq datasets with known ground truth, we assessed performance at the locus and family levels. For locus-level quantification, Salmon-based methods (Salmon, SalmonTE) achieved the highest overall accuracy and computational efficiency, but only when genes and retrotransposons were quantified jointly; joint modeling reduced spurious retrotransposon signal. For family-level quantification, aggregating locus-level estimates to family level produced more reliable results than mapping reads to retrotransposon family consensus sequences. These results define best practices for retrotransposon expression analysis and provide a practical framework for selecting computational strategies that balance precision, robustness, and speed.
Benchmarking Strategies for Retrotransposons Expression Analysis at Family and Locus-Level Resolution
MOHAMED, MOHAMED KHALED MOHAMED HOSNY ELSAYED
2024/2025
Abstract
Retrotransposons, a major class of transposable elements (TEs) that replicate via an RNA intermediate, make up a substantial fraction of the human genome and influence gene regulation, genome stability, evolution, and disease. Quantifying their expression is difficult because high sequence similarity leads to pervasive read multi mapping. Several tools have been developed for this purpose, yet no method is universally optimal. We systematically benchmarked purpose-built TE-expression tools (SalmonTE, Telescope, TEspeX) alongside the general-purpose transcript quantifier Salmon and evaluated different quantification strategies that can be implemented using these tools. Using simulated RNA-seq datasets with known ground truth, we assessed performance at the locus and family levels. For locus-level quantification, Salmon-based methods (Salmon, SalmonTE) achieved the highest overall accuracy and computational efficiency, but only when genes and retrotransposons were quantified jointly; joint modeling reduced spurious retrotransposon signal. For family-level quantification, aggregating locus-level estimates to family level produced more reliable results than mapping reads to retrotransposon family consensus sequences. These results define best practices for retrotransposon expression analysis and provide a practical framework for selecting computational strategies that balance precision, robustness, and speed.| File | Dimensione | Formato | |
|---|---|---|---|
|
Mohamed_MohamedKhaledMohamedHosnyElsayed.pdf
Accesso riservato
Dimensione
1.45 MB
Formato
Adobe PDF
|
1.45 MB | Adobe PDF |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/91413