Benchmarking Strategies for Retrotransposons Expression Analysis at Family and Locus-Level Resolution

Retrotransposons, a major class of transposable elements (TEs) that replicate via an RNA intermediate, make up a substantial fraction of the human genome and influence gene regulation, genome stability, evolution, and disease. Quantifying their expression is difficult because high sequence similarity leads to pervasive read multi mapping. Several tools have been developed for this purpose, yet no method is universally optimal. We systematically benchmarked purpose-built TE-expression tools (SalmonTE, Telescope, TEspeX) alongside the general-purpose transcript quantifier Salmon and evaluated different quantification strategies that can be implemented using these tools. Using simulated RNA-seq datasets with known ground truth, we assessed performance at the locus and family levels. For locus-level quantification, Salmon-based methods (Salmon, SalmonTE) achieved the highest overall accuracy and computational efficiency, but only when genes and retrotransposons were quantified jointly; joint modeling reduced spurious retrotransposon signal. For family-level quantification, aggregating locus-level estimates to family level produced more reliable results than mapping reads to retrotransposon family consensus sequences. These results define best practices for retrotransposon expression analysis and provide a practical framework for selecting computational strategies that balance precision, robustness, and speed.

Benchmarking Strategies for Retrotransposons Expression Analysis at Family and Locus-Level Resolution

MOHAMED, MOHAMED KHALED MOHAMED HOSNY ELSAYED

2024/2025

Abstract

Retrotransposons, a major class of transposable elements (TEs) that replicate via an RNA intermediate, make up a substantial fraction of the human genome and influence gene regulation, genome stability, evolution, and disease. Quantifying their expression is difficult because high sequence similarity leads to pervasive read multi mapping. Several tools have been developed for this purpose, yet no method is universally optimal. We systematically benchmarked purpose-built TE-expression tools (SalmonTE, Telescope, TEspeX) alongside the general-purpose transcript quantifier Salmon and evaluated different quantification strategies that can be implemented using these tools. Using simulated RNA-seq datasets with known ground truth, we assessed performance at the locus and family levels. For locus-level quantification, Salmon-based methods (Salmon, SalmonTE) achieved the highest overall accuracy and computational efficiency, but only when genes and retrotransposons were quantified jointly; joint modeling reduced spurious retrotransposon signal. For family-level quantification, aggregating locus-level estimates to family level produced more reliable results than mapping reads to retrotransposon family consensus sequences. These results define best practices for retrotransposon expression analysis and provide a practical framework for selecting computational strategies that balance precision, robustness, and speed.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Biologia - DiBio
			
	Corso di studio
	
				MOLECULAR BIOLOGY Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2024
			
	Titolo inglese
	
				Benchmarking Strategies for Retrotransposons Expression Analysis at Family and Locus-Level Resolution
			
	Abstract in italiano
	
				Retrotransposons, a major class of transposable elements (TEs) that replicate via an 
RNA intermediate, make up a substantial fraction of the human genome and influence 
gene regulation, genome stability, evolution, and disease. Quantifying their 
expression is difficult because high sequence similarity leads to pervasive read multi
mapping. Several tools have been developed for this purpose, yet no method is 
universally optimal. 
We systematically benchmarked purpose-built TE-expression tools (SalmonTE, 
Telescope, TEspeX) alongside the general-purpose transcript quantifier Salmon and 
evaluated different quantification strategies that can be implemented using these 
tools. Using simulated RNA-seq datasets with known ground truth, we assessed 
performance at the locus and family levels. 
For locus-level quantification, Salmon-based methods (Salmon, SalmonTE) achieved 
the highest overall accuracy and computational efficiency, but only when genes and 
retrotransposons were quantified jointly; joint modeling reduced spurious 
retrotransposon signal. For family-level quantification, aggregating locus-level 
estimates to family level produced more reliable results than mapping reads to 
retrotransposon family consensus sequences. 
These results define best practices for retrotransposon expression analysis and 
provide a practical framework for selecting computational strategies that balance 
precision, robustness, and speed.
			
	Parola chiave
	
				Retrotransposons
Transposable element
Benchmark
			
	Relatore
	
				SALES, GABRIELE
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Mohamed_MohamedKhaledMohamedHosnyElsayed.pdf Accesso riservato Dimensione 1.45 MB Formato Adobe PDF	1.45 MB	Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/91413