RNA-sequencing is a technique used to study transcriptome (defined as the sum of all mRNA molecules expressed by genes) within cells, tissues or organisms. It gives information on which genes are more expressed by comparing two (or more) conditions and provides a set of tools to retrieve biological information of genes (i.e. their involvement in pathways/processes). A plethora of tools are available to perform such tasks. In order to perform such comparative analysis, a dataset available on NCBI was used. In this case the RNA was extracted from the digestive gland of Ruditapes philippinarium exposed to heat stress, 30°C for 30 days and compared to controls maintained at 20°C. Sequencing data, stored on NCBI in SRA format, were downloaded using SRA toolkit and then sequences were converted in FASTQ format. Quality control analysis was performed using FastQC and trimmed using Fastp. STAR tool was used for generating genome indexes based on reference genome and its annotation in GTF format and then used for read mapping. FeatureCounts program was used for quantifying read alignment to genomic features. This thesis investigated the impact of different RNA-seq approaches on the number of differentially expressed genes (DEGs) and the subsequent biological impacts. The first aspect was related to the different impact of two alternative settings on STAR. These setting diverge in their output filtering option and alignments strategies. The two different setting showed some variation in mapping rates. The second was to compare between two different approaches for data normalization: log2(CPM+c) transformation and RUVSeq method. For log2(CPM + c) transformation was used by iDEP(integrated differential expression and pathway analysis), a web application. RUV(Removal of Unwanted Variation) used factor analysis to remove unwanted variation in particular it was employed RUVseq that used negative control sam¬ples. It helped to eliminate systematic artefact and improving the detection of true biological signals. After normalization, differential expression analysis was conducted using DESeq2 for both methods. Gene Set Enrichment Analysis (GSEA) was used for pathways analysis on iDEP. GSEA was also implemented after RUVseq using clusterProfile package. The two methods, iDEP and RUVseq+DESeq2 showed some differences in the number of differentially expressed genes (785 with RUVSeq+DESeq2 and 659 with iDEP), and also some differences in the functional results obtained. Only 19 pathways were shared among the two methods out of a total of 111. The main biological processes that were highlighted were: cilium motility and structure, sperm motility and structure, alteration of non-canonical Wnt signalling, while mitochondria were enhanced by heat stress. Such comparative analyses are useful in order to reveal differences in performances among RNAseq pipelines which may lead to slight differences in the results that are obtained.
L’RNA-sequencing è una tecnica usata per studiare il trascrittoma (definito come la somma di tutte le molecole di mRNA espresse dai geni) all’interno delle cellule, tessuti o organismi. Ci da informazioni su quali geni sono più o meno espressi durante la comparazione di due (o più) condizioni e offre una serie di strumenti per ottener informazioni biologiche sui geni (ad esempio, il loro coinvolgimento in vie/pathways biologiche). Esiste una moltitudine di strumenti disponibili per svolgere tali compiti. Per compiere questa analisi comparativa è stato usato un dataset su NCBI. In questo caso l’RNA è stato estratto dalla ghiandola digestiva di diversi esemplari di vongola Filippina Ruditapes philippinarium esposti ad uno stress termico corrispondente a 30°C per 30 giorni mentre gli animali di controllo sono stati mantenuti a 20°C. I dati di sequenziamento archiviati su NCBI con il formato SRA, sono stati scaricati usando SRA toolkit, poi le sequenze sono state convertite in formato FASTQ. Il controllo qualità delle sequenze è stato fatto usando fastQC e il taglio delle sequenze è stato fatto usando Fastp. L’indice genomico è stato creato usando STAR basandosi su un genoma di riferimento in formato GTF ed è stato utilizzato per quantificare e allineare le sequenze al genoma. In questa tesi è stato esaminato l’impatto di diversi approcci per l’RNA-seq sul numero dei geni diversamente espressi e sul conseguente impatto biologico. Il primo aspetto era relativo ai diversi effetti di due diversi settaggi su STAR per il mapping, Le due impostazioni hanno mostrato alcune variazioni nei tassi di mapping. Il secondo aspetto è stato comparare due diversi approcci per la normalizzazione dei dati: la trasformazione log2(CPM+c) e il metodo RUVSeq sono stati utilizzati. La trasformazione log2(CPM+c) è stata applicata tramite un’applicazione web, iDEP (“integrated Differential Expression and pathway Analysis”), RUV (“Removal of Unwanted Variation”) ha utilizzato l’analisi dei fattori per rimuovere le variazioni indesiderate, in particolare è stato usato RUVSeq, che ha utlizzato campioni di controllo negativi. Questo ha aiutato ad eliminare errori sistematici e migliorare il rivelamento dei segnali biologici reali. Dopo la normalizzazione, l’analisi per i geni differentemente espressi è stata fatta usando DESeq2 per entrambi i metodi. GSEA (“Gene Set Enrichment Analysis”) è stata usata per l’analisi delle pathways su iDEP. GSEA è stata anche implementata dopo RUVseq usando il pacchetto clusterProfile. I due metodi, iDEP e RUVseq+DESeq2 presentano alcune differenze nel numero di geni diversamente espressi (785 con RUVSeq+DESeq2 e 659 con iDEP), anche alcune differenze nei risultati funzionali ottenuti. Solo 19 pathways in comune sono state trovate tra i due metodi su un totale di 111. I principali processi biologici che sono stati evidenziati sono: motilità e struttura delle cilia, motilità e struttura degli spermatozoi, alterazione delle vie di segnalazione Wnt, e l’attività dei mitocondri. Questa analisi comparativa è stata utile per rivelare le differenze tra le pipeline di RNA-seq, che possono portare a lievi differenze nei risultati ottenuti.
Comparative analysis of RNA-seq protocols: a case study on heat stress response in Ruditapes philippinarum
QUARZAGO, MIRIAM
2023/2024
Abstract
RNA-sequencing is a technique used to study transcriptome (defined as the sum of all mRNA molecules expressed by genes) within cells, tissues or organisms. It gives information on which genes are more expressed by comparing two (or more) conditions and provides a set of tools to retrieve biological information of genes (i.e. their involvement in pathways/processes). A plethora of tools are available to perform such tasks. In order to perform such comparative analysis, a dataset available on NCBI was used. In this case the RNA was extracted from the digestive gland of Ruditapes philippinarium exposed to heat stress, 30°C for 30 days and compared to controls maintained at 20°C. Sequencing data, stored on NCBI in SRA format, were downloaded using SRA toolkit and then sequences were converted in FASTQ format. Quality control analysis was performed using FastQC and trimmed using Fastp. STAR tool was used for generating genome indexes based on reference genome and its annotation in GTF format and then used for read mapping. FeatureCounts program was used for quantifying read alignment to genomic features. This thesis investigated the impact of different RNA-seq approaches on the number of differentially expressed genes (DEGs) and the subsequent biological impacts. The first aspect was related to the different impact of two alternative settings on STAR. These setting diverge in their output filtering option and alignments strategies. The two different setting showed some variation in mapping rates. The second was to compare between two different approaches for data normalization: log2(CPM+c) transformation and RUVSeq method. For log2(CPM + c) transformation was used by iDEP(integrated differential expression and pathway analysis), a web application. RUV(Removal of Unwanted Variation) used factor analysis to remove unwanted variation in particular it was employed RUVseq that used negative control sam¬ples. It helped to eliminate systematic artefact and improving the detection of true biological signals. After normalization, differential expression analysis was conducted using DESeq2 for both methods. Gene Set Enrichment Analysis (GSEA) was used for pathways analysis on iDEP. GSEA was also implemented after RUVseq using clusterProfile package. The two methods, iDEP and RUVseq+DESeq2 showed some differences in the number of differentially expressed genes (785 with RUVSeq+DESeq2 and 659 with iDEP), and also some differences in the functional results obtained. Only 19 pathways were shared among the two methods out of a total of 111. The main biological processes that were highlighted were: cilium motility and structure, sperm motility and structure, alteration of non-canonical Wnt signalling, while mitochondria were enhanced by heat stress. Such comparative analyses are useful in order to reveal differences in performances among RNAseq pipelines which may lead to slight differences in the results that are obtained.File | Dimensione | Formato | |
---|---|---|---|
QUARZAGO_MIRIAM.pdf
accesso aperto
Dimensione
1.67 MB
Formato
Adobe PDF
|
1.67 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/74763