A Comparative Study of Batch Effect Correction Methods for scRNA Data

This thesis aims to analyze the effects of different machine learning based methods on the batch effects present among scRNA datasets from different sources and their elimination. Batch effect refers to the systematic differences between datasets that arise from non-biological variations such as differences in experimental conditions, protocols, or sequencing platforms, which can obscure true biological signals. In our work, we considered a series of pairs of single-cell datasets derived from the same tissue. We demonstrated that a straightforward dimensionality reduction technique like t-SNE shows batch effects when datasets from different sources are combined. We then investigated the extent to which batch effects can be reduced by performing dimensionality reduction on one dataset alone and subsequently applying the resulting embedding to the paired dataset using t-SNE. This approach was examined across various types of tissues. The above method does not correct batch effects on the data itself but on the projection. The aim is also to check the correction of the batch effects on the projection by the visualization of the various cell types when we are using dimensionality reduction methods. The accuracy of this approach is measured using the k-NN classifier on the resulting visualizations and compared to the other futuristic approaches mentioned in the thesis. We try the above methodology given by Policar et al.\cite{polivcar2023embedding} with UMAP. Then, we compared this approach to other prominent methodologies to evaluate the effect they have in reducing batch effects compared to the approach we mentioned above. Ideally, batch effects should also be eliminated when the data is represented in two dimensions. Another dimensionality reduction technique we explored is PCA followed by the Harmony algorithm. We also consider deep learning based methods as well as a foundational model which has been trained across multiple species datasets. Performance evaluation between these methods is based on the extent to which the batch effect is mitigated, either through correction or through its visualization in reduced dimensions. The batch effects are corrected on the visual 2D plane with the above approach and not on the data itself for dimensionality reduction methods.Whereas deep learning models fix the batch effects on the data itself and is reflected on the projection. We compare the results on this lower dimensional plane. A range of techniques is employed, from using conventional dimensionality reduction methods such as t-SNE and UMAP to reduce batch effects using the method defined above till the use of state-of-the-art deep learning models for the reduction of batch effects. A distinction has to be made that these methodologies, some of them correct batch effect in the data level,while others correct them in the data as well as on the projection. The core objective is to evaluate how effective these different techniques are in removing batch effects while preserving the essential signal characteristics for different cell types.

A Comparative Study of Batch Effect Correction Methods for scRNA Data

MANTHARA, CHRISTY JO

2024/2025

Abstract

This thesis aims to analyze the effects of different machine learning based methods on the batch effects present among scRNA datasets from different sources and their elimination. Batch effect refers to the systematic differences between datasets that arise from non-biological variations such as differences in experimental conditions, protocols, or sequencing platforms, which can obscure true biological signals. In our work, we considered a series of pairs of single-cell datasets derived from the same tissue. We demonstrated that a straightforward dimensionality reduction technique like t-SNE shows batch effects when datasets from different sources are combined. We then investigated the extent to which batch effects can be reduced by performing dimensionality reduction on one dataset alone and subsequently applying the resulting embedding to the paired dataset using t-SNE. This approach was examined across various types of tissues. The above method does not correct batch effects on the data itself but on the projection. The aim is also to check the correction of the batch effects on the projection by the visualization of the various cell types when we are using dimensionality reduction methods. The accuracy of this approach is measured using the k-NN classifier on the resulting visualizations and compared to the other futuristic approaches mentioned in the thesis. We try the above methodology given by Policar et al.\cite{polivcar2023embedding} with UMAP. Then, we compared this approach to other prominent methodologies to evaluate the effect they have in reducing batch effects compared to the approach we mentioned above. Ideally, batch effects should also be eliminated when the data is represented in two dimensions. Another dimensionality reduction technique we explored is PCA followed by the Harmony algorithm. We also consider deep learning based methods as well as a foundational model which has been trained across multiple species datasets. Performance evaluation between these methods is based on the extent to which the batch effect is mitigated, either through correction or through its visualization in reduced dimensions. The batch effects are corrected on the visual 2D plane with the above approach and not on the data itself for dimensionality reduction methods.Whereas deep learning models fix the batch effects on the data itself and is reflected on the projection. We compare the results on this lower dimensional plane. A range of techniques is employed, from using conventional dimensionality reduction methods such as t-SNE and UMAP to reduce batch effects using the method defined above till the use of state-of-the-art deep learning models for the reduction of batch effects. A distinction has to be made that these methodologies, some of them correct batch effect in the data level,while others correct them in the data as well as on the projection. The core objective is to evaluate how effective these different techniques are in removing batch effects while preserving the essential signal characteristics for different cell types.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Ingegneria dell'Informazione - DEI
			
	Corso di studio
	
				COMPUTER ENGINEERING Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2024
			
	Titolo inglese
	
				A Comparative Study of Batch Effect Correction Methods for scRNA Data
			
	Abstract in italiano
	
				This thesis aims to analyze the effects of different machine learning based methods on the batch effects present among scRNA datasets from different sources and their elimination. 

Batch effect refers to the systematic differences between datasets that arise from non-biological variations such as differences in experimental conditions, protocols, or sequencing platforms, which can obscure true biological signals.

In our work, we considered a series of pairs of single-cell datasets derived from the same tissue. We demonstrated that a straightforward dimensionality reduction technique like t-SNE shows batch effects when datasets from different sources are combined. We then investigated the extent to which batch effects can be reduced by performing dimensionality reduction on one dataset alone and subsequently applying the resulting embedding to the paired dataset using t-SNE. This approach was examined across various types of tissues. The above method does not correct batch effects on the data itself but on the projection. The aim is also to check the correction of the batch effects on the projection by the visualization of the various cell types when we are using dimensionality reduction methods. The accuracy of this approach is measured using the k-NN classifier on the resulting visualizations and compared to the other futuristic approaches mentioned in the thesis. We try the above methodology given by Policar et al.\cite{polivcar2023embedding} with UMAP.  Then, we compared this approach to other prominent methodologies to evaluate the effect they have in reducing batch effects compared to the approach we mentioned above.  Ideally, batch effects should also be eliminated when the data is represented in two dimensions. Another dimensionality reduction technique we explored is PCA followed by the Harmony algorithm. We also consider deep learning based methods as well as a foundational model which has been trained across multiple species datasets. Performance evaluation between these methods is based on the extent to which the batch effect is mitigated, either through correction or through its visualization in reduced dimensions.

The batch effects are corrected on the visual 2D plane with the above approach and not on the data itself for dimensionality reduction methods.Whereas deep learning models fix the batch effects on the data itself and is reflected on the projection. We compare the results on this lower dimensional plane. 
A range of techniques is employed, from using conventional dimensionality reduction methods such as t-SNE and UMAP to reduce batch effects using the method defined above till the use of state-of-the-art deep learning models for the reduction of batch effects. A distinction has to be made that these methodologies, some of them correct batch effect in the data level,while others correct them in the data as well as on the projection. The core objective is to evaluate how effective these different techniques are in removing batch effects while preserving the essential signal characteristics for different cell types.
			
	Parola chiave
	
				Bioinformatics
Batch Effects
scRNA
foundational models
tsne
			
	Relatore
	
				DI CAMILLO, BARBARA
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Manthara_ChristyJo.pdf accesso aperto Dimensione 11.09 MB Formato Adobe PDF Visualizza/Apri	11.09 MB	Adobe PDF	Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/94138