This thesis aims to analyze the effects of different machine learning based methods on the batch effects present among scRNA datasets from different sources and their elimination. Batch effect refers to the systematic differences between datasets that arise from non-biological variations such as differences in experimental conditions, protocols, or sequencing platforms, which can obscure true biological signals. In our work, we considered a series of pairs of single-cell datasets derived from the same tissue. We demonstrated that a straightforward dimensionality reduction technique like t-SNE shows batch effects when datasets from different sources are combined. We then investigated the extent to which batch effects can be reduced by performing dimensionality reduction on one dataset alone and subsequently applying the resulting embedding to the paired dataset using t-SNE. This approach was examined across various types of tissues. The above method does not correct batch effects on the data itself but on the projection. The aim is also to check the correction of the batch effects on the projection by the visualization of the various cell types when we are using dimensionality reduction methods. The accuracy of this approach is measured using the k-NN classifier on the resulting visualizations and compared to the other futuristic approaches mentioned in the thesis. We try the above methodology given by Policar et al.\cite{polivcar2023embedding} with UMAP. Then, we compared this approach to other prominent methodologies to evaluate the effect they have in reducing batch effects compared to the approach we mentioned above. Ideally, batch effects should also be eliminated when the data is represented in two dimensions. Another dimensionality reduction technique we explored is PCA followed by the Harmony algorithm. We also consider deep learning based methods as well as a foundational model which has been trained across multiple species datasets. Performance evaluation between these methods is based on the extent to which the batch effect is mitigated, either through correction or through its visualization in reduced dimensions. The batch effects are corrected on the visual 2D plane with the above approach and not on the data itself for dimensionality reduction methods.Whereas deep learning models fix the batch effects on the data itself and is reflected on the projection. We compare the results on this lower dimensional plane. A range of techniques is employed, from using conventional dimensionality reduction methods such as t-SNE and UMAP to reduce batch effects using the method defined above till the use of state-of-the-art deep learning models for the reduction of batch effects. A distinction has to be made that these methodologies, some of them correct batch effect in the data level,while others correct them in the data as well as on the projection. The core objective is to evaluate how effective these different techniques are in removing batch effects while preserving the essential signal characteristics for different cell types.

This thesis aims to analyze the effects of different machine learning based methods on the batch effects present among scRNA datasets from different sources and their elimination. Batch effect refers to the systematic differences between datasets that arise from non-biological variations such as differences in experimental conditions, protocols, or sequencing platforms, which can obscure true biological signals. In our work, we considered a series of pairs of single-cell datasets derived from the same tissue. We demonstrated that a straightforward dimensionality reduction technique like t-SNE shows batch effects when datasets from different sources are combined. We then investigated the extent to which batch effects can be reduced by performing dimensionality reduction on one dataset alone and subsequently applying the resulting embedding to the paired dataset using t-SNE. This approach was examined across various types of tissues. The above method does not correct batch effects on the data itself but on the projection. The aim is also to check the correction of the batch effects on the projection by the visualization of the various cell types when we are using dimensionality reduction methods. The accuracy of this approach is measured using the k-NN classifier on the resulting visualizations and compared to the other futuristic approaches mentioned in the thesis. We try the above methodology given by Policar et al.\cite{polivcar2023embedding} with UMAP. Then, we compared this approach to other prominent methodologies to evaluate the effect they have in reducing batch effects compared to the approach we mentioned above. Ideally, batch effects should also be eliminated when the data is represented in two dimensions. Another dimensionality reduction technique we explored is PCA followed by the Harmony algorithm. We also consider deep learning based methods as well as a foundational model which has been trained across multiple species datasets. Performance evaluation between these methods is based on the extent to which the batch effect is mitigated, either through correction or through its visualization in reduced dimensions. The batch effects are corrected on the visual 2D plane with the above approach and not on the data itself for dimensionality reduction methods.Whereas deep learning models fix the batch effects on the data itself and is reflected on the projection. We compare the results on this lower dimensional plane. A range of techniques is employed, from using conventional dimensionality reduction methods such as t-SNE and UMAP to reduce batch effects using the method defined above till the use of state-of-the-art deep learning models for the reduction of batch effects. A distinction has to be made that these methodologies, some of them correct batch effect in the data level,while others correct them in the data as well as on the projection. The core objective is to evaluate how effective these different techniques are in removing batch effects while preserving the essential signal characteristics for different cell types.

A Comparative Study of Batch Effect Correction Methods for scRNA Data

MANTHARA, CHRISTY JO
2024/2025

Abstract

This thesis aims to analyze the effects of different machine learning based methods on the batch effects present among scRNA datasets from different sources and their elimination. Batch effect refers to the systematic differences between datasets that arise from non-biological variations such as differences in experimental conditions, protocols, or sequencing platforms, which can obscure true biological signals. In our work, we considered a series of pairs of single-cell datasets derived from the same tissue. We demonstrated that a straightforward dimensionality reduction technique like t-SNE shows batch effects when datasets from different sources are combined. We then investigated the extent to which batch effects can be reduced by performing dimensionality reduction on one dataset alone and subsequently applying the resulting embedding to the paired dataset using t-SNE. This approach was examined across various types of tissues. The above method does not correct batch effects on the data itself but on the projection. The aim is also to check the correction of the batch effects on the projection by the visualization of the various cell types when we are using dimensionality reduction methods. The accuracy of this approach is measured using the k-NN classifier on the resulting visualizations and compared to the other futuristic approaches mentioned in the thesis. We try the above methodology given by Policar et al.\cite{polivcar2023embedding} with UMAP. Then, we compared this approach to other prominent methodologies to evaluate the effect they have in reducing batch effects compared to the approach we mentioned above. Ideally, batch effects should also be eliminated when the data is represented in two dimensions. Another dimensionality reduction technique we explored is PCA followed by the Harmony algorithm. We also consider deep learning based methods as well as a foundational model which has been trained across multiple species datasets. Performance evaluation between these methods is based on the extent to which the batch effect is mitigated, either through correction or through its visualization in reduced dimensions. The batch effects are corrected on the visual 2D plane with the above approach and not on the data itself for dimensionality reduction methods.Whereas deep learning models fix the batch effects on the data itself and is reflected on the projection. We compare the results on this lower dimensional plane. A range of techniques is employed, from using conventional dimensionality reduction methods such as t-SNE and UMAP to reduce batch effects using the method defined above till the use of state-of-the-art deep learning models for the reduction of batch effects. A distinction has to be made that these methodologies, some of them correct batch effect in the data level,while others correct them in the data as well as on the projection. The core objective is to evaluate how effective these different techniques are in removing batch effects while preserving the essential signal characteristics for different cell types.
2024
A Comparative Study of Batch Effect Correction Methods for scRNA Data
This thesis aims to analyze the effects of different machine learning based methods on the batch effects present among scRNA datasets from different sources and their elimination. Batch effect refers to the systematic differences between datasets that arise from non-biological variations such as differences in experimental conditions, protocols, or sequencing platforms, which can obscure true biological signals. In our work, we considered a series of pairs of single-cell datasets derived from the same tissue. We demonstrated that a straightforward dimensionality reduction technique like t-SNE shows batch effects when datasets from different sources are combined. We then investigated the extent to which batch effects can be reduced by performing dimensionality reduction on one dataset alone and subsequently applying the resulting embedding to the paired dataset using t-SNE. This approach was examined across various types of tissues. The above method does not correct batch effects on the data itself but on the projection. The aim is also to check the correction of the batch effects on the projection by the visualization of the various cell types when we are using dimensionality reduction methods. The accuracy of this approach is measured using the k-NN classifier on the resulting visualizations and compared to the other futuristic approaches mentioned in the thesis. We try the above methodology given by Policar et al.\cite{polivcar2023embedding} with UMAP. Then, we compared this approach to other prominent methodologies to evaluate the effect they have in reducing batch effects compared to the approach we mentioned above. Ideally, batch effects should also be eliminated when the data is represented in two dimensions. Another dimensionality reduction technique we explored is PCA followed by the Harmony algorithm. We also consider deep learning based methods as well as a foundational model which has been trained across multiple species datasets. Performance evaluation between these methods is based on the extent to which the batch effect is mitigated, either through correction or through its visualization in reduced dimensions. The batch effects are corrected on the visual 2D plane with the above approach and not on the data itself for dimensionality reduction methods.Whereas deep learning models fix the batch effects on the data itself and is reflected on the projection. We compare the results on this lower dimensional plane. A range of techniques is employed, from using conventional dimensionality reduction methods such as t-SNE and UMAP to reduce batch effects using the method defined above till the use of state-of-the-art deep learning models for the reduction of batch effects. A distinction has to be made that these methodologies, some of them correct batch effect in the data level,while others correct them in the data as well as on the projection. The core objective is to evaluate how effective these different techniques are in removing batch effects while preserving the essential signal characteristics for different cell types.
Bioinformatics
Batch Effects
scRNA
foundational models
tsne
File in questo prodotto:
File Dimensione Formato  
Manthara_ChristyJo.pdf

accesso aperto

Dimensione 11.09 MB
Formato Adobe PDF
11.09 MB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/94138