Deep Embedding Models for Oncology Gene Expression: Investigating Single-cell Foundation Models for Bulk RNA-seq Classification

Foundation models trained on large-scale single-cell RNA sequencing (scRNA-seq) have recently emerged as powerful tools for extracting transferable representations of gene expression. An open question is whether these models, originally designed for single-cell data, can also improve analyses of bulk RNA sequencing (bulk RNAseq), still predominant in clinical and cohort oncology studies despite its reduced resolution. This thesis evaluates five representative models: tGPT, GenePT, scFoundation, UCE, and scVI. The evaluation is performed on 40 bulk RNA-seq datasets from cBioPortal, covering different tumor types and spanning three binary classification tasks: tumor vs. healthy, papillary vs. non-papillary (in bladder cancer), and primary vs. metastatic. Logistic regression was applied to both raw features and model-derived embeddings (using models trained on scRNA-seq data), with performance assessed on bulk RNA-seq datasets through within-dataset validation and cross-dataset transfer (12 pairs), using Accuracy, AUC and AUCPR. The strongest evidence came from cross-dataset evaluations, where training and testing on independent studies exposed the limits of raw features, which often collapsed to near-random performance. In contrast, embeddings from tGPT and GenePT demonstrated robust transferability, with median AUC values frequently exceeding 0.96, whereas UCE yielded moderate improvements, and both scFoundation and scVI generally underperformed. Overall, this work shows that single-cell foundation models can be successfully applied to bulk RNA-seq, improving robustness and reproducibility across independent cohorts. These findings highlight their potential for more reliable patient stratification in oncology and point toward future directions such as fine-tuning on bulk data and multi-omics integration.

Deep Embedding Models for Oncology Gene Expression: Investigating Single-cell Foundation Models for Bulk RNA-seq Classification

TASSOTTI, CARLOTTA

2024/2025

Abstract

Foundation models trained on large-scale single-cell RNA sequencing (scRNA-seq) have recently emerged as powerful tools for extracting transferable representations of gene expression. An open question is whether these models, originally designed for single-cell data, can also improve analyses of bulk RNA sequencing (bulk RNAseq), still predominant in clinical and cohort oncology studies despite its reduced resolution. This thesis evaluates five representative models: tGPT, GenePT, scFoundation, UCE, and scVI. The evaluation is performed on 40 bulk RNA-seq datasets from cBioPortal, covering different tumor types and spanning three binary classification tasks: tumor vs. healthy, papillary vs. non-papillary (in bladder cancer), and primary vs. metastatic. Logistic regression was applied to both raw features and model-derived embeddings (using models trained on scRNA-seq data), with performance assessed on bulk RNA-seq datasets through within-dataset validation and cross-dataset transfer (12 pairs), using Accuracy, AUC and AUCPR. The strongest evidence came from cross-dataset evaluations, where training and testing on independent studies exposed the limits of raw features, which often collapsed to near-random performance. In contrast, embeddings from tGPT and GenePT demonstrated robust transferability, with median AUC values frequently exceeding 0.96, whereas UCE yielded moderate improvements, and both scFoundation and scVI generally underperformed. Overall, this work shows that single-cell foundation models can be successfully applied to bulk RNA-seq, improving robustness and reproducibility across independent cohorts. These findings highlight their potential for more reliable patient stratification in oncology and point toward future directions such as fine-tuning on bulk data and multi-omics integration.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Ingegneria dell'Informazione - DEI
			
	Corso di studio
	
				BIOINGEGNERIA Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2024
			
	Titolo inglese
	
				Deep Embedding Models for Oncology Gene Expression: Investigating Single-cell Foundation Models for Bulk RNA-seq Classification
			
	Parola chiave
	
				Foundation models
Single-cell RNA-seq
Bulk RNA-seq
Gene embeddings
Tumor classification
			
	Relatore
	
				DI CAMILLO, BARBARA
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Tassotti_Carlotta.pdf accesso aperto Dimensione 5.06 MB Formato Adobe PDF Visualizza/Apri	5.06 MB	Adobe PDF	Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/94423