Foundation models trained on large-scale single-cell RNA sequencing (scRNA-seq) have recently emerged as powerful tools for extracting transferable representations of gene expression. An open question is whether these models, originally designed for single-cell data, can also improve analyses of bulk RNA sequencing (bulk RNAseq), still predominant in clinical and cohort oncology studies despite its reduced resolution. This thesis evaluates five representative models: tGPT, GenePT, scFoundation, UCE, and scVI. The evaluation is performed on 40 bulk RNA-seq datasets from cBioPortal, covering different tumor types and spanning three binary classification tasks: tumor vs. healthy, papillary vs. non-papillary (in bladder cancer), and primary vs. metastatic. Logistic regression was applied to both raw features and model-derived embeddings (using models trained on scRNA-seq data), with performance assessed on bulk RNA-seq datasets through within-dataset validation and cross-dataset transfer (12 pairs), using Accuracy, AUC and AUCPR. The strongest evidence came from cross-dataset evaluations, where training and testing on independent studies exposed the limits of raw features, which often collapsed to near-random performance. In contrast, embeddings from tGPT and GenePT demonstrated robust transferability, with median AUC values frequently exceeding 0.96, whereas UCE yielded moderate improvements, and both scFoundation and scVI generally underperformed. Overall, this work shows that single-cell foundation models can be successfully applied to bulk RNA-seq, improving robustness and reproducibility across independent cohorts. These findings highlight their potential for more reliable patient stratification in oncology and point toward future directions such as fine-tuning on bulk data and multi-omics integration.
Deep Embedding Models for Oncology Gene Expression: Investigating Single-cell Foundation Models for Bulk RNA-seq Classification
TASSOTTI, CARLOTTA
2024/2025
Abstract
Foundation models trained on large-scale single-cell RNA sequencing (scRNA-seq) have recently emerged as powerful tools for extracting transferable representations of gene expression. An open question is whether these models, originally designed for single-cell data, can also improve analyses of bulk RNA sequencing (bulk RNAseq), still predominant in clinical and cohort oncology studies despite its reduced resolution. This thesis evaluates five representative models: tGPT, GenePT, scFoundation, UCE, and scVI. The evaluation is performed on 40 bulk RNA-seq datasets from cBioPortal, covering different tumor types and spanning three binary classification tasks: tumor vs. healthy, papillary vs. non-papillary (in bladder cancer), and primary vs. metastatic. Logistic regression was applied to both raw features and model-derived embeddings (using models trained on scRNA-seq data), with performance assessed on bulk RNA-seq datasets through within-dataset validation and cross-dataset transfer (12 pairs), using Accuracy, AUC and AUCPR. The strongest evidence came from cross-dataset evaluations, where training and testing on independent studies exposed the limits of raw features, which often collapsed to near-random performance. In contrast, embeddings from tGPT and GenePT demonstrated robust transferability, with median AUC values frequently exceeding 0.96, whereas UCE yielded moderate improvements, and both scFoundation and scVI generally underperformed. Overall, this work shows that single-cell foundation models can be successfully applied to bulk RNA-seq, improving robustness and reproducibility across independent cohorts. These findings highlight their potential for more reliable patient stratification in oncology and point toward future directions such as fine-tuning on bulk data and multi-omics integration.| File | Dimensione | Formato | |
|---|---|---|---|
|
Tassotti_Carlotta.pdf
accesso aperto
Dimensione
5.06 MB
Formato
Adobe PDF
|
5.06 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/94423