Foundation models trained on large-scale single-cell RNA sequencing (scRNA-seq) have recently emerged as powerful tools for extracting transferable representations of gene expression. An open question is whether these models, originally designed for single-cell data, can also improve analyses of bulk RNA sequencing (bulk RNAseq), still predominant in clinical and cohort oncology studies despite its reduced resolution. This thesis evaluates five representative models: tGPT, GenePT, scFoundation, UCE, and scVI. The evaluation is performed on 40 bulk RNA-seq datasets from cBioPortal, covering different tumor types and spanning three binary classification tasks: tumor vs. healthy, papillary vs. non-papillary (in bladder cancer), and primary vs. metastatic. Logistic regression was applied to both raw features and model-derived embeddings (using models trained on scRNA-seq data), with performance assessed on bulk RNA-seq datasets through within-dataset validation and cross-dataset transfer (12 pairs), using Accuracy, AUC and AUCPR. The strongest evidence came from cross-dataset evaluations, where training and testing on independent studies exposed the limits of raw features, which often collapsed to near-random performance. In contrast, embeddings from tGPT and GenePT demonstrated robust transferability, with median AUC values frequently exceeding 0.96, whereas UCE yielded moderate improvements, and both scFoundation and scVI generally underperformed. Overall, this work shows that single-cell foundation models can be successfully applied to bulk RNA-seq, improving robustness and reproducibility across independent cohorts. These findings highlight their potential for more reliable patient stratification in oncology and point toward future directions such as fine-tuning on bulk data and multi-omics integration.

Deep Embedding Models for Oncology Gene Expression: Investigating Single-cell Foundation Models for Bulk RNA-seq Classification

TASSOTTI, CARLOTTA
2024/2025

Abstract

Foundation models trained on large-scale single-cell RNA sequencing (scRNA-seq) have recently emerged as powerful tools for extracting transferable representations of gene expression. An open question is whether these models, originally designed for single-cell data, can also improve analyses of bulk RNA sequencing (bulk RNAseq), still predominant in clinical and cohort oncology studies despite its reduced resolution. This thesis evaluates five representative models: tGPT, GenePT, scFoundation, UCE, and scVI. The evaluation is performed on 40 bulk RNA-seq datasets from cBioPortal, covering different tumor types and spanning three binary classification tasks: tumor vs. healthy, papillary vs. non-papillary (in bladder cancer), and primary vs. metastatic. Logistic regression was applied to both raw features and model-derived embeddings (using models trained on scRNA-seq data), with performance assessed on bulk RNA-seq datasets through within-dataset validation and cross-dataset transfer (12 pairs), using Accuracy, AUC and AUCPR. The strongest evidence came from cross-dataset evaluations, where training and testing on independent studies exposed the limits of raw features, which often collapsed to near-random performance. In contrast, embeddings from tGPT and GenePT demonstrated robust transferability, with median AUC values frequently exceeding 0.96, whereas UCE yielded moderate improvements, and both scFoundation and scVI generally underperformed. Overall, this work shows that single-cell foundation models can be successfully applied to bulk RNA-seq, improving robustness and reproducibility across independent cohorts. These findings highlight their potential for more reliable patient stratification in oncology and point toward future directions such as fine-tuning on bulk data and multi-omics integration.
2024
Deep Embedding Models for Oncology Gene Expression: Investigating Single-cell Foundation Models for Bulk RNA-seq Classification
Foundation models
Single-cell RNA-seq
Bulk RNA-seq
Gene embeddings
Tumor classification
File in questo prodotto:
File Dimensione Formato  
Tassotti_Carlotta.pdf

accesso aperto

Dimensione 5.06 MB
Formato Adobe PDF
5.06 MB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/94423