In the past decade, scRNA-sequencing has become an established technology for exploring gene expression dynamics at the cellular level in both biology and medicine. In recent years, decreasing costs have enabled more complex experimental designs, involving multiple individuals in the analysis, often grouped by a clinical outcome of interest. In these multi-sample, multi-group studies a frequent scientific question involves verifying statistically significant differences in gene expression between experimental conditions, for all the identified cell-types. But since the cell-type membership is latent and estimated by a clustering algorithm, the type-I error control of inference procedures can be inflated by the circularity of the inference (double-dipping phenomenon). In this thesis double-dipping is investigated in the psuedo-bulk model through simulations and real data analysis. In particular, differential misclustering and missed propagation of uncertainty are identified as the primary sources of FDR inflation. To overcome this problematic a stability approach is proposed, in which cells are weighted by the observational level clustering stability before aggregation. The use of stability for number of clusters estimation and sub-clustering is also investigated.

In the past decade, scRNA-sequencing has become an established technology for exploring gene expression dynamics at the cellular level in both biology and medicine. In recent years, decreasing costs have enabled more complex experimental designs, involving multiple individuals in the analysis, often grouped by a clinical outcome of interest. In these multi-sample, multi-group studies a frequent scientific question involves verifying statistically significant differences in gene expression between experimental conditions, for all the identified cell-types. But since the cell-type membership is latent and estimated by a clustering algorithm, the type-I error control of inference procedures can be inflated by the circularity of the inference (double-dipping phenomenon). In this thesis double-dipping is investigated in the psuedo-bulk model through simulations and real data analysis. In particular, differential misclustering and missed propagation of uncertainty are identified as the primary sources of FDR inflation. To overcome this problematic a stability approach is proposed, in which cells are weighted by the observational level clustering stability before aggregation. The use of stability for number of clusters estimation and sub-clustering is also investigated.

Effects of misclustering in multi-sample, multi-group scRNA-seq studies: a stability-based approach

DOLFI, GABRIELE
2025/2026

Abstract

In the past decade, scRNA-sequencing has become an established technology for exploring gene expression dynamics at the cellular level in both biology and medicine. In recent years, decreasing costs have enabled more complex experimental designs, involving multiple individuals in the analysis, often grouped by a clinical outcome of interest. In these multi-sample, multi-group studies a frequent scientific question involves verifying statistically significant differences in gene expression between experimental conditions, for all the identified cell-types. But since the cell-type membership is latent and estimated by a clustering algorithm, the type-I error control of inference procedures can be inflated by the circularity of the inference (double-dipping phenomenon). In this thesis double-dipping is investigated in the psuedo-bulk model through simulations and real data analysis. In particular, differential misclustering and missed propagation of uncertainty are identified as the primary sources of FDR inflation. To overcome this problematic a stability approach is proposed, in which cells are weighted by the observational level clustering stability before aggregation. The use of stability for number of clusters estimation and sub-clustering is also investigated.
2025
Effects of misclustering in multi-sample, multi-group scRNA-seq studies: a stability-based approach
In the past decade, scRNA-sequencing has become an established technology for exploring gene expression dynamics at the cellular level in both biology and medicine. In recent years, decreasing costs have enabled more complex experimental designs, involving multiple individuals in the analysis, often grouped by a clinical outcome of interest. In these multi-sample, multi-group studies a frequent scientific question involves verifying statistically significant differences in gene expression between experimental conditions, for all the identified cell-types. But since the cell-type membership is latent and estimated by a clustering algorithm, the type-I error control of inference procedures can be inflated by the circularity of the inference (double-dipping phenomenon). In this thesis double-dipping is investigated in the psuedo-bulk model through simulations and real data analysis. In particular, differential misclustering and missed propagation of uncertainty are identified as the primary sources of FDR inflation. To overcome this problematic a stability approach is proposed, in which cells are weighted by the observational level clustering stability before aggregation. The use of stability for number of clusters estimation and sub-clustering is also investigated.
scRNA-seq
Double dipping
Clustering
Pseudo-bulk
Stability
File in questo prodotto:
File Dimensione Formato  
Dolfi_Gabriele.pdf

Accesso riservato

Dimensione 18.02 MB
Formato Adobe PDF
18.02 MB Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/105774