In the past decade, scRNA-sequencing has become an established technology for exploring gene expression dynamics at the cellular level in both biology and medicine. In recent years, decreasing costs have enabled more complex experimental designs, involving multiple individuals in the analysis, often grouped by a clinical outcome of interest. In these multi-sample, multi-group studies a frequent scientific question involves verifying statistically significant differences in gene expression between experimental conditions, for all the identified cell-types. But since the cell-type membership is latent and estimated by a clustering algorithm, the type-I error control of inference procedures can be inflated by the circularity of the inference (double-dipping phenomenon). In this thesis double-dipping is investigated in the psuedo-bulk model through simulations and real data analysis. In particular, differential misclustering and missed propagation of uncertainty are identified as the primary sources of FDR inflation. To overcome this problematic a stability approach is proposed, in which cells are weighted by the observational level clustering stability before aggregation. The use of stability for number of clusters estimation and sub-clustering is also investigated.
In the past decade, scRNA-sequencing has become an established technology for exploring gene expression dynamics at the cellular level in both biology and medicine. In recent years, decreasing costs have enabled more complex experimental designs, involving multiple individuals in the analysis, often grouped by a clinical outcome of interest. In these multi-sample, multi-group studies a frequent scientific question involves verifying statistically significant differences in gene expression between experimental conditions, for all the identified cell-types. But since the cell-type membership is latent and estimated by a clustering algorithm, the type-I error control of inference procedures can be inflated by the circularity of the inference (double-dipping phenomenon). In this thesis double-dipping is investigated in the psuedo-bulk model through simulations and real data analysis. In particular, differential misclustering and missed propagation of uncertainty are identified as the primary sources of FDR inflation. To overcome this problematic a stability approach is proposed, in which cells are weighted by the observational level clustering stability before aggregation. The use of stability for number of clusters estimation and sub-clustering is also investigated.
Effects of misclustering in multi-sample, multi-group scRNA-seq studies: a stability-based approach
DOLFI, GABRIELE
2025/2026
Abstract
In the past decade, scRNA-sequencing has become an established technology for exploring gene expression dynamics at the cellular level in both biology and medicine. In recent years, decreasing costs have enabled more complex experimental designs, involving multiple individuals in the analysis, often grouped by a clinical outcome of interest. In these multi-sample, multi-group studies a frequent scientific question involves verifying statistically significant differences in gene expression between experimental conditions, for all the identified cell-types. But since the cell-type membership is latent and estimated by a clustering algorithm, the type-I error control of inference procedures can be inflated by the circularity of the inference (double-dipping phenomenon). In this thesis double-dipping is investigated in the psuedo-bulk model through simulations and real data analysis. In particular, differential misclustering and missed propagation of uncertainty are identified as the primary sources of FDR inflation. To overcome this problematic a stability approach is proposed, in which cells are weighted by the observational level clustering stability before aggregation. The use of stability for number of clusters estimation and sub-clustering is also investigated.| File | Dimensione | Formato | |
|---|---|---|---|
|
Dolfi_Gabriele.pdf
Accesso riservato
Dimensione
18.02 MB
Formato
Adobe PDF
|
18.02 MB | Adobe PDF |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/105774