Single-cell proteomics (SCP) data generated by mass spectrometry (MS) are characterized by complex experimental designs, strong correlation structures, and pervasive missing values arising from both technical and biological mechanisms. These characteristics pose significant challenges for statistical modeling and make the generation of realistic synthetic data difficult. Hence, high quality simulation-based datasets are lacking to benchmark the quality of current and novel data analysis workflows for SCP. In this contribution, we develop a model-based factorization approach to simulate new datasets that preserve the main empirical properties of real SCP data. The model consists of two components, a logistic regression component to model presence absence and a gaussian component modeling the observed log2 MS intensities. The two specifications provide complementary information, capturing both structured patterns of missing values and variability in MS intensity. The quality of the simulated data is evaluated through a benchmarking procedure based on repeated cell-wise train/test splits and replicate simulations. Feature- and cell-level summary metrics are compared between simulated and reference data using the Kolmogorov–Smirnov statistic and the 2-Wasserstein distance. The results show that the proposed framework successfully reproduces key distributional properties and correlation structures of the reference data. Hence, it provides a promising basis for developing a flexible and principled tool for generating realistic SCP data.
Single-cell proteomics (SCP) data generated by mass spectrometry (MS) are characterized by complex experimental designs, strong correlation structures, and pervasive missing values arising from both technical and biological mechanisms. These characteristics pose significant challenges for statistical modeling and make the generation of realistic synthetic data difficult. Hence, high quality simulation-based datasets are lacking to benchmark the quality of current and novel data analysis workflows for SCP. In this contribution, we develop a model-based factorization approach to simulate new datasets that preserve the main empirical properties of real SCP data. The model consists of two components, a logistic regression component to model presence absence and a gaussian component modeling the observed log2 MS intensities. The two specifications provide complementary information, capturing both structured patterns of missing values and variability in MS intensity. The quality of the simulated data is evaluated through a benchmarking procedure based on repeated cell-wise train/test splits and replicate simulations. Feature- and cell-level summary metrics are compared between simulated and reference data using the Kolmogorov–Smirnov statistic and the 2-Wasserstein distance. The results show that the proposed framework successfully reproduces key distributional properties and correlation structures of the reference data. Hence, it provides a promising basis for developing a flexible and principled tool for generating realistic SCP data.
Leveraging Generalised Matrix Factorisation for Proteomics Data Simulation
DE CORSO, LUCA
2025/2026
Abstract
Single-cell proteomics (SCP) data generated by mass spectrometry (MS) are characterized by complex experimental designs, strong correlation structures, and pervasive missing values arising from both technical and biological mechanisms. These characteristics pose significant challenges for statistical modeling and make the generation of realistic synthetic data difficult. Hence, high quality simulation-based datasets are lacking to benchmark the quality of current and novel data analysis workflows for SCP. In this contribution, we develop a model-based factorization approach to simulate new datasets that preserve the main empirical properties of real SCP data. The model consists of two components, a logistic regression component to model presence absence and a gaussian component modeling the observed log2 MS intensities. The two specifications provide complementary information, capturing both structured patterns of missing values and variability in MS intensity. The quality of the simulated data is evaluated through a benchmarking procedure based on repeated cell-wise train/test splits and replicate simulations. Feature- and cell-level summary metrics are compared between simulated and reference data using the Kolmogorov–Smirnov statistic and the 2-Wasserstein distance. The results show that the proposed framework successfully reproduces key distributional properties and correlation structures of the reference data. Hence, it provides a promising basis for developing a flexible and principled tool for generating realistic SCP data.| File | Dimensione | Formato | |
|---|---|---|---|
|
DeCorso_Luca.pdf
Accesso riservato
Dimensione
26.43 MB
Formato
Adobe PDF
|
26.43 MB | Adobe PDF |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/105873