Single-cell proteomics (SCP) data generated by mass spectrometry (MS) are characterized by complex experimental designs, strong correlation structures, and pervasive missing values arising from both technical and biological mechanisms. These characteristics pose significant challenges for statistical modeling and make the generation of realistic synthetic data difficult. Hence, high quality simulation-based datasets are lacking to benchmark the quality of current and novel data analysis workflows for SCP. In this contribution, we develop a model-based factorization approach to simulate new datasets that preserve the main empirical properties of real SCP data. The model consists of two components, a logistic regression component to model presence absence and a gaussian component modeling the observed log2 MS intensities. The two specifications provide complementary information, capturing both structured patterns of missing values and variability in MS intensity. The quality of the simulated data is evaluated through a benchmarking procedure based on repeated cell-wise train/test splits and replicate simulations. Feature- and cell-level summary metrics are compared between simulated and reference data using the Kolmogorov–Smirnov statistic and the 2-Wasserstein distance. The results show that the proposed framework successfully reproduces key distributional properties and correlation structures of the reference data. Hence, it provides a promising basis for developing a flexible and principled tool for generating realistic SCP data.

Single-cell proteomics (SCP) data generated by mass spectrometry (MS) are characterized by complex experimental designs, strong correlation structures, and pervasive missing values arising from both technical and biological mechanisms. These characteristics pose significant challenges for statistical modeling and make the generation of realistic synthetic data difficult. Hence, high quality simulation-based datasets are lacking to benchmark the quality of current and novel data analysis workflows for SCP. In this contribution, we develop a model-based factorization approach to simulate new datasets that preserve the main empirical properties of real SCP data. The model consists of two components, a logistic regression component to model presence absence and a gaussian component modeling the observed log2 MS intensities. The two specifications provide complementary information, capturing both structured patterns of missing values and variability in MS intensity. The quality of the simulated data is evaluated through a benchmarking procedure based on repeated cell-wise train/test splits and replicate simulations. Feature- and cell-level summary metrics are compared between simulated and reference data using the Kolmogorov–Smirnov statistic and the 2-Wasserstein distance. The results show that the proposed framework successfully reproduces key distributional properties and correlation structures of the reference data. Hence, it provides a promising basis for developing a flexible and principled tool for generating realistic SCP data.

Leveraging Generalised Matrix Factorisation for Proteomics Data Simulation

DE CORSO, LUCA
2025/2026

Abstract

Single-cell proteomics (SCP) data generated by mass spectrometry (MS) are characterized by complex experimental designs, strong correlation structures, and pervasive missing values arising from both technical and biological mechanisms. These characteristics pose significant challenges for statistical modeling and make the generation of realistic synthetic data difficult. Hence, high quality simulation-based datasets are lacking to benchmark the quality of current and novel data analysis workflows for SCP. In this contribution, we develop a model-based factorization approach to simulate new datasets that preserve the main empirical properties of real SCP data. The model consists of two components, a logistic regression component to model presence absence and a gaussian component modeling the observed log2 MS intensities. The two specifications provide complementary information, capturing both structured patterns of missing values and variability in MS intensity. The quality of the simulated data is evaluated through a benchmarking procedure based on repeated cell-wise train/test splits and replicate simulations. Feature- and cell-level summary metrics are compared between simulated and reference data using the Kolmogorov–Smirnov statistic and the 2-Wasserstein distance. The results show that the proposed framework successfully reproduces key distributional properties and correlation structures of the reference data. Hence, it provides a promising basis for developing a flexible and principled tool for generating realistic SCP data.
2025
Leveraging Generalised Matrix Factorisation for Proteomics Data Simulation
Single-cell proteomics (SCP) data generated by mass spectrometry (MS) are characterized by complex experimental designs, strong correlation structures, and pervasive missing values arising from both technical and biological mechanisms. These characteristics pose significant challenges for statistical modeling and make the generation of realistic synthetic data difficult. Hence, high quality simulation-based datasets are lacking to benchmark the quality of current and novel data analysis workflows for SCP. In this contribution, we develop a model-based factorization approach to simulate new datasets that preserve the main empirical properties of real SCP data. The model consists of two components, a logistic regression component to model presence absence and a gaussian component modeling the observed log2 MS intensities. The two specifications provide complementary information, capturing both structured patterns of missing values and variability in MS intensity. The quality of the simulated data is evaluated through a benchmarking procedure based on repeated cell-wise train/test splits and replicate simulations. Feature- and cell-level summary metrics are compared between simulated and reference data using the Kolmogorov–Smirnov statistic and the 2-Wasserstein distance. The results show that the proposed framework successfully reproduces key distributional properties and correlation structures of the reference data. Hence, it provides a promising basis for developing a flexible and principled tool for generating realistic SCP data.
Proteomics
Simulation
Factorisation
File in questo prodotto:
File Dimensione Formato  
DeCorso_Luca.pdf

Accesso riservato

Dimensione 26.43 MB
Formato Adobe PDF
26.43 MB Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/105873