Leveraging Generalised Matrix Factorisation for Proteomics Data Simulation

Single-cell proteomics (SCP) data generated by mass spectrometry (MS) are characterized by complex experimental designs, strong correlation structures, and pervasive missing values arising from both technical and biological mechanisms. These characteristics pose significant challenges for statistical modeling and make the generation of realistic synthetic data difficult. Hence, high quality simulation-based datasets are lacking to benchmark the quality of current and novel data analysis workflows for SCP. In this contribution, we develop a model-based factorization approach to simulate new datasets that preserve the main empirical properties of real SCP data. The model consists of two components, a logistic regression component to model presence absence and a gaussian component modeling the observed log2 MS intensities. The two specifications provide complementary information, capturing both structured patterns of missing values and variability in MS intensity. The quality of the simulated data is evaluated through a benchmarking procedure based on repeated cell-wise train/test splits and replicate simulations. Feature- and cell-level summary metrics are compared between simulated and reference data using the Kolmogorov–Smirnov statistic and the 2-Wasserstein distance. The results show that the proposed framework successfully reproduces key distributional properties and correlation structures of the reference data. Hence, it provides a promising basis for developing a flexible and principled tool for generating realistic SCP data.

Leveraging Generalised Matrix Factorisation for Proteomics Data Simulation

DE CORSO, LUCA

2025/2026

Abstract

Single-cell proteomics (SCP) data generated by mass spectrometry (MS) are characterized by complex experimental designs, strong correlation structures, and pervasive missing values arising from both technical and biological mechanisms. These characteristics pose significant challenges for statistical modeling and make the generation of realistic synthetic data difficult. Hence, high quality simulation-based datasets are lacking to benchmark the quality of current and novel data analysis workflows for SCP. In this contribution, we develop a model-based factorization approach to simulate new datasets that preserve the main empirical properties of real SCP data. The model consists of two components, a logistic regression component to model presence absence and a gaussian component modeling the observed log2 MS intensities. The two specifications provide complementary information, capturing both structured patterns of missing values and variability in MS intensity. The quality of the simulated data is evaluated through a benchmarking procedure based on repeated cell-wise train/test splits and replicate simulations. Feature- and cell-level summary metrics are compared between simulated and reference data using the Kolmogorov–Smirnov statistic and the 2-Wasserstein distance. The results show that the proposed framework successfully reproduces key distributional properties and correlation structures of the reference data. Hence, it provides a promising basis for developing a flexible and principled tool for generating realistic SCP data.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Scienze Statistiche
			
	Corso di studio
	
				SCIENZE STATISTICHE Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2025
			
	Titolo inglese
	
				Leveraging Generalised Matrix Factorisation for Proteomics Data Simulation
			
	Abstract in italiano
	
				Single-cell proteomics (SCP) data generated by mass spectrometry (MS) are
characterized by complex experimental designs, strong correlation structures, and
pervasive missing values arising from both technical and biological mechanisms.
These characteristics pose significant challenges for statistical modeling and make the
generation of realistic synthetic data difficult. Hence, high quality simulation-based
datasets are lacking to benchmark the quality of current and novel data analysis
workflows for SCP. In this contribution, we develop a model-based factorization
approach to simulate new datasets that preserve the main empirical properties of real
SCP data. The model consists of two components, a logistic regression component to
model presence absence and a gaussian component modeling the observed log2 MS
intensities. The two specifications provide complementary information, capturing
both structured patterns of missing values and variability in MS intensity. The
quality of the simulated data is evaluated through a benchmarking procedure based
on repeated cell-wise train/test splits and replicate simulations. Feature- and cell-level
summary metrics are compared between simulated and reference data using the
Kolmogorov–Smirnov statistic and the 2-Wasserstein distance. The results show that
the proposed framework successfully reproduces key distributional properties and
correlation structures of the reference data. Hence, it provides a promising basis for
developing a flexible and principled tool for generating realistic SCP data.
			
	Parola chiave
	
				Proteomics
Simulation
Factorisation
			
	Relatore
	
				RISSO, DAVIDE
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
DeCorso_Luca.pdf Accesso riservato Dimensione 26.43 MB Formato Adobe PDF	26.43 MB	Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/105873