Evaluating the Utility of Synthetic Data generated using Bayesian Networks: an Amyotrophic Lateral Sclerosis Application

The growth of statistical and computational methods in scientific research has highlighted the importance of data sharing mechanisms that can balance analytical utility with a high level of protection for sensitive information. In the clinical domain, ethical considerations and privacy laws often limit access to data at the individual level. Bayesian Networks (BNs) provide a structured probabilistic framework for modeling conditional dependencies among variables, thereby enabling the generation of synthetic data that approximate the joint distribution of observed data. This thesis explores their application in synthetic data generation and systematically assesses the generated data in terms of statistical utility and empirical privacy risk. The evaluation is conducted using a real-world clinical dataset of patients with amyotrophic lateral sclerosis (ALS) as a case study, which is characterized by heterogeneous variables and a limited sample size, thereby reflecting typical challenges in clinical data analysis. Synthetic data are generated using BNs, learned on discretized variables, which facilitates the estimation of conditional probability tables (CPTs) and allows BNs to model dependencies effectively. The generated data are then analyzed both in discrete form and after empirical reconstruction into continuous form. Utility is assessed using univariate, multivariate, and predictive metrics, while privacy is evaluated through empirical measures of record-level similarity and dataset distinguishability. The main methodological finding is that the same synthetic dataset can receive materially different utility and privacy assessments depending on whether evaluation is performed in discretized space or after empirical de-discretization into continuous space. The results demonstrate that BN–based synthetic data can preserve significant statistical structure while mitigating direct disclosure risk. Importantly, the findings highlight the crucial role of data representation in both utility and privacy evaluations. The thesis therefore contributes both to the technical assessment of BN-based synthetic data generation and to the broader methodological question of how synthetic data should be evaluated in clinical research.

Evaluating the Utility of Synthetic Data generated using Bayesian Networks: an Amyotrophic Lateral Sclerosis Application

SVETOVA, KRISTINA

2025/2026

Abstract

The growth of statistical and computational methods in scientific research has highlighted the importance of data sharing mechanisms that can balance analytical utility with a high level of protection for sensitive information. In the clinical domain, ethical considerations and privacy laws often limit access to data at the individual level. Bayesian Networks (BNs) provide a structured probabilistic framework for modeling conditional dependencies among variables, thereby enabling the generation of synthetic data that approximate the joint distribution of observed data. This thesis explores their application in synthetic data generation and systematically assesses the generated data in terms of statistical utility and empirical privacy risk. The evaluation is conducted using a real-world clinical dataset of patients with amyotrophic lateral sclerosis (ALS) as a case study, which is characterized by heterogeneous variables and a limited sample size, thereby reflecting typical challenges in clinical data analysis. Synthetic data are generated using BNs, learned on discretized variables, which facilitates the estimation of conditional probability tables (CPTs) and allows BNs to model dependencies effectively. The generated data are then analyzed both in discrete form and after empirical reconstruction into continuous form. Utility is assessed using univariate, multivariate, and predictive metrics, while privacy is evaluated through empirical measures of record-level similarity and dataset distinguishability. The main methodological finding is that the same synthetic dataset can receive materially different utility and privacy assessments depending on whether evaluation is performed in discretized space or after empirical de-discretization into continuous space. The results demonstrate that BN–based synthetic data can preserve significant statistical structure while mitigating direct disclosure risk. Importantly, the findings highlight the crucial role of data representation in both utility and privacy evaluations. The thesis therefore contributes both to the technical assessment of BN-based synthetic data generation and to the broader methodological question of how synthetic data should be evaluated in clinical research.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Ingegneria dell'Informazione - DEI
			
	Corso di studio
	
				COMPUTER ENGINEERING Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2025
			
	Titolo inglese
	
				Evaluating the Utility of Synthetic Data generated using Bayesian Networks: an Amyotrophic Lateral Sclerosis Application
			
	Abstract in italiano
	
				The growth of statistical and computational methods in scientific research has highlighted the importance of data sharing mechanisms that can balance analytical utility with a high level of protection for sensitive information. In the clinical domain, ethical considerations and privacy laws often limit access to data at the individual level. Bayesian Networks (BNs) provide a structured probabilistic framework for modeling conditional dependencies among variables, thereby enabling the generation of synthetic data that approximate the joint distribution of observed data. This thesis explores their application in synthetic data generation and systematically assesses the generated data in terms of statistical utility and empirical privacy risk. The evaluation is conducted using a real-world clinical dataset of patients with amyotrophic lateral sclerosis (ALS) as a case study, which is characterized by heterogeneous variables and a limited sample size, thereby reflecting typical challenges in clinical data analysis. Synthetic data are generated using BNs, learned on discretized variables, which facilitates the estimation of conditional probability tables (CPTs) and allows BNs to model dependencies effectively. The generated data are then analyzed both in discrete form and after empirical reconstruction into continuous form. Utility is assessed using univariate, multivariate, and predictive metrics, while privacy is evaluated through empirical measures of record-level similarity and dataset distinguishability. The main methodological finding is that the same synthetic dataset can receive materially different utility and privacy assessments depending on whether evaluation is performed in discretized space or after empirical de-discretization into continuous space. The results demonstrate that BN–based synthetic data can preserve significant statistical structure while mitigating direct disclosure risk. Importantly, the findings highlight the crucial role of data representation in both utility and privacy evaluations. The thesis therefore contributes both to the technical assessment of BN-based synthetic data generation and to the broader methodological question of how synthetic data should be evaluated in clinical research.
			
	Parola chiave
	
				Synthetic Data
Bayesian Networks
Utility Evaluation
ALS
			
	Relatore
	
				DI CAMILLO, BARBARA
			
	Correlatore
	
				TAVAZZI, ERICA
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Svetova_Kristina.pdf embargo fino al 12/04/2029 Dimensione 12.89 MB Formato Adobe PDF	12.89 MB	Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/106600