The growth of statistical and computational methods in scientific research has highlighted the importance of data sharing mechanisms that can balance analytical utility with a high level of protection for sensitive information. In the clinical domain, ethical considerations and privacy laws often limit access to data at the individual level. Bayesian Networks (BNs) provide a structured probabilistic framework for modeling conditional dependencies among variables, thereby enabling the generation of synthetic data that approximate the joint distribution of observed data. This thesis explores their application in synthetic data generation and systematically assesses the generated data in terms of statistical utility and empirical privacy risk. The evaluation is conducted using a real-world clinical dataset of patients with amyotrophic lateral sclerosis (ALS) as a case study, which is characterized by heterogeneous variables and a limited sample size, thereby reflecting typical challenges in clinical data analysis. Synthetic data are generated using BNs, learned on discretized variables, which facilitates the estimation of conditional probability tables (CPTs) and allows BNs to model dependencies effectively. The generated data are then analyzed both in discrete form and after empirical reconstruction into continuous form. Utility is assessed using univariate, multivariate, and predictive metrics, while privacy is evaluated through empirical measures of record-level similarity and dataset distinguishability. The main methodological finding is that the same synthetic dataset can receive materially different utility and privacy assessments depending on whether evaluation is performed in discretized space or after empirical de-discretization into continuous space. The results demonstrate that BN–based synthetic data can preserve significant statistical structure while mitigating direct disclosure risk. Importantly, the findings highlight the crucial role of data representation in both utility and privacy evaluations. The thesis therefore contributes both to the technical assessment of BN-based synthetic data generation and to the broader methodological question of how synthetic data should be evaluated in clinical research.

The growth of statistical and computational methods in scientific research has highlighted the importance of data sharing mechanisms that can balance analytical utility with a high level of protection for sensitive information. In the clinical domain, ethical considerations and privacy laws often limit access to data at the individual level. Bayesian Networks (BNs) provide a structured probabilistic framework for modeling conditional dependencies among variables, thereby enabling the generation of synthetic data that approximate the joint distribution of observed data. This thesis explores their application in synthetic data generation and systematically assesses the generated data in terms of statistical utility and empirical privacy risk. The evaluation is conducted using a real-world clinical dataset of patients with amyotrophic lateral sclerosis (ALS) as a case study, which is characterized by heterogeneous variables and a limited sample size, thereby reflecting typical challenges in clinical data analysis. Synthetic data are generated using BNs, learned on discretized variables, which facilitates the estimation of conditional probability tables (CPTs) and allows BNs to model dependencies effectively. The generated data are then analyzed both in discrete form and after empirical reconstruction into continuous form. Utility is assessed using univariate, multivariate, and predictive metrics, while privacy is evaluated through empirical measures of record-level similarity and dataset distinguishability. The main methodological finding is that the same synthetic dataset can receive materially different utility and privacy assessments depending on whether evaluation is performed in discretized space or after empirical de-discretization into continuous space. The results demonstrate that BN–based synthetic data can preserve significant statistical structure while mitigating direct disclosure risk. Importantly, the findings highlight the crucial role of data representation in both utility and privacy evaluations. The thesis therefore contributes both to the technical assessment of BN-based synthetic data generation and to the broader methodological question of how synthetic data should be evaluated in clinical research.

Evaluating the Utility of Synthetic Data generated using Bayesian Networks: an Amyotrophic Lateral Sclerosis Application

SVETOVA, KRISTINA
2025/2026

Abstract

The growth of statistical and computational methods in scientific research has highlighted the importance of data sharing mechanisms that can balance analytical utility with a high level of protection for sensitive information. In the clinical domain, ethical considerations and privacy laws often limit access to data at the individual level. Bayesian Networks (BNs) provide a structured probabilistic framework for modeling conditional dependencies among variables, thereby enabling the generation of synthetic data that approximate the joint distribution of observed data. This thesis explores their application in synthetic data generation and systematically assesses the generated data in terms of statistical utility and empirical privacy risk. The evaluation is conducted using a real-world clinical dataset of patients with amyotrophic lateral sclerosis (ALS) as a case study, which is characterized by heterogeneous variables and a limited sample size, thereby reflecting typical challenges in clinical data analysis. Synthetic data are generated using BNs, learned on discretized variables, which facilitates the estimation of conditional probability tables (CPTs) and allows BNs to model dependencies effectively. The generated data are then analyzed both in discrete form and after empirical reconstruction into continuous form. Utility is assessed using univariate, multivariate, and predictive metrics, while privacy is evaluated through empirical measures of record-level similarity and dataset distinguishability. The main methodological finding is that the same synthetic dataset can receive materially different utility and privacy assessments depending on whether evaluation is performed in discretized space or after empirical de-discretization into continuous space. The results demonstrate that BN–based synthetic data can preserve significant statistical structure while mitigating direct disclosure risk. Importantly, the findings highlight the crucial role of data representation in both utility and privacy evaluations. The thesis therefore contributes both to the technical assessment of BN-based synthetic data generation and to the broader methodological question of how synthetic data should be evaluated in clinical research.
2025
Evaluating the Utility of Synthetic Data generated using Bayesian Networks: an Amyotrophic Lateral Sclerosis Application
The growth of statistical and computational methods in scientific research has highlighted the importance of data sharing mechanisms that can balance analytical utility with a high level of protection for sensitive information. In the clinical domain, ethical considerations and privacy laws often limit access to data at the individual level. Bayesian Networks (BNs) provide a structured probabilistic framework for modeling conditional dependencies among variables, thereby enabling the generation of synthetic data that approximate the joint distribution of observed data. This thesis explores their application in synthetic data generation and systematically assesses the generated data in terms of statistical utility and empirical privacy risk. The evaluation is conducted using a real-world clinical dataset of patients with amyotrophic lateral sclerosis (ALS) as a case study, which is characterized by heterogeneous variables and a limited sample size, thereby reflecting typical challenges in clinical data analysis. Synthetic data are generated using BNs, learned on discretized variables, which facilitates the estimation of conditional probability tables (CPTs) and allows BNs to model dependencies effectively. The generated data are then analyzed both in discrete form and after empirical reconstruction into continuous form. Utility is assessed using univariate, multivariate, and predictive metrics, while privacy is evaluated through empirical measures of record-level similarity and dataset distinguishability. The main methodological finding is that the same synthetic dataset can receive materially different utility and privacy assessments depending on whether evaluation is performed in discretized space or after empirical de-discretization into continuous space. The results demonstrate that BN–based synthetic data can preserve significant statistical structure while mitigating direct disclosure risk. Importantly, the findings highlight the crucial role of data representation in both utility and privacy evaluations. The thesis therefore contributes both to the technical assessment of BN-based synthetic data generation and to the broader methodological question of how synthetic data should be evaluated in clinical research.
Synthetic Data
Bayesian Networks
Utility Evaluation
ALS
File in questo prodotto:
File Dimensione Formato  
Svetova_Kristina.pdf

embargo fino al 12/04/2029

Dimensione 12.89 MB
Formato Adobe PDF
12.89 MB Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/106600