Machine learning (ML) models offer significant potential across sensitive domains, yet their adoption is hindered by privacy concerns related to (1) the use of personal data for training and (2) the exchange of data between organizations such as healthcare providers and financial institutions. Both scenarios risk breaches of privacy and disclosure of sensitive information. In this work, we address these challenges by leveraging federated learning and synthetic data generation to mitigate privacy risks while preserving data utility. Unlike previous studies that focus primarily on image data, we concentrate on tabular data, which is particularly relevant for sensitive domains. Specifically, we adapt a state-of-the-art autoregressive neural network–based synthetic data generator (TabularARGN) to a federated learning environment. We design and implement a framework that integrates this generator with the Flower federated learning framework and propose a novel approach for model aggregation in the context of tabular data generation. We conduct an extensive evaluation of our approach on multiple tabular datasets to assess fidelity, utility, and privacy trade-offs, and we analyze how factors such as data heterogeneity and the number of clients influence the quality of the generated synthetic data. Our findings show that the federated approach achieves performance comparable to centralized synthetic data generation in many cases while significantly enhancing privacy. However, we also observe that strong imbalances among clients can affect the fidelity and utility of the generated data, underscoring the importance of tailored aggregation strategies.

Machine learning (ML) models offer significant potential across sensitive domains, yet their adoption is hindered by privacy concerns related to (1) the use of personal data for training and (2) the exchange of data between organizations such as healthcare providers and financial institutions. Both scenarios risk breaches of privacy and disclosure of sensitive information. In this work, we address these challenges by leveraging federated learning and synthetic data generation to mitigate privacy risks while preserving data utility. Unlike previous studies that focus primarily on image data, we concentrate on tabular data, which is particularly relevant for sensitive domains. Specifically, we adapt a state-of-the-art autoregressive neural network–based synthetic data generator(TabularARGN) to a federated learning environment. We design and implement a framework that integrates this generator with the Flower federated learning framework and propose a novel approach for model aggregation in the context of tabular data generation. We conduct an extensive evaluation of our approach on multiple tabular datasets to assess fidelity, utility, and privacy trade-offs, and we analyze how factors such as data heterogeneity and the number of clients influence the quality of the generated synthetic data. Our findings show that the federated approach achieves performance comparable to centralized synthetic data generation in many cases while significantly enhancing privacy. However, we also observe that strong imbalances among clients can affect the fidelity and utility of the generated data, underscoring the importance of tailored aggregation strategies.

Federated synthetic data generation with Autoregressive Neural Networks

ASHOURI KAFSHGAR, ELHAM
2024/2025

Abstract

Machine learning (ML) models offer significant potential across sensitive domains, yet their adoption is hindered by privacy concerns related to (1) the use of personal data for training and (2) the exchange of data between organizations such as healthcare providers and financial institutions. Both scenarios risk breaches of privacy and disclosure of sensitive information. In this work, we address these challenges by leveraging federated learning and synthetic data generation to mitigate privacy risks while preserving data utility. Unlike previous studies that focus primarily on image data, we concentrate on tabular data, which is particularly relevant for sensitive domains. Specifically, we adapt a state-of-the-art autoregressive neural network–based synthetic data generator (TabularARGN) to a federated learning environment. We design and implement a framework that integrates this generator with the Flower federated learning framework and propose a novel approach for model aggregation in the context of tabular data generation. We conduct an extensive evaluation of our approach on multiple tabular datasets to assess fidelity, utility, and privacy trade-offs, and we analyze how factors such as data heterogeneity and the number of clients influence the quality of the generated synthetic data. Our findings show that the federated approach achieves performance comparable to centralized synthetic data generation in many cases while significantly enhancing privacy. However, we also observe that strong imbalances among clients can affect the fidelity and utility of the generated data, underscoring the importance of tailored aggregation strategies.
2024
Federated synthetic data generation with Autoregressive Neural Networks
Machine learning (ML) models offer significant potential across sensitive domains, yet their adoption is hindered by privacy concerns related to (1) the use of personal data for training and (2) the exchange of data between organizations such as healthcare providers and financial institutions. Both scenarios risk breaches of privacy and disclosure of sensitive information. In this work, we address these challenges by leveraging federated learning and synthetic data generation to mitigate privacy risks while preserving data utility. Unlike previous studies that focus primarily on image data, we concentrate on tabular data, which is particularly relevant for sensitive domains. Specifically, we adapt a state-of-the-art autoregressive neural network–based synthetic data generator(TabularARGN) to a federated learning environment. We design and implement a framework that integrates this generator with the Flower federated learning framework and propose a novel approach for model aggregation in the context of tabular data generation. We conduct an extensive evaluation of our approach on multiple tabular datasets to assess fidelity, utility, and privacy trade-offs, and we analyze how factors such as data heterogeneity and the number of clients influence the quality of the generated synthetic data. Our findings show that the federated approach achieves performance comparable to centralized synthetic data generation in many cases while significantly enhancing privacy. However, we also observe that strong imbalances among clients can affect the fidelity and utility of the generated data, underscoring the importance of tailored aggregation strategies.
Federated Learning
Synthetic Data
Autoregressive Model
Neural Networks
File in questo prodotto:
File Dimensione Formato  
AshouriKafshgar_Elham.pdf

accesso aperto

Dimensione 5.8 MB
Formato Adobe PDF
5.8 MB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/99552