Federated synthetic data generation with Autoregressive Neural Networks

Machine learning (ML) models offer significant potential across sensitive domains, yet their adoption is hindered by privacy concerns related to (1) the use of personal data for training and (2) the exchange of data between organizations such as healthcare providers and financial institutions. Both scenarios risk breaches of privacy and disclosure of sensitive information. In this work, we address these challenges by leveraging federated learning and synthetic data generation to mitigate privacy risks while preserving data utility. Unlike previous studies that focus primarily on image data, we concentrate on tabular data, which is particularly relevant for sensitive domains. Specifically, we adapt a state-of-the-art autoregressive neural network–based synthetic data generator (TabularARGN) to a federated learning environment. We design and implement a framework that integrates this generator with the Flower federated learning framework and propose a novel approach for model aggregation in the context of tabular data generation. We conduct an extensive evaluation of our approach on multiple tabular datasets to assess fidelity, utility, and privacy trade-offs, and we analyze how factors such as data heterogeneity and the number of clients influence the quality of the generated synthetic data. Our findings show that the federated approach achieves performance comparable to centralized synthetic data generation in many cases while significantly enhancing privacy. However, we also observe that strong imbalances among clients can affect the fidelity and utility of the generated data, underscoring the importance of tailored aggregation strategies.

Machine learning (ML) models offer significant potential across sensitive domains, yet their adoption is hindered by privacy concerns related to (1) the use of personal data for training and (2) the exchange of data between organizations such as healthcare providers and financial institutions. Both scenarios risk breaches of privacy and disclosure of sensitive information. In this work, we address these challenges by leveraging federated learning and synthetic data generation to mitigate privacy risks while preserving data utility. Unlike previous studies that focus primarily on image data, we concentrate on tabular data, which is particularly relevant for sensitive domains. Specifically, we adapt a state-of-the-art autoregressive neural network–based synthetic data generator(TabularARGN) to a federated learning environment. We design and implement a framework that integrates this generator with the Flower federated learning framework and propose a novel approach for model aggregation in the context of tabular data generation. We conduct an extensive evaluation of our approach on multiple tabular datasets to assess fidelity, utility, and privacy trade-offs, and we analyze how factors such as data heterogeneity and the number of clients influence the quality of the generated synthetic data. Our findings show that the federated approach achieves performance comparable to centralized synthetic data generation in many cases while significantly enhancing privacy. However, we also observe that strong imbalances among clients can affect the fidelity and utility of the generated data, underscoring the importance of tailored aggregation strategies.