Interest in synthetic data has grown rapidly in recent years. Synthetic data is artificially generated data with the same statistical properties as real-world data. This growth of interest can be attributed, on the one hand, to the increasing demand for large amounts of data to train AI/ML models and, on the other hand, to the recent development of effective methods for generating high-quality synthetic data. For example, generative AI models have demonstrated excellent capabilities in synthesizing complex datasets. Unfortunately, many of the processes of interest are rare events or edge cases. Therefore, the amount of real data that can be used to train generative models is often insufficient, hence limiting their applicability. Furthermore, in the case of processes involving dynamical systems, generative models often fail to capture the underlying laws governing the dynamics, thus resulting in low-fidelity synthetic data. A possible strategy to overcome these limitations is to generate synthetic data using a physics-informed approach, that is, incorporating the knowledge of the governing physical laws into the generative model. This thesis explores a possible approach for generating high-fidelity synthetic data using physics-informed ML. Specifically, the approach investigated in this work uses the SINDy Autoencoder network introduced by Champion et al. as a synthetic data generator. This approach is benchmarked with a commercial tool developed by Clearbox AI, a synthetic data provider. The generative models under study are tested on two datasets generated by nonlinear dynamical systems: a simulation dataset with dynamics defined by the Lorenz system and a real dataset acquired on a full-scale F-16 aircraft. The results of the study show that the explored approach is a rather promising solution for generating high-fidelity synthetic data. However, the training procedure is significantly complicated by the presence of multiple competing loss terms. Moreover, the effectiveness of the approach appears to be strongly dependent on the dataset in use and on the complexity of the corresponding dynamical system.

Physics-Informed Machine Learning for High-Fidelity Synthetic Data Generation

NINNI, DANIELE
2022/2023

Abstract

Interest in synthetic data has grown rapidly in recent years. Synthetic data is artificially generated data with the same statistical properties as real-world data. This growth of interest can be attributed, on the one hand, to the increasing demand for large amounts of data to train AI/ML models and, on the other hand, to the recent development of effective methods for generating high-quality synthetic data. For example, generative AI models have demonstrated excellent capabilities in synthesizing complex datasets. Unfortunately, many of the processes of interest are rare events or edge cases. Therefore, the amount of real data that can be used to train generative models is often insufficient, hence limiting their applicability. Furthermore, in the case of processes involving dynamical systems, generative models often fail to capture the underlying laws governing the dynamics, thus resulting in low-fidelity synthetic data. A possible strategy to overcome these limitations is to generate synthetic data using a physics-informed approach, that is, incorporating the knowledge of the governing physical laws into the generative model. This thesis explores a possible approach for generating high-fidelity synthetic data using physics-informed ML. Specifically, the approach investigated in this work uses the SINDy Autoencoder network introduced by Champion et al. as a synthetic data generator. This approach is benchmarked with a commercial tool developed by Clearbox AI, a synthetic data provider. The generative models under study are tested on two datasets generated by nonlinear dynamical systems: a simulation dataset with dynamics defined by the Lorenz system and a real dataset acquired on a full-scale F-16 aircraft. The results of the study show that the explored approach is a rather promising solution for generating high-fidelity synthetic data. However, the training procedure is significantly complicated by the presence of multiple competing loss terms. Moreover, the effectiveness of the approach appears to be strongly dependent on the dataset in use and on the complexity of the corresponding dynamical system.
2022
Physics-Informed Machine Learning for High-Fidelity Synthetic Data Generation
physics-informed
machine learning
synthetic data
File in questo prodotto:
File Dimensione Formato  
ninni_daniele.pdf

accesso aperto

Dimensione 23.48 MB
Formato Adobe PDF
23.48 MB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/47364