The management of longitudinal datasets in the context of clinical research, particularly in the presence of missing data, is a complex and diverse task that requires meticulous deliberation. Longitudinal datasets are very important in the context of evaluating disease development and treatment success due to their ability to record information over multiple time points. Nonetheless, the occurrence of missing data might be attributed to a range of factors, including patient dropping out, irregular follow-up, or technical errors. In order to tackle this problem, researchers often use advanced statistical methodologies such as imputation methods, which we have used in this work to handle missing data. In our case, we worked on longitudinal height and weight data of 3897 patients between 0 to 24 years old and the missing data ratio of our dataset was around 35\%. As we wanted to get the BMIs of the patients and cluster them, at first we replaced these missing data with different imputation approaches, and according to the obtained results, we chose the Mean Expected Growth approach and then calculated the BMIs of the patients. Choosing the best clustering method depends on the nature and distribution of data and the problem definition and requirements raised in a project. In this research, the Gaussian Mixture Model (GMM) was selected as the clustering algorithm due to the Gaussian distribution of the data. The objective was to comprehend the dynamic changes in patient clusters using a novel forgetting factor approach in the context of longitudinal data to identify age-adjusted BMI growth trajectories. Forgetting factor is an approach used in time-series analysis and forecasting that involves assigning weights to previous data that decrease exponentially with time and analyzes previous observations' effect on future outcomes. Our dataset had a very high percentage of missing data, therefore we chose to cluster the data in two different ways. In the first scenario, we separated the data that did not have missing data, performed clustering on them, and considered it as a gold standard. Then, in the second scenario, we imputed the missing data and performed clustering on the entire dataset. By focusing on early life factors such as gestational smoking, lactation, and pre–gestational and gestational BMI control, our findings contribute additional evidence to the OECD guidance regarding high BMI risks and interventions (World Health Organization, 2016).

The management of longitudinal datasets in the context of clinical research, particularly in the presence of missing data, is a complex and diverse task that requires meticulous deliberation. Longitudinal datasets are very important in the context of evaluating disease development and treatment success due to their ability to record information over multiple time points. Nonetheless, the occurrence of missing data might be attributed to a range of factors, including patient dropping out, irregular follow-up, or technical errors. In order to tackle this problem, researchers often use advanced statistical methodologies such as imputation methods, which we have used in this work to handle missing data. In our case, we worked on longitudinal height and weight data of 3897 patients between 0 to 24 years old and the missing data ratio of our dataset was around 35\%. As we wanted to get the BMIs of the patients and cluster them, at first we replaced these missing data with different imputation approaches, and according to the obtained results, we chose the Mean Expected Growth approach and then calculated the BMIs of the patients. Choosing the best clustering method depends on the nature and distribution of data and the problem definition and requirements raised in a project. In this research, the Gaussian Mixture Model (GMM) was selected as the clustering algorithm due to the Gaussian distribution of the data. The objective was to comprehend the dynamic changes in patient clusters using a novel forgetting factor approach in the context of longitudinal data to identify age-adjusted BMI growth trajectories. Forgetting factor is an approach used in time-series analysis and forecasting that involves assigning weights to previous data that decrease exponentially with time and analyzes previous observations' effect on future outcomes. Our dataset had a very high percentage of missing data, therefore we chose to cluster the data in two different ways. In the first scenario, we separated the data that did not have missing data, performed clustering on them, and considered it as a gold standard. Then, in the second scenario, we imputed the missing data and performed clustering on the entire dataset. By focusing on early life factors such as gestational smoking, lactation, and pre–gestational and gestational BMI control, our findings contribute additional evidence to the OECD guidance regarding high BMI risks and interventions (World Health Organization, 2016).

Clustering Patients using Longitudinal Data

GOLMOHAMMADI, PARGOL
2022/2023

Abstract

The management of longitudinal datasets in the context of clinical research, particularly in the presence of missing data, is a complex and diverse task that requires meticulous deliberation. Longitudinal datasets are very important in the context of evaluating disease development and treatment success due to their ability to record information over multiple time points. Nonetheless, the occurrence of missing data might be attributed to a range of factors, including patient dropping out, irregular follow-up, or technical errors. In order to tackle this problem, researchers often use advanced statistical methodologies such as imputation methods, which we have used in this work to handle missing data. In our case, we worked on longitudinal height and weight data of 3897 patients between 0 to 24 years old and the missing data ratio of our dataset was around 35\%. As we wanted to get the BMIs of the patients and cluster them, at first we replaced these missing data with different imputation approaches, and according to the obtained results, we chose the Mean Expected Growth approach and then calculated the BMIs of the patients. Choosing the best clustering method depends on the nature and distribution of data and the problem definition and requirements raised in a project. In this research, the Gaussian Mixture Model (GMM) was selected as the clustering algorithm due to the Gaussian distribution of the data. The objective was to comprehend the dynamic changes in patient clusters using a novel forgetting factor approach in the context of longitudinal data to identify age-adjusted BMI growth trajectories. Forgetting factor is an approach used in time-series analysis and forecasting that involves assigning weights to previous data that decrease exponentially with time and analyzes previous observations' effect on future outcomes. Our dataset had a very high percentage of missing data, therefore we chose to cluster the data in two different ways. In the first scenario, we separated the data that did not have missing data, performed clustering on them, and considered it as a gold standard. Then, in the second scenario, we imputed the missing data and performed clustering on the entire dataset. By focusing on early life factors such as gestational smoking, lactation, and pre–gestational and gestational BMI control, our findings contribute additional evidence to the OECD guidance regarding high BMI risks and interventions (World Health Organization, 2016).
2022
Clustering Patients using Longitudinal Data.
The management of longitudinal datasets in the context of clinical research, particularly in the presence of missing data, is a complex and diverse task that requires meticulous deliberation. Longitudinal datasets are very important in the context of evaluating disease development and treatment success due to their ability to record information over multiple time points. Nonetheless, the occurrence of missing data might be attributed to a range of factors, including patient dropping out, irregular follow-up, or technical errors. In order to tackle this problem, researchers often use advanced statistical methodologies such as imputation methods, which we have used in this work to handle missing data. In our case, we worked on longitudinal height and weight data of 3897 patients between 0 to 24 years old and the missing data ratio of our dataset was around 35\%. As we wanted to get the BMIs of the patients and cluster them, at first we replaced these missing data with different imputation approaches, and according to the obtained results, we chose the Mean Expected Growth approach and then calculated the BMIs of the patients. Choosing the best clustering method depends on the nature and distribution of data and the problem definition and requirements raised in a project. In this research, the Gaussian Mixture Model (GMM) was selected as the clustering algorithm due to the Gaussian distribution of the data. The objective was to comprehend the dynamic changes in patient clusters using a novel forgetting factor approach in the context of longitudinal data to identify age-adjusted BMI growth trajectories. Forgetting factor is an approach used in time-series analysis and forecasting that involves assigning weights to previous data that decrease exponentially with time and analyzes previous observations' effect on future outcomes. Our dataset had a very high percentage of missing data, therefore we chose to cluster the data in two different ways. In the first scenario, we separated the data that did not have missing data, performed clustering on them, and considered it as a gold standard. Then, in the second scenario, we imputed the missing data and performed clustering on the entire dataset. By focusing on early life factors such as gestational smoking, lactation, and pre–gestational and gestational BMI control, our findings contribute additional evidence to the OECD guidance regarding high BMI risks and interventions (World Health Organization, 2016).
machine learning
clustering
data analysis
longitudinal data
File in questo prodotto:
File Dimensione Formato  
Golmohammadi_Pargol.pdf

embargo fino al 23/10/2024

Dimensione 5.5 MB
Formato Adobe PDF
5.5 MB Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/55983