Integrating different datasets with different predictors and the same target variable can enhance the performances and the reliability of machine learning models trained on them. This work applied this concept to healthcare data, specifically pneumonia in elderly patients in the Veneto region, by combining the notification forms from the regional surveillance system for invasive bacterial diseases (IBD) dataset with the hospital discharge records dataset (HDR). Due to the datasets’ pseudonymization, deterministic techniques were used to exploit the fact that some features are shared to identify the matches between them. Predictive prognostic models were independently developed and evaluated for IBD, HDR, and the combined dataset. The models were implemented using machine learning techniques such as support vector machines, random decision forests, and artificial neural networks. The best model trained on the combined dataset yielded higher performances than the models trained only on one of the two datasets. More precisely, the best model trained only on the IBD dataset, had an area under the ROC curve equal to 0.685 on the test set; the best model trained only on the HDR dataset had an area under the ROC curve equal to 0.793 on the test set; while the best model trained on the combined dataset had an area under the ROC curve equal to 0.861 on the test set. All of these models, especially the one trained on the combined dataset, could be used to identify higher-risk patients so that the hospital’s personnel could increase their likelihood of survival by surveilling them more.
Integrating different datasets with different predictors and the same target variable can enhance the performances and the reliability of machine learning models trained on them. This work applied this concept to healthcare data, specifically pneumonia in elderly patients in the Veneto region, by combining the notification forms from the regional surveillance system for invasive bacterial diseases (IBD) dataset with the hospital discharge records dataset (HDR). Due to the datasets’ pseudonymization, deterministic techniques were used to exploit the fact that some features are shared to identify the matches between them. Predictive prognostic models were independently developed and evaluated for IBD, HDR, and the combined dataset. The models were implemented using machine learning techniques such as support vector machines, random decision forests, and artificial neural networks. The best model trained on the combined dataset yielded higher performances than the models trained only on one of the two datasets. More precisely, the best model trained only on the IBD dataset, had an area under the ROC curve equal to 0.685 on the test set; the best model trained only on the HDR dataset had an area under the ROC curve equal to 0.793 on the test set; while the best model trained on the combined dataset had an area under the ROC curve equal to 0.861 on the test set. All of these models, especially the one trained on the combined dataset, could be used to identify higher-risk patients so that the hospital’s personnel could increase their likelihood of survival by surveilling them more.
Prognostic Prediction Machine Learning Models for Elderly Pneumonia Patients
PALMERI, CLAUDIO
2024/2025
Abstract
Integrating different datasets with different predictors and the same target variable can enhance the performances and the reliability of machine learning models trained on them. This work applied this concept to healthcare data, specifically pneumonia in elderly patients in the Veneto region, by combining the notification forms from the regional surveillance system for invasive bacterial diseases (IBD) dataset with the hospital discharge records dataset (HDR). Due to the datasets’ pseudonymization, deterministic techniques were used to exploit the fact that some features are shared to identify the matches between them. Predictive prognostic models were independently developed and evaluated for IBD, HDR, and the combined dataset. The models were implemented using machine learning techniques such as support vector machines, random decision forests, and artificial neural networks. The best model trained on the combined dataset yielded higher performances than the models trained only on one of the two datasets. More precisely, the best model trained only on the IBD dataset, had an area under the ROC curve equal to 0.685 on the test set; the best model trained only on the HDR dataset had an area under the ROC curve equal to 0.793 on the test set; while the best model trained on the combined dataset had an area under the ROC curve equal to 0.861 on the test set. All of these models, especially the one trained on the combined dataset, could be used to identify higher-risk patients so that the hospital’s personnel could increase their likelihood of survival by surveilling them more.| File | Dimensione | Formato | |
|---|---|---|---|
|
Palmeri_Claudio.pdf
accesso aperto
Dimensione
3.15 MB
Formato
Adobe PDF
|
3.15 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/84789