Prognostic Prediction Machine Learning Models for Elderly Pneumonia Patients

Integrating different datasets with different predictors and the same target variable can enhance the performances and the reliability of machine learning models trained on them. This work applied this concept to healthcare data, specifically pneumonia in elderly patients in the Veneto region, by combining the notification forms from the regional surveillance system for invasive bacterial diseases (IBD) dataset with the hospital discharge records dataset (HDR). Due to the datasets’ pseudonymization, deterministic techniques were used to exploit the fact that some features are shared to identify the matches between them. Predictive prognostic models were independently developed and evaluated for IBD, HDR, and the combined dataset. The models were implemented using machine learning techniques such as support vector machines, random decision forests, and artificial neural networks. The best model trained on the combined dataset yielded higher performances than the models trained only on one of the two datasets. More precisely, the best model trained only on the IBD dataset, had an area under the ROC curve equal to 0.685 on the test set; the best model trained only on the HDR dataset had an area under the ROC curve equal to 0.793 on the test set; while the best model trained on the combined dataset had an area under the ROC curve equal to 0.861 on the test set. All of these models, especially the one trained on the combined dataset, could be used to identify higher-risk patients so that the hospital’s personnel could increase their likelihood of survival by surveilling them more.

Prognostic Prediction Machine Learning Models for Elderly Pneumonia Patients

PALMERI, CLAUDIO

2024/2025

Abstract

Integrating different datasets with different predictors and the same target variable can enhance the performances and the reliability of machine learning models trained on them. This work applied this concept to healthcare data, specifically pneumonia in elderly patients in the Veneto region, by combining the notification forms from the regional surveillance system for invasive bacterial diseases (IBD) dataset with the hospital discharge records dataset (HDR). Due to the datasets’ pseudonymization, deterministic techniques were used to exploit the fact that some features are shared to identify the matches between them. Predictive prognostic models were independently developed and evaluated for IBD, HDR, and the combined dataset. The models were implemented using machine learning techniques such as support vector machines, random decision forests, and artificial neural networks. The best model trained on the combined dataset yielded higher performances than the models trained only on one of the two datasets. More precisely, the best model trained only on the IBD dataset, had an area under the ROC curve equal to 0.685 on the test set; the best model trained only on the HDR dataset had an area under the ROC curve equal to 0.793 on the test set; while the best model trained on the combined dataset had an area under the ROC curve equal to 0.861 on the test set. All of these models, especially the one trained on the combined dataset, could be used to identify higher-risk patients so that the hospital’s personnel could increase their likelihood of survival by surveilling them more.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Matematica "Tullio Levi-Civita" - DM
			
	Corso di studio
	
				DATA SCIENCE Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2024
			
	Titolo inglese
	
				Prognostic Prediction Machine Learning Models for Elderly Pneumonia Patients
			
	Abstract in italiano
	
				Integrating different datasets with different predictors and the same target variable can enhance
the performances and the reliability of machine learning models trained on them.
This work applied this concept to healthcare data, specifically pneumonia in elderly patients
in the Veneto region, by combining the notification forms from the regional surveillance system
for invasive bacterial diseases (IBD) dataset with the hospital discharge records dataset (HDR).
Due to the datasets’ pseudonymization, deterministic techniques were used to exploit the
fact that some features are shared to identify the matches between them. Predictive prognostic models were independently developed and evaluated for IBD, HDR, and the combined dataset.
The models were implemented using machine learning techniques such as support vector
machines, random decision forests, and artificial neural networks. The best model trained on
the combined dataset yielded higher performances than the models trained only on one of the
two datasets.
More precisely, the best model trained only on the IBD dataset, had an area under the ROC
curve equal to 0.685 on the test set; the best model trained only on the HDR dataset had an
area under the ROC curve equal to 0.793 on the test set; while the best model trained on the
combined dataset had an area under the ROC curve equal to 0.861 on the test set.
All of these models, especially the one trained on the combined dataset, could be used to
identify higher-risk patients so that the hospital’s personnel could increase their likelihood of
survival by surveilling them more.
			
	Parola chiave
	
				AI
Supervised learning
Deep Learning
Clinical Data
Imbalanced Data
			
	Relatore
	
				NAVARIN, NICOLO'
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Palmeri_Claudio.pdf accesso aperto Dimensione 3.15 MB Formato Adobe PDF Visualizza/Apri	3.15 MB	Adobe PDF	Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/84789