In this thesis, we focus on supervised learning analysis of electroencephalography (EEG) data for classifying major psychiatric disorders using Random Forest models. A reworking of an approach proposed in the literature is presented, aiming to distinguish between healthy subjects and patients with a clinical diagnosis using a real dataset comprising 945 subjects. The dataset includes sociodemographic and clinical covariates, as well as variables derived from EEG recordings via the Fast Fourier Transform. These variables include measures of spectral power and phase coherence, computed across the main frequency bands of brain activity. The main objective of this work is to evaluate the impact of dimensionality reduction of EEG-derived variables via Principal Component Analysis on the performance of Random Forest classifiers. Principal components are used, together with sociodemographic variables, to train models for classifying subjects based on their clinical condition. The models are trained under different configurations, distinguished by the parameter type (spectral power or phase coherence) and the reference frequency band. The best configurations are selected using a 5-fold cross-validation procedure, with the Area Under the Curve as the evaluation metric. The results highlight limitations in the stability of the estimates, attributable to the limited sample sizes for several diagnostic categories. A comparison is also performed between models built using only sociodemographic covariates and models based exclusively on EEG-derived variables. This comparison shows that the latter provides only a marginal contribution to classification performance, whereas the predictive ability is largely driven by sociodemographic variables alone.
Statistical Learning Methods for Psychiatric Disorder Classification Using Resting-State Electroencephalography Recordings
VINCENZI, MARGHERITA
2025/2026
Abstract
In this thesis, we focus on supervised learning analysis of electroencephalography (EEG) data for classifying major psychiatric disorders using Random Forest models. A reworking of an approach proposed in the literature is presented, aiming to distinguish between healthy subjects and patients with a clinical diagnosis using a real dataset comprising 945 subjects. The dataset includes sociodemographic and clinical covariates, as well as variables derived from EEG recordings via the Fast Fourier Transform. These variables include measures of spectral power and phase coherence, computed across the main frequency bands of brain activity. The main objective of this work is to evaluate the impact of dimensionality reduction of EEG-derived variables via Principal Component Analysis on the performance of Random Forest classifiers. Principal components are used, together with sociodemographic variables, to train models for classifying subjects based on their clinical condition. The models are trained under different configurations, distinguished by the parameter type (spectral power or phase coherence) and the reference frequency band. The best configurations are selected using a 5-fold cross-validation procedure, with the Area Under the Curve as the evaluation metric. The results highlight limitations in the stability of the estimates, attributable to the limited sample sizes for several diagnostic categories. A comparison is also performed between models built using only sociodemographic covariates and models based exclusively on EEG-derived variables. This comparison shows that the latter provides only a marginal contribution to classification performance, whereas the predictive ability is largely driven by sociodemographic variables alone.| File | Dimensione | Formato | |
|---|---|---|---|
|
Vincenzi_Margherita.pdf
accesso aperto
Dimensione
1.71 MB
Formato
Adobe PDF
|
1.71 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/106088