Introduction: The use of Large Language Models (LLMs) in emergency departments (EDs) is growing rapidly, but evidence from peripheral Italian settings remains limited. In general surgery, abdominal pain is a common reason for ED attendance, marked by heterogeneous etiologies and the need for time‑sensitive decisions: correctly identifying surgical cases, selecting the appropriate care pathway, and assessing whether the patient should be centralized to higher‑complexity centers. This study aims to assess how an LLM can support decision‑making for patients with abdominal pain in a peripheral ED by comparing the model’s recommendations with actual clinical outcomes. Study Objectives: The objective is to evaluate the diagnostic accuracy and predictive ability of an LLM (ChatGPT‑4o) in supporting surgical decisions for adult patients with abdominal pain in a peripheral ED, with particular attention to the following: appropriateness of disposition (discharge, admission to surgery, or admission to internal medicine), indication for surgery, need for centralization, and need for intensive care admission. Materials and Methods: This single‑center retrospective observational study included 352 adults presenting to the ED of M.O.A. Locatelli Hospital (Piario, Bergamo) from 31 December 2024 to 31 March 2025 with abdominal pain at triage. For each case, the model—prompted to act as a general surgeon—provided: disposition (discharge/admission to surgery/admission to internal medicine), indication for surgical treatment, need for centralization and for intensive care, 30‑day mortality, and the type of procedure. Performance was estimated using confusion matrices, sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV), accuracy (95% CI), and Cohen’s kappa; analyses were performed in R 4.5.1. Results: Overall accuracy was 45.5% (95% CI 40.2–50.8) with a kappa of 0.195, indicating slight agreement. For admissions to surgery, sensitivity was 0.773 and specificity 0.857, with a positive predictive value (PPV) of 43.6% and a negative predictive value (NPV) of 96.4%. For admissions to internal medicine, sensitivity was 0.700, specificity 0.566, and PPV 8.9%. For discharges, sensitivity was 0.389 and specificity 0.938, with PPV 96.6% and NPV 25.4%. Looking at individual clinical outcomes: for surgical indication, sensitivity was 59% and specificity 92% (PPV 27%, NPV 98%); for intensive care, sensitivity 100% and specificity 97% (PPV 9%, NPV 100%); for 30‑day mortality, sensitivity 100% and specificity 93% (PPV 13%, NPV 100%); finally, for centralization, sensitivity 0% and specificity 98% (PPV 0%, NPV 98%). Conclusions: In the peripheral setting analyzed, the LLM exhibits a cautious stance: it shows excellent ability to rule out critical events such as intensive care and mortality but tends to overestimate admissions and surgical indications. Performance is inadequate with respect to centralization. It is not suitable for standalone use; we suggest employing it as an initial filter under clinical supervision. Calibration to the local context, multimodal integration, and prospective impact evaluations are desirable.
Introduzione: L’uso dei Large Language Models (LLM) in pronto soccorso (PS) sta crescendo rapidamente, ma le evidenze nei contesti periferici italiani sono ancora limitate. In chirurgia generale, l’addominalgia è una causa frequente di accesso al PS, caratterizzata da eterogeneità di cause e dalla necessità di prendere decisioni tempo-dipendenti: corretta identificazione dei casi chirurgici, scelta del percorso assistenziale appropriato e valutazione dell’eventuale centralizzazione del paziente verso centri a maggior complessità. Il presente studio si propone di valutare come un LLM possa supportare il percorso decisionale per i pazienti con addominalgia in un PS periferico, confrontando le raccomandazioni del modello con gli esiti clinici reali. Obiettivi dello studio: L’obiettivo è valutare l’accuratezza diagnostica e la capacità predittiva di un LLM (ChatGPT-4o) nel supportare le decisioni chirurgiche per pazienti adulti con addominalgia in un PS periferico, prestando particolare attenzione ad appropriatezza della destinazione (dimissione, ricovero in chirurgia o in medicina), indicazione a intervento chirurgico, necessità di centralizzazione e di ammissione in terapia intensiva. Materiali e metodi: Si tratta di uno studio osservazionale retrospettivo monocentrico che ha coinvolto 352 adulti afferenti al PS dell’Ospedale M.O.A. Locatelli (Piario, BG) dal 31/12/2024 al 31/03/2025, con dolore addominale al triage. Per ogni caso il modello, istruito a comportarsi come un chirurgo generale, ha fornito: esito dell’accesso (dimissione/ricovero in chirurgia/ricovero in medicina), indicazione a trattamento chirurgico, necessità di centralizzazione e di terapia intensiva, mortalità a 30 giorni e tipo di procedura. La performance è stata stimata con matrici di confusione, sensibilità, specificità, valore predittivo positivo (VPP) e negativo (VPN), accuratezza (IC95%) e kappa di Cohen; analisi condotta in R 4.5.1. Risultati: L’accuratezza complessiva si attesta al 45,5% (IC95% 40,2–50,8) con un kappa di 0,195, che indica un accordo lieve. Per quanto riguarda i ricoveri in chirurgia, abbiamo rilevato una sensibilità di 0,773 e una specificità di 0,857, con un VPP del 43,6% e un VPN del 96,4%. Per i ricoveri in medicina, la sensibilità è di 0,700, la specificità di 0,566 e il VPP di 8,9%. Riguardo alle dimissioni, la sensibilità è di 0,389, la specificità di 0,938, con un VPP del 96,6% e un VPN del 25,4%. Analizzando i singoli outcome clinici: per l’indicazione chirurgica la sensibilità è del 59% e la specificità del 92% (VPP 27%, VPN 98%); per la terapia intensiva la sensibilità è del 100% e la specificità del 97% (VPP 9%, VPN 100%); per la mortalità a 30 giorni la sensibilità è del 100% e la specificità del 93% (VPP 13%, VPN 100%); infine, per la centralizzazione la sensibilità è dello 0% e la specificità del 98% (VPP 0%, VPN 98%). Conclusioni: Nel contesto periferico analizzato, l’LLM dimostra un atteggiamento cauto: ha un’ottima capacità di non sottostimare eventi critici come la terapia intensiva e la mortalità, ma tende a sovrastimare i ricoveri e le indicazioni chirurgiche. La performance risulta inadeguata per quanto riguarda la centralizzazione. Non è adatto per un uso autonomo; si suggerisce un utilizzo come filtro iniziale sotto supervisione clinica. È auspicabile una calibrazione al contesto locale, un’integrazione multimodale e valutazioni prospettiche d’impatto.
I Large Language Models possono fornire raccomandazioni di Chirurgia Generale in Pronto Soccorso? Valutazione dell’accuratezza diagnostica e del supporto decisionale precoce in un ospedale periferico.
PERONIO, LUCIA
2024/2025
Abstract
Introduction: The use of Large Language Models (LLMs) in emergency departments (EDs) is growing rapidly, but evidence from peripheral Italian settings remains limited. In general surgery, abdominal pain is a common reason for ED attendance, marked by heterogeneous etiologies and the need for time‑sensitive decisions: correctly identifying surgical cases, selecting the appropriate care pathway, and assessing whether the patient should be centralized to higher‑complexity centers. This study aims to assess how an LLM can support decision‑making for patients with abdominal pain in a peripheral ED by comparing the model’s recommendations with actual clinical outcomes. Study Objectives: The objective is to evaluate the diagnostic accuracy and predictive ability of an LLM (ChatGPT‑4o) in supporting surgical decisions for adult patients with abdominal pain in a peripheral ED, with particular attention to the following: appropriateness of disposition (discharge, admission to surgery, or admission to internal medicine), indication for surgery, need for centralization, and need for intensive care admission. Materials and Methods: This single‑center retrospective observational study included 352 adults presenting to the ED of M.O.A. Locatelli Hospital (Piario, Bergamo) from 31 December 2024 to 31 March 2025 with abdominal pain at triage. For each case, the model—prompted to act as a general surgeon—provided: disposition (discharge/admission to surgery/admission to internal medicine), indication for surgical treatment, need for centralization and for intensive care, 30‑day mortality, and the type of procedure. Performance was estimated using confusion matrices, sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV), accuracy (95% CI), and Cohen’s kappa; analyses were performed in R 4.5.1. Results: Overall accuracy was 45.5% (95% CI 40.2–50.8) with a kappa of 0.195, indicating slight agreement. For admissions to surgery, sensitivity was 0.773 and specificity 0.857, with a positive predictive value (PPV) of 43.6% and a negative predictive value (NPV) of 96.4%. For admissions to internal medicine, sensitivity was 0.700, specificity 0.566, and PPV 8.9%. For discharges, sensitivity was 0.389 and specificity 0.938, with PPV 96.6% and NPV 25.4%. Looking at individual clinical outcomes: for surgical indication, sensitivity was 59% and specificity 92% (PPV 27%, NPV 98%); for intensive care, sensitivity 100% and specificity 97% (PPV 9%, NPV 100%); for 30‑day mortality, sensitivity 100% and specificity 93% (PPV 13%, NPV 100%); finally, for centralization, sensitivity 0% and specificity 98% (PPV 0%, NPV 98%). Conclusions: In the peripheral setting analyzed, the LLM exhibits a cautious stance: it shows excellent ability to rule out critical events such as intensive care and mortality but tends to overestimate admissions and surgical indications. Performance is inadequate with respect to centralization. It is not suitable for standalone use; we suggest employing it as an initial filter under clinical supervision. Calibration to the local context, multimodal integration, and prospective impact evaluations are desirable.| File | Dimensione | Formato | |
|---|---|---|---|
|
Peronio_Lucia.pdf
Accesso riservato
Dimensione
845.08 kB
Formato
Adobe PDF
|
845.08 kB | Adobe PDF |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/93212