Introduction: Free-tier consumer Large Language Models (LLMs) such as ChatGPT, Claude, and Gemini are increasingly used by patients to obtain information about their conditions. Eosinophilic esophagitis is a chronic immune-mediated esophageal disease whose management requires close treatment adherence and high-quality patient-physician communication. To date, no systematic study has compared the performance of the leading free-tier consumer LLMs on the priority questions of patients with eosinophilic esophagitis, integrating both expert and patient perspectives. Study Objective: To benchmark the quality of responses provided by ChatGPT, Claude, and Gemini to the 51 expert-prioritized questions of the Eosinophilic Esophagitis Question Prompt List recently published by Achalu and colleagues (J Clin Gastroenterol 2025), through blinded expert evaluation by an international panel of gastroenterologists and blinded patient evaluation via a paper-based questionnaire administered in the outpatient clinic. Materials and methods: Two-phase prospective observational study, prospectively registered on the Open Science Framework (osf.io/436zx). The 51 Question Prompt List questions were submitted three times to each model via free-tier web interface, preceded by a standardized contextualizing prompt, generating 459 responses. One repetition per question-model combination was randomly selected for rating, yielding 153 responses to be evaluated. Phase 1 involved blinded expert rating on six 1-to-5 Likert scales (accuracy, completeness, safety, clarity, empathy, willingness to send to a patient) and five binary error categories. Phase 2 involved blinded patient evaluation, through a paper-based questionnaire with Latin-square randomization, of a prioritized subset of four questions, on three Likert scales (quality, understandability, empathy), plus forced-choice preference and Turing-style authorship guess. Statistical analysis was conducted in R 4.5.0 using cumulative link mixed models for Phase 1 and linear mixed-effects models for Phase 2. Results: The preliminary analysis is based on 218 expert ratings from three raters and 27 patients yielding 323 patient-level ratings. ChatGPT showed the highest prevalence of hallucinations (24.3%) and guideline non-alignment (24.3%); Claude the safest profile (hallucinations 12.3%, safety omission 2.8%); Gemini an intermediate profile. No model reached an expert mean of 4 on 5 on the willingness-to-send-to-patient dimension. In patient evaluation, Gemini was significantly more empathic than ChatGPT and Claude (3.96 vs 3.42 vs 3.45; p < 0.001 after Bonferroni correction) and was preferred on the safety-critical question about cancer and mortality risk (65% preference). In the Turing-style authorship guess, Gemini responses were identified as authored by a physician in 43% of cases, against 34% for ChatGPT and 36% for Claude. Conclusions: The three free-tier consumer LLMs show a differentiated performance profile on eosinophilic esophagitis, with ChatGPT characterized by higher error rates, Claude by the safest profile, and Gemini with a communication style that better matches patient preferences, particularly for safety-critical and emotionally charged topics. However, none of the three models is currently considered by experts as suitable for direct unmodified delivery to patients. These preliminary findings, to be confirmed upon completion of the international rating panel and patient recruitment, emphasize the importance of clinical mediation in the use of these tools for patient education in eosinophilic esophagitis.
Introduzione: I modelli di intelligenza artificiale conversazionale di libero accesso (ChatGPT, Claude, Gemini) sono sempre più utilizzati dai pazienti come fonte di informazione sulla propria malattia. L'esofagite eosinofila è una malattia infiammatoria cronica dell'esofago la cui gestione richiede una stretta aderenza terapeutica e una comunicazione medico-paziente di qualità. Tuttavia, ad oggi mancano studi sistematici che confrontino le prestazioni dei principali modelli di intelligenza artificiale conversazionale di libero accesso sulle domande prioritarie dei pazienti affetti da esofagite eosinofila, integrando sia la prospettiva esperta sia quella del paziente. Scopo dello studio: Confrontare la qualità delle risposte di ChatGPT, Claude e Gemini alle 51 domande prioritarie del Question Prompt List sull'esofagite eosinofila pubblicato da Achalu e colleghi (J Clin Gastroenterol 2025), mediante una valutazione esperta in cieco da parte di un panel internazionale di gastroenterologi e una valutazione paziente in cieco mediante questionario cartaceo somministrato in ambulatorio. Materiali e metodi: Studio osservazionale a due fasi, prospetticamente registrato su Open Science Framework (osf.io/436zx). Le 51 domande del Question Prompt List sono state sottomesse tre volte a ciascun modello via interfaccia web free-tier, preceduto da un prompt contestualizzante standardizzato, generando 459 risposte. Una ripetizione per ciascuna combinazione domanda-modello è stata casualmente selezionata per il rating, ottenendo 153 risposte da valutare. La Fase 1 prevedeva la valutazione esperta in cieco su sei scale Likert 1-5 (accuratezza, completezza, sicurezza, chiarezza, empatia, idoneità a essere inviata al paziente) e cinque categorie di errore binarie. La Fase 2 prevedeva la valutazione paziente, mediante questionario cartaceo con randomizzazione a quadrato latino, di un sottoinsieme prioritario di quattro domande, su tre scale Likert (qualità, comprensibilità, empatia) più scelta forzata della risposta preferita e attribuzione dell'autore in stile Turing. L'analisi statistica è stata condotta in R 4.5.0 mediante regressione ordinale a effetti misti per la Fase 1 e modelli lineari a effetti misti per la Fase 2. Risultati: L'analisi preliminare è basata su 218 valutazioni esperte da tre rater e su 27 pazienti reclutati con 323 valutazioni paziente-livello. ChatGPT ha mostrato la più alta prevalenza di allucinazioni (24,3%) e di non-allineamento alle linee guida (24,3%); Claude il profilo più sicuro (allucinazioni 12,3%, omissione safety 2,8%); Gemini un profilo intermedio. Nessun modello ha raggiunto un valore medio esperto di 4 su 5 sulla dimensione dell'idoneità all'invio diretto al paziente. Nella valutazione paziente, Gemini è risultato significativamente più empatico di ChatGPT e Claude (3,96 vs 3,42 vs 3,45; p < 0,001 dopo correzione di Bonferroni) ed è stato preferito sulla domanda safety-critical relativa al cancro e alla mortalità (65% di preferenza). Nell'attribuzione di autore in stile Turing, le risposte di Gemini sono state identificate come scritte da un medico nel 43% dei casi, contro il 34% di ChatGPT e il 36% di Claude. Conclusioni: I tre modelli di intelligenza artificiale conversazionale di libero accesso mostrano un profilo di prestazioni differenziato sull'esofagite eosinofila, con ChatGPT caratterizzato da maggiori tassi di errore, Claude dal profilo più sicuro, e Gemini con uno stile comunicativo che incontra meglio le preferenze del paziente, in particolare per i temi safety-critical e ad alto carico emotivo. Tuttavia, nessuno dei tre modelli è attualmente ritenuto dagli esperti adeguato all'invio diretto al paziente senza modifica. Questi risultati preliminari, da confermare al completamento del panel internazionale e del reclutamento paziente, sottolineano l'importanza della mediazione clinica nell'utilizzo di questi strumenti per l'educazione del paziente con esofagite eosinofila
Intelligenza artificiale e comunicazione con i pazienti affetti da esofagite eosinofila: uno studio comparativo
MASSARWA, FADI
2025/2026
Abstract
Introduction: Free-tier consumer Large Language Models (LLMs) such as ChatGPT, Claude, and Gemini are increasingly used by patients to obtain information about their conditions. Eosinophilic esophagitis is a chronic immune-mediated esophageal disease whose management requires close treatment adherence and high-quality patient-physician communication. To date, no systematic study has compared the performance of the leading free-tier consumer LLMs on the priority questions of patients with eosinophilic esophagitis, integrating both expert and patient perspectives. Study Objective: To benchmark the quality of responses provided by ChatGPT, Claude, and Gemini to the 51 expert-prioritized questions of the Eosinophilic Esophagitis Question Prompt List recently published by Achalu and colleagues (J Clin Gastroenterol 2025), through blinded expert evaluation by an international panel of gastroenterologists and blinded patient evaluation via a paper-based questionnaire administered in the outpatient clinic. Materials and methods: Two-phase prospective observational study, prospectively registered on the Open Science Framework (osf.io/436zx). The 51 Question Prompt List questions were submitted three times to each model via free-tier web interface, preceded by a standardized contextualizing prompt, generating 459 responses. One repetition per question-model combination was randomly selected for rating, yielding 153 responses to be evaluated. Phase 1 involved blinded expert rating on six 1-to-5 Likert scales (accuracy, completeness, safety, clarity, empathy, willingness to send to a patient) and five binary error categories. Phase 2 involved blinded patient evaluation, through a paper-based questionnaire with Latin-square randomization, of a prioritized subset of four questions, on three Likert scales (quality, understandability, empathy), plus forced-choice preference and Turing-style authorship guess. Statistical analysis was conducted in R 4.5.0 using cumulative link mixed models for Phase 1 and linear mixed-effects models for Phase 2. Results: The preliminary analysis is based on 218 expert ratings from three raters and 27 patients yielding 323 patient-level ratings. ChatGPT showed the highest prevalence of hallucinations (24.3%) and guideline non-alignment (24.3%); Claude the safest profile (hallucinations 12.3%, safety omission 2.8%); Gemini an intermediate profile. No model reached an expert mean of 4 on 5 on the willingness-to-send-to-patient dimension. In patient evaluation, Gemini was significantly more empathic than ChatGPT and Claude (3.96 vs 3.42 vs 3.45; p < 0.001 after Bonferroni correction) and was preferred on the safety-critical question about cancer and mortality risk (65% preference). In the Turing-style authorship guess, Gemini responses were identified as authored by a physician in 43% of cases, against 34% for ChatGPT and 36% for Claude. Conclusions: The three free-tier consumer LLMs show a differentiated performance profile on eosinophilic esophagitis, with ChatGPT characterized by higher error rates, Claude by the safest profile, and Gemini with a communication style that better matches patient preferences, particularly for safety-critical and emotionally charged topics. However, none of the three models is currently considered by experts as suitable for direct unmodified delivery to patients. These preliminary findings, to be confirmed upon completion of the international rating panel and patient recruitment, emphasize the importance of clinical mediation in the use of these tools for patient education in eosinophilic esophagitis.| File | Dimensione | Formato | |
|---|---|---|---|
|
Massarwa_Fadi .pdf
Accesso riservato
Dimensione
3.32 MB
Formato
Adobe PDF
|
3.32 MB | Adobe PDF |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/108920