This experimental study on 150 real cases assessed the ability of a generative language model (ChatGPT EDU – GPT-5 Pro) to estimate permanent impairment of psycho-physical integrity in a civil law context, comparing its evaluations with those of two forensic physicians with different levels of experience (senior and junior). The aim was not to replace professional judgment, but to verify the feasibility of simulating, with sufficient reliability, a standardized assessment pathway based on tabular criteria and a transparent clinical–medico-legal rationale. The results show very high agreement between the two human evaluators (CCC 0.9917 across the entire sample), with virtually no bias and narrow limits of agreement; this confirms that, given the same method, the “physiological” variability of expert judgment remains limited. Compared with the senior evaluator, the AI shows overall substantial concordance (CCC 0.9707) but with systematic underestimation at low and medium percentages (bias ≈ −3; LoA ≈ −20 / +10). Performance varies markedly by severity class: insufficient in minor impairments (CCC 0.5737; pronounced underestimation and dispersion) and good in major impairments (CCC 0.9538; mild–moderate underestimation with acceptable LoA). The analysis clarifies the reasons for these patterns. The model works better when the picture is driven by well-described and measurable primary lesions (range of motion in degrees, extent of scarring, prosthetic classes), which allow a direct choice of the tabular entry and assignment of a value within the range. It struggles, however, when it must combine multiple outcomes – especially if mild – when indicators are only qualitative (“mild,” “modest”), or when it must move from the impairment of a single domain to the person’s overall functional compromise; in these situations, it tends to select the minimum value or the most conservative class. The result is underestimation, particularly in minor and multi-injury cases. Overall, the study provides encouraging yet not definitive evidence: an LLM, appropriately instructed with guidance documents and standardized interaction protocols, can reproduce tabular reasoning with good fidelity and can support consistency of assessments in well-structured scenarios; however, it is not yet reliable in multiple and interdependent cases, where clinical-functional synthesis and the overall evaluation of multiple impairments remain central.
Questo studio sperimentale su 150 casi reali ha valutato la capacità di un modello linguistico generativo (ChatGPT EDU – GPT-5 Pro) di stimare il danno biologico permanente (DBP) in ambito civilistico, mettendo a confronto le sue valutazioni con quelle di due medici legali con diversa esperienza (senior e junior). L’obiettivo non era la sostituzione del giudizio professionale, ma la verifica della possibilità di simulare, con sufficiente affidabilità, un percorso valutativo standardizzato fondato su criteri tabellari e su una motivazione clinico-medico-legale trasparente. I risultati documentano un’elevatissima coerenza tra i due valutatori umani (CCC 0,9917 sull’intero campione), con bias pressoché nullo e limiti di accordo contenuti: ciò conferma che, a parità di metodo, la variabilità “fisiologica” del giudizio tra esperti rimane ridotta. Al confronto con il senior, l’IA mostra una concordanza complessivamente sostanziale (CCC 0,9707) ma con sottostima sistematica alle percentuali basse e medie (bias ≈ −3; LoA ≈ −20 / +10). La performance è marcatamente eterogenea per classe di gravità: insufficiente nelle micro-menomazioni (CCC 0,5737; sottostima e dispersione elevate) e buona nelle macro-menomazioni (CCC 0,9538; sottostima lieve-moderata con LoA accettabili). L’analisi chiarisce il perché di questi andamenti. Il modello funziona meglio quando il quadro è guidato da lesioni principali ben descritte e misurabili (ROM in gradi, estensione di esiti cicatriziali, classi protesiche), che consentono scelta diretta sulla voce tabellare e attribuzione di un valore interno al range. Va in difficoltà, invece, quando deve comporre molteplici esiti, soprattutto se lievi, quando gli indicatori sono solo qualitativi (“lieve”, “modico”) o quando occorre passare dalla menomazione di un singolo dominio alla compromissione globale della persona, ambiti nei quali tende a selezionare il valore minimo o la classe più conservativa. Ne conseguono sottostime soprattutto nelle micro e nei quadri plurilesivi. Nel complesso, lo studio fornisce una prova incoraggiante ma non definitiva: un LLM, opportunamente istruito con testi guida e protocolli di interazione standardizzati, può riprodurre con buona fedeltà il ragionamento tabellare e può supportare la coerenza valutativa in scenari ben strutturati; tuttavia, non è ancora affidabile nei casi plurimi e interdipendenti, dove la sintesi clinico-funzionale e la valutazione complessiva di plurime menomazioni rimane centrale.
Punti di forza e limiti di ChatGPT nella valutazione del danno biologico permanente: studio di accuratezza su 150 casi reali
AMICO, IRENE
2023/2024
Abstract
This experimental study on 150 real cases assessed the ability of a generative language model (ChatGPT EDU – GPT-5 Pro) to estimate permanent impairment of psycho-physical integrity in a civil law context, comparing its evaluations with those of two forensic physicians with different levels of experience (senior and junior). The aim was not to replace professional judgment, but to verify the feasibility of simulating, with sufficient reliability, a standardized assessment pathway based on tabular criteria and a transparent clinical–medico-legal rationale. The results show very high agreement between the two human evaluators (CCC 0.9917 across the entire sample), with virtually no bias and narrow limits of agreement; this confirms that, given the same method, the “physiological” variability of expert judgment remains limited. Compared with the senior evaluator, the AI shows overall substantial concordance (CCC 0.9707) but with systematic underestimation at low and medium percentages (bias ≈ −3; LoA ≈ −20 / +10). Performance varies markedly by severity class: insufficient in minor impairments (CCC 0.5737; pronounced underestimation and dispersion) and good in major impairments (CCC 0.9538; mild–moderate underestimation with acceptable LoA). The analysis clarifies the reasons for these patterns. The model works better when the picture is driven by well-described and measurable primary lesions (range of motion in degrees, extent of scarring, prosthetic classes), which allow a direct choice of the tabular entry and assignment of a value within the range. It struggles, however, when it must combine multiple outcomes – especially if mild – when indicators are only qualitative (“mild,” “modest”), or when it must move from the impairment of a single domain to the person’s overall functional compromise; in these situations, it tends to select the minimum value or the most conservative class. The result is underestimation, particularly in minor and multi-injury cases. Overall, the study provides encouraging yet not definitive evidence: an LLM, appropriately instructed with guidance documents and standardized interaction protocols, can reproduce tabular reasoning with good fidelity and can support consistency of assessments in well-structured scenarios; however, it is not yet reliable in multiple and interdependent cases, where clinical-functional synthesis and the overall evaluation of multiple impairments remain central.| File | Dimensione | Formato | |
|---|---|---|---|
|
Amico_Irene.pdf
Accesso riservato
Dimensione
1.29 MB
Formato
Adobe PDF
|
1.29 MB | Adobe PDF |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/96689