As artificial intelligence (AI) systems increasingly integrate into everyday life, large language model (LLM)-based chatbots, such as ChatGPT, are now used by millions worldwide, often harnessing their potential without a clear understanding of their inherent limitations and psychological implications. In particular, this work addresses sycophancy, i.e., these models’ tendency to flatter and agree with users’ statements, sometimes even factually wrong ones. From a Human-AI interaction perspective, such behaviour poses significant cognitive and social risks, reinforcing user misconceptions (confirmation biases, general misinformation), uncritical trust, anthropomorphism attributions, and the development or exacerbation of psychiatric disorders. In order to mitigate this phenomenon, I propose a novel prompt-based approach, the Augmented Knowledge Prompt (Aug), which instructs LLMs to retrieve and report relevant model’s internal contextual facts before producing a final answer, thereby increasing factual grounding and transparency. Specifically, by internal I refer to the knowledge already encoded within the model, without relying on any external sources (e.g., online searches). To validate it, I conducted an experimental comparison on four different LLMs – two open-source (Mistral and Llama models) and two closed models (the SOTA Gemini-2.5Flash model and Gemini-2.0-Flash-Lite) – using a widely known dataset of variably framed open-ended questions. Specifically, the Augmented Knowledge Prompt was tested against LLMs’ baseline condition (Default), a literature-validated prompt-based solution (NonSyc), and a combined condition (NonSycAug) as well. Performance was evaluated via teacher-model assessment using accuracy as the principal outcome. The results show that open-source models were generally more susceptible to question framing than closed-source ones, while also being less affected by prompt manipulations. Notably, all models but one (Mistral) were more robust to how questions were formulated under the NonSycAug prompt condition, while both closed-source models’ accuracy was better under the augmented conditions (Aug and NonSycAug). Lastly, the Gemini-2.5Flash model benefited the most from system prompt design with respect to its baseline accuracy. Overall, the study contributes empirical evidence and methodological tools for evaluating and mitigating sycophancy in LLMs, supporting the development of safer and more reliable Human–AI interactions by encouraging an interdisciplinary effort.

As artificial intelligence (AI) systems increasingly integrate into everyday life, large language model (LLM)-based chatbots, such as ChatGPT, are now used by millions worldwide, often harnessing their potential without a clear understanding of their inherent limitations and psychological implications. In particular, this work addresses sycophancy, i.e., these models’ tendency to flatter and agree with users’ statements, sometimes even factually wrong ones. From a Human-AI interaction perspective, such behaviour poses significant cognitive and social risks, reinforcing user misconceptions (confirmation biases, general misinformation), uncritical trust, anthropomorphism attributions, and the development or exacerbation of psychiatric disorders. In order to mitigate this phenomenon, I propose a novel prompt-based approach, the Augmented Knowledge Prompt (Aug), which instructs LLMs to retrieve and report relevant model’s internal contextual facts before producing a final answer, thereby increasing factual grounding and transparency. Specifically, by internal I refer to the knowledge already encoded within the model, without relying on any external sources (e.g., online searches). To validate it, I conducted an experimental comparison on four different LLMs – two open-source (Mistral and Llama models) and two closed models (the SOTA Gemini-2.5Flash model and Gemini-2.0-Flash-Lite) – using a widely known dataset of variably framed open-ended questions. Specifically, the Augmented Knowledge Prompt was tested against LLMs’ baseline condition (Default), a literature-validated prompt-based solution (NonSyc), and a combined condition (NonSycAug) as well. Performance was evaluated via teacher-model assessment using accuracy as the principal outcome. The results show that open-source models were generally more susceptible to question framing than closed-source ones, while also being less affected by prompt manipulations. Notably, all models but one (Mistral) were more robust to how questions were formulated under the NonSycAug prompt condition, while both closed-source models’ accuracy was better under the augmented conditions (Aug and NonSycAug). Lastly, the Gemini-2.5Flash model benefited the most from system prompt design with respect to its baseline accuracy. Overall, the study contributes empirical evidence and methodological tools for evaluating and mitigating sycophancy in LLMs, supporting the development of safer and more reliable Human–AI interactions by encouraging an interdisciplinary effort.

Mitigating Sycophancy in Large Language Models through Prompt Design: A Transparent Knowledge-Augmented Approach Based on Internal Retrieval

LA ROSA, ANDREA
2024/2025

Abstract

As artificial intelligence (AI) systems increasingly integrate into everyday life, large language model (LLM)-based chatbots, such as ChatGPT, are now used by millions worldwide, often harnessing their potential without a clear understanding of their inherent limitations and psychological implications. In particular, this work addresses sycophancy, i.e., these models’ tendency to flatter and agree with users’ statements, sometimes even factually wrong ones. From a Human-AI interaction perspective, such behaviour poses significant cognitive and social risks, reinforcing user misconceptions (confirmation biases, general misinformation), uncritical trust, anthropomorphism attributions, and the development or exacerbation of psychiatric disorders. In order to mitigate this phenomenon, I propose a novel prompt-based approach, the Augmented Knowledge Prompt (Aug), which instructs LLMs to retrieve and report relevant model’s internal contextual facts before producing a final answer, thereby increasing factual grounding and transparency. Specifically, by internal I refer to the knowledge already encoded within the model, without relying on any external sources (e.g., online searches). To validate it, I conducted an experimental comparison on four different LLMs – two open-source (Mistral and Llama models) and two closed models (the SOTA Gemini-2.5Flash model and Gemini-2.0-Flash-Lite) – using a widely known dataset of variably framed open-ended questions. Specifically, the Augmented Knowledge Prompt was tested against LLMs’ baseline condition (Default), a literature-validated prompt-based solution (NonSyc), and a combined condition (NonSycAug) as well. Performance was evaluated via teacher-model assessment using accuracy as the principal outcome. The results show that open-source models were generally more susceptible to question framing than closed-source ones, while also being less affected by prompt manipulations. Notably, all models but one (Mistral) were more robust to how questions were formulated under the NonSycAug prompt condition, while both closed-source models’ accuracy was better under the augmented conditions (Aug and NonSycAug). Lastly, the Gemini-2.5Flash model benefited the most from system prompt design with respect to its baseline accuracy. Overall, the study contributes empirical evidence and methodological tools for evaluating and mitigating sycophancy in LLMs, supporting the development of safer and more reliable Human–AI interactions by encouraging an interdisciplinary effort.
2024
Mitigating Sycophancy in Large Language Models through Prompt Design: A Transparent Knowledge-Augmented Approach Based on Internal Retrieval
As artificial intelligence (AI) systems increasingly integrate into everyday life, large language model (LLM)-based chatbots, such as ChatGPT, are now used by millions worldwide, often harnessing their potential without a clear understanding of their inherent limitations and psychological implications. In particular, this work addresses sycophancy, i.e., these models’ tendency to flatter and agree with users’ statements, sometimes even factually wrong ones. From a Human-AI interaction perspective, such behaviour poses significant cognitive and social risks, reinforcing user misconceptions (confirmation biases, general misinformation), uncritical trust, anthropomorphism attributions, and the development or exacerbation of psychiatric disorders. In order to mitigate this phenomenon, I propose a novel prompt-based approach, the Augmented Knowledge Prompt (Aug), which instructs LLMs to retrieve and report relevant model’s internal contextual facts before producing a final answer, thereby increasing factual grounding and transparency. Specifically, by internal I refer to the knowledge already encoded within the model, without relying on any external sources (e.g., online searches). To validate it, I conducted an experimental comparison on four different LLMs – two open-source (Mistral and Llama models) and two closed models (the SOTA Gemini-2.5Flash model and Gemini-2.0-Flash-Lite) – using a widely known dataset of variably framed open-ended questions. Specifically, the Augmented Knowledge Prompt was tested against LLMs’ baseline condition (Default), a literature-validated prompt-based solution (NonSyc), and a combined condition (NonSycAug) as well. Performance was evaluated via teacher-model assessment using accuracy as the principal outcome. The results show that open-source models were generally more susceptible to question framing than closed-source ones, while also being less affected by prompt manipulations. Notably, all models but one (Mistral) were more robust to how questions were formulated under the NonSycAug prompt condition, while both closed-source models’ accuracy was better under the augmented conditions (Aug and NonSycAug). Lastly, the Gemini-2.5Flash model benefited the most from system prompt design with respect to its baseline accuracy. Overall, the study contributes empirical evidence and methodological tools for evaluating and mitigating sycophancy in LLMs, supporting the development of safer and more reliable Human–AI interactions by encouraging an interdisciplinary effort.
AI
LLM
Empirical
File in questo prodotto:
File Dimensione Formato  
La Rosa Andrea_Tesi magistrale.pdf

Accesso riservato

Dimensione 1.84 MB
Formato Adobe PDF
1.84 MB Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/100208