Recently, we have witnessed a rise in the use of Large Language Models (LLMs), especially in applications like chatbot assistants. Safety mechanisms and specialized training procedures are implemented to prevent improper responses from these assistants. In this thesis, we show how to bypass these measures for ChatGPT, GPT-3.5-turbo, Bard, and Gemini-1.5-flash (and, to some extent, Bing chat) by making them impersonate complex personas with personality characteristics that are not aligned with a truthful assistant. We start by creating elaborate biographies of these personas, which we then use in a new session with the same chatbots. Our conversations then followed a role-play style to elicit prohibited responses. By making use of personas, we show that such responses are actually provided, making it possible to obtain unauthorized, illegal, or harmful information. This thesis shows that by using adversarial personas, one can overcome safety mechanisms set out by LLMs' developers. We also introduce several ways of activating such adversarial personas, which show that the considered chatbots are vulnerable to this kind of attack. With our attack, we managed to get illicit information and dangerous content for 38 out of 40 different scenarios in GPT-3.5-turbo, and 40 out of 40 in Gemini-1.5-flash. On the same principle, we introduce two defenses that push the model to interpret trustworthy personalities and make it more robust against such attacks. Our best defense increased GPT-3.5-turbo's robustness, defusing 106 out of 114 working jailbreaking prompts.
Recently, we have witnessed a rise in the use of Large Language Models (LLMs), especially in applications like chatbot assistants. Safety mechanisms and specialized training procedures are implemented to prevent improper responses from these assistants. In this thesis, we show how to bypass these measures for ChatGPT, GPT-3.5-turbo, Bard, and Gemini-1.5-flash (and, to some extent, Bing chat) by making them impersonate complex personas with personality characteristics that are not aligned with a truthful assistant. We start by creating elaborate biographies of these personas, which we then use in a new session with the same chatbots. Our conversations then followed a role-play style to elicit prohibited responses. By making use of personas, we show that such responses are actually provided, making it possible to obtain unauthorized, illegal, or harmful information. This thesis shows that by using adversarial personas, one can overcome safety mechanisms set out by LLMs' developers. We also introduce several ways of activating such adversarial personas, which show that the considered chatbots are vulnerable to this kind of attack. With our attack, we managed to get illicit information and dangerous content for 38 out of 40 different scenarios in GPT-3.5-turbo, and 40 out of 40 in Gemini-1.5-flash. On the same principle, we introduce two defenses that push the model to interpret trustworthy personalities and make it more robust against such attacks. Our best defense increased GPT-3.5-turbo's robustness, defusing 106 out of 114 working jailbreaking prompts.
Dr. Jekyll and Mr. Hyde: Two Faces of LLMs
COLLU, MATTEO GIOELE
2023/2024
Abstract
Recently, we have witnessed a rise in the use of Large Language Models (LLMs), especially in applications like chatbot assistants. Safety mechanisms and specialized training procedures are implemented to prevent improper responses from these assistants. In this thesis, we show how to bypass these measures for ChatGPT, GPT-3.5-turbo, Bard, and Gemini-1.5-flash (and, to some extent, Bing chat) by making them impersonate complex personas with personality characteristics that are not aligned with a truthful assistant. We start by creating elaborate biographies of these personas, which we then use in a new session with the same chatbots. Our conversations then followed a role-play style to elicit prohibited responses. By making use of personas, we show that such responses are actually provided, making it possible to obtain unauthorized, illegal, or harmful information. This thesis shows that by using adversarial personas, one can overcome safety mechanisms set out by LLMs' developers. We also introduce several ways of activating such adversarial personas, which show that the considered chatbots are vulnerable to this kind of attack. With our attack, we managed to get illicit information and dangerous content for 38 out of 40 different scenarios in GPT-3.5-turbo, and 40 out of 40 in Gemini-1.5-flash. On the same principle, we introduce two defenses that push the model to interpret trustworthy personalities and make it more robust against such attacks. Our best defense increased GPT-3.5-turbo's robustness, defusing 106 out of 114 working jailbreaking prompts.File | Dimensione | Formato | |
---|---|---|---|
Collu_Matteo_Gioele.pdf
accesso aperto
Dimensione
9.99 MB
Formato
Adobe PDF
|
9.99 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/68881