Recently, we have witnessed a rise in the use of Large Language Models (LLMs), especially in applications like chatbot assistants. Safety mechanisms and specialized training procedures are implemented to prevent improper responses from these assistants. In this thesis, we show how to bypass these measures for ChatGPT, GPT-3.5-turbo, Bard, and Gemini-1.5-flash (and, to some extent, Bing chat) by making them impersonate complex personas with personality characteristics that are not aligned with a truthful assistant. We start by creating elaborate biographies of these personas, which we then use in a new session with the same chatbots. Our conversations then followed a role-play style to elicit prohibited responses. By making use of personas, we show that such responses are actually provided, making it possible to obtain unauthorized, illegal, or harmful information. This thesis shows that by using adversarial personas, one can overcome safety mechanisms set out by LLMs' developers. We also introduce several ways of activating such adversarial personas, which show that the considered chatbots are vulnerable to this kind of attack. With our attack, we managed to get illicit information and dangerous content for 38 out of 40 different scenarios in GPT-3.5-turbo, and 40 out of 40 in Gemini-1.5-flash. On the same principle, we introduce two defenses that push the model to interpret trustworthy personalities and make it more robust against such attacks. Our best defense increased GPT-3.5-turbo's robustness, defusing 106 out of 114 working jailbreaking prompts.

Recently, we have witnessed a rise in the use of Large Language Models (LLMs), especially in applications like chatbot assistants. Safety mechanisms and specialized training procedures are implemented to prevent improper responses from these assistants. In this thesis, we show how to bypass these measures for ChatGPT, GPT-3.5-turbo, Bard, and Gemini-1.5-flash (and, to some extent, Bing chat) by making them impersonate complex personas with personality characteristics that are not aligned with a truthful assistant. We start by creating elaborate biographies of these personas, which we then use in a new session with the same chatbots. Our conversations then followed a role-play style to elicit prohibited responses. By making use of personas, we show that such responses are actually provided, making it possible to obtain unauthorized, illegal, or harmful information. This thesis shows that by using adversarial personas, one can overcome safety mechanisms set out by LLMs' developers. We also introduce several ways of activating such adversarial personas, which show that the considered chatbots are vulnerable to this kind of attack. With our attack, we managed to get illicit information and dangerous content for 38 out of 40 different scenarios in GPT-3.5-turbo, and 40 out of 40 in Gemini-1.5-flash. On the same principle, we introduce two defenses that push the model to interpret trustworthy personalities and make it more robust against such attacks. Our best defense increased GPT-3.5-turbo's robustness, defusing 106 out of 114 working jailbreaking prompts.

Dr. Jekyll and Mr. Hyde: Two Faces of LLMs

COLLU, MATTEO GIOELE
2023/2024

Abstract

Recently, we have witnessed a rise in the use of Large Language Models (LLMs), especially in applications like chatbot assistants. Safety mechanisms and specialized training procedures are implemented to prevent improper responses from these assistants. In this thesis, we show how to bypass these measures for ChatGPT, GPT-3.5-turbo, Bard, and Gemini-1.5-flash (and, to some extent, Bing chat) by making them impersonate complex personas with personality characteristics that are not aligned with a truthful assistant. We start by creating elaborate biographies of these personas, which we then use in a new session with the same chatbots. Our conversations then followed a role-play style to elicit prohibited responses. By making use of personas, we show that such responses are actually provided, making it possible to obtain unauthorized, illegal, or harmful information. This thesis shows that by using adversarial personas, one can overcome safety mechanisms set out by LLMs' developers. We also introduce several ways of activating such adversarial personas, which show that the considered chatbots are vulnerable to this kind of attack. With our attack, we managed to get illicit information and dangerous content for 38 out of 40 different scenarios in GPT-3.5-turbo, and 40 out of 40 in Gemini-1.5-flash. On the same principle, we introduce two defenses that push the model to interpret trustworthy personalities and make it more robust against such attacks. Our best defense increased GPT-3.5-turbo's robustness, defusing 106 out of 114 working jailbreaking prompts.
2023
Dr. Jekyll and Mr. Hyde: Two Faces of LLMs
Recently, we have witnessed a rise in the use of Large Language Models (LLMs), especially in applications like chatbot assistants. Safety mechanisms and specialized training procedures are implemented to prevent improper responses from these assistants. In this thesis, we show how to bypass these measures for ChatGPT, GPT-3.5-turbo, Bard, and Gemini-1.5-flash (and, to some extent, Bing chat) by making them impersonate complex personas with personality characteristics that are not aligned with a truthful assistant. We start by creating elaborate biographies of these personas, which we then use in a new session with the same chatbots. Our conversations then followed a role-play style to elicit prohibited responses. By making use of personas, we show that such responses are actually provided, making it possible to obtain unauthorized, illegal, or harmful information. This thesis shows that by using adversarial personas, one can overcome safety mechanisms set out by LLMs' developers. We also introduce several ways of activating such adversarial personas, which show that the considered chatbots are vulnerable to this kind of attack. With our attack, we managed to get illicit information and dangerous content for 38 out of 40 different scenarios in GPT-3.5-turbo, and 40 out of 40 in Gemini-1.5-flash. On the same principle, we introduce two defenses that push the model to interpret trustworthy personalities and make it more robust against such attacks. Our best defense increased GPT-3.5-turbo's robustness, defusing 106 out of 114 working jailbreaking prompts.
Large Language Model
Adversarial Attack
Jailbreaking
Prompt Engineering
Cybersecurity
File in questo prodotto:
File Dimensione Formato  
Collu_Matteo_Gioele.pdf

accesso aperto

Dimensione 9.99 MB
Formato Adobe PDF
9.99 MB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/68881