Dr. Jekyll and Mr. Hyde: Two Faces of LLMs

Recently, we have witnessed a rise in the use of Large Language Models (LLMs), especially in applications like chatbot assistants. Safety mechanisms and specialized training procedures are implemented to prevent improper responses from these assistants. In this thesis, we show how to bypass these measures for ChatGPT, GPT-3.5-turbo, Bard, and Gemini-1.5-flash (and, to some extent, Bing chat) by making them impersonate complex personas with personality characteristics that are not aligned with a truthful assistant. We start by creating elaborate biographies of these personas, which we then use in a new session with the same chatbots. Our conversations then followed a role-play style to elicit prohibited responses. By making use of personas, we show that such responses are actually provided, making it possible to obtain unauthorized, illegal, or harmful information. This thesis shows that by using adversarial personas, one can overcome safety mechanisms set out by LLMs' developers. We also introduce several ways of activating such adversarial personas, which show that the considered chatbots are vulnerable to this kind of attack. With our attack, we managed to get illicit information and dangerous content for 38 out of 40 different scenarios in GPT-3.5-turbo, and 40 out of 40 in Gemini-1.5-flash. On the same principle, we introduce two defenses that push the model to interpret trustworthy personalities and make it more robust against such attacks. Our best defense increased GPT-3.5-turbo's robustness, defusing 106 out of 114 working jailbreaking prompts.

Dr. Jekyll and Mr. Hyde: Two Faces of LLMs

COLLU, MATTEO GIOELE

2023/2024

Abstract

Recently, we have witnessed a rise in the use of Large Language Models (LLMs), especially in applications like chatbot assistants. Safety mechanisms and specialized training procedures are implemented to prevent improper responses from these assistants. In this thesis, we show how to bypass these measures for ChatGPT, GPT-3.5-turbo, Bard, and Gemini-1.5-flash (and, to some extent, Bing chat) by making them impersonate complex personas with personality characteristics that are not aligned with a truthful assistant. We start by creating elaborate biographies of these personas, which we then use in a new session with the same chatbots. Our conversations then followed a role-play style to elicit prohibited responses. By making use of personas, we show that such responses are actually provided, making it possible to obtain unauthorized, illegal, or harmful information. This thesis shows that by using adversarial personas, one can overcome safety mechanisms set out by LLMs' developers. We also introduce several ways of activating such adversarial personas, which show that the considered chatbots are vulnerable to this kind of attack. With our attack, we managed to get illicit information and dangerous content for 38 out of 40 different scenarios in GPT-3.5-turbo, and 40 out of 40 in Gemini-1.5-flash. On the same principle, we introduce two defenses that push the model to interpret trustworthy personalities and make it more robust against such attacks. Our best defense increased GPT-3.5-turbo's robustness, defusing 106 out of 114 working jailbreaking prompts.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Matematica "Tullio Levi-Civita" - DM
			
	Corso di studio
	
				COMPUTER SCIENCE Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2023
			
	Titolo inglese
	
				Dr. Jekyll and Mr. Hyde: Two Faces of LLMs
			
	Abstract in italiano
	
				Recently, we have witnessed a rise in the use of Large Language Models (LLMs), especially in applications like chatbot assistants. Safety mechanisms and specialized training procedures are implemented to prevent improper responses from these assistants. In this thesis, we show how to bypass these measures for ChatGPT, GPT-3.5-turbo, Bard, and Gemini-1.5-flash (and, to some extent, Bing chat) by making them impersonate complex personas with personality characteristics that are not aligned with a truthful assistant. We start by creating elaborate biographies of these personas, which we then use in a new session with the same chatbots. Our conversations then followed a role-play style to elicit prohibited responses. By making use of personas, we show that such responses are actually provided, making it possible to obtain unauthorized, illegal, or harmful information. This thesis shows that by using adversarial personas, one can overcome safety mechanisms set out by LLMs' developers. We also introduce several ways of activating such adversarial personas, which show that the considered chatbots are vulnerable to this kind of attack. With our attack, we managed to get illicit information and dangerous content for 38 out of 40 different scenarios in GPT-3.5-turbo, and 40 out of 40 in Gemini-1.5-flash. On the same principle, we introduce two defenses that push the model to interpret trustworthy personalities and make it more robust against such attacks. Our best defense increased GPT-3.5-turbo's robustness, defusing 106 out of 114 working jailbreaking prompts.
			
	Parola chiave
	
				Large Language Model
Adversarial Attack
Jailbreaking
Prompt Engineering
Cybersecurity
			
	Relatore
	
				CONTI, MAURO
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Collu_Matteo_Gioele.pdf accesso aperto Dimensione 9.99 MB Formato Adobe PDF Visualizza/Apri	9.99 MB	Adobe PDF	Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/68881