Leveraging Extracted Submodels to Increase Jailbreak Efficiency in LLMs

LLM chatbots such as ChatGPT are increasingly popular due to their capabilities and growing autonomy. However, as these systems improve, so does the risk of misuse: attackers may elicit sophisticated harmful responses for malicious purposes. To reduce this threat, companies train LLMs to follow safety guidelines in their replies. However, attackers can often induce models to bypass these safeguards through attacks known as jailbreaks. Existing jailbreak methods generally do not exploit knowledge of the internal structure of LLMs, which may limit their effectiveness. This thesis investigates whether techniques from mechanistic interpretability, the field that studies the internal computations of neural networks, can improve the performance of classical jailbreak attacks. We begin by analyzing whether, and how, LLMs distinguish harmful prompts from harmless ones in their internal activations. We then formalize Mechanistic AutoDAN, a novel method for attacking LLMs using insights from mechanistic interpretability. Finally, we evaluate its performance relative to that of classical jailbreak attacks. Our results suggest that Mechanistic AutoDAN is a viable alternative to classical jailbreak methods and that mechanistic interpretability can provide useful signals for improving jailbreak generation.

Leveraging Extracted Submodels to Increase Jailbreak Efficiency in LLMs

CONTE, RICCARDO

2025/2026

Abstract

LLM chatbots such as ChatGPT are increasingly popular due to their capabilities and growing autonomy. However, as these systems improve, so does the risk of misuse: attackers may elicit sophisticated harmful responses for malicious purposes. To reduce this threat, companies train LLMs to follow safety guidelines in their replies. However, attackers can often induce models to bypass these safeguards through attacks known as jailbreaks. Existing jailbreak methods generally do not exploit knowledge of the internal structure of LLMs, which may limit their effectiveness. This thesis investigates whether techniques from mechanistic interpretability, the field that studies the internal computations of neural networks, can improve the performance of classical jailbreak attacks. We begin by analyzing whether, and how, LLMs distinguish harmful prompts from harmless ones in their internal activations. We then formalize Mechanistic AutoDAN, a novel method for attacking LLMs using insights from mechanistic interpretability. Finally, we evaluate its performance relative to that of classical jailbreak attacks. Our results suggest that Mechanistic AutoDAN is a viable alternative to classical jailbreak methods and that mechanistic interpretability can provide useful signals for improving jailbreak generation.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Matematica "Tullio Levi-Civita" - DM
			
	Corso di studio
	
				CYBERSECURITY Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2025
			
	Titolo inglese
	
				Leveraging Extracted Submodels to Increase Jailbreak Efficiency in LLMs
			
	Parola chiave
	
				Large Language Model
Jailbreak
Interpretability
			
	Relatore
	
				CONFALONIERI, ROBERTO
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Master_Thesis_Riccardo_Conte.pdf Accesso riservato Dimensione 1.68 MB Formato Adobe PDF	1.68 MB	Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/108079