LLM chatbots such as ChatGPT are increasingly popular due to their capabilities and growing autonomy. However, as these systems improve, so does the risk of misuse: attackers may elicit sophisticated harmful responses for malicious purposes. To reduce this threat, companies train LLMs to follow safety guidelines in their replies. However, attackers can often induce models to bypass these safeguards through attacks known as jailbreaks. Existing jailbreak methods generally do not exploit knowledge of the internal structure of LLMs, which may limit their effectiveness. This thesis investigates whether techniques from mechanistic interpretability, the field that studies the internal computations of neural networks, can improve the performance of classical jailbreak attacks. We begin by analyzing whether, and how, LLMs distinguish harmful prompts from harmless ones in their internal activations. We then formalize Mechanistic AutoDAN, a novel method for attacking LLMs using insights from mechanistic interpretability. Finally, we evaluate its performance relative to that of classical jailbreak attacks. Our results suggest that Mechanistic AutoDAN is a viable alternative to classical jailbreak methods and that mechanistic interpretability can provide useful signals for improving jailbreak generation.

Leveraging Extracted Submodels to Increase Jailbreak Efficiency in LLMs

CONTE, RICCARDO
2025/2026

Abstract

LLM chatbots such as ChatGPT are increasingly popular due to their capabilities and growing autonomy. However, as these systems improve, so does the risk of misuse: attackers may elicit sophisticated harmful responses for malicious purposes. To reduce this threat, companies train LLMs to follow safety guidelines in their replies. However, attackers can often induce models to bypass these safeguards through attacks known as jailbreaks. Existing jailbreak methods generally do not exploit knowledge of the internal structure of LLMs, which may limit their effectiveness. This thesis investigates whether techniques from mechanistic interpretability, the field that studies the internal computations of neural networks, can improve the performance of classical jailbreak attacks. We begin by analyzing whether, and how, LLMs distinguish harmful prompts from harmless ones in their internal activations. We then formalize Mechanistic AutoDAN, a novel method for attacking LLMs using insights from mechanistic interpretability. Finally, we evaluate its performance relative to that of classical jailbreak attacks. Our results suggest that Mechanistic AutoDAN is a viable alternative to classical jailbreak methods and that mechanistic interpretability can provide useful signals for improving jailbreak generation.
2025
Leveraging Extracted Submodels to Increase Jailbreak Efficiency in LLMs
Large Language Model
Jailbreak
Interpretability
File in questo prodotto:
File Dimensione Formato  
Master_Thesis_Riccardo_Conte.pdf

Accesso riservato

Dimensione 1.68 MB
Formato Adobe PDF
1.68 MB Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/108079