Synthetic Data Generation For Email Phishing Detection

Phishing attacks pose a significant cybersecurity challenge, and the recent technological advancements have introduced new complexities. The rise of Large Language Model (LLM)s has empowered attackers to generate more sophisticated, contextually nuanced phishing communications, thereby enhancing the potential effectiveness of such malicious attempts. Researchers started to use AI-based phishing detection systems, leveraging their ability to learn from past data to identify and mitigate such attacks. However, acquiring suitable datasets for training these models remains challenging and presents several limitations. First, due to privacy concerns, companies do not release internal emails, resulting in public datasets that include only a limited range of legitimate emails (ham), often from a few industries or outdated sources. Second, phishing tactics constantly evolve, so current datasets may fail to represent novel, sophisticated attack strategies and social engineering tactics. Finally, most datasets are mainly in English, limiting their applicability to real-world cases, where multilingual communications are predominant. These challenges hinder the development of robust and generalizable phishing detection models that can be applied to a real-world scenario. The thesis aims to fill the gap by creating a framework named FORGE for creating synthetic email (subject and bodies) leveraging the human-like text generation capabilities of LLMs. This approach found real-world applications for benign and malicious purposes, such as enhancing existing anti-phishing systems or being applied by attackers to scale up the phishing email generation process. FORGE is then used to generate a full synthetic dataset representing corporate communications from the USA, Italy, and the UK. The dataset includes 5185 phishing emails categorized into three attack types: malware installation, credential harvesting, and Business E-mail Compromised (BEC), alongside 5575 legitimate emails. Finally, the FORGE framework is evaluated by training state-of-the-art classifiers on the synthetic dataset produced, demonstrating strong generalization to real-world datasets, achieving an F1-score up to 0.89. The generated dataset is finally evaluated by fine-tuning BERT-based models, comparing their performance against a real-world dataset, demonstrating that the proposed synthetic data generation approach holds potential for further application and investigation in anti-phishing research.

Synthetic Data Generation For Email Phishing Detection

CARIPOTI, EUGENIO

2024/2025

Abstract

Phishing attacks pose a significant cybersecurity challenge, and the recent technological advancements have introduced new complexities. The rise of Large Language Model (LLM)s has empowered attackers to generate more sophisticated, contextually nuanced phishing communications, thereby enhancing the potential effectiveness of such malicious attempts. Researchers started to use AI-based phishing detection systems, leveraging their ability to learn from past data to identify and mitigate such attacks. However, acquiring suitable datasets for training these models remains challenging and presents several limitations. First, due to privacy concerns, companies do not release internal emails, resulting in public datasets that include only a limited range of legitimate emails (ham), often from a few industries or outdated sources. Second, phishing tactics constantly evolve, so current datasets may fail to represent novel, sophisticated attack strategies and social engineering tactics. Finally, most datasets are mainly in English, limiting their applicability to real-world cases, where multilingual communications are predominant. These challenges hinder the development of robust and generalizable phishing detection models that can be applied to a real-world scenario. The thesis aims to fill the gap by creating a framework named FORGE for creating synthetic email (subject and bodies) leveraging the human-like text generation capabilities of LLMs. This approach found real-world applications for benign and malicious purposes, such as enhancing existing anti-phishing systems or being applied by attackers to scale up the phishing email generation process. FORGE is then used to generate a full synthetic dataset representing corporate communications from the USA, Italy, and the UK. The dataset includes 5185 phishing emails categorized into three attack types: malware installation, credential harvesting, and Business E-mail Compromised (BEC), alongside 5575 legitimate emails. Finally, the FORGE framework is evaluated by training state-of-the-art classifiers on the synthetic dataset produced, demonstrating strong generalization to real-world datasets, achieving an F1-score up to 0.89. The generated dataset is finally evaluated by fine-tuning BERT-based models, comparing their performance against a real-world dataset, demonstrating that the proposed synthetic data generation approach holds potential for further application and investigation in anti-phishing research.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Matematica "Tullio Levi-Civita" - DM
			
	Corso di studio
	
				CYBERSECURITY Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2024
			
	Titolo inglese
	
				Synthetic Data Generation For Email Phishing Detection
			
	Abstract in italiano
	
				Phishing attacks pose a significant cybersecurity challenge, and the recent technological advancements have introduced new complexities. The rise of Large Language Model (LLM)s has empowered attackers to generate more sophisticated, contextually nuanced phishing communications, thereby enhancing the potential effectiveness of such malicious attempts. Researchers started to use AI-based phishing detection systems, leveraging their ability to learn from past data to identify and mitigate such attacks. However, acquiring suitable datasets for training these models remains challenging and presents several limitations. First, due to privacy concerns, companies do not release internal emails, resulting in public datasets that include only a limited range of legitimate emails (ham), often from a few industries or outdated sources. Second, phishing tactics constantly evolve, so current datasets may fail to represent novel, sophisticated attack strategies and social engineering tactics. Finally, most datasets are mainly in English, limiting their applicability to real-world cases, where multilingual communications are predominant. These challenges hinder the development of robust and generalizable phishing detection models that can be applied to a real-world scenario.
The thesis aims to fill the gap by creating a framework named FORGE for creating synthetic email (subject and bodies) leveraging the human-like text generation capabilities of LLMs. This approach found real-world applications for benign and malicious purposes, such as enhancing existing anti-phishing systems or being applied by attackers to scale up the phishing email generation process. FORGE is then used to generate a full synthetic dataset representing corporate communications from the USA, Italy, and the UK. The dataset includes 5185 phishing emails categorized into three attack types: malware installation, credential harvesting, and Business E-mail Compromised (BEC), alongside 5575 legitimate emails. Finally, the FORGE framework is evaluated by training state-of-the-art classifiers on the synthetic dataset produced, demonstrating strong generalization to real-world datasets, achieving an F1-score up to 0.89.
The generated dataset is finally evaluated by fine-tuning BERT-based models, comparing their performance against a real-world dataset, demonstrating that the proposed synthetic data generation approach holds potential for further application and investigation in anti-phishing research.
			
	Parola chiave
	
				NLP
Phishing Detection
Synthetic Data
LLM
			
	Relatore
	
				PAJOLA, LUCA
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Caripoti_Eugenio.pdf Accesso riservato Dimensione 34.89 MB Formato Adobe PDF	34.89 MB	Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/84772