Phishing attacks pose a significant cybersecurity challenge, and the recent technological advancements have introduced new complexities. The rise of Large Language Model (LLM)s has empowered attackers to generate more sophisticated, contextually nuanced phishing communications, thereby enhancing the potential effectiveness of such malicious attempts. Researchers started to use AI-based phishing detection systems, leveraging their ability to learn from past data to identify and mitigate such attacks. However, acquiring suitable datasets for training these models remains challenging and presents several limitations. First, due to privacy concerns, companies do not release internal emails, resulting in public datasets that include only a limited range of legitimate emails (ham), often from a few industries or outdated sources. Second, phishing tactics constantly evolve, so current datasets may fail to represent novel, sophisticated attack strategies and social engineering tactics. Finally, most datasets are mainly in English, limiting their applicability to real-world cases, where multilingual communications are predominant. These challenges hinder the development of robust and generalizable phishing detection models that can be applied to a real-world scenario. The thesis aims to fill the gap by creating a framework named FORGE for creating synthetic email (subject and bodies) leveraging the human-like text generation capabilities of LLMs. This approach found real-world applications for benign and malicious purposes, such as enhancing existing anti-phishing systems or being applied by attackers to scale up the phishing email generation process. FORGE is then used to generate a full synthetic dataset representing corporate communications from the USA, Italy, and the UK. The dataset includes 5185 phishing emails categorized into three attack types: malware installation, credential harvesting, and Business E-mail Compromised (BEC), alongside 5575 legitimate emails. Finally, the FORGE framework is evaluated by training state-of-the-art classifiers on the synthetic dataset produced, demonstrating strong generalization to real-world datasets, achieving an F1-score up to 0.89. The generated dataset is finally evaluated by fine-tuning BERT-based models, comparing their performance against a real-world dataset, demonstrating that the proposed synthetic data generation approach holds potential for further application and investigation in anti-phishing research.

Phishing attacks pose a significant cybersecurity challenge, and the recent technological advancements have introduced new complexities. The rise of Large Language Model (LLM)s has empowered attackers to generate more sophisticated, contextually nuanced phishing communications, thereby enhancing the potential effectiveness of such malicious attempts. Researchers started to use AI-based phishing detection systems, leveraging their ability to learn from past data to identify and mitigate such attacks. However, acquiring suitable datasets for training these models remains challenging and presents several limitations. First, due to privacy concerns, companies do not release internal emails, resulting in public datasets that include only a limited range of legitimate emails (ham), often from a few industries or outdated sources. Second, phishing tactics constantly evolve, so current datasets may fail to represent novel, sophisticated attack strategies and social engineering tactics. Finally, most datasets are mainly in English, limiting their applicability to real-world cases, where multilingual communications are predominant. These challenges hinder the development of robust and generalizable phishing detection models that can be applied to a real-world scenario. The thesis aims to fill the gap by creating a framework named FORGE for creating synthetic email (subject and bodies) leveraging the human-like text generation capabilities of LLMs. This approach found real-world applications for benign and malicious purposes, such as enhancing existing anti-phishing systems or being applied by attackers to scale up the phishing email generation process. FORGE is then used to generate a full synthetic dataset representing corporate communications from the USA, Italy, and the UK. The dataset includes 5185 phishing emails categorized into three attack types: malware installation, credential harvesting, and Business E-mail Compromised (BEC), alongside 5575 legitimate emails. Finally, the FORGE framework is evaluated by training state-of-the-art classifiers on the synthetic dataset produced, demonstrating strong generalization to real-world datasets, achieving an F1-score up to 0.89. The generated dataset is finally evaluated by fine-tuning BERT-based models, comparing their performance against a real-world dataset, demonstrating that the proposed synthetic data generation approach holds potential for further application and investigation in anti-phishing research.

Synthetic Data Generation For Email Phishing Detection

CARIPOTI, EUGENIO
2024/2025

Abstract

Phishing attacks pose a significant cybersecurity challenge, and the recent technological advancements have introduced new complexities. The rise of Large Language Model (LLM)s has empowered attackers to generate more sophisticated, contextually nuanced phishing communications, thereby enhancing the potential effectiveness of such malicious attempts. Researchers started to use AI-based phishing detection systems, leveraging their ability to learn from past data to identify and mitigate such attacks. However, acquiring suitable datasets for training these models remains challenging and presents several limitations. First, due to privacy concerns, companies do not release internal emails, resulting in public datasets that include only a limited range of legitimate emails (ham), often from a few industries or outdated sources. Second, phishing tactics constantly evolve, so current datasets may fail to represent novel, sophisticated attack strategies and social engineering tactics. Finally, most datasets are mainly in English, limiting their applicability to real-world cases, where multilingual communications are predominant. These challenges hinder the development of robust and generalizable phishing detection models that can be applied to a real-world scenario. The thesis aims to fill the gap by creating a framework named FORGE for creating synthetic email (subject and bodies) leveraging the human-like text generation capabilities of LLMs. This approach found real-world applications for benign and malicious purposes, such as enhancing existing anti-phishing systems or being applied by attackers to scale up the phishing email generation process. FORGE is then used to generate a full synthetic dataset representing corporate communications from the USA, Italy, and the UK. The dataset includes 5185 phishing emails categorized into three attack types: malware installation, credential harvesting, and Business E-mail Compromised (BEC), alongside 5575 legitimate emails. Finally, the FORGE framework is evaluated by training state-of-the-art classifiers on the synthetic dataset produced, demonstrating strong generalization to real-world datasets, achieving an F1-score up to 0.89. The generated dataset is finally evaluated by fine-tuning BERT-based models, comparing their performance against a real-world dataset, demonstrating that the proposed synthetic data generation approach holds potential for further application and investigation in anti-phishing research.
2024
Synthetic Data Generation For Email Phishing Detection
Phishing attacks pose a significant cybersecurity challenge, and the recent technological advancements have introduced new complexities. The rise of Large Language Model (LLM)s has empowered attackers to generate more sophisticated, contextually nuanced phishing communications, thereby enhancing the potential effectiveness of such malicious attempts. Researchers started to use AI-based phishing detection systems, leveraging their ability to learn from past data to identify and mitigate such attacks. However, acquiring suitable datasets for training these models remains challenging and presents several limitations. First, due to privacy concerns, companies do not release internal emails, resulting in public datasets that include only a limited range of legitimate emails (ham), often from a few industries or outdated sources. Second, phishing tactics constantly evolve, so current datasets may fail to represent novel, sophisticated attack strategies and social engineering tactics. Finally, most datasets are mainly in English, limiting their applicability to real-world cases, where multilingual communications are predominant. These challenges hinder the development of robust and generalizable phishing detection models that can be applied to a real-world scenario. The thesis aims to fill the gap by creating a framework named FORGE for creating synthetic email (subject and bodies) leveraging the human-like text generation capabilities of LLMs. This approach found real-world applications for benign and malicious purposes, such as enhancing existing anti-phishing systems or being applied by attackers to scale up the phishing email generation process. FORGE is then used to generate a full synthetic dataset representing corporate communications from the USA, Italy, and the UK. The dataset includes 5185 phishing emails categorized into three attack types: malware installation, credential harvesting, and Business E-mail Compromised (BEC), alongside 5575 legitimate emails. Finally, the FORGE framework is evaluated by training state-of-the-art classifiers on the synthetic dataset produced, demonstrating strong generalization to real-world datasets, achieving an F1-score up to 0.89. The generated dataset is finally evaluated by fine-tuning BERT-based models, comparing their performance against a real-world dataset, demonstrating that the proposed synthetic data generation approach holds potential for further application and investigation in anti-phishing research.
NLP
Phishing Detection
Synthetic Data
LLM
File in questo prodotto:
File Dimensione Formato  
Caripoti_Eugenio.pdf

Accesso riservato

Dimensione 34.89 MB
Formato Adobe PDF
34.89 MB Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/84772