From Email Mining to Unsupervised Classification: An NLP Approach

Natural Language Processing (NLP) techniques for document classification have significantly advanced, especially unsupervised and zero-shot methods thanks to the development of large language models (LLMs). While supervised approaches depend heavily on labeled data, which can be labor-intensive and challenging to obtain, unsupervised and zero-shot methods leverage intrinsic characteristics of the data as well as pre-trained capabilities. This thesis provides a comprehensive overview of unsupervised and few shot methods for Italian email classification. A complete text processing and classification piepeline is presented, encompassing: automated email data extraction, preprocessing with text de-duplication, document representation via embedding techniques, ranging from traditional TF-IDF to transformer-based methods like Sentence-BERT, the application of clustering and topic modeling methods and finally using different open and closed source state of the art LLMs for zero and few shot classification with different prompting strategies. Results show that Non-negative Matrix Factorization (NMF) with KL divergence outperforms clustering methods, with an Adjusted Rand Index (ARI) of 0.287, proving its relevance as a first approach and for benchmarking purposes. For classification, the use of state-of-the-art LLMs, particularly the open-source Llama 3.1 405b model in one-shot scenarios, achieved remarkable accuracy (86.5%), surpassing the existing literature for italian email classification like the ALICE system (76%). Smaller open-source models such as Qwen 2.5 14B also demonstrated competitive results (75.1%), highlighting the feasibility of high-performance small scale deployments.

From Email Mining to Unsupervised Classification: An NLP Approach

ORTEGA DOMINGUEZ, ESTEBAN

2024/2025

Abstract

Natural Language Processing (NLP) techniques for document classification have significantly advanced, especially unsupervised and zero-shot methods thanks to the development of large language models (LLMs). While supervised approaches depend heavily on labeled data, which can be labor-intensive and challenging to obtain, unsupervised and zero-shot methods leverage intrinsic characteristics of the data as well as pre-trained capabilities. This thesis provides a comprehensive overview of unsupervised and few shot methods for Italian email classification. A complete text processing and classification piepeline is presented, encompassing: automated email data extraction, preprocessing with text de-duplication, document representation via embedding techniques, ranging from traditional TF-IDF to transformer-based methods like Sentence-BERT, the application of clustering and topic modeling methods and finally using different open and closed source state of the art LLMs for zero and few shot classification with different prompting strategies. Results show that Non-negative Matrix Factorization (NMF) with KL divergence outperforms clustering methods, with an Adjusted Rand Index (ARI) of 0.287, proving its relevance as a first approach and for benchmarking purposes. For classification, the use of state-of-the-art LLMs, particularly the open-source Llama 3.1 405b model in one-shot scenarios, achieved remarkable accuracy (86.5%), surpassing the existing literature for italian email classification like the ALICE system (76%). Smaller open-source models such as Qwen 2.5 14B also demonstrated competitive results (75.1%), highlighting the feasibility of high-performance small scale deployments.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Matematica "Tullio Levi-Civita" - DM
			
	Corso di studio
	
				DATA SCIENCE Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2024
			
	Titolo inglese
	
				From Email Mining to Unsupervised Classification: An NLP Approach
			
	Parola chiave
	
				NLP
UNSUPERVISED
LLMS
			
	Relatore
	
				SUSTO, GIAN ANTONIO
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Esteban Ortega DS thesis final draft.pdf Accesso riservato Dimensione 4.24 MB Formato Adobe PDF	4.24 MB	Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/84788