Natural Language Processing (NLP) techniques for document classification have significantly advanced, especially unsupervised and zero-shot methods thanks to the development of large language models (LLMs). While supervised approaches depend heavily on labeled data, which can be labor-intensive and challenging to obtain, unsupervised and zero-shot methods leverage intrinsic characteristics of the data as well as pre-trained capabilities. This thesis provides a comprehensive overview of unsupervised and few shot methods for Italian email classification. A complete text processing and classification piepeline is presented, encompassing: automated email data extraction, preprocessing with text de-duplication, document representation via embedding techniques, ranging from traditional TF-IDF to transformer-based methods like Sentence-BERT, the application of clustering and topic modeling methods and finally using different open and closed source state of the art LLMs for zero and few shot classification with different prompting strategies. Results show that Non-negative Matrix Factorization (NMF) with KL divergence outperforms clustering methods, with an Adjusted Rand Index (ARI) of 0.287, proving its relevance as a first approach and for benchmarking purposes. For classification, the use of state-of-the-art LLMs, particularly the open-source Llama 3.1 405b model in one-shot scenarios, achieved remarkable accuracy (86.5%), surpassing the existing literature for italian email classification like the ALICE system (76%). Smaller open-source models such as Qwen 2.5 14B also demonstrated competitive results (75.1%), highlighting the feasibility of high-performance small scale deployments.

From Email Mining to Unsupervised Classification: An NLP Approach

ORTEGA DOMINGUEZ, ESTEBAN
2024/2025

Abstract

Natural Language Processing (NLP) techniques for document classification have significantly advanced, especially unsupervised and zero-shot methods thanks to the development of large language models (LLMs). While supervised approaches depend heavily on labeled data, which can be labor-intensive and challenging to obtain, unsupervised and zero-shot methods leverage intrinsic characteristics of the data as well as pre-trained capabilities. This thesis provides a comprehensive overview of unsupervised and few shot methods for Italian email classification. A complete text processing and classification piepeline is presented, encompassing: automated email data extraction, preprocessing with text de-duplication, document representation via embedding techniques, ranging from traditional TF-IDF to transformer-based methods like Sentence-BERT, the application of clustering and topic modeling methods and finally using different open and closed source state of the art LLMs for zero and few shot classification with different prompting strategies. Results show that Non-negative Matrix Factorization (NMF) with KL divergence outperforms clustering methods, with an Adjusted Rand Index (ARI) of 0.287, proving its relevance as a first approach and for benchmarking purposes. For classification, the use of state-of-the-art LLMs, particularly the open-source Llama 3.1 405b model in one-shot scenarios, achieved remarkable accuracy (86.5%), surpassing the existing literature for italian email classification like the ALICE system (76%). Smaller open-source models such as Qwen 2.5 14B also demonstrated competitive results (75.1%), highlighting the feasibility of high-performance small scale deployments.
2024
From Email Mining to Unsupervised Classification: An NLP Approach
NLP
UNSUPERVISED
LLMS
File in questo prodotto:
File Dimensione Formato  
Esteban Ortega DS thesis final draft.pdf

accesso riservato

Dimensione 4.24 MB
Formato Adobe PDF
4.24 MB Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/84788