Natural Language Processing (NLP) techniques for document classification have significantly advanced, especially unsupervised and zero-shot methods thanks to the development of large language models (LLMs). While supervised approaches depend heavily on labeled data, which can be labor-intensive and challenging to obtain, unsupervised and zero-shot methods leverage intrinsic characteristics of the data as well as pre-trained capabilities. This thesis provides a comprehensive overview of unsupervised and few shot methods for Italian email classification. A complete text processing and classification piepeline is presented, encompassing: automated email data extraction, preprocessing with text de-duplication, document representation via embedding techniques, ranging from traditional TF-IDF to transformer-based methods like Sentence-BERT, the application of clustering and topic modeling methods and finally using different open and closed source state of the art LLMs for zero and few shot classification with different prompting strategies. Results show that Non-negative Matrix Factorization (NMF) with KL divergence outperforms clustering methods, with an Adjusted Rand Index (ARI) of 0.287, proving its relevance as a first approach and for benchmarking purposes. For classification, the use of state-of-the-art LLMs, particularly the open-source Llama 3.1 405b model in one-shot scenarios, achieved remarkable accuracy (86.5%), surpassing the existing literature for italian email classification like the ALICE system (76%). Smaller open-source models such as Qwen 2.5 14B also demonstrated competitive results (75.1%), highlighting the feasibility of high-performance small scale deployments.
From Email Mining to Unsupervised Classification: An NLP Approach
ORTEGA DOMINGUEZ, ESTEBAN
2024/2025
Abstract
Natural Language Processing (NLP) techniques for document classification have significantly advanced, especially unsupervised and zero-shot methods thanks to the development of large language models (LLMs). While supervised approaches depend heavily on labeled data, which can be labor-intensive and challenging to obtain, unsupervised and zero-shot methods leverage intrinsic characteristics of the data as well as pre-trained capabilities. This thesis provides a comprehensive overview of unsupervised and few shot methods for Italian email classification. A complete text processing and classification piepeline is presented, encompassing: automated email data extraction, preprocessing with text de-duplication, document representation via embedding techniques, ranging from traditional TF-IDF to transformer-based methods like Sentence-BERT, the application of clustering and topic modeling methods and finally using different open and closed source state of the art LLMs for zero and few shot classification with different prompting strategies. Results show that Non-negative Matrix Factorization (NMF) with KL divergence outperforms clustering methods, with an Adjusted Rand Index (ARI) of 0.287, proving its relevance as a first approach and for benchmarking purposes. For classification, the use of state-of-the-art LLMs, particularly the open-source Llama 3.1 405b model in one-shot scenarios, achieved remarkable accuracy (86.5%), surpassing the existing literature for italian email classification like the ALICE system (76%). Smaller open-source models such as Qwen 2.5 14B also demonstrated competitive results (75.1%), highlighting the feasibility of high-performance small scale deployments.File | Dimensione | Formato | |
---|---|---|---|
Esteban Ortega DS thesis final draft.pdf
accesso riservato
Dimensione
4.24 MB
Formato
Adobe PDF
|
4.24 MB | Adobe PDF |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/84788