This thesis addresses two research questions: whether mafia-infiltrated firms exhibit distinctive linguistic patterns in their management reports (RQ1), and whether such differences can be used to detect infiltration through machine learning (RQ2). Using a dataset of management reports from Italian infiltrated firms and matched legitimate competitors, the study applies Natural Language Processing (NLP) techniques to analyse narrative disclosure behaviour. The empirical strategy is twofold. First, in relation to RQ1, an inferential analysis identifies systematic stylistic differences: infiltrated firms produce longer and lexically richer narratives in the Performance, R&D, and Future sections, while providing shorter and simpler disclosures in the Risks section, an asymmetric pattern consistent with selective transparency. Second, in relation to RQ2, a predictive analysis evaluates the detection value of these textual features. A TF-IDF-based XGBoost classifier achieves the best performance (F₁ score 0.71; AUC 0.89), outperforming SBERT embeddings. Overall, the findings show that narrative corporate disclosures contain measurable signals of criminal infiltration and demonstrate that textual analysis can complement traditional accounting indicators in supporting risk identification and enforcement.

Truth in Numbers, Lies in Words: Identifying Mafia-Infiltrated Firms through Natural Language Processing and Supervised Machine Learning

LOPARCO, EMMANUEL
2024/2025

Abstract

This thesis addresses two research questions: whether mafia-infiltrated firms exhibit distinctive linguistic patterns in their management reports (RQ1), and whether such differences can be used to detect infiltration through machine learning (RQ2). Using a dataset of management reports from Italian infiltrated firms and matched legitimate competitors, the study applies Natural Language Processing (NLP) techniques to analyse narrative disclosure behaviour. The empirical strategy is twofold. First, in relation to RQ1, an inferential analysis identifies systematic stylistic differences: infiltrated firms produce longer and lexically richer narratives in the Performance, R&D, and Future sections, while providing shorter and simpler disclosures in the Risks section, an asymmetric pattern consistent with selective transparency. Second, in relation to RQ2, a predictive analysis evaluates the detection value of these textual features. A TF-IDF-based XGBoost classifier achieves the best performance (F₁ score 0.71; AUC 0.89), outperforming SBERT embeddings. Overall, the findings show that narrative corporate disclosures contain measurable signals of criminal infiltration and demonstrate that textual analysis can complement traditional accounting indicators in supporting risk identification and enforcement.
2024
Truth in Numbers, Lies in Words: Identifying Mafia-Infiltrated Firms through Natural Language Processing and Supervised Machine Learning
Mafia
Machine learning
Text analysis
Financial Statement
File in questo prodotto:
File Dimensione Formato  
Loparco_Emmanuel.pdf

accesso aperto

Dimensione 1.92 MB
Formato Adobe PDF
1.92 MB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/101309