This thesis addresses two research questions: whether mafia-infiltrated firms exhibit distinctive linguistic patterns in their management reports (RQ1), and whether such differences can be used to detect infiltration through machine learning (RQ2). Using a dataset of management reports from Italian infiltrated firms and matched legitimate competitors, the study applies Natural Language Processing (NLP) techniques to analyse narrative disclosure behaviour. The empirical strategy is twofold. First, in relation to RQ1, an inferential analysis identifies systematic stylistic differences: infiltrated firms produce longer and lexically richer narratives in the Performance, R&D, and Future sections, while providing shorter and simpler disclosures in the Risks section, an asymmetric pattern consistent with selective transparency. Second, in relation to RQ2, a predictive analysis evaluates the detection value of these textual features. A TF-IDF-based XGBoost classifier achieves the best performance (F₁ score 0.71; AUC 0.89), outperforming SBERT embeddings. Overall, the findings show that narrative corporate disclosures contain measurable signals of criminal infiltration and demonstrate that textual analysis can complement traditional accounting indicators in supporting risk identification and enforcement.
Truth in Numbers, Lies in Words: Identifying Mafia-Infiltrated Firms through Natural Language Processing and Supervised Machine Learning
LOPARCO, EMMANUEL
2024/2025
Abstract
This thesis addresses two research questions: whether mafia-infiltrated firms exhibit distinctive linguistic patterns in their management reports (RQ1), and whether such differences can be used to detect infiltration through machine learning (RQ2). Using a dataset of management reports from Italian infiltrated firms and matched legitimate competitors, the study applies Natural Language Processing (NLP) techniques to analyse narrative disclosure behaviour. The empirical strategy is twofold. First, in relation to RQ1, an inferential analysis identifies systematic stylistic differences: infiltrated firms produce longer and lexically richer narratives in the Performance, R&D, and Future sections, while providing shorter and simpler disclosures in the Risks section, an asymmetric pattern consistent with selective transparency. Second, in relation to RQ2, a predictive analysis evaluates the detection value of these textual features. A TF-IDF-based XGBoost classifier achieves the best performance (F₁ score 0.71; AUC 0.89), outperforming SBERT embeddings. Overall, the findings show that narrative corporate disclosures contain measurable signals of criminal infiltration and demonstrate that textual analysis can complement traditional accounting indicators in supporting risk identification and enforcement.| File | Dimensione | Formato | |
|---|---|---|---|
|
Loparco_Emmanuel.pdf
accesso aperto
Dimensione
1.92 MB
Formato
Adobe PDF
|
1.92 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/101309