Truth in Numbers, Lies in Words: Identifying Mafia-Infiltrated Firms through Natural Language Processing and Supervised Machine Learning

This thesis addresses two research questions: whether mafia-infiltrated firms exhibit distinctive linguistic patterns in their management reports (RQ1), and whether such differences can be used to detect infiltration through machine learning (RQ2). Using a dataset of management reports from Italian infiltrated firms and matched legitimate competitors, the study applies Natural Language Processing (NLP) techniques to analyse narrative disclosure behaviour. The empirical strategy is twofold. First, in relation to RQ1, an inferential analysis identifies systematic stylistic differences: infiltrated firms produce longer and lexically richer narratives in the Performance, R&D, and Future sections, while providing shorter and simpler disclosures in the Risks section, an asymmetric pattern consistent with selective transparency. Second, in relation to RQ2, a predictive analysis evaluates the detection value of these textual features. A TF-IDF-based XGBoost classifier achieves the best performance (F₁ score 0.71; AUC 0.89), outperforming SBERT embeddings. Overall, the findings show that narrative corporate disclosures contain measurable signals of criminal infiltration and demonstrate that textual analysis can complement traditional accounting indicators in supporting risk identification and enforcement.

Truth in Numbers, Lies in Words: Identifying Mafia-Infiltrated Firms through Natural Language Processing and Supervised Machine Learning

LOPARCO, EMMANUEL

2024/2025

Abstract

This thesis addresses two research questions: whether mafia-infiltrated firms exhibit distinctive linguistic patterns in their management reports (RQ1), and whether such differences can be used to detect infiltration through machine learning (RQ2). Using a dataset of management reports from Italian infiltrated firms and matched legitimate competitors, the study applies Natural Language Processing (NLP) techniques to analyse narrative disclosure behaviour. The empirical strategy is twofold. First, in relation to RQ1, an inferential analysis identifies systematic stylistic differences: infiltrated firms produce longer and lexically richer narratives in the Performance, R&D, and Future sections, while providing shorter and simpler disclosures in the Risks section, an asymmetric pattern consistent with selective transparency. Second, in relation to RQ2, a predictive analysis evaluates the detection value of these textual features. A TF-IDF-based XGBoost classifier achieves the best performance (F₁ score 0.71; AUC 0.89), outperforming SBERT embeddings. Overall, the findings show that narrative corporate disclosures contain measurable signals of criminal infiltration and demonstrate that textual analysis can complement traditional accounting indicators in supporting risk identification and enforcement.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Scienze Economiche e Aziendali "Marco Fanno" - DSEA
			
	Corso di studio
	
				APPLIED ECONOMICS  Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2024
			
	Titolo inglese
	
				Truth in Numbers, Lies in Words: Identifying Mafia-Infiltrated Firms through Natural Language Processing and Supervised Machine Learning
			
	Parola chiave
	
				Mafia
Machine learning
Text analysis
Financial Statement
			
	Relatore
	
				AMBROSINI, FRANCESCO
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Loparco_Emmanuel.pdf accesso aperto Dimensione 1.92 MB Formato Adobe PDF Visualizza/Apri	1.92 MB	Adobe PDF	Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/101309