Retrieval-Augmented Generation for Legal Document Analysis: A Case Study on Statuti delle Imprese.

The digitalization of corporate governance increasingly requires automated tools to extract critical legal parameters from corporate bylaws (Statuti delle Imprese). However, the semantic density of legal texts, coupled with strict data sovereignty mandates and GDPR compliance, presents significant challenges for both traditional algorithmic extraction and cloud-based Large Language Models (LLMs). This thesis addresses these limitations by presenting an on-premises Retrieval-Augmented Generation (RAG) prototype optimized for standard enterprise infrastructure (CPU-only execution, 16~GB RAM). The proposed system features a dual-modality architecture: an automated batch-processing pipeline for JSON entity extraction, and a conversational interface for localized document interrogation. To enable structure-aware segmentation of the Statuti document for database storage, we adopted a hierarchical chunking strategy (hereafter referred to as Custom Structural Chunking) for document ingestion. This approach utilizes regular expressions to isolate legal articles as discrete semantic boundaries, while employing a recursive fixed-size segmentation fallback for unformatted text to ensure high-fidelity context preservation. The system's architectural viability was assessed via an ablation study that compared the proposed chunking strategy with standard fixed-size and embedding-based segmentation methods, complemented by a comparative evaluation of selected open-source Small Language Models (SLMs) for the RAG generative engine. In our experiments, the best-performing setup combined Gemma 3 (4B) for both generation and extraction with the Custom Structural Chunking strategy for document ingestion. On the evaluation set of novel statutes, this optimal configuration achieved a Context Recall of 95.4% and a Context Precision of 87.8%, translating to a natural language QA accuracy of 91.7%, and an end-to-end structured JSON extraction accuracy of 87.5%. While the methodology acknowledges a degree of structural sensitivity in non-standard layouts, it empirically demonstrates that optimized SLMs can execute complex, privacy-compliant RAG tasks on standard enterprise hardware. Ultimately, this work establishes a functional, cost-effective framework for integrating generative language models into sensitive enterprise workflows while maintaining absolute control over proprietary data.

Retrieval-Augmented Generation for Legal Document Analysis: A Case Study on Statuti delle Imprese.

ALI, AMJAD

2025/2026

Abstract

The digitalization of corporate governance increasingly requires automated tools to extract critical legal parameters from corporate bylaws (Statuti delle Imprese). However, the semantic density of legal texts, coupled with strict data sovereignty mandates and GDPR compliance, presents significant challenges for both traditional algorithmic extraction and cloud-based Large Language Models (LLMs). This thesis addresses these limitations by presenting an on-premises Retrieval-Augmented Generation (RAG) prototype optimized for standard enterprise infrastructure (CPU-only execution, 16~GB RAM). The proposed system features a dual-modality architecture: an automated batch-processing pipeline for JSON entity extraction, and a conversational interface for localized document interrogation. To enable structure-aware segmentation of the Statuti document for database storage, we adopted a hierarchical chunking strategy (hereafter referred to as Custom Structural Chunking) for document ingestion. This approach utilizes regular expressions to isolate legal articles as discrete semantic boundaries, while employing a recursive fixed-size segmentation fallback for unformatted text to ensure high-fidelity context preservation. The system's architectural viability was assessed via an ablation study that compared the proposed chunking strategy with standard fixed-size and embedding-based segmentation methods, complemented by a comparative evaluation of selected open-source Small Language Models (SLMs) for the RAG generative engine. In our experiments, the best-performing setup combined Gemma 3 (4B) for both generation and extraction with the Custom Structural Chunking strategy for document ingestion. On the evaluation set of novel statutes, this optimal configuration achieved a Context Recall of 95.4% and a Context Precision of 87.8%, translating to a natural language QA accuracy of 91.7%, and an end-to-end structured JSON extraction accuracy of 87.5%. While the methodology acknowledges a degree of structural sensitivity in non-standard layouts, it empirically demonstrates that optimized SLMs can execute complex, privacy-compliant RAG tasks on standard enterprise hardware. Ultimately, this work establishes a functional, cost-effective framework for integrating generative language models into sensitive enterprise workflows while maintaining absolute control over proprietary data.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Matematica "Tullio Levi-Civita" - DM
			
	Corso di studio
	
				DATA SCIENCE  Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2025
			
	Titolo inglese
	
				Retrieval-Augmented Generation for Legal Document Analysis: A Case Study on Statuti delle Imprese.
			
	Abstract in italiano
	
				The digitalization of corporate governance increasingly requires automated tools to extract critical legal parameters from corporate bylaws (Statuti delle Imprese). However, the semantic density of legal texts, coupled with strict data sovereignty mandates and GDPR compliance, presents significant challenges for both traditional algorithmic extraction and cloud-based Large Language Models (LLMs). This thesis addresses these limitations by presenting an on-premises Retrieval-Augmented Generation (RAG) prototype optimized for standard enterprise infrastructure (CPU-only execution, 16~GB RAM).

The proposed system features a dual-modality architecture: an automated batch-processing pipeline for JSON entity extraction, and a conversational interface for localized document interrogation. To enable structure-aware segmentation of the Statuti document for database storage, we adopted a hierarchical chunking strategy (hereafter referred to as Custom Structural Chunking) for document ingestion. This approach utilizes regular expressions to isolate legal articles as discrete semantic boundaries, while employing a recursive fixed-size segmentation fallback for unformatted text to ensure high-fidelity context preservation. The system's architectural viability was assessed via an ablation study that compared the proposed chunking strategy with standard fixed-size and embedding-based segmentation methods, complemented by a comparative evaluation of selected open-source Small Language Models (SLMs) for the RAG generative engine.

In our experiments, the best-performing setup combined Gemma 3 (4B) for both generation and extraction with the Custom Structural Chunking strategy for document ingestion. On the evaluation set of novel statutes, this optimal configuration achieved a Context Recall of 95.4% and a Context Precision of 87.8%, translating to a natural language QA accuracy of 91.7%, and an end-to-end structured JSON extraction accuracy of 87.5%. 

While the methodology acknowledges a degree of structural sensitivity in non-standard layouts, it empirically demonstrates that optimized SLMs can execute complex, privacy-compliant RAG tasks on standard enterprise hardware. Ultimately, this work establishes a functional, cost-effective framework for integrating generative language models into sensitive enterprise workflows while maintaining absolute control over proprietary data.
			
	Parola chiave
	
				RAG
Legal NLP
Corporate Statuti
Open-Source AI
			
	Relatore
	
				ERSEGHE, TOMASO
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Ali_Amjad.pdf Accesso riservato Dimensione 2.61 MB Formato Adobe PDF	2.61 MB	Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/108223