The digitalization of corporate governance increasingly requires automated tools to extract critical legal parameters from corporate bylaws (Statuti delle Imprese). However, the semantic density of legal texts, coupled with strict data sovereignty mandates and GDPR compliance, presents significant challenges for both traditional algorithmic extraction and cloud-based Large Language Models (LLMs). This thesis addresses these limitations by presenting an on-premises Retrieval-Augmented Generation (RAG) prototype optimized for standard enterprise infrastructure (CPU-only execution, 16~GB RAM). The proposed system features a dual-modality architecture: an automated batch-processing pipeline for JSON entity extraction, and a conversational interface for localized document interrogation. To enable structure-aware segmentation of the Statuti document for database storage, we adopted a hierarchical chunking strategy (hereafter referred to as Custom Structural Chunking) for document ingestion. This approach utilizes regular expressions to isolate legal articles as discrete semantic boundaries, while employing a recursive fixed-size segmentation fallback for unformatted text to ensure high-fidelity context preservation. The system's architectural viability was assessed via an ablation study that compared the proposed chunking strategy with standard fixed-size and embedding-based segmentation methods, complemented by a comparative evaluation of selected open-source Small Language Models (SLMs) for the RAG generative engine. In our experiments, the best-performing setup combined Gemma 3 (4B) for both generation and extraction with the Custom Structural Chunking strategy for document ingestion. On the evaluation set of novel statutes, this optimal configuration achieved a Context Recall of 95.4% and a Context Precision of 87.8%, translating to a natural language QA accuracy of 91.7%, and an end-to-end structured JSON extraction accuracy of 87.5%. While the methodology acknowledges a degree of structural sensitivity in non-standard layouts, it empirically demonstrates that optimized SLMs can execute complex, privacy-compliant RAG tasks on standard enterprise hardware. Ultimately, this work establishes a functional, cost-effective framework for integrating generative language models into sensitive enterprise workflows while maintaining absolute control over proprietary data.

The digitalization of corporate governance increasingly requires automated tools to extract critical legal parameters from corporate bylaws (Statuti delle Imprese). However, the semantic density of legal texts, coupled with strict data sovereignty mandates and GDPR compliance, presents significant challenges for both traditional algorithmic extraction and cloud-based Large Language Models (LLMs). This thesis addresses these limitations by presenting an on-premises Retrieval-Augmented Generation (RAG) prototype optimized for standard enterprise infrastructure (CPU-only execution, 16~GB RAM). The proposed system features a dual-modality architecture: an automated batch-processing pipeline for JSON entity extraction, and a conversational interface for localized document interrogation. To enable structure-aware segmentation of the Statuti document for database storage, we adopted a hierarchical chunking strategy (hereafter referred to as Custom Structural Chunking) for document ingestion. This approach utilizes regular expressions to isolate legal articles as discrete semantic boundaries, while employing a recursive fixed-size segmentation fallback for unformatted text to ensure high-fidelity context preservation. The system's architectural viability was assessed via an ablation study that compared the proposed chunking strategy with standard fixed-size and embedding-based segmentation methods, complemented by a comparative evaluation of selected open-source Small Language Models (SLMs) for the RAG generative engine. In our experiments, the best-performing setup combined Gemma 3 (4B) for both generation and extraction with the Custom Structural Chunking strategy for document ingestion. On the evaluation set of novel statutes, this optimal configuration achieved a Context Recall of 95.4% and a Context Precision of 87.8%, translating to a natural language QA accuracy of 91.7%, and an end-to-end structured JSON extraction accuracy of 87.5%. While the methodology acknowledges a degree of structural sensitivity in non-standard layouts, it empirically demonstrates that optimized SLMs can execute complex, privacy-compliant RAG tasks on standard enterprise hardware. Ultimately, this work establishes a functional, cost-effective framework for integrating generative language models into sensitive enterprise workflows while maintaining absolute control over proprietary data.

Retrieval-Augmented Generation for Legal Document Analysis: A Case Study on Statuti delle Imprese.

ALI, AMJAD
2025/2026

Abstract

The digitalization of corporate governance increasingly requires automated tools to extract critical legal parameters from corporate bylaws (Statuti delle Imprese). However, the semantic density of legal texts, coupled with strict data sovereignty mandates and GDPR compliance, presents significant challenges for both traditional algorithmic extraction and cloud-based Large Language Models (LLMs). This thesis addresses these limitations by presenting an on-premises Retrieval-Augmented Generation (RAG) prototype optimized for standard enterprise infrastructure (CPU-only execution, 16~GB RAM). The proposed system features a dual-modality architecture: an automated batch-processing pipeline for JSON entity extraction, and a conversational interface for localized document interrogation. To enable structure-aware segmentation of the Statuti document for database storage, we adopted a hierarchical chunking strategy (hereafter referred to as Custom Structural Chunking) for document ingestion. This approach utilizes regular expressions to isolate legal articles as discrete semantic boundaries, while employing a recursive fixed-size segmentation fallback for unformatted text to ensure high-fidelity context preservation. The system's architectural viability was assessed via an ablation study that compared the proposed chunking strategy with standard fixed-size and embedding-based segmentation methods, complemented by a comparative evaluation of selected open-source Small Language Models (SLMs) for the RAG generative engine. In our experiments, the best-performing setup combined Gemma 3 (4B) for both generation and extraction with the Custom Structural Chunking strategy for document ingestion. On the evaluation set of novel statutes, this optimal configuration achieved a Context Recall of 95.4% and a Context Precision of 87.8%, translating to a natural language QA accuracy of 91.7%, and an end-to-end structured JSON extraction accuracy of 87.5%. While the methodology acknowledges a degree of structural sensitivity in non-standard layouts, it empirically demonstrates that optimized SLMs can execute complex, privacy-compliant RAG tasks on standard enterprise hardware. Ultimately, this work establishes a functional, cost-effective framework for integrating generative language models into sensitive enterprise workflows while maintaining absolute control over proprietary data.
2025
Retrieval-Augmented Generation for Legal Document Analysis: A Case Study on Statuti delle Imprese.
The digitalization of corporate governance increasingly requires automated tools to extract critical legal parameters from corporate bylaws (Statuti delle Imprese). However, the semantic density of legal texts, coupled with strict data sovereignty mandates and GDPR compliance, presents significant challenges for both traditional algorithmic extraction and cloud-based Large Language Models (LLMs). This thesis addresses these limitations by presenting an on-premises Retrieval-Augmented Generation (RAG) prototype optimized for standard enterprise infrastructure (CPU-only execution, 16~GB RAM). The proposed system features a dual-modality architecture: an automated batch-processing pipeline for JSON entity extraction, and a conversational interface for localized document interrogation. To enable structure-aware segmentation of the Statuti document for database storage, we adopted a hierarchical chunking strategy (hereafter referred to as Custom Structural Chunking) for document ingestion. This approach utilizes regular expressions to isolate legal articles as discrete semantic boundaries, while employing a recursive fixed-size segmentation fallback for unformatted text to ensure high-fidelity context preservation. The system's architectural viability was assessed via an ablation study that compared the proposed chunking strategy with standard fixed-size and embedding-based segmentation methods, complemented by a comparative evaluation of selected open-source Small Language Models (SLMs) for the RAG generative engine. In our experiments, the best-performing setup combined Gemma 3 (4B) for both generation and extraction with the Custom Structural Chunking strategy for document ingestion. On the evaluation set of novel statutes, this optimal configuration achieved a Context Recall of 95.4% and a Context Precision of 87.8%, translating to a natural language QA accuracy of 91.7%, and an end-to-end structured JSON extraction accuracy of 87.5%. While the methodology acknowledges a degree of structural sensitivity in non-standard layouts, it empirically demonstrates that optimized SLMs can execute complex, privacy-compliant RAG tasks on standard enterprise hardware. Ultimately, this work establishes a functional, cost-effective framework for integrating generative language models into sensitive enterprise workflows while maintaining absolute control over proprietary data.
RAG
Legal NLP
Corporate Statuti
Open-Source AI
File in questo prodotto:
File Dimensione Formato  
Ali_Amjad.pdf

Accesso riservato

Dimensione 2.61 MB
Formato Adobe PDF
2.61 MB Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/108223