Implementing a Named Entity Recognition pipeline for Public Procurement contracts in the Municipality of Padova

Public procurement contracts are essential documents in public administration, requiring accurate and efficient processing to ensure transparency and compliance in governmental administrative processes. This thesis presents a Named Entity Recognition (NER) pipeline designed to automate the extraction of critical information from these contracts within the context of the Municipality of Padova. By integrating recent Natural Language Processing (NLP) techniques, such as the GLINER architecture, with rule-based methods, the proposed solution demonstrates promising results in handling such documents. The pipeline leverages fine-tuned GLINER models to accurately extract stakeholder entities, such as contractor details, signing authorities, and authorized officers. Additionally, a sequential entity extraction strategy is proposed for GLINER models, prioritizing the extraction of contractor names before other entities in cases involving multiple contractors. Experimental results highlight the effectiveness of GLINER models, particularly the proposed sequential entity extraction strategy, compared to a baseline approach using the SpaCy library, especially in domain-specific contexts. The study also addresses the challenges posed by the variability in contract structures and introduces text segmentation and normalization techniques to overcome them. The insights from this work pave the way for expanding the pipeline to handle additional document types and enhancing digital governance through the generation of metadata and the integration of NLP applications like chatbots for public administration procedures.

Implementing a Named Entity Recognition pipeline for Public Procurement contracts in the Municipality of Padova

CHACON MEJIA, JOSÉ JAVIER

2023/2024

Abstract

Public procurement contracts are essential documents in public administration, requiring accurate and efficient processing to ensure transparency and compliance in governmental administrative processes. This thesis presents a Named Entity Recognition (NER) pipeline designed to automate the extraction of critical information from these contracts within the context of the Municipality of Padova. By integrating recent Natural Language Processing (NLP) techniques, such as the GLINER architecture, with rule-based methods, the proposed solution demonstrates promising results in handling such documents. The pipeline leverages fine-tuned GLINER models to accurately extract stakeholder entities, such as contractor details, signing authorities, and authorized officers. Additionally, a sequential entity extraction strategy is proposed for GLINER models, prioritizing the extraction of contractor names before other entities in cases involving multiple contractors. Experimental results highlight the effectiveness of GLINER models, particularly the proposed sequential entity extraction strategy, compared to a baseline approach using the SpaCy library, especially in domain-specific contexts. The study also addresses the challenges posed by the variability in contract structures and introduces text segmentation and normalization techniques to overcome them. The insights from this work pave the way for expanding the pipeline to handle additional document types and enhancing digital governance through the generation of metadata and the integration of NLP applications like chatbots for public administration procedures.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Matematica "Tullio Levi-Civita" - DM
			
	Corso di studio
	
				DATA SCIENCE Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2023
			
	Titolo inglese
	
				Implementing a Named Entity Recognition pipeline for Public Procurement contracts in the Municipality of Padova
			
	Abstract in italiano
	
				Public procurement contracts are essential documents in public administration, requiring accurate and efficient processing to ensure transparency and compliance in governmental administrative processes. This thesis presents a Named Entity Recognition (NER) pipeline designed to automate the extraction of critical information from these contracts within the context of the Municipality of Padova. By integrating recent Natural Language Processing (NLP) techniques, such as the GLINER architecture, with rule-based methods, the proposed solution demonstrates promising results in handling such documents.
The pipeline leverages fine-tuned GLINER models to accurately extract stakeholder entities, such as contractor details, signing authorities, and authorized officers. Additionally, a sequential entity extraction strategy is proposed for GLINER models, prioritizing the extraction of contractor names before other entities in cases involving multiple contractors.
Experimental results highlight the effectiveness of GLINER models, particularly the proposed sequential entity extraction strategy, compared to a baseline approach using the SpaCy library, especially in domain-specific contexts. The study also addresses the challenges posed by the variability in contract structures and introduces text segmentation and normalization techniques to overcome them. The insights from this work pave the way for expanding the pipeline to handle additional document types and enhancing digital governance through the generation
of metadata and the integration of NLP applications like chatbots for public administration procedures.
			
	Parola chiave
	
				Data Science
NER
NLP
Deep Learning
Transformers
			
	Relatore
	
				SPERDUTI, ALESSANDRO
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Chacon_Thesis.pdf accesso aperto Dimensione 4.53 MB Formato Adobe PDF Visualizza/Apri	4.53 MB	Adobe PDF	Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/80885