Public procurement contracts are essential documents in public administration, requiring accurate and efficient processing to ensure transparency and compliance in governmental administrative processes. This thesis presents a Named Entity Recognition (NER) pipeline designed to automate the extraction of critical information from these contracts within the context of the Municipality of Padova. By integrating recent Natural Language Processing (NLP) techniques, such as the GLINER architecture, with rule-based methods, the proposed solution demonstrates promising results in handling such documents. The pipeline leverages fine-tuned GLINER models to accurately extract stakeholder entities, such as contractor details, signing authorities, and authorized officers. Additionally, a sequential entity extraction strategy is proposed for GLINER models, prioritizing the extraction of contractor names before other entities in cases involving multiple contractors. Experimental results highlight the effectiveness of GLINER models, particularly the proposed sequential entity extraction strategy, compared to a baseline approach using the SpaCy library, especially in domain-specific contexts. The study also addresses the challenges posed by the variability in contract structures and introduces text segmentation and normalization techniques to overcome them. The insights from this work pave the way for expanding the pipeline to handle additional document types and enhancing digital governance through the generation of metadata and the integration of NLP applications like chatbots for public administration procedures.

Public procurement contracts are essential documents in public administration, requiring accurate and efficient processing to ensure transparency and compliance in governmental administrative processes. This thesis presents a Named Entity Recognition (NER) pipeline designed to automate the extraction of critical information from these contracts within the context of the Municipality of Padova. By integrating recent Natural Language Processing (NLP) techniques, such as the GLINER architecture, with rule-based methods, the proposed solution demonstrates promising results in handling such documents. The pipeline leverages fine-tuned GLINER models to accurately extract stakeholder entities, such as contractor details, signing authorities, and authorized officers. Additionally, a sequential entity extraction strategy is proposed for GLINER models, prioritizing the extraction of contractor names before other entities in cases involving multiple contractors. Experimental results highlight the effectiveness of GLINER models, particularly the proposed sequential entity extraction strategy, compared to a baseline approach using the SpaCy library, especially in domain-specific contexts. The study also addresses the challenges posed by the variability in contract structures and introduces text segmentation and normalization techniques to overcome them. The insights from this work pave the way for expanding the pipeline to handle additional document types and enhancing digital governance through the generation of metadata and the integration of NLP applications like chatbots for public administration procedures.

Implementing a Named Entity Recognition pipeline for Public Procurement contracts in the Municipality of Padova

CHACON MEJIA, JOSÉ JAVIER
2023/2024

Abstract

Public procurement contracts are essential documents in public administration, requiring accurate and efficient processing to ensure transparency and compliance in governmental administrative processes. This thesis presents a Named Entity Recognition (NER) pipeline designed to automate the extraction of critical information from these contracts within the context of the Municipality of Padova. By integrating recent Natural Language Processing (NLP) techniques, such as the GLINER architecture, with rule-based methods, the proposed solution demonstrates promising results in handling such documents. The pipeline leverages fine-tuned GLINER models to accurately extract stakeholder entities, such as contractor details, signing authorities, and authorized officers. Additionally, a sequential entity extraction strategy is proposed for GLINER models, prioritizing the extraction of contractor names before other entities in cases involving multiple contractors. Experimental results highlight the effectiveness of GLINER models, particularly the proposed sequential entity extraction strategy, compared to a baseline approach using the SpaCy library, especially in domain-specific contexts. The study also addresses the challenges posed by the variability in contract structures and introduces text segmentation and normalization techniques to overcome them. The insights from this work pave the way for expanding the pipeline to handle additional document types and enhancing digital governance through the generation of metadata and the integration of NLP applications like chatbots for public administration procedures.
2023
Implementing a Named Entity Recognition pipeline for Public Procurement contracts in the Municipality of Padova
Public procurement contracts are essential documents in public administration, requiring accurate and efficient processing to ensure transparency and compliance in governmental administrative processes. This thesis presents a Named Entity Recognition (NER) pipeline designed to automate the extraction of critical information from these contracts within the context of the Municipality of Padova. By integrating recent Natural Language Processing (NLP) techniques, such as the GLINER architecture, with rule-based methods, the proposed solution demonstrates promising results in handling such documents. The pipeline leverages fine-tuned GLINER models to accurately extract stakeholder entities, such as contractor details, signing authorities, and authorized officers. Additionally, a sequential entity extraction strategy is proposed for GLINER models, prioritizing the extraction of contractor names before other entities in cases involving multiple contractors. Experimental results highlight the effectiveness of GLINER models, particularly the proposed sequential entity extraction strategy, compared to a baseline approach using the SpaCy library, especially in domain-specific contexts. The study also addresses the challenges posed by the variability in contract structures and introduces text segmentation and normalization techniques to overcome them. The insights from this work pave the way for expanding the pipeline to handle additional document types and enhancing digital governance through the generation of metadata and the integration of NLP applications like chatbots for public administration procedures.
Data Science
NER
NLP
Deep Learning
Transformers
File in questo prodotto:
File Dimensione Formato  
Chacon_Thesis.pdf

accesso aperto

Dimensione 4.53 MB
Formato Adobe PDF
4.53 MB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/80885