Public procurement contracts are essential documents in public administration, requiring accurate and efficient processing to ensure transparency and compliance in governmental administrative processes. This thesis presents a Named Entity Recognition (NER) pipeline designed to automate the extraction of critical information from these contracts within the context of the Municipality of Padova. By integrating recent Natural Language Processing (NLP) techniques, such as the GLINER architecture, with rule-based methods, the proposed solution demonstrates promising results in handling such documents. The pipeline leverages fine-tuned GLINER models to accurately extract stakeholder entities, such as contractor details, signing authorities, and authorized officers. Additionally, a sequential entity extraction strategy is proposed for GLINER models, prioritizing the extraction of contractor names before other entities in cases involving multiple contractors. Experimental results highlight the effectiveness of GLINER models, particularly the proposed sequential entity extraction strategy, compared to a baseline approach using the SpaCy library, especially in domain-specific contexts. The study also addresses the challenges posed by the variability in contract structures and introduces text segmentation and normalization techniques to overcome them. The insights from this work pave the way for expanding the pipeline to handle additional document types and enhancing digital governance through the generation of metadata and the integration of NLP applications like chatbots for public administration procedures.
Public procurement contracts are essential documents in public administration, requiring accurate and efficient processing to ensure transparency and compliance in governmental administrative processes. This thesis presents a Named Entity Recognition (NER) pipeline designed to automate the extraction of critical information from these contracts within the context of the Municipality of Padova. By integrating recent Natural Language Processing (NLP) techniques, such as the GLINER architecture, with rule-based methods, the proposed solution demonstrates promising results in handling such documents. The pipeline leverages fine-tuned GLINER models to accurately extract stakeholder entities, such as contractor details, signing authorities, and authorized officers. Additionally, a sequential entity extraction strategy is proposed for GLINER models, prioritizing the extraction of contractor names before other entities in cases involving multiple contractors. Experimental results highlight the effectiveness of GLINER models, particularly the proposed sequential entity extraction strategy, compared to a baseline approach using the SpaCy library, especially in domain-specific contexts. The study also addresses the challenges posed by the variability in contract structures and introduces text segmentation and normalization techniques to overcome them. The insights from this work pave the way for expanding the pipeline to handle additional document types and enhancing digital governance through the generation of metadata and the integration of NLP applications like chatbots for public administration procedures.
Implementing a Named Entity Recognition pipeline for Public Procurement contracts in the Municipality of Padova
CHACON MEJIA, JOSÉ JAVIER
2023/2024
Abstract
Public procurement contracts are essential documents in public administration, requiring accurate and efficient processing to ensure transparency and compliance in governmental administrative processes. This thesis presents a Named Entity Recognition (NER) pipeline designed to automate the extraction of critical information from these contracts within the context of the Municipality of Padova. By integrating recent Natural Language Processing (NLP) techniques, such as the GLINER architecture, with rule-based methods, the proposed solution demonstrates promising results in handling such documents. The pipeline leverages fine-tuned GLINER models to accurately extract stakeholder entities, such as contractor details, signing authorities, and authorized officers. Additionally, a sequential entity extraction strategy is proposed for GLINER models, prioritizing the extraction of contractor names before other entities in cases involving multiple contractors. Experimental results highlight the effectiveness of GLINER models, particularly the proposed sequential entity extraction strategy, compared to a baseline approach using the SpaCy library, especially in domain-specific contexts. The study also addresses the challenges posed by the variability in contract structures and introduces text segmentation and normalization techniques to overcome them. The insights from this work pave the way for expanding the pipeline to handle additional document types and enhancing digital governance through the generation of metadata and the integration of NLP applications like chatbots for public administration procedures.File | Dimensione | Formato | |
---|---|---|---|
Chacon_Thesis.pdf
accesso aperto
Dimensione
4.53 MB
Formato
Adobe PDF
|
4.53 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/80885