In response to the transformative shifts in the power industry, Baker Hughes strategically endeavors to predict and assess the lifespan of equipment parts. To tackle this challenge, they have invested in an automated process to carefully track equipment and its parts, aiming for proactive client updates and the avoidance of unnecessary outages. This master thesis is an integral part of the companies larger initiative to digitize decades of Service Shop Reports. The ultimate goal is to develop a comprehensive tool that tracks the historical flow of parts, aiding in calculating the residual useful life of parts and providing insights into maintenance predictions, substitutions, and future client offers. Three distinct data sources contribute to a diverse dataset. The dataset undergoes meticulous cleaning and preprocessing, incorporating advanced Optical Character Recognition (OCR) technology for optimal inclusion of aged and handwritten documents. The research attentively curates a reference dataset, ensuring high-quality labeling for over 4,652 Service Shop Reports. The labeled data forms the basis for evaluating two primary methods: SpaCy, a natural language processing library, and Long Short-Term Memory (LSTM) models. The SpaCy method applies rule-based matching to identify part numbers, while the LSTM models, varying in architecture, leverage deep learning for binary classification. The results reveal nuanced insights. While SpaCy exhibits challenges in precision and F1 score, LSTM models showcase competitive accuracy and improved balance between precision and recall. The comprehensive evaluation, considering the challenges of diverse datasets and OCR nuances, underscores the LSTM model's suitability for the intricate task of part number extraction.

Data extraction applied to domain specific technical documentation.

GALEVSKA, STEFANIJA
2022/2023

Abstract

In response to the transformative shifts in the power industry, Baker Hughes strategically endeavors to predict and assess the lifespan of equipment parts. To tackle this challenge, they have invested in an automated process to carefully track equipment and its parts, aiming for proactive client updates and the avoidance of unnecessary outages. This master thesis is an integral part of the companies larger initiative to digitize decades of Service Shop Reports. The ultimate goal is to develop a comprehensive tool that tracks the historical flow of parts, aiding in calculating the residual useful life of parts and providing insights into maintenance predictions, substitutions, and future client offers. Three distinct data sources contribute to a diverse dataset. The dataset undergoes meticulous cleaning and preprocessing, incorporating advanced Optical Character Recognition (OCR) technology for optimal inclusion of aged and handwritten documents. The research attentively curates a reference dataset, ensuring high-quality labeling for over 4,652 Service Shop Reports. The labeled data forms the basis for evaluating two primary methods: SpaCy, a natural language processing library, and Long Short-Term Memory (LSTM) models. The SpaCy method applies rule-based matching to identify part numbers, while the LSTM models, varying in architecture, leverage deep learning for binary classification. The results reveal nuanced insights. While SpaCy exhibits challenges in precision and F1 score, LSTM models showcase competitive accuracy and improved balance between precision and recall. The comprehensive evaluation, considering the challenges of diverse datasets and OCR nuances, underscores the LSTM model's suitability for the intricate task of part number extraction.
2022
Data extraction applied to domain specific technical documentation.
data science
NLP
classification model
File in questo prodotto:
File Dimensione Formato  
Stefanija_thesis4 (1).pdf

accesso riservato

Dimensione 674.03 kB
Formato Adobe PDF
674.03 kB Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/61383