In response to the transformative shifts in the power industry, Baker Hughes strategically endeavors to predict and assess the lifespan of equipment parts. To tackle this challenge, they have invested in an automated process to carefully track equipment and its parts, aiming for proactive client updates and the avoidance of unnecessary outages. This master thesis is an integral part of the companies larger initiative to digitize decades of Service Shop Reports. The ultimate goal is to develop a comprehensive tool that tracks the historical flow of parts, aiding in calculating the residual useful life of parts and providing insights into maintenance predictions, substitutions, and future client offers. Three distinct data sources contribute to a diverse dataset. The dataset undergoes meticulous cleaning and preprocessing, incorporating advanced Optical Character Recognition (OCR) technology for optimal inclusion of aged and handwritten documents. The research attentively curates a reference dataset, ensuring high-quality labeling for over 4,652 Service Shop Reports. The labeled data forms the basis for evaluating two primary methods: SpaCy, a natural language processing library, and Long Short-Term Memory (LSTM) models. The SpaCy method applies rule-based matching to identify part numbers, while the LSTM models, varying in architecture, leverage deep learning for binary classification. The results reveal nuanced insights. While SpaCy exhibits challenges in precision and F1 score, LSTM models showcase competitive accuracy and improved balance between precision and recall. The comprehensive evaluation, considering the challenges of diverse datasets and OCR nuances, underscores the LSTM model's suitability for the intricate task of part number extraction.
Data extraction applied to domain specific technical documentation.
GALEVSKA, STEFANIJA
2022/2023
Abstract
In response to the transformative shifts in the power industry, Baker Hughes strategically endeavors to predict and assess the lifespan of equipment parts. To tackle this challenge, they have invested in an automated process to carefully track equipment and its parts, aiming for proactive client updates and the avoidance of unnecessary outages. This master thesis is an integral part of the companies larger initiative to digitize decades of Service Shop Reports. The ultimate goal is to develop a comprehensive tool that tracks the historical flow of parts, aiding in calculating the residual useful life of parts and providing insights into maintenance predictions, substitutions, and future client offers. Three distinct data sources contribute to a diverse dataset. The dataset undergoes meticulous cleaning and preprocessing, incorporating advanced Optical Character Recognition (OCR) technology for optimal inclusion of aged and handwritten documents. The research attentively curates a reference dataset, ensuring high-quality labeling for over 4,652 Service Shop Reports. The labeled data forms the basis for evaluating two primary methods: SpaCy, a natural language processing library, and Long Short-Term Memory (LSTM) models. The SpaCy method applies rule-based matching to identify part numbers, while the LSTM models, varying in architecture, leverage deep learning for binary classification. The results reveal nuanced insights. While SpaCy exhibits challenges in precision and F1 score, LSTM models showcase competitive accuracy and improved balance between precision and recall. The comprehensive evaluation, considering the challenges of diverse datasets and OCR nuances, underscores the LSTM model's suitability for the intricate task of part number extraction.File | Dimensione | Formato | |
---|---|---|---|
Stefanija_thesis4 (1).pdf
accesso riservato
Dimensione
674.03 kB
Formato
Adobe PDF
|
674.03 kB | Adobe PDF |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/61383