Data extraction applied to domain specific technical documentation.

In response to the transformative shifts in the power industry, Baker Hughes strategically endeavors to predict and assess the lifespan of equipment parts. To tackle this challenge, they have invested in an automated process to carefully track equipment and its parts, aiming for proactive client updates and the avoidance of unnecessary outages. This master thesis is an integral part of the companies larger initiative to digitize decades of Service Shop Reports. The ultimate goal is to develop a comprehensive tool that tracks the historical flow of parts, aiding in calculating the residual useful life of parts and providing insights into maintenance predictions, substitutions, and future client offers. Three distinct data sources contribute to a diverse dataset. The dataset undergoes meticulous cleaning and preprocessing, incorporating advanced Optical Character Recognition (OCR) technology for optimal inclusion of aged and handwritten documents. The research attentively curates a reference dataset, ensuring high-quality labeling for over 4,652 Service Shop Reports. The labeled data forms the basis for evaluating two primary methods: SpaCy, a natural language processing library, and Long Short-Term Memory (LSTM) models. The SpaCy method applies rule-based matching to identify part numbers, while the LSTM models, varying in architecture, leverage deep learning for binary classification. The results reveal nuanced insights. While SpaCy exhibits challenges in precision and F1 score, LSTM models showcase competitive accuracy and improved balance between precision and recall. The comprehensive evaluation, considering the challenges of diverse datasets and OCR nuances, underscores the LSTM model's suitability for the intricate task of part number extraction.

Data extraction applied to domain specific technical documentation.

GALEVSKA, STEFANIJA

2022/2023

Abstract

In response to the transformative shifts in the power industry, Baker Hughes strategically endeavors to predict and assess the lifespan of equipment parts. To tackle this challenge, they have invested in an automated process to carefully track equipment and its parts, aiming for proactive client updates and the avoidance of unnecessary outages. This master thesis is an integral part of the companies larger initiative to digitize decades of Service Shop Reports. The ultimate goal is to develop a comprehensive tool that tracks the historical flow of parts, aiding in calculating the residual useful life of parts and providing insights into maintenance predictions, substitutions, and future client offers. Three distinct data sources contribute to a diverse dataset. The dataset undergoes meticulous cleaning and preprocessing, incorporating advanced Optical Character Recognition (OCR) technology for optimal inclusion of aged and handwritten documents. The research attentively curates a reference dataset, ensuring high-quality labeling for over 4,652 Service Shop Reports. The labeled data forms the basis for evaluating two primary methods: SpaCy, a natural language processing library, and Long Short-Term Memory (LSTM) models. The SpaCy method applies rule-based matching to identify part numbers, while the LSTM models, varying in architecture, leverage deep learning for binary classification. The results reveal nuanced insights. While SpaCy exhibits challenges in precision and F1 score, LSTM models showcase competitive accuracy and improved balance between precision and recall. The comprehensive evaluation, considering the challenges of diverse datasets and OCR nuances, underscores the LSTM model's suitability for the intricate task of part number extraction.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Matematica "Tullio Levi-Civita" - DM
			
	Corso di studio
	
				DATA SCIENCE Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2022
			
	Titolo inglese
	
				Data extraction applied to domain specific technical documentation.
			
	Parola chiave
	
				data science
NLP
classification model
			
	Relatore
	
				ERSEGHE, TOMASO
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Stefanija_thesis4 (1).pdf Accesso riservato Dimensione 674.03 kB Formato Adobe PDF	674.03 kB	Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/61383