Table Detection and Table Structure Recognition from PDF documents

This thesis develops an automated table extraction tool for internal auditors at the United Nations World Food Programme (WFP), who need to extract and verify data from unstructured documents, such as PDFs and images, that contain tables. The project pipeline consists of first document classification, and then table detection, table structure recognition, and finally table reconstruction in a digital format. The document classification stage uses NLP to filter out irrelevant documents. The table extraction (detection and structure recognition) stage uses two state-of-the-art models: Microsoft’s Table Transformer (TATR) and TableNet, which produce bounding boxes and masks for tables and their structures, respectively. TATR, pre-trained on PubTables-1M and FinTabNet.c datasets, excels in table detection but encounters challenges in localizing tables within scanned FLA and Amendment documents. Fine-tuning nuances structure recognition, showcasing five crucial object classes. TableNet, experimented with various encoder architectures, achieves optimal performance with DenseNet-121 with an F1-score of 83.6\% on table detection. Challenges in output masks prompt post-processing considerations. The table reconstruction stage uses the output of the table extraction stage and an OCR tool to generate a CSV representation of the tables. The project evaluates the performance of the models on common evaluation metrics like AP and F1-score and discusses the challenges and future directions for improvement. The project also proposes the integration of the table extraction tool into a comprehensive database, which aligns with the vision of continuous auditing. The project not only presents a solution to a specific challenge faced by auditors at UN WFP but also opens doors to a realm of possibilities for the broader field of document processing and audit methodologies.

Table Detection and Table Structure Recognition from PDF documents

XXX, MOHAMMAD HUZAIFA FAZAL

2022/2023

Abstract

This thesis develops an automated table extraction tool for internal auditors at the United Nations World Food Programme (WFP), who need to extract and verify data from unstructured documents, such as PDFs and images, that contain tables. The project pipeline consists of first document classification, and then table detection, table structure recognition, and finally table reconstruction in a digital format. The document classification stage uses NLP to filter out irrelevant documents. The table extraction (detection and structure recognition) stage uses two state-of-the-art models: Microsoft’s Table Transformer (TATR) and TableNet, which produce bounding boxes and masks for tables and their structures, respectively. TATR, pre-trained on PubTables-1M and FinTabNet.c datasets, excels in table detection but encounters challenges in localizing tables within scanned FLA and Amendment documents. Fine-tuning nuances structure recognition, showcasing five crucial object classes. TableNet, experimented with various encoder architectures, achieves optimal performance with DenseNet-121 with an F1-score of 83.6\% on table detection. Challenges in output masks prompt post-processing considerations. The table reconstruction stage uses the output of the table extraction stage and an OCR tool to generate a CSV representation of the tables. The project evaluates the performance of the models on common evaluation metrics like AP and F1-score and discusses the challenges and future directions for improvement. The project also proposes the integration of the table extraction tool into a comprehensive database, which aligns with the vision of continuous auditing. The project not only presents a solution to a specific challenge faced by auditors at UN WFP but also opens doors to a realm of possibilities for the broader field of document processing and audit methodologies.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Matematica "Tullio Levi-Civita" - DM
			
	Corso di studio
	
				DATA SCIENCE Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2022
			
	Titolo inglese
	
				Table Detection and Table Structure Recognition from PDF documents
			
	Abstract in italiano
	
				This thesis develops an automated table extraction tool for internal auditors at the United Nations World Food Programme (WFP), who need to extract and verify data from unstructured documents, such as PDFs and images, that contain tables. The project pipeline consists of first document classification, and then table detection, table structure recognition, and finally table reconstruction in a digital format. The document classification stage uses NLP to filter out irrelevant documents. The table extraction (detection and structure recognition) stage uses two state-of-the-art models: Microsoft’s Table Transformer (TATR) and TableNet, which produce bounding boxes and masks for tables and their structures, respectively. TATR, pre-trained on PubTables-1M and FinTabNet.c datasets, excels in table detection but encounters challenges in localizing tables within scanned FLA and Amendment documents. Fine-tuning nuances structure recognition, showcasing five crucial object classes. TableNet, experimented with various encoder architectures, achieves optimal performance with DenseNet-121 with an F1-score of 83.6\% on table detection. Challenges in output masks prompt post-processing considerations. The table reconstruction stage uses the output of the table extraction stage and an OCR tool to generate a CSV representation of the tables. The project evaluates the performance of the models on common evaluation metrics like AP and F1-score and discusses the challenges and future directions for improvement. The project also proposes the integration of the table extraction tool into a comprehensive database, which aligns with the vision of continuous auditing. The project not only presents a solution to a specific challenge faced by auditors at UN WFP but also opens doors to a realm of possibilities for the broader field of document processing and audit methodologies.
			
	Parola chiave
	
				machine learning
data science
deep learning
table detection
table extraction
			
	Relatore
	
				ERSEGHE, TOMASO
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Thesis_Mohammad_Huzaifa_Fazal_2041507_PDF-A.pdf Accesso riservato Dimensione 5.04 MB Formato Adobe PDF	5.04 MB	Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/61400