This thesis develops an automated table extraction tool for internal auditors at the United Nations World Food Programme (WFP), who need to extract and verify data from unstructured documents, such as PDFs and images, that contain tables. The project pipeline consists of first document classification, and then table detection, table structure recognition, and finally table reconstruction in a digital format. The document classification stage uses NLP to filter out irrelevant documents. The table extraction (detection and structure recognition) stage uses two state-of-the-art models: Microsoft’s Table Transformer (TATR) and TableNet, which produce bounding boxes and masks for tables and their structures, respectively. TATR, pre-trained on PubTables-1M and FinTabNet.c datasets, excels in table detection but encounters challenges in localizing tables within scanned FLA and Amendment documents. Fine-tuning nuances structure recognition, showcasing five crucial object classes. TableNet, experimented with various encoder architectures, achieves optimal performance with DenseNet-121 with an F1-score of 83.6\% on table detection. Challenges in output masks prompt post-processing considerations. The table reconstruction stage uses the output of the table extraction stage and an OCR tool to generate a CSV representation of the tables. The project evaluates the performance of the models on common evaluation metrics like AP and F1-score and discusses the challenges and future directions for improvement. The project also proposes the integration of the table extraction tool into a comprehensive database, which aligns with the vision of continuous auditing. The project not only presents a solution to a specific challenge faced by auditors at UN WFP but also opens doors to a realm of possibilities for the broader field of document processing and audit methodologies.
This thesis develops an automated table extraction tool for internal auditors at the United Nations World Food Programme (WFP), who need to extract and verify data from unstructured documents, such as PDFs and images, that contain tables. The project pipeline consists of first document classification, and then table detection, table structure recognition, and finally table reconstruction in a digital format. The document classification stage uses NLP to filter out irrelevant documents. The table extraction (detection and structure recognition) stage uses two state-of-the-art models: Microsoft’s Table Transformer (TATR) and TableNet, which produce bounding boxes and masks for tables and their structures, respectively. TATR, pre-trained on PubTables-1M and FinTabNet.c datasets, excels in table detection but encounters challenges in localizing tables within scanned FLA and Amendment documents. Fine-tuning nuances structure recognition, showcasing five crucial object classes. TableNet, experimented with various encoder architectures, achieves optimal performance with DenseNet-121 with an F1-score of 83.6\% on table detection. Challenges in output masks prompt post-processing considerations. The table reconstruction stage uses the output of the table extraction stage and an OCR tool to generate a CSV representation of the tables. The project evaluates the performance of the models on common evaluation metrics like AP and F1-score and discusses the challenges and future directions for improvement. The project also proposes the integration of the table extraction tool into a comprehensive database, which aligns with the vision of continuous auditing. The project not only presents a solution to a specific challenge faced by auditors at UN WFP but also opens doors to a realm of possibilities for the broader field of document processing and audit methodologies.
Table Detection and Table Structure Recognition from PDF documents
XXX, MOHAMMAD HUZAIFA FAZAL
2022/2023
Abstract
This thesis develops an automated table extraction tool for internal auditors at the United Nations World Food Programme (WFP), who need to extract and verify data from unstructured documents, such as PDFs and images, that contain tables. The project pipeline consists of first document classification, and then table detection, table structure recognition, and finally table reconstruction in a digital format. The document classification stage uses NLP to filter out irrelevant documents. The table extraction (detection and structure recognition) stage uses two state-of-the-art models: Microsoft’s Table Transformer (TATR) and TableNet, which produce bounding boxes and masks for tables and their structures, respectively. TATR, pre-trained on PubTables-1M and FinTabNet.c datasets, excels in table detection but encounters challenges in localizing tables within scanned FLA and Amendment documents. Fine-tuning nuances structure recognition, showcasing five crucial object classes. TableNet, experimented with various encoder architectures, achieves optimal performance with DenseNet-121 with an F1-score of 83.6\% on table detection. Challenges in output masks prompt post-processing considerations. The table reconstruction stage uses the output of the table extraction stage and an OCR tool to generate a CSV representation of the tables. The project evaluates the performance of the models on common evaluation metrics like AP and F1-score and discusses the challenges and future directions for improvement. The project also proposes the integration of the table extraction tool into a comprehensive database, which aligns with the vision of continuous auditing. The project not only presents a solution to a specific challenge faced by auditors at UN WFP but also opens doors to a realm of possibilities for the broader field of document processing and audit methodologies.File | Dimensione | Formato | |
---|---|---|---|
Thesis_Mohammad_Huzaifa_Fazal_2041507_PDF-A.pdf
accesso riservato
Dimensione
5.04 MB
Formato
Adobe PDF
|
5.04 MB | Adobe PDF |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/61400