This master’s thesis focuses on the extraction of text from banking documents using Computer Vision technology, followed by the classification of dates in the data extracted using Natural Language Processing (NLP) techniques. Accurately identifying and classifying dates within financial records, statements, and other banking-related documents is essential for various tasks such as auditing, compliance, and data analysis. The research explores the utilization of Optical Character Recognition (OCR) to extract data from scanned or digitally captured documents, overcoming challenges related to image quality, document layouts, text orientation, and varying date formats, which may compromise the final result; NLP models are then used to classify these extracted dates . The thesis investigates various NLP approaches, including pre-trained models, train models from scratch, and deep learning models, evaluating their effectiveness in accurately classifying dates. Additionally, the study examines the impact of different data preprocessing techniques and feature engineering methods on the date classification results. The outcomes of this research provide valuable insights for the development of efficient and reliable systems for date extraction and classification in the banking domain, contributing to improved document processing and decision-making in the financial industry. Furthermore, the research findings have been leveraged to create an automated tool specifically designed for a multinational corporation. The aim of this tool is to extract the precise contract signing dates from an extensive collection of bank documents, effectively reducing the need for human involvement in the process.
This master’s thesis focuses on the extraction of text from banking documents using Computer Vision technology, followed by the classification of dates in the data extracted using Natural Language Processing (NLP) techniques. Accurately identifying and classifying dates within financial records, statements, and other banking-related documents is essential for various tasks such as auditing, compliance, and data analysis. The research explores the utilization of Optical Character Recognition (OCR) to extract data from scanned or digitally captured documents, overcoming challenges related to image quality, document layouts, text orientation, and varying date formats, which may compromise the final result; NLP models are then used to classify these extracted dates . The thesis investigates various NLP approaches, including pre-trained models, train models from scratch, and deep learning models, evaluating their effectiveness in accurately classifying dates. Additionally, the study examines the impact of different data preprocessing techniques and feature engineering methods on the date classification results. The outcomes of this research provide valuable insights for the development of efficient and reliable systems for date extraction and classification in the banking domain, contributing to improved document processing and decision-making in the financial industry. Furthermore, the research findings have been leveraged to create an automated tool specifically designed for a multinational corporation. The aim of this tool is to extract the precise contract signing dates from an extensive collection of bank documents, effectively reducing the need for human involvement in the process.
Extracting Data from Banking Documents using Computer Vision and Natural Language Processing Tools
VAROTTO, DAVIDE
2024/2025
Abstract
This master’s thesis focuses on the extraction of text from banking documents using Computer Vision technology, followed by the classification of dates in the data extracted using Natural Language Processing (NLP) techniques. Accurately identifying and classifying dates within financial records, statements, and other banking-related documents is essential for various tasks such as auditing, compliance, and data analysis. The research explores the utilization of Optical Character Recognition (OCR) to extract data from scanned or digitally captured documents, overcoming challenges related to image quality, document layouts, text orientation, and varying date formats, which may compromise the final result; NLP models are then used to classify these extracted dates . The thesis investigates various NLP approaches, including pre-trained models, train models from scratch, and deep learning models, evaluating their effectiveness in accurately classifying dates. Additionally, the study examines the impact of different data preprocessing techniques and feature engineering methods on the date classification results. The outcomes of this research provide valuable insights for the development of efficient and reliable systems for date extraction and classification in the banking domain, contributing to improved document processing and decision-making in the financial industry. Furthermore, the research findings have been leveraged to create an automated tool specifically designed for a multinational corporation. The aim of this tool is to extract the precise contract signing dates from an extensive collection of bank documents, effectively reducing the need for human involvement in the process.File | Dimensione | Formato | |
---|---|---|---|
TesiMagistraleDavideVarotto.pdf
accesso aperto
Dimensione
4.49 MB
Formato
Adobe PDF
|
4.49 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/84792