This master’s thesis focuses on the extraction of text from banking documents using Computer Vision technology, followed by the classification of dates in the data extracted using Natural Language Processing (NLP) techniques. Accurately identifying and classifying dates within financial records, statements, and other banking-related documents is essential for various tasks such as auditing, compliance, and data analysis. The research explores the utilization of Optical Character Recognition (OCR) to extract data from scanned or digitally captured documents, overcoming challenges related to image quality, document layouts, text orientation, and varying date formats, which may compromise the final result; NLP models are then used to classify these extracted dates . The thesis investigates various NLP approaches, including pre-trained models, train models from scratch, and deep learning models, evaluating their effectiveness in accurately classifying dates. Additionally, the study examines the impact of different data preprocessing techniques and feature engineering methods on the date classification results. The outcomes of this research provide valuable insights for the development of efficient and reliable systems for date extraction and classification in the banking domain, contributing to improved document processing and decision-making in the financial industry. Furthermore, the research findings have been leveraged to create an automated tool specifically designed for a multinational corporation. The aim of this tool is to extract the precise contract signing dates from an extensive collection of bank documents, effectively reducing the need for human involvement in the process.

This master’s thesis focuses on the extraction of text from banking documents using Computer Vision technology, followed by the classification of dates in the data extracted using Natural Language Processing (NLP) techniques. Accurately identifying and classifying dates within financial records, statements, and other banking-related documents is essential for various tasks such as auditing, compliance, and data analysis. The research explores the utilization of Optical Character Recognition (OCR) to extract data from scanned or digitally captured documents, overcoming challenges related to image quality, document layouts, text orientation, and varying date formats, which may compromise the final result; NLP models are then used to classify these extracted dates . The thesis investigates various NLP approaches, including pre-trained models, train models from scratch, and deep learning models, evaluating their effectiveness in accurately classifying dates. Additionally, the study examines the impact of different data preprocessing techniques and feature engineering methods on the date classification results. The outcomes of this research provide valuable insights for the development of efficient and reliable systems for date extraction and classification in the banking domain, contributing to improved document processing and decision-making in the financial industry. Furthermore, the research findings have been leveraged to create an automated tool specifically designed for a multinational corporation. The aim of this tool is to extract the precise contract signing dates from an extensive collection of bank documents, effectively reducing the need for human involvement in the process.

Extracting Data from Banking Documents using Computer Vision and Natural Language Processing Tools

VAROTTO, DAVIDE
2024/2025

Abstract

This master’s thesis focuses on the extraction of text from banking documents using Computer Vision technology, followed by the classification of dates in the data extracted using Natural Language Processing (NLP) techniques. Accurately identifying and classifying dates within financial records, statements, and other banking-related documents is essential for various tasks such as auditing, compliance, and data analysis. The research explores the utilization of Optical Character Recognition (OCR) to extract data from scanned or digitally captured documents, overcoming challenges related to image quality, document layouts, text orientation, and varying date formats, which may compromise the final result; NLP models are then used to classify these extracted dates . The thesis investigates various NLP approaches, including pre-trained models, train models from scratch, and deep learning models, evaluating their effectiveness in accurately classifying dates. Additionally, the study examines the impact of different data preprocessing techniques and feature engineering methods on the date classification results. The outcomes of this research provide valuable insights for the development of efficient and reliable systems for date extraction and classification in the banking domain, contributing to improved document processing and decision-making in the financial industry. Furthermore, the research findings have been leveraged to create an automated tool specifically designed for a multinational corporation. The aim of this tool is to extract the precise contract signing dates from an extensive collection of bank documents, effectively reducing the need for human involvement in the process.
2024
Extracting Data from Banking Documents using Computer Vision and Natural Language Processing Tools
This master’s thesis focuses on the extraction of text from banking documents using Computer Vision technology, followed by the classification of dates in the data extracted using Natural Language Processing (NLP) techniques. Accurately identifying and classifying dates within financial records, statements, and other banking-related documents is essential for various tasks such as auditing, compliance, and data analysis. The research explores the utilization of Optical Character Recognition (OCR) to extract data from scanned or digitally captured documents, overcoming challenges related to image quality, document layouts, text orientation, and varying date formats, which may compromise the final result; NLP models are then used to classify these extracted dates . The thesis investigates various NLP approaches, including pre-trained models, train models from scratch, and deep learning models, evaluating their effectiveness in accurately classifying dates. Additionally, the study examines the impact of different data preprocessing techniques and feature engineering methods on the date classification results. The outcomes of this research provide valuable insights for the development of efficient and reliable systems for date extraction and classification in the banking domain, contributing to improved document processing and decision-making in the financial industry. Furthermore, the research findings have been leveraged to create an automated tool specifically designed for a multinational corporation. The aim of this tool is to extract the precise contract signing dates from an extensive collection of bank documents, effectively reducing the need for human involvement in the process.
Machine Learning
Computer Vision
Natural Language
File in questo prodotto:
File Dimensione Formato  
TesiMagistraleDavideVarotto.pdf

accesso aperto

Dimensione 4.49 MB
Formato Adobe PDF
4.49 MB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/84792