Extracting Data from Banking Documents using Computer Vision and Natural Language Processing Tools

This master’s thesis focuses on the extraction of text from banking documents using Computer Vision technology, followed by the classification of dates in the data extracted using Natural Language Processing (NLP) techniques. Accurately identifying and classifying dates within financial records, statements, and other banking-related documents is essential for various tasks such as auditing, compliance, and data analysis. The research explores the utilization of Optical Character Recognition (OCR) to extract data from scanned or digitally captured documents, overcoming challenges related to image quality, document layouts, text orientation, and varying date formats, which may compromise the final result; NLP models are then used to classify these extracted dates . The thesis investigates various NLP approaches, including pre-trained models, train models from scratch, and deep learning models, evaluating their effectiveness in accurately classifying dates. Additionally, the study examines the impact of different data preprocessing techniques and feature engineering methods on the date classification results. The outcomes of this research provide valuable insights for the development of efficient and reliable systems for date extraction and classification in the banking domain, contributing to improved document processing and decision-making in the financial industry. Furthermore, the research findings have been leveraged to create an automated tool specifically designed for a multinational corporation. The aim of this tool is to extract the precise contract signing dates from an extensive collection of bank documents, effectively reducing the need for human involvement in the process.

Extracting Data from Banking Documents using Computer Vision and Natural Language Processing Tools

VAROTTO, DAVIDE

2024/2025

Abstract

This master’s thesis focuses on the extraction of text from banking documents using Computer Vision technology, followed by the classification of dates in the data extracted using Natural Language Processing (NLP) techniques. Accurately identifying and classifying dates within financial records, statements, and other banking-related documents is essential for various tasks such as auditing, compliance, and data analysis. The research explores the utilization of Optical Character Recognition (OCR) to extract data from scanned or digitally captured documents, overcoming challenges related to image quality, document layouts, text orientation, and varying date formats, which may compromise the final result; NLP models are then used to classify these extracted dates . The thesis investigates various NLP approaches, including pre-trained models, train models from scratch, and deep learning models, evaluating their effectiveness in accurately classifying dates. Additionally, the study examines the impact of different data preprocessing techniques and feature engineering methods on the date classification results. The outcomes of this research provide valuable insights for the development of efficient and reliable systems for date extraction and classification in the banking domain, contributing to improved document processing and decision-making in the financial industry. Furthermore, the research findings have been leveraged to create an automated tool specifically designed for a multinational corporation. The aim of this tool is to extract the precise contract signing dates from an extensive collection of bank documents, effectively reducing the need for human involvement in the process.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Matematica "Tullio Levi-Civita" - DM
			
	Corso di studio
	
				DATA SCIENCE Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2024
			
	Titolo inglese
	
				Extracting Data from Banking Documents using Computer Vision and Natural Language Processing Tools
			
	Abstract in italiano
	
				This master’s thesis focuses on the extraction of text from banking documents using Computer
Vision technology, followed by the classification of dates in the data extracted using Natural
Language Processing (NLP) techniques. Accurately identifying and classifying dates within financial
records, statements, and other banking-related documents is essential for various tasks
such as auditing, compliance, and data analysis.
The research explores the utilization of Optical Character Recognition (OCR) to extract data
from scanned or digitally captured documents, overcoming challenges related to image quality,
document layouts, text orientation, and varying date formats, which may compromise the
final result; NLP models are then used to classify these extracted dates . The thesis investigates
various NLP approaches, including pre-trained models, train models from scratch, and deep
learning models, evaluating their effectiveness in accurately classifying dates. Additionally, the
study examines the impact of different data preprocessing techniques and feature engineering
methods on the date classification results.
The outcomes of this research provide valuable insights for the development of efficient and
reliable systems for date extraction and classification in the banking domain, contributing to
improved document processing and decision-making in the financial industry. Furthermore,
the research findings have been leveraged to create an automated tool specifically designed for a
multinational corporation. The aim of this tool is to extract the precise contract signing dates
from an extensive collection of bank documents, effectively reducing the need for human involvement
in the process.
			
	Parola chiave
	
				Machine Learning
Computer Vision
Natural Language
			
	Relatore
	
				TESTOLIN, ALBERTO
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
TesiMagistraleDavideVarotto.pdf accesso aperto Dimensione 4.49 MB Formato Adobe PDF Visualizza/Apri	4.49 MB	Adobe PDF	Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/84792