Key Information Extraction from Visually-Rich Documents

Key Information Extraction (KIE) from Visually-Rich Documents is a critical challenge due to the diverse formats, intricate layouts, and the integration of both textual and non-textual elements. This thesis explores recent open source methods for extracting key information. The study presents the progression of Information Extraction (IE) approaches, from traditional rule-based systems to the latest deep learning models. GLiNER, LayoutLM, LiLT, and Donut, are evaluated for their performance for a specific KIE task on two private dataset containing complex documents in the Italian language, and then compared to third party services (GPT-4o) and human performance in order to gather insights for performing KIE in an industrial setting. The research compares the accuracy, inference time, and cost of these models. It reveals that while all models exhibit a certain degree of accuracy for some fields, other are much more difficult to extract. Donut particularly excel in the task, showing robust performance extracting all fields and beating results obtained by human. Other than the most precise model, it also is the cheapest model, due to his end-to-end structure that doesn't require to run an OCR, it requires a smaller amount of data preprocessing, thanks to its flexibility and being a generative model, however it requires extensive training to reach peak accuracy and can't be run as a zero-shot or few-shot model. The research also lays the groundwork for further exploration into more sophisticated models and techniques that can handle even more complex document structures in the future.

Key Information Extraction from Visually-Rich Documents

VIRGINIO, GIACOMO

2023/2024

Abstract

Key Information Extraction (KIE) from Visually-Rich Documents is a critical challenge due to the diverse formats, intricate layouts, and the integration of both textual and non-textual elements. This thesis explores recent open source methods for extracting key information. The study presents the progression of Information Extraction (IE) approaches, from traditional rule-based systems to the latest deep learning models. GLiNER, LayoutLM, LiLT, and Donut, are evaluated for their performance for a specific KIE task on two private dataset containing complex documents in the Italian language, and then compared to third party services (GPT-4o) and human performance in order to gather insights for performing KIE in an industrial setting. The research compares the accuracy, inference time, and cost of these models. It reveals that while all models exhibit a certain degree of accuracy for some fields, other are much more difficult to extract. Donut particularly excel in the task, showing robust performance extracting all fields and beating results obtained by human. Other than the most precise model, it also is the cheapest model, due to his end-to-end structure that doesn't require to run an OCR, it requires a smaller amount of data preprocessing, thanks to its flexibility and being a generative model, however it requires extensive training to reach peak accuracy and can't be run as a zero-shot or few-shot model. The research also lays the groundwork for further exploration into more sophisticated models and techniques that can handle even more complex document structures in the future.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Matematica "Tullio Levi-Civita" - DM
			
	Corso di studio
	
				DATA SCIENCE Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2023
			
	Titolo inglese
	
				Key Information Extraction from Visually-Rich Documents
			
	Abstract in italiano
	
				Key Information Extraction (KIE) from Visually-Rich Documents is a critical challenge due to the diverse formats, intricate layouts, and the integration of both textual and non-textual elements. This thesis explores recent open source methods for extracting key information. The study presents the progression of Information Extraction (IE) approaches, from traditional rule-based systems to the latest deep learning models.
GLiNER, LayoutLM, LiLT, and Donut, are evaluated for their performance for a specific KIE task on two private dataset containing complex documents in the Italian language, and then compared to third party services (GPT-4o) and human performance in order to gather insights for performing KIE in an industrial setting.
The research compares the accuracy, inference time, and cost of these models. It reveals that while all models exhibit a certain degree of accuracy for some fields, other are much more difficult to extract.
Donut particularly excel in the task, showing robust performance extracting all fields and beating results obtained by human.
Other than the most precise model, it also is the cheapest model, due to his end-to-end structure that doesn't require to run an OCR, it requires a smaller amount of data preprocessing, thanks to its flexibility and being a generative model, however it requires extensive training to reach peak accuracy and can't be run as a zero-shot or few-shot model.
The research also lays the groundwork for further exploration into more sophisticated models and techniques that can handle even more complex document structures in the future.
			
	Parola chiave
	
				Mixed-Models
Computer Vision
NLP
Documents
Extraction
			
	Relatore
	
				TESTOLIN, ALBERTO
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Tesi Virginio.pdf Accesso riservato Dimensione 2.16 MB Formato Adobe PDF	2.16 MB	Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/71038