Multimodal Analysis of Video Content through Speech Recognition, Language Models, and Computer Vision

This thesis describes the development of an application that analyzes video data by combining speech recognition, large language models, and computer vision. The intended use case is the analysis of Italian knowledge transfer meetings, where multiple speakers interact using technical language. The system was designed to make these recordings easier to use, both by generating structured reports and by creating resources that can later be explored through a chatbot. The system starts from raw video, extracting audio and processing it with WhisperX to obtain both a transcription and speaker diarization. These outputs are refined with carefully designed GPT prompts, which improve transcription quality and correct speaker information. Then a custom model, inspired by transformer architectures such as BERT, was developed to identify sentences that refer to visual content. When such references are detected, frames are extracted from the video and analyzed with a ResNet model to remove redundancies and retain only unique images. Finally, GPT is applied again to link the visual information back to the transcript. To evaluate the system, datasets of transcriptions and images were annotated, enabling comparisons between base outputs, GPT-enhanced results, and manually corrected ground truth. The integrated application shows how speech processing, text refinement, and image analysis can be combined into a single pipeline, highlighting both the potential of this approach and areas where further improvement is possible.

Multimodal Analysis of Video Content through Speech Recognition, Language Models, and Computer Vision

ILIC, NEMANJA

2024/2025

Abstract

This thesis describes the development of an application that analyzes video data by combining speech recognition, large language models, and computer vision. The intended use case is the analysis of Italian knowledge transfer meetings, where multiple speakers interact using technical language. The system was designed to make these recordings easier to use, both by generating structured reports and by creating resources that can later be explored through a chatbot. The system starts from raw video, extracting audio and processing it with WhisperX to obtain both a transcription and speaker diarization. These outputs are refined with carefully designed GPT prompts, which improve transcription quality and correct speaker information. Then a custom model, inspired by transformer architectures such as BERT, was developed to identify sentences that refer to visual content. When such references are detected, frames are extracted from the video and analyzed with a ResNet model to remove redundancies and retain only unique images. Finally, GPT is applied again to link the visual information back to the transcript. To evaluate the system, datasets of transcriptions and images were annotated, enabling comparisons between base outputs, GPT-enhanced results, and manually corrected ground truth. The integrated application shows how speech processing, text refinement, and image analysis can be combined into a single pipeline, highlighting both the potential of this approach and areas where further improvement is possible.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Matematica "Tullio Levi-Civita" - DM
			
	Corso di studio
	
				DATA SCIENCE  Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2024
			
	Titolo inglese
	
				Multimodal Analysis of Video Content through Speech Recognition, Language Models, and Computer Vision
			
	Abstract in italiano
	
				This thesis describes the development of an application that analyzes video data by combining speech recognition, large language models, and computer vision.  
The intended use case is the analysis of Italian knowledge transfer meetings, where multiple speakers interact using technical language. The system was designed to make these recordings easier to use, both by generating structured reports and by creating resources that can later be explored through a chatbot. 
The system starts from raw video, extracting audio and processing it with WhisperX to obtain both a transcription and speaker diarization. These outputs are refined with carefully designed GPT prompts, which improve transcription quality and correct speaker information. Then a custom model, inspired by transformer architectures such as BERT, was developed to identify sentences that refer to visual content. When such references are detected, frames are extracted from the video and analyzed with a ResNet model to remove redundancies and retain only unique images. Finally, GPT is applied again to link the visual information back to the transcript.  
To evaluate the system, datasets of transcriptions and images were annotated, enabling comparisons between base outputs, GPT-enhanced results, and manually corrected ground truth. The integrated application shows how speech processing, text refinement, and image analysis can be combined into a single pipeline, highlighting both the potential of this approach and areas where further improvement is possible.
			
	Parola chiave
	
				Video analysis
Multimodal process
Generative AI
			
	Relatore
	
				ZORZI, MARCO
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Ilic_Nemanja.pdf Accesso riservato Dimensione 3.25 MB Formato Adobe PDF	3.25 MB	Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/102114