This thesis describes the development of an application that analyzes video data by combining speech recognition, large language models, and computer vision. The intended use case is the analysis of Italian knowledge transfer meetings, where multiple speakers interact using technical language. The system was designed to make these recordings easier to use, both by generating structured reports and by creating resources that can later be explored through a chatbot. The system starts from raw video, extracting audio and processing it with WhisperX to obtain both a transcription and speaker diarization. These outputs are refined with carefully designed GPT prompts, which improve transcription quality and correct speaker information. Then a custom model, inspired by transformer architectures such as BERT, was developed to identify sentences that refer to visual content. When such references are detected, frames are extracted from the video and analyzed with a ResNet model to remove redundancies and retain only unique images. Finally, GPT is applied again to link the visual information back to the transcript. To evaluate the system, datasets of transcriptions and images were annotated, enabling comparisons between base outputs, GPT-enhanced results, and manually corrected ground truth. The integrated application shows how speech processing, text refinement, and image analysis can be combined into a single pipeline, highlighting both the potential of this approach and areas where further improvement is possible.

This thesis describes the development of an application that analyzes video data by combining speech recognition, large language models, and computer vision. The intended use case is the analysis of Italian knowledge transfer meetings, where multiple speakers interact using technical language. The system was designed to make these recordings easier to use, both by generating structured reports and by creating resources that can later be explored through a chatbot. The system starts from raw video, extracting audio and processing it with WhisperX to obtain both a transcription and speaker diarization. These outputs are refined with carefully designed GPT prompts, which improve transcription quality and correct speaker information. Then a custom model, inspired by transformer architectures such as BERT, was developed to identify sentences that refer to visual content. When such references are detected, frames are extracted from the video and analyzed with a ResNet model to remove redundancies and retain only unique images. Finally, GPT is applied again to link the visual information back to the transcript. To evaluate the system, datasets of transcriptions and images were annotated, enabling comparisons between base outputs, GPT-enhanced results, and manually corrected ground truth. The integrated application shows how speech processing, text refinement, and image analysis can be combined into a single pipeline, highlighting both the potential of this approach and areas where further improvement is possible.

Multimodal Analysis of Video Content through Speech Recognition, Language Models, and Computer Vision

ILIC, NEMANJA
2024/2025

Abstract

This thesis describes the development of an application that analyzes video data by combining speech recognition, large language models, and computer vision. The intended use case is the analysis of Italian knowledge transfer meetings, where multiple speakers interact using technical language. The system was designed to make these recordings easier to use, both by generating structured reports and by creating resources that can later be explored through a chatbot. The system starts from raw video, extracting audio and processing it with WhisperX to obtain both a transcription and speaker diarization. These outputs are refined with carefully designed GPT prompts, which improve transcription quality and correct speaker information. Then a custom model, inspired by transformer architectures such as BERT, was developed to identify sentences that refer to visual content. When such references are detected, frames are extracted from the video and analyzed with a ResNet model to remove redundancies and retain only unique images. Finally, GPT is applied again to link the visual information back to the transcript. To evaluate the system, datasets of transcriptions and images were annotated, enabling comparisons between base outputs, GPT-enhanced results, and manually corrected ground truth. The integrated application shows how speech processing, text refinement, and image analysis can be combined into a single pipeline, highlighting both the potential of this approach and areas where further improvement is possible.
2024
Multimodal Analysis of Video Content through Speech Recognition, Language Models, and Computer Vision
This thesis describes the development of an application that analyzes video data by combining speech recognition, large language models, and computer vision. The intended use case is the analysis of Italian knowledge transfer meetings, where multiple speakers interact using technical language. The system was designed to make these recordings easier to use, both by generating structured reports and by creating resources that can later be explored through a chatbot. The system starts from raw video, extracting audio and processing it with WhisperX to obtain both a transcription and speaker diarization. These outputs are refined with carefully designed GPT prompts, which improve transcription quality and correct speaker information. Then a custom model, inspired by transformer architectures such as BERT, was developed to identify sentences that refer to visual content. When such references are detected, frames are extracted from the video and analyzed with a ResNet model to remove redundancies and retain only unique images. Finally, GPT is applied again to link the visual information back to the transcript. To evaluate the system, datasets of transcriptions and images were annotated, enabling comparisons between base outputs, GPT-enhanced results, and manually corrected ground truth. The integrated application shows how speech processing, text refinement, and image analysis can be combined into a single pipeline, highlighting both the potential of this approach and areas where further improvement is possible.
Video analysis
Multimodal process
Generative AI
File in questo prodotto:
File Dimensione Formato  
Ilic_Nemanja.pdf

Accesso riservato

Dimensione 3.25 MB
Formato Adobe PDF
3.25 MB Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/102114