This thesis describes the development of an application that analyzes video data by combining speech recognition, large language models, and computer vision. The intended use case is the analysis of Italian knowledge transfer meetings, where multiple speakers interact using technical language. The system was designed to make these recordings easier to use, both by generating structured reports and by creating resources that can later be explored through a chatbot. The system starts from raw video, extracting audio and processing it with WhisperX to obtain both a transcription and speaker diarization. These outputs are refined with carefully designed GPT prompts, which improve transcription quality and correct speaker information. Then a custom model, inspired by transformer architectures such as BERT, was developed to identify sentences that refer to visual content. When such references are detected, frames are extracted from the video and analyzed with a ResNet model to remove redundancies and retain only unique images. Finally, GPT is applied again to link the visual information back to the transcript. To evaluate the system, datasets of transcriptions and images were annotated, enabling comparisons between base outputs, GPT-enhanced results, and manually corrected ground truth. The integrated application shows how speech processing, text refinement, and image analysis can be combined into a single pipeline, highlighting both the potential of this approach and areas where further improvement is possible.
This thesis describes the development of an application that analyzes video data by combining speech recognition, large language models, and computer vision. The intended use case is the analysis of Italian knowledge transfer meetings, where multiple speakers interact using technical language. The system was designed to make these recordings easier to use, both by generating structured reports and by creating resources that can later be explored through a chatbot. The system starts from raw video, extracting audio and processing it with WhisperX to obtain both a transcription and speaker diarization. These outputs are refined with carefully designed GPT prompts, which improve transcription quality and correct speaker information. Then a custom model, inspired by transformer architectures such as BERT, was developed to identify sentences that refer to visual content. When such references are detected, frames are extracted from the video and analyzed with a ResNet model to remove redundancies and retain only unique images. Finally, GPT is applied again to link the visual information back to the transcript. To evaluate the system, datasets of transcriptions and images were annotated, enabling comparisons between base outputs, GPT-enhanced results, and manually corrected ground truth. The integrated application shows how speech processing, text refinement, and image analysis can be combined into a single pipeline, highlighting both the potential of this approach and areas where further improvement is possible.
Multimodal Analysis of Video Content through Speech Recognition, Language Models, and Computer Vision
ILIC, NEMANJA
2024/2025
Abstract
This thesis describes the development of an application that analyzes video data by combining speech recognition, large language models, and computer vision. The intended use case is the analysis of Italian knowledge transfer meetings, where multiple speakers interact using technical language. The system was designed to make these recordings easier to use, both by generating structured reports and by creating resources that can later be explored through a chatbot. The system starts from raw video, extracting audio and processing it with WhisperX to obtain both a transcription and speaker diarization. These outputs are refined with carefully designed GPT prompts, which improve transcription quality and correct speaker information. Then a custom model, inspired by transformer architectures such as BERT, was developed to identify sentences that refer to visual content. When such references are detected, frames are extracted from the video and analyzed with a ResNet model to remove redundancies and retain only unique images. Finally, GPT is applied again to link the visual information back to the transcript. To evaluate the system, datasets of transcriptions and images were annotated, enabling comparisons between base outputs, GPT-enhanced results, and manually corrected ground truth. The integrated application shows how speech processing, text refinement, and image analysis can be combined into a single pipeline, highlighting both the potential of this approach and areas where further improvement is possible.| File | Dimensione | Formato | |
|---|---|---|---|
|
Ilic_Nemanja.pdf
Accesso riservato
Dimensione
3.25 MB
Formato
Adobe PDF
|
3.25 MB | Adobe PDF |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/102114