The work introduces a multi-agent system designed to produce accurate video transcriptions by combining several LLMs with human feedback. The pipeline begins by generating multiple transcriptions using different speech recognition models, including Whisper, Vosk, and Facebook-MMS. These transcriptions are then compared by a separate LLM that highlights their differences. The system selects the most accurate n-version through a majority voting mechanism. The chosen transcriptions are presented to users through an interactive application, where they can be reviewed and manually refined. Once finalized, a RAG model analyzes the script and provides content-based recommendations to the user.

Multi-Agent Human-AI Framework for Transcriptions

AKBULUT, ULASCAN
2025/2026

Abstract

The work introduces a multi-agent system designed to produce accurate video transcriptions by combining several LLMs with human feedback. The pipeline begins by generating multiple transcriptions using different speech recognition models, including Whisper, Vosk, and Facebook-MMS. These transcriptions are then compared by a separate LLM that highlights their differences. The system selects the most accurate n-version through a majority voting mechanism. The chosen transcriptions are presented to users through an interactive application, where they can be reviewed and manually refined. Once finalized, a RAG model analyzes the script and provides content-based recommendations to the user.
2025
Multi-Agent Human-AI Framework for Transcriptions
LLMs
Transcriptions
Whisper
File in questo prodotto:
File Dimensione Formato  
AkbulutUlascan2106046_MasterThesis.pdf

accesso aperto

Dimensione 2.98 MB
Formato Adobe PDF
2.98 MB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/108222