The work introduces a multi-agent system designed to produce accurate video transcriptions by combining several LLMs with human feedback. The pipeline begins by generating multiple transcriptions using different speech recognition models, including Whisper, Vosk, and Facebook-MMS. These transcriptions are then compared by a separate LLM that highlights their differences. The system selects the most accurate n-version through a majority voting mechanism. The chosen transcriptions are presented to users through an interactive application, where they can be reviewed and manually refined. Once finalized, a RAG model analyzes the script and provides content-based recommendations to the user.
Multi-Agent Human-AI Framework for Transcriptions
AKBULUT, ULASCAN
2025/2026
Abstract
The work introduces a multi-agent system designed to produce accurate video transcriptions by combining several LLMs with human feedback. The pipeline begins by generating multiple transcriptions using different speech recognition models, including Whisper, Vosk, and Facebook-MMS. These transcriptions are then compared by a separate LLM that highlights their differences. The system selects the most accurate n-version through a majority voting mechanism. The chosen transcriptions are presented to users through an interactive application, where they can be reviewed and manually refined. Once finalized, a RAG model analyzes the script and provides content-based recommendations to the user.| File | Dimensione | Formato | |
|---|---|---|---|
|
AkbulutUlascan2106046_MasterThesis.pdf
accesso aperto
Dimensione
2.98 MB
Formato
Adobe PDF
|
2.98 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/108222