From Voice to Sanction: A Comparative Study of On-Device AI Models

The digitalization of public services presents a critical architectural decision for developers: whether to leverage powerful, cloud-based AI services or to prioritize the privacy, resilience, and cost-effectiveness of on-device models. This master thesis investigates this trade-off within the practical context of developing Concilia EVO, a voice-powered Android application designed to enable law enforcement officers to issue sanctions using natural language commands. The core of this work is an end-to-end, dual-pipeline system addressing both Speech-to-Text (STT) and Text-to-Form (TTF) tasks. To answer its central research hypotheses, a rigorous comparative benchmark was conducted. For the STT task, fourteen state-of-the-art models were evaluated on both public and custom-generated synthetic datasets. For the TTF task, a comprehensive suite of Large Language Models (LLMs) and Small Language Models (SLMs) were tested on their ability to generate structured JSON output from unstructured text. The methodology involved evaluating models in a controlled desktop environment based on a multi-dimensional framework of metrics, including accuracy, reliability, and efficiency. The results empirically validate the hypotheses: cloud-based models unequivocally demonstrate superior accuracy and reliability in both STT and TTF tasks. However, the study also identifies a subset of specialized local models that achieve a performance level deemed "good enough" for practical application, offering a viable alternative when offline functionality is non-negotiable. Based on these findings, this thesis proposes a hybrid architecture as the optimal solution for this high-stakes domain.

From Voice to Sanction: A Comparative Study of On-Device AI Models

CALABRESE, ALBERTO

2024/2025

Abstract

The digitalization of public services presents a critical architectural decision for developers: whether to leverage powerful, cloud-based AI services or to prioritize the privacy, resilience, and cost-effectiveness of on-device models. This master thesis investigates this trade-off within the practical context of developing Concilia EVO, a voice-powered Android application designed to enable law enforcement officers to issue sanctions using natural language commands. The core of this work is an end-to-end, dual-pipeline system addressing both Speech-to-Text (STT) and Text-to-Form (TTF) tasks. To answer its central research hypotheses, a rigorous comparative benchmark was conducted. For the STT task, fourteen state-of-the-art models were evaluated on both public and custom-generated synthetic datasets. For the TTF task, a comprehensive suite of Large Language Models (LLMs) and Small Language Models (SLMs) were tested on their ability to generate structured JSON output from unstructured text. The methodology involved evaluating models in a controlled desktop environment based on a multi-dimensional framework of metrics, including accuracy, reliability, and efficiency. The results empirically validate the hypotheses: cloud-based models unequivocally demonstrate superior accuracy and reliability in both STT and TTF tasks. However, the study also identifies a subset of specialized local models that achieve a performance level deemed "good enough" for practical application, offering a viable alternative when offline functionality is non-negotiable. Based on these findings, this thesis proposes a hybrid architecture as the optimal solution for this high-stakes domain.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Matematica "Tullio Levi-Civita" - DM
			
	Corso di studio
	
				DATA SCIENCE  Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2024
			
	Titolo inglese
	
				From Voice to Sanction: A Comparative Study of On-Device AI Models
			
	Abstract in italiano
	
				The digitalization of public services presents a critical architectural decision for developers: whether to leverage powerful, cloud-based AI services or to prioritize the privacy, resilience, and cost-effectiveness of on-device models. This master thesis investigates this trade-off within the practical context of developing Concilia EVO, a voice-powered Android application designed to enable law enforcement officers to issue sanctions using natural language commands. The core of this work is an end-to-end, dual-pipeline system addressing both Speech-to-Text (STT) and Text-to-Form (TTF) tasks.
To answer its central research hypotheses, a rigorous comparative benchmark was conducted. For the STT task, fourteen state-of-the-art models were evaluated on both public and custom-generated synthetic datasets. For the TTF task, a comprehensive suite of Large Language Models (LLMs) and Small Language Models (SLMs) were tested on their ability to generate structured JSON output from unstructured text. The methodology involved evaluating models in a controlled desktop environment based on a multi-dimensional framework of metrics, including accuracy, reliability, and efficiency.
The results empirically validate the hypotheses: cloud-based models unequivocally demonstrate superior accuracy and reliability in both STT and TTF tasks. However, the study also identifies a subset of specialized local models that achieve a performance level deemed "good enough" for practical application, offering a viable alternative when offline functionality is non-negotiable. Based on these findings, this thesis proposes a hybrid architecture as the optimal solution for this high-stakes domain.
			
	Parola chiave
	
				On Device AI
SLM
LLM
Speech To Text
Structured Output
			
	Relatore
	
				ERSEGHE, TOMASO
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Master_Thesis_Alberto_Calabrese.pdf Accesso riservato Dimensione 16.34 MB Formato Adobe PDF	16.34 MB	Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/102101