The digitalization of public services presents a critical architectural decision for developers: whether to leverage powerful, cloud-based AI services or to prioritize the privacy, resilience, and cost-effectiveness of on-device models. This master thesis investigates this trade-off within the practical context of developing Concilia EVO, a voice-powered Android application designed to enable law enforcement officers to issue sanctions using natural language commands. The core of this work is an end-to-end, dual-pipeline system addressing both Speech-to-Text (STT) and Text-to-Form (TTF) tasks. To answer its central research hypotheses, a rigorous comparative benchmark was conducted. For the STT task, fourteen state-of-the-art models were evaluated on both public and custom-generated synthetic datasets. For the TTF task, a comprehensive suite of Large Language Models (LLMs) and Small Language Models (SLMs) were tested on their ability to generate structured JSON output from unstructured text. The methodology involved evaluating models in a controlled desktop environment based on a multi-dimensional framework of metrics, including accuracy, reliability, and efficiency. The results empirically validate the hypotheses: cloud-based models unequivocally demonstrate superior accuracy and reliability in both STT and TTF tasks. However, the study also identifies a subset of specialized local models that achieve a performance level deemed "good enough" for practical application, offering a viable alternative when offline functionality is non-negotiable. Based on these findings, this thesis proposes a hybrid architecture as the optimal solution for this high-stakes domain.
The digitalization of public services presents a critical architectural decision for developers: whether to leverage powerful, cloud-based AI services or to prioritize the privacy, resilience, and cost-effectiveness of on-device models. This master thesis investigates this trade-off within the practical context of developing Concilia EVO, a voice-powered Android application designed to enable law enforcement officers to issue sanctions using natural language commands. The core of this work is an end-to-end, dual-pipeline system addressing both Speech-to-Text (STT) and Text-to-Form (TTF) tasks. To answer its central research hypotheses, a rigorous comparative benchmark was conducted. For the STT task, fourteen state-of-the-art models were evaluated on both public and custom-generated synthetic datasets. For the TTF task, a comprehensive suite of Large Language Models (LLMs) and Small Language Models (SLMs) were tested on their ability to generate structured JSON output from unstructured text. The methodology involved evaluating models in a controlled desktop environment based on a multi-dimensional framework of metrics, including accuracy, reliability, and efficiency. The results empirically validate the hypotheses: cloud-based models unequivocally demonstrate superior accuracy and reliability in both STT and TTF tasks. However, the study also identifies a subset of specialized local models that achieve a performance level deemed "good enough" for practical application, offering a viable alternative when offline functionality is non-negotiable. Based on these findings, this thesis proposes a hybrid architecture as the optimal solution for this high-stakes domain.
From Voice to Sanction: A Comparative Study of On-Device AI Models
CALABRESE, ALBERTO
2024/2025
Abstract
The digitalization of public services presents a critical architectural decision for developers: whether to leverage powerful, cloud-based AI services or to prioritize the privacy, resilience, and cost-effectiveness of on-device models. This master thesis investigates this trade-off within the practical context of developing Concilia EVO, a voice-powered Android application designed to enable law enforcement officers to issue sanctions using natural language commands. The core of this work is an end-to-end, dual-pipeline system addressing both Speech-to-Text (STT) and Text-to-Form (TTF) tasks. To answer its central research hypotheses, a rigorous comparative benchmark was conducted. For the STT task, fourteen state-of-the-art models were evaluated on both public and custom-generated synthetic datasets. For the TTF task, a comprehensive suite of Large Language Models (LLMs) and Small Language Models (SLMs) were tested on their ability to generate structured JSON output from unstructured text. The methodology involved evaluating models in a controlled desktop environment based on a multi-dimensional framework of metrics, including accuracy, reliability, and efficiency. The results empirically validate the hypotheses: cloud-based models unequivocally demonstrate superior accuracy and reliability in both STT and TTF tasks. However, the study also identifies a subset of specialized local models that achieve a performance level deemed "good enough" for practical application, offering a viable alternative when offline functionality is non-negotiable. Based on these findings, this thesis proposes a hybrid architecture as the optimal solution for this high-stakes domain.| File | Dimensione | Formato | |
|---|---|---|---|
|
Master_Thesis_Alberto_Calabrese.pdf
Accesso riservato
Dimensione
16.34 MB
Formato
Adobe PDF
|
16.34 MB | Adobe PDF |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/102101