The integration of Generative AI into Intelligent Process Automation is currently hindered by structural barriers regarding the economic sustainability of cloud-based solutions and compliance risks regarding the transfer of sensitive data to third-party providers. This thesis investigates the technical and economic feasibility of a paradigm shift towards "on-premise" inference, utilizing Small Language Models (SLMs) executed exclusively on standard CPU infrastructure, without the aid of dedicated GPU accelerators. Through a real-world case study at Data4Prime S.r.l., this work analyzes the performance of quantized models in two distinct application scenarios: code generation and Knowledge Retrieval on proprietary technical documentation via a Retrieval-Augmented Generation (RAG) architecture. The research aims to provide a critical assessment of the trade-offs involved in deploying local AI strategies on standard hardware.
The integration of Generative AI into Intelligent Process Automation is currently hindered by structural barriers regarding the economic sustainability of cloud-based solutions and compliance risks regarding the transfer of sensitive data to third-party providers. This thesis investigates the technical and economic feasibility of a paradigm shift towards "on-premise" inference, utilizing Small Language Models (SLMs) executed exclusively on standard CPU infrastructure, without the aid of dedicated GPU accelerators. Through a real-world case study at Data4Prime S.r.l., this work analyzes the performance of quantized models in two distinct application scenarios: code generation and Knowledge Retrieval on proprietary technical documentation via a Retrieval-Augmented Generation (RAG) architecture. The research aims to provide a critical assessment of the trade-offs involved in deploying local AI strategies on standard hardware.
Efficient On-Premise AI: Benchmarking Quantized SLMs on CPU-only Infrastructure
GASTALDON, SIMONE
2025/2026
Abstract
The integration of Generative AI into Intelligent Process Automation is currently hindered by structural barriers regarding the economic sustainability of cloud-based solutions and compliance risks regarding the transfer of sensitive data to third-party providers. This thesis investigates the technical and economic feasibility of a paradigm shift towards "on-premise" inference, utilizing Small Language Models (SLMs) executed exclusively on standard CPU infrastructure, without the aid of dedicated GPU accelerators. Through a real-world case study at Data4Prime S.r.l., this work analyzes the performance of quantized models in two distinct application scenarios: code generation and Knowledge Retrieval on proprietary technical documentation via a Retrieval-Augmented Generation (RAG) architecture. The research aims to provide a critical assessment of the trade-offs involved in deploying local AI strategies on standard hardware.| File | Dimensione | Formato | |
|---|---|---|---|
|
Master thesis SG.pdf
Accesso riservato
Dimensione
1.88 MB
Formato
Adobe PDF
|
1.88 MB | Adobe PDF |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/108226