Efficient On-Premise AI: Benchmarking Quantized SLMs on CPU-only Infrastructure

The integration of Generative AI into Intelligent Process Automation is currently hindered by structural barriers regarding the economic sustainability of cloud-based solutions and compliance risks regarding the transfer of sensitive data to third-party providers. This thesis investigates the technical and economic feasibility of a paradigm shift towards "on-premise" inference, utilizing Small Language Models (SLMs) executed exclusively on standard CPU infrastructure, without the aid of dedicated GPU accelerators. Through a real-world case study at Data4Prime S.r.l., this work analyzes the performance of quantized models in two distinct application scenarios: code generation and Knowledge Retrieval on proprietary technical documentation via a Retrieval-Augmented Generation (RAG) architecture. The research aims to provide a critical assessment of the trade-offs involved in deploying local AI strategies on standard hardware.

Efficient On-Premise AI: Benchmarking Quantized SLMs on CPU-only Infrastructure

GASTALDON, SIMONE

2025/2026

Abstract

The integration of Generative AI into Intelligent Process Automation is currently hindered by structural barriers regarding the economic sustainability of cloud-based solutions and compliance risks regarding the transfer of sensitive data to third-party providers. This thesis investigates the technical and economic feasibility of a paradigm shift towards "on-premise" inference, utilizing Small Language Models (SLMs) executed exclusively on standard CPU infrastructure, without the aid of dedicated GPU accelerators. Through a real-world case study at Data4Prime S.r.l., this work analyzes the performance of quantized models in two distinct application scenarios: code generation and Knowledge Retrieval on proprietary technical documentation via a Retrieval-Augmented Generation (RAG) architecture. The research aims to provide a critical assessment of the trade-offs involved in deploying local AI strategies on standard hardware.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Matematica "Tullio Levi-Civita" - DM
			
	Corso di studio
	
				DATA SCIENCE  Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2025
			
	Titolo inglese
	
				Efficient On-Premise AI: Benchmarking Quantized SLMs on CPU-only Infrastructure
			
	Abstract in italiano
	
				The integration of Generative AI into Intelligent Process Automation is currently hindered by structural barriers regarding the economic sustainability of cloud-based solutions and compliance risks regarding the transfer of sensitive data to third-party providers. This thesis investigates the technical and economic feasibility of a paradigm shift towards "on-premise" inference, utilizing Small Language Models (SLMs) executed exclusively on standard CPU infrastructure, without the aid of dedicated GPU accelerators. Through a real-world case study at Data4Prime S.r.l., this work analyzes the performance of quantized models in two distinct application scenarios: code generation and Knowledge Retrieval on proprietary technical documentation via a Retrieval-Augmented Generation (RAG) architecture. The research aims to provide a critical assessment of the trade-offs involved in deploying local AI strategies on standard hardware.
			
	Parola chiave
	
				Small Language Model
On-Premise Inference
RAG
			
	Relatore
	
				ERSEGHE, TOMASO
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Master thesis SG.pdf Accesso riservato Dimensione 1.88 MB Formato Adobe PDF	1.88 MB	Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/108226