This thesis presents the design, implementation, and cybersecurity evaluation of a secure, onpremise Retrieval-Augmented Generation (RAG) conversational agent tailored for a museum environment. To ensure data sovereignty and operate within a strict 36GB VRAM hardware constraint, the open-weight Qwen3.5-35B-A3B model was selected. The system optimizes latency by decoupling offline document ingestion from real-time generation, employing a hybrid search pipeline that combines Maximal Marginal Relevance (MMR) with BM25, refined by a cross-encoder reranker. This architecture achieves a 72.23% retrieval success rate while maintaining a strict operational latency of 1.4 seconds per query. Crucially, deploying RAG architectures shifts traditional security defenses to a novel ”semantic perimeter”. This study conducts a rigorous threat modeling assessment, evaluating inference- and data-phase attack vectors such as micro-scale data poisoning (PoisonedRAG) , automated jailbreaking (GPTFUZZER) , and Denial of Service exploits. To secure the infrastructure, the research proposes multi-layered countermeasures, including data provenance, retrieval-native access controls, and a Dual LLM pattern

This thesis presents the design, implementation, and cybersecurity evaluation of a secure, onpremise Retrieval-Augmented Generation (RAG) conversational agent tailored for a museum environment. To ensure data sovereignty and operate within a strict 36GB VRAM hardware constraint, the open-weight Qwen3.5-35B-A3B model was selected. The system optimizes latency by decoupling offline document ingestion from real-time generation, employing a hybrid search pipeline that combines Maximal Marginal Relevance (MMR) with BM25, refined by a cross-encoder reranker. This architecture achieves a 72.23% retrieval success rate while maintaining a strict operational latency of 1.4 seconds per query. Crucially, deploying RAG architectures shifts traditional security defenses to a novel ”semantic perimeter”. This study conducts a rigorous threat modeling assessment, evaluating inference- and data-phase attack vectors such as micro-scale data poisoning (PoisonedRAG) , automated jailbreaking (GPTFUZZER) , and Denial of Service exploits. To secure the infrastructure, the research proposes multi-layered countermeasures, including data provenance, retrieval-native access controls, and a Dual LLM pattern

Building Secure Conversational Agents: Architectural Choices, Performance Evaluation, and Threat Modeling in RAG

CALIGIURI, GIORGIO
2025/2026

Abstract

This thesis presents the design, implementation, and cybersecurity evaluation of a secure, onpremise Retrieval-Augmented Generation (RAG) conversational agent tailored for a museum environment. To ensure data sovereignty and operate within a strict 36GB VRAM hardware constraint, the open-weight Qwen3.5-35B-A3B model was selected. The system optimizes latency by decoupling offline document ingestion from real-time generation, employing a hybrid search pipeline that combines Maximal Marginal Relevance (MMR) with BM25, refined by a cross-encoder reranker. This architecture achieves a 72.23% retrieval success rate while maintaining a strict operational latency of 1.4 seconds per query. Crucially, deploying RAG architectures shifts traditional security defenses to a novel ”semantic perimeter”. This study conducts a rigorous threat modeling assessment, evaluating inference- and data-phase attack vectors such as micro-scale data poisoning (PoisonedRAG) , automated jailbreaking (GPTFUZZER) , and Denial of Service exploits. To secure the infrastructure, the research proposes multi-layered countermeasures, including data provenance, retrieval-native access controls, and a Dual LLM pattern
2025
Building Secure Conversational Agents: Architectural Choices, Performance Evaluation, and Threat Modeling in RAG
This thesis presents the design, implementation, and cybersecurity evaluation of a secure, onpremise Retrieval-Augmented Generation (RAG) conversational agent tailored for a museum environment. To ensure data sovereignty and operate within a strict 36GB VRAM hardware constraint, the open-weight Qwen3.5-35B-A3B model was selected. The system optimizes latency by decoupling offline document ingestion from real-time generation, employing a hybrid search pipeline that combines Maximal Marginal Relevance (MMR) with BM25, refined by a cross-encoder reranker. This architecture achieves a 72.23% retrieval success rate while maintaining a strict operational latency of 1.4 seconds per query. Crucially, deploying RAG architectures shifts traditional security defenses to a novel ”semantic perimeter”. This study conducts a rigorous threat modeling assessment, evaluating inference- and data-phase attack vectors such as micro-scale data poisoning (PoisonedRAG) , automated jailbreaking (GPTFUZZER) , and Denial of Service exploits. To secure the infrastructure, the research proposes multi-layered countermeasures, including data provenance, retrieval-native access controls, and a Dual LLM pattern
Generative AI
Knowledge Retrieval
RAG Architecture
Large Language Model
LLM Security
File in questo prodotto:
File Dimensione Formato  
Caligiuri Giorgio master thesis.pdf

Accesso riservato

Dimensione 835.31 kB
Formato Adobe PDF
835.31 kB Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/108077