This thesis presents the design and evaluation of an internal conversational assistant based on the Retrieval-Augmented Generation (RAG) paradigm. The system is developed for a data and analytics team within a large enterprise, with the goal of providing fast, reliable access to technical and operational documentation through naturallanguage interaction. The assistant combines a REST API backend implemented in Python with a cloud-hosted large language model and an internal knowledge base exported from existing documentation platforms and stored as Markdown files. The architecture follows a modular and serverless approach. A cloud LLM, accessed via a managed API, is used both in a pure “zero-shot” configuration and in RAG settings where it is grounded on semantically retrieved document chunks. A vector-based retrieval layer, built on top of document embeddings, enables semantic search over the internal corpus. Chat sessions and messages are persisted in a NoSQL datastore with time-to-live (TTL) policies, exposing dedicated endpoints for conversation management and administration. From a methodological perspective, the work defines and compares multiple chat modes, including zero-shot prompting, retrieval-only RAG, and few-shot prompting combined with RAG. Functional tests and illustrative dialogues are used to analyse the behaviour of each mode, while quantitative measurements focus on latency and retrieval statistics. Where available, human feedback is used to assess perceived usefulness, relevance, and clarity of the assistant’s answers. The results show that grounding the LLM on an internal knowledge base significantly improves the factuality and usefulness of the responses compared to a zero-shot configuration, especially for questions about internal tools, processes, and runbooks. At the same time, few-shot prompting provides more consistent style and structure in the answers, at the cost of slightly higher response times. The thesis concludes by discussing the limitations of the current prototype. These include the dependency on cloud connectivity, coverage gaps in the knowledge base, and a limited amount of human evaluation. The work then outlines directions for future development, such as richer logging, systematic user studies, and tighter integration with existing enterprise tooling.

This thesis presents the design and evaluation of an internal conversational assistant based on the Retrieval-Augmented Generation (RAG) paradigm. The system is developed for a data and analytics team within a large enterprise, with the goal of providing fast, reliable access to technical and operational documentation through naturallanguage interaction. The assistant combines a REST API backend implemented in Python with a cloud-hosted large language model and an internal knowledge base exported from existing documentation platforms and stored as Markdown files. The architecture follows a modular and serverless approach. A cloud LLM, accessed via a managed API, is used both in a pure “zero-shot” configuration and in RAG settings where it is grounded on semantically retrieved document chunks. A vector-based retrieval layer, built on top of document embeddings, enables semantic search over the internal corpus. Chat sessions and messages are persisted in a NoSQL datastore with time-to-live (TTL) policies, exposing dedicated endpoints for conversation management and administration. From a methodological perspective, the work defines and compares multiple chat modes, including zero-shot prompting, retrieval-only RAG, and few-shot prompting combined with RAG. Functional tests and illustrative dialogues are used to analyse the behaviour of each mode, while quantitative measurements focus on latency and retrieval statistics. Where available, human feedback is used to assess perceived usefulness, relevance, and clarity of the assistant’s answers. The results show that grounding the LLM on an internal knowledge base significantly improves the factuality and usefulness of the responses compared to a zero-shot configuration, especially for questions about internal tools, processes, and runbooks. At the same time, few-shot prompting provides more consistent style and structure in the answers, at the cost of slightly higher response times. The thesis concludes by discussing the limitations of the current prototype. These include the dependency on cloud connectivity, coverage gaps in the knowledge base, and a limited amount of human evaluation. The work then outlines directions for future development, such as richer logging, systematic user studies, and tighter integration with existing enterprise tooling.

Development and Deployment of a Generative ChatBot on AWS: Architectural Design and Performance Analysis

BERBERI, JOI
2024/2025

Abstract

This thesis presents the design and evaluation of an internal conversational assistant based on the Retrieval-Augmented Generation (RAG) paradigm. The system is developed for a data and analytics team within a large enterprise, with the goal of providing fast, reliable access to technical and operational documentation through naturallanguage interaction. The assistant combines a REST API backend implemented in Python with a cloud-hosted large language model and an internal knowledge base exported from existing documentation platforms and stored as Markdown files. The architecture follows a modular and serverless approach. A cloud LLM, accessed via a managed API, is used both in a pure “zero-shot” configuration and in RAG settings where it is grounded on semantically retrieved document chunks. A vector-based retrieval layer, built on top of document embeddings, enables semantic search over the internal corpus. Chat sessions and messages are persisted in a NoSQL datastore with time-to-live (TTL) policies, exposing dedicated endpoints for conversation management and administration. From a methodological perspective, the work defines and compares multiple chat modes, including zero-shot prompting, retrieval-only RAG, and few-shot prompting combined with RAG. Functional tests and illustrative dialogues are used to analyse the behaviour of each mode, while quantitative measurements focus on latency and retrieval statistics. Where available, human feedback is used to assess perceived usefulness, relevance, and clarity of the assistant’s answers. The results show that grounding the LLM on an internal knowledge base significantly improves the factuality and usefulness of the responses compared to a zero-shot configuration, especially for questions about internal tools, processes, and runbooks. At the same time, few-shot prompting provides more consistent style and structure in the answers, at the cost of slightly higher response times. The thesis concludes by discussing the limitations of the current prototype. These include the dependency on cloud connectivity, coverage gaps in the knowledge base, and a limited amount of human evaluation. The work then outlines directions for future development, such as richer logging, systematic user studies, and tighter integration with existing enterprise tooling.
2024
Development and Deployment of a Generative ChatBot on AWS: Architectural Design and Performance Analysis
This thesis presents the design and evaluation of an internal conversational assistant based on the Retrieval-Augmented Generation (RAG) paradigm. The system is developed for a data and analytics team within a large enterprise, with the goal of providing fast, reliable access to technical and operational documentation through naturallanguage interaction. The assistant combines a REST API backend implemented in Python with a cloud-hosted large language model and an internal knowledge base exported from existing documentation platforms and stored as Markdown files. The architecture follows a modular and serverless approach. A cloud LLM, accessed via a managed API, is used both in a pure “zero-shot” configuration and in RAG settings where it is grounded on semantically retrieved document chunks. A vector-based retrieval layer, built on top of document embeddings, enables semantic search over the internal corpus. Chat sessions and messages are persisted in a NoSQL datastore with time-to-live (TTL) policies, exposing dedicated endpoints for conversation management and administration. From a methodological perspective, the work defines and compares multiple chat modes, including zero-shot prompting, retrieval-only RAG, and few-shot prompting combined with RAG. Functional tests and illustrative dialogues are used to analyse the behaviour of each mode, while quantitative measurements focus on latency and retrieval statistics. Where available, human feedback is used to assess perceived usefulness, relevance, and clarity of the assistant’s answers. The results show that grounding the LLM on an internal knowledge base significantly improves the factuality and usefulness of the responses compared to a zero-shot configuration, especially for questions about internal tools, processes, and runbooks. At the same time, few-shot prompting provides more consistent style and structure in the answers, at the cost of slightly higher response times. The thesis concludes by discussing the limitations of the current prototype. These include the dependency on cloud connectivity, coverage gaps in the knowledge base, and a limited amount of human evaluation. The work then outlines directions for future development, such as richer logging, systematic user studies, and tighter integration with existing enterprise tooling.
ChatBot
Large Language Model
Amazon Web Service
File in questo prodotto:
File Dimensione Formato  
Maste_Thesis_JoiBerberi.pdf

Accesso riservato

Dimensione 967.95 kB
Formato Adobe PDF
967.95 kB Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/102098