Utility-based Source Attribution in Retrieval-Augmented Generation

Retrieval Augmented Generation (RAG) systems were introduced to address the knowledge limitations of Large Language Models (LLMs) and their tendency to hallucinate. By leveraging relevant external documents, RAG provides a more grounded generation and access to specific knowledge. Although RAG system are widely used, they still face limitations, such as the inclusion of irrelevant documents in the context and the potential for LLMs to misuse the retrieved information. This work explores effective strategies to enhance RAG systems, with a particular focus on utility-based attribution techniques. Specifically, we investigate the feasibility and effectiveness of adapting Shapley-based attribution to identify influential retrieved documents in RAG. We compare Shapley with more computationally tractable approximations and some existing attribution methods for LLM. This work aims to: (1) systematically apply established attribution principles to the RAG document-level setting; (2) quantify how well SHAP approximations can mirror exact attributions while minimizing costly LLM interactions; and (3) evaluate their practical explainability in identifying critical documents, especially under complex inter-document relationships such as redundancy, complementarity, and synergy.

Utility-based Source Attribution in Retrieval-Augmented Generation

FUGAGNOLI, GABRIELE

2024/2025

Abstract

Retrieval Augmented Generation (RAG) systems were introduced to address the knowledge limitations of Large Language Models (LLMs) and their tendency to hallucinate. By leveraging relevant external documents, RAG provides a more grounded generation and access to specific knowledge. Although RAG system are widely used, they still face limitations, such as the inclusion of irrelevant documents in the context and the potential for LLMs to misuse the retrieved information. This work explores effective strategies to enhance RAG systems, with a particular focus on utility-based attribution techniques. Specifically, we investigate the feasibility and effectiveness of adapting Shapley-based attribution to identify influential retrieved documents in RAG. We compare Shapley with more computationally tractable approximations and some existing attribution methods for LLM. This work aims to: (1) systematically apply established attribution principles to the RAG document-level setting; (2) quantify how well SHAP approximations can mirror exact attributions while minimizing costly LLM interactions; and (3) evaluate their practical explainability in identifying critical documents, especially under complex inter-document relationships such as redundancy, complementarity, and synergy.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Matematica "Tullio Levi-Civita" - DM
			
	Corso di studio
	
				DATA SCIENCE  Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2024
			
	Titolo inglese
	
				Utility-based Source Attribution in Retrieval-Augmented Generation
			
	Abstract in italiano
	
				Retrieval Augmented Generation (RAG) systems were introduced to address the knowledge limitations of Large Language Models (LLMs) and their tendency to hallucinate. By leveraging relevant external documents, RAG provides a more grounded generation and access to specific knowledge. Although RAG system are widely used, they still face limitations, such as the inclusion of irrelevant documents in the context and the potential for LLMs to misuse the retrieved information. This work explores effective strategies to enhance RAG systems, with a particular focus on utility-based attribution techniques. Specifically, we investigate the feasibility and effectiveness of adapting Shapley-based attribution to identify influential retrieved documents in RAG. We compare Shapley with more computationally tractable approximations and some existing attribution methods for LLM. This work aims to: (1) systematically apply established attribution principles to the RAG document-level setting; (2) quantify how well SHAP approximations can mirror exact attributions while minimizing costly LLM interactions; and (3) evaluate their practical explainability in identifying critical documents, especially under complex inter-document relationships such as redundancy, complementarity, and synergy.
			
	Parola chiave
	
				Large Language Model
RAG
Explainable AI
Shapley Value
Attribution
			
	Relatore
	
				AIOLLI, FABIO
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Utility_based_Attribution_for_Retrieval_Augmented_Generation.pdf Accesso riservato Dimensione 2.78 MB Formato Adobe PDF	2.78 MB	Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/102110