Implementation and Validation of a Decision Tree-Based Approach for Interpretable Supervised Clustering

Statistical learning is a fundamental field that unifies statistical principles and machine learning approaches to enable data-driven analysis and prediction. Among its key techniques, clustering represents a pivotal task, allowing the identification of homogeneous groups within a dataset. While traditional clustering methods often prioritize accuracy and efficiency, interpretability remains a crucial aspect, particularly in domains where understanding the reasoning behind classifications is essential. This thesis explores the integration of decision trees into a supervised clustering framework, aiming to enhance transparency as well as effectiveness in cluster formation. The proposed method is based on a novel algorithm called Best Node Selection, which leverages decision trees to iteratively partition the dataset. At each iteration, a decision tree of fixed depth is trained, and the best node, along with its decision path, is selected using the F-score metric as the primary criterion. This node is then removed from the dataset, and the process continues on the remaining data, ensuring an interpretable step-by-step clustering approach. The algorithm has been implemented in Python and evaluated on the Breast Cancer Wisconsin (Diagnostic) dataset. The results demonstrate its ability to identify well-defined clusters while maintaining a clear selection process. However, a tendency to favor small, highly pure nodes raises concerns regarding generalizability and robustness. This trade-off between interpretability and cluster representativeness suggests avenues for further refinement. Future developments could focus on improving the node selection strategy to balance purity and sample size, integrating external knowledge, and exploring applications to different datasets to assess the method’s adaptability and effectiveness across various domains.

Implementation and Validation of a Decision Tree-Based Approach for Interpretable Supervised Clustering

GUDERZO, MICHELE

2024/2025

Abstract

Statistical learning is a fundamental field that unifies statistical principles and machine learning approaches to enable data-driven analysis and prediction. Among its key techniques, clustering represents a pivotal task, allowing the identification of homogeneous groups within a dataset. While traditional clustering methods often prioritize accuracy and efficiency, interpretability remains a crucial aspect, particularly in domains where understanding the reasoning behind classifications is essential. This thesis explores the integration of decision trees into a supervised clustering framework, aiming to enhance transparency as well as effectiveness in cluster formation. The proposed method is based on a novel algorithm called Best Node Selection, which leverages decision trees to iteratively partition the dataset. At each iteration, a decision tree of fixed depth is trained, and the best node, along with its decision path, is selected using the F-score metric as the primary criterion. This node is then removed from the dataset, and the process continues on the remaining data, ensuring an interpretable step-by-step clustering approach. The algorithm has been implemented in Python and evaluated on the Breast Cancer Wisconsin (Diagnostic) dataset. The results demonstrate its ability to identify well-defined clusters while maintaining a clear selection process. However, a tendency to favor small, highly pure nodes raises concerns regarding generalizability and robustness. This trade-off between interpretability and cluster representativeness suggests avenues for further refinement. Future developments could focus on improving the node selection strategy to balance purity and sample size, integrating external knowledge, and exploring applications to different datasets to assess the method’s adaptability and effectiveness across various domains.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Scienze Statistiche
			
	Corso di studio
	
				STATISTICA PER LE TECNOLOGIE E LE SCIENZE Laurea di Primo Livello (D.M. 270/2004)
			
	Anno Accademico
	
				2024
			
	Titolo inglese
	
				Implementation and Validation of a Decision Tree-Based Approach for Interpretable Supervised Clustering
			
	Abstract in italiano
	
				Statistical learning is a fundamental field that unifies statistical principles and machine learning approaches to enable data-driven analysis and prediction. Among its key techniques, clustering represents a pivotal task, allowing the identification of homogeneous groups within a dataset. While traditional clustering methods often prioritize accuracy and efficiency, interpretability remains a crucial aspect, particularly in domains where understanding the reasoning behind classifications is essential. This thesis explores the integration of decision trees into a supervised clustering framework, aiming to enhance transparency as well as effectiveness in cluster formation.
The proposed method is based on a novel algorithm called Best Node Selection, which leverages decision trees to iteratively partition the dataset. At each iteration, a decision tree of fixed depth is trained, and the best node, along with its decision path, is selected using the F-score metric as the primary criterion. This node is then removed from the dataset, and the process continues on the remaining data, ensuring an interpretable step-by-step clustering approach.
The algorithm has been implemented in Python and evaluated on the Breast Cancer Wisconsin (Diagnostic) dataset. The results demonstrate its ability to identify well-defined clusters while maintaining a clear selection process. However, a tendency to favor small, highly pure nodes raises concerns regarding generalizability and robustness. This trade-off between interpretability and cluster representativeness suggests avenues for further refinement.
Future developments could focus on improving the node selection strategy to balance purity and sample size, integrating external knowledge, and exploring applications to different datasets to assess the method’s adaptability and effectiveness across various domains.
			
	Parola chiave
	
				Statistical learning
Decision tree
Supervised cluster
F-score
Best node selection
			
	Relatore
	
				TORTORA, STEFANO
			
	Appare nelle tipologie:
	
				Lauree triennali

File in questo prodotto:

File	Dimensione	Formato
Guderzo_Michele.pdf accesso aperto Dimensione 2.46 MB Formato Adobe PDF Visualizza/Apri	2.46 MB	Adobe PDF	Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/84134