Statistical learning is a fundamental field that unifies statistical principles and machine learning approaches to enable data-driven analysis and prediction. Among its key techniques, clustering represents a pivotal task, allowing the identification of homogeneous groups within a dataset. While traditional clustering methods often prioritize accuracy and efficiency, interpretability remains a crucial aspect, particularly in domains where understanding the reasoning behind classifications is essential. This thesis explores the integration of decision trees into a supervised clustering framework, aiming to enhance transparency as well as effectiveness in cluster formation. The proposed method is based on a novel algorithm called Best Node Selection, which leverages decision trees to iteratively partition the dataset. At each iteration, a decision tree of fixed depth is trained, and the best node, along with its decision path, is selected using the F-score metric as the primary criterion. This node is then removed from the dataset, and the process continues on the remaining data, ensuring an interpretable step-by-step clustering approach. The algorithm has been implemented in Python and evaluated on the Breast Cancer Wisconsin (Diagnostic) dataset. The results demonstrate its ability to identify well-defined clusters while maintaining a clear selection process. However, a tendency to favor small, highly pure nodes raises concerns regarding generalizability and robustness. This trade-off between interpretability and cluster representativeness suggests avenues for further refinement. Future developments could focus on improving the node selection strategy to balance purity and sample size, integrating external knowledge, and exploring applications to different datasets to assess the method’s adaptability and effectiveness across various domains.

Statistical learning is a fundamental field that unifies statistical principles and machine learning approaches to enable data-driven analysis and prediction. Among its key techniques, clustering represents a pivotal task, allowing the identification of homogeneous groups within a dataset. While traditional clustering methods often prioritize accuracy and efficiency, interpretability remains a crucial aspect, particularly in domains where understanding the reasoning behind classifications is essential. This thesis explores the integration of decision trees into a supervised clustering framework, aiming to enhance transparency as well as effectiveness in cluster formation. The proposed method is based on a novel algorithm called Best Node Selection, which leverages decision trees to iteratively partition the dataset. At each iteration, a decision tree of fixed depth is trained, and the best node, along with its decision path, is selected using the F-score metric as the primary criterion. This node is then removed from the dataset, and the process continues on the remaining data, ensuring an interpretable step-by-step clustering approach. The algorithm has been implemented in Python and evaluated on the Breast Cancer Wisconsin (Diagnostic) dataset. The results demonstrate its ability to identify well-defined clusters while maintaining a clear selection process. However, a tendency to favor small, highly pure nodes raises concerns regarding generalizability and robustness. This trade-off between interpretability and cluster representativeness suggests avenues for further refinement. Future developments could focus on improving the node selection strategy to balance purity and sample size, integrating external knowledge, and exploring applications to different datasets to assess the method’s adaptability and effectiveness across various domains.

Implementation and Validation of a Decision Tree-Based Approach for Interpretable Supervised Clustering

GUDERZO, MICHELE
2024/2025

Abstract

Statistical learning is a fundamental field that unifies statistical principles and machine learning approaches to enable data-driven analysis and prediction. Among its key techniques, clustering represents a pivotal task, allowing the identification of homogeneous groups within a dataset. While traditional clustering methods often prioritize accuracy and efficiency, interpretability remains a crucial aspect, particularly in domains where understanding the reasoning behind classifications is essential. This thesis explores the integration of decision trees into a supervised clustering framework, aiming to enhance transparency as well as effectiveness in cluster formation. The proposed method is based on a novel algorithm called Best Node Selection, which leverages decision trees to iteratively partition the dataset. At each iteration, a decision tree of fixed depth is trained, and the best node, along with its decision path, is selected using the F-score metric as the primary criterion. This node is then removed from the dataset, and the process continues on the remaining data, ensuring an interpretable step-by-step clustering approach. The algorithm has been implemented in Python and evaluated on the Breast Cancer Wisconsin (Diagnostic) dataset. The results demonstrate its ability to identify well-defined clusters while maintaining a clear selection process. However, a tendency to favor small, highly pure nodes raises concerns regarding generalizability and robustness. This trade-off between interpretability and cluster representativeness suggests avenues for further refinement. Future developments could focus on improving the node selection strategy to balance purity and sample size, integrating external knowledge, and exploring applications to different datasets to assess the method’s adaptability and effectiveness across various domains.
2024
Implementation and Validation of a Decision Tree-Based Approach for Interpretable Supervised Clustering
Statistical learning is a fundamental field that unifies statistical principles and machine learning approaches to enable data-driven analysis and prediction. Among its key techniques, clustering represents a pivotal task, allowing the identification of homogeneous groups within a dataset. While traditional clustering methods often prioritize accuracy and efficiency, interpretability remains a crucial aspect, particularly in domains where understanding the reasoning behind classifications is essential. This thesis explores the integration of decision trees into a supervised clustering framework, aiming to enhance transparency as well as effectiveness in cluster formation. The proposed method is based on a novel algorithm called Best Node Selection, which leverages decision trees to iteratively partition the dataset. At each iteration, a decision tree of fixed depth is trained, and the best node, along with its decision path, is selected using the F-score metric as the primary criterion. This node is then removed from the dataset, and the process continues on the remaining data, ensuring an interpretable step-by-step clustering approach. The algorithm has been implemented in Python and evaluated on the Breast Cancer Wisconsin (Diagnostic) dataset. The results demonstrate its ability to identify well-defined clusters while maintaining a clear selection process. However, a tendency to favor small, highly pure nodes raises concerns regarding generalizability and robustness. This trade-off between interpretability and cluster representativeness suggests avenues for further refinement. Future developments could focus on improving the node selection strategy to balance purity and sample size, integrating external knowledge, and exploring applications to different datasets to assess the method’s adaptability and effectiveness across various domains.
Statistical learning
Decision tree
Supervised cluster
F-score
Best node selection
File in questo prodotto:
File Dimensione Formato  
Guderzo_Michele.pdf

accesso aperto

Dimensione 2.46 MB
Formato Adobe PDF
2.46 MB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/84134