Statistical learning is a fundamental field that unifies statistical principles and machine learning approaches to enable data-driven analysis and prediction. Among its key techniques, clustering represents a pivotal task, allowing the identification of homogeneous groups within a dataset. While traditional clustering methods often prioritize accuracy and efficiency, interpretability remains a crucial aspect, particularly in domains where understanding the reasoning behind classifications is essential. This thesis explores the integration of decision trees into a supervised clustering framework, aiming to enhance transparency as well as effectiveness in cluster formation. The proposed method is based on a novel algorithm called Best Node Selection, which leverages decision trees to iteratively partition the dataset. At each iteration, a decision tree of fixed depth is trained, and the best node, along with its decision path, is selected using the F-score metric as the primary criterion. This node is then removed from the dataset, and the process continues on the remaining data, ensuring an interpretable step-by-step clustering approach. The algorithm has been implemented in Python and evaluated on the Breast Cancer Wisconsin (Diagnostic) dataset. The results demonstrate its ability to identify well-defined clusters while maintaining a clear selection process. However, a tendency to favor small, highly pure nodes raises concerns regarding generalizability and robustness. This trade-off between interpretability and cluster representativeness suggests avenues for further refinement. Future developments could focus on improving the node selection strategy to balance purity and sample size, integrating external knowledge, and exploring applications to different datasets to assess the method’s adaptability and effectiveness across various domains.
Statistical learning is a fundamental field that unifies statistical principles and machine learning approaches to enable data-driven analysis and prediction. Among its key techniques, clustering represents a pivotal task, allowing the identification of homogeneous groups within a dataset. While traditional clustering methods often prioritize accuracy and efficiency, interpretability remains a crucial aspect, particularly in domains where understanding the reasoning behind classifications is essential. This thesis explores the integration of decision trees into a supervised clustering framework, aiming to enhance transparency as well as effectiveness in cluster formation. The proposed method is based on a novel algorithm called Best Node Selection, which leverages decision trees to iteratively partition the dataset. At each iteration, a decision tree of fixed depth is trained, and the best node, along with its decision path, is selected using the F-score metric as the primary criterion. This node is then removed from the dataset, and the process continues on the remaining data, ensuring an interpretable step-by-step clustering approach. The algorithm has been implemented in Python and evaluated on the Breast Cancer Wisconsin (Diagnostic) dataset. The results demonstrate its ability to identify well-defined clusters while maintaining a clear selection process. However, a tendency to favor small, highly pure nodes raises concerns regarding generalizability and robustness. This trade-off between interpretability and cluster representativeness suggests avenues for further refinement. Future developments could focus on improving the node selection strategy to balance purity and sample size, integrating external knowledge, and exploring applications to different datasets to assess the method’s adaptability and effectiveness across various domains.
Implementation and Validation of a Decision Tree-Based Approach for Interpretable Supervised Clustering
GUDERZO, MICHELE
2024/2025
Abstract
Statistical learning is a fundamental field that unifies statistical principles and machine learning approaches to enable data-driven analysis and prediction. Among its key techniques, clustering represents a pivotal task, allowing the identification of homogeneous groups within a dataset. While traditional clustering methods often prioritize accuracy and efficiency, interpretability remains a crucial aspect, particularly in domains where understanding the reasoning behind classifications is essential. This thesis explores the integration of decision trees into a supervised clustering framework, aiming to enhance transparency as well as effectiveness in cluster formation. The proposed method is based on a novel algorithm called Best Node Selection, which leverages decision trees to iteratively partition the dataset. At each iteration, a decision tree of fixed depth is trained, and the best node, along with its decision path, is selected using the F-score metric as the primary criterion. This node is then removed from the dataset, and the process continues on the remaining data, ensuring an interpretable step-by-step clustering approach. The algorithm has been implemented in Python and evaluated on the Breast Cancer Wisconsin (Diagnostic) dataset. The results demonstrate its ability to identify well-defined clusters while maintaining a clear selection process. However, a tendency to favor small, highly pure nodes raises concerns regarding generalizability and robustness. This trade-off between interpretability and cluster representativeness suggests avenues for further refinement. Future developments could focus on improving the node selection strategy to balance purity and sample size, integrating external knowledge, and exploring applications to different datasets to assess the method’s adaptability and effectiveness across various domains.File | Dimensione | Formato | |
---|---|---|---|
Guderzo_Michele.pdf
accesso aperto
Dimensione
2.46 MB
Formato
Adobe PDF
|
2.46 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/84134