Nowadays data are everywhere and it becomes increasingly important to collect and analyze them in the correct way in order to obtain useful information, since a broad number of fields on a scientific and industrial level need data analysis to solve a wide range of problems. With the advent of highly performing computers as well as measurement and processing systems, data are exploding progressively in both size and complexity, therefore requiring careful analysis to address the issues arising from this phenomenon: just think of the sensors that can make thousands of measurements in a few moments and that subsequently need a proper analysis to extract the information. This scenario introduces the so-called high dimensional data, characterized by a number of predictor variables which is (possibly much) larger than that of observations: this type of data can be found in different areas such as economy, bioinformatics, astronomy, geology, chemistry, physics, and so on. This phenomenon poses several problems when using the traditional approaches, so it is necessary to apply some methods that are suited and adapted to this context: one of this is Partial Least Squares regression (PLS), a technique initially designed for linear regression that exploits a dimensionality reduction approach to find a few orthogonal components which explain as much variance of the predictors as possible while being correlated to the response. The focus of this thesis is to adapt PLS for classification from a new point of view with respect to those are now present in the literature, since in most cases PLS is used as a discriminatory tool rather than a classifier, meaning that it only separates the classes of the response variable and does not effectively perform the final classification (delegated to an additional classifier). This is our starting point: indeed, the aim of this work is to design a new classification method purely based on PLS. To achieve this, there are two main ingredients: the first one involves the formulation of PLS as an iterative procedure that minimizes the distance between response and modelled response (that in the Euclidean space corresponds to the least squares problem) through the steepest descent method; the second one is the use of compositional data, through which it is possible to consider the response as compositions (and therefore probabilities), giving a rigorous mathematical justification to the classification criterion used by the model, and make use of proper transformations that allow to perform calculations that link spaces with different structures. Exploiting these factors, we developed a new approach that adapts PLS for classification providing a clear theoretical foundation, focusing on the binary response case. The case of G > 2 class is presented in its general framework but it requires further studies for a more detailed discussion. Different procedures are proposed which share the underlying approach but differ in the space in which the calculations are made and in the transformation applied to the data. These classification techniques have the same performance of Partial Least Squares - Discriminant Analysis (PLS-DA), which is the most used state-of-the-art tool to perform classification using PLS; nevertheless, PLS-DA is not a purely PLS-based method since it also requires additional classifiers, as Linear Discriminant Analysis (LDA), to predict the classes of the observations. Moreover, the proposed methods present a good predictive ability also in traditional scenarios, that is when the number of X-variables is much lower than that of observations and the collinearity between predictors is mild or moderate: in this setting, the results are comparable to those of logistic regression. The classification procedures are tested against both simulated and real datasets, also giving the evidence of their theoretical properties.

Partial least squares for classification: a new point of view

De Nardi, Martino
2020/2021

Abstract

Nowadays data are everywhere and it becomes increasingly important to collect and analyze them in the correct way in order to obtain useful information, since a broad number of fields on a scientific and industrial level need data analysis to solve a wide range of problems. With the advent of highly performing computers as well as measurement and processing systems, data are exploding progressively in both size and complexity, therefore requiring careful analysis to address the issues arising from this phenomenon: just think of the sensors that can make thousands of measurements in a few moments and that subsequently need a proper analysis to extract the information. This scenario introduces the so-called high dimensional data, characterized by a number of predictor variables which is (possibly much) larger than that of observations: this type of data can be found in different areas such as economy, bioinformatics, astronomy, geology, chemistry, physics, and so on. This phenomenon poses several problems when using the traditional approaches, so it is necessary to apply some methods that are suited and adapted to this context: one of this is Partial Least Squares regression (PLS), a technique initially designed for linear regression that exploits a dimensionality reduction approach to find a few orthogonal components which explain as much variance of the predictors as possible while being correlated to the response. The focus of this thesis is to adapt PLS for classification from a new point of view with respect to those are now present in the literature, since in most cases PLS is used as a discriminatory tool rather than a classifier, meaning that it only separates the classes of the response variable and does not effectively perform the final classification (delegated to an additional classifier). This is our starting point: indeed, the aim of this work is to design a new classification method purely based on PLS. To achieve this, there are two main ingredients: the first one involves the formulation of PLS as an iterative procedure that minimizes the distance between response and modelled response (that in the Euclidean space corresponds to the least squares problem) through the steepest descent method; the second one is the use of compositional data, through which it is possible to consider the response as compositions (and therefore probabilities), giving a rigorous mathematical justification to the classification criterion used by the model, and make use of proper transformations that allow to perform calculations that link spaces with different structures. Exploiting these factors, we developed a new approach that adapts PLS for classification providing a clear theoretical foundation, focusing on the binary response case. The case of G > 2 class is presented in its general framework but it requires further studies for a more detailed discussion. Different procedures are proposed which share the underlying approach but differ in the space in which the calculations are made and in the transformation applied to the data. These classification techniques have the same performance of Partial Least Squares - Discriminant Analysis (PLS-DA), which is the most used state-of-the-art tool to perform classification using PLS; nevertheless, PLS-DA is not a purely PLS-based method since it also requires additional classifiers, as Linear Discriminant Analysis (LDA), to predict the classes of the observations. Moreover, the proposed methods present a good predictive ability also in traditional scenarios, that is when the number of X-variables is much lower than that of observations and the collinearity between predictors is mild or moderate: in this setting, the results are comparable to those of logistic regression. The classification procedures are tested against both simulated and real datasets, also giving the evidence of their theoretical properties.
2020-09-17
120
regression, classification, partial least squares
File in questo prodotto:
File Dimensione Formato  
tesi_De_NardiDef.pdf

accesso aperto

Dimensione 2.32 MB
Formato Adobe PDF
2.32 MB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/21693