This thesis is focused on the development and application of explainable machine learning techniques to the generation and characterization of proteins. Specifically, the work aims at developing a generalisation of the Restricted Boltzmann Machine (RBM) to categorical variables and to use it in an unsupervised machine learning fashion to produce new and biologically plausible sequences of amino acids. A particular focus will be given also to the design of evaluation metrics for the generated data, given that it is generally hard to assess the quality of the trained model when dealing with generative tasks. Although being quite dated, RBMs are not yet completely understood, and recently they proved very promising for applications in biology. Importantly, being a model inspired by the physics of spin glasses, the problem of understanding the learning of the RBM can be tackled through methods already developed in the physics of disordered systems community, endowing this machine learning model with the rare quality of being explainable. Particular attention is devoted to monitoring the training of the RBM to reliably assess the equilibrium and out-of-equilibrium regimes that the training procedure has been shown can undergo. To this end, the model is evaluated at different ages of the training and different generation time using properly designed scores, both data-agnostic and biologically relevant. The sampling and training procedures of RBMs are based on algorithms such as Markov Chain Monte Carlo to explore the configurations space and, as such, a big challenge is represented, especially with highly clustered datasets, which typically is the case with protein, where the MCMC has difficulty to jump from one cluster to another. Hence, new methods for exploring the configuration space ergodically need to be developed if one aims to interpret correctly the parameters learned by the RBM. In the end, the objective is to train an RBM that on one hand is able to produce new biologically relevant proteins and, on the other hand, whose training is reliable and well understood.

This thesis is focused on the development and application of explainable machine learning techniques to the generation and characterization of proteins. Specifically, the work aims at developing a generalisation of the Restricted Boltzmann Machine (RBM) to categorical variables and to use it in an unsupervised machine learning fashion to produce new and biologically plausible sequences of amino acids. A particular focus will be given also to the design of evaluation metrics for the generated data, given that it is generally hard to assess the quality of the trained model when dealing with generative tasks. Although being quite dated, RBMs are not yet completely understood, and recently they proved very promising for applications in biology. Importantly, being a model inspired by the physics of spin glasses, the problem of understanding the learning of the RBM can be tackled through methods already developed in the physics of disordered systems community, endowing this machine learning model with the rare quality of being explainable. Particular attention is devoted to monitoring the training of the RBM to reliably assess the equilibrium and out-of-equilibrium regimes that the training procedure has been shown can undergo. To this end, the model is evaluated at different ages of the training and different generation time using properly designed scores, both data-agnostic and biologically relevant. The sampling and training procedures of RBMs are based on algorithms such as Markov Chain Monte Carlo to explore the configurations space and, as such, a big challenge is represented, especially with highly clustered datasets, which typically is the case with protein, where the MCMC has difficulty to jump from one cluster to another. Hence, new methods for exploring the configuration space ergodically need to be developed if one aims to interpret correctly the parameters learned by the RBM. In the end, the objective is to train an RBM that on one hand is able to produce new biologically relevant proteins and, on the other hand, whose training is reliable and well understood.

Explainable Machine Learning Applied to Proteins

ROSSET, LORENZO
2021/2022

Abstract

This thesis is focused on the development and application of explainable machine learning techniques to the generation and characterization of proteins. Specifically, the work aims at developing a generalisation of the Restricted Boltzmann Machine (RBM) to categorical variables and to use it in an unsupervised machine learning fashion to produce new and biologically plausible sequences of amino acids. A particular focus will be given also to the design of evaluation metrics for the generated data, given that it is generally hard to assess the quality of the trained model when dealing with generative tasks. Although being quite dated, RBMs are not yet completely understood, and recently they proved very promising for applications in biology. Importantly, being a model inspired by the physics of spin glasses, the problem of understanding the learning of the RBM can be tackled through methods already developed in the physics of disordered systems community, endowing this machine learning model with the rare quality of being explainable. Particular attention is devoted to monitoring the training of the RBM to reliably assess the equilibrium and out-of-equilibrium regimes that the training procedure has been shown can undergo. To this end, the model is evaluated at different ages of the training and different generation time using properly designed scores, both data-agnostic and biologically relevant. The sampling and training procedures of RBMs are based on algorithms such as Markov Chain Monte Carlo to explore the configurations space and, as such, a big challenge is represented, especially with highly clustered datasets, which typically is the case with protein, where the MCMC has difficulty to jump from one cluster to another. Hence, new methods for exploring the configuration space ergodically need to be developed if one aims to interpret correctly the parameters learned by the RBM. In the end, the objective is to train an RBM that on one hand is able to produce new biologically relevant proteins and, on the other hand, whose training is reliable and well understood.
2021
Explainable Machine Learning Applied to Proteins
This thesis is focused on the development and application of explainable machine learning techniques to the generation and characterization of proteins. Specifically, the work aims at developing a generalisation of the Restricted Boltzmann Machine (RBM) to categorical variables and to use it in an unsupervised machine learning fashion to produce new and biologically plausible sequences of amino acids. A particular focus will be given also to the design of evaluation metrics for the generated data, given that it is generally hard to assess the quality of the trained model when dealing with generative tasks. Although being quite dated, RBMs are not yet completely understood, and recently they proved very promising for applications in biology. Importantly, being a model inspired by the physics of spin glasses, the problem of understanding the learning of the RBM can be tackled through methods already developed in the physics of disordered systems community, endowing this machine learning model with the rare quality of being explainable. Particular attention is devoted to monitoring the training of the RBM to reliably assess the equilibrium and out-of-equilibrium regimes that the training procedure has been shown can undergo. To this end, the model is evaluated at different ages of the training and different generation time using properly designed scores, both data-agnostic and biologically relevant. The sampling and training procedures of RBMs are based on algorithms such as Markov Chain Monte Carlo to explore the configurations space and, as such, a big challenge is represented, especially with highly clustered datasets, which typically is the case with protein, where the MCMC has difficulty to jump from one cluster to another. Hence, new methods for exploring the configuration space ergodically need to be developed if one aims to interpret correctly the parameters learned by the RBM. In the end, the objective is to train an RBM that on one hand is able to produce new biologically relevant proteins and, on the other hand, whose training is reliable and well understood.
Machine Learning
Statistical Physics
Proteins
RBM
File in questo prodotto:
File Dimensione Formato  
Master_thesis-Rosset_Lorenzo.pdf

accesso aperto

Dimensione 9.32 MB
Formato Adobe PDF
9.32 MB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/36259