Explainable Machine Learning Applied to Proteins

This thesis is focused on the development and application of explainable machine learning techniques to the generation and characterization of proteins. Specifically, the work aims at developing a generalisation of the Restricted Boltzmann Machine (RBM) to categorical variables and to use it in an unsupervised machine learning fashion to produce new and biologically plausible sequences of amino acids. A particular focus will be given also to the design of evaluation metrics for the generated data, given that it is generally hard to assess the quality of the trained model when dealing with generative tasks. Although being quite dated, RBMs are not yet completely understood, and recently they proved very promising for applications in biology. Importantly, being a model inspired by the physics of spin glasses, the problem of understanding the learning of the RBM can be tackled through methods already developed in the physics of disordered systems community, endowing this machine learning model with the rare quality of being explainable. Particular attention is devoted to monitoring the training of the RBM to reliably assess the equilibrium and out-of-equilibrium regimes that the training procedure has been shown can undergo. To this end, the model is evaluated at different ages of the training and different generation time using properly designed scores, both data-agnostic and biologically relevant. The sampling and training procedures of RBMs are based on algorithms such as Markov Chain Monte Carlo to explore the configurations space and, as such, a big challenge is represented, especially with highly clustered datasets, which typically is the case with protein, where the MCMC has difficulty to jump from one cluster to another. Hence, new methods for exploring the configuration space ergodically need to be developed if one aims to interpret correctly the parameters learned by the RBM. In the end, the objective is to train an RBM that on one hand is able to produce new biologically relevant proteins and, on the other hand, whose training is reliable and well understood.

Explainable Machine Learning Applied to Proteins

ROSSET, LORENZO

2021/2022

Abstract

This thesis is focused on the development and application of explainable machine learning techniques to the generation and characterization of proteins. Specifically, the work aims at developing a generalisation of the Restricted Boltzmann Machine (RBM) to categorical variables and to use it in an unsupervised machine learning fashion to produce new and biologically plausible sequences of amino acids. A particular focus will be given also to the design of evaluation metrics for the generated data, given that it is generally hard to assess the quality of the trained model when dealing with generative tasks. Although being quite dated, RBMs are not yet completely understood, and recently they proved very promising for applications in biology. Importantly, being a model inspired by the physics of spin glasses, the problem of understanding the learning of the RBM can be tackled through methods already developed in the physics of disordered systems community, endowing this machine learning model with the rare quality of being explainable. Particular attention is devoted to monitoring the training of the RBM to reliably assess the equilibrium and out-of-equilibrium regimes that the training procedure has been shown can undergo. To this end, the model is evaluated at different ages of the training and different generation time using properly designed scores, both data-agnostic and biologically relevant. The sampling and training procedures of RBMs are based on algorithms such as Markov Chain Monte Carlo to explore the configurations space and, as such, a big challenge is represented, especially with highly clustered datasets, which typically is the case with protein, where the MCMC has difficulty to jump from one cluster to another. Hence, new methods for exploring the configuration space ergodically need to be developed if one aims to interpret correctly the parameters learned by the RBM. In the end, the objective is to train an RBM that on one hand is able to produce new biologically relevant proteins and, on the other hand, whose training is reliable and well understood.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Fisica e Astronomia "Galileo Galilei" - DFA
			
	Corso di studio
	
				PHYSICS OF DATA Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2021
			
	Titolo inglese
	
				Explainable Machine Learning Applied to Proteins
			
	Abstract in italiano
	
				This thesis is focused on the development and application of explainable machine learning techniques to the generation and characterization of proteins. Specifically, the work aims at developing a generalisation of the Restricted Boltzmann Machine (RBM) to categorical variables and to use it in an unsupervised machine learning fashion to produce new and biologically plausible sequences of amino acids. A particular focus will be given also to the design of evaluation metrics for the generated data, given that it is generally hard to assess the quality of the trained model when dealing with generative tasks.

Although being quite dated, RBMs are not yet completely understood, and recently they proved very promising for applications in biology. Importantly, being a model inspired by the physics of spin glasses, the problem of understanding the learning of the RBM can be tackled through methods already developed in the physics of disordered systems community, endowing this machine learning model with the rare quality of being explainable.

Particular attention is devoted to monitoring the training of the RBM to reliably assess the equilibrium and out-of-equilibrium regimes that the training procedure has been shown can undergo.

To this end, the model is evaluated at different ages of the training and different generation time using properly designed scores, both data-agnostic and biologically relevant.

The sampling and training procedures of RBMs are based on algorithms such as Markov Chain Monte Carlo to explore the configurations space and, as such, a big challenge is represented, especially with highly clustered datasets, which typically is the case with protein, where the MCMC has difficulty to jump from one cluster to another. Hence, new methods for exploring the configuration space ergodically need to be developed if one aims to interpret correctly the parameters learned by the RBM.

In the end, the objective is to train an RBM that on one hand is able to produce new biologically relevant proteins and, on the other hand, whose training is reliable and well understood.
			
	Parola chiave
	
				Machine Learning
Statistical Physics
Proteins
RBM
			
	Relatore
	
				BAIESI, MARCO
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Master_thesis-Rosset_Lorenzo.pdf accesso aperto Dimensione 9.32 MB Formato Adobe PDF Visualizza/Apri	9.32 MB	Adobe PDF	Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/36259