Sequence–based prediction of protein specificity in liquid–liquid phase separation with machine learning

Protein liquid-liquid phase separation (LLPS) is a complex process through which membrane-less organelles (MLOs) are formed. MLOs serve many functions in the cell, which are partly determined by their specific composition, and alterations that disrupt their proper function have been linked with various diseases. However, despite a significant amount of research and experimental data, how the specific composition of a MLO is encoded in the amino acid sequences of its components remains an open question. In this thesis, we develop machine learning models to identify universal features encoding such information in protein sequences. In order to train and test the models, we assemble multiple datasets from curated databases on protein LLPS and recently published experimental data. We explore multiple model architectures based on convolutional neural networks, on the assumption that the amino acid features that determine LLPS are independent of the absolute position within the sequence. The models are designed to be as minimal as possible and improved iteratively, aiming to achieve a good performance as well as being interpretable. We then assess the performance of the optimal models, and study them through several approaches to identify protein features that may drive MLO assembly and specificity. We find that hydrophobic, disorder promoting and charged amino acids can all play important roles in proteins that self-assemble. For proteins that preferentially condensate with the partner proteins MED1 and FUS, we find that selectivity for these partners is encoded by mutually exclusive features, and that condensation with both partners is encoded by concatenation of the corresponding features. This study provides an approach based on machine learning models to identify sequence features that determine MLO formation and composition specificity. The methods proposed here may be scaled to increasingly larger datasets, which could deepen our understanding of protein LLPS and potentially guide the creation of treatments for condensate-related pathologies.

Sequence–based prediction of protein specificity in liquid–liquid phase separation with machine learning

AQUISTAPACE TAGUA, FRANCO

2024/2025

Abstract

Protein liquid-liquid phase separation (LLPS) is a complex process through which membrane-less organelles (MLOs) are formed. MLOs serve many functions in the cell, which are partly determined by their specific composition, and alterations that disrupt their proper function have been linked with various diseases. However, despite a significant amount of research and experimental data, how the specific composition of a MLO is encoded in the amino acid sequences of its components remains an open question. In this thesis, we develop machine learning models to identify universal features encoding such information in protein sequences. In order to train and test the models, we assemble multiple datasets from curated databases on protein LLPS and recently published experimental data. We explore multiple model architectures based on convolutional neural networks, on the assumption that the amino acid features that determine LLPS are independent of the absolute position within the sequence. The models are designed to be as minimal as possible and improved iteratively, aiming to achieve a good performance as well as being interpretable. We then assess the performance of the optimal models, and study them through several approaches to identify protein features that may drive MLO assembly and specificity. We find that hydrophobic, disorder promoting and charged amino acids can all play important roles in proteins that self-assemble. For proteins that preferentially condensate with the partner proteins MED1 and FUS, we find that selectivity for these partners is encoded by mutually exclusive features, and that condensation with both partners is encoded by concatenation of the corresponding features. This study provides an approach based on machine learning models to identify sequence features that determine MLO formation and composition specificity. The methods proposed here may be scaled to increasingly larger datasets, which could deepen our understanding of protein LLPS and potentially guide the creation of treatments for condensate-related pathologies.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Fisica e Astronomia "Galileo Galilei" - DFA
			
	Corso di studio
	
				PHYSICS OF DATA Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2024
			
	Titolo inglese
	
				Sequence–based prediction of protein specificity in liquid–liquid phase separation with machine learning
			
	Abstract in italiano
	
				Protein liquid-liquid phase separation (LLPS) is a complex process through which membrane-less organelles (MLOs) are formed. MLOs serve many functions in the cell, which are partly determined by their specific composition, and alterations that disrupt their proper function have been linked with various diseases. However, despite a significant amount of research and experimental data, how the specific composition of a MLO is encoded in the amino acid sequences of its components remains an open question. In this thesis, we develop machine learning models to identify universal features encoding such information in protein sequences.

In order to train and test the models, we assemble multiple datasets from curated databases on protein LLPS and recently published experimental data. We explore multiple model architectures based on convolutional neural networks, on the assumption that the amino acid features that determine LLPS are independent of the absolute position within the sequence. The models are designed to be as minimal as possible and improved iteratively, aiming to achieve a good performance as well as being interpretable.

We then assess the performance of the optimal models, and study them through several approaches to identify protein features that may drive MLO assembly and specificity. We find that hydrophobic, disorder promoting and charged amino acids can all play important roles in proteins that self-assemble. For proteins that preferentially condensate with the partner proteins MED1 and FUS, we find that selectivity for these partners is encoded by mutually exclusive features, and that condensation with both partners is encoded by concatenation of the corresponding features.

This study provides an approach based on machine learning models to identify sequence features that determine MLO formation and composition specificity. The methods proposed here may be scaled to increasingly larger datasets, which could deepen our understanding of protein LLPS and potentially guide the creation of treatments for condensate-related pathologies.
			
	Parola chiave
	
				Phase separation
Machine learning
Protein aggregation
			
	Relatore
	
				FUXREITER, MONIKA
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Aquistapace_Franco.pdf accesso aperto Dimensione 5.52 MB Formato Adobe PDF Visualizza/Apri	5.52 MB	Adobe PDF	Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/100369