Protein liquid-liquid phase separation (LLPS) is a complex process through which membrane-less organelles (MLOs) are formed. MLOs serve many functions in the cell, which are partly determined by their specific composition, and alterations that disrupt their proper function have been linked with various diseases. However, despite a significant amount of research and experimental data, how the specific composition of a MLO is encoded in the amino acid sequences of its components remains an open question. In this thesis, we develop machine learning models to identify universal features encoding such information in protein sequences. In order to train and test the models, we assemble multiple datasets from curated databases on protein LLPS and recently published experimental data. We explore multiple model architectures based on convolutional neural networks, on the assumption that the amino acid features that determine LLPS are independent of the absolute position within the sequence. The models are designed to be as minimal as possible and improved iteratively, aiming to achieve a good performance as well as being interpretable. We then assess the performance of the optimal models, and study them through several approaches to identify protein features that may drive MLO assembly and specificity. We find that hydrophobic, disorder promoting and charged amino acids can all play important roles in proteins that self-assemble. For proteins that preferentially condensate with the partner proteins MED1 and FUS, we find that selectivity for these partners is encoded by mutually exclusive features, and that condensation with both partners is encoded by concatenation of the corresponding features. This study provides an approach based on machine learning models to identify sequence features that determine MLO formation and composition specificity. The methods proposed here may be scaled to increasingly larger datasets, which could deepen our understanding of protein LLPS and potentially guide the creation of treatments for condensate-related pathologies.

Protein liquid-liquid phase separation (LLPS) is a complex process through which membrane-less organelles (MLOs) are formed. MLOs serve many functions in the cell, which are partly determined by their specific composition, and alterations that disrupt their proper function have been linked with various diseases. However, despite a significant amount of research and experimental data, how the specific composition of a MLO is encoded in the amino acid sequences of its components remains an open question. In this thesis, we develop machine learning models to identify universal features encoding such information in protein sequences. In order to train and test the models, we assemble multiple datasets from curated databases on protein LLPS and recently published experimental data. We explore multiple model architectures based on convolutional neural networks, on the assumption that the amino acid features that determine LLPS are independent of the absolute position within the sequence. The models are designed to be as minimal as possible and improved iteratively, aiming to achieve a good performance as well as being interpretable. We then assess the performance of the optimal models, and study them through several approaches to identify protein features that may drive MLO assembly and specificity. We find that hydrophobic, disorder promoting and charged amino acids can all play important roles in proteins that self-assemble. For proteins that preferentially condensate with the partner proteins MED1 and FUS, we find that selectivity for these partners is encoded by mutually exclusive features, and that condensation with both partners is encoded by concatenation of the corresponding features. This study provides an approach based on machine learning models to identify sequence features that determine MLO formation and composition specificity. The methods proposed here may be scaled to increasingly larger datasets, which could deepen our understanding of protein LLPS and potentially guide the creation of treatments for condensate-related pathologies.

Sequence–based prediction of protein specificity in liquid–liquid phase separation with machine learning

AQUISTAPACE TAGUA, FRANCO
2024/2025

Abstract

Protein liquid-liquid phase separation (LLPS) is a complex process through which membrane-less organelles (MLOs) are formed. MLOs serve many functions in the cell, which are partly determined by their specific composition, and alterations that disrupt their proper function have been linked with various diseases. However, despite a significant amount of research and experimental data, how the specific composition of a MLO is encoded in the amino acid sequences of its components remains an open question. In this thesis, we develop machine learning models to identify universal features encoding such information in protein sequences. In order to train and test the models, we assemble multiple datasets from curated databases on protein LLPS and recently published experimental data. We explore multiple model architectures based on convolutional neural networks, on the assumption that the amino acid features that determine LLPS are independent of the absolute position within the sequence. The models are designed to be as minimal as possible and improved iteratively, aiming to achieve a good performance as well as being interpretable. We then assess the performance of the optimal models, and study them through several approaches to identify protein features that may drive MLO assembly and specificity. We find that hydrophobic, disorder promoting and charged amino acids can all play important roles in proteins that self-assemble. For proteins that preferentially condensate with the partner proteins MED1 and FUS, we find that selectivity for these partners is encoded by mutually exclusive features, and that condensation with both partners is encoded by concatenation of the corresponding features. This study provides an approach based on machine learning models to identify sequence features that determine MLO formation and composition specificity. The methods proposed here may be scaled to increasingly larger datasets, which could deepen our understanding of protein LLPS and potentially guide the creation of treatments for condensate-related pathologies.
2024
Sequence–based prediction of protein specificity in liquid–liquid phase separation with machine learning
Protein liquid-liquid phase separation (LLPS) is a complex process through which membrane-less organelles (MLOs) are formed. MLOs serve many functions in the cell, which are partly determined by their specific composition, and alterations that disrupt their proper function have been linked with various diseases. However, despite a significant amount of research and experimental data, how the specific composition of a MLO is encoded in the amino acid sequences of its components remains an open question. In this thesis, we develop machine learning models to identify universal features encoding such information in protein sequences. In order to train and test the models, we assemble multiple datasets from curated databases on protein LLPS and recently published experimental data. We explore multiple model architectures based on convolutional neural networks, on the assumption that the amino acid features that determine LLPS are independent of the absolute position within the sequence. The models are designed to be as minimal as possible and improved iteratively, aiming to achieve a good performance as well as being interpretable. We then assess the performance of the optimal models, and study them through several approaches to identify protein features that may drive MLO assembly and specificity. We find that hydrophobic, disorder promoting and charged amino acids can all play important roles in proteins that self-assemble. For proteins that preferentially condensate with the partner proteins MED1 and FUS, we find that selectivity for these partners is encoded by mutually exclusive features, and that condensation with both partners is encoded by concatenation of the corresponding features. This study provides an approach based on machine learning models to identify sequence features that determine MLO formation and composition specificity. The methods proposed here may be scaled to increasingly larger datasets, which could deepen our understanding of protein LLPS and potentially guide the creation of treatments for condensate-related pathologies.
Phase separation
Machine learning
Protein aggregation
File in questo prodotto:
File Dimensione Formato  
Aquistapace_Franco.pdf

accesso aperto

Dimensione 5.52 MB
Formato Adobe PDF
5.52 MB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/100369