Protein liquid-liquid phase separation (LLPS) is a complex process through which membrane-less organelles (MLOs) are formed. MLOs serve many functions in the cell, which are partly determined by their specific composition, and alterations that disrupt their proper function have been linked with various diseases. However, despite a significant amount of research and experimental data, how the specific composition of a MLO is encoded in the amino acid sequences of its components remains an open question. In this thesis, we develop machine learning models to identify universal features encoding such information in protein sequences. In order to train and test the models, we assemble multiple datasets from curated databases on protein LLPS and recently published experimental data. We explore multiple model architectures based on convolutional neural networks, on the assumption that the amino acid features that determine LLPS are independent of the absolute position within the sequence. The models are designed to be as minimal as possible and improved iteratively, aiming to achieve a good performance as well as being interpretable. We then assess the performance of the optimal models, and study them through several approaches to identify protein features that may drive MLO assembly and specificity. We find that hydrophobic, disorder promoting and charged amino acids can all play important roles in proteins that self-assemble. For proteins that preferentially condensate with the partner proteins MED1 and FUS, we find that selectivity for these partners is encoded by mutually exclusive features, and that condensation with both partners is encoded by concatenation of the corresponding features. This study provides an approach based on machine learning models to identify sequence features that determine MLO formation and composition specificity. The methods proposed here may be scaled to increasingly larger datasets, which could deepen our understanding of protein LLPS and potentially guide the creation of treatments for condensate-related pathologies.
Protein liquid-liquid phase separation (LLPS) is a complex process through which membrane-less organelles (MLOs) are formed. MLOs serve many functions in the cell, which are partly determined by their specific composition, and alterations that disrupt their proper function have been linked with various diseases. However, despite a significant amount of research and experimental data, how the specific composition of a MLO is encoded in the amino acid sequences of its components remains an open question. In this thesis, we develop machine learning models to identify universal features encoding such information in protein sequences. In order to train and test the models, we assemble multiple datasets from curated databases on protein LLPS and recently published experimental data. We explore multiple model architectures based on convolutional neural networks, on the assumption that the amino acid features that determine LLPS are independent of the absolute position within the sequence. The models are designed to be as minimal as possible and improved iteratively, aiming to achieve a good performance as well as being interpretable. We then assess the performance of the optimal models, and study them through several approaches to identify protein features that may drive MLO assembly and specificity. We find that hydrophobic, disorder promoting and charged amino acids can all play important roles in proteins that self-assemble. For proteins that preferentially condensate with the partner proteins MED1 and FUS, we find that selectivity for these partners is encoded by mutually exclusive features, and that condensation with both partners is encoded by concatenation of the corresponding features. This study provides an approach based on machine learning models to identify sequence features that determine MLO formation and composition specificity. The methods proposed here may be scaled to increasingly larger datasets, which could deepen our understanding of protein LLPS and potentially guide the creation of treatments for condensate-related pathologies.
Sequence–based prediction of protein specificity in liquid–liquid phase separation with machine learning
AQUISTAPACE TAGUA, FRANCO
2024/2025
Abstract
Protein liquid-liquid phase separation (LLPS) is a complex process through which membrane-less organelles (MLOs) are formed. MLOs serve many functions in the cell, which are partly determined by their specific composition, and alterations that disrupt their proper function have been linked with various diseases. However, despite a significant amount of research and experimental data, how the specific composition of a MLO is encoded in the amino acid sequences of its components remains an open question. In this thesis, we develop machine learning models to identify universal features encoding such information in protein sequences. In order to train and test the models, we assemble multiple datasets from curated databases on protein LLPS and recently published experimental data. We explore multiple model architectures based on convolutional neural networks, on the assumption that the amino acid features that determine LLPS are independent of the absolute position within the sequence. The models are designed to be as minimal as possible and improved iteratively, aiming to achieve a good performance as well as being interpretable. We then assess the performance of the optimal models, and study them through several approaches to identify protein features that may drive MLO assembly and specificity. We find that hydrophobic, disorder promoting and charged amino acids can all play important roles in proteins that self-assemble. For proteins that preferentially condensate with the partner proteins MED1 and FUS, we find that selectivity for these partners is encoded by mutually exclusive features, and that condensation with both partners is encoded by concatenation of the corresponding features. This study provides an approach based on machine learning models to identify sequence features that determine MLO formation and composition specificity. The methods proposed here may be scaled to increasingly larger datasets, which could deepen our understanding of protein LLPS and potentially guide the creation of treatments for condensate-related pathologies.| File | Dimensione | Formato | |
|---|---|---|---|
|
Aquistapace_Franco.pdf
accesso aperto
Dimensione
5.52 MB
Formato
Adobe PDF
|
5.52 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/100369