PLoS Computational Biology
Public Library of Science
image
Insight into the protein solubility driving forces with neural attention
Volume: 16, Issue: 4
DOI 10.1371/journal.pcbi.1007722
  • PDF   
  • XML   
  •       
Abstract

The solubility of proteins is a crucial biophysical aspect when it comes to understanding many human diseases and to improve the industrial processes for protein production. Due to its relevance, computational methods have been devised in order to study and possibly optimize the solubility of proteins. In this work we apply a deep-learning technique, called neural attention to predict protein solubility while “opening” the model itself to interpretability, even though Machine Learning models are usually considered black boxes. Thank to the attention mechanism, we show that i) our model implicitly learns complex patterns related to emergent, protein folding-related, aspects such as to recognize β-amyloidosis regions and that ii) the N-and C-termini are the regions with the highes signal fro solubility prediction. When it comes to enhancing the solubility of proteins, we, for the first time, propose to investigate the synergistic effects of tandem mutations instead of “single” mutations, suggesting that this could minimize the number of required proposed mutations.

Raimondi, Orlando, Fariselli, Moreau, and Alexov: Insight into the protein solubility driving forces with neural attention

Introduction

Proteins have evolved within the different cellular environments to improve or preserve their functions, maintaining at the same time a degree of hydrophobicity necessary to proper fold, and enough solubility to prevent protein precipitation. Solubility is then a fundamental ingredient that must be properly balanced to maintain the protein functions and not aggregating [1]. Protein solubility is also an essential aspect in diagnostic and therapeutic applications [2, 3], as well as being a critical requirement for protein homeostasis [1, 4, 5]. Solubility deficit can hamper protein-based drug development, generating insoluble protein precipitates that can be toxic and may elicit an immune response in patients [6, 7]. Protein aggregation is considered an hallmark of more than forty human diseases, that span from neurodegenerative illness, to cancer and to metabolic disorders such as diabetes [8].

Since the soluble expression of proteins is crucial for protein production [9] for both pharmacological and research goals [10], in-silico bioinformatics models have been developed to predict i) the solubility of proteins [1013] or ii) the solubility change upon mutation [9, 12, 14, 15], in order to, respectively, i) select the most likely soluble proteins [12] and ii) run in-silico mutagenesis to increase the solubility of existent proteins [9].

The methods developed so far in both categories use various Machine Learning (ML) approaches, such as Neural Networks (NN) [10], Gradient Boosting Machines [11], Support Vector Machines (SVM) [13], Logistic Regression [12] or simpler statistical methods [9, 14]. Most of these approaches compute predictions starting from just the proteins sequence, with the exception of CamSol, which uses PDB structures [14]. Among the sequence-based features used, the most common are: i) biophysical propensity scales values (e.g. hydrophobicity, charge), ii) various forms of k-mers frequencies (such as mono, bi- or tri-peptides occurrences), iii) predictions from other methods (e.g. disorder, Secondary Structure, Relative Solvent Accessibility or aggregation predictors) and iv) global features such as sequence length and the fraction of residues exposed to the solvent.

A common issue that the methods predicting the solubility of proteins had to face is the fact that the input protein sequences may have very different lengths, and indeed building ML models able to work with protein sequences is a common task in structural bioinformatics. From the ML standpoint, this task is not trivial because the variable length of proteins poses some issues to conventional ML methods, such SVM or Random Forests. This problem is usually addressed by using sliding window techniques to predict each residue independently [16, 17], but different solutions are needed when a single prediction must be associated to an entire protein sequence [13, 14, 18], since the information content of an entire sequence needs to be shrunk into a single predictive scalar value.

Neural Networks (NN) are flexible models that can elegantly address this issue. The classical approaches consist in building a pyramid-like architecture [10] that takes the protein sequence as input and reduces it to a fixed size through subsequent abstraction layers, ending with a feed-forward sub-network that yields the final scalar prediction.

Here we propose a novel solution to this issue, which has been inspired by the neural attention mechanisms developed for Natural Language Processing and machine translation [19, 20]. Our model is called SKADE and uses a neural attention-like architecture to elegantly process the information contained in protein sequences towards the prediction of their solubility. By comparing it with state of the art methods we show that it has competitive performances while requiring as inputs just the protein sequence.

Additionally, the use of neural attention allows our model to be interpreted, showing that the learned patterns correlate with complex sequence aspects such as the presence of aggregating regions or the protein contact density, while it does not correlate with more trivial aspects such as biophysical propensity scales or solvent accessibility.

We also show that, even if it has not been specifically trained for the task, SKADE can distinguish between mutations that increase or decrease the proteins’ solubility. This, coupled with the fact that it can generate hundreds of predictions per second, makes it ideal to perform in-silico optimization of protein solubility. To show its potential in this sense, we performed a complete in-silico mutagenesis of the UPF0235 protein MTH_637, computing both the effect of all the possible single mutations and all possible pairs of tandem mutations (> 2 × 106 pairs). This allowed us to investigate the possible effects of synergistic interactions between mutations, indicating that, in certain regions of the proteins, the implementation of pairs of mutations could have a larger effect that the sum of the effects of independent mutations. Finally, we show that the predicted synergistic effects have a significant correlation with the average contact distances between residues, extracted from the protein PDB structure, suggesting that SKADE is able to catch a glimpse of complex emergent properties such as the contact density.

Materials and methods

Datasets

To train and test our model, we used the protein solubility datasets adopted in [10, 11]. Using the same training/testing data and procedure allowed us to compare the performances of SKADE with the most recently published methods. The training set contains 28972 soluble and 40448 insoluble proteins that have been annotated with the pepcDB [21] “soluble” (or subsequent stages) annotations in [12]. The test dataset contains 1000 soluble and 1001 insoluble proteins, and has been compiled by [22].

To validate the ability of SKADE to distinguish between variants increasing or decreasing the overall solubility of the protein, we adopted the dataset used in CamSol [14], which contains 142 variants known to increase or decrease the solubility of 42 proteins.

Neural attention for protein sequences

SKADE is a NN model which consists in two sub-networks: the predictor network P and the attention network A.

The NN takes as input just a the target proteins sequences, without using evolutionary information or other kinds of annotations. As a first step, each residue in the input sequence is translated into a trainable 20-dimensional embedding encoding the 20 possible amino acids. The embedded sequences are then sorted and padded, in order to maximize the efficiency of the NN computations on GPUs.

Both A and P are constituted by 2 layers of bidirectional Gated Recurrent Unit (GRU) [23] networks with 20 hidden dimensions. The GRU network runs on the input sequences, outputting 20 dimensional vector for each residue. Since the GRU is bidirectional, for each protein it provides two 20 × L tensors as output, where L is the length of the sequence. The two tensors obtained from the forward and backward pass of the bidirectional GRU are concatenated into a 40 × L and further processed by a linear layer that produces a single scalar value for each residue, obtaining a 1 × L tensor for each protein.

At this point the A and P network differ: the last layer of the predictor P has a LeakyReLU activation while the attention network A has a SoftMax activation that is globally applied over the entire 1 × L tensor T, ensuring that i=0LTi=1.

To obtain the final prediction, for each protein the scalar product aTp between the 1 × L tensors a and p, respectively obtained from the P and A sub-networks is computed, allowing the SoftMax-ed attention vector a to select the position on the vector p that should be given highest relevance for the final prediction. The resulting value is passed through a Sigmoid activation, constraining the final output of SKADE withing the 0-1 range.

Conceptually, the predictor network P should assign different solubility-related values to each residue in the protein, considering also the local sequence context. In parallel, the attention network A should learn on which protein regions its attention should be focused.

The final model has 25462 trainable parameters, uses a batch size of 1001 with an initial learning rate of 0.01. We used the Adam optimizer with a L2 regularization equal to 10−6 and it is trained for 50 epochs. The model has been implemented in Pytorch version 1.0.1 [24]. The code is freely available at https://bitbucket.org/eddiewrc/skade/src.

Validation procedure and performance evaluation

In order to compare our results with [10, 11], we reproduced their validation procedure. We trained our model on the 69420 proteins in the trainset and we tested it on the 2001 proteins in the test set from [22]. We used the widely adopted Sensitivity (Sen), Specificity (Spe), Precision (Pre), Accuracy (Acc), Matthews Correlation Coefficient (MCC), Area Under the ROC curve AUC and Area Under the Precision Recall Curve (AUPRC) metrices to evaluate the predictions. The Balanced Accuracy (BAC) is computed as the arithmetic mean of Sen and Spe.

To assess the ability of SKADE to distinguish between variants increasing or decreasing the solubility of proteins, we followed the procedure used in SODA [9].

Results

SKADE predicts protein solubility from just the protein amino acid sequence

The most recently developed tools for the protein solubility prediction [10, 11] use various kinds of features. These can be divided into sequence related features (e.g. k-mers and their frequencies of occurrence, biophysical propensity scales sequence descriptions), global features (e.g. sequence length, molecular weight, fraction of residues with certain biophysical properties) and structural features, such as secondary structure assignments (SS) and Relative Solvent Accessibility (RSA) [10, 11].

During the development of SKADE we chose to use only the protein sequences as inputs because the addition of any other kind of features would pose certain restrictions to our model. For example, in [25] it has been shown that, although extremely valuable, the use of Multiple Sequence Alignments (MSAs) as features could bias predictors towards more studied proteins (e.g. proteins for which many homologous sequences are known). Another recent study we conducted [26] showed that the use of biophysical propensity scales coupled with sophisticate ML method could be just a discrete and intrinsically limited way to find an optimized embedding description of the amino acids in the proteins, and using entirely random scales could give extremely similar results [26]. Finally, our rationale to avoid the use of global features is that, since the solubility of a protein is a crucial step for its expression and large scale manufacture, we envision SKADE as a tool for the in-silico optimization of protein solubility, and thus we deemed global features such as the protein length, the molecular weight and the absolute charge as not suitable for our framework, since they are not characteristics on which we want to act directly while perform the optimization.

S6 Fig shows a PCA of the learned embeddings, and S2 Text contains their actual 20-dimensional values.

SKADE positively compares with state of the art predictors

We trained and tested SKADE following the same procedure adopted in [10, 11], which are two of the most recently developed solubility predictors. This allowed us to compare SKADE with the most relevant state of the art methods, extending the benchmark initially proposed in [22]. This procedure involved a training set containing 69420 proteins and a test set of 2001 proteins. The results obtained by SKADE and the comparison with other methods is shown in Table 1.

Table 1
Table showing the comparison between SKADE and the state of the art tools benchmarked in [22].
MethodsSenSpePreAccMCCAUCAUPRC
SKADE66817873478282
PaRSnIP76727074488280
DeepSol S175716973468181
PROSO II67686964347471
CCSOL545451548--
SOLPRO6258516020--
PROSO5857545816--
RPSP525144523--
SCM6557426021--

From DeepSol [10] we reported the performance of their S1 model, which is the DeepSol version that uses only protein sequence information. PaRSnIP and all the other method, on the other hand, use any kind of features, some of which include information related to Secondary Structures, Relative Solvent Accessibility and other biophysical aspects. Notwithstanding this limitation, Table 1 shows that SKADE has the best AUPRC and the best AUC, on par with PaRSnIP.

The role of attention

As shown in Fig 1, SKADE’s NN is divided into two sub-networks. The sub-network A is used to compute the per-residues SoftMax-ed attention values a while the sub-network P is used to compute the per-residue predictions p. The final solubility prediction for each protein is obtained as σ(aTp), where σ is a Sigmoid activation. This value is thus a linear combination between p and a, and the A networks is responsible to assign the coefficients that are used to weight the per-residues predictions proposed by P. This is analogous to a bias-less Logistic Regression in which the weights are dynamically predicted for each sample instead of being learned once for all, and it allows SKADE to work with input sequences with arbitrary length, making it suitable to bioinformatics applications.

Figure showing the structure of the NN.
Fig 1
The protein sequence is translated into a 20 dimensional embedding and then passed to the predictor network P (left) and the attention network A (right). Each of these subnetworks contain 2 layers of bi-directional GRU network, followed by a feed-forward. The predictor ends with a LeakyReLU activation, while the attention network has a SoftMax activation. Finally, the 1 × L outputs of the two networks are reduced to a predictive, probability-like, scalar value by a dot product operation, followed by a SIgmoid activation.Figure showing the structure of the NN.

Moreover, since σ(aTp)=σ(iLaipi), with σ monotonically increasing, residues i with a positive value of ai × pi steer the prediction towards the class 1 (soluble), while residues with negative values are actually “voting” for the class 0. This means that these per-residue values can be considered as the solubility profile of the protein (see S3S5 Figs for the attention and prediction profiles of the proteins Q8TC59, Q9HBE1 and P25984).

In order to analyse the behavior of SKADE and to investigate the learned patterns of attention, we tried to correlate the per-residue attention and solubility profiles to known biophysical properties, such as hydrophobicity, polarity, charge, volume and the propensities of the residues being in α-helical or β -sheet conformation, but no correlation with these values has been detected (see Table A1 in S1 Text), indicating that the A and P networks behavior cannot be directly associated with trivial sequence properties. We also tested the correlation of the profiles against in-house Relative Solvent Accessibility (RSA) predictions, obtaining a Pearson correlation of r = 0.0237.

SKADE solubility profiles detect aggregation patches

We then tested the predictive power of SKADE’s solubility profiles on the AMYL dataset [27], which contains 34 amyloidogenic proteins whose aggregating regions have been used in [28] to benchmark the performances of per-residue in-silico aggregation predictors. The AMYL dataset contains annotations indicating the involvement of each residue in β-amyloidosis aggregation, for a total of 7732 residues, 2599 of which are responsible for aggregation.

SKADE’s solubility profiles show an interesting signal towards the prediction of aggregating regions in these proteins. The profiles have an AUC of 65 even though SKADE has not been trained on the proteins in the AMYL dataset nor on the β -amyloidosis aggregation task at any point. To better contextualize the magnitude of this signal, in Table 2 we compared the performance of SKADE’s solubility profiles with state of the art tools that have been specifically developed to predict aggregation. We can see that even though the solubility profiles of SKADE are a byproduct of the neural attention model and have been obtained in a completely unsupervised way with respect to the task at hand, SKADE provides quite competitive performance when it comes to identifying aggregating regions in proteins.

Table 2
Results are reported from [28].
Comparison of the performance of SKADE with state of the art aggregation predictors on the amyl33 dataset [27].
MethodSenSpeBACMCC
AgMata43846625
PASTA 2 (85 specificity)41856324
PASTA 2 (90 specificity)30906022
AMYLPRED239846222
SKADE55716320
MetAmyl52716217
Tango14965514
Aggrescan35795713
FishAmyloid14945410
FoldAmyloid2187548

We can thus conclude that, although the neural attention profiles do not show correlation with trivial biophysical aspects of the protein sequence, the ai × pi per-residue solubility profiles that are responsible for the final prediction show an interesting correlation with a complex and still only partially understood behavior such as β-amyloidosis aggregation of proteins.

The N- and C-termini are the most relevant regions for the protein solubility prediction

Further analysing the data extracted from the neural attention used in SKADE, in Fig 2, we show the mean and median values of attention a (red) and prediction p (blue) in function of the relative position of the residues in the sequence. S1 Fig shows the distributions of the ai × pi per-residues solubility profile value, where i is the residue position in each sequence. From both figures it clearly appears that SKADE focuses its attention to the N- and C-termini of the proteins, indicating that they contain signal for the prediction of the protein solubility. To investigate whether this behavior is an artifact of the recurrent modules in our NN, we tested SKADE on the 2001 sequences in the test set, but first removing the initial and final 20% of residues from each of them. The obtained AUC is very close to random (0.55), indicating that the beginning and the end of the sequences might indeed carry an important signal for the prediction of the protein solubility. On the contrary, when we test SKADE on the 2001 sequences in the test set after removing the central residues (located in the relative sequence positions between the 20th and 80th percentiles), the performances are very similar to normal (0.81 of AUC).

Plot showing the mean (solid red) and median (light red) attention values, compared with the mean (solid blue) and median (light blue) prediction values, in function of the position in the sequence.
Fig 2
On average, SKADE assigns higher values to positions close to the N- and C-termini.Plot showing the mean (solid red) and median (light red) attention values, compared with the mean (solid blue) and median (light blue) prediction values, in function of the position in the sequence.

To further ensure that this behavior is due to the signal carried by the N- and C-termini regions and not an artifact of SKADE’s architecture, we repeated the same experiments by using the DeepSol S1 webserver to predict the 2001 proteins in the test set after removing i) the initial and final 20% of residues from each sequence, and ii) the central portion of each protein (from the 20th to the 80th percentile). Similarly to the results obtained with SKADE, DeepSolS1 produces almost random predictions when the N- and C-termini are removed from the input sequences (AUC = 0.53), and exactly normal predictions when the central part of the protein is removed (AUC = 0.81). This shows that also DeepSol S1, which is based on a completely different architecture from SKADE, implicitly exploits the information contained in the extremities of the sequences to perform its prediction.

We also searched in literature whether the relevance of N- and C-termini was already established, and although we did not find exhaustive studies analysing this behavior, a number of papers investigated specific cases. For example, it has been shown that mutating the N-terminus of the mitochondrial aminoacyl-tRNA synthetases [29], sperm whale myoglobin [30] and hen egg-white lysozyme [31] enhances their expression and solubility.

Unsupervised prediction of solubility change upon mutation

One of the possible applications of protein solubility predictors is the in-silico optimization of protein sequences to enhance their solubility, for example by selecting the smallest possible set of variants able to increase the solubility of the sequence while minimizing the structural and functional alterations to the original protein.

To do so, it is necessary that the prediction methods are able to distinguish between variants that increase or decrease the overall solubility of the sequence. Unfortunately, very little experimental data is available in this sense, and in particular, to the best of our knowledge, no data concerning the effect of multiple mutations or insertions/deletions on the solubility of proteins is available.

To evaluate the ability of SKADE to identify mutations increasing or decreasing protein solubility, we used the CamSol [14] dataset, which contains 56 experimentally validated mutations. S7 Fig shows the distribution of the variants in CamSol on the corresponding protein sequences.

Since SKADE is designed to take the whole target protein sequence as input, to evaluate the solubility change upon mutation we predicted both the wildtype and the mutated target sequence, computing the difference in predicted solubility between the wildtype (WTs) and the mutant (MUTs) as ΔS = MUTsWTs.

In Table 3 we used the CamSol dataset to compare the performance of SKADE with methods designed for the task of predicting the effect of mutations on the protein solubility. To do so we reproduced the benchmark performed in SODA [9], where 4 existing predictors have been tested. SKADE correctly identifies 69% of the mutations increasing solubility and 100% of the mutations decreasing it, with an AUC of 82 and an AUPRC of 99%.

Table 3
Results have been preported form [9].
Table showing the comparison of the unsupervised predictions of SKADE on the CamSol dataset of mutations.
MethodsAccuracyTotal
SODA100.056/56
CamSol9654/56
SKADE7140/56
SolPro7140/56
PROSO II5732/56

An important consideration is that while methods such as SODA (see Table 3) have been specifically designed and trained to discriminate mutations increasing or decreasing the solubility, SKADE faces this task in a completely unsupervised way, since it has been trained to predict the solubility of entire protein sequences.

In-silico mutational screening and analysis of the synergistic effects of mutations

SKADE is a NN and thus it can be easily run in parallel on GPUs, computing predictions for hundreds or thousands of sequences per second. This can be used to compute in-silico mutational screening of proteins, as shown in Fig 3 for UPF0235 protein MTH_637 (Uniprot ID: O26734), which is present in the test dataset [22]. The protein is annotated as soluble in the dataset, and SKADE predicts it correctly (score of 0.733). We implemented each possible mutation in O26734 and we computed the the change in solubility ΔS = MUTsWTs, meaning that positive ΔS (shown in red) indicate that the mutation increases the solubility of the protein, while negative ΔS values (in blue) decrease it. Fig 3 shows that, since the predicted solubility of O26734 is already very high, most of the mutations have the effect of decreasing the predicted solubility. In particular, there are few regions with very high effect, such as the residues from 1 to 25, from 49 to 51 and close to the C-terminal. Among the possible mutations, it appears that most of the mutations of a wildtype residue into a Met are predicted to heavily decrease the solubility. We listed the mutations with the larges ΔS in S1 Text.

In-silico mutational screening of UPF0235 protein MTH_637 (O26734), showing the effect on the solubility of the protein of every possible mutation.
Fig 3
The change in solubility is computed as ΔS = MUTsWTs, meaning that positive ΔS (shown in red) indicate that the mutation increases the solubility of the protein, while negative ΔS values (in blue) decrease it.In-silico mutational screening of UPF0235 protein MTH_637 (O26734), showing the effect on the solubility of the protein of every possible mutation.

An aspect that it is not usually considered when performing classical in-silico mutational screenings is the difference between the effects of two mutations mi and mj when they are considered independently or in combination (i.e. both implemented at the same time), thus investigating the possible synergistic effects of mutations. This screening of pairs of mutations is usually not doable due to the extremely high number of possibilities that needs to be tested, but SKADE is fast enough to allow also the exhaustive analysis of the effects combinations of variants. As an example, we ran this experiment on the protein O26734, which is 103 residues long, and we tested the solubility changes ΔSi,j due to all the possible pairs of variants mi and mj, for a total of (103 × 20 × 102 × 20)/2 = 2101200 mutations. We then computed the change in solubility due to two independent mutations mi and mj from Fig 3 as ΔSsingle = ΔSi + ΔSj, and the change in solubility obtained when both mi and mj are implemented in the sequence ΔSpair = ΔSi,j = (MUTi,jWT). We then computed the synergistic versus individual effects as ΔE = ΔSsingle − ΔSpair for each pair of variants mi and mj.

In Fig 4 we averaged the ΔE for each position of O26734, indicating on which pairs of sequence positions the synergistic effects have a stronger effect than two independent variants, averaged over all the possible mutations that could be implemented in those positions. Fig 4 shows that SKADE’s model predicts that most of the positions in O26734 are more likely affected by synergistic effects of mutations in distant parts of the sequence, with respect to two single mutations acting independently. This might indicate that the most effective way to optimize proteins’ solubility could involve the study of the synergistic effects of mutations, instead of implementing variants whose effect on solubility has been assessed in isolation.

Heatmap showing the average single versus synergistic effects ΔE = ΔSsingle − ΔSpair on each position of the O26734 protein.
Fig 4
Negative (blue) values indicate that synergistic effects are stronger, while positive (red) values indicate that the effect of independent mutations is higher.Heatmap showing the average single versus synergistic effects ΔE = ΔSsingle − ΔSpair on each position of the O26734 protein.

In S2 Fig we show the mean predicted synergistic effects, averaged over the possible pairs of amino acids mutations. Residues such as A, N, P, W, Y and I are on average less prone to show synergistic effects, while C, D, Q, M and V are, on average, involved in stronger synergistic effects. Interestingly, among these residues, Cs appears to experience little influence from mutations of Qs, and Hs are generaly not influenced by mutations of Ps and Rs.

In Fig 5 we analysed the mean synergistic effects on the protein O26734 in function of the sequence separation |ij| between pairs of mutated residues at positions i, j. We see that residues which are very close (1 ≤ |ij| ≤ 10) or very distant (90 ≤ |ij | ≤ 100) tend to experience the highest synergistic effects (blue lines). In order to find an explanation to this behavior, we compared the magnitude of the synergistic effects with the distribution of the actual 3D distances between residues extracted from the 1JRM pdb structure. As shown in Fig 5, we noticed that a certain correlation exists between the magnitude of the synergistic effects and the mean contact distance between residues (red lines). The Pearson correlation between the mean predicted synergy and mean Angstrom distance between the residues’ C-β atoms is r = 0.29 (p-value = 0.003) and the Spearman’s correlation is r = 0.36 (p-value = 0.0002). This shows that the attention-based architecture on which SKADE is built is indeed able to catch a glimpse of more complex structural aspects of proteins, such as the distribution of contacts. S8 Fig shows a scatter plot version of the same data.

Plot showing the correlation between the average spatial distance between residues at a certain sequence separation |i − j| (red) with the magnitude of the synergistic effects between tandem mutations at the same positions i, j (blue).
Fig 5
Solid colors indicate the mean, while light colors indicate the median values.Plot showing the correlation between the average spatial distance between residues at a certain sequence separation |ij| (red) with the magnitude of the synergistic effects between tandem mutations at the same positions i, j (blue).

Discussion

In this study we propose a novel Neural Network (NN) architecture for the prediction of the solubility of protein sequences, and we exploit its attention-based interpretability to investigate the molecular forces driving protein solubility. This model, called SKADE, is based on a neural attention mechanism inspired from machine translation tasks and takes as input only the target protein sequence. SKADE’s performance positively compares with state of the art solubility predictors, and the neural attention architecture offers some opportunities for the interpretation of the predictions, in a first step towards opening the Machine Learning (ML) black-box.

In this study, we indeed analyzed the attention profiles learned by the model during training to investigate whether they showed a significant correlation with biophysical properties of proteins that may relate to the solubility of the chain. Interestingly, we did not find any correlation with trivial biophysical characteristics of amino acids, such as described in biophysical propensity scales, but we showed that the solubility profiles extracted from the model can be used as an unsupervised predictor for aggregation-prone regions in proteins. This suggests that the attention-like mechanism in SKADE is indeed learning non-trivial biophysical emergent characteristics of the protein sequences and that uses them as building blocks to compute the final solubility prediction.

From the analysis of the attention profiles it also appears that the portions of the protein that carry the strongest signal when it comes to predict the protein solubility are the ones closer to the N- and C-termini, more specifically the first and last 20% of residues of the sequences.

Protein solubiliity predictors are generally used to determine which minimal set of mutations could increase the overall solubility of the protein, thus facilitating its expression and production. In our analysis we also show that, although SKADE has not been trained for the task, the model can distinguish between mutations that increase or decrease the protein solubility, and that our model can thus be used to perform in-silico mutational screening with the goal of optimizing the solubility of proteins by selecting an optimal set of mutations.

When solubility predictors are used to screen mutations to increase protein solubility, only single mutations are usually analysed, because the number of possible pairs (or triplets, quadruplets) of mutations grows exponentially. One of the advantages of SKADE is that its NN implementation can be heavily parallelized on GPU, and thus an unprecedented amount of predictions can be computed in very short time. SKADE can thus compute the in-silico mutational screening of pairs or triplets of mutations in minutes, thus including the possible synergistic effects of multiple mutations in this analysis. To show an example of this, we predicted all the possible pairs of mutations on the fairly short protein O26734, and we compared the solubility changes due to couples of independent mutations with respect to pairs of tandem mutations. From this analysis it appears that many regions of O26734 are predicted to be affected by synergistic interactions between mutations, and we thus hypothesize that modelling the synergistic effects of mutations may provide an optimal way towards the optimization of proteins with respect to specific biophysical desiderata.

Finally, while analysing the distribution of the magnitude of the synergistic effects with respect to the sequence separation, we noticed that the synergies predicted by SKADE from the O26734 protein sequence have a significant correlation with the average contact distance between residues, extracted from the corresponding PDB sequence. SKADE is thus able to catch a glimpse of a complex emergent aspect of protein sequences, such as their contact density, from the sequence alone.

Acknowledgements

DR is grateful to A. L. Mascagni for the constructive discussion and to A. Reynolds for the inspiration.

References

1 

P Ciryam, GG Tartaglia, RI Morimoto, CM Dobson, M Vendruscolo. . Widespread aggregation and neurodegenerative diseases are associated with supersaturated proteins. Cell reports. 2013;5(3):, pp.781–790. , doi: 10.1016/j.celrep.2013.09.043

2 

CC Lee, JM Perchiacca, PM Tessier. . Toward aggregation-resistant antibodies by design. Trends in biotechnology. 2013;31(11):, pp.612–620. , doi: 10.1016/j.tibtech.2013.07.002

3 

JM Perchiacca, PM Tessier. . Engineering aggregation-resistant antibodies. Annual review of chemical and biomolecular engineering. 2012;3:, pp.263–286. , doi: 10.1146/annurev-chembioeng-062011-081052

4 

WE Balch, RI Morimoto, A Dillin, JW Kelly. . Adapting proteostasis for disease intervention. science. 2008;319(5865):, pp.916–919. , doi: 10.1126/science.1141448

5 

R Kundra, P Ciryam, RI Morimoto, CM Dobson, M Vendruscolo. . Protein homeostasis of a metastable subproteome associated with Alzheimer’s disease. Proceedings of the National Academy of Sciences. 2017;114(28):, pp.E5703–E5711. , doi: 10.1073/pnas.1618417114

6 

MC Manning, DK Chou, BM Murphy, RW Payne, DS Katayama. . Stability of protein pharmaceuticals: an update. Pharmaceutical research. 2010;27(4):, pp.544–575. , doi: 10.1007/s11095-009-0045-6

7 

JW Bye, L Platts, RJ Falconer. . Biopharmaceutical liquid formulation: a review of the science of protein stability and solubility in aqueous environments. Biotechnology letters. 2014;36(5):, pp.869–875. , doi: 10.1007/s10529-013-1445-6

8 

F Chiti, CM Dobson. . Protein misfolding, amyloid formation, and human disease: a summary of progress over the last decade. Annual review of biochemistry. 2017;86:, pp.27–68. , doi: 10.1146/annurev-biochem-061516-045115

9 

L Paladin, D Piovesan, SC Tosatto. . SODA: prediction of protein solubility from disorder and aggregation propensity. Nucleic acids research. 2017;45(W1):, pp.W236–W240. , doi: 10.1093/nar/gkx412

10 

S Khurana, R Rawi, K Kunji, GY Chuang, H Bensmail, R Mall. . DeepSol: a deep learning framework for sequence-based protein solubility prediction. Bioinformatics. 2018;34(15):, pp.2605–2613. , doi: 10.1093/bioinformatics/bty166

11 

R Rawi, R Mall, K Kunji, CH Shen, PD Kwong, GY Chuang. . PaRSnIP: sequence-based protein solubility prediction using gradient boosting machine. Bioinformatics. 2017;34(7):, pp.1092–1098. , doi: 10.1093/bioinformatics/btx662

12 

P Smialowski, G Doose, P Torkler, S Kaufmann, D Frishman. . PROSO II–a new method for protein solubility prediction. The FEBS journal. 2012;279(12):, pp.2192–2200. , doi: 10.1111/j.1742-4658.2012.08603.x

13 

F Agostini, D Cirillo, CM Livi, R Delli Ponti, GG Tartaglia. . cc SOL omics: A webserver for solubility prediction of endogenous and heterologous expression in Escherichia coli. Bioinformatics. 2014;30(20):, pp.2975–2977. , doi: 10.1093/bioinformatics/btu420

14 

P Sormanni, FA Aprile, M Vendruscolo. . The CamSol method of rational design of protein mutants with enhanced solubility. Journal of molecular biology. 2015;427(2):, pp.478–490. , doi: 10.1016/j.jmb.2014.09.026

15 

CN Magnan, A Randall, P Baldi. . SOLpro: accurate sequence-based prediction of protein solubility. Bioinformatics. 2009;25(17):, pp.2200–2207. , doi: 10.1093/bioinformatics/btp386

16 

DW Buchan, F Minneci, TC Nugent, K Bryson, DT Jones. . Scalable web services for the PSIPRED Protein Analysis Workbench. Nucleic acids research. 2013;41(W1):, pp.W349–W357. , doi: 10.1093/nar/gkt381

17 

D Raimondi, G Orlando, R Pancsa, T Khan, WF Vranken. . Exploring the sequence-based prediction of folding initiation sites in proteins. Scientific reports. 2017;7(1):, pp.8826, doi: 10.1038/s41598-017-08366-3

18 

D Raimondi, G Orlando, Y Moreau, WF Vranken. . Ultra-fast global homology detection with Discrete Cosine Transform and Dynamic Time Warping. Bioinformatics. 2018;1:, pp.8.

19 

Rush AM, Chopra S, Weston J. A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:150900685. 2015;.

20 

A Vaswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, AN Gomez, et al. Attention is all you need. In: Advances in neural information processing systems; 2017 p. , pp.5998–6008.

21 

HM Berman, JD Westbrook, MJ Gabanyi, W Tao, R Shah, A Kouranov, et al. The protein structure initiative structural genomics knowledgebase. Nucleic acids research. 2008;37(suppl_1):, pp.D365–D368. , doi: 10.1093/nar/gkn790

22 

CCH Chang, J Song, BT Tey, RN Ramanan. . Bioinformatics approaches for improved recombinant protein production in Escherichia coli: protein solubility prediction. Briefings in bioinformatics. 2013;15(6):, pp.953–962. , doi: 10.1093/bib/bbt057

23 

Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:14061078. 2014;.

24 

A Paszke, S Gross, S Chintala, G Chanan, E Yang, Z DeVito, et alAutomatic differentiation in PyTorch. 2017;.

25 

G Orlando, D Raimondi, W Vranken. . Observation selection bias in contact prediction and its implications for structural bioinformatics. Scientific Reports. 2016;6, doi: 10.1038/srep36679

26 

D Raimondi, G Orlando, WF Vranken, Y Moreau. . Exploring the limitations of biophysical propensity scales coupled with machine learning for protein sequence analysis. Scientific Reports. 2019;9(1):, pp.1–11. , doi: 10.1038/s41598-019-53324-w

27 

AC Tsolis, NC Papandreou, VA Iconomidou, SJ Hamodrakas. . A consensus method for the prediction of ‘aggregation-prone’peptides in globular proteins. PLoS One. 2013;8(1):, pp.e54175, doi: 10.1371/journal.pone.0054175

28 

I Walsh, F Seno, SC Tosatto, A Trovato. . PASTA 2.0: an improved server for protein aggregation prediction. Nucleic acids research. 2014;42(W1):, pp.W301–W307. , doi: 10.1093/nar/gku399

29 

A Gaudry, B Lorber, A Neuenfeldt, C Sauter, C Florentz, M Sissler. . Re-designed N-terminus enhances expression, solubility and crystallizability of mitochondrial protein. Protein Engineering, Design & Selection. 2012;25(9):, pp.473–481. , doi: 10.1093/protein/gzs046

30 

EA Ribeiro, CH Ramos. . Circular permutation and deletion studies of myoglobin indicate that the correct position of its N-terminus is required for native stability and solubility but not for native-like heme binding and folding. Biochemistry. 2005;44(12):, pp.4699–4709. , doi: 10.1021/bi047908c

31 

S Mine, T Ueda, Y Hashimoto, T Imoto. . Improvement of the refolding yield and solubility of hen egg-white lysozyme by altering the Met residue attached to its N-terminus to Ser. Protein engineering. 1997;10(11):, pp.1333–1338. , doi: 10.1093/protein/10.11.1333


7 Jan 2020

Dear Dr Raimondi,

Thank you very much for submitting your manuscript 'Insight into the protein solubility driving forces with neural attention' for review by PLOS Computational Biology. Your manuscript has been fully evaluated by the PLOS Computational Biology editorial team and in this case also by independent peer reviewers. The reviewers appreciated the attention to an important problem, but raised some substantial concerns about the manuscript as it currently stands. While your manuscript cannot be accepted in its present form, we are willing to consider a revised version in which the issues raised by the reviewers have been adequately addressed. We cannot, of course, promise publication at that time.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

Your revisions should address the specific points made by each reviewer. Please return the revised version within the next 60 days. If you anticipate any delay in its return, we ask that you let us know the expected resubmission date by email at ploscompbiol@plos.org. Revised manuscripts received beyond 60 days may require evaluation and peer review similar to that applied to newly submitted manuscripts.

In addition, when you are ready to resubmit, please be prepared to provide the following:

(1) A detailed list of your responses to the review comments and the changes you have made in the manuscript. We require a file of this nature before your manuscript is passed back to the editors.

(2) A copy of your manuscript with the changes highlighted (encouraged). We encourage authors, if possible to show clearly where changes have been made to their manuscript e.g. by highlighting text.

(3) A striking still image to accompany your article (optional). If the image is judged to be suitable by the editors, it may be featured on our website and might be chosen as the issue image for that month. These square, high-quality images should be accompanied by a short caption. Please note as well that there should be no copyright restrictions on the use of the image, so that it can be published under the Open-Access license and be subject only to appropriate attribution.

Before you resubmit your manuscript, please consult our Submission Checklist to ensure your manuscript is formatted correctly for PLOS Computational Biology: http://www.ploscompbiol.org/static/checklist.action. Some key points to remember are:

- Figures uploaded separately as TIFF or EPS files (if you wish, your figures may remain in your main manuscript file in addition).

- Supporting Information uploaded as separate files, titled Dataset, Figure, Table, Text, Protocol, Audio, or Video.

- Funding information in the 'Financial Disclosure' box in the online system.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see here

We are sorry that we cannot be more positive about your manuscript at this stage, but if you have any concerns or questions, please do not hesitate to contact us.

Sincerely,

Emil Alexov

Guest Editor

PLOS Computational Biology

Ruth Nussinov

Editor-in-Chief

PLOS Computational Biology

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

[LINK]

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: This is an interesting manuscript in the burgeoning field of protein solubility prediction. It is based on machine learning models, and produces results in line with other available methods. The authors describe their use of ML methods as novel in the field and this may be the case, I am afraid that I am not expert in ML methods. What I regard as novel are a couple of observations made in the manuscript, and that is where I would like to see more work carried out to clarify the results.

First, a statement is made to the effect that the predictions / model are not based on the commonly used biophysical properties (of amino acids). But, I do not see how this can be asserted if amino acids themselves are central to the prediction scheme - as I see it, sequence is the input to the training algorithm.

Second, and of the most interest to me, is the observation that N-t and C-t seem to possess (Fig. 2) more or less all the predictive power. I would not be surprised to find a tendency towards greater predictive power, but a flat line (Fig. 2) over the central 70% of the sequence I find amazing. Is it possible that the method is learning sequences (i.e. homology) and that this is best expressed at NT and CT when sequences of variable lengths are mapped to percentiles?

My own speculation is probably wrong, but I feel that the authors need more work here. e.g. What happens if the standard bioinformatics methods for reducing to non-redundant training set is used?

In the data reported for SKADE analysis of the CAMSOL mutation set, are all the mutations that affect solubility in the NT and CT regions - as might be expected given the results of Fig. 2?

Similarly, the synergistic mutations in later figures, are they also in NT and CT regions?

Reviewer #2: The paper presents a novel computational approach to predicting protein and peptide solubility from amino acid sequence alone SKADE. The model behind the method is a Recurrent Neural Network with soft self-attention and dot-product score function, built using Gated Recurrent Units. It is implemented in Python using Pytorch framework and supplies a pre-trained network for immediate applications.

SKADE is a great example of a neural network design without explicit feature engineering. The resulting model is comparable in performance or outperforming (depending on metrics) state of the art methods such as PaRSnIP and DeepSol that rely on knowledge-based and engineered features. The distributions of weights and scores along protein sequences suggest that N- and C-termini are most important determinants of protein and peptide solubility.

In addition, SKADE is able to identify mutations that increase or decrease the overall solubility of the protein, making it a useful and fast method for scanning mutagenesis, not limited to single mutations. Testing the effects of protein truncation on solubility is also possible with SKADE.

Major:

1. Although the model provides per-residue attention values and scores, interpretability of the methods remains limited. The only conclusion is that N- and C- termini are more important, but that is about it. Besides, these are the output layers of the network, what is inside is still a “black box” in terms of interpretability, I suggest deemphasizing this point in abstract and conclusions. The lack of correlations with biophysical properties contradicts this conclusion.

2. It is quite surprising to see that there is no correlation of attention and solubility profiles to any physico-chemical and/or structural properties of amino acids. Particularly, because these features work quite well in other solubility prediction methods. I think Table 1 in supporting information files does not comprehensively cover the biophysical features, for instance aggregation propensity, SASA. Besides, it is not clear how the correlations were calculated. Could the lack of correlation be related to scalability of attention/profile with sequence length? I.e. values of a short protein could be incomparable to values of a large protein.

3. According to neural network architecture it seems that the Embedding for converting 20 amino acid categories into a 20-dimensional tensor acts simply as a One Hot encoder. If there is any actual embedding that is learnt during training could you visualize the resulting amino acid similarities? Otherwise I suggest renaming embedding to encoding.

4. Python 2.7 is not acceptable by any means. This Python branch reached end of life on January 1 2020, is no longer supported. Moreover, is currently being removed from many computer systems for security concerns because it is unsupported. See https://devguide.python.org/#status-of-python-branches . It looks like there should not be any reasons not to update the code to Python 3.

5. Please format code according to PEP8 https://www.python.org/dev/peps/pep-0008/ Automatic conversion with autopep8 --aggressive could be used, for instance. Note that there is a lot of imported but unused modules, variables, and other issues that a lint program could fix, please also run pylint to get rid of the software issues.

Minor:

1. There is an imbalance between soluble and insoluble classes, since you have not used weighted sampling. To what extent do you think this imbalance affected the results? I suspect this could be the reason behind performance gap with respect to positive/negative class prediction separating SKADE and PaRSnIP.

2. The details of model convergence should be provided in supporting information. I could see that tensorboard was used in the program, so it should be possible to include those plots.

3. Line 98, why i<0 ?

4. Line 298: According to uniprot the protein should be 104, not 103 residues long https://www.uniprot.org/uniprot/O26734

5. The equation in line 300 is not clear, were the pairs of mutations generated including two mutations on the same position? Shouldn’t it be something like 104x19x19x103/2?

6. Current Pytorch v 1.3.1, version was 1.0.1 used in the paper. Model needs to be recalculated

7. Please submit the model to kipoi.org repository, that would make it much easier to run it

8. Peptide N- and C-termini are effectively charged residues in terms of their physico-chemical characteristics affecting solubility, i.e. Ala is hydrophobic, but will have a charge on the N-termini. However, it is not treated/encoded in any way different than any other amino acid. Was the reasoning behind it to an unbiased representation in the model or something else?

9. What is BAC in table 2?

10. Line 65, correlation is reported as AUC, which is not convenient

11. Line 319: “my” > “by”

12. In my opinion it is better to show dependence of synergetic effect on spatial distance directly as a scatter plot to emphasize on the correlation, particularly in proximity

13. In Figure 3 according to heatmap the first WT residue in O26734 should be V (the only white shaded residue), however this is not the case. What gives?

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biologydata availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Alexander Goncearenco


27 Jan 2020

Submitted filename: answersReviewers.pdf

10 Feb 2020

Dear Dr. Raimondi,

We are pleased to inform you that your manuscript 'Insight into the protein solubility driving forces with neural attention' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch within two working days with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Emil Alexov

Guest Editor

PLOS Computational Biology

Ruth Nussinov

Editor-in-Chief

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #2: The authors responded to all my questions.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biologydata availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: Yes: Alexander Goncearenco


6 Apr 2020

PCOMPBIOL-D-19-02103R1

Insight into the protein solubility driving forces with neural attention

Dear Dr Raimondi,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Laura Mallard

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

https://www.researchpad.co/tools/openurl?pubtype=article&doi=10.1371/journal.pcbi.1007722&title=Insight into the protein solubility driving forces with neural attention&author=Daniele Raimondi,Gabriele Orlando,Piero Fariselli,Yves Moreau,Emil Alexov,Ruth Nussinov,Emil Alexov,Ruth Nussinov,Emil Alexov,Ruth Nussinov,Emil Alexov,&keyword=&subject=Research Article,Physical Sciences,Materials Science,Material Properties,Solubility,Biology and Life Sciences,Biophysics,Physical Sciences,Physics,Biophysics,Computer and Information Sciences,Neural Networks,Biology and Life Sciences,Neuroscience,Neural Networks,Computer and Information Sciences,Artificial Intelligence,Machine Learning,Biology and Life Sciences,Molecular Biology,Macromolecular Structure Analysis,Protein Structure,Protein Structure Prediction,Biology and Life Sciences,Biochemistry,Proteins,Protein Structure,Protein Structure Prediction,Computer and Information Sciences,Artificial Intelligence,Machine Learning,Support Vector Machines,Research and Analysis Methods,Extraction Techniques,Protein Extraction,Biology and Life Sciences,Genetics,Mutagenesis,