Model Quality Assessment in Membrane Proteins Using predicted Lipid Accessibility Profiles Mukta Phatak1,3 and Jarosław Meller1,3,4* 1Department of Biomedical Engineering, University of Cincinnati, Cincinnati, OH 45221, USA of Environmental Health, University of Cincinnati, Cincinnati, OH 45267 4Division of Biomedical Informatics, Cincinnati Children’s Hospital Research Foundation, OH 45229 3Department Abstract Today, membrane proteins dominate the class of drug targets because of their key role in signal transduction as well as transport of ions and small molecules across cell membrane. In order to understand the function of proteins, it is necessary to understand their 3-D structure. Compared to soluble proteins, a relatively small number of high resolution membrane protein structures are resolved experimentally. However, despite the paucity of 3D protein structures, we can the design computational methods by utilizing protein sequence data, which is readily available. Towards the bigger goal of predicting 3D structure of a protein, we adopt a step-by-step approach to first predict intermediate structural attributes of a protein structure from a sequence. For each residue in the transmembrane domains (TMDs) of a protein, we capture the level of exposure to the lipid in terms of relative lipid accessibility (RLA). We have developed robust predictors for RLA of membrane proteins using lowcomplexity Support Vector Regression (SVR) models capable of learning from a limited number of examples in order to minimize the risk of overfitting. Our results indicate that RLA can be predicted at the level of about 0.5 CC. Further, we generated multiple decoy models for proteins in the test set by swapping observed RLA values of the transmembrane (TM) helices. We hypothesize that RLA prediction profiles are well-correlated with the observed RLA values, and therefore can be used to discriminate near native models from non-native decoy models. We ranked all the models based on the CC between the observed and the predicted RLA. Our results indicate that the CC between the predicted and the observed RLA for the native was the highest compared to the decoy models. The results underscore the hypothesis that given a list of models; sufficiently accurate RLA predictions can be used to narrow down towards those models which are consistent with the predicted patterns. This will facilitate further efforts to improve de novo and template based prediction of membrane protein structure. Introduction Membrane proteins are key regulators of many cellular and physiological processes and they represent a significant fraction of the entire proteome. Today, membrane proteins dominate the class of drug targets because of their critical role in signal transduction as well as transport of ions and small molecules across cell membrane. Structural and functional studies on membrane proteins could lead to novel and improved pharmaceutical treatments for a broad range of diseases. In order to understand the function of proteins, it is necessary to understand their 3-D structure. Compared to soluble proteins, a relatively small number of high resolution membrane protein structures are resolved experimentally. However, despite the paucity of 3D protein structures, we can design computational methods by utilizing protein sequence data, which is readily available. The computational prediction of membrane proteins, and their functional attributes, has therefore become an important alternative and complementary tool for membrane protein studies. However, the number of examples from which to learn is limited, making applications of statistical and machine learning techniques much more challenging in this case. In this work, we focus on alpha helical transmembrane (TM) proteins which span the lipid bilayer. There are several successful studies for predicting TMD boundaries given a membrane protein sequence. SOSUI [1], TopPred II [2], TMpred [3], Minnou [4] are few such examples. Once the part of a sequence corresponding to the TMDs is located, we then need to identify the arrangement of TM helices with respect to each other. Towards the bigger goal of predicting 3D structure of a protein, the next step is to predict the overall 3D topology of TM helices. In order to achieve this goal we adopt a step-by-step approach to first predict intermediate structural attributes of a protein structure from a sequence. For each residue in TMDs of a protein, we capture the level of exposure to the lipid in terms of relative lipid accessibility (RLA). For this analysis, we focus on residues located in the TMDs disregarding the residues in the non-TM loop structures. We have developed a robust method for predicting the extent of lipid exposure of amino acid residues in TM proteins in terms of lipid accessibility. To solve the underlying regression problem, we developed robust low complexity support vector regression (SVR) models which are capable of learning from a limited number of examples and thereby minimize the risk of overfitting. The primary aim of our work is to explore these 1 dimensional (1D) lipid accessibility prediction profiles towards further 3D structural and functional studies of membrane proteins. Here, we evaluate the efficacy of the predicted RLA profile for discriminating incorrectly folded non native models from the native ones. If predicted RLA profiles are well-correlated with the observed RLA values, they can facilitate template based structure prediction methods. We put this hypothesis to test by generating decoy sets for membrane proteins for which the native 3D structure is resolved. We then computed the Correlation Coefficient (CC) between the observed and the predicted RLA profiles. We concluded that the CC between the predicted and observe RLA for the native structure was the highest compared to the observed RLA from the decoys. The results obtained underscore the importance of such 1D prediction protocols for selecting the correct template for the target protein structure in the template based structure prediction methods. RLA prediction profiles can potentially be applied as a filter to reduce the sample size of thousands of putative models generated by de novo structure prediction methods. Materials and Methods Training and test data Carefully designing a representative training data set is a vital part of any machine learning based predictive models. Due to the difficulties in applying experimental techniques, the number of high resolution structures of membrane proteins that have been solved to date is limited as compared to the soluble proteins. Hence, in the case of membrane proteins, the problem is even more challenging with the limited number of examples available to learn from. Here, we used MPtopo [5] and PDB_TM [6] membrane protein databases for generating non redundant yet representative protein chains of the resolved 3D structures to construct a training set for RLA (RD) predictors. The redundant entries are removed from the dataset using BLAST[7] sequence alignment. For this study we developed a set of 71 non-redundant alpha-helical protein chains with 6,049 residues in the TM domains available for cross-validated training purposes. Next, we created an independent test set using the PDB_TM database. Sequence homology with respect to the training set was evaluated to make sure that the proteins in the test set are non-redundant to all the chains in the training set. This resulted in the non-redundant set of 49 chains with a total of 3826 residues in the TM domains. RLA Computation Relative Lipid Accessibility (RLA): The RLA of amino acid residue i is defined as follows. RLAi 100 LAi % MSAi where (LAi) is lipid exposed surface area observed in a given structure and it is normalized by (MSAi) which is the maximum achievable surface area for that type of amino acid [8]. It is important to normalize LAi values to take into account the differences in the surface area of side chains of 20 amino acids. E.g. Side chains of an amino acid like Alanine (A) is very small compared to say tryptophan (W). Values of RLAi can range between 0% and 100%, corresponding to a fully buried and a fully lipid accessible state respectively. Observed Lipid Accessibility values of the known 3D structures are computed using DSSP program [9]. Given actual 3D coordinates of the protein structure, it computes a parameterization of protein surface to yield percent exposure of each of the amino acid residues. Sequence Based Predictor Given the structural data of resolved membrane proteins, we know the protein sequence as well as the corresponding 3D structure coordinates for those proteins. Based on these examples, we can develop a model to predict RLA for each residue in a protein sequence. A typical sequence based predictor model is depicted in Figure 1. Figure 1: Typical sequence based predictor model Here, we predict RLA for the residues located in the TM part. We adopted regression approach to approximate the relationship between a sequence and RLA values. The input to the predictor consists of samples (every amino acid residue in TM domain) and known labels (observed value of the RLA to be estimated). Support Vector Regression (SVR) models are robust and one of the most promising methodologies for learning and inference with minimal parameter choices. Wagner et. al. [10] has shown that improved predictions are obtained using simple and computationally much efficient linear SVR that performs comparably to nonlinear models. Samples were characterized by various parameters and are termed as “features”. The information for the features was obtained from amino acid sequence itself. Evolutionary conservation as captured in the form of multiple sequence alignment (MSA) is an important feature for the prediction of structural attributes. In addition to MSA, using SABLE server[11], we obtained predicted Relative Solvent Accessibility (RSA) value and its confidence factor to be used as possible features. The notion of RSA in case of soluble proteins is analogous to RLA in membrane proteins. The predicted confidence factor values qualitatively follow the periodic surface exposure of the residues in the TM helices. Lastly, we also explored hydropathy and lipophilicity profiles, as derived in terms of KD[12] and TMLIP2H[13] scales as additional features. The local structural environment of each residue is characterized by a sliding window of amino acids. The residue of interest is located at the central position in the window which moves along the sequence. Different window sizes in the range of 9 to 21 were tested. RLA Prediction Results The performance of the predictor was assessed by means of 10-fold cross validation on the training set. First, the training set was randomly split into 10 subsets of (approximately) equal sizes. 10 different SVRs were trained each time leaving one group out as test set whereas remaining 9 groups were merged to form a training set. Final result is considered to be the arithmetic average of the 10 SVRs trained. This process provides improvement in the accuracy for the independent test set. Here, we assess the performance in terms of correlation coefficient (CC) between the observed and the predicted values as well as the Root Mean Square Error (RMSE) and mean absolute error (MAE). The final cross validation accuracy reported is the arithmetic average of the 10 different models thus trained on different parts of the data. Even for the independent test set, 10 predictions are obtained and the final consensus prediction (in terms of arithmetic average) is reported. We investigated different combinations of the features to derive an optimal feature representation. Optimal feature representation is obtained by performing 10 fold cross validation study, fine-tuning the meta-parameters C and ε, for the SVR model. In particular for RLA, the combined MSA+RSA+TMLIP2H representation performed best. The values for error tolerances and the penalty parameter were set to i 0.1 and C 0.03 respectively. Further, we chose .a sliding window of length 15 that yielded the best results. The average CC between the observed and the predicted RLA in 10 fold cross-validation was of the order of 0.5 and the corresponding MAE and RMSE were 0.15 and 0.19 respectively. In order to further test the generalization of SVR model, we computed average accuracies on the independent test set of 49 chains. The average CC between the observed and the predicted RLA was 0.49 and corresponding MAE and RMSE were 0.16 and 0.20 respectively. Our estimates of accuracy measures obtained on the test set are consistent with the estimates obtained using crossvalidated training. Since the errors are of the same order to those observed in cross-validation, we conclude that the method is robust and avoids over-fitting. Model Quality Assessment De novo and template-based computational approaches for 3D-protein structure predictions generate multiple candidate models. Typically, the correctly folded 3D structure of a protein is called as a “native” model and the putative models generated are called “decoy” or “non native” models. For the most part, the decoys being generated have correct stereochemical properties. However, they differ in the overall 3D topology. For example, in case of alpha helical proteins, arrangement, as well as, the orientation of the helices with respect to each other defines a characteristic 3D structure. Some of the decoy models, which have an overall 3D structure closer to the native structure are referred to as “near native” models, while other decoy models with 3D topology different to the native structure are called “non-native” models. Thus, when presented with a set of decoys, the challenge of how to filter out “non native” models from the “near native” ones becomes the next immediate problem that must be addressed. Improved predictions of intermediate structural attributes of amino acid residues in a protein, (such as secondary structure or solvent accessibility) greatly facilitate the template-based structure prediction methods as well as de novo simulations [14]. Here, we hypothesize that sufficiently accurate RLA predictions can be used to discriminate native (near native) models from the nonnative decoy structures. If RLA predictions are wellcorrelated with the structure, they can be used to narrow down the pool of decoys to those which are consistent with the predicted patterns. Figure 2: Given RLA predictions for the TM helices (marked in yellow), matching predicted and observed TM helices can be used in the template-based structure prediction methods The scenario is depicted in Figure 2. Matching RLA profiles for the predicted and observed TM helices can be used in the template based structure prediction methods. To investigate whether predicted RLA can discriminate between native and non native models, we need sufficiently large number of decoy sets. The decoy sets for the present study are obtained by randomly reshuffling the TM helices of the native models as described in the following section. Rearrangement of TM helices In the first approach, our aim is to assess the overall efficacy of the predicted profile by simply randomly shuffling the observed RLA (RD) values corresponding to the TM helices and thereby generating a decoy model. The goal of this exercise is not to generate an overall 3D structure for each of the decoy models by re-computing the actual coordinates, but rather to assess the correlation between the predicted and the observed RLA (RD) values per TM helix. The process involves simple rearrangement of TM domains while retaining the non-TM part of the structure intact. The process can be best explained with an example. Figure 3: Demonstrates the process of shuffling observed RLA values in TM helices to generate an effect of an alternative structure considered as a decoy Figure 3 depicts the scenario. Part A of Figure 3 shows the arrangement of 7 TM helices of a hypothetical protein in its native state. Here, helix 7 is located at the center and is therefore a buried helix. Now, let’s say we swap helix 7 with helix 1 to generate an alternative model such that helix 7 is no longer placed at the center and instead, helix 1 is located at the center. The alternative arrangement is depicted in Part B of Figure 3. Here, we now have 2 models that are generated using the same sequence, one being native and other a decoy. We achieve this effect by swapping the corresponding observed RLA values for the corresponding TM segments while keeping the rest of the values intact. For example, in the present case, we exchange observed values of TM 7 with TM 1, keeping rest of the values intact. Generating and Ranking decoys For this analysis, we selected 10 representative protein chains from the test set having at least 4 TM helices. This gives us the required flexibility to generate sufficient number of distinct decoys for each protein chain. We generated decoys as follows. For each protein, We obtain the predicted RLA values. It should be noted that the prediction profile is obtained only once, based on the original sequence, and thus will remain consistent throughout the analysis. The corresponding observed RLA values of the native structure are obtained from DSSP [9]. We note the CC between the predicted and the observed RLA values of the native structure. We then generate alternative models by randomly shuffling the TM domain segments of the sequence, keeping non-TM domain parts intact. We would like to highlight that we swap only similar length helices in each of the proteins to avoid generating trivial cases. We then compute CC between predicted and observed RLA of the decoy model generated. We rank all the decoys using such obtained CCs as a measure of their quality. Since the observed values for the decoys are now rearranged, CC between the observed and the predicted RLA is expected to the highest for the native structure. We then measure the separation of nonnative and native by a Z score. We performed this exercise on 10 protein chains and observed that CC with the native model was indeed the highest. The detailed results are listed in Table 1. Moreover, we also took into consideration the flexible boundaries for the TM domains. Since TM domains detected from PDBTM [6] are predicted, they are subject to some uncertainty. Sometimes TM domains can be shifted up or down by a couple of residues. To account for the fuzzy boundaries, we generated alternative models by shifting TM boundaries left or right by one residue in the protein sequence. Thus, TM domain for each helix now has three alternative boundaries including the original. In principle, we can generate a total of 3^ (#helices) helix arrangements for a protein. We refer to these models as “native-like” models. For the purposes of this study, we limited “native-like” models to 100 wherever applicable. For the “native-like” models, we then shuffled these TM domains randomly as explained in the previous section. We observed that the “native-like” models were also ranked at the top, suggesting the robustness of the predictions. These results indicate that the predicted RLA profiles are well-correlated with the observed values and that the predictions are not random. By contrasting predicted RLAs and RDs with those observed in decoy models, one can Pdb id # tm # native-like models generated 1aig_l 1bcc_c 1c17_m 1eys_m 1fft_c 1fx8_a 1izl_a 1ors_c 1u77_a 1xfh_a 5 8 4 5 5 6 5 4 11 9 100 100 70 100 100 100 100 70 100 100 #of different shuffles for each model 120 1440 24 12 120 24 120 4 864 72 native CC mean of native like from column 2 Average Z score 0.55 0.51 0.45 0.53 0.36 0.68 0.35 0.48 0.71 0.7 0.54 ± 0.01 0.50± 0.02 0.45± 0.02 0.52± 0.02 0.36± 0.04 0.68± 0.01 0.35± 0.01 0.47±0.01 0.70±0.01 0.69 ±0.01 2.44 ± 0.29 2.86± 0.65 2.18± 0.57 2.21± 0.2 1.28± 0.4 2.36± 0.25 2.17± 0.23 1.72±0.48 2.20±0.17 2.38±0.17 Table 1: Results of swapping TM helices in the protein chains. The third column lists the number of “native-like” structures considered. The fourth column lists the number of decoys generated. The fifth column lists the CC between actual and predicted RLA for the native structure. The sixth column lists the mean of the CC of all the native-like structures. The last column lists the average Z score of native and “native-like” models among all possible arrangements. consistently discriminate between native-like and nonnative arrangements of TM helices. Template based protein structure prediction is based on the assumption that similar proteins exhibit similar protein folds. Our results indicate that predicted RLA profiles can facilitate the search for the correct protein template that is structurally similar to the target protein. While it is premature to conclude from ten proteins and a more conclusive argument requires more representative examples, nevertheless, the results obtained so far are encouraging and underscore the hypothesis that RLA prediction profiles are correlated with observed values, and therefore can be used to discriminate near native from nonnative models. Several different types of decoys and datasets (e.g. using Rosetta-Membrane [15] and I-Tasser [16]) can further be obtained to validate the hypothesis. The results will facilitate further efforts to improve de novo and template-based prediction of membrane protein structure. Conclusions Limitations of experimental approaches, especially in the case of membrane proteins, create an opportunity for computational approaches to complement and facilitate experimental efforts in that regard. In this paper, we proposed a novel method for the prediction (from sequence) of relative lipid accessibility in membrane proteins using a linear Support Vector Regression approach to minimize the risk of overfitting and provide robust performance. Our results indicate that RLA can be predicted at the level of about 0.5 CC. While this is still lower than the estimated 0.6-0.7 CC for state-of-the-art real-valued RSA prediction methods for soluble proteins [11], it is sufficient for the model quality assessment for membrane proteins. Our results indicate that the predicted RLA profiles are well-correlated with the observed values and that the predictions are not random. Given a list of models, one can narrow down to those models which are consistent with the predicted patterns. By contrasting predicted RLAs with those observed in decoy models, one can consistently discriminate between native-like and nonnative arrangements of TM helices. The results so far look promising and will facilitate further efforts to improve de novo and template-based prediction of membrane protein structure. References 1. Hirokawa, T., S. Boon-Chieng, and S. Mitaku, SOSUI: classification and secondary structure prediction system for membrane proteins. Bioinformatics, 1998. 14(4): p. 378-9. 2. Claros, M.G. and G. von Heijne, TopPred II: an improved software for membrane protein structure predictions. Comput Appl Biosci, 1994. 10(6): p. 685-6. 3. Stoffel, K.H.W., TMbase - A database of membrane spanning proteins segments. Biol. Chem. HoppeSeyler, 1993. 374: p. 166. 4. Cao, B., et al., Enhanced recognition of protein transmembrane domains with prediction-based structural profiles. Bioinformatics, 2006. 22(3): p. 303-9. 5. Jayasinghe, S., K. Hristova, and S.H. White, MPtopo: A database of membrane protein topology. Protein Sci, 2001. 10(2): p. 455-8. 6. Tusnady, G.E., Z. Dosztanyi, and I. Simon, PDB_TM: selection and membrane localization of transmembrane proteins in the protein data bank. Nucleic Acids Res, 2005. 33(Database issue): p. D275-8. 7. Altschul, S.F., et al., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, 1997. 25(17): p. 3389-402. 8. Chothia, C., The nature of the accessible and buried surfaces in proteins. J Mol Biol, 1976. 105(1): p. 1-12. 9. Kabsch, W. and C. Sander, Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 1983. 22(12): p. 2577-637. 10. Wagner M., A.R., Porollo A and Meller J., Linear Regression Models for Solvent Accessibility Prediction in Proteins. Journal of Computational Biology, 2005. 12(3): p. 355-369. 11. Adamczak, R., A. Porollo, and J. Meller, Accurate prediction of solvent accessibility using neural networks-based regression. Proteins, 2004. 56(4): p. 753-67. 12. Kyte, J. and R.F. Doolittle, A simple method for displaying the hydropathic character of a protein. J Mol Biol, 1982. 157(1): p. 105-32. 13. Adamian, L., et al., Empirical lipid propensities of amino acid residues in multispan alpha helical membrane proteins. Proteins, 2005. 59(3): p. 496509. 14. Rohl CA, S.C., Misura KM, Baker D., Protein structure prediction using Rosetta. Methods Enzymol, 2004. 383: p. 66–93. 15. Yarov-Yarovoy, V., J. Schonbrun, and D. Baker, Multipass membrane protein structure prediction using Rosetta. Proteins, 2006. 62(4): p. 1010-25. 16. Wu, S., J. Skolnick, and Y. Zhang, Ab initio modeling of small proteins by iterative TASSER simulations. BMC Biol, 2007. 5: p. 17.