paper - University of Cincinnati

advertisement
Model Quality Assessment in Membrane Proteins Using predicted
Lipid Accessibility Profiles
Mukta Phatak1,3 and Jarosław Meller1,3,4*
1Department
of Biomedical Engineering, University of Cincinnati, Cincinnati, OH 45221, USA
of Environmental Health, University of Cincinnati, Cincinnati, OH 45267
4Division of Biomedical Informatics, Cincinnati Children’s Hospital Research Foundation, OH 45229
3Department
Abstract
Today, membrane proteins dominate the class of drug targets
because of their key role in signal transduction as well as
transport of ions and small molecules across cell membrane. In
order to understand the function of proteins, it is necessary to
understand their 3-D structure. Compared to soluble proteins, a
relatively small number of high resolution membrane protein
structures are resolved experimentally. However, despite the
paucity of 3D protein structures, we can the design computational
methods by utilizing protein sequence data, which is readily
available. Towards the bigger goal of predicting 3D structure of
a protein, we adopt a step-by-step approach to first predict
intermediate structural attributes of a protein structure from a
sequence. For each residue in the transmembrane domains
(TMDs) of a protein, we capture the level of exposure to the lipid
in terms of relative lipid accessibility (RLA). We have developed
robust predictors for RLA of membrane proteins using lowcomplexity Support Vector Regression (SVR) models capable of
learning from a limited number of examples in order to minimize
the risk of overfitting. Our results indicate that RLA can be
predicted at the level of about 0.5 CC.
Further, we generated multiple decoy models for proteins in the
test set by swapping observed RLA values of the transmembrane
(TM) helices. We hypothesize that RLA prediction profiles are
well-correlated with the observed RLA values, and therefore can
be used to discriminate near native models from non-native decoy
models. We ranked all the models based on the CC between the
observed and the predicted RLA. Our results indicate that the CC
between the predicted and the observed RLA for the native was
the highest compared to the decoy models. The results underscore
the hypothesis that given a list of models; sufficiently accurate
RLA predictions can be used to narrow down towards those
models which are consistent with the predicted patterns. This will
facilitate further efforts to improve de novo and template based
prediction of membrane protein structure.
Introduction
Membrane proteins are key regulators of many cellular and
physiological processes and they represent a significant
fraction of the entire proteome. Today, membrane proteins
dominate the class of drug targets because of their critical
role in signal transduction as well as transport of ions and
small molecules across cell membrane. Structural and
functional studies on membrane proteins could lead to
novel and improved pharmaceutical treatments for a broad
range of diseases.
In order to understand the function of proteins, it is
necessary to understand their 3-D structure. Compared to
soluble proteins, a relatively small number of high
resolution membrane protein structures are resolved
experimentally. However, despite the paucity of 3D
protein structures, we can design computational methods
by utilizing protein sequence data, which is readily
available. The computational prediction of membrane
proteins, and their functional attributes, has therefore
become an important alternative and complementary tool
for membrane protein studies. However, the number of
examples from which to learn is limited, making
applications of statistical and machine learning techniques
much more challenging in this case.
In this work, we focus on alpha helical transmembrane
(TM) proteins which span the lipid bilayer. There are
several successful studies for predicting TMD boundaries
given a membrane protein sequence. SOSUI [1], TopPred
II [2], TMpred [3], Minnou [4] are few such examples.
Once the part of a sequence corresponding to the TMDs is
located, we then need to identify the arrangement of TM
helices with respect to each other. Towards the bigger goal
of predicting 3D structure of a protein, the next step is to
predict the overall 3D topology of TM helices. In order to
achieve this goal we adopt a step-by-step approach to first
predict intermediate structural attributes of a protein
structure from a sequence. For each residue in TMDs of a
protein, we capture the level of exposure to the lipid in
terms of relative lipid accessibility (RLA). For this
analysis, we focus on residues located in the TMDs
disregarding the residues in the non-TM loop structures.
We have developed a robust method for predicting the
extent of lipid exposure of amino acid residues in TM
proteins in terms of lipid accessibility. To solve the
underlying regression problem, we developed robust low
complexity support vector regression (SVR) models which
are capable of learning from a limited number of examples
and thereby minimize the risk of overfitting.
The primary aim of our work is to explore these 1
dimensional (1D) lipid accessibility prediction profiles
towards further 3D structural and functional studies of
membrane proteins. Here, we evaluate the efficacy of the
predicted RLA profile for discriminating incorrectly folded
non native models from the native ones. If predicted RLA
profiles are well-correlated with the observed RLA values,
they can facilitate template based structure prediction
methods. We put this hypothesis to test by generating
decoy sets for membrane proteins for which the native 3D
structure is resolved. We then computed the Correlation
Coefficient (CC) between the observed and the predicted
RLA profiles. We concluded that the CC between the
predicted and observe RLA for the native structure was the
highest compared to the observed RLA from the decoys.
The results obtained underscore the importance of such 1D
prediction protocols for selecting the correct template for
the target protein structure in the template based structure
prediction methods. RLA prediction profiles can
potentially be applied as a filter to reduce the sample size
of thousands of putative models generated by de novo
structure prediction methods.
Materials and Methods
Training and test data
Carefully designing a representative training data set is a
vital part of any machine learning based predictive models.
Due to the difficulties in applying experimental techniques,
the number of high resolution structures of membrane
proteins that have been solved to date is limited as
compared to the soluble proteins. Hence, in the case of
membrane proteins, the problem is even more challenging
with the limited number of examples available to learn
from. Here, we used MPtopo [5] and PDB_TM [6]
membrane protein databases for generating non redundant
yet representative protein chains of the resolved 3D
structures to construct a training set for RLA (RD)
predictors. The redundant entries are removed from the
dataset using BLAST[7] sequence alignment. For this
study we developed a set of 71 non-redundant alpha-helical
protein chains with 6,049 residues in the TM domains
available for cross-validated training purposes.
Next, we created an independent test set using the
PDB_TM database. Sequence homology with respect to
the training set was evaluated to make sure that the
proteins in the test set are non-redundant to all the chains
in the training set. This resulted in the non-redundant set of
49 chains with a total of 3826 residues in the TM domains.
RLA Computation
Relative Lipid Accessibility (RLA): The RLA of
amino acid residue i is defined as follows.
RLAi  100
LAi
%
MSAi
where (LAi) is lipid exposed surface area observed in a
given structure and it is normalized by (MSAi) which is
the maximum achievable surface area for that type of
amino acid [8]. It is important to normalize LAi values to
take into account the differences in the surface area of side
chains of 20 amino acids. E.g. Side chains of an amino
acid like Alanine (A) is very small compared to say
tryptophan (W).
Values of RLAi can range between 0% and 100%,
corresponding to a fully buried and a fully lipid accessible
state respectively.
Observed Lipid Accessibility values of the known 3D
structures are computed using DSSP program [9]. Given
actual 3D coordinates of the protein structure, it computes
a parameterization of protein surface to yield percent
exposure of each of the amino acid residues.
Sequence Based Predictor
Given the structural data of resolved membrane proteins,
we know the protein sequence as well as the corresponding
3D structure coordinates for those proteins. Based on these
examples, we can develop a model to predict RLA for each
residue in a protein sequence. A typical sequence based
predictor model is depicted in Figure 1.
Figure 1: Typical sequence based predictor model
Here, we predict RLA for the residues located in the TM
part. We adopted regression approach to approximate the
relationship between a sequence and RLA values. The
input to the predictor consists of samples (every amino
acid residue in TM domain) and known labels (observed
value of the RLA to be estimated). Support Vector
Regression (SVR) models are robust and one of the most
promising methodologies for learning and inference with
minimal parameter choices. Wagner et. al. [10] has shown
that improved predictions are obtained using simple and
computationally much efficient linear SVR that performs
comparably to nonlinear models.
Samples were characterized by various parameters and are
termed as “features”. The information for the features was
obtained from amino acid sequence itself. Evolutionary
conservation as captured in the form of multiple sequence
alignment (MSA) is an important feature for the prediction
of structural attributes. In addition to MSA, using SABLE
server[11], we
obtained predicted Relative Solvent
Accessibility (RSA) value and its confidence factor to be
used as possible features. The notion of RSA in case of
soluble proteins is analogous to RLA in membrane
proteins. The predicted confidence factor values
qualitatively follow the periodic surface exposure of the
residues in the TM helices. Lastly, we also explored
hydropathy and lipophilicity profiles, as derived in terms
of KD[12] and TMLIP2H[13] scales as additional features.
The local structural environment of each residue is
characterized by a sliding window of amino acids. The
residue of interest is located at the central position in the
window which moves along the sequence. Different
window sizes in the range of 9 to 21 were tested.
RLA Prediction Results
The performance of the predictor was assessed by means
of 10-fold cross validation on the training set. First, the
training set was randomly split into 10 subsets of
(approximately) equal sizes. 10 different SVRs were
trained each time leaving one group out as test set whereas
remaining 9 groups were merged to form a training set.
Final result is considered to be the arithmetic average of
the 10 SVRs trained. This process provides improvement
in the accuracy for the independent test set. Here, we
assess the performance in terms of correlation coefficient
(CC) between the observed and the predicted values as
well as the Root Mean Square Error (RMSE) and mean
absolute error (MAE). The final cross validation accuracy
reported is the arithmetic average of the 10 different
models thus trained on different parts of the data. Even for
the independent test set, 10 predictions are obtained and
the final consensus prediction (in terms of arithmetic
average) is reported.
We investigated different combinations of the features to
derive an optimal feature representation. Optimal feature
representation is obtained by performing 10 fold cross
validation study, fine-tuning the meta-parameters C and ε,
for the SVR model. In particular for RLA, the combined
MSA+RSA+TMLIP2H representation performed best. The
values for error tolerances and the penalty parameter were
set to  i  0.1 and C  0.03 respectively. Further, we
chose .a sliding window of length 15 that yielded the best
results.
The average CC between the observed and the predicted
RLA in 10 fold cross-validation was of the order of 0.5 and
the corresponding MAE and RMSE were 0.15 and 0.19
respectively. In order to further test the generalization of
SVR model, we computed average accuracies on the
independent test set of 49 chains. The average CC
between the observed and the predicted RLA was 0.49 and
corresponding MAE and RMSE were 0.16 and 0.20
respectively.
Our estimates of accuracy measures obtained on the test set
are consistent with the estimates obtained using crossvalidated training. Since the errors are of the same order to
those observed in cross-validation, we conclude that the
method is robust and avoids over-fitting.
Model Quality Assessment
De novo and template-based computational approaches for
3D-protein structure predictions generate multiple
candidate models. Typically, the correctly folded 3D
structure of a protein is called as a “native” model and the
putative models generated are called “decoy” or “non
native” models. For the most part, the decoys being
generated have correct stereochemical properties.
However, they differ in the overall 3D topology. For
example, in case of alpha helical proteins, arrangement, as
well as, the orientation of the helices with respect to each
other defines a characteristic 3D structure. Some of the
decoy models, which have an overall 3D structure closer to
the native structure are referred to as “near native” models,
while other decoy models with 3D topology different to the
native structure are called “non-native” models. Thus,
when presented with a set of decoys, the challenge of how
to filter out “non native” models from the “near native”
ones becomes the next immediate problem that must be
addressed.
Improved predictions of intermediate structural attributes
of amino acid residues in a protein, (such as secondary
structure or solvent accessibility) greatly facilitate the
template-based structure prediction methods as well as de
novo simulations [14].
Here, we hypothesize that
sufficiently accurate RLA predictions can be used to
discriminate native (near native) models from the nonnative decoy structures. If RLA predictions are wellcorrelated with the structure, they can be used to narrow
down the pool of decoys to those which are consistent with
the predicted patterns.
Figure 2: Given RLA predictions for the TM helices
(marked in yellow), matching predicted and observed TM
helices can be used in the template-based structure
prediction methods
The scenario is depicted in Figure 2. Matching RLA
profiles for the predicted and observed TM helices can be
used in the template based structure prediction methods.
To investigate whether predicted RLA can discriminate
between native and non native models, we need
sufficiently large number of decoy sets. The decoy sets for
the present study are obtained by randomly reshuffling the
TM helices of the native models as described in the
following section.
Rearrangement of TM helices
In the first approach, our aim is to assess the overall
efficacy of the predicted profile by simply randomly
shuffling the observed RLA (RD) values corresponding to
the TM helices and thereby generating a decoy model. The
goal of this exercise is not to generate an overall 3D
structure for each of the decoy models by re-computing the
actual coordinates, but rather to assess the correlation
between the predicted and the observed RLA (RD) values
per TM helix. The process involves simple rearrangement
of TM domains while retaining the non-TM part of the
structure intact. The process can be best explained with an
example.
Figure 3: Demonstrates the process of shuffling observed
RLA values in TM helices to generate an effect of an
alternative structure considered as a decoy
Figure 3 depicts the scenario. Part A of Figure 3 shows the
arrangement of 7 TM helices of a hypothetical protein in
its native state. Here, helix 7 is located at the center and is
therefore a buried helix. Now, let’s say we swap helix 7
with helix 1 to generate an alternative model such that
helix 7 is no longer placed at the center and instead, helix 1
is located at the center. The alternative arrangement is
depicted in Part B of Figure 3. Here, we now have 2
models that are generated using the same sequence, one
being native and other a decoy. We achieve this effect by
swapping the corresponding observed RLA values for the
corresponding TM segments while keeping the rest of the
values intact. For example, in the present case, we
exchange observed values of TM 7 with TM 1, keeping
rest of the values intact.
Generating and Ranking decoys
For this analysis, we selected 10 representative protein
chains from the test set having at least 4 TM helices. This
gives us the required flexibility to generate sufficient
number of distinct decoys for each protein chain. We
generated decoys as follows.
For each protein,
 We obtain the predicted RLA values. It should be
noted that the prediction profile is obtained only
once, based on the original sequence, and thus
will remain consistent throughout the analysis.
 The corresponding observed RLA values of the
native structure are obtained from DSSP [9].
 We note the CC between the predicted and the
observed RLA values of the native structure.
 We then generate alternative models by randomly
shuffling the TM domain segments of the
sequence, keeping non-TM domain parts intact.
We would like to highlight that we swap only
similar length helices in each of the proteins to
avoid generating trivial cases.
 We then compute CC between predicted and
observed RLA of the decoy model generated.
 We rank all the decoys using such obtained CCs as
a measure of their quality. Since the observed
values for the decoys are now rearranged, CC
between the observed and the predicted RLA is
expected to the highest for the native structure.
 We then measure the separation of nonnative and
native by a Z score.
We performed this exercise on 10 protein chains and
observed that CC with the native model was indeed the
highest. The detailed results are listed in Table 1.
Moreover, we also took into consideration the flexible
boundaries for the TM domains. Since TM domains
detected from PDBTM [6] are predicted, they are subject
to some uncertainty. Sometimes TM domains can be
shifted up or down by a couple of residues. To account for
the fuzzy boundaries, we generated alternative models by
shifting TM boundaries left or right by one residue in the
protein sequence. Thus, TM domain for each helix now has
three alternative boundaries including the original. In
principle, we can generate a total of 3^ (#helices) helix
arrangements for a protein. We refer to these models as
“native-like” models. For the purposes of this study, we
limited “native-like” models to 100 wherever applicable.
For the “native-like” models, we then shuffled these TM
domains randomly as explained in the previous section.
We observed that the “native-like” models were also
ranked at the top, suggesting the robustness of the
predictions.
These results indicate that the predicted RLA profiles are
well-correlated with the observed values and that the
predictions are not random. By contrasting predicted RLAs
and RDs with those observed in decoy models, one can
Pdb id
# tm
# native-like
models
generated
1aig_l
1bcc_c
1c17_m
1eys_m
1fft_c
1fx8_a
1izl_a
1ors_c
1u77_a
1xfh_a
5
8
4
5
5
6
5
4
11
9
100
100
70
100
100
100
100
70
100
100
#of different
shuffles for each
model
120
1440
24
12
120
24
120
4
864
72
native CC
mean of native
like from
column 2
Average Z score
0.55
0.51
0.45
0.53
0.36
0.68
0.35
0.48
0.71
0.7
0.54 ± 0.01
0.50± 0.02
0.45± 0.02
0.52± 0.02
0.36± 0.04
0.68± 0.01
0.35± 0.01
0.47±0.01
0.70±0.01
0.69 ±0.01
2.44 ± 0.29
2.86± 0.65
2.18± 0.57
2.21± 0.2
1.28± 0.4
2.36± 0.25
2.17± 0.23
1.72±0.48
2.20±0.17
2.38±0.17
Table 1: Results of swapping TM helices in the protein chains. The third column lists the number of “native-like” structures
considered. The fourth column lists the number of decoys generated. The fifth column lists the CC between actual and
predicted RLA for the native structure. The sixth column lists the mean of the CC of all the native-like structures. The last
column lists the average Z score of native and “native-like” models among all possible arrangements.
consistently discriminate between native-like and nonnative arrangements of TM helices.
Template based protein structure prediction is based on the
assumption that similar proteins exhibit similar protein
folds. Our results indicate that predicted RLA profiles can
facilitate the search for the correct protein template that is
structurally similar to the target protein.
While it is premature to conclude from ten proteins and a
more conclusive argument requires more representative
examples, nevertheless, the results obtained so far are
encouraging and underscore the hypothesis that RLA
prediction profiles are correlated with observed values, and
therefore can be used to discriminate near native from nonnative models. Several different types of decoys and
datasets (e.g. using Rosetta-Membrane [15] and I-Tasser
[16]) can further be obtained to validate the hypothesis.
The results will facilitate further efforts to improve de novo
and template-based prediction of membrane protein
structure.
Conclusions
Limitations of experimental approaches, especially in the
case of membrane proteins, create an opportunity for
computational approaches to complement and facilitate
experimental efforts in that regard. In this paper, we
proposed a novel method for the prediction (from
sequence) of relative lipid accessibility in membrane
proteins using a linear Support Vector Regression
approach to minimize the risk of overfitting and provide
robust performance. Our results indicate that RLA can be
predicted at the level of about 0.5 CC. While this is still
lower than the estimated 0.6-0.7 CC for state-of-the-art
real-valued RSA prediction methods for soluble proteins
[11], it is sufficient for the model quality assessment for
membrane proteins. Our results indicate that the predicted
RLA profiles are well-correlated with the observed values
and that the predictions are not random. Given a list of
models, one can narrow down to those models which are
consistent with the predicted patterns. By contrasting
predicted RLAs with those observed in decoy models, one
can consistently discriminate between native-like and nonnative arrangements of TM helices. The results so far look
promising and will facilitate further efforts to improve de
novo and template-based prediction of membrane protein
structure.
References
1. Hirokawa, T., S. Boon-Chieng, and S. Mitaku, SOSUI:
classification and secondary structure prediction
system for membrane proteins. Bioinformatics,
1998. 14(4): p. 378-9.
2. Claros, M.G. and G. von Heijne, TopPred II: an
improved software for membrane protein
structure predictions. Comput Appl Biosci, 1994.
10(6): p. 685-6.
3. Stoffel, K.H.W., TMbase - A database of membrane
spanning proteins segments. Biol. Chem. HoppeSeyler, 1993. 374: p. 166.
4. Cao, B., et al., Enhanced recognition of protein
transmembrane domains with prediction-based
structural profiles. Bioinformatics, 2006. 22(3): p.
303-9.
5. Jayasinghe, S., K. Hristova, and S.H. White, MPtopo: A
database of membrane protein topology. Protein
Sci, 2001. 10(2): p. 455-8.
6. Tusnady, G.E., Z. Dosztanyi, and I. Simon, PDB_TM:
selection and membrane localization of
transmembrane proteins in the protein data bank.
Nucleic Acids Res, 2005. 33(Database issue): p.
D275-8.
7. Altschul, S.F., et al., Gapped BLAST and PSI-BLAST: a
new generation of protein database search
programs. Nucleic Acids Res, 1997. 25(17): p.
3389-402.
8. Chothia, C., The nature of the accessible and buried
surfaces in proteins. J Mol Biol, 1976. 105(1): p.
1-12.
9. Kabsch, W. and C. Sander, Dictionary of protein
secondary structure: pattern recognition of
hydrogen-bonded and geometrical features.
Biopolymers, 1983. 22(12): p. 2577-637.
10.
Wagner M., A.R., Porollo A and Meller J., Linear
Regression Models for Solvent Accessibility
Prediction in Proteins. Journal of Computational
Biology, 2005. 12(3): p. 355-369.
11.
Adamczak, R., A. Porollo, and J. Meller, Accurate
prediction of solvent accessibility using neural
networks-based regression. Proteins, 2004. 56(4):
p. 753-67.
12.
Kyte, J. and R.F. Doolittle, A simple method for
displaying the hydropathic character of a protein.
J Mol Biol, 1982. 157(1): p. 105-32.
13.
Adamian, L., et al., Empirical lipid propensities of
amino acid residues in multispan alpha helical
membrane proteins. Proteins, 2005. 59(3): p. 496509.
14.
Rohl CA, S.C., Misura KM, Baker D., Protein
structure prediction using Rosetta. Methods
Enzymol, 2004. 383: p. 66–93.
15.
Yarov-Yarovoy, V., J. Schonbrun, and D. Baker,
Multipass membrane protein structure prediction
using Rosetta. Proteins, 2006. 62(4): p. 1010-25.
16.
Wu, S., J. Skolnick, and Y. Zhang, Ab initio
modeling of small proteins by iterative TASSER
simulations. BMC Biol, 2007. 5: p. 17.
Download