Module 7 - Protein Structure Prediction

advertisement
MODULE 7
Protein Structure Prediction
AIMS
To understand how computer algorithms can be used to predict the secondary and tertiary
structures of proteins
 To recognize different approaches to structure prediction
 To understand some aspects of the limitations of computer-based methods
OBJECTIVES
The student should be able to:
 Predict the occurrence of aspects of secondary and tertiary structure in proteins using
Web-based analytical tools
 To select which tools are most appropriate for a particular analysis
INTRODUCTION
Protein structure may be considered at a variety of levels (for further information see Webbased tutorial):
1o (primary) structure is the actual amino acid sequence of the protein
2o (secondary) structure refers to the localized organization of parts of the polypeptide chain
(e.g.  helix,  sheet, turn etc.)
3o (tertiary) structure describes the three-dimensional organization of all the atoms in the
polypeptide
4o (quarternary) structure refers to the organization of a protein composed of more than one
polypeptide chain
This module deals with the prediction of the secondary and tertiary structure of proteins. The
most direct route to the study of protein structure is the use of techniques such as X-ray
crytallography and NMR to determine the atomic co-ordinates of a protein. However, whilst
there are over 100,000 entries in the primary protein sequence databases, there are only just
over 12,000 entries in the protein structure databases. In consequence, a variety of methods
are in development to predict secondary and tertiary structure from the 1o sequence
information and this is the topic covered by this module. In truth this is an enormous subject
worthy of a course all to itself, so only a somewhat superficial view can be presented here.
More detailed tutorials and guides, such as “Sisyphus and protein structure prediction”,
“Pedestrian guide to analysing sequence databases”and “A Guide to protein structure
Prediction”, are available on the Web.
Secondary structure prediction
The most successful area of protein structure prediction deals with secondary structure and
related topics including the interaction of proteins with membranes.
Signal peptides
Signal peptides (or signal sequences) are short N-terminal amino acid sequences that target
the protein for membrane translocation and are removed after translocation. SignalP predicts
signal peptide cleavage sites in Gram-positive, Gram-negative and eukaryotic amino acid
sequences. http://www.cbs.dtu.dk/services/SignalP/caution.html
Intracellular targeting
TargetP predicts the subcellular location of eukaryotic protein sequences. The subcellular
location assignment is based on the predicted presence of any of the N-terminal presequences
chloroplast transit peptide, mitochondrial targeting peptide, or secretory pathway signal
peptide
Trans-membrane  helices
Many proteins in the cell are integral membrane proteins that have one or more segments
embedded in. In transmembrane proteins one or more segments of the protein completely
traverse the phospholipid bilayer and these membrane spanning domains are always  helices
or multiple  strands. Arguably, the most successful area in secondary structure prediction is
that of the prediction of trans-membrane  helices. There are a variety of computational
approaches which offer 90% accuracy or more in such predictions. We will focus on one of
the approaches, known as TMHMM, although there are others such as TopPred2, MEMSAT,
DASand PHDhtm, which you might have a look at.
The large majority of trans-membrane  helices consist of an unusually long stretch of
hydrophobic amino acid residues and it is this feature that many programs employ to identify
such potential  helices. The helix also has a topology i.e. whether it runs inwards or
outwards. Positively charged residues, arginine and lysine, play a central role in determining
the orientation since they are primarily found in non-transmembrane parts of the polypeptide
on the cytoplasmic side. TMHMM employs a hidden Markov model which closely onto these
features to make highly accurate predictions of trans-membrane  helices.
Have a look at the output from a typical TMHMM analysis of the lactose permease (LacY)
from E. coli. Notice it has 12 predicted trans-membrane  helices with their polarity clearly
indicated.
 helices and  sheets etc.
One of the first predictive algorithms GOR (Garnier, Osguthorpe & Robson, 1978) for
secondary structure was developed through a co-operation between a laboratory interested in
developing the theory for protein secondary structure prediction methods and a laboratory
interested in applying and comparing such methods . The GOR algorithm unambiguously
assigns each residue to one conformational state of a-helix, extended chain, reverse turn or
coil. In its initial form GOR was roughly 50% accurate on a test sample of 26 proteins. GOR
has now been through a series of developments and version IV of GOR has a mean accuracy
of 64.4% for a three state prediction. The program gives two outputs, one eye-friendly
(example) giving the sequence and the predicted secondary structure in rows, H=helix,
E=extended or beta strand and C=coil; the second (example) gives the probability values for
each secondary structure at each amino acid position. The predicted secondary structure is the
one of highest probability compatible with a predicted helix segment of at least four residues
and a predicted extended segment of at least two residues.
There are a number of other secondary structure prediction approaches including PSIPRED,
PHD, NNSP, PROF, Predator and ZPRED. Most of these servers expect the input to an
alignment of multiple sequences which enhances the accuracy of the predictions.
Jpred developed as a result of a study to test and compare different secondary structure
prediction methods. Jpred takes a single input sequence and scans it against a non-redundant
sequence database. The hits are aligned with CLUSTALW (v1.7) and the alignment is
submitted to MULPRED, which uses a combination of single sequence methods that are
combined to give a prediction profile, from which a consensus is taken. The methods used
within MULPRED are Lim, GOR, Chou-Fasman, Rose and Wilmot/Thornton turn prediction
methods. The accuracy of Jpred is approximately 73%.
Super-secondary structure
Secondary structure elements are observed to combine in specific geometric arrangements
known as motifs or super-secondary structures (see Web-based tutorial) e.g. coiled coils,
helix-turn-helix etc.
Coiled-coils are another structural feature of proteins which sometimes separate domains.
Coiled coils comprise two, three or four amphipathic  helices wrapped round one another.
Coiled coil motifs are particularly amenable to computer-based prediction because of the
characteristic repeating patter of hydrophobic residues spaced every four and then three
residues apart. This pattern forms a heptad repeat (abcdefg)n of amino acids in which
positions a abd d tend to be hydrophobic and positions e and g are predominantly charged
residues. Predictions of coiled coils can be obtained at PAIRCOIL and MULTICOIL. The
leucine zipper structure is adopted by one family of the coiled coil proteins. Leucine zippers
have a characteristic leucine repeat: Leu-X6-Leu-X6-Leu-X6-Leu (where X may be any
residue) and TRESPASSER will detect such motifs with a high degree of accuracy.
The helix-turn-helix motif occurs in many DNA binding proteins and can be predicted using
HTH.
Integrated structure prediction
There is a variety of servers which offer a secondary structure prediction integrated with a
variety of other analyses.
Predict Protein offers predictions of:
secondary structure (more info),
solvent accessibility (more info),
globular regions ( more info),
transmembrane helices (more info),
coiled-coil regions ( more info).
as well as
a multiple sequence alignment (i.e. database search),
ProSite sequence motifs (more info),
low-complexity retions (SEG) ( more info),
ProDom domain assignments (more info),
Tertiary structure prediction
This component of the module, more than any other, can only skim the surface of a complex
and extensive topic. An excellent and more detailed introduction to the the topic is provided
in “A Guide to Structure Prediction (version 2)”.
The ultimate objective in protein structure prediction is to use ab initio methods to accurately
predict the tertiary structure of a protein from its primary structure using purely physicochemical information. However, such approaches are prevented at present by a lack of some
of the basic information required combined with the enormous computational complexity of
the task.
Tertiary structure describes the folding of the polypeptide chain to assemble the different
secondary structure elements into a particular arrangement. Just as helices, sheets etc. are the
units of secondary structure so the folds/domains are the units of tertiary structure. In
multidomain proteins, tertiary structure includes the arrangement of domains relative to each
other as well as the arrangement of residues within the domain. The terms ‘domain’ and ‘fold’
to a large extent mean the same thing though definitions may vary. Domains are regions of
contiguous polypeptide chain that have been described as compact, local, and semiindependent units. A fold is defined as a component of tertiary structure in which the proteins
have the same major secondary structures in the same arrangement with the same topological
connections. There are glossaries of the different protein folds/domains.
The overall strategy for secondary structure prediction is summarized by the following
flowchart
An excellent, more detailed, interactive flowchart has been produced by Robert Russell.
The first step in any attempt to predict the tertiary structure of a protein is to search the
sequence databases for proteins that show sequence similarity. If the result of the search
includes a protein of known structure then the route of choice is homology modelling. If there
is no homologue in the structural databases then things become rather more difficult, but not
impossible. Even with no no homologues of known structure it may be possible to use fold
recognition methods. There is a so called “twilight area” of 20-30% sequence identity, where
it is difficult to assess whether
One of the most important advances in sequence comparison recently has been the
development of both gapped BLAST and PSI-BLAST (position specific interated BLAST).
Both of these have made BLAST much more sensitive, and the latter is able to detect very
remote homologues by taking the results of one search, constructing a profile and then using
this to search the database again to find other homologues
Homology modelling
The most successful tool for prediction of 3D structure is homology modelling. An
approximate 3D model can be built for a protein, if it has “significant similarity” to a protein
of known structure. So what is “significant similarity”? The answer is about 30% identity. At
this level of identity it is possible to construct a model which has a correct fold structure, but
may have inaccurate loops. Above levels of 90% sequence identity, homology modelling is
about as accurate as the experimental determination of a protein structure.
Part of the problem of homology modelling at lower levels of similarity is to correctly align .
Sequence alignments are more or less straightforward for levels of above 30% pairwise
sequence identity. The region between 20 and 30% sequence identity is frequently referred to
as the twilight zone.
Fold recognition
It has long been recognised that proteins often adopt similar folds despite lack of significant
sequence or functional similarity. Fortunately, certain folds crop up time and time again in
proteins, and so fold recognition methods for predicting protein structure can be very
effective. Methods of fold recognition attempt to detect similarities between the 3D structure
of proteins that do not exhibit significant sequence similarity. There are numerous different
approaches to fold recognition, though ‘threading’ is a common feature of several of them.
Some fold recognition programs can be accessed through the Web e.g.TOPITS, and 3DPSSM. If you have predicted that protein under study contains a particular fold then it is
important to establish which other proteins that contain a similar fold by looking at databases
such as SCOP (Structural Classification of Proteins) or CATH (Protein Structure
Classification).
Threading
Threading takes the query sequence of unknown structure threads it through the atomic coordinates of a protein whose structure is known. The query sequence is moved residue by
residue through the template sequence and calculations are carried to determine the degree of
“fitness” of the alignment by a variety of methods which could include thermodynamic
criteria, solvent accessibility, secondary structure information etc. Such approaches are quite
computationally intensive, but there are freely accessible Web-based sites which will carry
out a threading analysis e.g. bioinbgu.
Building the model
Sophisticated and usually expensive software is commercially available for carrying out
tertiary structure predictions, but there is a freely accessible Web-based modelling server.
SWISS-MODEL is an Automated Protein Modelling Server running at the GlaxoWellcome
Experimental Research in Geneva, Switzerland. When a sequence is submitted to SWISSMODEL the sequence of events is as follows:
1. BLASTP2 finds all similarities of target sequence with sequences of known structure.
2. Templates with sequence identities above 25% and projected model size larger than 20
residues are selected. This step also detects domains which can be modelled based on
unrelated templates
3 ProModII then generates the models in which the key process is the production of a
framework which represents topology of corresponding atoms in the query sequence and the
template(s).
4 Energy minimisation analysis is done for all models
CPHmodels is another Web based homology modelling server.
Exercises
1. Use TMHMM to predict whether the human integrin beta subunit is likely to be an integral
membrane protein and, if so, how many trans-membrane domains it has.
2. What advantages might TMHMM have over TopPred (see the original TMHMM paper)
3. Use GORIV to do a secondary structure prediction on the alpha chain of human
hemoglobin. Compare the predictions to those of NNSSP.
4. Determine whether the human transcription factor AP-1 (proto-oncogene C-JUN) has a
coiled coil motif
5. Does the E. coli Lac repressor contain any recognizable folds?
References
Cuff J. A. and Barton G. J. (1999) Evaluation and improvement of multiple sequence methods
for protein secondary structure prediction. PROTEINS: Structure, Function and Genetics
34:508-519.
Erik L.L. Sonnhammer, Gunnar von Heijne, and Anders Krogh: A hidden Markov model for
predicting transmembrane helices in protein sequences. In Proc. of Sixth Int. Conf. on
Intelligent Systems for Molecular Biology, p 175-182 Ed J. Glasgow, T. Littlejohn, F. Major,
R. Lathrop, D. Sankoff, and C. Sensen Menlo Park, CA: AAAI Press, 1998 (pdf download)
Garnier J, Osguthorpe DJ, Robson B (1978) Analysis of the accuracy and implications of
simple methods for predicting the secondary structure of globular proteins. J Mol Biol
120(1):97-120
Accuracy of structure prediction methods
Protein Structure Prediction Center
Fold recognition
Fold recognition links
Tertiary structure prediction tools and structure databases
SWISS-MODEL
Modeller-4
SCOP
Comprehensive lists of structure prediction sites can be found at:
Index of resources
Structure Prediction & Evaluation
Protein Structure Prediction
Some Other Structural Biology Databases and Servers around the world
Network Protein Sequence Analysis
Summary of protein structure
Principles of Protein Structure, Comparative Protein Modelling and Visualisation
Papers and essays on Tertiary structure prediction
Pedestrian guide to analysing sequence databases
Sisyphus and protein structure prediction
A Guide to Structure Prediction (version 2)
Download