2D & 3D Structure Modelling

advertisement
2d-3D Structure Modelling
S. Shahriar Arab
Flow of information
DNA
RNA
PROTEIN SEQ
PROTEIN STRUCT
PROTEIN FUNCTION
……….
Prediction in bioinformatics
•
Important prediction problems:
Protein sequence from genomic DNA
Protein 3D structure from sequence
Protein function from structure
Protein function from sequence
Why predict protein structure?
The sequence structure gap
Over millions known sequences, 80 000 known
structures
Structural knowledge brings understanding of
function and mechanism of action
Can help in prediction of function
Why predict protein structure?
Predicted structures can be used in structure
based drug design
It can help us understand the effects of
mutations on structure or function
It is a very interesting scientific problem
still unsolved in its most general form after
more than 20 years of effort
What is protein structure prediction?
In its most general form
a prediction of the (relative) spatial position of
each atom in the tertiary structure generated
from knowledge only of the primary structure
(sequence)
Methods of structure prediction
Ab initio protein folding approaches
Comparative (homology) modelling
Fold recognition/threading
Prediction in one dimension
Secondary structure prediction
Surface accessibility prediction
2D Structure Identification
QHTAWCLTSEQHTAAVIWDCETPGKQNGAYQEDCA
HHHHHHCCEEEEEEEEEEECCHHHHHHHCCCCCCC
• DSSP - Database of Secondary Structures for Ps
(http://swift.cmbi.kun.nl/gv/dssp/)
• VADAR - Volume Area Dihedral Angle Reporter
(http://redpoll.pharmacy.ualberta.ca/vadar/)
• PDB - Protein Data Bank (www.rcsb.org)
Secondary Structure
The DSSP code
•H = alpha helix
•B = residue in isolated betabridge
-
•E = extended strand,
participates in beta ladder
-
•G = 3-helix (3/10 helix)
•I = 5 helix (pi helix)
•T = hydrogen bonded turn
•S = bend
•C= coil
Simplifications
Eight states from DSSP
Identification of
secondary structures
focused on
 -helices
-strands
others (turns, coils,
other helices) are
collectively called
“coils”
H:  helix
G: 310 helix
I: -helix
E: strand
B: bridge
T: turn
S: bend
C: coil
CASP Standard
H = (H, G, I),
E = (E, B),
 C = (C, T, S)
What is Secondary structure prediction?
Given a protein sequence (primary structure)
GHWIATRGQLIREAYEDYRHFSSECPFIP
Predict its secondary structure content
(C=coils H=Alpha Helix E=Beta Strands)

CEEEEECHHHHHHHHHHHCCCHHCCCCCC
Why Secondary Structure Prediction?
Simply easier problem than 3D structure prediction
Accurate secondary structure prediction can be an
important information for the tertiary structure
prediction
Improving alignment accuracy
Protein function prediction
Protein classification
secondary structure prediction
• less detailed results
– only predicts the H (helix), E (extended) or C
(coil/loop) state of each residue, does not predict
the full atomic structure
• Accuracy of secondary structure prediction
– The best methods have an average accuracy of
just about 73% (the percentage of residues
predicted correctly)
History of protein secondary structure prediction
First generation
How: single residue statistics
Example: Chou-Fasman method, LIM method, GOR I, etc
Accuracy: low
Secondary generation
How: segment statistics
Examples: ALB method, GOR III, etc
Accuracy: ~60%
Third generation
How: long-range interaction, homology based
Examples: PHD
Accuracy: ~70%
Chou-Fasman Method
Developed by Chou & Fasman in 1974 & 1978
Based on frequencies of residues in α-helices, βsheets and turns
Accuracy ~50 - 60% Q3
Chou-Fasman statistics
R – amino acid, S- secondary structure
f(R,S) – number of occurrences of R in S
Ns – total number of amino acids in conformation S
N – total number of amino acids
P(R,S) – propensity of amino acid R to be in structure
S
P(R,S) = (f(R,S)/f(R))/(Ns/N)
Example
•#residues=20,000,
•#helix=4,000,
•#Ala=2,000,
•#Ala in helix=500
•f(Ala, α) = 500/20,000,
•f(Ala) = 2,000/20,000
•p(α) = Να/Ν=4,000/20,000
Chou-Fasman Statistics
Amino acid propensities
Scan peptide for  helix regions
2. Identify regions where 4/6 have a
P(H) >100 “alpha-helix nucleus”
Extend  -helix nucleus
3. Extend helix in both directions until a set of
four residues have an average P(H) <100.
Repeat steps 1 – 3 for entire peptide
Scan peptide for -sheet regions
4. Identify regions where 3/5 have a
P(E) >100 “-sheet nucleus”
5. Extend -sheet until 4 continuous residues an
have an average P(E) < 100
6. If region average > 105 and the average P(E)
> average P(H) then “-sheet”
The GOR method
developed by Garnier, Osguthorpe& Robson
build on Pij values based on information
theory
evaluate each residue PLUS adjacent 8 Nterminal and 8 carboxyl-terminal residues
sliding window of 17
GOR III method accuracy ~64% Q3
Second generation
GOR idea: Statistics that take into account the
whole window
Each residue caries two different types of information:
1. Intra-residue information – information about it’s own
secondary structure
2. Inter-residue information – the influence of this residue on
other residue
GOR….continued
1. Individual propensity of amino acid R to be in
secondary structure S. – same idea as in Chou –
Fasman
2. Contribution of 16 neighbors.
- take the window of radius 8 around the residue in
question (8 before and 8 after the residue)
- for each residue in the window consider it’s
contribution to the conformation of the middle
residue and this it’s value to PH, PS, PC.
-Like in Chou-Fasman the values of all
contributions are based on statistics.
Third generation
Nearest Neighbour Method
• Idea: similar sequences are likely have same secondary
structure.
• Take a window around amino acid the conformation
of which is to be predicted
• Find several, say k, closest sequences (with
respect to a similarity measure defined differently
depending on the variant of the method) of known
structure.
• Assign secondary structure based on conformation
of the sequence neighbours.
• Use max (n , n, nc) or max(s , s, sc)
 Key: Scoring measure of evolutionary similarity.
Neighbours
1
2
3
4
n
n+1
-
L
L
L
L
L
H
H
L
E
E
L
H
H
H
E
E
L
H
H
H
E
E
L
L
H
H
E
E
E
L
H
H
E
E
E
L
H
H
E
E
E
E
L
L
L
L
E
E
L
L
L
L
E
E
-
S1
S2
S3
S4
Sn
Sn+1
:

max (n , n, nL) or max (s , s, sL)
or something else…
Advantages
Information from structural neighbours can be
used to provide details to predicted secondary
structure (phi,psi angles)
Much higher accuracy than previous methods.
Neural network models
• machine learning approach
provide training sets of structures (e.g. α-helices, non α helices)
computers are trained to recognize patterns in known
secondary structures
provide test set (proteins with known structures)
accuracy ~ 70 –75%
Neural Network Method
Recall artificial neurone:
How PHD works
• Step 1. BLAST search with input sequence
• Step 2. Perform multiple seq. alignment and
calculate aa frequencies for each position
How PHD works (cont.)
Step3. Level 1: sequence to structure
•Take window of 13 adjacent residues
•Scores for helix, strand, loop in the output layer,
for each residue
Prediction tools that use NNs
• MACMATCH
- (Presnell et al., 1993)
- for Macintoch
• PHD
• - (Rost & Sander, 1993)
•
http://www.predictprotein.org/
• NNPREDICT
- (Kneller et al. 1990)
•
http://www.cmpharm.ucsf.edu/nomi/nnpredict.html
PHD Prediction of rCD2
Prediction Accuracy
Best of the Best
PredictProtein-PHD (72%)
http://www.predictprotein.org/
Jpred (73-75%)
http://jura.ebi.ac.uk:8888/
PREDATOR (75%)
http://www.embl-heidelberg.de/cgi/predator_serv.pl
PSIpred (77%)
http://insulin.brunel.ac.uk/psipred
Accessible Surface Area
Solvent Probe
Accessible Surface
Reentrant Surface
Van der Waals
Surface
ASA Calculation
• DSSP - Database of Secondary Structures for
Proteins (swift.embl-heidelberg.de/dssp)
• VADAR - Volume Area Dihedral Angle Reporter
(http://redpoll.pharmacy.ualberta.ca/vadar/)
• GetArea - www.scsb.utmb.edu/getarea/area_form.html
QHTAWCLTSEQHTAAVIWDCETPGKQNGAYQEDCAMD
BBPPBEEEEEPBPBPBPBBPEEEPBPEPEEEEEEEEE
1056298799415251510478941496989999999
Other ASA sites
• Connolly Molecular Surface Home Page
– http://www.biohedron.com/
• Naccess Home Page
– http://sjh.bi.umist.ac.uk/naccess.html
• ASA Parallelization
– http://cmag.cit.nih.gov/Asa.htm
• Protein Structure Database
– http://www.psc.edu/biomed/pages/research/PSdb/
Accessibility
•
Accessible Surface Area (ASA)
•
in folded protein
Accessibility =
•
•
•
•
•
Maximum ASA
Two state
= b(buried) ,e(exposed)
e.g. b<= 16% e>16%
Three state = b(buried),I(intermediate), e(exposed)
e.g. b<=16% 16%>i,<36%
e>36%
Accessibility Prediction
• PredictProtein-PHDacc (58%)
http://cubic.bioc.columbia.edu/predictprotein
• PredAcc (70%?)
http://condor.urbb.jussieu.fr/PredAccCfg.html
QHTAW...
QHTAWCLTSEQHTAAVIW
BBPPBEEEEEPBPBPBPB
PHD Prediction of rCD2
3D
structure prediction
3D structure prediction of proteins
New folds
Existing folds
Ab initio
prediction
Threading
0
10
20 30
Building by
homology
40 50 60 70 80 90 100
similarity (%)
Choice of prediction methods
If you can find similar sequences of known
structure then comparative modelling is the best
way to predict structure
all other methods are less reliable
Of course, you can’t always find similar
sequences of known structure.
When you can’t do comparative
modelling?
• Secondary structure prediction
• Fold recognition/threading
• Ab initio protein folding approaches
Divergent evolution
Different proteins in different organisms have
diverged from a common ancestor protein
Each copy of this ancestor in various organisms
has been subject to mutations, deletions, and
insertions of amino acids in its sequence
In general, its 3-D fold and function have
remained similar
Homology Modelling of Proteins
•Prediction of three dimensional structure of a target
protein from the amino acid sequence (primary
structure) of a homologous (template) protein for which
an X-ray or NMR structure is available.
•
Comparative modelling
Makes a prediction of tertiary structure based on
– sequences of known structure which are similar to
the target sequence (called template structures)
– an alignment between these and the target sequence
• Remember: ~25% seq ID means two proteins
have the same basic structure
Can and cannot of homology modelling
Best results relatively to other methods
Unreliable in predicting the conformations of
insertions or deletions
Comparative models are unlikely to be useful in
modelling ligand docking (drug design) unless
the sequence identity with the template is >70%,
and even then, less reliable than an empirical
crystallographic or NMR structure.
What is “good” comparative model
Take the 3D alignment between predicted structure
A’ and native structure A.
Let a1,…..a n be the coordinates of carbon atoms in
the native structure and a’1,…..a’n in predicted
structure
<2 A rmsd is good for homology modelling results.
Factors affecting accuracy
The accuracy of comparative modelling is
controlled by the quality of the alignment
between target sequence and template
structures
Alignment is easier if the sequences are closely
related (e.g. sequence identity > 80%).
Homology model
Target sequence
Select templates
from DB
Align target sequence with template
structures
Build a model and evaluate
Homology Modelling
Assumptions
 The overall 3-D structure of the target protein is not dissimilar
to that of the related proteins.
 Regions of homologous sequence have similar structure.
Residues homologous throughout a family of proteins are
conserved structurally.
 Residues involved in biological activity have similar topology
throughout the protein family.
 Loop regions (non-conserved residues) allow insertions
and deletions without disrupting the overall structure of the protein.
 Loop regions are flexible and therefore need not be constructed
as strictly as the conserved regions - assuming that they play no role
in biological activity.
Homology Modelling of Proteins
•
Steps in Molecular Modelling
–
–
Identification of structures that will form the template for the target structure (model).
Sequence Alignment.- The most important step. For proteins with low homology sequences
with the query protein (~<30% Percentage sequence identity), the model can be improved
by using secondary structure prediction (i.e. align-model-realign-remodel).
Transfer the coordinates from the template(s) to the target of structurally conserved regions
(SCR’s)
–
•
•
•
•
•
•
•
•
•
•
- many fragment method
- single structure
Modelling variable regions.
- Loops Insertions: Search of a high resolution fragment database
- Deletions: local minimization often sufficient.
Modelling of side chains
- Rotamer database
Minimization
- Local-specially loop-hinge regions
- Global
Model Building from template
Core conserved regions
Protein Fold
Variable Loop regions
Side chains
Multiple templates
Calculate the framework from
average of all template structures
Generate one model for each
template and evaluate
Model in loops
If it is a short deletion - often local Minimization is sufficient.
Insertions:
a. Look for same length in another homologue
b. Search database of short High Resolution fragments
Lowest RMSD from Anchor points
Best Sequence Homology
Least interference with Core structure.
5 residue insertion
Anchor points
(2 residues)
Database search
for 5 residue
fragments
annealing
Side Chain modelling
Same S.C.
Partial Similarity:
substitution:
conformer taken from template.
Most S.C. build on template.
build based on rotamer library &
energetics.
Core model with side chains
Minimization
LOCAL: Minimize a fragment. Usually a loop and its
anchor regions - as these often have bad geometries. First minimize without
influence of surrounding structure then take surrounding structure into
account.
GLOBAL: Minimize whole protein (& H2O). Mainly
to relieve short contacts and to rectify bad geometry, like bond angles,
peptide planarity etc.
Errors in Models !!!
Incorrect template selection
Incorrect alignments
Errors in positioning of side-chains and loops
Fold recognition or threading
Aimed at detecting when the target sequence
adopts a known fold, even if it has no significant
similarity to sequences of known fold
How many folds are there ?
SCOP: Structural Classification of Proteins. 1.75 release38221 PDB Entries (23 Feb
2009). 110800 Domains. 1 Literature Reference(excluding nucleic acids and
theoretical models)
Source: http://scop.mrc-lmb.cam.ac.uk/scop/count.html
Threading
Definition
Threading - A protein fold recognition technique
that involves replacing the sequence of a known
protein structure with a query sequence of
unknown structure. The new “model” structure is
evaluated using a simple heuristic measure of
protein fold quality. The process is repeated
against all known 3D structures until an optimal
fit is found.
Why Threading?
Secondary structure is more conserved than
primary structure
Tertiary structure is more conserved than
secondary structure
Therefore very remote relationships can be
better detected through 2D or 3D structural
homology instead of sequence homology
Threading idea
Choose a set of candidate structures templates.
Align a sequence of proteins of unknown
structure to each template structure.
Design a test that will evaluate which template is
the most likely candidate for the correct fold for
the given sequences. If none is reasonable – be
able to recognize it as a possible new fold.
Threading
Database of 3D structures and sequences
– Protein Data Bank (or non-redundant subset)
Query sequence
– Sequence < 25% identity to known structures
Alignment protocol
– Dynamic programming
Evaluation protocol
– Distance-based potential or secondary structure
Ranking protocol
2 Kinds of Threading
• 2D Threading
• Prediction Based Methods (PBM)
– Predict secondary structure (SS) or ASA of query
– Evaluate on basis of SS and/or ASA matches
• 3D Threading
• Distance Based Methods (DBM)
– Create a 3D model of the structure
– Evaluate using a distance-based “hydrophobicity or
pseudo-thermodynamic potential
2D Threading Algorithm
(prediction based method)
Convert PDB to a database containing sequence,
SS and ASA information
Predict the SS and ASA for the query sequence
Perform a dynamic programming alignment using
the query against the database (include sequence,
SS & ASA)
Rank the alignments and select the most probable
fold
Dynamic Programming
G
E
N
E
S
I
S
G
10
0
0
0
0
0
0
E
0
10
0
0
0
0
0
N
0
0
10
0
0
0
0
E
0
10
0
10
0
0
0
T
0
0
0
0
0
0
0
I
0
0
0
10
0
10
0
C
0
0
0
0
0
0
0
S
0
0
0
0
10
0
10
G
E
N
E
S
I
S
G
60
40
30
20
20
10
0
E
40
50
30
20
20
10
0
N
30
30
40
20
20
10
0
E
20
30
20
30
20
10
0
T
20
20
20
20
20
10
0
I
0
0
0
10
0
20
0
C
10
10
10
10
10
10
0
S
0
0
0
0
10
0
10
Sij (Identity Matrix)
A C D E F G H I K L M N P Q R S T V W Y
A 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
C 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
D 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
E 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
F 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
G 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
H 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
I 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
K 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
L 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
M 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
N 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
P 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
Q 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
R 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
V 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
Y 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
A Simple Example...
AAT V D
A 1
V
V
D
AAT V D
A 1 1
V
V
D
AAT V D
A 1 1000
V
V
D
AAT V D
A 1 1000
V 0
V
D
AAT V D
A 1 1000
V 0 11
V
D
AAT V D
A 1 1000
V 0 112
V
D
A Simple Example...
AAT V D
A 1 1000
V 0 1121
V
D
AAT V D
|
| | |
A- VVD
A
V
V
D
AAT V D
1 1000
0 1121
0 1122
0 1113
AAT V D
| | | |
AVVD
A
V
V
D
AAT V D
1 1000
0 1121
0 1122
0 1113
AAT V D
| | | |
AV -VD
Let’s Include
strc
Sij
H
H 1
E 0
C 0
total
E
0
1
0
C
0
0
1
seq
D
2
info & ASA
asa
Sij
strc
E
E 1
P 0
B 0
P
0
1
0
B
0
0
1
asa
Sij = k1Sij + k2Sij + k3Sij
A Simple Example...
EEECC
AAT V D
EA 2
EV
CV
CD
EEECC
AAT V D
EA 2 2
EV
CV
CD
EEECC
AAT V D
EA 2 2 1 0 0
EV
CV
CD
EEECC
AAT V D
EA 2 2 1 0 0
EV 1
CV
CD
EEECC
AAT V D
EA 2 2 1 0 0
EV 1 3 3
CV
CD
EEECC
AAT V D
EA 2 2 1 0 0
EV 1 3 3 3
CV
CD
A Simple Example...
EEECC
AAT V D
EA 2 2 1 0 0
EV 1 3 3 3 2
CV
CD
AAT V D
|
| | |
A- VVD
EEECC
AAT V D
EA 2 2 1 0 0
EV 1 3 3 3 2
CV 0 2 3 5 4
CD 0 2 3 4 7
EEECC
AAT V D
EA 2 2 1 0 0
EV 1 3 3 3 2
CV 0 2 3 5 4
CD 0 2 3 4 7
AAT V D
| | | |
AVVD
AAT V D
| | | |
AV -VD
2D Threading Performance
In test sets 2D threading methods can identify
30-40% of proteins having very remote
homologues (i.e. not detected by BLAST) using
“minimal” non-redundant databases (<700
proteins)
If the database is expanded ~4x the performance
jumps to 70-75%
2D Threading Advantages
Algorithm is easy to implement
Algorithm is very fast (10x faster than 3D threading
approaches)
The 2D database is small (<500 kbytes) compared to 3D
database (>2 Gbytes)
Appears to be just as accurate as DBM or other 3D
threading approaches
Very amenable to web servers
2D Threading
Disadvantages
Reliability is not 100% making most threading
predictions suspect unless experimental
evidence can be used to support the conclusion
Does not produce a 3D model at the end of the
process
Doesn’t include all aspects of 2D and 3D
structure features in prediction process
Servers - PredictProtein
Servers - PSIPRED
Servers - LIBRA I
More Servers - www.bronco.ualberta.ca
Force Fields
Molecular Mechanics
Statistical or Knowledge based
Molecular Mechanic Force Field
•
EFF = Estr+ Ebend + Etors + Eoop
(bonded Terms)
•+ Evdw + Eel + Ehb
(Non-bonded Terms)
•+ Estr-str + Estr-bnd + Estr-tor + Ebnd-bnd + Ebnd-tor
Estr = Σi kbi ( bi – b0 )2
Ebend = Σi kθi ( θi – θ0 )2
Etors = Σi kςi ( cos(3ςi + γ0 ))
Eoop = Σi kimp (χ−χ0)2
Evdw = ΣiΣj Aij dij-6 + Bijdij-12
Eel = ΣiΣj vivj / εdij
Ehb = ΣiΣj ε [5(R0/Rij)12 -6(R0/Rij)10]
(Cross Terms)
(Bond length)
(Bond angle)
(Torsion angle)
(Improper quadratic out of plan)
(Vanderwalls interaction)
(Electrostatic interaction)
(Hydrogen bond)
Molecular Mechanic Force Field
AMBER
Differences
CHARMM
Terms of energy
Parameters
GROMACS
...
Cornell WD, Cieplak P, Bayly CI, Gould IR, Merz KM Jr, Ferguson DM, Spellmeyer DC, Fox T, Caldwell JW, Kollman PA . A Second Generation Force Field for the
Simulation of Proteins, Nucleic Acids, and Organic Molecules. J. Am. Chem. Soc. 1995 117: 5179–5197.
Brooks BR, Bruccoleri RE, Olafson BD, States DJ, Swaminathan S, Karplus M: CHARMM: A program for macromolecular energy, minimization, and dynamics
calculations. J Comput Chem 1983, 4:187-217
Van Der Spoel D, Lindahl E, Hess B, Groenhof G, Mark AE, Berendsen HJ . GROMACS: fast, flexible, and free. J Comput Chem 2005, 26 (16): 1701–18
Statistical Force Field
Tanaka and Scheraga (1976) : The idea of using
Derived from an analysis of known
Boltzmann distribution to find knowledge-based
structures in the Protein
Data
Bank
force field
Schueler-Furman O, Wang C, Bradley P, Misura K, Baker D: Progress in modeling of protein structures and interactions. Science 2005, 310:638-642.
Bradley P, Misura KM, Baker D: Toward high-resolution de novo structure prediction for small proteins. Science 2005, 309:1868-1871.
Statistical Force Field
Reduce protein structure
Distribution of:
Distances
Angles
ASA
Lazaridis T, Karplus M: Effective energy functions for protein structure prediction. Curr Opin Struct Biol 2000, 10:139-145.
Bauer A, Beyer A: An improved pair potential to recognize native protein folds. Proteins Struct Funct Genet 1994, 18:254-261.
Jernigan RL, Bahar I: Structure-derived potentials and protein simulations. Curr Opin Struct Biol 1996, 6:195-209.
Melo F, Feytmans E: Assessing protein structures with a non-local atomic interaction energy. J Mol Biol 1998, 277:1141-1152.
Sippl
Knowledge-based
potentialspotentials
for proteins.
Opin Struct Biol
1995,
MeloMJ:
F, Sanchez
R, Sali A: Statistical
forCurr
fold assessment.
Protein
Sci5:229-235.
2002, 11:430-448.
Covell
DG:
Folding
protein α-carbon chains
into compact
forms folding:
by Monte
Carlo from
methods.
Proteins
Struct Funct
Genet
1992,
14:409-420.
Tobi D,
Elber
R: Distance-dependent,
pair potential
for protein
Results
linear
optimization.
Proteins
Struct
Funct
Genet
Sun
S: Reduced
representation
model of protein structure
prediction:
statistical
potential
genetic
Protein
Sci 1993,
2000,
41:40-46.D,
Elber R: Distance-dependent,
pair potential
for protein
folding:
Resultsand
from
linearalgorithms.
optimization.
Proteins
Struct2:762-785.
Funct
Genet 2000, 41:40-46.
Statistical Force Field
P(c)≈ exp(−βE(c))
Contact Potential Calculation - 1
Interaction energy between AAs
E(interaction) = -KT ln(frequency of interaction)
K: constant
T: temperature (in K, 273K = 0 ºC)
Frequency of interaction: measured in database of known struct.
More frequent ⇒ more favourable
“energy” based on contact potentials (Jones)
Pairwise contact potentials:
ΔEab(s) = -kT ln (fab(s)/f(s))
s : separation length
fab(s): frequency of occurrence of a, b with separation s
f(s): frequency of the separation
Define energy of a structure as the sum over all pairwise
contact potentials.
Limitation of Contact Potential Method
The energy associated with an isolated AA pair is
assumed to be similar to that found in known
protein structures
Modification: the conformation energy of groups
of AAs larger than 2 may provide a more reliable
prediction
Ab Initio Prediction
Predicting the 3D structure without any “prior
knowledge”
Used when homology modelling or threading
have failed (no homologues are evident)
Equivalent to solving the “Protein Folding
Problem”
Still a research problem
Ab initio protein folding
Aims to predict tertiary structure from basic
physico-chemical principles
does not rely on any detection of similarity to
sequences of known structure
An important scientific question
As yet very unreliable for practical predictions
Some Ab Initio Methods
Molecular Dynamic Simulation
Using complex energy functions simulate folding of the
primary sequence until it reaches it’s native state (1D->3D)
Genetic Algorithm
Used in refining a given potential function so that it can best
predict the native state of a protein
Simulated Annealing
Branch and Bound Methods (usually used in side-chain
conformation)
INPUT
1. Sequence of amino acids
2. The chemical structures of amino acids and peptide
backbone
constituent atoms
bond lengths, angles
constraints on dihedral angles
3. The properties of the media (water molecules, anions,
cations, other molecules…)
OUTPUT
3D coordinates of atoms in the protein (or some
equivalent representation)
We are also willing to accept partial information:
3D structure of active site only
Location (in sequence) of secondary structures
Prediction of the “class” or “family” of the protein
Is problem hard?
YES.
Huge Search Space:
Assume each amino acid can adopt one of three conformations (alpha,
beta, coil), then chain of 100 amino acids has 3100 = 5 x 1047 possible
folds.
If sample a fold in 10-13 seconds, it would take 1027 years.
Universe is 1010 years old.
Difficult criterion for “correct fold.”:
Interaction between thousands of atoms with each other, surrounding
water,and surrounding molecules.
Can it be done?
• YES.
Nature does it all the time.
Real proteins fold in the range of seconds.
• THUS
Nature must not sample all conformations.
Nature knows the correct criterion.
Potential Energy Function
In thermodynamics,
•
How do we know when a predicted structure is the
A molecule is most stable when it’s free energy is at a
native shape of the protein ?
minimum
native shape is at a free energy minimum
• The potential energy function is a simplification of actual
forces acting on a real protein molecule and it’s formulation is
based on the given simplified structural model
Polypeptides can be...
Represented by a range of approaches or
approximations including:
all atom representations in cartesian space
all atom representations in dihedral space
simplified atomic versions in dihedral space
tube/cylinder/ribbon representations
lattice models
Ab Initio Folding
• Two Central Problems
Sampling conformational space (10100)
The energy minimum problem
• The Sampling Problem (Solutions)
Lattice models, off-lattice models, simplified chain
methods
• The Energy Problem (Solutions)
Threading energies, simplified force fields, packing
assessment, topology assessment
A Simple 2D Lattice
3.5Å
Lattice Folding
Lattice Algorithm
Build a “n x m” matrix (a 2D array)
Choose an arbitrary point as your N terminal residue (start
residue)
Add or subtract “1” from the x or y position of the start residue
Check to see if the new point (residue) is off the lattice or is
already occupied
Evaluate the energy
Go to step 3) and repeat until done
Lattice Energy Algorithm
• Red = hydrophobic, Blue = hydrophilic
• If Red is near empty space E = E+1
• If Blue is near empty space E = E-1
• If Red is near another Red E = E-1
• If Blue is near another Blue E = E+0
• If Blue is near Red E = E+0
More Complex Lattices
3D Lattices
Really Complex 3D Lattices
J. Skolnick
Lattice Methods
Advantages
Disadvantages
• At best, only an
Easiest and quickest
approximation to the
way to build a
real thing
polypeptide
• Does not allow
More complex lattices
accurate constructs
allow reasonably
• Complex lattices are
accurate
as “costly” as the real
representation
thing
The CASP “contest”
CASP is a blind prediction contest. There is a set
of structures that are crystallized but not published.
The predictors attempt to predict there structures.
The results are compared.
• http://predictioncenter.org/casp[1,2,3,4,5,6,7,8,9]/
Download