CS273_StructurePrediction

advertisement
Protein Structural Prediction
Protein Structure is Hierarchical
Structure Determines Function
The Protein Folding Problem
What determines structure?
•
•
Energy
Kinematics
How can we determine structure?
•
•
Experimental methods
Computational predictions
Primary Structure: Sequence
•
The primary structure of a protein is the amino acid sequence
Primary Structure: Sequence
• Twenty different amino
acids have distinct shapes
and properties
Primary Structure: Sequence
A useful mnemonic for the hydrophobic amino acids is "FAMILY VW"
Secondary Structure: , , & loops
•  helices and  sheets are stabilized by hydrogen bonds between
backbone oxygen and hydrogen atoms
Secondary Structure:  helix
Secondary Structure:  sheet
 sheet
 buldge
Second-and-a-half-ary Structure: Motifs
beta helix
beta barrel
beta trefoil
Tertiary Structure: Domains
Mosaic Proteins
Tertiary Structure: A Protein Fold
Protein Folds Composed of , , other
Quaternary Structure: Multimeric Proteins or
Functional Assemblies
• Multimeric Proteins
• Macromolecular Assemblies
Ribosome:
Protein Synthesis
Hemoglobin:
A tetramer
Replisome:
DNA copying
Protein Folding
• The amino-acid sequence of a protein determines the 3D fold
[Anfinsen et al., 1950s]
Some exceptions:
 All proteins can be denatured
 Some proteins have multiple conformations
 Some proteins get folding help from chaperones
• The function of a protein is determined by its 3D fold
• Can we predict 3D fold of a protein given its amino-acid sequence?
The Leventhal Paradox
• Given a small protein (100aa) assume 3 possible
conformations/peptide bond
• 3100 = 5 × 1047 conformations
• Fastest motions 10- 15 sec so sampling all conformations would take
5 × 1032 sec
• 60 × 60 × 24 × 365 = 31536000 seconds in a year
• Sampling all conformations will take 1.6 × 1025 years
• Each protein folds quickly into a single stable native conformation the
Leventhal paradox
Quick Overview of Energy
Bond
Strength
(kcal/mole)
H-bonds
3-7
Ionic bonds
10
Hydrophobic
interactions
1-2
Van der vaals
interactions
1
Disulfide bridge
51
The Hydrophobic Effect
• Important for folding, because every amino acid participates!
2.25
Trp
0.26
Thr
1.80
Ile
0.13
His
1.79
Phe
0.00
Gly
1.70
Leu
-0.04
Ser
1.54
Cys
-0.22
Gln
1.23
Met
-0.60
Asn
1.22
Val
-0.64
Glu
0.96
Tyr
-0.77
Asp
0.72
Pro
-0.99
Lys
0.31
Ala
-1.01
Arg
Experimentally Determined Hydrophobicity Levels
Fauchere and Pilska (1983).
Eur. J. Med. Chem. 18, 369-75.
Protein Structure Determination
• Experimental
 X-ray crystallography
 NMR spectrometry
• Computational – Structure Prediction
(The Holy Grail)
Sequence implies structure, therefore in principle we can
predict the structure from the sequence alone
Protein Structure Prediction
• ab initio
 Use just first principles: energy, geometry, and kinematics
• Homology
 Find the best match to a database of sequences with known 3Dstructure
• Threading
• Meta-servers and other methods
Ab initio Prediction
• Sampling the global conformation space
 Lattice models / Discrete-state models
 Molecular Dynamics
 Pre-set libraries of fragment 3D motifs
• Picking native conformations with an energy function
 Solvation model: how protein interacts with water
 Pair interactions between amino acids
• Predicting secondary structure
 Local homology
 Fragment libraries
Lattice String Folding
• HP model: main modeled force is hydrophobic attraction
 NP-hard in both 2-D square and 3-D cubic
 Constant approximation algorithms
 Not so relevant biologically
Lattice String Folding
ROSETTA
http://www.bioinfo.rpi.edu/~bystrc/hmmstr/server.php
http://depts.washington.edu/bakerpg/papers/Bonneau-ARBBS-v30-p173.pdf
•
Monte Carlo based method
•
Limit conformational search space by using sequence—structure
motif I-Sites library (http://isites.bio.rpi.edu/Isites/)

261 patterns in library

Certain positions in motif favor certain residues
•
Remove all sequences with <25% identity
•
Find structures of the 25 nearest sequence neighbors of
each 9-mer
Rationale

Local structures often fold independently of full protein

Can predict large areas of protein by matching sequence to ISites
? ?
?
I-Sites Examples
•
Non polar helix
 Abundance of alanine at all positions
 Non-polar side chains favored at positions 3, 6, 10
(methionine, leucine, isoleucine)
•
Amphipathic helix
 Non-polar side chains favored at positions 6, 9, 13, 16
(methionine, leucine, isoleucine)
 Polar side chains favored at positions 1, 8, 11, 18
(glutamic acid, lysine)
ROSETTA Method
•
•
•
•
New structures generated by swapping
compatible fragments
Accepted structures are clustered based
on energy and structural size
Best cluster is one with the greatest
number of conformations within 4-Å rms
deviation structure of the center
Representative structures taken from each
of the best five clusters and returned to
the user as predictions
? ?
?
Robetta & Rosetta
Rosetta results in CASP
Rosetta Results
• In CASP4, Rosetta’s best models ranged from 6–10 Å rmsd C
• For comparison, good comparative models give 2-5 Å rmsd C
• Most effective with small proteins (<100 residues) and structures with
helices
Only a few folds are found in nature
The SCOP Database
Structural Classification Of Proteins
FAMILY: proteins that are >30% similar, or >15% similar and have
similar known structure/function
SUPERFAMILY: proteins whose families have some sequence and
function/structure similarity suggesting a common evolutionary origin
COMMON FOLD: superfamilies that have same secondary structures in
same arrangement, probably resulting by physics and chemistry
CLASS: alpha, beta, alpha–beta, alpha+beta, multidomain
Status of Protein Databases
PDB
SCOP: Structural Classification of Proteins. 1.67 release
24037 PDB Entries (15 May 2004). 65122 Domains.
Class
EMBL
Number of
folds
Number of
superfamilies
Number of
families
All alpha proteins
202
342
550
All beta proteins
141
280
529
Alpha and beta proteins (a/b)
130
213
593
Alpha and beta proteins (a+b)
260
386
650
Multi-domain proteins
40
40
55
Membrane and cell surface
proteins
42
82
91
Small proteins
71
104
162
Total
887
1447
2630
Evolution of Proteins – Domains
Chothia, Gough, Vogel, Teichmann, Science 300:1701-17-3, 2003
#members in different families obey power law
429 families common in all 14 eukaryotes;
80% of animal domains, 90% of fungi domains
80% of proteins are multidomain in eukaryotes;
domains usually combine pairwise in same order
--why?
Evolution of proteins happens
mainly through duplication,
recombination, and divergence
Homology-based Prediction
• Align query sequence with sequences of known structure,
usually >30% similar
• Superimpose the aligned sequence onto the structure
template, according to the computed sequence alignment
• Perform local refinement of the resulting structure in 3D
The number of unique structural folds
is small (possibly a few thousand)
90% of new structures submitted to PDB in the
past three years have similar folds in PDB
Examples of Fold Classes
Homology-based Prediction
Raw model
Loop modeling
Side chain placement
Refinement
Homology-based Prediction
Download