Lecture 2

advertisement
2. Introduction to Rosetta and
structural modeling
• Approaches for structural modeling of proteins
• The Rosetta framework and its prediction
modes
• Cartesian and polar coordinates
• Sampling (finding the structure) and scoring
(selecting the structure)
Structural Modeling of Proteins Approaches
Prediction of Structure from Sequence
Flowchart
Comparison of query sequence to nr database
Similar to a sequence of known structure?
No
Yes
Fold Recognition
(Threading)
Homology Modeling
(Comparative Modeling)
Protocols: ab initio, loops, side chains, active sites….
Fits a known fold?
Yes
No
Ab initio prediction
The Rosetta framework and its
prediction modes
A short history of Rosetta
In the beginning: ab initio modeling of protein
structure starting from sequence
 Short fragments of known proteins are
assembled by a Monte Carlo strategy to yield
native-like protein conformations
 Reliable fold identification for short proteins.
Recently improved to high-resolution models
(within 2A RMSD)
ATCSFFGRKLL…..
A short history of Rosetta
Success of ab initio protocol lead to extension to
 Protein design
 Design of new fold: TOP7
 Protein loop modeling; homology modeling
 Protein-protein docking; protein interface design
ATCSFFGRKLL…..
ATCSFFGRKLL…..
 Protein-ligand docking
 Protein-DNA interactions; RNA modeling
 Many more, e.g. solving the phase problem in
Xray crystallography
The Rosetta Strategy
• Observation: local sequence preferences bias, but do not
uniquely define the local structure of a protein
• Goal: mimic interplay of local and global interactions that
determine protein structure
• Local interactions: fragments derived from known structures
(sampled for similar sequences/secondary structure propensity)
• Global (non-local) interactions: buried hydrophobic residues,
paired b strands, specific side chain interactions, etc
The Rosetta Strategy
• Local interactions – fragments
– Fragment library representing accessible local
structures for all short sequences in a protein
chain, derived from known structures
• Global (non-local) interactions – scoring
function
– Derived from conformational statistics of known
structures
More recent additions
• Boinc (Rosetta@home)
• FoldIt
• Rosettascripts; RosettaDiagrams
• PyRosetta
Scoring and Sampling
The basic assumption in structure
prediction
Native structure located
in global minimum
(free) energy
conformation (GMEC)
➜A good Energy function
can select the correct
model among decoys
➜A good sampling
technique can find the
GMEC in the rugged
landscape
E
GMEC
Conformation space
Two-Step Procedure
1. Low-resolution step
locates potential
minima (fast)
2. Cluster analysis
identifies broadest
basins in landscape
3. High-resolution step
can identify lowest
energy minimum in
the basins (slow)
E
Conformation space
GMEC
Low-Resolution Step
Structure Representation:
• Equilibrium bonds and
angles (Engh & Huber 1991)
• Centroid: average location
of center of mass of sidechain
(Centroid | aa, f,)
• No modeling of side chains
• Fast
Low-Resolution Scoring Function
Bayes Theorem:
• Independent components prevent over-counting
P(str | seq) = P(str)*P(seq|str) / P(seq)
structure
dependent
features
sequencedependent
features
constant
O
...
O
N
N
O
N
N
O
...
Sequence-Dependent Components
Bayes Theorem:
P(str | seq) = P(str) * P(seq | str) / P(seq)
Score = Senv+ Spair + …
neighbors: Cb-Cb <10Ǻ
Rohl et al. (2004) Methods in Enzymology 383:66
Origin: Simons et al., JMB 1997; Simons et al., Proteins 1999
Structure-Dependent Components
P(str | seq) = P(str) * P(seq | str) / P(seq)
Score = … + Srg + Scb + Svdw + …
Structure-Dependent Components
P(str | seq) = P(str) * P(seq | str) / P(seq)
Score = … + Sss + …
Structure-Dependent Components
P(str | seq) = P(str) * P(seq | str) / P(seq)
Score = … + Ssheet+ Shs + …
+ Srama
10
High-Resolution Step
Slow, exact step
Structure Representation:
• Locates global energy
minimum
• All-atom (including polar and
non-polar hydrogens, but no
water)
• Side chains as rotamers from
backbone-dependent library
• Side chain conformation
adjusted frequently
Dunbrack 1997
High-Resolution Step: Rotamer Libraries
• Side chains have
preferred conformations
• They are summarized in
rotamer libraries
• Select one rotamer for
each position
• Best conformation:
lowest-energy
combination of rotamers
Serine c1 preferences
t=180o
g+=+60o
g-=-60o
High-Resolution Scoring Function
• Major contributions:
– Burial of hydrophobic
groups away from water
– Void-free packing of
buried groups and atoms
– Buried polar atoms form
intra-molecular
hydrogen bonds
High-Resolution Scoring Function
Packing interactions
Score = SLJ(atr + rep) + ….
Linearized repulsive part
e: well depth from CHARMm19
rij
High-Resolution Scoring Function
Implicit solvation
Score = … +
Ssolvation + ….
xij2
xji2
xij=(rij - Ri)/li
solvation free energy density of i
polar
polar
Lazaridis & Karplus, Proteins 1999
High-Resolution Scoring Function
Hydrogen Bonds (original function)
Score = …. +
Shb(srbb+lrbb+sc) + ….
srbb: short range, backbone HB
lrbb: long range, backbone HB
sc: HB with side chain atom
d
H
N
O
C


(Kortemme, 2003; Morozov 2004)
Hydrogen Bonding Energy
(Kortemme, Morozov & Baker 2003 JMB)
Based on statistics
from high-resolution
structures in the
Protein Data Bank
(rcsb.org)

]
E

W
[
E
(
)

E
(

)

E
(

)

E
(

)
H
B
H
B
H
A

G


k
T
l
n
P
Slide from Jeff Gray
High-Resolution Scoring Function
Rotamer preference
Score = … +
Sdunbrack + ….
Dunbrack, 1997
Scoring Function: Summary
One long, generic function ….
Score = Senv+ Spair + Srg + Scb + Svdw + Sss+ Ssheet+ Shs + Srama + Shb (srbb + lrbb) + docking_score
+ Sdisulf_cent+ Srs+ Sco + Scontact_prediction + Sdipolar+ Sprojection + Spc+ Stether+ Sf+ Sw+
Ssymmetry + Ssplicemsd + …..
docking_score = Sd env+ Sd pair + Sd contact+ Sd vdw+ Sd site constr + Sd
+ Sfab score
Score = SLJ(atr + rep) + Ssolvation + Shb(srbb+lrbb+sc) + Sdunbrack + Spair – Sref + Sprob1b + Sintrares + Sgb_elec
+ Sgsolt + Sh2o(solv + hb) + S_plane
Representations of protein structure:
Cartesian and polar coordinates
PDB
x
ATOM
ATOM
ATOM
ATOM
…..
….
490
491
492
493
N
CA
C
O
GLN
GLN
GLN
GLN
A
A
A
A
31
31
31
31
52.013
52.134
51.726
51.015
y
z
-87.359 -8.797
-87.762 -10.201
-89.222 -10.343
-89.601 -11.275
1.00 7.06
1.00 8.67
1.00 10.90
1.00 9.63
N
C
C
O
Position PHI
PSI
OMEGA CHI1
CHI2 CHI3 CHI4
1
0.00 -60.00 -180.00 -60.00 0.00 0.00 0.00
2
3
….
…
…
2 ways to represent the protein structure
Cartesian coordinates
(x,y,z; pdb format)
 Intuitive – look at
molecules in space
 Easy calculation of energy
score (based on atomatom distances)
– Difficult to change
conformation of structure
(while keeping bond length
and bond angle
unchanged)
Polar coordinates
(FW; equilibrium angles
and bond lengths)
 Compact (3 values/residue)
 Easy changes of protein
structure (turn around one
or more dihedral angles)
– Non-intuitive
– Difficult to evaluate energy
score (calculation of
neighboring matrix
complicated)
A snake in the 2D world
• Cartesian representation:
x
5
(3,3)
2-3
3 (1,2)
3-4
4 (2,2)
2 (1,1)
1
(0,0)
y
points:
(0,0),(1,1),(1,2),(2,2),(3,3)
connections (predefined):
1-2,2-3,3-4,4-5
A snake in the 2D world
x
• Internal coordinates:
bond lengths (predefined):
√2,1,1,√2
angles:
450,90o,0o,45o
y
x
45o
90o
45o
y
From wikipedia
A snake wiggling in the 2D world
• Constraint: keep bond length
fixed
• Move in Cartesian
representation
x
(0,0),(1,1),(1,2),(2,2),(3,3) 
(0,0),(1,1),(1,2),(2,2),(3,0)
Bond length changed!
y
A snake wiggling in the 2D world
• Constraint: keep bond length
fixed
• Move in polar coordinates
x
450,90o,0o,45o 
450,90o,45o,45o
Bond length unchanged!
Large impact on structure
y
Polar Cartesian coordinates
Convert r and q to x and y
x
y
√2,1,1,√2
450,90o,0o,45o
From wikipedia
(0,0),(1,1),(1,2),(2,2),(3,3)
Cartesianpolar coordinates
Convert x and y to r and q
x
y
(0,0),(1,1),(1,2),(2,2),(3,3)
√2,1,1,√2
450,90o,0o,45o
Moving the snake to the 3D world
• Cartesian representation:
z
points: additional z-axis
(0,0,0),(1,1,0),(1,2,0),(2,2,0),
(3,3,0)
connections (predefined):
1-2,2-3,3-4,4-5
• Internal coordinates:
y bond lengths (predefined):
x
√2,1,1,√2
angles:
450,90o,0o,45o
dihedral angles:
1800,180o
Proteins: bond lengths and angles fixed. Only dihedral angles are varied
Dihedral angles
Dihedral angles c1-c4 define side chain
• Dihedral angle: defines geometry of
4 consecutive atoms (given bond
lengths and angles)
From wikipedia
What we learned from our snake
• Cartesian representation: Easy to
look at, difficult to move
– Moves do not preserve bond length
(and angles in 3D)
z
• Internal coordinates: Easy to move,
difficult to see
x
y
– calculation of distances between
points not trivial
Proteins: bond lengths and angles fixed. Only dihedral angles are varied
Solution: toggle
CALCULATE ENERGY Cartesian
coordinates:
Transform: calculate
dihedral angles from
coordinates
introduce changes in
structure by rotating
around dihedral
angle(s) (change F
values)
Derive distance
matrix (neighbor list)
for energy score
calculation
Transform: build positions
in space according to
dihedral angles
PDB
x
ATOM
ATOM
ATOM
ATOM
…..
….
MOVE STRUCTURE Polar coordinates:
490
491
492
493
N
CA
C
O
GLN
GLN
GLN
GLN
A
A
A
A
31
31
31
31
52.013
52.134
51.726
51.015
y
z
-87.359 -8.797
-87.762 -10.201
-89.222 -10.343
-89.601 -11.275
1.00 7.06
1.00 8.67
1.00 10.90
1.00 9.63
(0,0),(1,1),(1,2),(2,2),(3,3)
N
C
C
O
Position PHI
PSI
OMEGA CHI1
CHI2 CHI3 CHI4
1
0.00 -60.00 -180.00 -60.00 0.00 0.00 0.00
2
3
0
o
o
o
….
45 ,90 ,0 ,45
Cartesian polar coordinates
How to calculate polar from Cartesian
coordinates: example F: C’-N-Ca-C
– define plane perpendicular to N-Ca (b2) vector
– calculate projection of Ca-C (b3) and C’-N (b1)
onto plane
– calculate angle between projections
PDB
x
…
ATOM
ATOM
ATOM
ATOM
…..
….
490
491
492
493
C
N
CA
O
GLN
GLY
GLY
GLY
A
A
A
A
31
32
32
32
52.013
52.134
51.726
51.015
y
z
-87.359 -8.797
-87.762 -10.201
-89.222 -10.343
-89.601 -11.275
1.00 7.06
1.00 8.67
1.00 10.90
1.00 9.63
(0,0),(1,1),(1,2),(2,2),(3,3)
N
C
C
O
Position PHI
PSI
OMEGA CHI1
CHI2 CHI3 CHI4
…..
32
-59.00 -60.00 -180.00 0.00 0.00 0.00 0.00
33
34
….
0
o
o
o
…
…
45 ,90 ,0 ,45
Polar Cartesian coordinates
Find x,y,z coordinates of C, based on atom positions
of C’, N and Ca, and a given F value (F: C’-N-Ca-C)
• create Ca-C vector:
–size Ca-C=1.51A (equilibrium bond length)
–angle N-Ca-C= 111o (equilibrium value for
N-Ca-C angle)
• rotate vector around N-Ca axis to obtain
projections of Ca-C and N-C’ with wanted F
PDB
x
…
ATOM
ATOM
ATOM
ATOM
…..
….
490
491
492
493
C
N
CA
O
GLN
GLY
GLY
GLY
A
A
A
A
31
32
32
32
52.013
52.134
51.726
51.015
y
z
-87.359 -8.797
-87.762 -10.201
-89.222 -10.343
-89.601 -11.275
1.00 7.06
1.00 8.67
1.00 10.90
1.00 9.63
(0,0),(1,1),(1,2),(2,2),(3,3)
N
C
C
O
Position PHI
PSI
OMEGA CHI1
CHI2 CHI3 CHI4
…..
32
-59.00 -60.00 -180.00 0.00 0.00 0.00 0.00
33
34
….
…
…
450,90o,0o,45o
Representation of protein structure
Rosetta folding
1
2
3
4
5
6
7
8
3 backbone dihedral angles per residue
Build coordinates of structure starting from first atom, according to dihedral angles
(and equilibrium bond length and angle)
1
2
3
4
5
6
7
7
8
8
Sampling and minimization in TORSIONAL space:
change angle and rebuild, starting from changed angle
Based on slides by Chu Wang
Representation of protein structure
Rosetta folding
1
2
3
4
5
6
7
8
3 backbone dihedral angles per residue
Sampling and minimization in TORSIONAL space
Sampling and minimization in RIGID-BODY space
1
2
3
4
5
6
7
8
Backbone dihedral
angles fixed (rigid-body)
Rosetta docking
1’ 2’ 3’ 4’ 5’ 6’ 7’ 8’
6 rigid-body DOFs -3 translational vectors
3 rotational angles
How can those two types of degrees of freedom be combined?
Fold tree representation
 Originally developed to improve sampling of strand registers in b-sheet proteins.
 Allows simultaneous optimization of rigid-body and backbone/sidechain torsional
degrees of freedom.
Example:
fold-tree based docking
“peptide” edge – 3 backbone dihedral angles
1
2
3
4
5
6
7
8
3’
4’
5’
6’
7’
8’
“long-range” edge – 6 rigid-body DOFs
1’
2’
“peptide” edge – 3 backbone dihedral angles
 Construct fold-trees to treat a variety of protein folding and docking problems.
Fold tree: Bradley and Baker, Proteins (2006)
Fold-trees for different modeling tasks
protein folding
N
C
Color – flexible bb
Gray – fixed bb
Flexible “peptide” edge
rigid “peptide” edge
N: N-terminal; C: C-terminal; X: chain break; O: root of the tree;
1
1’
rigid “jump”
1
1’
flexible “jump”
Fold-trees for different modeling tasks
loop modeling
N
1
x
1’
2
x 2’
C
Color – flexible bb
Gray – fixed bb
Flexible “peptide” edge
rigid “peptide” edge
N: N-terminal; C: C-terminal; X: chain break; O: root of the tree;
1
1’
rigid “jump”
1
1’
flexible “jump”
Fold-trees for different modeling tasks
fully flexible docking
N
1
C
N
1’
C
docking w/ loop modeling
N
N
1
3’
2
x
x3
2’
1’
C
C
docking w/ hinge motion
N
N
1
1’
Flexible “peptide” edge
C
Color – flexible bb
Gray – fixed bb
C
rigid “peptide” edge
N: N-terminal; C: C-terminal; X: chain break; O: root of the tree;
1
1’
rigid “jump”
1
1’
flexible “jump”
Fold-trees for different modeling tasks
Color – flexible bb
Gray – fixed bb
Pale – symmetry operation
Fold-trees for different modeling tasks
Color – flexible bb
Gray – fixed bb
• Filled colored circles - flexible sc
Fold-trees for different modeling tasks
Color – flexible bb
Gray – fixed bb
• Filled colored circles - flexible sc
o empty colored circles – flexible amino acid: design
Fold-trees for different modeling tasks
Color – flexible bb
Gray – fixed bb
• Filled colored circles - flexible sc
o empty colored circles – flexible amino acid: design
The Rosetta sampling strategy: a general
overview
Fragment Sampling
Local optimization
• 9 residue fragments
• 3 residue fragments
• Gradual addition of parameters
to scoring function
• Quick quenching
• Strategies to keep fragment
insertion/perturbation local
• Monte Carlo (MC) Sampling
• MC sampling with minimization
• Repacking and refinement
Side chain
rearrangement
Download