Robotics Algorithms for the Study of Protein Structure and Motion Jean-Claude Latombe

advertisement
Robotics Algorithms
for the Study of
Protein Structure and Motion
Jean-Claude Latombe
Computer Science Department
Stanford University
Protein
Long sequence of amino-acids (dozens to thousands),
from a dictionary of 20 distinct amino-acids
Central Dogma
of Molecular Biology
Physiological conditions:
aqueous solution, 37°C, pH 7,
atmospheric pressure
Why Proteins?
 They are the workhorses of living organisms
• They perform many vital functions, e.g.:
-
catalysis of reactions
storage of energy
transmission of signals
building blocks of muscles
 They raise challenging computational issues
• Large molecules (100s to several 1000s of atoms)
• Made of building blocks drawn from a small “dictionary”
• Unusual kinematic structure
 They are associated with many critical problems
• Folded structure determination
• Global and local structural similarities
• Prediction of folding and binding motions
f-y Kinematic Linkage Model
peptide group
side-chain group
Molecule and Robot
Two problems
 Structure determination from electron
density maps
• Inverse kinematics techniques
[Itay Lotan, Henry van den Bedem, Ashley Deacon
(Joint Center for Structural Genomics)]
 Energy maintenance during Monte Carlo
simulation
• Collision detection techniques
[Itay Lotan, Fabian Schwarzer, and Danny Halperin
(Tel Aviv University)]
Structure
Determination/Prediction
 Experimental tools
X-ray crystallography
 Computational tools
• Homology, threading
• Molecular dynamics
NMR spectrometry
Protein Data Bank
Only about 10% of structures have been
determined for known protein sequences
 Protein Structure Initiative (PSI)
1990
1999
2000
2004




250 new structures
2500 new structures
>20,000 structures total
~30,000 structures total
X-Ray Crystallography
Automated Model Building
Software systems: RESOLVE, TEXTAL, ARP/wARP, MAID
• 1.0Å < d < 2.3Å
~ 90% completeness
• 2.3Å ≤ d < 3.0Å
~ 67% completeness (varies widely)1
1.0Å
3.0Å
JCSG: 43% of data sets  2.3Å
 Manually completing a model:
• Labor intensive, time consuming
• Existing tools are highly interactive
 Model completion is high-throughput bottleneck
1Badger
(2003) Acta Cryst. D59
The Completion Problem
 Input:
Anchor 1
(3 atoms)
• Electron-density map
• Partial structure
• Two anchor residues
• Amino-acid sequence of
missing fragment
(typically 4 – 15 residues long)
Anchor 2
(3 atoms)
Protein fragment (fuzzy map)
Main part of protein (folded)
 Output:
• Few candidate conformation(s) of fragment that
- Respect the closure constraint (IK)
- Maximize match with electron-density map
IK Problem
 Input:
• Closed kinematic chain with n > 6 degrees of freedom
• Relative positions/orientations X of end frames
• Target function T(Q) → R
 Output:
• Joint angles Q that
- Achieve closure
- Optimize T
T
Related Work
Biology/Crystallography
Robotics/Computer Science
•
–
–
Manocha & Canny ’94
Manocha et al. ’95
–
Wang & Chen ’91
–
–
Khatib ’87
Burdick ’89
–
–
–
Han & Amato ’00
Yakey et al. ’01
Cortes et al. ’02, ’04
Optimization IK solvers
•
Redundant manipulators
Motion planning for closed loops
Exact IK solvers
–
–
Exact IK solvers
•
•
•
•
Optimization IK solvers
–
–
•
Fiser et al. ’00
Kolodny et al. ’03
Database search loop closure
–
–
•
Fine et al. ’86
Canutescu & Dunbrack Jr. ’03
Ab-initio loop closure
–
–
•
Wedemeyer & Scheraga ’99
Coutsias et al. ’04
Jones & Thirup ’86
Van Vlijman & Karplus ’97
Semi-automatic tools
–
–
Jones & Kjeldgaard ’97
Oldfield ’01
Two-Stage IK Method
1. Candidate generations
 Closed fragments
2. Candidate refinement
 Optimize fit with EDM
Stage 1: Candidate Generation
1.
Generate random conformation of fragment
(only one end attached to anchor)
2. Close fragment (i.e., bring other end to
second anchor) using Cyclic Coordinate
Descent (CCD) (Wang & Chen ’91, Canutescu & Dunbrack ’03)
Closure Distance
Closure Distance: S  N - N  C - C  C - C
2
moving end
fixed end
2
2
A.A. Canutescu and R.L. Dunbrack Jr.
Cyclic coordinate descent: A robotics
algorithm for protein loop closure.
Prot. Sci. 12:963–972, 2003.
S
0
Compute qi s.t.
qi
+ bias toward EDM
+ avoid steric clashes
Stage 2: Candidate Refinement
 Target function T (Q) measuring quality of
the fit with the EDM
 Minimize T while retaining closure
 Closed conformations lie on a self-motion
manifold of lower dimension
dq3
dq2
Null space
(q1,q2,q3)
dq1
1-D manifold
Closure and Null Space





dX = J dQ, where J is the 6n Jacobian
matrix (n > 6)
Null space {dQ | J dQ = 0} has dim = n – 6
N: orthonormal basis of null space
Pseudo-inverse J+ such that JJ+ = I
dQ = J+dX + NNTy
y = T(Q)
Computation of J+ and N
SVD of J
dX
U66
VT6n
S66
s1
s2
dQ
0
=
s6
NT
(n-6) basis N of null space
Gram-Schmidt orthogonalization
J+ = V S+ UT where S+=diag[1/si]
Refinement Procedure
Repeat until minimum is reached:
 Compute J, J+ and N at current Q
• Compute T at current Q
(analytical expression of T + linear-time recursive
computation [Abe et al., Comput. Chem., 1984])
•
Move along dQ = J+dX + NNT T until
minimum is reached or closure is broken
+
Monte Carlo + simulated annealing protocol to
deal with local minima
Monte Carlo Optimization
Repeat:
1. Perform a random move of the
fragment:
– either by picking a random direction in null
space
– or by using an exact IK solver over 6 dofs
[Coutsias et al, 2004] ( big jumps)
2. Minimize T(Q)
3. Accept move with Metropolis-criterion
probability ~exp(-DT/Temp)
Tests #1: Artificial Gaps
 TM1621 (234 residues) and TM0423 (376
residues), SCOP classification a/b
 Complete structures (gold standard)
resolved with EDM at 1.6Å resolution
 Compute EDM at 2, 2.5, and 2.8Å resolution
 Remove fragments and rebuild
TM1621
103 Fragments from TM1621 at 2.5Å
Short Fragments:
100% < 1.0Å aaRMSD
Long Fragments:
12: 96% < 1.0Å aaRMSD
15: 88% < 1.0Å aaRMSD
Produced by H. van den Bedem
Comparison Across Resolutions
Resolution = 2.0Å
Resolution = 2.5Å
Resolution = 2.8Å
Example: TM0423
PDB: 1KQ3, 376 res.
2.0Å resolution
12 residue gap
Best: 0.3Å aaRMSD
Tests #2: True Gaps




Structure computed by RESOLVE
Gaps completed independently (gold standard)
Example: TM1742 (271 residues)
2.4Å resolution; 5 gaps left by RESOLVE
Length
Top scorer
Lowest error
4
0.22Å
0.22Å
5
0.78Å
0.78Å
5
0.36Å
0.36Å
7
0.72Å
0.66Å
10
0.43Å
0.43Å
Produced by H. van den Bedem
TM0813
PDB: 1J5X, 342 res.
2.8Å resolution
12 residue gap
GLU-83
GLY-96
TM0813
PDB: 1J5X, 342 res.
2.8Å resolution
12 residue gap
Best 0.6Å aaRMSD
GLU-83
GLY-96
TM1621
 Green: manually
completed
conformation
 Cyan: conformation
computed by stage 1
 Magenta: conformation
computed by stage 2
 The aaRMSD improved
by 2.4Å to 0.31Å
Alr1529
D72-D78
resolution:
initial model:
contour:
PDB:
aaRMSD:
2.0Å
ARP/wARP
1.0s
1VJG
0.33Å
TM0542
• Top-scoring fragment in cyan
• Manually completed fragment in green
• Residues A259 and A260 are flipped
Current/Future Work
 Software actively being
used at the JCSG
 What about multi-modal
loops?
B
A
 TM0755: data at 1.8Å
 8-residue fragment crystallized in 2 conformations
 Overlapping density: Difficult to interpret manually
A323
Hist
A316
Ser
Algorithm successfully identified and built both conformations
Current/Future Work
 Software actively being
used at the JCSG
 What about multi-modal
loops?
 Fuzziness in EDM can
then be exploited
B
 Use EDM to infer
probability measure
over the conformation
space of the loop
A
Amylosucrase
J. Cortés, T. Siméon, M. Renaud-Siméon, and V. Tran.
J. Comp. Chemistry, 25:956-967, 2004
Energy maintenance during
Monte Carlo simulation
joint work with Itay Lotan, Fabian Schwarzer,
and Dan Halperin1
1 Computer Science Department, Tel Aviv University
Monte Carlo Simulation (MCS)
 Random walk through conformation space
 At each attempted step:
• Perturb current conformation at random
• Accept step with probability:

P(accept )  min 1, e
-DE / kbT

 The conformations generated by an arbitrarily
long MCS are Boltzman distributed, i.e.,
#conformations in V ~

V
e
-
E
kT
dV
Monte Carlo Simulation (MCS)
 Used to:
• sample meaningful distributions of conformations
• generate energetically plausible motion pathways
 A simulation run may consist of millions of
steps
 energy must be evaluated frequently
Problem: How to maintain energy efficiently?
Energy Function
 E = S bonded terms
+
S non-bonded terms
 Bonded terms
+ S solvation terms
- O(n)
 Non-bonded terms
- E.g., e.g. Van der Waals and electrostatic
- Depend on distances between pairs of atoms
- O(n2)  Expensive to compute
 Solvation terms
- May require computing molecular surface
Non-Bonded Terms
 Energy terms go to 0 when distance
increases
 Cutoff distance (6 - 12Å)
 vdW forces prevent atoms
from bunching up
 Only O(n) interacting pairs
[Halperin&Overmars 98]
Problem: How to find interacting pairs
without enumerating all atom pairs?
Grid Method
dcutoff
 Subdivide 3-space into
cubic cells
 Compute cell that
contains each atom
center
 Represent grid as hashtable
Grid Method
dcutoff
 Θ(n) time to build grid
 O(1) time to find
interactive pairs for each
atom
 Θ(n) to find all
interactive pairs of
atoms [Halperin&Overmars, 98]
 Asymptotically optimal
in worst-case
Can we do better on average?
 Few DOFs are changed at each MC step
0
simulation
of 100,000
attempted
steps
5
10
20
30
Number k
of DOF changes
Can we do better on average?
 Few DOFs are changed at each MC step
 Proteins are long chain kinematics
 Long sub-chains stay rigid at each step
 Many partial energy sums remain constant
Problem: How to retrieve the unchanged
partial sums?
Hierarchical Collision Checking
 Widely used technique
in robotics/graphics to
approximate distances
between objects
 Pre-computation of
bounding-volume
hierarchy
 How to update this
hierarchy if the objects
deform
Two New Data Structures
1. ChainTree
 Fast detection of interacting atom pairs
2. EnergyTree
 Retrieval of unchanged partial energy sums
ChainTree
(Twofold Hierarchy: BVs + Transforms)
links
ChainTree
(Twofold Hierarchy: BVs + Transforms)
TNO
TJK
TAB
joints
Updating the ChainTree
Update path to root:
– Recompute transforms that “shortcut” the DOF change
– Recompute BVs that contain the DOF change
– O(k log(n/k)) work for k changes
Finding Interacting Pairs

Finding Interacting Pairs
Finding Interacting Pairs
 Do not search inside
rigid sub-chains
(unmarked nodes)
Finding Interacting Pairs
 Do not search inside
rigid sub-chains
(unmarked nodes)
 Do not test two nodes
with no marked node
between them
 New interacting pairs
EnergyTree
E(N,N)
E(J,L)
E(K.L)
E(L,L)
E(M,M)
EnergyTree
E(N,N)
E(J,L)
E(K.L)
E(L,L)
E(M,M)
Complexity
 n : total number of DOFs
 k : number of DOF changes at each MCS step
 k << n
 Complexity of:
 updating ChainTree: O(k log(n/k))
 finding interacting pairs: O(n4/3)
but performs much better in practice!!!
Experimental Setup
 Energy function:




Van der Waals
Electrostatic
Attraction between native contacts
Cutoff at 12Å
 300,000 steps MCS with Grid and
ChainTree
 Steps are the same with both methods
 Early rejection for large vdW terms
Results: 1-DOF change
12.5
7.8
speedup
5.8
3.5
# amino acids
(68)
(144)
(374)
(755)
Results: 5-DOF change
5.9
speedup
4.5
3.4
2.2
(68)
(144)
(374)
(755)
Two-Pass ChainTree (ChainTree+)
1st pass: small cutoff distance to detect steric clashes
2nd pass: normal cutoff distance
>5
Tests around
native state
Interaction with Solvent
 Explicit solvent models: 100s or 1000s of discrete
solvent molecules
 Implicit solvent models: solvent as continuous medium,
interface is solvent-accessible surface
E. Eyal, D. Halperin. Dynamic Maintenance of Molecular Surfaces under
Conformational Changes.
http://www.give.nl/movie/publications/telaviv/EH04.pdf
Summary
 Inverse kinematics techniques 
Improve structure determination from
fuzzy electron density maps
 Collision detection techniques 
Speedup energy maintenance during
Monte Carlo simulation
About Computational Biology
 Computational Biology is more than using
computers to biological problems or
mimicking nature (e.g., performing MD
simulation)
 One of its goals is to achieve algorithmic
efficiency by exploiting properties of
molecules, e.g.:
• Proteins are long kinematic chains
• Atoms cannot bunch up together
• Forces have relatively short ranges
Download