Robotics Algorithms for the Study of
Protein Structure and Motion
Jean-Claude Latombe
Computer Science Department
Stanford University
Based on Itay Lotan’s PhD
Unfolded (denatured) state
Many pathways
Folded (native) state
Loops connect helices and strands
Protein Sequence Structure amino-acid
(residue) peptide bonds
f-y Kinematic Linkage Model
Molecule Robot
Why Studying Proteins?
They perform many vital functions, e.g.:
• catalysis of reactions
• storage of energy
• transmission of signals
• building blocks of muscles
They are linked to key biological problems that raise major computational challenges mostly due to their large sizes (100s to several
1000s of atoms), many degrees of kinematic freedom, and their huge number (millions)
Two problems
Structure determination from electron density maps
• Inverse kinematics techniques
[Itay Lotan, Henry van den Bedem, Ashley Deacon
(Joint Center for Structural Genomics)]
Energy maintenance during Monte Carlo simulation
• Distance computation techniques
[Itay Lotan, Fabian Schwarzer, and Danny Halperin
(Tel Aviv University)]
Structure Determination:
X-Ray Crystallography
Software
Software systems: RESOLVE, TEXTAL, ARP/wARP, MAID
• 1.0Å < d < 2.3Å
• 2.3Å ≤ d < 3.0Å
~ 90% completeness
~ 67% completeness (varies widely) 1
1.0Å 3.0Å
JCSG: 43% of data sets 2.3Å
Manually completing a model:
• Labor intensive, time consuming
• Existing tools are highly interactive
Model completion is high-throughput bottleneck
1 Badger (2003) Acta Cryst. D59
The Completion Problem
Input:
• Electron-density map
• Partial structure
• Two anchor residues
• Amino-acid sequence of
Anchor 1
(3 atoms) missing fragment
(typically 4 – 15 residues long)
Partial structure
(folded)
Anchor 2
(3 atoms)
Output:
• Ranked conformations Q of fragment that
Respect the closure constraint
Maximize target function T(Q) measuring fit with electron-density map
No atomic clashes
(Inverse Kinematics)
Two-Stage IK Method
1.
Candidate generations
Closed fragments
2.
Candidate refinement
Optimize fit with EDM
Stage 1: Candidate Generation
1.
Generate a random conformation of fragment (only one end attached to anchor)
2.
Close fragment (i.e., bring other end to second anchor) using Cyclic Coordinate
Descent (CCD)
(Wang & Chen ’91, Canutescu & Dunbrack ’03)
Closure Distance:
S
N
-
N
2
C
-
C
2
C
-
C
2 moving end
A.A. Canutescu and R.L. Dunbrack Jr.
Cyclic coordinate descent: A robotics algorithm for protein loop closure.
Prot. Sci. 12:963–972, 2003.
fixed end
Compute q i
+ bias toward avoiding steric clashes
Exact Inverse Kinematics
Repeat for each conformation of a closed fragment:
1.
Pick 3 amino-acids at random (3 pairs of f y angles)
2.
Apply exact IK solver to generate all
IK solutions
[Coutsias et al, 2004]
GLU-83
TM0813
GLY-96
Stage 2: Candidate Refinement
Target function T (Q) measuring quality of the fit with the EDM
Minimize T while retaining closure
Closed conformations lie on a self-motion manifold of lower dimension d q
3
( q
1
, q
2
, q
3
) d q
2
Null space d q
1
1-D manifold
Closure and Null Space
dX = J dQ, where J is the 6 n Jacobian matrix (n > 6)
Null space {dQ | J dQ = 0} has dim = n – 6
N: orthonormal basis of null space
dQ = NN T T(Q)
X
dX
=
Computation of N
SVD of J
U
6 6 s
1 s
2
S
6 6
V T
6 n dQ
0 s
6
N T
(n-6) basis N of null space
Gram-Schmidt orthogonalization
Refinement Procedure
Repeat until minimum of T is reached:
1.
Compute J and N at current Q
2.
Compute T at current Q
(analytical expression of T + linear-time recursive computation [Abe et al., Comput. Chem., 1984] )
3.
Move by small increment along dQ = NN T T
(+ Monte Carlo / simulated annealing protocol to deal with local minima)
GLU-83
TM0813
GLY-96
Tests #1: Artificial Gaps
TM1621 (234 residues) and TM0423 (376 residues), SCOP classification a/b
Complete structures (gold standard) resolved with EDM at 1.6Å resolution
Compute EDM at 2, 2.5, and 2.8Å resolution
Remove fragments and rebuild
TM1621
103 Fragments from TM1621 at 2.5Å
Short Fragments:
100% < 1.0Å aaRMSD
Long Fragments:
12: 96% < 1.0Å aaRMSD
15: 88% < 1.0Å aaRMSD
Produced by H. van den Bedem
Example: TM0423
PDB: 1KQ3, 376 res.
2.0Å resolution
12 residue gap
Best: 0.3Å aaRMSD
Tests #2: True Gaps
Structure computed by RESOLVE
Gaps completed independently (gold standard)
Example: TM1742 (271 residues)
2.4Å resolution; 5 gaps left by RESOLVE
Length
4
5
5
7
10
Top scorer
0.22Å
0.78Å
0.36Å
0.72Å
0.43Å
Produced by H. van den Bedem
TM1621
Green: manually completed conformation
Cyan: conformation computed by stage 1
Magenta: conformation computed by stage 2
The aaRMSD improved by 2.4Å to 0.31Å
Current/Future Work
Software actively being used at the JCSG
What about multi-modal loops?
A
B
TM0755: data at 1.8Å
8-residue fragment crystallized in 2 conformations
Overlapping density: Difficult to interpret manually
A316
Ser
A323
Hist
Algorithm successfully identified and built both conformations
Current/Future Work
Software actively being used at the JCSG
What about multi-modal loops?
Fuzziness in EDM can then be exploited
Use EDM to infer probability measure over the conformation space of the loop
B
A
Amylosucrase
J. Cortés, T. Siméon, M. Renaud-Siméon, and V. Tran.
J. Comp. Chemistry, 25:956-967, 2004
Energy maintenance during
Monte Carlo simulation joint work with Itay Lotan, Fabian Schwarzer, and Dan Halperin 1
1 Computer Science Department, Tel Aviv University
Monte Carlo Simulation (MCS)
Random walk through conformation space
At each attempted step:
• Perturb current conformation at random
• Accept step with probability:
)
e
-
E k T
The conformations generated by an arbitrarily long MCS are Boltzman distributed, i.e.,
#conformations in V ~
V
-
E e dV
Monte Carlo Simulation (MCS)
Used to:
• sample meaningful distributions of conformations
• generate energetically plausible motion pathways
A simulation run may consist of millions of steps
energy must be evaluated a large number of times
Problem: How to maintain energy efficiently?
Energy Function
E = S bonded terms
+ S non-bonded terms
+ S solvation terms
Bonded terms
O(n)
Non-bonded terms
E.g., Van der Waals and electrostatic
- Depend on distances between pairs of atoms
O(n 2 ) Expensive to compute
Solvation terms
May require computing molecular surface
Non-Bonded Terms
Energy terms go to 0 when distance increases
Cutoff distance (6 - 12Å)
vdW forces prevent atoms from bunching up
Only O(n) interacting pairs
[Halperin&Overmars 98]
Problem: How to find interacting pairs without enumerating all atom pairs?
d cutoff
Grid Method
Subdivide 3-space into cubic cells
Compute cell that contains each atom center
Represent grid as hashtable
d cutoff
Grid Method
Θ(n) time to build grid
O(1) time to find interactive pairs for each atom
Θ(n) to find all interactive pairs of atoms
[Halperin&Overmars, 98]
Asymptotically optimal in worst-case
Can we do better on average?
Few DOFs are changed at each MC step simulation of 100,000 attempted steps
0
5 10 20 30
Number k of DOF changes
Can we do better on average?
Few DOFs are changed at each MC step
Proteins are long chain kinematics
Long sub-chains stay rigid at each step
Many interacting pairs of atoms are unchanged
Many partial energy sums remain constant
Problem: How to find new interacting pairs and retrieve unchanged partial sums?
Two New Data Structures
1. ChainTree
Fast detection of interacting atom pairs
2. EnergyTree
Retrieval of unchanged partial energy sums
ChainTree
(Twofold Hierarchy: BVs + Transforms) links
ChainTree
(Twofold Hierarchy: BVs + Transforms)
T
NO
T
AB
T
JK joints
Updating the ChainTree
Update path to root:
– Recompute transforms that “shortcut” the DOF change
– Recompute BVs that contain the DOF change
– O(k log
2
(2n/k)) work for k changes
Finding Interacting Pairs
Finding Interacting Pairs
Finding Interacting Pairs
Do not search inside rigid sub-chains
(unmarked nodes)
Finding Interacting Pairs
Do not search inside rigid sub-chains
(unmarked nodes)
Do not test two nodes with no marked node between them
New interacting pairs
E (N,N)
E (J,L)
EnergyTree
E (K.L)
E (L,L)
E (M,M)
E (N,N)
E (J,L)
EnergyTree
E (K.L)
E (L,L)
E (M,M)
Complexity
n : total number of DOFs
k : number of DOF changes at each MCS step
k << n
Complexity of:
updating ChainTree: O( k log
2
(2 n / k ))
finding interacting pairs: O( n 4/3 ) but p erforms much better in practice!!!
Experimental Setup
Energy function:
Van der Waals
Electrostatic
Attraction between native contacts
Cutoff at 12Å
300,000 steps MCS with Grid and
ChainTree
Steps are the same with both methods
Early rejection for large vdW terms
Results: 1-DOF change
12.5
7.8
speedup
5.8
3.5
# amino acids (68) (144) (374) (755)
Results: 5-DOF change
5.9
2.2
speedup 4.5
3.4
(68) (144) (374) (755)
Two-Pass ChainTree (ChainTree+)
1 st pass: small cutoff distance to detect steric clashes
2 nd pass: normal cutoff distance
>5
Tests around native state
Interaction with Solvent
Implicit solvent model : solvent as continuous medium, interface is solvent-accessible surface
E. Eyal, D. Halperin. Dynamic Maintenance of Molecular Surfaces under
Conformational Changes. http://www.give.nl/movie/publications/telaviv/EH04.pdf
Summary
Inverse kinematics techniques
Improve structure determination from fuzzy electron density maps
Collision detection techniques
Speedup energy maintenance during
Monte Carlo simulation
About Computational Biology
Computational Biology is more than mimicking nature (e.g., performing Molecular Dynamic simulation)
One of its goals is to achieve algorithmic efficiency by exploiting properties of molecules, e.g.:
• Atoms cannot bunch up together
• Forces have relatively short ranges
• Proteins are long kinematic chains