Robotics Algorithms for the Study of Protein Structure and Motion Jean-Claude Latombe Computer Science Department Stanford University Protein Long sequence of amino-acids (dozens to thousands), from a dictionary of 20 distinct amino-acids Central Dogma of Molecular Biology Physiological conditions: aqueous solution, 37°C, pH 7, atmospheric pressure Why Proteins? They are the workhorses of living organisms • They perform many vital functions, e.g.: - catalysis of reactions storage of energy transmission of signals building blocks of muscles They raise challenging computational issues • Large molecules (100s to several 1000s of atoms) • Made of building blocks drawn from a small “dictionary” • Unusual kinematic structure They are associated with many critical problems • Folded structure determination • Global and local structural similarities • Prediction of folding and binding motions f-y Kinematic Linkage Model peptide group side-chain group Molecule and Robot Two problems Structure determination from electron density maps • Inverse kinematics techniques [Itay Lotan, Henry van den Bedem, Ashley Deacon (Joint Center for Structural Genomics)] Energy maintenance during Monte Carlo simulation • Collision detection techniques [Itay Lotan, Fabian Schwarzer, and Danny Halperin (Tel Aviv University)] Structure Determination/Prediction Experimental tools X-ray crystallography Computational tools • Homology, threading • Molecular dynamics NMR spectrometry Protein Data Bank Only about 10% of structures have been determined for known protein sequences Protein Structure Initiative (PSI) 1990 1999 2000 2004 250 new structures 2500 new structures >20,000 structures total ~30,000 structures total X-Ray Crystallography Automated Model Building Software systems: RESOLVE, TEXTAL, ARP/wARP, MAID • 1.0Å < d < 2.3Å ~ 90% completeness • 2.3Å ≤ d < 3.0Å ~ 67% completeness (varies widely)1 1.0Å 3.0Å JCSG: 43% of data sets 2.3Å Manually completing a model: • Labor intensive, time consuming • Existing tools are highly interactive Model completion is high-throughput bottleneck 1Badger (2003) Acta Cryst. D59 The Completion Problem Input: Anchor 1 (3 atoms) • Electron-density map • Partial structure • Two anchor residues • Amino-acid sequence of missing fragment (typically 4 – 15 residues long) Anchor 2 (3 atoms) Protein fragment (fuzzy map) Main part of protein (folded) Output: • Few candidate conformation(s) of fragment that - Respect the closure constraint (IK) - Maximize match with electron-density map IK Problem Input: • Closed kinematic chain with n > 6 degrees of freedom • Relative positions/orientations X of end frames • Target function T(Q) → R Output: • Joint angles Q that - Achieve closure - Optimize T T Related Work Biology/Crystallography Robotics/Computer Science • – – Manocha & Canny ’94 Manocha et al. ’95 – Wang & Chen ’91 – – Khatib ’87 Burdick ’89 – – – Han & Amato ’00 Yakey et al. ’01 Cortes et al. ’02, ’04 Optimization IK solvers • Redundant manipulators Motion planning for closed loops Exact IK solvers – – Exact IK solvers • • • • Optimization IK solvers – – • Fiser et al. ’00 Kolodny et al. ’03 Database search loop closure – – • Fine et al. ’86 Canutescu & Dunbrack Jr. ’03 Ab-initio loop closure – – • Wedemeyer & Scheraga ’99 Coutsias et al. ’04 Jones & Thirup ’86 Van Vlijman & Karplus ’97 Semi-automatic tools – – Jones & Kjeldgaard ’97 Oldfield ’01 Two-Stage IK Method 1. Candidate generations Closed fragments 2. Candidate refinement Optimize fit with EDM Stage 1: Candidate Generation 1. Generate random conformation of fragment (only one end attached to anchor) 2. Close fragment (i.e., bring other end to second anchor) using Cyclic Coordinate Descent (CCD) (Wang & Chen ’91, Canutescu & Dunbrack ’03) Closure Distance Closure Distance: S N - N C - C C - C 2 moving end fixed end 2 2 A.A. Canutescu and R.L. Dunbrack Jr. Cyclic coordinate descent: A robotics algorithm for protein loop closure. Prot. Sci. 12:963–972, 2003. S 0 Compute qi s.t. qi + bias toward EDM + avoid steric clashes Stage 2: Candidate Refinement Target function T (Q) measuring quality of the fit with the EDM Minimize T while retaining closure Closed conformations lie on a self-motion manifold of lower dimension dq3 dq2 Null space (q1,q2,q3) dq1 1-D manifold Closure and Null Space dX = J dQ, where J is the 6n Jacobian matrix (n > 6) Null space {dQ | J dQ = 0} has dim = n – 6 N: orthonormal basis of null space Pseudo-inverse J+ such that JJ+ = I dQ = J+dX + NNTy y = T(Q) Computation of J+ and N SVD of J dX U66 VT6n S66 s1 s2 dQ 0 = s6 NT (n-6) basis N of null space Gram-Schmidt orthogonalization J+ = V S+ UT where S+=diag[1/si] Refinement Procedure Repeat until minimum is reached: Compute J, J+ and N at current Q • Compute T at current Q (analytical expression of T + linear-time recursive computation [Abe et al., Comput. Chem., 1984]) • Move along dQ = J+dX + NNT T until minimum is reached or closure is broken + Monte Carlo + simulated annealing protocol to deal with local minima Monte Carlo Optimization Repeat: 1. Perform a random move of the fragment: – either by picking a random direction in null space – or by using an exact IK solver over 6 dofs [Coutsias et al, 2004] ( big jumps) 2. Minimize T(Q) 3. Accept move with Metropolis-criterion probability ~exp(-DT/Temp) Tests #1: Artificial Gaps TM1621 (234 residues) and TM0423 (376 residues), SCOP classification a/b Complete structures (gold standard) resolved with EDM at 1.6Å resolution Compute EDM at 2, 2.5, and 2.8Å resolution Remove fragments and rebuild TM1621 103 Fragments from TM1621 at 2.5Å Short Fragments: 100% < 1.0Å aaRMSD Long Fragments: 12: 96% < 1.0Å aaRMSD 15: 88% < 1.0Å aaRMSD Produced by H. van den Bedem Comparison Across Resolutions Resolution = 2.0Å Resolution = 2.5Å Resolution = 2.8Å Example: TM0423 PDB: 1KQ3, 376 res. 2.0Å resolution 12 residue gap Best: 0.3Å aaRMSD Tests #2: True Gaps Structure computed by RESOLVE Gaps completed independently (gold standard) Example: TM1742 (271 residues) 2.4Å resolution; 5 gaps left by RESOLVE Length Top scorer Lowest error 4 0.22Å 0.22Å 5 0.78Å 0.78Å 5 0.36Å 0.36Å 7 0.72Å 0.66Å 10 0.43Å 0.43Å Produced by H. van den Bedem TM0813 PDB: 1J5X, 342 res. 2.8Å resolution 12 residue gap GLU-83 GLY-96 TM0813 PDB: 1J5X, 342 res. 2.8Å resolution 12 residue gap Best 0.6Å aaRMSD GLU-83 GLY-96 TM1621 Green: manually completed conformation Cyan: conformation computed by stage 1 Magenta: conformation computed by stage 2 The aaRMSD improved by 2.4Å to 0.31Å Alr1529 D72-D78 resolution: initial model: contour: PDB: aaRMSD: 2.0Å ARP/wARP 1.0s 1VJG 0.33Å TM0542 • Top-scoring fragment in cyan • Manually completed fragment in green • Residues A259 and A260 are flipped Current/Future Work Software actively being used at the JCSG What about multi-modal loops? B A TM0755: data at 1.8Å 8-residue fragment crystallized in 2 conformations Overlapping density: Difficult to interpret manually A323 Hist A316 Ser Algorithm successfully identified and built both conformations Current/Future Work Software actively being used at the JCSG What about multi-modal loops? Fuzziness in EDM can then be exploited B Use EDM to infer probability measure over the conformation space of the loop A Amylosucrase J. Cortés, T. Siméon, M. Renaud-Siméon, and V. Tran. J. Comp. Chemistry, 25:956-967, 2004 Energy maintenance during Monte Carlo simulation joint work with Itay Lotan, Fabian Schwarzer, and Dan Halperin1 1 Computer Science Department, Tel Aviv University Monte Carlo Simulation (MCS) Random walk through conformation space At each attempted step: • Perturb current conformation at random • Accept step with probability: P(accept ) min 1, e -DE / kbT The conformations generated by an arbitrarily long MCS are Boltzman distributed, i.e., #conformations in V ~ V e - E kT dV Monte Carlo Simulation (MCS) Used to: • sample meaningful distributions of conformations • generate energetically plausible motion pathways A simulation run may consist of millions of steps energy must be evaluated frequently Problem: How to maintain energy efficiently? Energy Function E = S bonded terms + S non-bonded terms Bonded terms + S solvation terms - O(n) Non-bonded terms - E.g., e.g. Van der Waals and electrostatic - Depend on distances between pairs of atoms - O(n2) Expensive to compute Solvation terms - May require computing molecular surface Non-Bonded Terms Energy terms go to 0 when distance increases Cutoff distance (6 - 12Å) vdW forces prevent atoms from bunching up Only O(n) interacting pairs [Halperin&Overmars 98] Problem: How to find interacting pairs without enumerating all atom pairs? Grid Method dcutoff Subdivide 3-space into cubic cells Compute cell that contains each atom center Represent grid as hashtable Grid Method dcutoff Θ(n) time to build grid O(1) time to find interactive pairs for each atom Θ(n) to find all interactive pairs of atoms [Halperin&Overmars, 98] Asymptotically optimal in worst-case Can we do better on average? Few DOFs are changed at each MC step 0 simulation of 100,000 attempted steps 5 10 20 30 Number k of DOF changes Can we do better on average? Few DOFs are changed at each MC step Proteins are long chain kinematics Long sub-chains stay rigid at each step Many partial energy sums remain constant Problem: How to retrieve the unchanged partial sums? Hierarchical Collision Checking Widely used technique in robotics/graphics to approximate distances between objects Pre-computation of bounding-volume hierarchy How to update this hierarchy if the objects deform Two New Data Structures 1. ChainTree Fast detection of interacting atom pairs 2. EnergyTree Retrieval of unchanged partial energy sums ChainTree (Twofold Hierarchy: BVs + Transforms) links ChainTree (Twofold Hierarchy: BVs + Transforms) TNO TJK TAB joints Updating the ChainTree Update path to root: – Recompute transforms that “shortcut” the DOF change – Recompute BVs that contain the DOF change – O(k log(n/k)) work for k changes Finding Interacting Pairs Finding Interacting Pairs Finding Interacting Pairs Do not search inside rigid sub-chains (unmarked nodes) Finding Interacting Pairs Do not search inside rigid sub-chains (unmarked nodes) Do not test two nodes with no marked node between them New interacting pairs EnergyTree E(N,N) E(J,L) E(K.L) E(L,L) E(M,M) EnergyTree E(N,N) E(J,L) E(K.L) E(L,L) E(M,M) Complexity n : total number of DOFs k : number of DOF changes at each MCS step k << n Complexity of: updating ChainTree: O(k log(n/k)) finding interacting pairs: O(n4/3) but performs much better in practice!!! Experimental Setup Energy function: Van der Waals Electrostatic Attraction between native contacts Cutoff at 12Å 300,000 steps MCS with Grid and ChainTree Steps are the same with both methods Early rejection for large vdW terms Results: 1-DOF change 12.5 7.8 speedup 5.8 3.5 # amino acids (68) (144) (374) (755) Results: 5-DOF change 5.9 speedup 4.5 3.4 2.2 (68) (144) (374) (755) Two-Pass ChainTree (ChainTree+) 1st pass: small cutoff distance to detect steric clashes 2nd pass: normal cutoff distance >5 Tests around native state Interaction with Solvent Explicit solvent models: 100s or 1000s of discrete solvent molecules Implicit solvent models: solvent as continuous medium, interface is solvent-accessible surface E. Eyal, D. Halperin. Dynamic Maintenance of Molecular Surfaces under Conformational Changes. http://www.give.nl/movie/publications/telaviv/EH04.pdf Summary Inverse kinematics techniques Improve structure determination from fuzzy electron density maps Collision detection techniques Speedup energy maintenance during Monte Carlo simulation About Computational Biology Computational Biology is more than using computers to biological problems or mimicking nature (e.g., performing MD simulation) One of its goals is to achieve algorithmic efficiency by exploiting properties of molecules, e.g.: • Proteins are long kinematic chains • Atoms cannot bunch up together • Forces have relatively short ranges