Molecular Motion Pathways: Computation of Ensemble Properties with Probabilistic Roadmaps 1) 2) 3) 4) 5) A.P. Singh, J.C. Latombe, and D.L. Brutlag. A Motion Planning Approach to Flexible Ligand Binding. Proc. 7th Int. Conf. on Intelligent Systems for Molecular Biology (ISMB), AAAI Press, Menlo Park, CA, pp. 252-261, 1999. N.M. Amato, K.A. Dill, and G. Song. Using Motion Planning to Map Protein Folding Landscapes and Analyze Folding Kinetics of Known Native Structures. J. Comp. Biology, 10(2):239-255, 2003. M.S. Apaydin, D.L. Brutlag, C. Guestrin, D. Hsu, J.C. Latombe, and C. Varma. Stochastic Roadmap Simulation: An Efficient Representation and Algorithm for Analyzing Molecular Motion. J. Comp. Biology, 10(3-4):257281, 2003. N. Singhal, C.D. Snow, and V.S. Pande. Using Path Sampling to Build Better Markovian State Models: Predicting the Folding Rate and Mechanism of a Tryptophan Zipper Beta Hairpin, J. Chemical Physics, 121(1):415-425, 2004. J. Cortés, T. Siméon, M. Renaud-Siméon, and V. Tran. Geometric Algorithms for the Conformational Analysis of Long Protein Loops. J. Comp. Chemistry, 25:956-967, 2004. Molecular motion is an essential process of life Mad cow disease is caused by misfolding Drug molecules act by binding to proteins So, studying molecular motion is of critical importance in molecular biology However, few tools are available Computer simulation: - Monte Carlo simulation - Molecular Dynamics Stanford BioX cluster NMR spectrometer Two Major Drawbacks of MD and MC Simulation 1) Each simulation run yields a single pathway, while molecules tend to move along many different pathways Interest in ensemble properties Unfolded (denatured) state Intermediate states Many pathways Folded (native) state Example of Ensemble Property: Probability of Folding pfold Measure kinetic distance to folded state Du, Pande, Grosberg, Tanaka, and Shakhnovich. On the Transition Coordinate for Protein Folding. Journal of Chemical Physics (1998). 1- pfold Unfolded state pfold Folded state Other Examples of Ensemble Properties Folding: • Order of formation of SSE’s • Folding rate / Mean first passage time • Key intermediates Binding: • Average time to escape from active site • Average energy barrier Two Major Drawbacks of MD and MC Simulation 1) Each simulation run yields a single pathway, while molecules tend to move along many different pathways 2) Each simulation run tends to waste much time in local minima Roadmap-Based Representation Compact representation of many motion pathways Coarse resolution relative to MC and MD simulation Efficient algorithms for analyzing multiple pathways Roadmaps for Robot Motion Planning free space [Kavraki, Svetska, Latombe,Overmars, 96] Initial Work A.P. Singh, J.C. Latombe, and D.L. Brutlag. A Motion Planning Approach to Flexible Ligand Binding. Proc. 7th ISMB, pp. 252-261, 1999 Study of ligand-protein binding The ligand is a small flexible molecule, but the protein is assumed rigid A fixed coordinate system P is attached to the protein and a moving coordinate system L is defined using three bonded atoms in the ligand A conformation of the ligand is defined by the position and orientation of L relative to P and the torsional angles of the ligand Roadmap Construction (Node Generation) The nodes of the roadmap are generated by sampling conformations of the ligand uniformly at random in the parameter space (around the protein) The energy E at each sampled conformation is computed: Waals E = Einteraction = Einternal = Einteraction + Einternal electrostatic + van der Waals potential Snon-bonded pairs of atoms electrostatic + van der Roadmap Construction (Node Generation) The nodes of the roadmap are generated by sampling conformations of the ligand uniformly at random in the parameter space (around the protein) The energy E at each sampled conformation is computed: Waals E = Einteraction = Einternal = Einteraction + Einternal electrostatic + van der Waals potential Snon-bonded pairs of atoms electrostatic + van der A sampled conformation is retained as a node of the roadmap with probability: P= 0 Emax-E Emax-Emin 1 if E > Emax if Emin E Emax if E < Emin Denser distribution of nodes in low-energy regions of conformational space Roadmap Construction (Edge Generation) q qi qi+1 q’ Each node is connected to its closest neighbors by straight edges Each edge is discretized so that between qi and qi+1 no atom moves by more than some ε (= 1Å) E Emax If any E(qi) > Emax , then the edge is rejected Roadmap Construction (Edge Generation) q qi q’ qi+1 Any two nodes closer apart than some threshold distance are connected by a straight edge Each edge is discretized so that between qi and qi+1 no atom moves by more than some ε (= 1Å) If all E(qi) Emax , then the edge is retained and is assigned two weights w(qq’) and w(q’q) where: w(q q') = -ln(P[q q i i i+1 ]) Heuristic measure of energetic difficulty or moving from q to q’ e-(Ei+1 -Ei )/kT P[qi qi+1 ] = -(Ei+1 -Ei )/kT e e-(Ei-1 -Ei )/kT (probability that the ligand moves from qi to qi+1 when it is constrained to move along the edge) Querying the Roadmap For a given goal node qg (e.g., binding conformation), the Dijkstra’s single-source algorithm computes the lowest-weight paths from qg to each node (in either direction) in O(N logN) time, where N = number of nodes Various quantities can then be easily computed in O(N) time, e.g., average weights of all paths entering qg and of all paths leaving qg (~ binding and dissociation rates Kon and Koff) Protein: Lactate dehydrogenase Ligand: Oxamate (7 degrees of freedom) Experiments on 3 Complexes 1) PDB ID: 1ldm Receptor: Lactate Dehydrogenase (2386 atoms, 309 residues) Ligand: Oxamate (6 atoms, 7 dofs) 2) PDB ID: 4ts1 Receptor: Mutant of tyrosyl-transfer-RNA synthetase (2423 atoms, 319 residues) Ligand: L- leucyl-hydroxylamine (13 atoms, 9 dofs) 3) PDB ID: 1stp Receptor: Streptavidin (901 atoms, 121 residues) Ligand: Biotin (16 atoms, 11 dofs) Computation of Potential Binding Conformations 1) Sample many (several 1000’s) ligand’s conformations at random around protein active site 2) Repeat several times: Select lowest-energy conformations that are close to protein surface Resample around them 3) Retain k (~10) lowest-energy conformations whose centers of mass are at least 5Å apart lactate dehydrogenase Results for 1ldm Some potential binding sites have slightly lower energy than the active site Energy is not a discriminating factor Average path weights (energetic difficulty) to enter and leave binding site are significantly greater for the active site Indicates that the active site is surrounded by an energy barrier that “traps” the ligand Energy Potential binding site Active site Potential binding site Conformation Application of Roadmaps to Protein Folding N.M. Amato, K.A. Dill, and G. Song. Using Motion Planning to Map Protein Folding Landscapes and Analyze Folding Kinetics of Known Native Structures. J. Comp. Biology, 10(2):239-255, 2003 Known native state Degrees of freedom: φ-ψ angles Energy: van der Waals, hydrogen bonds, hydrophobic effect New idea: Sampling strategy Application: Finding order of SSE formation Sampling Strategy (Node Generation) High dimensionality non-uniform sampling Conformations are sampled using Gaussian distribution around native state Conformations are sorted into bins by number of native contacts (pairs of C atoms that are close apart in native structure) Sampling ends when all bins have minimum number of conformations “good” coverage of conformational space Application: Order of Formation of Secondary Structures The lowest-weight path is extracted from each denatured conformation to the folded one The order of formation of SSE’s is computed along each path The formation order that appears the most often over all paths is considered the SSE formation order of the protein Method 1) The contact matrix showing the time step when each native contact appears is built Protein CI2 (1 + 4 b) 60 5 Protein CI2 (1 + 4 b) The native contact between residues 5 and 60 appears at step 216 Method 1) The contact matrix showing the time step when each native contact appears is built 2) The time step at which a structure appears is approximated as the average of the appearance time steps of its contacts forms at time step 122 (II) b3 and b4 come together at 187 (V) b2 and b3 come together at 210 (IV) b1 and b4 come together at 214 (I) and b4 come together at 214 (III) Protein CI2 (1 + 4 b) Method 1) The contact matrix showing the time step when each native contact appears is built 2) The time step at which a structure appears is approximated as the average of the appearance time steps of its contacts Comparison with Experimental Data CI2 SSE’s roadmap size 1+4b 5126, 70k 3 1+4b 1+5b 5471, 104k 7975, 104k 8357, 119k Stochastic Roadmaps M.S. Apaydin, D.L. Brutlag, C. Guestrin, D. Hsu, J.C. Latombe and C. Varma. Stochastic Roadmap Simulation: An Efficient Representation and Algorithm for Analyzing Molecular Motion. J. Comp. Biol., 10(3-4):257-281, 2003 New Idea: Capture the stochastic nature of molecular motion by assigning probabilities to edges vi Pij vj Edge probabilities exp(-ΔEij/kT) , if ΔEij >0; Ni Follow Metropolis criteria: Pij = 1 , otherwise. Ni vi Self-transition probability: Pii =1- Pij Pii Pij ji [Roadmap nodes are sampled uniformly at random and energy profile along edges is not considered] vj Stochastic Roadmap Simulation Pij V Stochastic roadmap simulation and Monte Carlo simulation converge to the Boltzmann distribution, i.e., the number of -E/kT e dV times SRS is at a node in V converges toward V when the number of nodes grows (and they are uniformly distributed) Roadmap as Markov Chain i Pij j Transition probability Pij depends only on i and j Example #1: Probability of Folding pfold 1- pfold Unfolded state pfold Folded state First-Step Analysis U: Unfolded state F: Folded state One linear equation per node l Solution gives pfold for all nodes No explicit simulation runk j Pik Pil All pathways are taken into account Pij m Sparse linear system i Pim Pii Let fi = pfold(i) After one step: fi = Pii fi + Pij fj + Pik fk + Pil fl + Pim fm =1 =1 Number of Self-Avoiding Walks on a 2D Grid 1, 2, 12, 184, 8512, 1262816, 575780564, 789360053252, 3266598486981642, (10x10) 41044208702632496804, (11x11) 1568758030464750013214100, (12x12) 182413291514248049241470885236 > 1028 http://mathworld.wolfram.com/Self-AvoidingWalk.html In contrast … Computing pfold with MC simulation requires: For every conformation q of interest Perform many MC simulation runs from q Count number of times F is attained first Computational Tests • 1ROP (repressor of primer) • 2 helices • 6 DOF • 1HDD (Engrailed homeodomain) • 3 helices • 12 DOF H-P energy model with steric clash exclusion [Sun et al., 95] Correlation with MC Approach 1ROP pfold for ß hairpin Immunoglobin binding protein (Protein G) Last 16 amino acids Cα based representation Go model energy function 42 DOFs [Zhou and Karplus, `99] Computation Times (ß hairpin) Monte Carlo (30 simulations): ~10 hours of computer time Over 107 energy computations 2000 conformations 23 seconds of computer time ~50,000 energy computations 1 conformation Roadmap: ~6 orders of magnitude speedup! Example #2: Ligand-Protein Interaction Computation of escape time from funnels of attraction around potential binding sites Funnel of attraction = ball of 10Å rmsd around bound state [Camacho and Vajda, 01] Computation Through Simulation [Sept, Elcock and McCammon `99] 10K to 30K independent simulations Computing Escape Time with Roadmap l k j Pil Pik Pij i Pii m Pim Funnel of Attraction ti = 1 + Pii ti + Pij tj+ Pik tk + Pil tl + Pim tm =0 (escape time is measured as number of steps of stochastic simulation) Distinguishing Active Site Given several potential binding sites, which one is the active one? Energy: electrostatic + van der Waals + solvation free energy terms Complexes Studied ligand protein # random nodes # DOFs oxamate 1ldm 8000 7 Streptavidin 1stp 8000 11 Hydroxylamine 4ts1 8000 9 COT 1cjw 8000 21 THK 1aid 8000 14 IPM 1ao5 8000 10 PTI 3tpi 8000 13 Distinction Using Escape Time Protein 1stp 4ts1 3tpi 1ldm 1cjw 1aid 1ao5 Bound state 3.4E+9 3.8E+10 1.3E+11 8.1E+5 5.4E+8 9.7E+5 6.6E+7 Best potential binding site 1.1E+7 1.8E+6 5.9E+5 3.4E+6 4.2E+6 1.6E+8 5.7E+6 Able to distinguish catalytic site Not able (# steps) Using Path Sampling to Construct Roadmaps N. Singhal, C.D. Snow, and V.S. Pande. Using Path Sampling to Build Better Markovian State Models: Predicting the Folding Rate and Mechanism of a Tryptophan Zipper Beta Hairpin, J. Chemical Physics, 121(1):415-425, 2004 New idea: Paths computed with Molecular Dynamics simulation techniques are used to create the nodes of the roadmap More pertinent/better distributed nodes Edges are labeled with the time needed to traverse them Sampling Nodes from Computed Paths (Path Shooting) ~dt F U Sampling Nodes from Computed Paths (Path Shooting) tij i j pij F U Example: Langevin dynamics equation of motion dx is Fext -mγ +R=0 where R is a Gaussian random force dt Node Merging If two nodes are closer apart than some e, they are merged into one and merging rules are applied to update edge probabilities and times 1 P12, t12 P14, t14 3 2 3 1 P12’, t12’ 2’ 4 5 P12’ = P12 + P14 t12’ = P12xt12 + P14xt14 5 Node Merging If two nodes are closer apart than some e, they Approximately uniform distribution are merged into one and merging rules are of nodes over the reachable subset of applied to update edge probabilities and times conformational space 1 P12, t12 P14, t14 3 2 3 1 P12’, t12’ 2’ 4 5 P12’ = P12 + P14 t12’ = P12xt12 + P14xt14 5 Application: Computation of MFPT Mean First Passage Time: the average time when a protein first reaches its folded state First-Step Analysis yields: MPFT(i) = Sj Pij x (tij + MPFT(j)) MPFT(i) = 0 if i F Assuming first-order kinetics, the probability that a protein folds at time t is: Pf (t) = 1 - e-rt where r is the folding rate MFPT = P (t) tdt 0 f =1/r Computational Test 12-residue tryptophan zipper beta hairpin (TZ2) Folding@Home used to generate trajectories (fully atomistic simulation) ranging from 10 to 450 ns 1750 trajectories (14 reaching folded state) 22,400-node roadmap MFPT ~ 2-9 ms, which is similar to experimental measurements (from fluorescence and IR) Conformational Analysis of Protein Loops J. Cortés, T. Siméon, M. Renaud-Siméon, and V. Tran. Geometric Algorithms for the Conformational Analysis of Long Protein Loops. J. Comp. Chemistry, 25:956-967, 2004 New idea: Explore the clash-free subset of the conformational space of a loop, by building a tree-shaped roadmap Kinematic model: f-y angles on the backbone + ci torsional angles in side-chains Amylosucrase (AS) - Only enzyme in its family that acts on sucrose substrate -The 17-residue loop (named loop 7) between Gly433 and Gly449 is believed to play a pivotal role Roadmap Construction A tree-shaped roadmap is created from a start conformation qstart At each step of the roadmap construction, a conformation qrand of the loop is picked at random, and a new roadmap node is created by iteratively pulling toward it the existing node that is closest to qrand Roadmap Construction C Cclosed Cfree qrand qstart Roadmap Construction C Cclosed Cfree qrand qstart Roadmap Construction C Cclosed Cfree qrand qstart Roadmap Construction C Cclosed Cfree qrand qstart Stops when one can’t get closer to qrand or a clash is detected Computational Results Surprisingly, loop 7 can’t move much Main bottleneck is residue Asp231 Positions of the C atom of middle residue (Ser441) Computational Results Surprisingly, loop 7 can’t move much Main bottleneck is residue Asp231 Computational Results If residue Asp231 is “removed”, then loop 7’s mobility increases dramatically. The C atom of Ser441 can be displaced by more than 9Å from its crystallographic position Conclusion Probabilistic roadmaps are a recent, but promising tool for exploring conformational space and computing ensemble properties of molecular pathways Current/future research: • Better sampling strategies able to handle more complex molecular models (protein-protein binding) • More work to include time information in roadmaps • More thorough experimental validation to compare computed and measured quantitative properties