Conformation Networks: an Application to Protein Folding Zoltán Toroczkai Erzsébet Ravasz Center for Nonlinear Studies Gnana Gnanakaran (T-10) Theoretical Biology and Biophysics Los Alamos National Laboratory Center for Nonlinear Studies Proteins the most complex molecules in nature globular or fibrous basic functional units of a cell chains of amino acids (50 – 103) peptide bonds link the backbone Native state unique 3D structure (native physiological conditions) biological function fold in nanoseconds to minutes about 1000 known 3D structures: crystallography, NMR X-ray Myoglobin 153 Residues, Mol. Weight=17181 [D], 1260 Atoms Main function: primary oxygen storage and carrier in muscle tissue It contains a heme (iron-containing porphyrin ) group in the center. C34H32N4O4FeHO Protein conformations • defined by dihedral angles 2 angles with 2-3 local minima of the torsion energy Amino-acid N monomers about 10N different conformations Levinthal’s paradox • Anfinsen: thermodynamic hypothesis native state is at the global minimum of the free energy • Levinthal’s paradox, 1968 Epstain, Goldberger, & Anfinsen, Cold Harbor Symp. Quant. Biol. 28, 439 (1963) Levinthal, J. Chim. Phys. 65, 44-45 (1968) finding the native state by random sampling is not possible 40 monomer polypeptide 1013 conf/s 3 1019 years to sample all universe ~ 2 1010 years old nucleation folding pathways Wetlaufer, P.N.A.S. 70, 691 (1973) Free energy landscapes • Bryngelson & Wolynes, 1987 Bryngelson & Wolynes, P.N.A.S. 84, 7524 (1987) free energy landscape a random hetero-polymer typically does NOT fold Experiment: — random sequences — GLU, ARG, LEU — 80-100 amino-acids ~ 95% did not fold in a stable manner Davidson & Sauer, P.N.A.S. 91, 2146 (1994) Funnels • Leopold, Mortal & Onuchic, 1992 Leopold, Mortal & Onuchic, P.N.A.S. 89, 8721 (1992) Energy funnels Given any amino-acid sequence: can we tell if it is a good folder? experiments (X-ray, NMR) molecular dynamics simulations homology modeling many folding pathways Difficult and slow Molecular dynamics • State of the art Sanbonmatsu, Joseph & Tung, P.N.A.S. 102 15854 (2005) supercomputer (LANL) Ribosome in explicit solvent: – targeted MD – 2.64x106 atoms (2.5x105 + water) – Q machine, 768 processors – 260 days of simulation (event: 2 ns) distributed computing (Stanford, Folding@home) ~ 1016 times slower – more than 100,000 CPU’s – simulation of complete folding event » BBA5, 23-residue, implicit water » 10,000 CPU days/folding event (~1s) Shirts & Pande, Science 290, 1903 (2000) Snow, Nguyen, Pande, Gruebele, Nature 420,102 (2002) Configuration networks • Configuration networks Protein conformations dihedral angles have few preferred values Ramachandran & Sasisekharan, J.Mol.Biol. 7, 95 (1963) • Helix • Sheet • other Ramachandran map PDB structures NODE configuration LINK change of one degree of freedom (angle) refinement of angle values continuous case Why networks? • VERY LARGE: 100 monomers 10100 nodes. However: Generic features of folding are determined by STATISTICAL properties of the configuration network toolkit from network research captures the high dimensionality degree distribution average distance clustering degree correlations Albert & Barabási, Rev. Mod. Phys. 74, 67 (2002); Newman, SIAM Rev. 45, 167 (2003) faster algorithms to simulate folding events pre-screening synthetic proteins insights into misfolding A real example • The Protein Folding Network: F. Rao, A. Caflisch, J.Mol.Biol, 342, 299 (2004) beta3s: 20 monomers, antiparallel beta sheets MD simulation, implicit water 330K, equilibrium folded random coil NODE -- 8 letters / AA (local secondary struct) LINK -- 2ps transition Its native conformation has been studied by NMR experiments: De Alba et.al. Prot.Sci. 8, 854 (1999). Beta3s in aqueous solution forms a monomeric triple-stranded antiparallel beta sheet in equilibrium with the denaturated state. •Simulations @ 330K •The average folding time from denaturated state ~ 83ns •The average unfolding time ~83ns •Simulation time ~12.6s •Coordinates saved at every 20ps (5105 snapshots in 10s) •Secondary structures: H,G,I,E,B,T,S,- (-helix, 310 helix, -helix, extended, isolated -bridge, hydrogen-bonded turn, bend and unstructured). •The native state: -EEEESSEEEEEESSEEEE•There are approx. 818 1016 conformations. •Nodes: conformations, transitions: links. Scale-free network beta3s randomized Many real-world networks are scale free hubs Barabási & Albert, Science 286, 509, (1999); Many reasons co-authorship (=1 - 2.5) behind SF topology citations (=3) • • • sexual contacts (=3.4) movie actors (=2.3) Internet Why is the protein network scale free?(y=2.4) World Why does the randomized chain haveWide Web (=2.1/2.5) Genetic regulation (=1.3) similar degree distribution? Protein-protein interactions ( =2.4) Metabolic pathways (=2.2) Why is = - 2 ? Food webs (=1.1) Robot arm networks 12 02 n=0 22 021 11 n=1 0 1 2 01 21 00 20 020 10 010 00 n=2 22 01 02 10 11 12 20 21 000 n-dimensional hypercube 100 200 • Steric constraints? binomial degree distribution Homogeneous missing nodes missing links Swiss cheese A bead-chain model • Beads on a chain in 3D: robot arm model similar to C protein models rod-rod angle 3 positions around axis N=18; = 120 2212112212111122 Honeycutt & Thirumalai, Biopolymers 32, 695 (1992) N=6; = 90 Homogeneous network Another example: L = 7, = 75 , r = 0.25 “00100” state “00100” allowed state forbidden state Adding monomers not only increases the number of nodes in the network but also its dimensionality!! The combined effect is small-world. Shortcuts in Folding Space The “dilemma” HOMOGENEOUS • from studies of conformation networks SCALE FREE • from polypeptide MD simulations bead chain beta3s robot arm randomized version ? Gradient Networks Gradients of a scalar (temperature, concentration, potential, etc.) induce flows (heat, particles, currents, etc.). Naturally, gradients will induce flows on networks as well. Ex.: Load balancing in parallel computation and packet routing on the internet Y. Rabani, A. Sinclair and R. Wanka, Proc. 39th Symp. On Foundations of Computer Science (FOCS), 1998: “Local Divergence of Markov Chains and the Analysis of Iterative Load-balancing Schemes” References: Z. T. and K.E. Bassler, “Jamming is Limited in Scale-free Networks”, Nature, 428, 716 (2004) Z. T., B. Kozma, K.E. Bassler, N.W. Hengartner and G. Korniss “Gradient Networks”, http://www.arxiv.org/cond-mat/0408262 Setup: Let G=G(V,E) be an undirected graph, which we call the substrate network. The vertex set: V {x0 , x1 ,..., xN 1} {0,1,2,..., N 1} The edge set: E V V , e E , e xi x j (i, j ), xx E (no self - loops) A simple representation of E is via the Nx N adjacency (or incidence) matrix A 1 if (i, j ) E A( xi , x j ) aij 0 if (i, j ) E Let us consider a scalar field (1) {h} : V Set of nearest neighbor nodes on G of i : Si(1) Definition 1 The gradient h(i) of the field {h} in node i is a directed edge: h(i ) (i, (i )) (2) (1) Which points from i to that nearest neighbor Si {i} for G for which the increase in the scalar is the largest, i.e.,: (i) arg max (h j ) (3) jS i(1) {i} The weight associated with edge (i,) is given by: h(i) h hi If (i ) i then h(i ) (i, i ) 0(i ) . The self-loop 0(i ) is a loop through i with zero weight. Definition 2 The set F of directed gradient edges on G together with the vertex set V forms the gradient network: G G (V , F ) If (3) admits more than one solution, than the gradient in i is degenerate. In the following we will only consider scalar fields with non-degenerate gradients. This means: Prob.{hi h j if (i, j ) E} 0 Theorem 1 Proof: Non-degenerate gradient networks form forests. Theorem 2 The number of trees in this forest = number of local maxima of {h} on G. 0.48 0.82 0.67 0.65 0.46 0.6 0.53 0.44 0.5 0.22 0.2 0.65 0.1 0.19 0.16 0.87 0.15 0.32 0.14 0.2 0.2 0.18 0.44 0.43 0.67 0.7 0.05 0.15 0.24 0.16 0.65 0.13 0.55 0.05 0.65 0.8 For Erdős - Rényi random graph substrates with i.i.d random numbers as scalars, the in-degree distribution is: In the limit p 0, N , z Np const. , z 1 : 1 RN (l ) , 1 l z Np , zl The Configuration model A. Clauset, C. Moore, Z.T., E. Lopez, to be published. Generating functions: g ( z ) ki z k i xg ( x) R( z ) dx g 1 (1 z ) g (1) 0 1 K-th Power of a Ring 4 3 9 K 4 K 2 2 Kl , 1 l K 1 (2 K l )( 2 K l 1)( 2 K l 2)( 2 K l 3) 6 2 7K 7K 2 , lK 3K (3K 1)(3K 2)(3K 3) R ( 2 K ) (l ) 42 K 1 , K 1 l 2K 1 (2 K l 1)( 2 K l 2)( 2 K l 3) 1 , l 2K 4 K 1 Power law with exponent =- 3 2K+l The energy landscape • Energy associated with each node (configuration) the gradient network most favorable transitions T=0 backbone of the flow MD simulation tracks the flow network biased walk close to the gradient network trees basins of local minima What generates = - 2 ? The REM generates an exponent of -1. Model ingredients • A network model of configuration spaces network topology homogeneous degree correlations how to associate energies constrained (folded) small kconf lower energy k, E increases loose (random coil) large kconf higher energy Random geometric graph • random geometric graph R=0.113, <k>=20 Dall & Christensen, Phys.Rev.E 66, 026121 (2002) • Energy proportional to connectivity E in higher D: similar to hypercube with holes degree correlations k N=30000, <k> = 1000, d=2. Exponent is - 2 2 essential ingredients: 1) 2) k1-k2 correlations <E> with k monotonic Bead-chain model • more realistic model: bead-chain configuration network excluded volume energy: Lennard-Jones Lennard-Jones potential Repulsive Attractive L = 30, = 75 The case of the -helix AKA peptide • ALA: orange • LYS: blue • TYR: green MD simulations, no water. The MD traced network T = 400 More than one simulation 3 different runs: yellow, red and green The role of temperature Conclusions • A network approach was introduced to study sterically constrained conformations of ball-chain like objects. • This networks approach is based on the “statistical dogma” stating that generic features must be the result of statistical properties of the networks and should not depend on details. • Protein conformation dynamics happens in high dimensional spaces that are not adequately described by simplistic reaction coordinates. • The dynamics performs a locally biased sampling of the full conformational network. For low enough temperatures the sampled network is a gradient graph which is typically a scale-free structure. • The -2 degree exponent appears at and bellow the temperature where the basins of the local energy minima become kinetically disconnected. • Understanding the protein folding network has the potential of leading to faster simulation algorithms towards closing the gap between nature’s speed and ours. Coming up: conditions on side chain distributions for the existence of funneled energy landscapes.