What parsing algorithms can tell us about protein folding Julia Hockenmaier Computer Science, UIUC http://www.cs.uiuc.edu/~juliahmr juliahmr@cs.uiuc.edu Two unrelated facts • People understand language. • Proteins fold into unique 3D structures. Two unrelated facts • People understand language. • Proteins fold into unique 3D structures. Natural language understanding and protein folding are both really hard for computers The need for protein folding and structure prediction The need for protein folding and structure prediction Designing new drugs (which bind to proteins) Understanding misfolding diseases (Alzheimer’s, etc.) The need for natural language understanding !"#$%&'&()*+,-./012 34*5665756638/9:;< =>?@ABCDEFGHIJ5KL@ AMNOPQ;RSTUV<=WXYZ [\O]^_`;abcde>fghi jklmPnopqklmPnrst The need for natural language understanding Information extraction (news, scientific papers) Machine translation !"#$%&'&()*+,-./012 34*5665756638/9:;< =>?@ABCDEFGHIJ5KL@ AMNOPQ;RSTUV<=WXYZ [\O]^_`;abcde>fghi jklmPnopqklmPnrst Dialog systems (phone, robots) Parsing: a necessary first step !"#$%&'&()*+,-./012 34*5665756638/9:;< =>?@ABCDEFGHIJ5KL@ AMNOPQ;RSTUV<=WXYZ [\O]^_`;abcde>fghi jklmPnopqklmPnrst Parsing: a necessary first step !"#$%&'&()*+,-./012 34*5665756638/9:;< =>?@ABCDEFGHIJ5KL@ AMNOPQ;RSTUV<=WXYZ [\O]^_`;abcde>fghi jklmPnopqklmPnrst • What are these symbols? Parsing: a necessary first step !"#$%&'&()*+,-./012 34*5665756638/9:;< =>?@ABCDEFGHIJ5KL@ AMNOPQ;RSTUV<=WXYZ [\O]^_`;abcde>fghi jklmPnopqklmPnrst • What are these symbols? • How do they fit together? I eat sushi with tuna. I eat sushi with tuna. I eat sushi with tuna. I eat sushi with chopsticks. I eat sushi with tuna. I eat sushi with chopsticks. • Language is ambiguous. I eat sushi with tuna. I eat sushi with chopsticks. • Language is ambiguous. • What is the most likely structure for a given sentence? Dependency graphs describe sentence structures I eat sushi with tuna. I eat sushi with chopsticks. Proteins are amino acid sequences Side chain Side chain Backbone H O R H C‘ C N C R Side chain N H H C‘ O H O R H C‘ C N C R Side chain N H H C‘ O O H C‘ C R Side chain Amino Acids The 20 amino acids differ only in their side chains (hydrophobic or polar). Proteins are amino acid sequences Side chain Side chain O Backbone H C‘ C R Side chain N H R H C N H C‘ O H O R H C‘ C N C R Side chain N H H C‘ O O H C‘ C R Side chain Amino Acids The 20 amino acids differ only in their side chains (hydrophobic or polar). Proteins fold into a unique lowest-energy structure (native state) Folded structures are stabilized by side chain contacts. Contact graphs describe protein structures α-Helix: β-Sheet: Contact graphs describe protein structures α-Helix: β-Sheet: Contact graphs describe protein structures α-Helix: β-Sheet: Contact graphs describe protein structures α-Helix: fast folding β-Sheet: slow folding The Levinthal paradox: Folding is a search problem (C. Levinthal 1968) A protein with 150 amino acids 300 has 10 possible structures. How can it find its native state? Two unrelated search problems • Natural language parsing: Find the grammatical structure of a sentence. • Protein folding: Find the folded structure of a protein chain. Two similar search problems Two similar search problems Find the optimal structure of a sequence. Two similar search problems Find the optimal structure of a sequence. • The structure is determined by the sequence. • The number of possible structures is exponential Solving both search problems Solving both search problems Structural Representation Solving both search problems Structural Representation Scoring Function Solving both search problems Search Algorithm Structural Representation Scoring Function Natural Language Parsing Solving the parsing problem Search Algorithm Structural Representation Scoring Function Grammars for natural language parsing • A grammar is a description of the syntax of a particular language. • There are many different grammar formalisms (programming languages for grammars) Context-free grammar S → NP VP VP → V NP VP → VP PP NP → NP PP PP → P NP NP → we NP → sushi V → eat P → with VP V NP eat V sushi VP VP NP P NP P PP NP with tuna PP NP eat sushi with chopsticks Solving the parsing problem Search Algorithm Structural Representation Scoring Function Statistical parsing • We want the most likely parse of a sentence: argmax P(τ |s) = τ ∝ P(τ, s) argmax P(s) τ argmax P(τ, s) τ • We use machine learning to estimate P(t,s) Solving the parsing problem Search Algorithm Structural Representation Scoring Function The CKY parsing algorithm (Younger ’67, Kasami ‘65) S → NP VP VP → V NP V → eat NP → we NP → sushi We eat sushi The CKY parsing algorithm (Younger ’67, Kasami ‘65) S → NP VP VP → V NP V → eat NP → we NP → sushi We eat sushi The CKY parsing algorithm (Younger ’67, Kasami ‘65) NP S → NP VP VP → V NP V → eat NP → we NP → sushi We eat sushi The CKY parsing algorithm (Younger ’67, Kasami ‘65) NP S → NP VP VP → V NP V → eat NP → we NP → sushi We eat sushi The CKY parsing algorithm (Younger ’67, Kasami ‘65) NP S → NP VP VP → V NP V → eat NP → we NP → sushi V We eat sushi The CKY parsing algorithm (Younger ’67, Kasami ‘65) NP S → NP VP VP → V NP V → eat NP → we NP → sushi V We eat sushi The CKY parsing algorithm (Younger ’67, Kasami ‘65) NP S → NP VP VP → V NP V → eat NP → we NP → sushi V NP We eat sushi The CKY parsing algorithm (Younger ’67, Kasami ‘65) NP S → NP VP VP → V NP V → eat NP → we NP → sushi V NP We eat sushi The CKY parsing algorithm (Younger ’67, Kasami ‘65) NP S → NP VP VP → V NP V → eat NP → we NP → sushi V NP We eat sushi The CKY parsing algorithm (Younger ’67, Kasami ‘65) NP S → NP VP VP → V NP V → eat NP → we NP → sushi V NP We eat sushi The CKY parsing algorithm (Younger ’67, Kasami ‘65) NP S → NP VP VP → V NP V → eat NP → we NP → sushi V VP NP We eat sushi The CKY parsing algorithm (Younger ’67, Kasami ‘65) NP S → NP VP VP → V NP V → eat NP → we NP → sushi V VP NP We eat sushi The CKY parsing algorithm (Younger ’67, Kasami ‘65) NP S → NP VP VP → V NP V → eat NP → we NP → sushi S V VP NP We eat sushi The CKY parsing algorithm (Younger ’67, Kasami ‘65) NP S → NP VP VP → V NP V → eat NP → we NP → sushi S V VP NP We eat sushi The CKY parsing algorithm (Younger ’67, Kasami ‘65) NP S → NP VP VP → V NP V → eat NP → we NP → sushi S V VP NP We eat sushi The CKY parsing algorithm (Younger ’67, Kasami ‘65) NP S → NP VP VP → V NP V → eat NP → we NP → sushi S V VP NP We eat sushi The CKY parsing algorithm (Younger ’67, Kasami ‘65) NP S → NP VP VP → V NP V → eat NP → we NP → sushi S V VP NP We eat sushi Protein Folding Solving the folding problem Search Algorithm Structural Representation Scoring Function Real proteins are difficult to simulate Blue Gene’s biggest success: a protein with 21 amino acids Blue Gene’s biggest success: a protein with 21 amino acids Folding@home’s biggest success: a protein with 36 amino acids. Blue Gene’s biggest success: a protein with 21 amino acids Folding@home’s biggest success: a protein with 36 amino acids. We need a simple model system that captures the essential properties of proteins. Protein chains are self-avoiding walks (SAWs) • Proteins are connected chains. Sequence-adjacent amino acids are also physically adjacent. • Proteins occupy space. The space occupied by one amino acid can’t be occupied by another one. The HP model: The simplest protein model • Two kinds of amino acids: Hydrophobic and polar. P H H P H H H P H H H P P H H P • Proteins are SAWs on a 2D square lattice. Sequence-adjacent amino acids are up, down, left or right. The HP model: The simplest protein model • Two kinds of amino acids: Hydrophobic and polar. P H H P H H H P H H H P P H H P • Proteins are SAWs on a 2D square lattice. Sequence-adjacent amino acids are up, down, left or right. Solving the folding problem Search Algorithm Structural Representation Scoring Function Folding = energy minimization • Every structure has an energy F. • Physical systems want to be in the state with the lowest energy. • Proteins have a unique lowest- energy state, their “native state”. • This is why they fold. The energy landscape is funnel-shaped Folding = downhill moves in the landscape (Fig.: Dill and Chan’97) The energy landscape is funnel-shaped Folding = downhill moves in the landscape (Fig.: Dill and Chan’97) Folding is driven by the hydrophobic effect Folded proteins have a hydrophobic core: - Proteins are surrounded by water. - Minimizing contact between the - hydrophobic side chains and the water is energetically favorable. Hydrophobic contacts are favorable. Where do we get the energy function from? • Physics-based: (molecular dynamics, etc.) • Statistics-based: learn it from databases of known protein structures. • Simplified models: we need to define it ourselves The energy function in the HP model • Contact energies: P H H P H H H P H H H P We only consider P HP seqences with a unique native state. H H P Every HH contact contributes -1. • • Folding is still NP hard. The energy function in the HP model • Contact energies: P H H P H H H P H H H P We only consider P HP seqences with a unique native state. H H P Every HH contact contributes -1. • • Folding is still NP hard. Solving the folding problem Search Algorithm Structural Representation Scoring Function Structure prediction vs. protein folding • We want to understand the folding process. • For this, it is not sufficient to just predict the structure. Zipping: Structure grows by adding local contacts (Fiebig & Dill ‘93) Hierarchical folding: Zipping and Assembly • Zipping is local structure growth. • Folding also requires assembly of independent local structures. Hierarchical folding: Zipping and Assembly • Zipping is local structure growth. • Folding also requires assembly of independent local structures. Hierarchical folding: Zipping and Assembly • Zipping is local structure growth. • Folding also requires assembly of independent local structures. Hierarchical folding: Zipping and Assembly • Zipping is local structure growth. • Folding also requires assembly of independent local structures. If folding is hierarchical... ...folding routes are trees If folding is hierarchical... ...folding routes are trees If folding is hierarchical... ...folding routes are trees If folding is hierarchical... ...folding routes are trees Evidence for hierarchical folding Proteins have recursive domains. Some fragments fold faster than the whole chain. Some fragments fold by themselves. (Fig.: G. Rose,1979) Implementing hierarchical search A parsing-based search strategy: The CKY algorithm searches all binary trees defined by a context-free grammar. Can we use the same search strategy? A new folding algorithm (Hockenmaier, Joshi, Dill ‘07) 1. Split the chain into fragments. 2. Enter their structures into chart. A new folding algorithm (Hockenmaier, Joshi, Dill ‘07) 3. Combine small structures (like a jigsaw puzzle) A new folding algorithm (Hockenmaier, Joshi, Dill ‘07) 4. Keep only lowest-energy structures A new folding algorithm (Hockenmaier, Joshi, Dill ‘07) 5. Top cell contains folded structure. Extracting folding routes Extracting folding routes Charts as energy landscapes H H X X X X X P P H P P H P H Charts as energy landscapes H H X X X X X P P H P P H P H 0 -1 -2 -3 -4 Charts as energy landscapes 0 -1 -2 -3 -4 Charts as energy landscapes 0 -1 -2 -3 -4 The chart landscape determines the amount of search Fast Medium Slow Folding rates and native state topology Real proteins (Plaxco et al. 98) HP proteins with our algorithm 6 −3 −4 log(k) 10 Folding rate k (log ) 4 2 0 −5 −6 −7 −8 -2 5 10 15 20 Relative Contact Order (%) 25 −9 5 6 7 8 9 10 Native CO 11 12 How well does this work? • CKY is not guaranteed to find the native state, but: - 24,900 HP sequences of length 20 with - unique native state. Each sequences has 42,000,000 states. CKY finds the native state for 96.7% of them. • Folding speed is correlated with contact order. But... But... ... Proteins don’t use dynamic programming. But... ... Proteins don’t use dynamic programming. They misfold, unfold, refold.... But... ... Proteins don’t use dynamic programming. They misfold, unfold, refold.... ... Can we predict the collective behavior of an ensemble of protein molecules? Modeling the folding process Modeling the folding process • We assume folding is hierarchical: Modeling the folding process • We assume folding is hierarchical: Modeling the folding process • We assume folding is hierarchical: • We model the process as a Markov chain Modeling the folding process • We assume folding is hierarchical: • We model the process as a Markov chain Modeling the folding process • We assume folding is hierarchical: • We model the process as a Markov chain • We use CKY to construct this chain for each protein. Folding as a Markov chain Folding as a Markov chain rij qi qj Folding as a Markov chain rij qj qi • The protein can be in (and move between) a finite set of states. Folding as a Markov chain rij qj qi • The protein can be in (and move between) a finite set of states. • A Markov chain defines the probability of which state the protein is in at time t. An example of hierarchical folding: {(1,4)} {} 1 {(1,4),(5,8)} {(5,8)} 4 5 8 {(1,4),(1,8),(5,8)} This is not allowed: {(1,4)} 1 {(1,4),(5,8)} {} {(5,8)} {(1,8)} {(1,8),(5,8)} 4 5 8 {(1,4),(1,8),(5,8)} Trees define states and folding steps Folding: from siblings to the parent. Unfolding: from the parent to children. Unfolding Folding Trees define states and folding steps Folding: from siblings to the parent. Unfolding: from the parent to children. Unfolding Folding Calculating folding rates Folding rates depend on the difference in energy between children and the parent node. We can estimate this difference. Our folding algorithm • Use “CKY” to find native state. • Construct the Markov chain from the parse chart. • Let the protein fold! (calculate the probability of where it is at time 0,10,100...) Our test sequence • 16mer: helix and hairpin • The Markov chain has 193 states Energy How the protein folds: Time (logscale) How the protein folds: %&#$'()*+,-$.,/0$1%2,.23!2425/655 Energy !: 85 &5 !%5 95 !%: %55 %65 !65 %85 %&5 !6: %95 % %% 6% 7% 8% !"#$ Time (logscale) !75 Probability 65 How the protein folds: %&#$'()*+,-$.,/0$1%2,.23!2425/655 Energy !: 85 &5 !%5 95 !%: %55 %65 !65 %85 %&5 !6: %95 % %% 6% 7% 8% !"#$ Time (logscale) !75 Probability 65 Evidence against hierarchical folding For some proteins, the ends come together early during folding (Maity et al, ‘05) Probability What experiments would see What experiments would see Native Trap 9 12 1 6 12 1 6 13 2 5 13 2 3 16 16 Probability 9 What experiments would see Native Trap 9 9 12 1 6 12 1 6 13 2 5 13 2 3 16 Probability 16 Time (logscale) What experiments would see Native Trap 9 9 12 1 6 12 1 6 13 2 5 13 2 3 16 Probability 16 Time (logscale) What experiments would see Native Trap 9 9 12 1 6 12 1 6 13 2 5 13 2 3 16 Probability 16 Time (logscale) What experiments would see Native Trap 9 9 12 1 6 12 1 6 13 2 5 13 2 3 16 Probability 16 Time (logscale) What experiments would see Native Trap 9 9 12 1 6 12 1 6 13 2 5 13 2 3 16 Probability 16 Time (logscale) What experiments would see Native Trap 9 9 12 1 6 12 1 6 13 2 5 13 2 3 16 Probability 16 Time (logscale) What experiments would see Native Trap 9 9 12 1 6 12 1 6 13 2 5 13 2 3 16 Probability 16 Time (logscale) What experiments would see Native Trap 9 9 12 1 6 12 1 6 13 2 5 13 2 3 16 Probability 16 Time (logscale) What experiments would see Native Trap 9 9 12 1 6 12 1 6 13 2 5 13 2 3 16 Probability 16 Time (logscale) What experiments would see Native Trap 9 9 12 1 6 12 1 6 13 2 5 13 2 3 16 Probability 16 Time (logscale) What experiments would see Native Trap 9 9 12 1 6 12 1 6 13 2 5 13 2 3 16 Probability 16 Time (logscale) As a line plot: 1 0.9 0.8 Probability 0.7 ’Helix’ ’Hairpin’ ’End−to−End’ ’(3,6)’ ’(2,5)’ ’Trap’ ’Native’ 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.5 1 1.5 Time 2 2.5 3 5 x 10 Experimental observations that “the ends come together early” are not evidence against hierarchical folding: Macroscopic observations don’t always correspond to microscopic behavior: Challenges • Can these algorithms be applied to realistic representations of proteins? • Can we define coarse-grained representations of real proteins (and energy functions) that don’t require supercomputers? To conclude.... Search Algorithm Structural Representation Scoring Function For a computer scientist, protein folding and parsing pose similar research questions and require similar techniques. Thank you! http://www.cs.uiuc.edu/~juliahmr My collaborators/mentors: Ken A. Dill, UC San Francisco Aravind Joshi, U. of Pennsylvania Our funder: National Science Foundation