Canada-China Industrial Workshop, 2005 Hong Kong Baptist University The Inverse Protein Folding Problem* Arvind Gupta Simon Fraser University May 24, 2005 *Joint work with J. Manuch, C. Mead, L. Stacho, B. Bhattacharyya, X. Huang Outline • • • • • • Background Forces in Protein Folding Hydrophobic-Polar Model Protein Databank Determining Attributes of the Ideal Lattice Future Steps DNA • • • • Genetic code A “string” of nucleotides over A C G T Code for all proteins Self-replicating Proteins • A “string” over 20 amino acids • In solvent will fold into a unique 3D spatial structure with minimal energy Protein Structure • Structure determines protein function. • Proteins normally are in an aqueous environment • Proteins are globular. Proteins in the body • Proteins are involved in all processes in the body: Insulin Hemoglobin Proteins and diseases M. Thorpe, Protein Folding, HIV and Drug Design, Physics and Technology Forefronts (2003). Forward Protein Folding Problem • Identify the protein structure for a specific amino acid sequence. MAGWTRLS.. • Central open problem in biology • NP-hard under most models Inverse Protein Folding Problem • Given a structure (or a functionality) identify an amino acid sequence whose fold will be that structure (exhibit that functionality). • Crucial problem in drug design. • NP-hard under most models. Forces acting on Proteins • • • • • Hydrogen Bonding Van der Waals interactions Ion pairing Disulfide bonds Intrinsic properties Hydro (water) philic (loving) (conformational preference) phobic (fearing) • Hydrophobicity: the dominant force in protein folding (Dill, 1990) Hydrophobic Interactions • Each amino acid can be classified as either hydrophobic or hydrophilic (polar) • Hydrophobic [Polar] are in a higher [lower] energy state in an aqueous environment. Hydrophobic – Polar (HP) Model • • • • • • • Introduced by Dill (1985) and Chan (1985) “0” for polar; “1” for hydrophobic Protein sequence embedded on lattice Each amino acid in exactly one cell Interactions across adjacent cells Empty lattice cells contain water Given protein maximize hydrophobic interactions (native fold). • IE: Given 0-1 string embed onto a lattice, maximizing adjacent 1’s. The 2-D Square Lattice Protein: • Hydrophobic “1”: • Peptide bond: • Example. Polar “0”: Hydrophobic interaction: Inverse protein folding • Problem: For a given shape find a protein (amino acid string) with a native fold approximating the shape. • Example. Constructible structures Theorem: For any constructible structure S, there exists a protein p(S) with a native fold exactly filling the structure S. • Proof by induction: – Base case: p(S)=010010010010 Constructible structures Theorem: For any constructible structure S, there exists a protein p(S) with a native fold exactly filling the structure S. • Proof by induction: – Inductive case: Constructible structures Theorem: For any constructible structure S, there exists a protein p(S) with a native fold exactly filling the structure S. • Proof by induction: – Inductive case: Constructible structures Theorem: For any constructible structure S, there exists a protein p(S) with a native fold exactly filling the structure S. • Proof: – Folds are saturated: every hydrophobic “1” is involved in two hydrophobic interactions – saturated implies native Stability of proteins • Proteins is stable if it has unique “native fold” (fold with minimal energy). • Most natural proteins are stable. • The protein in our example is not stable: Together 82 native folds! Stability of proteins Conjecture: For any constructible structure S, the protein p(S) is stable. • Tested for >20,000 constructible structures. • Mathematically proved for two simple infinite classes of constructible structures L0 and L1. L 0: L 1: Boundary squares • Diagonal frame: the smallest diagonal rectangle containing all hydrophobic “1”-s. • Boundary square: hydrophobic “1” lying on the border of diagonal frame. 5 boundary squares Boundary squares • Useful to find the last tile of constructible structure. • A saturated fold has at least 4 of them. Lemma. Let p=0{0,1}*0 be a protein string not containing 11, 000 and 10101 as a substring. For every saturated fold of p, each boundary square not adjacent to a terminal is the main square of a corner-closed core. Proof for L0 structures • Take a saturated fold for p(S), L0. • It has at least 4 boundary squares, and at least 2 not adjacent to a terminal (the first or the last amino acid). • By Lemma, each is contained in a corner-closed core, i.e., is a red 1 of substring 1001001 of the protein string. • In p(S)=0(10010)n(01001)n0, there are only two occurrences of substring 1001001, and they are overlapping. • Hence, cores match each other and form a fully-closed core (closed on 3 sides) - the last tile. • Cut the last tile and apply induction. L1 structures are more complex • p(S)=0(10010)n010(10010)m(01001)m01(01001)n-10 • p(S) contains one occurrence of substring 10101 (Lemma cannot be directly applied) and three occurrences of 1001001 (two cornerclosed cores does not imply a fully-closed core). Choosing a Lattice • 2D is easier Fewer options for combinatorial case analysis More visually intuitive Torsion angles describe protein mainchain • 3D is more relevant More biologically relevant More representative of actual protein structures Directly applicable to known protein structures Protein Data Bank (PDB) • Worldwide repository for 3-D biological macromolecular structure data • Contains 30857 known protein structures (May17,2005) • Structures derived using different techniques – Nuclear Magnetic Resonance spectroscopy – X-ray crystallography • PDB ‘known structures’ are really models of the structure of a protein Determining Ideal Lattice Attributes 1. Should all edges of the lattice be identical in length? 2. How should distances between nonadjacent lattice points behave? 3. What angles should the lattice have? 4. How regular should the lattice be? Use PDB statistics to answer these questions Assemble a Set of Proteins Create a protein structure subset of good quality protein structures from the PDB: a) Protein structures generated using X-ray diffraction b) High resolution structures (<= 1.75 Å) c) Model fits the experimental data well Result: 3704 Protein structures in subset Q1: Uniform Edge Length? Overall distribution of consecutive residue distance: Consecutive residue distance appears consistently with length 3.8 Å. Answer to Question 1: All edge lengths should be uniform with length 3.8 Å. Q2: Non-adjacent Vertex Distances? Overall distribution of non-consecutive residue distance: • minimum distance: 3.06 Å • only 10 distances < 3.5Å • 1813 distances < 3.8Å (out of 426 billion pairs). Answer to Question 2: Non-adjacent vertices should be at least 3.8 Å apart. Q3: Lattice Angles? One amino acid Amino acid chain Q3: Lattice Angles? Overall distribution of Ca angles: • Calculate Ca angles: angle produced by three consecutive Ca atoms • Group results by middle amino acid residue type Bimodal distribution: • Sharp peak at 90o • Shallow peak at 120o Q3: Lattice Angles? Some differences appear for Ca angles around certain amino acids: Shown: Proline, Phenylalanine, Aspartic acid Q4: Lattice Regularity? • Determine average corresponding coordinate root square mean deviation (c-RMS) values between the original PDB structure and lattice approximated structures (over the entire 3704 PDB protein subset) n c - RMS | ai bi | 2 i 1 n ai = coordinates of lattice vertex corresponding to bi bi = coordinates of residue in protein X-ray structure Q4: Lattice Regularity? • Periodic Lattices: Cubic and Face-Centered-Cubic (FCC) • Randomized Lattices: Shift each vertex in periodic lattices by a random value from normal (0, 0.0025) distribution, preserve edges • De Novo Random Lattices: Generate random nodes and edges, maintain average degree and edge length of periodic lattices Q4: Lattice Regularity? • average c-RMS values generally increase as the randomization of the lattices increase lattice degree model FCC Cubic 12 6 average c-RMS periodic lattice 1.82 3.11 Randomized de novo periodic lattice random lattice 1.967 3.21 4.85 3.96 Answer to Question 4: Periodic lattices achieve better approximation of protein structure than random lattices of the same degree Results: Ideal Lattice Attributes • • • • Uniform edge lengths of 3.8Å Mimimum distance between any two vertices of 3.8Å Supporting mainly 90o and 120o angles Periodic in structure Candidate lattices (space-filling) cubic hex. prism truncated tetrahedron cuboctahedron truncated octahedron Candidate lattices (vector-based) Face-centered cubic (FCC) Side+FCC (S+FCC) Extended FCC (e-FCC) RMS comparison of lattices c-RMS d-RMS a-RMS Truncated Octahedron 5.3053 3.2479 13.0982 Hexagonal Prism 3.8704 2.4312 10.0313 Truncated Tetrahedron 3.6913 2.4133 19.9030 Simple Cubic 3.1123 2.1081 21.1005 Cubeoctahedron 2.5581 1.7427 8.3526 FCC 1.8212 1.4369 8.3346 S+FCC 2.1791 1.5819 6.2022 e-FCC 1.5385 1.1048 2.5700 Angle comparison of lattices Cubo Trunc. Hexago Trunc. Cub ctaLattice octahed nal tetrahed FCC ic hedro ron prism ron n S+FCC e-FCC Degree 4 5 6 6 8 12 18 42 Close ness to 90 20 18 42 18 30 30 28.82 31.40 Close ness to 120 10 24 36 36 34.29 32.73 36.47 38.72 Future 1. Investigate candidate lattices to determine an ideal lattice for inverse protein folding 2. Mathematically prove that the ideal lattice can generate stable sequences for specified protein shapes within the HP model 3. Attempt to assign specific amino acids to lattice sites Future 4. Investigate protein sequences generated by the model for stability and folding properties. 5. Incorporate other protein folding forces – – – – – Hydrogen Bonding Van der Waals interactions Intrinsic properties (conformational preference) Ion pairing Disulfide bonds Questions?