NEW APPROACHES TO PROTEIN STRUCTURE PREDICTION AND DESIGN Joe DeBartolo An overview of my thesis structure prediction Why do prediction and design matter? amino acid sequence Structure Prediction. Growth of sequences outpaces experimental characterization. Knowing their structure provides insights into their function and interactions Protein design. Understanding design principles can allow the creation of new proteins with therapeutic and industirial applications protein design native protein structure Protein structure prediction and design PART I PART II PART III PART IV ItFix: Homology-free structure prediction SPEED: ItFix enhanced with evolution Future directions in prediction Protein design Protein structure prediction 1° structure MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLR local 2° structure 2° and 3° structure topology diagram 3D model Residue-residue contact map The Challenge: Distill the folding problem down to the basic principles, code them into an algorithm, and predict pathways and structure without using homology …LEKVQLN… amino acid sequence native structure Capturing the interrelated forces of protein structure Ramachandran angles local structures • backbone hydrogen bonds • y f • • • local sterics solvation backbone entropy • • • • long range sterics Van der Waals electrostatics hydrophobic effect The overlapping features of local protein structure turn β-strand α-helix backbone 180 Ramachandran torsion ψ angles y f -180 -180 180 φ -180 φ 180 -180 φ backbone H-bonds polar amphipathic sidechain patterning polar apolar apolar mostly polar 180 Capturing the interrelated forces of protein structure • • y f • • ramachandran angles backbone hydrogen bonds solvation long range effects long-range hydrogen bonding • • • sterics Van der Waals electrostatics 3° packing specificity of the chain hydrophobic effect surface residue placement solvent exposed residues salt bridges and other favorable pairings apolar buried residues long-range hydrogen bonding contacts that are highly separated in sequence The structure prediction challenge: To integrate all of these features into an algorithm y f requirements 180 ψ -180 -180 φ a way to sample conformations X 180 a way to evaluate conformations Sample Ramachandran space Rama angle pair 180 y f Rama map of PDB ψ Rama angle pairs describe entire conformation... NO sidechain rotamer sampling -180 -180 exclude sidechains beyond Cβ φ 180 1° and 2° structure information refines the Rama search space 180 Entire PDB ψ -180 -180 180 add amino acid identity φ ALL-ALL-ALL ALL-ASN-ALL ALL-ALL-ALL ALL-ASN-GLY ALL-ALL-ALL 180 ALL-ASN-GLY ψ -180 -180 ALL-ALL-ALL 180 ψ -180 -180 180 add 2° structure identity φ 2° structure 180 ψ -180 -180 180 add neighbor identity φ 1° structure φ 180 BETA-ALL-ALL The structure prediction challenge: To integrate all of these features into one algorithm y f requirements 180 ψ -180 -180 φ a way to sample conformations X 180 a way to evaluate conformations The DOPE statistical potential Discrete Optimized Potential Energy Knowledge-based modeling of the energy of a conformation The DOPE atom pair energy… residue j residue i amino acid i amino acid j atom type j atom type I PDB I have added to DOPE… • orientation dependence • 2° structure dependence EnergyPDB (rij) =biases -ln( ProbPDB(rij) ) • eliminate local rij is the distance between atoms i and j Shen and Sali, Proteins (2007) GLU-Cβ - GLU-Cβ LEU-Cβ - LEU-Cβ Distance (Å) DOPE-PW DOPE PW energy DOPE energy DOPE GLU-Cβ - GLU-Cβ LEU-Cβ - LEU-Cβ Distance (Å) Capturing sidechain orientation in a sidechain-free model PW = r = ( r1 2 90) ( r 21 90) 2 2 ρ1-2 is the angle between two vectors High low ρρ (in-line) residue 1 Ca residueρ1 1-2 Cβ Cβ Cb ρ1-2 Ca ρ2-1 Cβ Ca residue 2 ρ2-1 , Cα Cβ Ca residue 2 DeBartolo et al. PNAS 2009 DOPE-PW (uniquely) captures the hydrophobic effect Potential orientations of high PW DOPE energy hydrophobic residues pairs have lower buried in energy at smaller the core distances GLU-Cβ GLU-Cβ LEU-Cβ LEU-Cβ Cα Cβ Cα Cβ Cβ Cα Cα Cb Cα Cβ Cα Cβ Distance (Å) large distance preferred DOPE-PW captures the amphipathic nature of β-sheets DOPE energy polar and apolar residues prefer opposing sides of the β-sheet potential orientations of low PW same side of β-sheet Cβ Cβ Cα Cα GLU-Cβ LYS-Cβ GLU-Cβ LEU-Cβ opposite side of βsheet Cβ Cα Cα Distance (Å) Cβ The challenge: To integrate all of these features into one algorithm y f requirements 180 ψ -180 -180 φ a way to sample conformations X 180 a way to evaluate conformations ItFix Iterative Fixing to reduce the conformational search sampling library Fold with (f,y) from “I1” LibraryInitial Fold with (f,y) from LibraryRestricted 1 “I2” Remove trimers of lowlypopulated 2o structure Fold with (f,y) from LibraryRestricted 2 Remove trimers Repeat until no further fixing is possible Final Round Fold with (f,y) from LibraryRestricted final Repeat removal “N” helix strand Not(Strand) Not(Helix) Coil subtypes search space is restricted “U” 180° 2° structure option removed Starting configuration 1° only (no 2o structure restriction) ψ -180° 180° ψ -180° 180° ψ -180° -180° φ 180° DeBartolo et al., PNAS 2009 Homology-free ItFix 2° and 3° structure prediction results Native ItFix SSPro PSIPRED ---HHHHHHHHHHHHHHH-----GGGHHHHHHHHHHHHHHHT---HHHHHHHHHH-TT-THHHHHHHH---HHHHHHHHHHHHHHHT-----S-HHHHHHHHHHHHHHHT-S--HHHHHHHHHT---HHHHHHHHH---HHHHHHHHHHHHHHHHHHE-TTHHHHHHHHHHHHHHHHT--HHHHHHHHHHT-TTHHHHHHHHHH---HHHHHHHHHHHHHHH-----HHHHHHHHHHHHHHHHHH----HHHHHHHHH----HHHHHHHHH-- 1af7 2.5 Å Native ItFix SSPro PSIPRED -HHHHHHHHHHHTT-SS--HHHHHHHHHHHT--HHHHHHHHHHHHHHHH--HHHHHHHHHHHH-----HHHHHHHHHHHH--S-HHHHHHHHHHHHHH-HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH-HHHHEEHEHHHHHHH--HHHHHHHHHHHHH-----HHHHHHHHHHHHHHHHHHHHHHH-HHHH--- 1b72 1.6 Å Native ItFix SSPro PsiPred -EEEEEEEEETTTTEEEEE-TTS--EEEEGGGB-SSSS----TT-EEEEEEEEETTEEEEEEEEE--EEEEEEEE-STTTEEEEEEET-T-EEEEEEE--SSS-----TS--EEEEEEES--S----EEEEE--TEEEEEE-TTTTEEEE--TT--EEEEEEEHEETTT--E--TT-EEEEEEEE-TT--E-EE------EEEEEEEE----EEEEE-----EEEEEEE--------------EEEEEEEE-----EEEEEE--- 1csp 6.0 Å Native ItFix SSPro PSIPRED --BGGG---SEEEEE-TTS-EEEEEEHHHHHHHHHHTT-EEEEEETTSSS-EEEEE-EEE-SSSSEEEEEE-TTS-EEEEEEHHHHHHHHHHHT--EEEE-TTSSS-EEEEE--BBTEEE-EEEEEEETTT-EEEEE-HHHHHHHHHHHT--EEEE-TT----EEEE-----------EEEEE-----EEEEE-HHHHHHHHHH----EEEE-------EEEE-- 1tif 4.2 Å Native ItFix SSPro PsiPred -HHHHHHHHHHHTT--HHHHHHHHTS-HHHHHHHHTTS-SS-TTHHHHHHHTT--HHHHH-HHHHHHHHHHHHT--HHHHHHHHT--HHHHHHHHTT--SS----HHHHHHHT--HHHHH---HHHHHHHHHHHHHHHHHHHHHT-HHHHHHHHHTT-------HHHHHHHHHT--HHHH-HHHHHHHHHHH----HHHHHHHH---HHHHHHHH------HHHHHHHHHHH---HHHH-- 1r69 2.4 Å Native ItFix SSPro PSIPRED -EEEEEETTS-EEEEE--TTSBHHHHHHHHHHHH---GGGEEEEETTEE--TTSBTGGGT--TT-EEEEEE-EEEEEETTS-EEEEEE---S-B-HHHHHHHHHSS---SSEEEEETT----TT-B----------EEEEEE-EEEEEEETTEEEEEEE---SHHHHHHHHHHHTTT---T--E--ETT-E--TT-EEEEEE--TT-EEEEEE-EEEEEE----EEEEEE-----HHHHHHHHHHHH---HHHEEEEE--EE------HHH-------EEEEEE- 1ubq 3.1 Å DeBartolo et al., PNAS 2009 1 b1 b2 helix b4 b5 310 Major pathway (from experiment) b3 Unfolded state 10 Round 0 b1-b2 hairpin + b3 +helix 10 Round 2 + b4 1 0 + b3 Round 3 1 0 +helix + b4 Round 4 1 0 +310 helix Round 6 10 + b5 Round 9 b1 b2 helix b4 b5 0 2° Structure frequency Round 1 Mimicking folding pathways 1 residue index 310 b3 73 Native state DeBartolo et al., PNAS 2009 Part I Conclusions Challenge: Distill the folding problem down to the basic principles, code them into an algorithm, and predict pathways and structure without using homology What novel about how we approached this challenge? Use basic principles of protein structure and folding. Search strategies: mimic true folding behavior i) Coupled 2° & 3° structure formation ii) Iterative fixing to reduce the search iii) Outputs pathway information Energy functions: orientational and 2° structure dependence Protein structure prediction and design PART I PART II PART III PART IV ItFix: Homology-free structure prediction SPEED: ItFix enhanced with evolution Future directions in prediction Protein design ψ φ Cover image of Protein Science, March 2010 SPEED: Structure Prediction Enhanced by Evolutionary Diversity Increase φ, ψ diversity and accuracy multiple sequence alignment target sequence sequence database MQIFVKTLTGKTITLEV 180° ψ 180° -180° -180° φ homology-free sampling 180° ψ SPEED sampling -180° -180° φ 180° IEIKIRDIYSKTYKFMA IEITCNDRLGKKVRVKC MRLFIRSHLHDQVVISA MKLSVKSPNGRIEIFNE LQFFVRLLDGKSVTLTF IEITLNDRLGKKIRVKC IEIWVNDHLSHRERIKC MDVFLMIRRQKTTIFDA IIVTVNDRLGTKAQIPA MRISVIKLDSTSFDVAV MNVNFRTILGKTYTITV MLLTVRDRSELTFSLQV MQIFVTTPSENVFGLEV MSLTIKF-GAKSIALSL MKYRIRTISNDEAVIEL … ~1000 sequences Uses sequence data base 107 seq’s, growing fast; PDB only 104 structures growing slowly ItFix-SPEED overview Homology-free 1tif position 4 …AGTYEFRKAKIT… homology free Multiple Sequence Alignment INE SPEED 1tif position 4 {IND , IGD , VGN,…}MSA 180° SPEED Round 1 Rama distribution ψ Rama Distribution Fold 500x with Eradial -180° ItFix 180° Analyze 2° Structure Statistics no Round 2 Rama ψ distribution 2° structure converged yes -180° 180° Final 2° Structure Fold 10000x with Eradial or DOPE-PW (all α) Final Rama distribution ψ -180° -180° φ 180° -180° φ 180° DeBartolo et al., Protein Sci. 2010 ItFix-SPEED overview Homology-free 1tif position 4 …AGTYEFRKAKIT… homology free Multiple Sequence Alignment INE SPEED 1tif position 4 {IND , IGD , VGN,…}MSA 180° SPEED Round 1 Rama distribution ψ Rama Distribution Fold 500x with Eradial -180° ItFix 180° Analyze 2° Structure Statistics no Round 2 Rama ψ distribution 2° structure converged yes -180° 180° Final 2° Structure Fold 10000x with Eradial or DOPE-PW (all α) Final Rama distribution ψ cluster Largest cluster Refine 100X each with DOPE-PW Reject ∆Eradial> 0 -180° -180° φ 180° -180° φ 180° prediction min<Energy> 100 DeBartolo et al., Protein Sci. 2010 Assaying accuracy Clustering predicts model accuracy and confidence Mean Ca-RMSD to native of cluster (Å) fold ItFix predicted 2° structure cluster identify best cluster Global Accuracy 8 Local Accuracy 1af7 2 7 R =0.85 6 5 1b72 4 3 1r69 2 1.5 2.0 2.5 3.0 3.5 4.0 4.5 Mean Ca-RMSD between models in cluster (Å) (i.e. we know whether we got it right or wrong) Cut-off Distance (Å) Performance in CASP8 Global Distance Test T0482 (4.8 Å) ItFix free modeling Cut-off Distance (Å) T0405 D1 (6.4 Å ) ItFix T0464 D1 (4.5 Å) Cut-off Distance (Å) loop insertion modeling RAPTOR ItFix Aashish Adhikari ItFix DeBartolo et al., Protein Sci. 2010 Better template Cut-off Distance (Å) T0429 D2 (6.8 Å) RAPTOR ItFix Percentage of residues template identification using folding Part II Conclusions • Adding evolutionary information to ItFix improves the accuracy of the conformational search • Clustering permits global and local prediction of cluster accuracy and uncertainty • SPEED is successful in the CASP8 experiment Protein structure prediction and design PART I PART II PART III PART IV ItFix: Homology-free structure prediction SPEED: ItFix enhanced with evolution Future directions in prediction Protein design 1° structure MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLR Invert the structure prediction problem local 2° structure 2° and 3° structure topology diagram 3D model 3D contacts 1° structure MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLR Current designs are very similar to parent sequences design length fold wt % id (wt % sim) top % id (top % sim) top-wt % id (top-wt % sim) protein L1 62 αβ 35 (61) 50 (62) 73 (86) protein L2 62 αβ 45 (60) 45 (60) 73 (86) ACP 98 αβ 41 (54) 39 (57) 67 (69) PCP S6 U1A FKB 70 94 96 107 αβ αβ αβ αβ 31 (56) 26 (43) 32 (57) 42 (59) 33 (56) 32 (46) 33 (57) 44 (62) 73 (84) 33 (52) 97 (100) 96 (96) zinc-finger 28 αβ 21 (38) N/A N/A tenascin 89 β 42 (64) 42 (64) 100 (100) Can we design a more unique protein sequence? Design method 01010111 1 Restrict AA possibilities by burial in native structure for the hydrophobic effect 2 Find best sequences for maximum Rama propensity 3 Monte Carlo search of Statistical Potential DOPE PW energy DOPE-PW GLU-Cβ - GLU-Cβ LEU-Cβ - LEU-Cβ Distance (Å) MKLFVKTP… LTVTIR L IV R E positional sequence library Hello Jello Preliminary wetlab analysis cd • 1ds0 expresses in inclusion bodies • mutations enhance in vitro solubility • further experiments needed design design-sol wavelength (nm) native Thesis defense Conclusions • Homology-free structure prediction can provide accurate models by mimicking folding pathways • Adding evolutionary information improves the accuracy of the conformational search • Inverting our homology-free prediction method into a design algorithm aims to generate unique amino acid sequences Acknowledgements Prof. Tobin Sosnick Prof. Karl Freed Prof. Jinbo Xu Glen Hocky Andres Colubri James Fitzgerald Abhishek Jha Esmael Haddadian James Hinshaw Aashish Adhikari Jouko Virtanen Chloe Antoniou Josiah Zayner Feng Zhao Jian Peng Grzegorz Gawlak Srikanth Aravamuthan Funding: NIH, NSF, Joint Theory Institute Native Rama probability Enhancement of Ramachandran propensity ψ φ AA SecStr position Enhancement in energy and structure prediction • • ∆∆E = -120 (arb. units) 2X enhancement in native-like models in prediction 1b72 1.6 Å 1di2 4.6 Å 1r69 2.4 Å 1 1af7 2.7 Å Round 0 Round 0 Round 0 Round 1 Round 1 Round 1 Round 1 Round 2 Round 2 Round 2 Round 2 Round 3 Round 3 Round 4 Round 3 Round 5 Round 4 Round 6 Round 4 Round 7 Round 6 Round 8 Round 6 10 10 1 0 10 0 Secondary Structure frequency 10 Round 0 residue index SPEED increases the native Rama probability native Rama regions 180 2 ψ 1 -180 -180 Native basin probability 1b72 φ % positions with PNative > 0.25 3 SPEED reduces cases SPEED whereimproves native φ,native ψ hasφ, aψ probability sequence very lowacross probability 4 180 2° structure by position PDB of target Amino acid by id position Radial energy terms enforce productive chain collapse (global terms) Rg-Cα: Root-squared distance of Cα from CM. Compactness of model Rg-phil Rg-phob CMCα Rg-Cα Cα Cβ Ru-Cα: Root-meansquared deviation of Cα from CM. Enforces a spherical model Rg-phob/Rg-phil (burial ratio): best packing of hydrophobic residues Eliminating the fixing thresholds from ItFix 180 (e.g. pos. 67) MQIFVKT…STLHLVLR round0 Rama distribution 0 -180 180 fold 2000X round1 Rama distribution 0 -180 180 fold 2000X round2 Rama distribution 0 -180 180 fold 2000X round3 Rama distribution 0 -180 -180 0 180 An evolution-enhanced energy function DOPE-PW-SPEED 10 WT:ILE Homologs: polar DOPE-PW DOPE-PW-SPEED WT:Ala Homologs: polar energy 8 6 4 2 0 0.0 10 PHE4 THR14 energy 8 5.0 10.0 15.0 20.0 25.0 30.0 distance (Å) DOPE-PW DOPE-PW-SPEED 6 4 2 0 0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0 16.0 distance (Å)