Computational Methods for Protein Structure Prediction Ying Xu Outline introduction to protein structures the problem of protein structure prediction why we can predict protein structure protein tertiary structure prediction – Ab initio folding – homology modeling protein threading Protein and Structure Protein sequence >1MBN:_ MYOGLOBIN (154 AA) MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHL KTEAEMKASEDLKKAGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKI PIKYLEFISEAIIHVLHSRHPGNFGADAQGAMNKALELFRKDIAAKYKEL GYQG Protein structure Protein function Oxygen storage Ball and stick spacefill Protein Structure protein sequence folds into a “unique” shape (“structure”) that minimizes its free potential energy Protein Structures Primary sequence MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE Secondary structure -helix anti-parallel -sheet parallel Protein Structures Tertiary structure Quaternary structure Protein Structures Protein structure – generally compact Soluble protein structure – individual domains are generally globular – they share various common characteristics, e.g. hydrophobic moment profile Membrane protein structure most of the amino acid sidechains of transmembrane segments are non-polar polar groups of the polypeptide backbone of transmembrane segments generally participate in hydrogen bonds Protein Tertiary Structures Family: Clear evolutionary relationship, protein in the same family are homologous, sequence identity >=30%. Superfamily: Low sequence identity, probable common evolutionary origin. Fold: May not have a common evolutionary origin. Major structural similarity. Class: all-, all-, /, +, … http://scop.mrc-lmb.cam.ac.uk/scop/ SCOP 1.65 release: 2327 families, 1294 superfamilies, 800 folds SCOP: Structural Classification Of Proteins Protein Tertiary Structures Protein Structure Determination X-ray crystallography – – – – NMR – – – – most accurate in vitro need crystals proteins ~50K per structure accurate in vivo no need for crystals limited to small proteins Cryo-EM – Imaging technology – Low-resolution Protein Structure Prediction Problem: Given the amino acid sequence of a protein, what’s its 3-dimensional shape? MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHL KTEAEMKASEDLKKAGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKI PIKYLEFISEAIIHVLHSRHPGNFGADAQGAMNKALELFRKDIAAKYKEL GYQG ? …….. Why Protein Structure Prediction? Importance of protein structure – – – knowledge of the structure of a protein enable us to understand its function and functional mechanism design better mutagenesis experiments structure-based rational drug design Experimental methods for protein structure determination Pros: high resolution Cons: time-consuming and very expensive Why Protein Structure Prediction? Big gap between the number of protein sequences and the number of protein structures – Uniprot/Swiss-prot, 283,454 protein sequences – Uniprot/TrEMBL, 4,864,587 gene sequences – PDB (Protein Data Bank), 48,385 protein structures Fundamental, unsolved, challenging problem Why We Can Predict Structure In theory, a protein structure can solved computationally A protein folds into a 3D structure to minimizes its free potential energy – Anfinsen’s classic experiment on Ribonuclease A folding in the 1960’s – energy functions This problem can be formulated as an optimization problem – protein folding problem, or ab initio folding Why We Can Predict Structure While there could be billions of billions of proteins in nature, the number of unique structural folds (shapes) might be small 90% of new structures submitted to PDB in the past three years have similar structural folds in PDB Why We Can Predict Structure Theoretical studies suggest that the vast majority of the proteins in nature fall into not much more than 1000 structural folds This realization has fundamentally changed how protein structures can be predicted The structure prediction problem becomes for a protein sequence, find which of the structural folds the protein can fold into, plus possibly some structural refinement MTYKLILN …. NGVDGEWTYTE Computational Methods for Protein Structure Prediction ab initio --use first principles to fold proteins --does not require templates --high computational complexity •homology modeling --similar sequence similar structures --practically very useful, need homologues • protein threading --many proteins share the same structural fold --a folding problem becomes a fold recognition problem Need known protein structures ab initio structure prediction An energy function to describe the protein o o o o o bond energy bond angle energy dihedral angel energy van der Waals energy electrostatic energy Efficient and reliable algorithms to search the conformational space to minimize the function and obtain the structure. ab initio structure prediction The problem is exceedingly difficult to solve – the search space is defined by psi/phi angles of backbone and side-chain positions – the search space is enormous even for small proteins! – the number of local minima increases exponentially of the number of residues Theoretically solvable but practically infeasible! ROSETTA (Dave Baker’s Lab) Construct a library of small structure fragments, e.g. 9 AA Cut a target sequence to sequence fragments. For each sequence fragment, choose structural candidate fragments from the fragment library Assemble the fragment structures by Monte Carlo simulation The generated structures are grouped into clusters Clusters are ranked by their energy Homology Modeling Observation: proteins with similar sequences tend to fold into similar structures. 1. Target sequence is aligned with the sequence of a known structure, they usually share sequence identity of 30% or higher 2. Superimpose target sequence onto the template, replacing equivalent side-chain atoms where necessary 3. Refine the model by minimizing an energy function. Programs: Modeller Swiss-Model http://salilab.org/modeller/ http://swissmodel.expasy.org//SWISS-MODEL.html Protein Threading Basic premise The number of unique structural (domain) folds in nature is fairly small (possibly a few thousand) Statistics from Protein Data Bank (~48,000 structures) 90% of new structures submitted to PDB in the past three years have similar structural folds in PDB Chances for a protein to have a native-like structural fold in PDB are quite good (estimated to be 60-70%) – Proteins with similar structural folds could be homologues or analogues Protein Threading The goal: find the “correct” sequence-structure alignment between a target sequence and its native-like fold in PDB MTYKLILN …. NGVDGEWTYTE Energy function – knowledge (or statistics) based rather than physics based – Should be able to distinguish correct structural folds from incorrect structural folds – Should be able to distinguish correct sequence-fold alignment from incorrect sequence-fold alignments Protein Threading – four basic components Structure database Energy function Sequence-structure alignment algorithm Prediction reliability assessment Protein Threading – structure database Build a template database Protein Threading – structure database • Non-redundant representatives through structure-structure and/or sequence-sequence comparison FSSP (http://www.bioinfo.biocenter.helsinki.fi:8080/dali/index.html) (Families of Structurally Similar Proteins) SCOP (http://scop.mrc-lmb.cam.ac.uk/scop/) PDB-Select (http://www.sander.embl-heidelberg.de/pdbsel/) Pisces (http://www.fccc.edu/research/labs/dunbrack/pisces/) Protein Threading – energy function MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE how preferable to put two particular residues nearby: E_p how well a residue fits a structural environment: E_s alignment gap penalty: E_g total energy: E_p + E_s + E_g find a sequence-structure alignment to minimize the energy function Protein Threading – energy function A singleton energy measures each residue’s preference in a specific structural environments – secondary structure – solvent accessibility Where Compare actual occurrence against its “expected value” by chance Protein Threading – energy function A simple definition of structural environment – secondary structure: alpha-helix, beta-strand, loop – solvent accessibility: 0, 10, 20, …, 100% of accessibility – each combination of secondary structure and solvent accessibility level defines a structural environment • E.g., (alpha-helix, 30%), (loop, 80%), … E_s: a scoring matrix of 30 structural environments by 20 amino acids – E.g., E_s ((loop, 30%), A) Singleton energy term Protein Threading – energy function Helix ALA ARG ASN ASP CYS GLN GLU GLY HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL Buried Inter -0.578 -0.119 0.997 -0.507 0.819 0.090 1.050 0.172 -0.360 0.333 1.047 -0.294 0.670 -0.313 0.414 0.932 0.479 -0.223 -0.551 0.087 -0.744 -0.218 1.863 -0.045 -0.641 -0.183 -0.491 0.057 1.090 0.705 0.350 0.260 0.291 0.215 -0.379 -0.363 -0.111 -0.292 -0.374 0.236 Exposed -0.160 -0.488 -0.007 -0.426 1.831 -0.939 -0.721 0.969 0.136 1.248 0.940 -0.865 0.779 1.364 0.236 -0.020 0.304 1.178 0.942 1.144 Sheet Buried Inter 0.010 0.583 1.267 -0.345 0.844 0.221 1.145 0.322 -0.671 0.003 1.452 0.139 0.999 0.031 0.177 0.565 0.306 -0.343 -0.875 -0.182 -0.411 0.179 2.109 -0.017 -0.269 0.197 -0.649 -0.200 1.249 0.695 0.303 0.058 0.156 -0.382 -0.270 -0.477 -0.267 -0.691 -0.912 -0.334 Exposed 0.921 -0.580 0.046 0.061 1.216 -0.555 -0.494 0.989 -0.014 0.500 0.900 -0.901 0.658 0.776 0.145 -0.075 -0.584 0.682 0.292 0.089 Loop Buried Inter 0.023 0.218 0.930 -0.005 0.030 -0.322 0.308 -0.224 -0.690 -0.225 1.326 0.486 0.845 0.248 -0.562 -0.299 0.019 -0.285 -0.166 0.384 -0.205 0.169 1.925 0.474 -0.228 0.113 -0.375 -0.001 -0.412 -0.491 -0.173 -0.210 -0.012 -0.103 -0.220 -0.099 -0.015 -0.176 -0.030 0.309 Exposed 0.368 -0.032 -0.487 -0.541 1.216 -0.244 -0.144 -0.601 0.051 1.336 1.217 -0.498 0.714 1.251 -0.641 -0.228 -0.125 1.267 0.946 0.998 Protein Threading – energy function It measures the preference of a pair of amino acids to be close in 3D space. How close is close? – distance dependent – single cutoff – C, C, or centroid of the sidechain Observed occurrence of a pair compared with its “expected” occurrence Pair-wise interaction energy term Protein Threading – energy function ALA ARG ASN ASP CYS GLN GLU GLY HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL -140 268 105 217 330 27 122 11 58 -114 -182 123 -74 -65 174 169 58 51 53 -105 ALA -18 -85 -616 67 -60 -564 -80 -263 110 263 310 304 62 -33 -80 60 -150 -132 171 ARG -435 -417 106 -200 -136 -103 61 351 358 -201 314 201 -212 -223 -231 -18 53 298 ASN 17 278 -1923 67 191 -115 140 122 10 68 -267 88 -72 -31 -288 -454 190 272 -368 74 -448 318 154 243 294 179 294 -326 370 238 25 255 237 200 -160 -278 -564 246 -184 -667 95 54 194 178 122 211 50 32 141 13 -7 -12 -106 301 -494 284 34 72 235 114 158 -96 -195 -17 -272 -206 -28 105 -81 -102 -73 -65 369 218 -46 35 -21 -210 -299 7 -163 -212 -186 -133 206 272 -58 193 114 -162 -177 -203 372 -151 -211 -73 -239 109 225 -16 158 283 -98 -215 -210 104 52 -12 157 -69 -212 -18 81 29 -5 31 -432 129 95 268 62 -90 269 58 34 -163 -93 -312 -173 -5 -81 104 163 431 196 180 235 202 204 -232 -218 269 -50 -42 46 267 73 ASP CYS GLN GLU GLY HIS ILE LEU LYS MET PHE PRO SER THR -20 -95 101 TRP -6 107 -324 TYR VAL Protein Threading – energy function w(k) = h + gk, k ≥ 1, w(0) = 0; Where h and g are constants. h: opening gap penalty g: extension gap penalty FDSK---THRGHR :.: :: ::: FESYWTCTH-GHR FDSK-T--HRGHR :.: : : ::: FESYWTCTH-GHR gap penalty term Protein Threading – energy function • Secondary structure prediction is mature and can achieve ~80% accuracy • The performance of using probabilities of the predicted three secondary structure states (-helices, -strand, and loop) is better Secondary structure match energy Threading Parameter Optimization The contribution of each term (weight). Based on threading performance on a training set (fold recognition and alignment accuracy). Different weight for different classes? (superfamily, fold) pair-wise may contribute more for fold level threading mutation/profile terms dominate in superfamily level threading Etotal = sEsingleton + pEpairwise + gEgap + ssEss Protein Threading -- algorithm Considering only singleton energy + gap penalty Represent a structure a sequence of “structural environments” – (helix, 100%), (helix, 90%), ….. (strand, 0%) Align a sequence MACKLPV …. with a structural sequence (helix, 100%), (helix, 90%), ….. (strand, 0%) Protein Threading – dynamic programming AAGG Two sequences: AACG and AAGG | | | AACG Step #1: calculating alignment matrix A A C G A 2 2 -1 -1 A 2 4 3 G -1 3 3 5 G -1 2 2 5 2 Rule: 1: initialization– fill the first row and column with matching scores 2: fill an empty cell based on scores of its left, upper and upperleft neighbors + the matching score of the current cell 3: chose the one giving the highest score Protein Threading – dynamic programming Step #2: Tracing back to recover the alignment A A C G A 2 2 -1 -1 A 2 4 3 2 G -1 3 3 5 G -1 2 2 5 Rule: 1: start from the rightlower corner 2: trace back to left, upper or upper-left neighbor which gives the current cell’s score 3. Keep doing this until it cannot continue Protein Threading – dynamic programming Steps: 1. Initialization: construct an (n+1) x (m+1) matrix F for two sequences of lengths n and m. 2. Matrix fill: for each cell in the matrix F, check all possible pathways back to the beginning of the sequence (allowing insertions and deletions) and give that cell the value of the maximum scoring. 3. Traceback: construct an alignment back from the last cell in the matrix (or the highest scoring) cell to give the highest scoring alignment. Protein Threading -- dynamic programming (helix, 100%) M L V A (helix, 90%) (helix, 80%) (loop, 80%) Protein Threading -- algorithm Considering all three energy terms Considering the pair-wise interaction energy makes the problem much more difficult to solve – dynamic programming algorithm does not work any more! There are other techniques that can be used to solve the problem Protein Threading -- algorithm Dynamic programming Heuristic algorithms for pair-wise interactions – Frozen approximation algorithm (A. Godzik et al.) – Double dynamic programming (D. Jones et al.) – Monte carlo sampling (S.H. Bryant et al.) Rigorous algorithms for pair-wise interactions – – – – Branch-and-bound (R.H. Lathrop and T.F. Smith) Divide-and-conquer (Y. Xu et al.) --PROSPECT Linear programming (J. Xu et al.) –RAPTOR Tree decomposition (L. Cai et al.) Rigorous algorithm for treating backbone and side-chain simultaneously (Li et al.) Fold Recognition MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE Score = -1500 Score = -720 Score = -1120 Score = -900 Which one is the correct structural fold for the target sequence if any? The one with the highest score ? Fold Recognition Query sequence: AAAA Template #1: AATTAATACATTAATATAATAAAATTACTGA Better template? Template #2: CGGTAGTACGTAGTGTTTAGTAGCTATGAA Which of these two sequences will have better chance to have a good match with the query sequence after randomly reshuffling them? Fold Recognition Different template structures may have different background scores, making direct comparison of threading scores against different templates invalid Comparison of threading results should be made based on how standout the score is in its background score distribution rather than the threading scores directly Fold Recognition Threading 100,000 sequences against a template structure provides the baseline information about the background scores of the template By locating where the threading score with a particular query sequence, one can decide how significant the score, and hence the threading result, is! Not significant significant Fold Recognition Z-score = score - average standard deviation --randomly shuffle the query sequence and calculate the alignment score Fold Recognition Significance score versus prediction specificity/sensitivity Z-score Fold Recognition Examine feature space of threading alignments: (singleton score, pair contact scores, secondary structure score, hydrophobic moment score, ......) versus true/false fold recognition -2000, -500, -35, -90, ......, true -1000, -201, -11, -500, ......, false false -5020, -900, -20, -75, ......, true -1050, -185, -18, -320, ......, false true ...... Separate true ones from false ones using support vector machine (SVM) Fold Recognition Each feature has somewhat different distributions in the true and false predictions E.g., hydrophobic moments (Hydrophobic moments of protein structures: spatially profiling the distribution, David Silverman, PNAS 2001 98: 4996-5001) is quite useful in distinguishing true from false threading predictions Hydrophobic Moment profiles 120 100 80 60 H2(d) 40 20 0 -20 0 5 10 15 20 -40 -60 -80 -100 d (Angstromes) 25 30 35 Application Protein sequence Sequence preprocessing Remove signal peptide Membrane/soluble?? Domain prediction Sequence profile generation Secondary structure Prediction Experimental data constraints NO Database searching Find homolog in PDB? Fold recognition YES Homology modeling 3D structure Challenging Issues Much improved energy functions for threading Considering side-chain information when doing protein threading Effective integration of fragment-based methods and threading techniques