2o structure, TM regions, and solvent accessibility Chapter 29, Du and Bourne “Structural Bioinformatics” Topic 13 The Truth (Information) is Out (In) There The Truth (Information) is Out (In) There But we’re still having a tough time finding it. Protein Secondary Structure Prediction Given a protein sequence (primary structure), predict its secondary structures GHWIATRGQLIREAYEDYRHFSSECPFIP E: -strand H: -helix C: coil CEEEEECCCEEEEECCCHHHHHHCCCCCC H: ( H: - helix, G: 310 helix, I: -helix ) E: (E: -strand, B: bridge) C: (T: -turn, S: bend, C: coil) Assumption: short stretches of residues have propensity to adopt certain conformation ⇒ conformation of the central residue in a sequence fragment depends only on flanking residues (sliding window) Why secondary structure prediction? -- Because we can (kind of). -- Because it could be a first step towards prediction of protein tertiary structure. “Have solution, need problem.” Nearly every imaginable algorithm has been applied to secondary structure prediction. Secondary Structure Prediction Methods 1. First generation: Single amino acid propensities Chou-Fasman method (1974), GOR I-IV ~56-60% accuracy 2. Second generation: Segments of 3-51 adjacent residues NNSSP, SSPAL ~65% accuracy 3. Neural network PHD, Psi-Pred, J-Pred 4. Support vector machine (SVM) 5. Hidden Markov Models (HMM) Third generation methods using evolutionary information ~76% accuracy Secondary Structure Prediction Accuracy 1. three-state per-residue prediction accuracy 3 Q3 100 M i 1 N obs ii Mii, number of residues observed in state i and predicted in state i Nobs, the total number of residues observed in 3 states 2. per-segment prediction accuracy (SOV, Segment of OVerlap) Per-stage segment overlap: S1: observed SS segment S2: predicted SS segment Single Residue Propensity Methods Calculate the propensity for a given amino acid to adopt a certain ss-type P( | aai ) p( , aai ) P p( ) p( ) p(aai ) i i, amino acid , secondary structure state Example: from a data set with 30 proteins #Ala=2,000, #residues=20,000, #helix=4,000, #Ala in helix=580 p(,aa) = 580/20,000, p() = 4,000/20,000, p(aa) = 2,000/20,000 P = 580 / (4,000/10) = 1.45 Amino Acid Propensities to Secondary Structures P(H) P(H) T S P T A E L M R S T G 69 77 57 69 142 151 121 145 98 77 69 57 T S P T A E L M R S T G 69 77 57 69 142 151 121 145 98 77 69 57 P(H) T S P T A E L M R S T G 69 77 57 69 142 151 121 145 98 77 69 57 Chou-Fasman method Nearest Neighbor Methods * The idea is simple: predict SS of the central residue of a given segment from homologous segments (neighbors). For example, from database, find some number of the closest sequences to a subsequence defined by a window around the central residue, then use max (N, N, Nc) to assign the SS. E Homologous C sequences C RSTEVRASRQLAKEKVN H H Window size C C Key parameters: 1. How to define similarity? 2. What size window of sequence should be examined? 3. How many close sequences should be selected? C The Devil is in the details… Psi-Pred Method D. Jones, J. Mol. Boil. 292, 195 (1999). Method : Neural network Input data : PSSM generated by PSI-BLAST Bigger and better sequence database Combining several database and data filtering Training and test sets preparation Ss prediction only makes sense for proteins with no homologous structure. No sequence & structural homologues between training and test sets by CATH and PSI-BLAST (mimicking realistic situation). Psi-Pred Method--Neural Network Window size = 15 Two networks First network (sequence-to-structure): Second network (structure-to-structure): 315 = (20 + 1) 15 inputs extra unit to indicate where the windows spans either N or C terminus Data are scaled to [0-1] range by using 1/[1+exp(-x)] 75 hidden units 3 outputs (H, E, L) Structural correlation between adjacent sequences 60 = (3 + 1) 15 inputs 60 hidden units 3 outputs Accuracy ~76% Sample Psi-Pred Output Conf: Confidence (0=low, 9=high) ---very important!!!! Pred: Predicted secondary structure (H=helix, E=strand, C=coil) AA: Target sequence # PSIPRED HFORMAT (PSIPRED V2.3 by David Jones) Conf: 966899999997542002357777557999999716898188034435788873356776 Pred: CCHHHHHHHHHHHHHHHCCCCCCCHHHHHHHHHHHCCCCCEEECCCCEEEEEEECCCCCC AA: MMWEQFKKEKLRGYLEAKNQRKVDFDIVELLDLINSFDDFVTLSSCSGRIAVVDLEKPGD 10 20 30 40 50 60 Conf: 777179998337888888988751235636899718261220179868899999998557 Pred: CCCCEEEEEECCCCCHHHHHHHHHCCCCCEEEEECCCEEEEECCCHHHHHHHHHHHHHCC AA: KASSLFLGKWHEGVEVSEVAEAALRSRKVAWLIQYPPIIHVACRNIGAAKLLMNAANTAG 70 80 90 100 110 120 Conf: 200242314703799714651435541487355188999999999999999889999999 Pred: CCCCCCEECCCEEEEEECCCEEEEEECCCCCEEECHHHHHHHHHHHHHHHHHHHHHHHHH AA: FRRSGVISLSNYVVEIASLERIELPVAEKGLMLVDDAYLSYVVRWANEKLLKGKEKLGRL 130 140 150 160 170 180 ***Compare the prediction for residues 9 and 17*** Sample Psi-Pred Output-II Again, voting rules methods tend to be best ATKAVCVLKGDGPVQGTIHFEAKGDTVVVTGSITGLTEGDHGFHVHQFGDNTQGCTSAGP CCCCCCCCCCCCCCCCEEHCCHHECEEEEEEEEEEEECCCCCCCCCCCCCCCCCCCCCCC CCHEEEEECCCCCCCCEEEHHHCCCEEEEEEEEECECCCCCCEEEECCCCCCCCCCCCCC CCCEEEEEECCCCCEEEEEEEECCCEEEEEEEEEEEECCCCCEEEEECCCCCCCCCCCCC CCCEEEEECCCCCCCEEEEEECCCCEEEEEEEEECCCCCCCCEEEEEECCCCCCCCCCCC HHHCEEEECCCCCCCEEEEEECCCCEEEEEECEEEEEECCCCEEEEECCCCCCEEECCCC CCCCEEEECCCCCCCCCEEECCCCCCEEEEECEEECCCCCCCEEEECCCCCCCCEEECCC CCCCEEEEECCCCCCCCCEEECCCCCEEEECCCCCCCCCCCEEEEEEEECCCCCCCCCCC CCCCEEEECCCCCCCCEEEEECCCCEEEEEEEEEEECCCCCCEEEEECCCCCCCCCCCCC ---EEEEE------EEEEEEEEE--EEEEEEEEE-----EEEEEEEE------------- 2SOD BPS D_R DSC GGR GOR H_K K_S JOI 2SOD HFNPLSKKHGGPKDEERHVGDLGNVTADKNGVAIVDIVDPLISLSGEYSIIGRTMVVHEK CCCCCCCCCCCCCCCCCCCCCCECCCCCCHEECCCCCCCCCECCEECEEEEEEEEEEECC CCCCCCCCCCCCCCCHHCECCCCCECCCCCCEEEEEEECCEEEECCCEEEEEEEEEEECC CCCCCCCCCCCCCCEEEEECCCCCCCCCCCCEEEEEECCCCCCCCCCEEEEEEEEEEECC CCCCCCCCCCCCCCCCEEECCCCCCCCCCCCCEEEEECCCCCCCCCCEEEECEEEEEECC CCCCCCCCCCCCCCHHEEECCCCCCCCCCCCEEEEEEECCEEECCCCEEEEEEEEEECCC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCEECCCCCCCCCCCCCCHHHHHHEECCC CCCCCCCCCCCCCCCCEEECCCCCCCCCCCCCEEEEEEEEEEEEECCCEEECCEEEEEEE CCCCCCCCCCCCCCCCEEECCCCCCCCCCCCEEEEEECCCCECCCCCEEEEEEEEEEECC --------------------EEEEEE------EEEEEEE--------------EEEEE-- 2SOD BPS D_R DSC GGR GOR H_K K_S JOI 2SOD Prediction Accuracy (EVA) 25 P SIP RED SSp ro P ROF P HDps i JP red 2 P HD Percentage of all 150 proteins 20 15 10 5 0 30 40 50 60 70 80 90 1 00 P ercen tag e co rrectl y pred i cted resi d ues per p rot ei n EVA: Automatic evaluation of prediction servers How Far Can We Go? Currently ~76% Proteins with more than 100 homologues 80% Assignment is ambiguous (5-15%). Recall DSSP vs STRIDE. -- non-unique protein structures (dynamic), H-bond cutoff, etc. Different secondary structures between homologues (~12%). Non-locality. Secondary structure is influenced by long-range interactions. -- Some segments can have multiple structure types (chameleon sequences). Solvent accessibility Conceptually similar problem to SS prediction: Buried vs. Exposed. Weighted Ensemble Solvent Accessibility predictor: http://pipe.scs.fsu.edu/wesa.html E E E E B B B B B B E E Why bother? To provide structural context for putative mutations that one wants to characterize biochemically or biophysically. Transmembrane Segment Prediction Again, conceptually similar problem to SS prediction: TM vs. Not.