Secondary Structure Prediction Using Decision Lists Deniz YURET Volkan KURT Outline • • • • What is the problem? What are the different approaches? How do we use decision lists and why? Why does evolution help? What is the problem? • • • • The generic prediction algorithm Some important pitfalls: definition, data set Upper and lower bounds on performance Evolution and homology enters the picture Tertiary / Quaternary Structure Tertiary / Quaternary Structure Secondary Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----------HHHHHHHHHH------EEEEE------- A Generic Prediction Algorithm • Sequence to Structure • Structure to Structure A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ?????????????????????????????????????? A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD -????????????????????????????????????? A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD -????????????????????????????????????? A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD --???????????????????????????????????? A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD --???????????????????????????????????? A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ---??????????????????????????????????? A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----H-----???????????????????????????? A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----H-----H??????????????????????????? A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----H-----H??????????????????????????? A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----H-----HH?????????????????????????? A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----H-----HHHHHHHHHH------EEEEE------? A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----H-----HHHHHHHHHH------EEEEE------- A Generic Prediction Algorithm: Structure to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ?---H-----HHHHHHHHHH------EEEEE------- A Generic Prediction Algorithm: Structure to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----H-----HHHHHHHHHH------EEEEE------- A Generic Prediction Algorithm: Structure to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD -?--H-----HHHHHHHHHH------EEEEE------- A Generic Prediction Algorithm: Structure to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----H-----HHHHHHHHHH------EEEEE------- A Generic Prediction Algorithm: Structure to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD --?-H-----HHHHHHHHHH------EEEEE------- A Generic Prediction Algorithm: Structure to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----H-----HHHHHHHHHH------EEEEE------- A Generic Prediction Algorithm: Structure to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----?-----HHHHHHHHHH------EEEEE------- A Generic Prediction Algorithm: Structure to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----------HHHHHHHHHH------EEEEE------- A Generic Prediction Algorithm: Structure to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----------HHHHHHHHHH------EEEEE------? A Generic Prediction Algorithm: Structure to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----------HHHHHHHHHH------EEEEE------- Pitfalls for newcomers • Definition of secondary structure • Choice of data set Pitfall 1: Definition of Secondary Structure • • • • • DSSP: H, P, E, G, I, T, S STRIDE: H, G, I, E, B, b, T, C DEFINE: ??? Convert all to H, --, and E They only agree 71% of the time!!! (95% for DSSP and STRIDE) • Solution: Use DSSP Pitfall 2: Dataset • Trivial to get 80%+ when homologies are present between the training and the test set • Homology identification keeps evolving • RS126, CB513, etc. • Comparison of programs on different data sets meaningless… Performance Bounds • Simple baselines for lower bound • A method for estimating an upper bound Performance Bounds • Baseline 1: 43% of all residues are tagged “loop” 43%: assign loop Performance Bounds • Baseline 2: 49% of all residues are tagged with the most frequent structure for the given amino-acid. 49%: assign most frequent 43%: assign loop Performance Bounds • Upper bound: Only consider exact matches for a given frame size. • As the frame size increases accuracy should increase but coverage should fall. 100% ??? 49%: assign most frequent 43%: assign loop Upper Bound with Homologs 100 100.00100.00 99.97 95.15 88.87 90 95.42 95.85 95.83 80 64.7 70 60 50 49.11 53.68 56.74 56.66 43.14 40 30 20 16.21 13.09 12.51 12.21 10 0 1 2 3 4 5 6 7 8 9 Coverage Frame Size Bound Upper Bound without Homologs 100 100.00100.00 99.98 95.26 90 75 80 70 60 50 40 47.87 52.59 55.91 54.95 53.07 72.72 74.19 60.34 35.93 30 20 10 3.83 0.46 0.19 0.14 0 1 2 3 4 5 6 7 8 9 Coverage Frame Size Bound Performance Bounds • Upper bound: Only consider exact matches for a given frame size. • As the frame size increases accuracy should increase but coverage should fall. 100% ??? 75%: estimated upper bound 49%: assign most frequent 43%: assign loop The Miracle of Homology • People used to be stuck at around 60%. • Rost and Sander crossed the 70% barrier in 1993 using homology information. • All algorithms benefit 5-10% from homology. • The homologues are of unknown structure, training and test sets still unrelated! • Why? The Miracle of Homology 60% The Miracle of Homology 70% Outline • • • • What is the problem? What are the different approaches? How do we use decision lists and why? Why does evolution help? GORV Sequence Secondary Structure 66.9% Information Function / Bayesian Statistics +6.5% PSI-BLAST Majority Vote Filter Secondary Structure Secondary Structure +73.4% * Garnier et al, 2002 PHD Frequency Profile HSSP Secondary Structure +4.3% Neural Network Neural Network 62.6% / 67.4% +3.4% Secondary Structure 61.7% / 65.9% * Rost & Sander, 1993 Secondary Structure 70.8% Jury + Filter JNet Profile Secondary Structure PSIBLAST HMMER2 CLUSTALW Neural Network Neural Network Jury + Jury Network Secondary Structure Secondary Structure 76.9% * Cuff & Barton, 2000 PSIPRED Profiles PSI-BLAST Secondary Structure Neural Network Neural Network Secondary Structure Secondary Structure 76.3% * Jones, 1999 Outline • • • • What is the problem? What are the different approaches? How do we use decision lists and why? Why does evolution help? Introduction to Decision Lists • Prototypical machine learning problem: – Decide democrat or republican for 435 representatives based on 16 votes. Class Name: 2 (democrat, republican) 1. handicapped-infants: 2 (y,n) 2. water-project-cost-sharing: 2 (y,n) 3. adoption-of-the-budget-resolution: 2 (y,n) 4. physician-fee-freeze: 2 (y,n) 5. el-salvador-aid: 2 (y,n) 6. religious-groups-in-schools: 2 (y,n) … 16. export-administration-act-south-africa: 2 (y,n) Introduction to Decision Lists • Prototypical machine learning problem: – Decide democrat or republican for 435 representatives based on 16 votes. 1. If adoption-of-the-budget-resolution = y and anti-satellite-test-ban = n and water-project-cost-sharing = y then democrat 2. If physician-fee-freeze = y then republican 3. If TRUE then democrat The Greedy Prepend Algorithm Rule Search • Initially evertyhing is predicted to be the mostly seen structure (i.e. loop) False Assignments Correct Assignments + - Partition with respect to the Base Rule Training Set Rule Search • At each step add the maximum gain rule + + - - + Partition with respect to the Base Rule Partition with respect to the Second Rule GPA Rules • The first three rules of the sequence-tostructure decision list – 58.86% performance (of 66.36%) GPA Rule 1 • Everything => Loop GPA Rule 2 HELIX L4 L3 L2 L1 0 R1 R2 R3 R4 * * !GLY !GLY !ASN !GLY !PRO !PRO !PRO !PRO !GLY !PRO !PRO !SER (Non-polar or large) GPA Rule 3 STRAND L4 L3 L2 L1 0 R1 R2 R3 R4 !LEU !ALA !ASP !ALA CYS !PRO !ARG !LEU !LEU !ASP ILE !GLN !MET !MET !GLY LEU !GLU !PRO PHE !LYS TRP !PRO TYR (Non-Polar and Not Charged) !LEU !GLN !GLU VAL (Non-polar) GPA Sequence Secondary Structure +6.67% GPA GPA Secondary Structure 60.48% Secondary Structure 62.54% / 69.21% PSI-Blast Experimental Setup • DSSP assignments • Reduction: – E (extended strand), B (b bridge)-> Strand – H (a helix ), G (3-10 helix) -> Helix – Others -> Loop • Data set: – CB513 set – 7-fold cross-validation GPA Performance • Performance of seq-to-struct decision list: – Without homologs: 60.48% (29 to 66 rules) – With homologs: 66.36% (46 to 68 rules) • Performance with struct-to-struct filter: – Without homologs: 62.54% (18 to 116 rules) – With homologs: 69.21% (16 to 40 rules) GPA Performance • Performance at 20 rules at both steps: – Without homologs: 62.15% – With homologs: 69.08% • Possible to make a back-of-the-envelope structure prediction using our model Comparison on CB513 • • • • • PhD 72.3 NNSSP 71.7 GPA 69.2 DSC 69.1 Predator 69.0 Outline • • • • What is the problem? What are the different approaches? How do we use decision lists and why? Why does evolution help? The Miracle of Homology 70% Discussion • Training set homologues and test set homologues help for different reasons. • Training set homologues use semi-accurate guesses of structure to provide information on amino-acid substitutions • Test set homologues take advantage of “independent errors” in prediction • The less similar the homologue sequences the better… Summary • Homologues between the training set and the test set unfairly influence results. • Homologues within the training set and the test set still help significantly. • There is an upper bound at around 75% unless we use a homologue of the target protein. • Very different learning algorithms converge on comparable accuracy. Some Educated Guesses • Significant progress probably requires better homology detection rather than better learning algorithms. • To exceed the 75% bound one needs to start incorporating long range interactions. • CASP shows predicting tertiary structure first gives compatible results – any use for secondary structure? Thank you… • The algorithm, the paper, etc. available from: dyuret@ku.edu.tr Introduction • Protein Structure – What is Secondary Structure? – What is Tertiary Structure? • Secondary structure Prediction – What are decision lists? – GPA in Action • Tertiary Structure Prediction Protein Structure • Primary Structure – Sequences • Secondary Structure – Frequent Motifs • Tertiary Structure – Functional Form • Quaternary Structure – Protein complexes Primary Structure • Sequence information • Contains only aminoacid sequences – – – – – 24 amino acid codes present 20 standard residues Glutamine or Glutamic Acid GLX (GLU) Asparagine or Aspartic Acid ASX (ASN) Others (Non-natural/Unknown) X • Selenocysteine, Pyrrolysine Secondary Structure • Rigid structure motifs • Do not give information about coordinates of residues • Can be seen as a one-dimensional reduction of the tertiary structure • If accurately predicted, can be used to – Predict the final (tertiary) structure – Predict the fold type (all-alpha/all-beta etc.) Common Secondary Structure Motifs Parallel betasheet alfa-helix Antiparalle l betasheet Tertiary/Quaternary Structure • Tertiary Structure – The functional form – Coordinates of residues in the space • Quaternary Structure – Protein – Protein complexes – Assembly of one or more proteins Structure Prediction Sequences vs. Structures 200000 180000 160000 140000 120000 100000 80000 60000 40000 20000 0 19 86 19 88 19 90 19 92 19 94 19 96 19 98 20 00 20 02 20 04 Number of Entries • Easier to determine sequence than structure • Predictions may help close the gap Years PDB Swissprot Secondary Structure Prediction • • • • Assesment of Prediction Accuracy Common Strategy Methods in Literature Decision Lists – Prediction using GPA • A Performance Bound Secondary Structure Prediction • Predictions based on – Sequence Information – Multiple Sequence Alignments • Various algorithms present based on – Information Theory – Machine Learning – Neural Networks etc. Assessment of Accuracy • Determination method – DSSP • Performance Metric – Q3 accuracy – Three state accuracy (helix/strand/loop) • Data set selection – Non-redundancy • Homology Information – Multiple Sequence Alignments • Cross-Validation Two Levels of Prediction Sequence • First Level: – Sequence to Structure • Input: – Sequence Information – Multiple Sequence Alignments • Method: MSA Sequence to Structure – Machine Learning – Neural Networks • Output – Secondary Structure Secondary Structure Two Levels of Prediction • Second Level: Secondary Structure – Structure to Structure • Input: – Structure Information • Method: Structure to Structure – Machine Learning – Neural Networks – Filter • Simple Filters • Jury Decisions • Output – Secondary Structure Filter Secondary Structure Decision Lists • • • • • Machine Learning method Simply, a list of rules Each rule asserts a guess Generalization by simple rule pruning Output is human readable/understandable GPA • Greedy Decision List • Start with a global (base) rule • At every step – Find the maximum gain rule – Append to previous list • Stop when gain change is 0 Data Representation • Frames of length W – Context of an aminoacid is represented by W residues – (W-1)/2 to the left. (W-1)/2 to the right – If the frame exceeds terminii, they are represented as NAN – GLX = GLN. ASX = ASN. – New found/Non Natural aa’s = X Sample Data • evealekkv[aaLes]vqalekkvealehg helix • Frame Size = 5 • Represents the features used in the prediction of secondary structure for L (leucine) 2-level Algorithm • Sequence to Structure List – – – – Find the first rule that matches the data point Assign the output of that rule A frame of 9 residues is input Output: Secondary Structure • Structure to Structure List – After all predictions are made, check for possible improvements – A frame of 19 secondary structures is input – Output: Secondary Structure GPA/PHD/GORV Level 1 - No Homolog Level 1 – Homolog Level 2 – No Homolog Level 2 – Homolog Final GPA 60.48% PHD 61.7% GORV N/A 66.36% 65.9% N/A 62.54% 62.6% 66.9% 69.21% 67.4% 71.8% 69.21% 70.8% 73.4% Discussion - Why GPA? • Amazingly simple models – With as low as 20 rules in the first level and as low as 20 rules in the second • Rules (Models) are human-readable – Biological rules may be inferred • Second level decision list may be used as a filter for other algorithms A Performance Bound Claim • Using only sequence information. the highest achievable performance has an upper bound • The lower bound: – 43%. with everything assigned as loop – 49%. with every residue assigned the most probable structure • The upper bound – 75%. with non-homologous data A Performance Bound Claim • Bound is calculated by: – Taking only the exact sequence matches in the training and testing sets – Assign the mostly seen value of that frame in the training set as guess – Compare with actual value • A bound for non-homologous training and testing sets • A bound for carefully selected frame size – Not too short (assignments would be almost random) – Not too long (only unique frames will be available) Upper Bound with Homologs 100 100.00100.00 99.97 95.15 95.42 95.85 95.83 88.87 90 80 64.7 70 60 50 49.11 53.68 56.74 56.66 43.14 40 30 20 16.21 13.09 12.51 12.21 10 0 1 2 3 4 5 6 7 8 9 Coverage Bound Upper Bound without Homologs 100 100.00100.00 99.98 95.26 90 75 80 70 60 50 40 47.87 55.91 54.95 53.07 52.59 72.72 74.19 60.34 35.93 30 20 10 3.83 0.46 0.19 0.14 0 1 2 3 4 5 6 7 8 9 Coverage Bound Tertiary Structure Prediction • Predictions based on backbone dihedral angles – Phi and Psi angles fully define the tertiary structure • Goal: – Discover the right level of granularity Data Set Selection • PDB-Select – A set of non-homologous proteins of high resolution [Hobohm & Sander, 1994] • Data representation – Frames of 9 residues – Residue names plus residue properties • Hydrophobicity, polarity, volume, charge etc. • Train/Validation/Test Data Discretization • Phi/Psi angles are continuous – We need a discrete representation to predict them in a decision list • Split the (-180, 180) region into bins • Split the Ramachandran into bins Ramachandran Plot (1) Ramachandran Plot (2) * Karplus, 1996 How to Predict? • Predictions using sequence information – No homology information • Predicted angles may be incorporated – Upper bounds will be given • Accuracy – Percent of correct estimates – RMSD of phi and psi angles Using Predicted Angles Performance: Accuracy Phi Psi Combined 15 30 60 90 120 15 30 60 90 120 Region Secondary All 37.27 51.22 64.87 77.15 80.51 32.79 52.06 68.81 76.04 80.44 58.75 71.82 Same 31.90 44.52 61.44 68.44 78.26 30.47 49.05 64.97 71.99 77.42 58.04 73.05 Identical 31.56 43.89 61.37 66.81 78.12 29.84 48.39 64.72 72.00 76.64 56.40 71.40 None 29.40 42.58 61.23 59.92 78.11 22.83 35.69 49.85 53.90 60.77 39.38 53.18 All 36.64 50.36 63.57 76.84 79.51 31.08 48.29 65.22 71.74 75.58 57.20 69.60 Same 31.64 43.81 61.66 67.24 78.19 29.89 47.41 63.74 70.16 74.81 56.47 69.51 Identical 31.49 43.82 61.39 66.44 78.10 29.47 47.35 63.57 69.56 74.56 56.60 69.42 All 32.33 46.78 62.27 68.14 79.05 31.60 49.87 67.67 75.57 79.78 54.96 69.21 Same 31.23 44.29 61.48 66.72 78.16 28.47 45.96 63.34 70.83 76.26 54.60 68.66 Identical 31.03 43.32 61.32 66.61 78.12 28.64 45.80 63.35 70.22 74.94 55.07 68.59 All Right Left Performance: RMSD Phi Psi Combined 15 30 60 90 120 15 30 60 90 120 Region Secondary All 42.95 44.37 46.67 51.13 52.98 66.38 66.01 71.23 70.84 53.07 56.93 48.11 Same 47.68 46.90 50.95 57.37 55.95 71.30 69.56 75.38 75.21 57.02 57.56 47.09 Identical 47.70 47.42 51.19 58.07 56.13 72.25 70.15 75.80 75.19 57.99 60.64 48.60 None 49.24 48.45 51.60 61.11 56.15 99.69 101.12 103.49 106.83 75.16 79.98 65.34 All 43.41 44.95 47.71 51.49 54.31 73.97 72.02 75.54 75.64 59.30 59.31 50.37 Same 48.17 47.06 50.24 58.33 56.04 74.24 74.37 78.21 78.64 60.23 58.96 50.50 Identical 48.26 47.72 51.08 58.73 56.16 74.13 73.57 78.58 80.03 60.52 58.94 50.61 All 45.41 45.48 49.19 56.92 54.92 69.45 67.86 73.14 71.45 53.96 59.59 50.62 Same 47.27 47.06 50.80 58.94 56.08 73.92 72.82 77.88 77.72 58.47 60.26 51.24 Identical 47.37 48.56 51.39 59.01 56.13 73.15 72.47 78.00 78.72 60.07 59.73 51.32 All Right Left Performance: Backbone RMSD Performance: Input Features Performance: Real Prediction Future Work • For tertiary structure predictions. – The two-leveled approach may be applied to tertiary structure predictions – Homology information may be incorporated • For secondary structure predictions. – Should find better homologues and better representations – Incorporating sequence and homology information in the structure to structure part may be an option • For both predictions – A reliability index for predicted structure