De Novo Peptide Sequencing from Matrix-Assisted Laser Desorption/Ionization-Time of Flight Post-Source-Decay BARKER Spectra by MASSACHUSETTS INSTITUTE OF TECHNOLOGY Tony Liang Eng APR 2 4 2001 Bachelor of Science in EECS, MIT, 1992 Master of Science in EECS, MIT, 1994 Bachelor of Science in Mathematics, MIT, 1996 Bachelor of Science in Biology, MIT, 1998 LIBRARIES Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY February 2001 @ Tony Liang Eng, MMI. All rights reserved. The author hereby grants to MIT permission to reproduce and distribute publicly paper and electronic copies of this thesis document in whole or in part. Author................. Department of4ectri4 Engineering and Computer Science February 2, 2001 /71 14 C ertified by.................. . . . .. . ...... /1) ......................... To P ofessor of Electrcal Engineering -~) Accepted by ................. is Lozano-Perez Computer Science Thesis Supervisor ............. 0Arthur Smith Chairman, Department Committee on Graduate Students 2 De Novo Peptide Sequencing from Matrix-Assisted Laser Desorption/Ionization-Time of Flight Post-Source-Decay Spectra by Tony Liang Eng Submitted to the Department of Electrical Engineering and Computer Science on February 2, 2001, in partial fulfillment of the requirements for the degree of Doctor of Philosophy Abstract With the explosion of research activity in genomic and proteomic bioinformatics, there is an increased demand for rapid protein sequencing algorithms, and mass spectrometry(MS) has been explored as a possible tool for aiding in this process [YME96]. Most sequencing from tandem mass spectra relies on either some form of comparison to a database of known peptides, or manual sequence inference by human analysis of spectra. Such approaches encounter difficulties when presented with the spectra of unknown and novel proteins not catalogued in a database, or with complex spectra that do not easily lend themselves to manual interpretation. A few de novo approaches exist but their performance is sensitive to noise and gaps in the dataset and their scoring methods lack a formal framework for reasoning about the answer produced. We propose a new approach that involves a probabilistic model for peptide fragmentation, a scoring function based on this model, and a simulated annealing search based on this scoring function. Our algorithm takes as input the original mass and the MALDI-TOF PSD mass spectrum of the peptide to be sequenced, and finds the amino acid sequence consistent with the best interpretation of the spectrum under the proposed model. If the model is good and the dataset is sufficient, then the real sequence scores optimally, and simulated annealing, under the appropriate searching conditions, will converge onto this sequence. We found that a simple model was sufficient to correctly predict the sequence of short peptides, and that our approach exhibited some resilience to noise and gaps in the data. Thesis Supervisor: Tomas Lozano-Perez Title: Professor of Electrical Engineering & Computer Science 4 Acknowledgments This thesis has been long in the making- it is not simply a product of my doctoral years, but of my entire time at MIT and consequently, there are many people to thank and I will no doubt inadvertently omit several who deserve to be recognized and thanked. There are several categories of people that deserve mention. First and foremost, I am grateful for the availability and support of my thesis committee. I had enjoyed having Professor Tomas Lozano-Perez as my thesis advisor. It was a wonderful research experience, and I respect him for his insight, counsel and passion for research. Professor Paul Matsudaira suggested the de novo peptide sequencing problem, and I always felt he was "on my side, rooting for me" from his encouraging remarks and his support, both morally and financially through the MIT Whitehead Training Grant in Genomic Science. Professor Eric Grimson has been supportive, accommodating and affirming as a committee member, but also as the lecturer of a class I was pleased to be a part of. 6.001 has been a large part of my graduate life, and I am grateful for the chance to work with and learn from Professors Duane Boning and Eric Grimson (both of whom have inspired me in my teaching), TAs Robbin Chapman, Aileen Tang, Kyle Ingols, and of course the students themselves. Professors F Tom Leighton and Daniel Kleitman deserve special mention for their time and involvement in the earlier stages of this thesis(Sections E.1 and E.2 respectively). Thanks also to Professor Bonnie Berger for supplementing some of my support through the Program in Mathematics and Molecular Biology Graduate Student Fellowship and NIH/NHCRI HG00039. Various other professors and colleagues have been helpful in providing information/lending a hand/giving advice. Thanks to Arnie Falick(Applied Biosystems) for three sets of data; Drs Wishnok and Tannenbaum(MIT BEH) for the use of their mass spectrometer; Hong Bin Ni and Bryan Robinson(MIT); Kevin Hayden and Wade Hines (Applied Biosystems); Ivan Correia and James Pang (MIT Whitehead); Duane Boning(MIT MTL), Arthur Smith(MIT EECS) and Charles Leiserson(MIT LCS); David Williamson(IBM); Erik Winfree(Caltech); Ting Chen(USC); Bryan Che, Daniel Derksen and Tamara Williams(MIT) for use of pigtail, 5 a laptop and asti-spumanti respectively. My palm pilot has also been indispensable during the thesis writing stages. Thanks to the various administrators who are always rooting for us graduate students and who keep MIT running: Marilyn Pierce, Lisa Bella, Be Blackburn, Jill Fekete, Teresa Coates, Julie Ellis, David Jones and Bruce Dale. My years at MIT have been punctuated with many faces who have walked beside me and made MIT more pleasant and bearable. At some point in time during my later PhD years, each of them, whether at MIT or from afar, have sustained me through one thing or another, in some fashion or another - a card, a hug, a smile, a prayer, a backrub or a meal. My grateful thanks to: Ona Wu, Kiet Van, Mona Lou, Jen Chen, David Stephenson, Nicole Lazo, Jeff Kuo, Jim Derksen, Irene Yeh, Anca Brad, the generations of Cross Products (especially the London year), Jesse Byler (thanks for helping with simulations), Vivian Cheung (you'll make a great mom!), Dan Shiau, Julie Gesch, Christine Ko, Connie Chang, Vanessa Wong, Ben Nunes, Bryan Che, David Robison, Jennifer Lee, Jane Hsu, Jennifer Tam Lin, Christina Park, Joonah Yoon, Buck Goh, Thomas Lee, Christian Sevilla, Eric Hsieh, Lawrence Chang, Susan Huang and the Doctor's Small Group. I am especially grateful to those who have been there through to the end, during the hard times and during the (6 month!) home stretch when thesis became all-consuming - your prayers and all the little ways you have cheered me up/on mean a lot to me. Last but not least, to my loving parents who put their dreams aside so I could pursue mine - thanks for your belief in me and your solid support, patience and constant encouragement in all my endeavors. As I reflect back on the Phd years, writing this moments before the thesis deadline, I thank God for all that has happened, for in many ways and by many people, I have been blessed. 6 Contents 1 Introduction 23 2 Mass Spectrometry 26 2.1 Overview of MALDI MS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.1.1 Sample Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.1.2 Desorption and Ionization of Analyte . . . . . . . . . . . . . . . . . . 28 2.1.3 Ion Separation and Detection . . . . . . . . . . . . . . . . . . . . . . 29 2.1.4 Useful Improvements and Variations . . . . . . . . . . . . . . . . . . 29 2.2 3 Tandem Mass Spectrometry (MS/MS) . . . . . . . . . . . . . . . . . . . . . 31 Protein Fragmentation 35 3.1 MALDI-PSD Fragment Types . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.1.1 Series Ions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.1.2 Internal Ions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.1.3 Immonium Ions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.1.4 Parent Ion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.1.5 Neutral Loss/Gain Variants . . . . . . . . . . . . . . . . . . . . . . . 39 7 Other Fragment Types . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Fragmentation Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.1.6 3.2 4 5 6 43 Terminology and Concepts 4.1 Fragmentation and Spectra . . . . . . . . . . . . 43 4.2 Fundamental Graphs . . . . . . . . . . . . . . . . 44 4.2.1 Purpose . . . . . . . . . . . . . . . . . . . 45 4.2.2 Construction . . . . . . . . . . . . . . . . 45 4.3 De Novo Peptide Sequencing From Tandem Mass Spectra 46 4.4 Notion of a Correct Sequence Prediction . . . . . 47 de novo Protein Sequencing with Mass Spectra 49 5.1 Chemical Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.2 Sequencing with Mass Spectrometry . . . . . . . . . . . . . . . . . . . . . . 50 5.2.1 Enlisting the Aid of Fragment Types . . . . . . . . . . . . . . . . . . 51 5.2.2 Persevering Despite the Effects of Noise . . . . . . . . . . . . . . . . 52 5.2.3 Considering Different Interpretations of the Same Spectrum . . . . . 54 56 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . 56 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Database Search with MS/MS . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Searching with Peptide Sequence Tags . . . . . . . . . . . . . . . . . 60 6.1 Four Categories of Approaches 6.2 Database Search with MS 6.3 6.3.1 8 6.3.2 Evaluating Theoretically Predicted Spectra with Experimentally Ob. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Hybrid Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.4.1 Computational Approaches with a MS/MS Database Search . . . . . 62 6.4.2 Database Search with MS and MS/MS . . . . . . . . . . . . . . . . . 63 6.4.3 Database Search with MS or MS/MS . . . . . . . . . . . . . . . . . . 63 6.5 Discussion of Database Approaches . . . . . . . . . . . . . . . . . . . . . . . 64 6.6 Computational Search with MS . . . . . . . . . . . . . . . . . . . . . . . . . 64 Ladder Sequencing with Mass Spectrometry . . . . . . . . . . . . . . 64 . . . . . . . . . . . . . . . . . . . . . . 66 6.7.1 Sequence-to-Spectrum Categories . . . . . . . . . . . . . . . . . . . . 67 6.7.2 Spectrum-to-Sequence Category . . . . . . . . . . . . . . . . . . . . 68 6.7.3 Fundamental Graph (Global Spectrum-to-Sequence) Approaches 69 6.7.4 Global Fundamental Graphs Approaches . . . . . . . . . . . . . . 71 tained Spectra 6.4 6.6.1 6.7 7 Computational Search with MS/MS 73 Observations and Issues 7.1 Spectrum-Related Issues . . . . . . . . . . 73 7.1.1 G aps . . . . . . . . . . . . . . . . . 73 7.1.2 Immonium Interference . . . . . . 74 7.1.3 Mistaken Identities . . . . . . . . . 74 7.1.4 Under-/Over-Represented Families 75 7.1.5 Experimental Peak Heights . . . . 75 7.1.6 Mass Tolerances . . . . . . . . . . 76 9 7.2 8 Scoring Function Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 . . . . . . . . . . . . . . . . . . . . . . . 77 7.2.1 Uses of a Scoring Function 7.2.2 Theoretical Peak Heights . . . . . . . . . . . . . . . . . . . . . . . . 79 7.2.3 Award / Penalty System . . . . . . . . . . . . . . . . . . . . . . . . . 79 7.2.4 Accounting for Disallowed Variants . . . . . . . . . . . . . . . . . . 80 7.2.5 Accounting for Internals . . . . . . . . . . . . . . . . . . . . . . . . . 81 7.2.6 Fragment Type Frequencies . . . . . . . . . . . . . . . . . . . . . . . 81 Approach 82 8.1 Solution Schematic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 8.2 Modelling the Fragmentation of a Single Molecule . . . . . . . . . . . . . . 83 8.3 8.2.1 Model I: Modelling Series Ions . . . . . . . . . . . . . . . . . . . . . 84 8.2.2 Model II: Modelling Internal Ions . . . . . . . . . . . . . . . . . . . . 85 8.2.3 Model III: Modelling Variants . . . . . . . . . . . . . . . . . . . . . . 86 8.2.4 Model IV: Modelling Residue Tendencies . . . . . . . . . . . . . . 88 Accounting for Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 8.3.1 Physical Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 8.3.2 Measurement Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 8.4 Assum ptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 8.5 From Model to Scoring Function . . . . . . . . . . . . . . . . . . . . . . . . 93 8.5.1 Computing the Probability Mass Funct ion. . . . . . . . . . . . . . . 93 8.5.2 Evaluating Sequence Guesses . . . . . . . . . . . . . . . . . . . 93 10 8.5.3 9 Scoring Function Maximum . . . . . . . . 94 . . . . . . . . . . . . 94 . . . . . . . . . . . . . . . . . . . . . . 97 8.6 Exploring the Search Space 8.7 Sum mary 98 Testing the Model and Its Scoring Function Training Data . . . . . . . . . . . . . . . . . . . . 98 Observations of Training Spectra . . . . . 99 Training the Model . . . . . . . . . . . . . . . . . 102 9.2.1 Parameterizing the Model . . . . . . . . . 102 9.2.2 Training the Model Parameters . . . . . . 102 9.3 Examination of the Trained PMF . . . . . . . . . 104 9.4 Scoring Guesses Against an Observed Spectrum. 106 9.1 9.1.1 9.2 9.4.1 9.5 Not the Real Sequence, but Still Correct Summary: Model Training . . . . . . . . . . . . . 10 Testing the Simulated Annealing Search 112 114 115 10.1 Search Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 15 10.2 Sequence Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 18 . . . . . . . . . . . . . . . . . 1 19 10.5 Exploration of Simulated Annealing Parameters . . . . . . . . . . . . . . . . 1 19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 10.3 Different Restricted Sizes 10.4 Simulated Annealing on Data Without Noise 10.6 Summ ary 11 11 Testing the Approach 127 11.1 Leave One Out Cross-Validation . . . . . 127 11.1.1 Results for the Different Scenarios 128 11.1.2 Investigation of the 1205 Dataset 129 . . . 137 11.3 Data from Another Center . . . . . . . . . 138 11.4 Sum m ary 139 11.2 Meta-Analysis of Published Spectra . . . . . . . . . . . . . . . . . . 12 Discussion 141 12.1 A Study of Two Longer Peptides . . . . . . . . . . . . . . . . . . . . . . . . 12.1.1 Enlargement of the Training Set 141 . . . . . . . . . . . . . . . . . . . . 141 12.1.2 Refining the Model to Improve the Scoring Function . . . . . . . . . 145 12.1.3 Performance of the Different Variations . . . . . . . . . . . . . . . . 145 12.2 A Study of Dataset Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 . . . . . . . . . . . . . . . . . . . . . 147 12.2.2 Removing High Intensity Peaks . . . . . . . . . . . . . . . . . . . . . 150 12.2.3 Removing Noise Peaks . . . . . . . . . . . . . . . . . . . . . . . . . . 150 12.2.1 Removing Low Intensity Peaks 12.3 Sum m ary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Conclusions 13.1 Room for Improvement 150 154 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 13.1.1 Improvements in the Data . . . . . . . . . . . . . . . . . . . . . . . 155 13.1.2 Improvements in the Model . . . . . . . . . . . . . . . . . . . . . . 156 12 13.1.3 Improvements in the Search . . . . . . . . . . . . . . . . . . . . . . . 157 13.2 Looking Towards the Future: Longer Peptides . . . . . . . . . . . . . . . . . 159 Effect of Isotopes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 13.2.1 13.3 Sum mary A Amino Acid Information 163 B Experimental Methods 165 B.0.1 Sample Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 B.0.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 168 C Experimental Data . . . . . . . . . . . . . . . . . . . . . . . 169 C.2 Distribution of Fragment Types . . . . . . . . . . . . . . . . . . . . . . . . . 183 C.1 Dataset Peaks and Peak Identities 186 D Data Peaks of Unknown Origin D.1 Do bradykinin spectra also contain unknown peaks? . . . . . . . . . . . . . 187 D.2 Could these peaks be due to the matrix? . . . . . . . . . . . . . . . . . . . . 188 D.3 Is there any way to explain these peaks? . . . . . . . . . . . . . . . . . . . . 188 D.4 Might these unknowns be related to each other? . . . . . . . . . . . . . . . 189 D .5 K eep in M ind... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 192 E Visits to the Drawing Board E.1 Understanding the Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 Understanding the Acquistion Process . . . . . . . . . . . . . . . . . 192 E.1.1 13 . . . . . . . . . . . . . . 193 Exploring Sequencing Algorithms . . . . . . . . . . . . . . . 194 E.2.1 Fundamental Graph-Based Approaches . . . . . . . 195 E.2.2 Expanding Islands of Certainty . . . . . . . . . . . . 196 E.2.3 Bounding Partial Paths . . . . . . . . . . . . . . . . 197 E.1.2 E.2 Understanding the Spectra F Scoring Function Maximum 201 Bibliography 205 14 List of Tables . . . . . . . . . . . . . . . . . . 67 . . . . . . . . . . . . . . . . . . . . . . . 100 9.2 Model Parameters: Untrained and Trained Overall . . . . . . . . . . . . . . 104 9.3 Scores of Sequences with the Same Mass as Angiotensin for Datasets 0123 6.1 Taxonomy of De Novo MS/MS Approaches 9.1 Peak Classification from [RYM95] and 0119. A '*' denotes best score in a column. Sequences considered correct are the actual sequence (ilisted last) and the sequence RDVYIHPFHL, the actual with the first 2 residues flipped, listed fifth from last. . . . . . . . . . 9.4 108 Scores of Sequences with the Same Mass as Angiotensin for Datasets 1205 and 0121. A '*' denotes best score in a column. Sequences considered correct are the actual sequence (ilisted last) and the sequence RDVYIHPFHL, the actual with the first 2 residues flipped, listed fifth from last. . . . . . . . . . 9.5 Scores of Sequences with the Same Mass as Bradykinin. A score in a colum n. '*' 109 denotes best . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 10.1 Results of Simulated Annealing Run on Different Datasets using an Untrained/Trained Model, and a Length-Preserving/Non-Length-Preserving Search 117 10.2 Simulated Annealing Results for Different Lengths. This table lists the ten predictions made by a search, restricted to sequences of length 8 to 13 inclusive, on the 0123, 0119 and 1205 datasets. . . . . . . . . . . . . . . . . . . . 15 120 10.3 Simulated Annealing Results for Different Lengths(cont). This table lists the ten predictions made by a search, restricted to sequences of length 8 to 13 inclusive, on the 0121, 0220 and 0218 datasets. . . . . . . . . . . . . . . . . 121 10.4 Simulated Annealing of Datasets Without Noise . . . . . . . . . . . . . . . . 122 10.5 Exploring Simulated Annealing Parameter Space 123 . . . . . . . . . . . . . . . 10.6 Exploring Simulated Annealing Parameter Space (cont) . . . . . . . . . . . 124 10.7 Exploring Simulated Annealing Parameter Space (cont) . . . . . . . . . . . 125 11.1 Model Parameters for the Different Scenarios. The values of the Overall Model from Table 9.2 are included for ease of comparison. . . . . . . . . . . 128 11.2 Results of Running Simulated Annealing with Model Parameters from the D ifferent Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 11.3 Results of Running Simulated Annealing with Model Parameters Trained With All Angiotensin and Bradykinin Datasets Except 1205 (AllBut1205) . 129 11.4 Model Parameters When Trained With a Single Dataset . . . . . . . . . . . 131 11.5 Results of Running Simulated Annealing with Each Model of Table 11.4 on Its Training Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6 AllAngioBut1205: Trained Parameter Values . . . . . . . . . . . . . . . . . 131 132 11.7 Results of Running Simulated Annealing with Table 11.6 Model Parameters on the Other Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.8 Trained Parameter Values When Normalizing All Six Datasets . . . . . . . 132 134 11.9 Results of Running Simulated Annealing When Datasets Are Normalized (m ethod I) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 11.10Results of Running Simulated Annealing When Datasets Are Normalized (m ethod II) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 134 11.11 Factory Angiotensin: Trained Parameter Values . . . . . . . . . . . . . . . . 136 11.12Results of Running Simulated Annealing on Factory Angiotensin with Various Trained Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 11.13Results of Simulated Annealing Run on Datasets from the Literature Using a Model Trained Without 1205 (AllBut1205). The 1375.8 dataset was the only dataset for which extra peaks were not inferred. . . . . . . . . . . . . . 140 11.14Results of Simulated Annealing Run on Datasets from Applied BioSystems . . . . . . . . . . . . . . . . . . . . . 140 12.1 Model Parameters for Leave One Out Cross-Validation . . . . . . . . . . . . 142 . . . . 142 Using a Model Trained Without 1205 12.2 Results of Leave One Out Cross Validation with the Eight Datasets 12.3 Trained Model Parameters: Overall Training Set Plus 830.4, 1237.5 and 1948.1 143 12.4 Results of Running Simulated Annealing On All Datasets with the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 12.5 Results of Various Simulated Annealing Runs on Dataset with M+H 1758.9 146 12.6 Results of Various Simulated Annealing Runs on Dataset with M+H 1948.1 146 Parameters of Table 12.3. . . . . . . . . A.1 Basic Residues, their Frequencies and Masses C.1 Angiotensin Dataset: data/012360c/unprependedpeaks . . . . . . . . . . . 170 C.2 Angiotensin Dataset: data/011959adata/unprependedpeaks . . . . . . . . 173 C.3 Angiotensin Dataset: data/ 120598b/ 120598b data . . . . . . . . . . . . . . 175 C.4 Angiotensin Dataset: data/01 2170c/unprependedpeaks . . . . . . . . . . . 177 C.5 Bradykinin Dataset: data/022064c/unprependedpeaks . . . . . . . . . . . 180 C.6 Bradykinin Dataset: data/021829c/unprependedpeaks . . . . . . . . . . . 182 17 164 C.7 Ions Present in Data: Counts in parenthesis represent counts when using a model without the refinement of Section 12.1.2. . . . . . . . . . . . . . . . . D.1 184 Angiotensin Peaks of Unknown Identity: when a dataset contains an unknown, the height of the peak is listed. The height of the parent ion is included in the last row of the table for reference. . . . . . . . . . . . . . . . 186 D.2 Number of times each unknown appears in training datasets (out of 10). . . 187 D.3 Bradykinin Peaks of Unknown Identity. Note that there are two 0218 experimentals (of height 462 and 308) that have the same checkpoint value of 71.0376. D.4 D.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 Unknowns that are a Residue Distance Apart: recall that the sequence for angiontensin is DRVYIHPFHL. . . . . . . . . . . . . . . . . . . . . . . . . . 189 Angiotensin Peptides and Consistent Path Nodes . . . . . . . . . . . . . . . 190 18 List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-1 Linear M ass Analyzer 2-2 Tandem Mass Spectrometer: (A) only molecules of the desired parent mass 28 are allowed to pass through the timed ion selector, (B) post source decay continues, (C) fragment ions are detected 3-1 . . . . . . . . . . . . . . . . . . . 32 Peptide Fragment Ions Common to MALDI-PSD. Recall that a fragment must be positively charged to be detected - the H+ accompanied by a bracing line indicates that a proton has affixed itself to some part of the molecule encompassed by the bracing line. . . . . . . . . . . . . . . . . . . . . . . . . 37 . . . . . . . . . . 57 6-1 Classification of Different Protein Sequencing Approaches 6-2 MS and MS/MS spectra: MS is simply a way to separate pieces of the peptide resulting from proteolytic digestion by mass. MS/MS is a means to home in on a particular mass to generate random non-specific fragmentation. .... 7-1 58 Theoretical Masses and Experimental Peaks: Region A contains those theoreticals that are absent from the experimental spectrum, Region B contains matched peaks and Region C contains the unaccounted experimentals. . . . 80 . . . . . . . . . . . . . . . . . . 83 8-1 Schematic of Key Algorithmic Components 8-2 Model I: Basic Fragmentation Tree. A single break produces only prefixes and suffixes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 85 8-3 Model II: A partial Model II fragmentation tree showing the second stage of cleavage for two Model I leaves. When two breaks are possible, immonium and internal ions are added to the repertoire of fragment types. . . . . . . . 8-4 87 Model III: Each Model II leaf may express a variant and the decision process is shown here only for two leaves from Figure 8-3. . . . . . . . . . . . . . . . 9-1 88 Overview of Matrix Layout: Ideally, every matrix entry would be a parameter, but only a few regions have been parameterized and singled out for estimation - the non-break tendencies, and the Histidine(H) and Proline(P) residue dependencies. All entries within the same shaded region are assumed to have the same likelihood in the current model. Entries marked NA are not possible. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 9-2 PMF for Angiotensin Using an Untrained Model 9-3 PMF for Angiotensin Using a Trained Model 9-4 Trained Model Scores of Sequences Guesses for Angiotensin: Datasets 0123(points . . . . . . . . . . . . . . . 105 . . . . . . . . . . . . . . . . . 106 0-40 along the x-axis), 0119(40-60), 1205(80-120) and 0121(120-160). at zero were inserted to separate each dataset... 9-5 . .. Scores . . . . . .... ..... 111 Trained Model Scores of Sequences Guesses for Bradykinin: Datasets 0220(points 0-40) and 0218(40-80). Scores at zero were inserted to separate each dataset. 111 10-1 Simulated Annealing Moves for 0119 Dataset. The x-axis represents the progress of the search, and the y-axis is the score of each successive move. . 10-2 Simulated Annealing Moves for 0220 Dataset. 116 The x-axis represents the progress of the search, and the y-axis is the score of each successive move. . 116 12-1 Removal of Lowest Intensity Peaks . . . . . . . . . . . . . . . . . . . . . . . 148 12-2 Removal of Highest Intensity Peaks . . . . . . . . . . . . . . . . . . . . . . . 151 12-3 Removal and Subsequent Addition of Noise Peaks . . . . . . . . . . . . . . . 152 20 A-i Basic Residue Structure: The side-chain R of a residue hangs off of the acarbon. An amide bond joins the a-carboxyl group of one residue to the a-amino group of an adjoining residue polymerizing multiple basic residues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 C-i PSD for 0123 Angiotensin . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 C-2 PSD for 0119 Angiotensin . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 C-3 PSD for 1205 Angiotensin . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 C-4 PSD for 0121 Angiotensin . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 C-5 PSD for 0220 Bradykinin . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 C-6 PSD for 0218 Bradykinin . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 C-7 Manufacturer PSD for Angiotensin . . . . . . . . . . . . . . . . . . . . . . . 185 Graphical Representation of Residue Relationships Between Unknowns . . . 190 into a peptide. D-1 21 Abbreviations and Acronymns CID Collision-Induced Ionization ESI Electro-Spray Ionization FAB Fast Atom Bombardment MALDI Matrix-Assisted Laser Desorption/Ionization MS Mass Spectrometry MS/MS Tandem Mass Spectrometry PSD Post-Source Decay TOF Time-Of-Flight mass per charge ratio P peptide S mass spectrum fi Zth ion family M+H mass of the original peptide that is singly protonated P( probability mi mass of peak i hi height/intensity/abundance of peak i FO probability mass function (PMF) pico 10-12 femto 10-15 atto 10-18 22 Chapter 1 Introduction Proteins are essential to life, playing key roles in all biological processes: from enzymes that catalyze reactions, to antibodies in an immune response, from messengers in signaling pathways that allow a cell to react to stimuli, to secreted messengers that effect extracellular changes, and much more. Such is the extent of protein functionality to the survival of any organism. One of the first steps in understanding a protein is discovering its primary structure. Knowledge of the primary sequence characterizes the protein, offering a glimpse of what it does (its role and functionality), where it does it (its targeted destination) and how it does it (its active sites, specificity and structural motifs). Protein sequencing is the process by which this primary structure, the identity of each amino acid residue in order of appearance from one terminus to the other, is enumerated. A protein can be easily sequenced using the genetic code if the corresponding cDNA sequence is known. If, on the other hand, the genomic DNA were available, one would not be able to predict the amino acid sequence with 100% certainty because post-transcriptional and post-translational modification events cannot be completely predicted from a genomic sequence [BS87, Yat85]. If, however, one were able to somehow correctly deduce four or five consecutive residues, then one might find sequence information by probing a cDNA library or searching a database of previously sequenced proteins. 23 With a newly-discovered protein, however, sequence information must be determined from the actual protein itself. This process is said to be de novo, done "from scratch", since since no database contains an entry to lookup and no genetic information may be available to consult. Mass spectrometry has been explored as a possible tool for aiding in de novo sequencing. It is a fast and convenient means for sorting a mixture of molecules by mass and reporting the result as a histogram of masses and counts for each mass. Tandem mass spectrometry is based on a two-stage mass spectrometer: molecules are sorted by the first stage, only those molecules of a particular specified mass are allowed to pass into the second stage, non-specific fragmentation of the selected molecules produces various daughter ions, and these fragments are sorted by the second stage yielding the desired tandem mass spectrum. Various de novo computational approaches for sequencing from tandem mass spectra have been proposed in the literature. They take as input the tandem mass spectrum and the mass of the peptide to be sequenced, and all have a common strategy that involves some means for generating a guess for the sequence and some means for evaluating how good each guess is. Scoring functions in the literature incorporate factors that are reasonable properties of good matches, but the choice of these factors and how they are combined is often based on empirical observations and arbitrary decisions. Only Dancik et. al. [DAC+99b, DAC+99a] have attempted to build a more formal framework for reasoning about the scoring function and what a score means. Nonetheless, all approaches are sensitive to the quality of the observed spectrum, and encounter difficulties when the input spectrum is noisy and incomplete. And despite the existence of these algorithms, the majority are not widely used [DAC+99b]. This thesis serves as a proof of concept for a new de novo sequencing algorithm. An overview of our approach is as follows: Fragmentation Model We propose a probabilistic model of peptide fragmentation. Given a peptide sequence, this model produces a probability distribution that describes the masses of all possible ions that can result when a single peptide molecule of this sequence is fragmented. 24 Scoring Function Given this distribution of outcomes for an individual trial, we derive a multinomial-based scoring function, and if we consider a tandem mass spectrum as the cumulative outcome of many such independent trials, then we can compute the probability that a particular set of outcomes is observed. Searching Method A search strategy is formulated based on simulated annealing, a well known combinatorial optimization technique [KGJV83]. Simulated annealing effi- ciently samples the space of possible sequence guesses, identifying the one that scores optimally. In this thesis, we show that if the model is good and the data is sufficient, then the real sequence scores optimally, and simulated annealing, under appropriate searching conditions, will find this sequence. A simple model was sufficient to correctly predict the sequence of short peptides, up to mass equivalence and initial dipeptide inversion, with some tolerance of noise and gaps in the datasets we had available. This thesis is organized into three major parts. The first is an introduction to mass spectrometry(Chapter 2) and protein fragmentation(Chapter 3), and may be safely skipped if the reader is familiar with these physical/chemical processes. The second part begins with definitions and concepts(Chapter 4) used in this thesis, and then surveys existing de novo sequencing approaches(Chapter 5), with an emphasis on those computational approaches that involve the use of mass spectra(Chapter 6). The last part of this thesis contains our main contributions. After some remarks on the sequencing problem and issues that a solution should address(Chapter 7), we propose a new approach to de novo sequencing(Chapter 8), and test the model and scoring function (Chapter 9), the searching strategy(Chapter 10), and the predictions of our algorithm(Chapter 11). Chapter 12 discusses the sequencing of longer peptides and the effects of dataset size, and finally, Chapter 13 concludes this thesis. 25 Chapter 2 Mass Spectrometry For more than a decade now, mass spectrometry(MS) has been used extensively for studying biopolymers such as proteins, oligonucleotides and carbohydrates. It is a useful tool for molecule detection, structural analysis, compositional analysis and more recently, sequence prediction. For example, it can be used to determine the molecular weight of a molecule, and it can be used to compare the weight of a peptide product to that predicted by a gene sequence for verification or for intron/exon and modification discovery purposes. Central to MS is the ability to obtain a "mass signature" of a given sample - namely, a spectrum tabulates the range of masses present and the abundance (also called height or intensity) of each such mass. There are two major components of a mass spectrometer: the mass analyzer. The ionization method (e.g. the ionization technique and Fast Atom Bombardment(FAB), Elec- troSpray Ionization(ESI) and Matrix-Assisted Laser Desorption Ionization(MALDI)) takes the biopolymer of interest and forms gas-phased charged ions. Different ionization methods often produce different species of ion fragment types. The mass analyzer portion of the mass spectrometer allows for selection and analysis of the ions created by the ionization process. Examples of mass analyzers are the triple quadrupole, quadruple ion trap and the time-of-flight(TOF) mass spectrometers. MALDI-TOF, MALD ionization combined with a TOF analyzer, is a widely-used combination [Zen97b] because of its attomole to picomole sensitivity, tolerance of mixtures and 26 salt conditions, and ability to handle low-purity protein samples in excess of 300,000 Da in size [Kau95, FS96] (a practical limit; time-of-flight mass spectrometry imposes no theoretical limit [Sch97], although the error in mass measurement increases for higher masses). We also chose to work with MALDI-TOF because of the availability/accessibility of MALDITOF hardware and its ease of use (only a relatively short training time is needed before a neophyte can begin to acquire spectra, although spectra quality increases with skill and experience [Sch97]). Since we are interested mainly in the output spectrum and the rules governing fragmentation are largely unknown and incomplete(see Section 13.1.2), we treat the mass spectrometer as a black box. However, it will be useful to give a high-level description of MALDI-TOF to convey the basic intuition behind how the process works. For more thorough treatments, see [ZGW95, BC96, Yat85, BBG96, Kau95]. 2.1 2.1.1 Overview of MALDI MS Sample Preparation Sample preparation is largely an experimental art rather than a methodical science, so there are countless variations to this recipe [BC96]. In general, the biopolymer analyte of interest is combined with an excess of a compound called the "matrix", usually a compound of low molecular weight, diluting the analyte and producing a sample that is slightly acidic (pH less than 4 [BC96]). The choice of matrix is one variable in the recipe, determined by such factors as the solubility of the matrix and analyte, its absorptive spectrum, and the inability of matrix and analyte to react to form a stable product [Bea92]. Unfortunately, which compounds make good matrices are often discovered only by "trial-and-error" [HKBC91, FS96]. Nevertheless, whatever is chosen as the matrix must serve its primary function well - it must successfully facilitate the transfer of charge to analyte molecules during the ionization process. 27 prism [ laser charged fragments gas plume fl4J E10 E10 detector sample plate ion source flight tube accelerating grids NMS spectra -- - - - -- - - - -- -- .. .. .... .... ... ....................... Figure 2-1: Linear Mass Analyzer 2.1.2 Desorption and Ionization of Analyte Once the sample is prepared, it is deposited onto a sample plate and allowed to dry. The plate is then inserted into the mass spectrometer, and is illuminated by a laser. When the pulsed laser hits the solid crystalline matrix-analyte sample (see Figure 2-1), a small area is vaporized, and charged molecules are formed in the ensuing gaseous plume. The exact ionization mechanism is unknown [Bal95, Zen97a, Zen97b], but it is believed that the matrix enhances ion formation, by somehow absorbing the laser's energy and imparting it to the analyte. It is thought to serve as a hydrogen source for protonation [BCC91], and use of a matrix appears to facilitate ionization, expanding the range of masses that can be easily ionized from about lkDa in early non-matrix MS, to in excess of 300kDa [HKBC91, FS96]. 28 2.1.3 Ion Separation and Detection Charged analyte ions are extracted from the plume and propelled into the mass analyzer portion of the mass spectrometer by an electric field (typically 1-30kV [HKBC91]) that accelerates them to a constant kinetic energy. Although each ion has the same kinetic energy, the velocity of each ion is inversely proportional to ( where m is mass of the particle and o- is the number of charges it carries. Because of this, smaller mass ions travel at a higher velocity than those of larger mass, and the collection of ions can be separated by mass while in flight. A detector is situated at the end of the flight path, and the idea is to infer each ion's mass based on its flight time. This principle is the hallmark of time-of-flight (TOF) mass analyzers'. Extremely short laser pulses (e.g. 1-100ns), highly focused laser bursts (e.g. 10-300pim) and the fact that the start of ion production can be triggered by the firing of the laser make MALDI highly compatible with TOF analysis because ions can be considered to have originated from a "point source in time and space" [HKBC91]. Thus, as each molecule in the ion train collides with the detector, the mass is derived from its flight time, and the collision event is recorded by incrementing a count that keeps track of the abundance of each mass. What results is a spectrum of peaks that is a histogram distribution of the masses (mass per unit charge) of all analyte ions dislodged by the impaction of a single laser pulse and detected by the machine. To improve signal-to-noise, the typical output that one obtains from a mass spectrometer is actually the aggregate sum or an average of spectra from a number of laser pulses (at least 50) [BC96]. 2.1.4 Useful Improvements and Variations The arrangement of a simple straight-line flight path from an ion source, through a mass analyzer, to a detector is characteristic of a linear analyzer. The spectrum produced by a 'Other types of analyzers include: magnetic-sector. triple quadrupole, hybrid mass, fourier-transform, 29 ion-trap, and linear analyzer may contain peaks with poor resolution. This is due to the fact that the ionization energy imparted by the laser cannot be precisely controlled. Thus fragments of the same mass have slightly different energies, exhibit a variance in flight time and cause the detector to register an event at mass values that are slightly shifted from the actual. Use of an analyzer in reflector mode and use of delayed extraction are two means to partially counter this. Operating in reflector mode instead of linear mode allows for some deviations in ion velocity to be corrected, resulting in narrower peaks and higher mass resolution 2 . The reflectron is implemented using an electrostatic mirror which is discussed further in Section 2.2. Typical mass spectrometers have an accuracy of about 0.1% [YalOO] and a resolution of 300-500 in linear mode [HKBC91], and an accuracy of 0.05% [Yal00] and a resolution of 1200-3000 in reflector mode [HKBC91] (another source claims up to 4000 for peptides up to 10kDa on TOF reflectron instruments [KH93]). Delayed extraction is another means for correcting variations in ion velocities. After desorption, faster ions move farther from the surface, so the position of an ion in the cloud is determined by its initial escape velocity off the surface [Mur96]. By introducing a submicrosecond (typically 100-300nsec [CM98]) delay after the laser pulse but before the acceleration voltage is applied, slow ions are allowed to effectively "catch up" to the faster ions. This has been found to improve both mass accuracy and resolution [Yat85]. Once again, the basic idea behind MS is the ability to separate molecules by mass into a spectrum. With tandem mass spectrometry, this principle is built upon to obtain spectra of a slightly different nature. 2 Two measures of performance are useful to mention: mass accuracy and mass resolution. Mass accuracy indicates the percentage by which the experimental mass as detected by the mass spectrometer is off from the actual theoretical mass. Mass resolution ' reflects the mass spectrometer's ability to separate two peaks that are close in mass, and is calculated by dividing a peak's mass by its peak width at the half-intensity point. 30 2.2 Tandem Mass Spectrometry (MS/MS) Molecules can sometimes, either naturally or artificially induced, break up into pieces or fragments during the post-desorption pre-detector time frame. These fragments(called daughter ions), though smaller in mass, still travel with the same velocity as the original intact molecule (called the precursor or parent), and hence would hit the detector at the same time as the parent. It would be useful to be able to examine the spread of daughter fragments by targetting them for MS, producing a second spectrum that depicts the spread of daughter ion masses and their abundances. This is known as tandem mass spectrometry and it yields the information that we are interested in for protein sequencing. Figure 2-2 depicts a schematic of how this works. One can imagine a gate at the end of the first MS stage that stays closed, allowing no ions to pass beyond it. If a particular mass, called the "parent ion mass" or "precursor ion mass", is the desired target mass, the time at which it reaches the gate can be computed and programmed into the timed-ion selector. When the timed-ion selector is activated at this desired point in time, the gate is opened, allowing molecules arriving at the gate during this time frame to pass through. This has the effect of homing in onto one particular peak of the spectrum that would have been produced from a single stage of MS. All other peaks, such as fragments due to other molecules, contaminants and matrix molecules, are ignored. The gate acts as a filtering mechanism allowing only ions flying at the specific target time to pass through into a second stage of MS which then separates these fragments according to their individual masses. The end result of MS/MS is thus the spectrum resulting from these daughter ion that have arisen from fragmentation of the precursor ion. With MALDI-TOF analyzers, fragmentation can occur from Post Source Decay(PSD). Ions are stable long enough to survive extraction from the source, but then dissociate into smaller fragments before reaching the detector. Dissociation can occur within the dense analyte plume or while in flight through the flight tube. It is the fragment ions produced from PSD that we are interested in and will allow us to sequence the peptide. In addition, one may use Collision Induced Dissociation(CID) to enhance fragmentation by injecting inert gas particles into the ions' flight trajectory towards the detector. 31 A prism detector H Slaser mirror gas plume a t1 accelerating grids V I flighttube ti mirror di H II| H MS/MS spectra rII Figure 2-2: Tandem Mass Spectrometer: (A) only molecules of the desired parent mass are allowed to pass through the timed ion selector, (B) post source decay continues, (C) fragment ions are detected 32 parent analyte molecule fragments when it collides with a gas particle. However, CID tends to produce a repertoire of fragment types different from those obtained by PSD (see Section 3.1), and our work does not take them into account. When an ion fragments, all its pieces travel at the same velocity as the original intact parent ion. Although their velocities are the same, their kinetic energies are not. The kinetic energy of each daughter fragment is equal to the kinetic energy of the parent times the ratio of the daughter mass to the parent mass [KSL93]. The daughter fragments can be separated by using the reflector mode mirror which serves to slow, deflect and re-accelerate ions towards the detection apparatus. The net effect is that ions with higher kinetic energies will be allowed to traverse a slightly longer trajectory because ions with high kinetic energy (and hence high mass) penetrate farther into the mirror and arrive at the detector later than those ions with lower kinetic energy. This action of the mirror depends upon its hardness, which is governed by a parameter called the "mirror ratio". Different ratio settings allow for selective deflection, enabling a particular mass range of peaks to be detected. A lower mirror ratio facilitates the detection of low energy fragments, and a high mirror ratio for higher masses. The spectrum collected at a particular mirror ratio, which we refer to as a "stitch", contains peaks that are correctly focused for a particular mass interval. To produce one complete "PSD composite" (a spectrum correctly focused for the entire mass range), spectra must be collected at several mirror ratio settings chosen so that the mass range covered by the stitches, taken collectively, is enough to span the entire range of the analyte's molecular weight. These stitches (rather, the portion of each stitch for which the peaks are focused) are then stitched together to arrive at the desired composite. Note, however, that the conditions under which each stitch was acquired may be slightly different (e.g. laser intensity). The final PSD spectrum then, is basically a concatenation of individual stitches obtained by differ mirror ratios, each of which is an aggregate sum of several laser pulses. In this manner, since one can only focus on a portion of the entire mass range for each stitch, each stitch, and hence the final spectrum, is only a subset sampling of all fragments produced during the process. For the most part, we will consider the mass spectrometer as a black box - given a peptide 33 as input, a spectrum of fragment peaks is produced as output. Peptide fragmentation is discussed in the next chapter. 34 Chapter 3 Protein Fragmentation Proteins are polymers made up of amino acid building blocks called residues. Appendix A contains the twenty basic amino acids1 , including their one letter codes, molecular weights, and frequencies. Of these, two pairs of amino acids, leucine and isoleucine, and glutamine and lysine, have similar masses. Hence, sequencing algorithms based solely on mass will be unable to distinguish between members of these pairs. Amino acid sequences called peptides (or proteins) are strings taken over this alphabet of residues, and are written with the residue's one letter code, starting from the N terminus to the C terminus when reading left to right. When a peptide undergoes MALDI PSD fragmentation, one or more bonds within the molecule break, allowing the molecule to dissociate into two or more daughter pieces. Given one of these pieces, the bonds which were broken to create it determine the type of daughter fragment that results. 'Aside from the basic 20, other nonstandard amino acids exist (Table 2 of [FHM+93]), but they will not be considered in this thesis. 35 3.1 MALDI-PSD Fragment Types A variety of fragment types can be produced when a peptide undergoes fragmentation, and we classify them as follows: 1. Series Ions 2. Internal Ions 3. Immonium Ions 4. Parent Ion 5. Neutral Loss/Gain Variants The first three result from backbone bond cleavages. The last is a variation that can occur with any of the first four. Examples of the first four are illustrated in Figure 3-1 for a peptide of length 4. The fragment names are due to [RF84]. 3.1.1 Series Ions Series ions are those fragments that are prefixes and suffixes of the parent peptide, i.e. they contain either the original amino(N)- or the original carboxyl(C)-terminus respectively. An ion is named by which bond was broken to produce the fragment, and which terminus of the original peptide was retained. Since there are three bonds that can be broken (the NH-C 0 bond, the C,-CO bond and the peptide bond) and two fragments that can result when one of these is broken (a prefix and a suffix), there are six different possible series ions (the A,B,C,X,Y,Z ions); however, MALDI PSD usually only generates A,B and Y ions (see Figure 3-1). 3.1.2 Internal Ions Internal ions represent other subsequences of the peptide that are not prefixes or suffixes. These generally result from double cleavage of the peptide. The three types of internal ions 36 Parent Ion: 1 H -N-C-C- 11S 0 H N-terminus 2 N-C-C, H 1 0 3 4 N-C-C- N-C-CII 0 H H I1I 0 OH C-terminus H Family Ions: suffixes prefixes Aion R + R2 H - N-C-C - N=C O H H R2 R1 Bion H - N-C-C - Yion N-C-C H- 3 N-C-C- 4 N-C-C OH HH - OH 0 H Immonium Ion: Internal Ions: R R3 YAion H - N-C-C - N=C 1 11 1 O H H YBion II R H - N-C-C - N-C-C ill 1 o+ OH H R. H- N-C-C 11 + ' 0 H Figure 3-1: Peptide Fragment Ions Common to MALDI-PSD. Recall that a fragment must be positively charged to be detected - the H+ accompanied by a bracing line indicates that a proton has affixed itself to some part of the molecule encompassed by the bracing line. 37 that we will be concerned with: YB, YA and XB ions, are named for the types of cleavages that produced their ends [Hin97]. 3.1.3 Immonium Ions Immonium ions are internal ions of length one, and hence are the class of ions of the lowest mass. Their presence and even their absence can provide hints of the amino acid content of the peptide. For example, no peak at mass 70 means that proline is definitely absent from the sequence, and a peak at mass 61 means that methionine is present [FHM+93]. In some cases, certain combinations of low mass peaks with certain threshold intensities (strong/medium/weak) have been found to be indicators of a residue's presence. But what these combinations are and how to interpret them is still an area of investigation [FHM+93]. No comprehensive systematic study of immonium ions and sequence correlation has yet been done for PSD, but some general observations (for PSD and CID spectra) have been made in [SC98, FHM+93, MB94, Pap95]. Note that the immonium ions of the N-terminal residue are actually prefix series ions. 3.1.4 Parent Ion The parent ion is the mass of the intact peptide of interest. It is also the fragment whose mass is selected during the first MS stage to continue onto secondary fragmentation. This ion is the maximal Yion, since it is a suffix ion containing the original C terminus. Its mass is also the same as that of the maximal Bion with the addition of a extra water molecule -the masses are the same, but whether or not it can be classified as such in reality depends on whether or not this B ion is allowed to neutrally gain water. The actual or real sequence of an experimentally obtained spectrum refers to the parent ion, the peptide that actually produced the spectrum. 38 3.1.5 Neutral Loss/Gain Variants Unlike the fragment types discussed so far, the neutral loss/gain variations are not the result of peptide backbone bond cleavage, but rather, they involve the amino acid side-chains. Ions may exhibit a loss of ammonia, a loss of water, or a gain of water, and these are notated as "m17", "m18" and "p18" respectively, where 17 and 18 represent adjustments due to NH 3 and H 20, and the letter "in" denotes a loss (minus) and "p" a gain(plus). The literature offers an assortment of rules for when a variant is allowed and when it isn't (see Section 3.1.5 and 8.2.3). The exact mechanism for how these variants are produced is not known (e.g. Is the B ion a precursor to a Bm17 ion?). But whether or not an ion is capable of a neutral loss/gain variant is dependent upon the amino acid composition, the location of particular residues, and even the fragment type of the ion. These observations are likely based on empirical studies; no conclusive systematic study has been done to investigate/explain/unify the different observations various researchers have made. 3.1.6 Other Fragment Types Numerous other fragment types exist 2 . They are less likely to appear in MALDI-PSD spectra and we do not concern ourselves with them in our work. Some of them are listed below to illustrate the range of different fragment types that can be produced and to emphasize that there are likely to be other fragment types that haven't yet been discovered, so identifying the origin of every peak appearing in a spectrum is frequently not possible. Multiply Protonated Ions In order for a fragment to reach the detector, it must be an ion (it must be charged). MALDI PSD ions are usually singly protonated molecules, but with other ionization methods, such as ESI, multiply protonated molecules are a common occurrence. Multiply charged peaks can provide an additional "dimension" of information, especially if the singly charged fragment is not present. Isotope Peaks A particular fragment ion may appear as a spread of peaks called isotope 2 More than 20 ion types have been identified for hi energy CID [HFBG92] -actually in Martin, S. and Biemann, K., Int J Mass Spectrom Ion Processes, 1987, 78, 218-228 39 peaks that are at consecutive masses approximately 1 dalton apart (the mass of a neutron). This "mountain range" of peaks is most easily seen when the parent mass peak is magnified. These peaks are due to the presence of isotopes, and all five elements found in proteins - C,H,N,O,S - have naturally occurring isotopes. Because the probability of finding an isotope is low for small peptides, the leftmost peak is typically the highest in intensity and the heights of the other peaks in the range typically decrease rapidly with distance. This will not be true for longer peptides (see Section 13.2.1). Alkalinated Ions Impurities such as K+ and Na+ can attach to fragments displacing peaks from their original mass [OSTV95]. Adduct Peaks Adduct peaks can result when portions of the matrix (that breakdown into reactive components, for example) attach to a fragment (often, the parent ion causing a peak that is greater than the parent mass to appear [HKBC91, OSTV95]). Side Chain Ions Fragmentation of the amino acid side chain can also occur, producing fragments such as the D, V and W ions which are useful for telling isoleucine apart from leucine. "R-group losses" for CID spectra can be found in Table 111.10 of [Joh88]. In addition, there are other fragments that we haven't mentioned that possibly involve rearrangements [TBG90] and probably others that are yet to be discovered. With such a wide range of fragment ions possible, what are the rules or laws that govern the formation of all these ions? Are certain fragments are more likely or unlikely to occur? How does the actual sequence and position of its residues affect fragmentation? 3.2 Fragmentation Rules There is still a great deal about the fragmentation process that is yet to be satisfactorily explained. Some things are known, though sometimes this knowledge is limited. For example, the chemical structures of the various fragment types(see Figure 3-1) are known but the mechanisms for how they form are not. Although some schemes have been postulated (e.g. Scheme 111.4 of [Joh88] for A ions in CID spectra; p.455 of [Bie90] for CID B and Y 40 ions), observations are sometimes conflicting and often incomplete; no general unifying set of rules has yet been put forth. The extent of fragmentation - the intensity of ion formation and the fragmentation types observed - is probably dependent on several factors. These include: Sample Preparation According to [BC96], sample preparation is the "key to successful analysis". This includes matrix choice [KH93], relative concentration of matrix to peptide (to internal calibrant if any), etc. Type of Mass Spectrometer Different instruments produce different spectra. Collisions and hence fragmentation is more likely to occur with CID for example. It is generally accepted that the MALDI-PSD fragment types are a subset of those present in high energy CID spectra [GME+95]. Although not as well studied and understood as CID spectra [ASR96, Bie90], PSD requires less energy but more time for fragmentation to occur [Hin97]. Spectrometer Settings Laser intensity, probably the most influential setting [KH93], affects the potential amount of energy imparted to the precursor ion. Other settings such as the guide wire voltage, mirror ratios, etc. affect different aspects of acquisition. Fragment Stability Unstable products may decay too rapidly so that detection is not possible. Peptide Length N-terminal fragments seem to dominate the spectra of small peptides [JnC96] while longer peptides seem harder to fragment even with higher laser energies [Fen9l]. Effects of peptide chain length (possibly because of intra-molecular hydrogen bonding) and 3-dimensional peptide structure on fragmentation have been postulated by several groups [CGMW96]. Sequence Dependencies Fragmentation is sequence-dependent; the identity of a residue and its location can affect fragmentation in a number of ways: Immonium Ions Certain amino acids are more likely to exhibit characteristic immonium ions than others, for example His, Ile/Leu, Phe, Pro and Tyr seem to express themselves strongly [FHM+93]. 41 Residue Specific Tendencies Certain residues can be more reactant than others for example, histidine and proline [LL95] exhibit more N terminal breaks than most other residues. Additional supposed effects that various residues have on spectra are given in [Pap95]. Protonation Sites Fragmentation is highly dependent on the availability and location of protonation sites [CGMW96]. These sites acquire the positive charge which is prerequisite for detection, and their location determine which bonds may break and which of the peptide pieces are ionized. Likely protonation sites include the amino terminus, the carbonyl oxygen of peptide bonds, certain side chains, and basic residues. Basic Residues and their Location In general, basic residues (arginine, histidine and lysine) are more readily protonated since the free electrons of the nitrogen(s) of the residue side chain are more apt to accept the H+ proton. The location and number of basic residue influence the type of ion fragments produced ( [OSTV95, Bie90, CGMW96] for CID spectra). For this reason, when the basic residue is at the N-terminus(C-terminus), N-terminal(C-terminal) fragments will dominate the spectrum [MB94, LL95, KSL93]. Contrary to these observations, one analysis of MALDI-PSD spectra seems to show that the placement of a basic residue does not greatly affect the distribution of N-terminal and C-terminal fragments [RYM95]. The factors are numerous and diverse, but they are a taste of the elements at play in the larger fragmentation puzzle, of which only bits and pieces are known. 42 Chapter 4 Terminology and Concepts At this point, we introduce some definitions that will be convenient to have and refer to in later portions of this thesis. 4.1 Fragmentation and Spectra Although there are a wide range of different known fragment ions, by fragment type we mean any of the five ion types classified in Section 3.1. We will use the term variants to refer to the neutral loss/gain of water/ammonia, and the set of all series ions and their permissible variants are collectively referred to as the core ions. The identity of an experimental peak consists of two pieces of information: a fragment type and a peptide sequence. The mass of this peptide sequence with appropriate adjustments made for the terminal groups of the fragment type explains the mass of the experimental peak. An experimental peak with more than one identity is said to have multiple identities since there is more than one way to explain this peak mass. Let peptide P = ri ... r, be a string of residues ri. Let p-, 1 < i < n, refer to the position between ri and ri+1 (when i = n, ri+1 is the C-terminal group). The ion family at pi is the set of all fragments that directly testify to some cleavage event at pi. The immediate ion 43 family associated with pi, denoted fi, consists of those family ions at pi that are core ions. Given the parent mass and the mass of any immediate family ion, the masses of all other immediate family members can be computed. Some of the mathematical relationships are given below in terms of the mass of the Bion member; the remaining variant computations can be derived similarly: Aion = Bion - CO Bm17 = Bion - NH 3 Bm18 = Bion - H 2 0 Bp18 = Bion + H 2 0 Yion (M+H)+H-Bion A spectrum is said to be representative if at least one family member from every immediate family is present. Every position of the peptide is then represented in the dataset. A spectrum contains a gap if there is at least one immediate family which is not represented. An extended ion family for pi, denoted f', consists of the immediate family members plus certain internal ions which can also serve as evidence of a fracture at pi. ions related to pi fall into two categories: the form rk ... ri, 1 < k < fi Internal those with pi at the C-terminus (peptides of i), and those with pi as the N-terminus (peptides of the form ri ..-. rk,i% k < n). 4.2 Fundamental Graphs We define the notion of a fundamental graph, but defer discussion of its uses until Chapter 6. 44 4.2.1 Purpose The idea is to transform an input spectrum containing a diverse range of fragment types into a list of fundamental peaks where each peak is of the same fragment type. Each fundamental is a representative of a potential ion family, and a natural way to depict neighboring ion families is with a directed acyclic graph where the nodes of the graph represent fundamental fragment masses and a directed edge connects a node of lower mass to one of higher mass if their masses are an amino acid distance apart. Since the edges along any path through the graph define an amino acid sequence, this graph can then be traversed to enumerate all possible sequence guesses consistent with the spectrum. 4.2.2 Construction To build a fundamental graph, one needs to select a fragment type to serve as the fundamental and one has to select a subset of the family fragment types to be fundamental-generating or f-generating roles. Each experimental peak is then considered in every possible allowed f-generating role, and the corresponding fundamental mass is computed based on the mathematical relationships given above in Section 4.1. In this manner, a single experimental peak can give rise to a set of possible fundamentals, one fundamental per f-generating role. For example, assuming the parent mass is 1296.68, if the Bion is chosen as the fundamental, and the f-generating roles are { Aion, Bm17ion, Yion}, then the experimental peaks 506.24 and 763.51 each gives rise to the following fundmentals: 791.51, 780.51, 534.17 { 534.24, 523.24, 791.44 } and { } respectively. Some algorithm would need to detect and propose that 534.24 and 534.17 correspond to the same fundamental, and we say that the experimentals 506.24 and 763.51 both support (and hence are supporters of) this fundamental'. Lastly, related fundamentals are a set of fundamentals (e.g. { 534.24, 523.24, 791.44}) supported by the same experimental peak (e.g. 506.24). Thus, a dataset of p peaks generates a graph with at most cp fundamental nodes, where c is the cardinality of the set of f-generating roles. Two fundamentals are special and 'Although presented in a fundamental graph context, the terms "supporter" and "support" can apply more generally and we will use them often simply to indicate that there is evidence in the spectrum that favors some particular interpretation or conclusion. 45 of interest: the parent fundamental and the base fundamental. The parent fundamental is the fundamental consistent with the parent mass interpreted as a Yion. Much like the parent fundamental represents the entire peptide, the base fundamental represents the other extreme, the null peptide, and its value depends on what fragment type has been chosen as the fundamental. In the above example with the Bion as the fundamental, the 1.0 would be the base fundamental. Edges are created between any pair of fundamentals that differ in the weight of an amino acid residue. Each fundamental node and each edge may have scores associated with them (e.g. number of supporter peaks or sum of the intensities of supporter peaks, etc). A path in the graph connects two nodes via a sequence of edges and intermediary nodes. To score a path, a common strategy is to sum the scores of the nodes (and/or edges) visited along the path (or the logarithm of these scores). Many variations of the fundamental graph appear in the literature: sometimes it is not constructed in its entirety; different approaches use different node and/or edge scoring functions; the methods for finding the best scoring path vary, etc. Nevertheless the goal remains the same: find a path that connects the base fundamental to the parent fundamental. Such a path is called a complete path and it defines a complete sequence, one whose mass is the same as that of the parent mass. The hope (of algorithms employing the fundamental graph) is that with the appropriate scoring function and graph traversal algorithm, the best scoring complete path defines the correct sequence. 4.3 De Novo Peptide Sequencing From Tandem Mass Spectra Let S(P) (or S) be the tandem mass spectrum of peptide P. Let m(P) be the mass of the protonated form of P, i.e. m(P) = m(Nterm) + m(Cterm) + m(H) + Ze m(rj) where m computes the mass of its argument, Nterm and Cterm are the N- and C-terminal groups, ri is the basic residue (see Appendix A) and H stands for hydrogen. sequencing from tandem mass spectra can be stated as: 46 De novo peptide Given m(P) and S(P) for some P, find P' such that m(P') = m(P) and Prob(SIP') is maximal 2 . An algorithm for de novo sequencing from tandem mass spectra should be designed with the following issues in mind: Performance A fast method, inexpensive in terms of computational time - seconds/minutes of computer time, compared to hours/days of laboratory time, is desired. In addition, the method should involve a minimal number of laboratory steps, e.g. the acquisition of a tandem mass spectrum requiring no special chemical treatment of the peptide sample. Robustness The sequencing algorithm should be robust, able to tolerate some amount of noise in the form of extraneous peaks and data lossage in the form of missing peaks. Scalability The algorithm should be able to accommodate the sequencing of longer peptides. Reliability The user should be able to reason about or have some degree of confidence in the validity of the predicted sequence. Comprehensiveness When possible, all portions of the spectrum - the range of masses, the intensities of the peaks, the distribution of the fragments and fragment types, etc. - should be accounted for and considered in the sequencing process. 4.4 Notion of a Correct Sequence Prediction When a de novo sequencing algorithm halts, it may make one or more predictions as to the sequence of the parent ion. Most of these predictions will be different from the real sequence, but some of them will be considered correct and acceptable predictions. The conditions for when a sequence prediction is correct stem from physical/chemical properties, and are as follows: 2 Another way to view this problem is: Given S(P) and m(P), find P' such that m(P') = m(P) and c(P, P') is minimal, for some function c that evaluates the similarity of P and P'. 47 1. when an I appears in place of an L (or vice versa) in the real sequence, 2. when an K appears in place of an Q (or vice versa) in the real sequence, and/or 3. when the first two residues of the real sequence are interchanged. The first two conditions are due to the fact that certain pairs of amino acids cannot be distinguished by their mass alone: leucine and isoleucine are isomeric (same atoms in a different arrangement, hence identical mass), and lysine and glutamine are isobaric (different atoms but nearly identical mass). Thus, a sequence guess is considered correct if all residues can be identified and ordered up to mass equivalence 3 , so for all intents and purposes, these two sequences are pretty much the same guess. Researchers have found that the fi Bion is "too weak for meaningful assignment" [SWM97]. Similarly, Yalcin, et.al. [YKC+95, YCPH96] found the B 1 ion less favorable. Immonium interference, discussed later in Section 7.1.1, can also contribute to this reversal of the first two residues. 3 Fragment types that occur in other kinds of mass spectra can allow for, for example, the differentiation of leucine and isoleucine [JMB88] 48 Chapter 5 de novo Protein Sequencing with Mass Spectra Current strategies for protein sequencing often involve proteolytic cleavage of the protein, isolation of each peptide product (e.g. by reverse-phase high-pressure liquid chromato- graph), sequencing of each digestion product, and then non-trivial assembly of these sequences to reconstruct the sequence of the original intact protein. In the case of a novel protein, the sequencing step is a de novo one which can be done chemically with certain reactions or computationally with mass spectrometry. 5.1 Chemical Sequencing Early de novo sequencing efforts were chemical in nature, perhaps nothing more than a series of biochemical assays and analyses aimed at identifying residues based on some recognizable property. For example, acetylation could be used as a detector for lysine, and esterification for aspartic acid and glutamic acid [YME96]. One could, for example, acquire two sets of spectra - one with and one without the assay and then compare the two datasets [CGAP99]. Analyses via these direct biochemical methods would have required large amounts of protein which might have been hard to amass or taken a long time to isolate. A peptide's sequence can be chemically determined in a more methodical manner by us49 ing the Edman degradation reaction which removes amino acids from the N-terminus, one residue at a time. This chemical process involves the addition of phenylisothiocyanate(PLIC) and subsequent acid treatment in order to cleave the N-terminal residue at the peptide bond. The identity of the released residue can be measured by comparing its reversed-phase HPLC retention time with that of known amino acid standards. By chemically removing one residue at a time, a peptide sequence could be absolutely determined. This process has been automated, but it takes 30-60 minutes for each residue [YGHZ91, HFBG92]. As a result, Edman degradation is currently limited to the identification of about 50 residues per day in the best conditions (e.g. ample quantity and sufficient purity of sample) [CWBK93]. But proteins can range from 50 residues to more than 25,000 residues in size (average size is approximately 250) [Cre93], so Edman degration can take days to sequence the entire protein. Furthermore, errors in Edman degradation can be cumulative. If the N-terminal residue of some of the peptides fail to react and are not removed during one iteration of the degradation process, they will react and be removed during a subsequent cycle, potentially interfering with the identification of the proper residue. After n cycles, the true residue at the nth position from the N-terminus will be unclear and its identification, unreliable (n can be on the order of 70) [Cre93]. 5.2 Sequencing with Mass Spectrometry Although many laboratories today still use Edman degradation to obtain de novo sequence information [SCE+97], in practice, Edman degradation and mass spectrometry approaches are often used simulatenously, and sometimes in a complementary fashion (for laboratories that can afford to) [Fen9l, TJ97]. Mass spectrometry offers the following advantages: Blocked N Terminus A blocked N-terminus interferes with the Edman degradation reaction (e.g. will not react with phenylisothiocyanate), but it does not pose a problem for mass spectrometric techniques. Roughly 30-50% of proteins isolated from SDS-PAGE gels, for example, are N-terminally blocked [YGHZ91]. Modified/Unanticipated Amino Acids Modified residues and unexpected residue vari- 50 ants can interfere with residue identification during Edman analysis if a comparative standard is unavailable [BS87, HFBG92]. With mass spectra, fragments containing the modification will exhibit a shift in mass, however identifying the chemical structure of the modified moeity would still be challenging despite possible knowledge of its location and molecular weight. Sample Quantity Low picomolar (e.g. less than 1 pmol [Bal95]) and femtomolar (e.g. 200fmol [GME+95]) protein concentrations can be successfully used with mass spectrometry. Under certain conditions, even attomole sensitivity can be achieved [FS96]. Edman analysis requires picomolar amounts, e.g. 25-100pmol [Pro]. Consequently, Edman degradation may be unfit for peptides that are scarce or rare. Sample Purity The sample does not have to be pure since mass spectrometry can tolerate the presence of other peptides (e.g. in a proteolytic digest) and impurities including salts and other biochemical agents commonly used as buffers or detergents [BC96]. Not so for Edman analysis; sample purity matters. How does a mass spectrum aid in the sequencing process? What can be done with an uninterpreted tandem mass spectrum for an unknown peptide? If an oracle could correctly label each experimental peak with its real identity (fragment type and sequence), the protein sequencing problem would be decidable: an algorithm could examine the identities and either supply the correct sequence, or indicate that it is unable to do so due to insufficient information. Such an oracle may not be realistic. What if the oracle were allowed to be less powerful what if it were only able to supply the fragment type portion of a peak's identity? 5.2.1 Enlisting the Aid of Fragment Types Different fragment types provide different pieces of information useful for sequencing. Parent Ion The parent ion reveals the parent mass (but this value is already known as it was needed to tune the timed ion selector for MS/MS). 51 Immonium Ions Immonium ions are useful because they hint at a peptide's composition (they do not yield any sequence information). However, immonium clues are less informative when sequencing large peptides because the probability that each residue is present in the sequence increases with peptide length, and thus, it is unclear how to distinguish one possible sequence from another based on immonium ion data alone. Variant Ions Like immonium ions, variants can be indicative of residue composition, but again, variants of longer peptides are less informative than variants of shorter ones. Immediate Family Ions Immediate family ions are useful because they serve as evidence for the cleavage of some bond between two residues. When members of two neighboring families are present, e.g. fi and fin, the mass of the intermediate residue ri can be discovered. Internal Ions Internal ions are most often used to confirm a sequence guess by matching some expected theoretical internal calculated from the proposed guess. They can also sometimes fill in for absent immediate family members by bridging a gap in data that is not representative. Being able to label a real peak with its correct fragment type is not enough. In practice, the oracle needs to be able to contend with spectra marred with imperfections such as missing peaks, imprecise peaks and irrelevant peaks. This complicates the job of our oracle which now must first distinguish noise peaks from real peaks before an assignment of fragment types to the real peaks can be made. 5.2.2 Persevering Despite the Effects of Noise Noise is anything that causes the spectrum to deviate from the ideal, and can be categorized into two general categories: physical noise and measurement noise. The effects of noise can be additive (bogus peaks that do not belong appear in the spectrum), or subtractive (legitimate peaks disappear). When a peak is present, measurement noise interferes with the accuracy with which certain properties of the peak can be measured. The following is a taxonomy of the various categories of noise: 52 * Physical Noise Physical-Chemical Properties of Peptides The peptide itself, by nature, may not ionize well or it may not fragment well because of its residue composition, e.g. dynorphin 1-13 is highly resistant to fragmentation [QC96] (also see Section 3.2). Certain fragments which do not occur readily, produce a subtractive effect by being either totally absent or present with low abundance. Impurities The presence of impurities, introduced either by the peptide supplier (e.g. as artifacts of purification) or by the analyst during sample preparation, can lead to foreign peaks, peaks which do not correspond to any real fragment. This additive effect can complicate spectra and mislead sequencing algorithms. * Measurement Noise Mass Inaccuracies The mass of a peak is considered to be accurate to within 0.1% (and sometimes up to 0.01%) of the actual mass [MHH+94, Ba195, FS96]1 . As discussed in Section 2.1.4, the laser source is one of the contributors to this imprecision. Extreme cases of inaccuracy can lead to broad mountain ranges rather than sharp well-defined spectrum peaks. Height Inaccuracies Ambient electronic or background noise causes the spectrum to exhibit a basal non-zero baseline intensity level. The heights of peaks at particular masses may also be lower or higher than their actual counts due to the loss of ions during flight (e.g.it collides against the flight tube wall, it misses the detector due to improper mirror focussing), low or hyper detector sensitivity (e.g. it impacts the detector but fails to trigger the sensing/registering mechanism) and electronic disturbances [GMG+99]. Peak Detection Errors An algorithm is used on the raw mass spectrum data for peak detection. Additive and subtractive effects are possible as peak detection errors can treat spurious background noise peaks as possible data(false positives) or disqualify actual peaks of low intensity from consideration(false negatives). 'According to [ASR96], the mass accuracy is 0.1% for large proteins and 0.01% for small ones. estimate for the Voyager Elite in reflector mode is 0.02% and for PSD, ±0.2- 0.3 Daltons [Pet97]. 53 An 5.2.3 Considering Different Interpretations of the Same Spectrum Let us weaken the oracle further by considering one that does not always successfully distinguish signal from noise. It may mistakenly assign a peak an incorrect fragment type but one that is still consistent with the fragment type assigned to the other peaks in the dataset. As a result, the peaks of a single dataset may be interpreted in multiple ways, each with a complete set of fragment type assignments that are plausible and self-consistent. For each such complete assignment, a candidate guess for the complete peptide sequence can be inferred. Note that the sequences deduced from different complete assignments are fundamentally different, but a single complete assignment can lead to several sequences that are only marginally different. For example, DRVYIHPFHL (correct angiotensin sequence) and YSMYFVYNPI may both be sequence guesses derived from different complete assignments which each explain a particular dataset in some consistent interpretation, but the fragment type assignments are fundamentally different. However, DRVYIHPFHI, DRVYLHPFHL, DRFHPYLVHL and ALWYIHPFHI are all marginally different, because it may be possible to derive all of them from the same complete assignment. (A large enough dataset could potentially constrain the interpretation and significantly reduce the number of marginally different sequences.) Two factors contribute to the existence of marginally different sequence inferences. One is the non-uniqueness of residue weights, previously mentioned in Section 4.4, which leads to sequences like DRVYIHPFHI and DRVYLHPFHL. The second has to do with the fact that multiple distinct combinations of amino acids can have the same mass. Simple examples are: HR and GHV, N and GG, and TV and VT. Even if the amino acid composition were known, one could not tell the correct ordering from its anagrams if supporting (either core or internal) fragments are not present in the spectrum to resolving questions of residue position. If the intervention of a weak oracle can lead to a (potentially large) number of sequence guesses, how are they to be ranked and judged? How is one to be chosen over the others as the most likely answer? One answer is to use redundancy. Not all sequence guesses are equal; some guesses - hopefully, those that are closer to the correct sequence - are supported by more evidence peaks than others. Redundancy also plays a role in solving the 54 problem of missing peaks since interpretation may still be successful if enough redundant information is present. Some of the algorithms presented in the literature (the spectrum-to-sequence approaches discussed in Chapter 6) are basically simulations of this weak oracle. However, the idea of using such an oracle is not the only way to address the protein sequencing problem. De novo sequencing from MS/MS is the focus of this thesis, but it is but one of four answers the research community has given to the peptide sequencing question. 55 Chapter 6 Prior Work In this chapter, we briefly survey some of the previous work (not necessarily in chronological order) in de novo protein sequencing using mass spectrometry. While many of these approaches were not designed for MALDI-PSD spectra, the ultimate goal is the same: to find the sequence of the peptide, from a pool of candidate sequences, presumably best explains the observed data. The purpose of this chapter is to examine how the different algorithms arrive at a solution. 6.1 Four Categories of Approaches We partition the space of algorithms into four categories, and in general, each approach falls into one of these, although there are some hybrid approaches that fall into two. Membership in a category is dependent on whether it takes as input experimental data that is single stage MS spectra or tandem mass spectra, and whether it involves a database lookup. The works we visit are summarized in Table 6-1. In the case of MS spectra, the peptide is cleaved (perhaps incompletely) by a known protease at specific residues so that the spectrum is a conserved (all pieces are retained) histogram of masses corresponding to those subsequence blocks flanked by residues targeted by the protease. Tandem mass spectra is a non-conserved histogram of masses resulting from a more complex fragmentation process that can occur at non-specific residue positions(see 56 MS/MS MS YGSH93 JQCG93 MHR93 PHB93 GMG+99 FGVP+96 MW94 EMY94, YEMS95, GME+95, YECB96 TWJ96, TayOO PPCC99 SCE+97 CBB96 TJ97 CLS99 CWBK93 BJHP94 GP97 0 SMMK84 LS85 HWH86 IN86 SB88 JB89 ZTEB90 Bar90 YGH91 HFBG91 SZK95 FdCGB95, FdCGB+98, FdCGS+99 DAC+99a, DAC+99b CKT+00 Figure 6-1: Classification of Different Protein Sequencing Approaches Figure 6-2). Database approaches mine spectra for informational clues (called a fingerprint or signature) in an attempt to uniquely identify the peptide from a reservoir of sequenced proteins. Such approaches are often faster than their computational de novo counterparts but they may encounter difficulties with homologs and would be unable to identify novel proteins not present in the database. 6.2 Database Search with MS One can search a database of protein sequences using the MS masses as a mass fingerprint. Five such algorithms appeared in the literature in 1993: [YGSH93, JQCG93, MHR93, HBS+93, PHB93] and all of them adopt a similar strategy: predict the theoretical fingerprints of all entries in a database, compare these to the experimental fingerprint, and choose the entry with the best agreement as the answer sequence. 57 mixture of peptides specific conserved fragmentation e.g. proteolytic digestion MS spectra --------------------------- *-,*,* -------------------- --* ---------- *------------------------------------------------ *............ V E -O non-specific non-conserved fragmentation EI-O' E ] parent peptide MS/MS spectra parent ion (M+H) Figure 6-2: MS and MS/MS spectra: MS is simply a way to separate pieces of the peptide resulting from proteolytic digestion by mass. MS/MS is a means to home in on a particular mass to generate random non-specific fragmentation. 58 In practice, one could use the complete set of MS masses as the mass fingerprint, but rather surprisingly, Pappin et al., in their molecular weight search (MOWSE 1 ) approach, found that a protein could be identified uniquely with a fingerprint of as few as 3 or 4 masses [PHB93]. Their comparison algorithm is more sophisticated than simply counting the number of matches; their scores are based on an empirically determined matrix that captures the size distribution of peptide masses as a function of protein mass. In general, algorithms that fall in this category may have difficulty identifying the correct sequence from a database if: * the peptide sample contains a modification and this modification is not present in the database, " a number of sequences are homologous to the correct one, and the fingerprint is insufficient to distinguish amongst them, yielding false positives * the database is incomplete or contains errors * the fingerprint is corrupted by the bad choice of irrelevant peaks that are unrelated to the sample peptide and occur due to noise/impurities. " incomplete proteolytic digestion occurs or not enough of the products are recovered. Some of these issues have been studied[MHR93, JQCG93], and except when a combination of these occur [MHR93], the fingerprinting method has in general met with a great deal of success in identifying proteins that have already been previously catalogued into the database [LM97]. Further improvements have included better peak detection methods and enhanced scoring mechanisms that incorporate additional attributes for finer discrimina- tion [GMG+99]. 1 http://srs.hgmp.mrc.ac.uk/cgi-bin/mowse 59 6.3 Database Search with MS/MS Database searches can also be done with MS/MS data. Because the information contained in this type of spectra is different and potentially more sequence-revealing, a number of approaches have been proposed. 6.3.1 Searching with Peptide Sequence Tags In an approach called PeptideSearch2 , Mann and Wilm [MW94] enhance the success rate of a database search by using a mass fingerprint with more searching criteria - sequence information that can be readily gleaned and interpreted from ESI spectra. Their peptide sequence tag is composed to three parts: (min, s, mn2 ) where s is a short consecutive amino acid sequence that is the result of partial manual interpretation of the MS/MS spectra, m 1 is the mass of the prefix from the N-terminus to the start of s, and m suffix from the end of s to the C-terminus. In this manner, m 1 + m 2 2 is the mass of the + mass(s) is the mass of the entire original peptide. For database entries that contain subsequences that match this peptide tag, a second step scores the theoretically predicted fragmentation masses for each candidate to those experimentally observed. One must identify a correct subsequence s and in order to pinpoint its location in the real sequence, one must correctly assign fragment type roles to the mass peaks which support s. To find s, one finds a chain of peaks, assigns a core fragment type to each of them, and checks to see if their masses are consistent - namely, does the mass difference adjusted for the peaks' fragment roles correspond to an amino acid weight? The sequence may be correct, but if the roles are incorrect (e.g. one assumes they are B ions when in fact they are Y ions), the computation of mi in the query tag. and m 2 will be incorrect, shifting s to an incorrect position Multiple tag queries may result because different role assignments to different sets of peaks may be consistent, and if there is no redundancy making the choice obvious, there may be no other a priori bias towards favoring any one set of role assignments over another. 2 http://www.mann.embl-heidelberg.de/Services/PeptideSearch/PeptideSearchIntro.html 60 Mann and Wilm found that using a subsequence of two or three residues was sufficient to effectively narrow down the number of database matches. Furthermore, in the case of a mass fingerprint with MS data, a modified residue will affect and shift the molecular weight of all fragments of which it is a member. However, if there is an unexpected modification and the query turns up no answer; the matching criteria can be relaxed - in this manner, it is able to tolerate errors/modifications in one of the three parts of the sequence tag. 6.3.2 Evaluating Theoretically Predicted Spectra with Experimentally Obtained Spectra An algorithm incorporated in a program called SeQuest 3 [EMY94] for low energy CID 4 searches a database looking for entries containing subsequences with a mass equal to that of the parent ion. When a candidate entry is found, its theoretical spectra is predicted and compared to the experimental data to arrive at a preliminary score that is based on: (1) the number of theoretical peaks in the experimental dataset, (2) the sum of the intensities of these matched peaks, (3) the continuity of an ion fragment type, and (4) the presence/absence of immonium ions for H, Y, W, M and F. The highest scoring ones go on to a "cross-correlation" (or simply, correlation) analysis step which produces another score measuring the closeness-of-fit between the theoretical spectrum and the experimental spectrum. The algorithm handles modifications by simultaneously considering them at every putative modification site of a database entry during the search (an all-or-nothing approach - either all sites are modified or none are). A subsequent paper [YEMS95] allowed combinations of up to three unmodified/modified sites (as the number of possible sites increases, the number of possibilities increases exponentially). Both versions, however, are only able to look out for a known set of modifications; they cannot handle unanticipated or previously unencountered ones. PepID [TWJ96, TayOO], an algorithm implemented as part of a program called Sherpa 5 , is 3 4 http://thompson.mbt.washington.edu/sequest.html Variations of the algorithm has been demonstrated for MALDI PSD data [GME+95] and hi-energy CID spectra [YECB96]. 5 http://www.hairyfatguy.com/Sherpa/ 61 similar to SeQuest, except Sherpa is an interactive application for the interpretation of ESI MS/MS. A modified version of the cross-correlation function of [EMY94] is included as an optional analysis one might elect to perform. 6.4 Hybrid Approaches A few hybrid approaches exist that either have both a computational component and a database search, or use both MS and MS/MS data. These are depicted in Table 6-1 on the boundary between two quadrants. 6.4.1 Computational Approaches with a MS/MS Database Search MSTag [CBB96] 6 uses MALDI PSD peaks as a tag for searching either a protein or a DNA database. It also has limited de novo sequencing capability (for peptides < 1300Da) as it uses a combinatorial brute force approach. Lutefisk97 [TJ97], an algorithm for low energy CID, also combines a de novo sequencing step with a database lookup. The de novo step creates a fundamental graph of B ions that are scored with probability values (taken from [Bar90]). Paths through the graph are enumerated, and only the best scoring ones are subjected to cross-correlation a la [EMY94] with the experimental spectrum to produce a new combined score. In the resulting list of candidate sequences, several of them may be homologous with only slight differences (e.g. the order of a short subsequence might be reversed, a substitution might have occurred with two subsequences that are different but of the same mass, etc). A modified FASTA algorithm allows each database entry to be compared to multiple sequence queries, allowing for a homology-based sequence search. PepSeq [CLS99] combines features of a database approach and a computational/de novo approach. It examines PSD spectra and according to an arsenal of rules and observations (e.g. residue presence/absence from immonium ions, C terminal residue constraints due to protease used, fragment type patterns, etc ), infers a list of properties the generating 6 http://prospector.ucsf.edu/ucsfhtml3.2/mstagfd.htm 62 sequence should have and then combinatorially computes these possible candidates. The theoretical spectrum of each candidate is compared to the experimental data arriving at a score, and a database lookup for the candidate is performed as well. One novelty is that it includes internal ions in its generation of theoretical spectra. 6.4.2 Database Search with MS and MS/MS MassFrag, a hybrid approach devised by Gevaert, et.al. [GVP+96], works with both MS and PSD MS/MS spectra. MOWSE is used on the MS spectra to obtain a list of possible sequences. For each guess, its theoretical MS/MS spectrum is generated and compared to the experimental MS/MS dataset to determine a score based on the number of matches. PepFrag 7 allows nucleotide sequence databases to be searched as well. This program addresses the situation when a single mass fingerprint query results in multiple possible candidate sequences, and includes an in silico investigation of how effective different search criteria (e.g. knowledge of the N-terminal residue identity, knowledge of the presence/absence of certain amino acids, etc.) are at constraining the search [FQC98]. 6.4.3 Database Search with MS or MS/MS The Mascot program8 is an application that allows users to perform a number of the searches mentioned above: (1) mass fingerprinting with MS spectra, (2) peptide sequence tags with MS/MS, and (3) comparison of theoretical MS/MS with experimental MS/MS. Mascot's scoring algorithms are based upon those of MOWSE, but they are probability-based and measure the probability that an observed match between the theoretical and experimental spectrums resulted from chance. The absolute probability that a match is random must be supplied, as well as the database size, but other details are not available [PPCC99]. 7 distributed as part of PROWL, http://prowl.rockefeller.edu/ 8 http: //www.matrixscience.com/ 63 6.5 Discussion of Database Approaches In general, database searches can be made relatively fast, and if the peptide in question is part of a protein present in the database, then the search will produce a sequence. In the event that a search does not yield a unique answer, additional information such as partial sequence information (a longer contiguous stretch of amino acids, or several short ones in the case of a peptide sequence tag), other fragment masses and protease specificity can be supplied to constrain the search to a more definitive match. However, a database search is only viable if the protein is present. Furthermore, protein databases may contain errors and are far from being complete [SWM97]. When sequencing newly encountered proteins, a database search will yield either no answer or, even worse, a false positive - a sequence, e.g of a homologous protein, that may fit the criteria well but is not the real one responsible for the observed spectrum. If a peptide is not present in any database, then one must now resort to some other means for protein sequencing such as Edman degradation or one of the computational algorithms to be discussed. 6.6 Computational Search with MS Recall that MS spectrum peaks represent subsequence blocks that result from proteolysis. For each block/peak mass, one might exhaustively enumerate all amino acid combinations that have the same mass (e.g. PAAS [MSM+83] embodies a library of routines that would be ideal for this). But the larger the magnitude of the mass, the greater the number of possible combinations. Furthermore, there exists no subsequent means of further evaluation for distinguishing the right answer from the pool of possibilities. 6.6.1 Ladder Sequencing with Mass Spectrometry Instead of a protease that cleaves after a certain set of specific residues, a non-residuespecific reagent is used to cleave after every amino acid. 64 If one were able to isolate all possible intermediate product fragments containing the original amino (carboxyl) terminus, one would in effect have all the partial prefixes (suffixes) of the sequence string. One could then easily identify the amino acids in the sequence from the MS spectrum by calculating the mass difference between consecutive peaks. This is the idea behind ladder sequencing and it was first proposed by Chait et al. Their scheme is a modified Edman reaction in which three reagents are used: phenylisocyanate(PIC), phenylisothiocyanate(PIT C) and trifluoroacetic acid(TFA) [CWBK93]. PITC will react with and modify the N-terminal residue in a manner that enables TFA to subsequently strip the residue off, leaving another new free N-terminus that PITC can now react with to repeat the cycle on the next residue. PIC is a compound that like PITC reacts with the N-terminal residue, but unlike PITC, once modified by PIC, the terminal residue is no longer susceptible to TFA cleavage. Thus, PIC is a terminating reagent and a small amount is added when PITC is added, so that at the end of each cycle, a fraction of the products are blocked and do not participate further in the ladder producing reactions. This enables one to obtain a ladder of partial suffixes which can then be flown on a mass spectrometer. Another strategy used by [BJHP94] (using chemical reactions based on the protocol in [CWBK93]) and later by [GP97] (using different chemicals in their protocol) involves the introduction of a new aliquot of the original full length peptide at each cycle. In this manner, peptides get successively shorter in each cycle, but peptides that have been added later have fewer N-terminal residues stripped off and hence form the longer partial suffixes. Ladder sequencing approaches can handle unexpected modifications effectively because as long as these modification do not interfere with the ladder generating reactions, they will not hinder sequence discovery of the other unmodified residues. Only subpicomolar amounts of peptide are needed, but per residue). [GP97], for example, requires 60 minutes per cycle (an hour There are various other complications for each of these ladder sequencing approaches: * Chait, et al: instability of the terminating block [SSMW97] and loss of hydrophobic peptides [GP97] " Bartlet-Jones, et al: difficulties with lysine-containing peptides [GP97, 65 * Gu, et al: certain reaction steps need to be optimized for different proteins so this may be difficult to do for novel proteins. some difficulties with side-chain reactions also [GP97]. 6.7 Computational Search with MS/MS Early de novo sequencing from mass spectra was done manually - an analyst would visually inspect spectra, seek clues and make deductions, with the ultimate objective of finding some consistent interpretation of the spectrum as a whole. He/she might take into account guidelines and hints described in [str95, Pap95, MB94], such as the known specificities of a protease that might have been used during sample preparation, which could restrict the choices for the C-terminal residue. The analyst could then attempt to build a sequence, one residue at a time, from the C-terminus by guessing and temporarily assigning fragment types, looking for the best next residue. Immonium ions might indicate the more likely residues, but the number of possibilities would have still made this process time-consuming and tedious. The task might be simpler if only the highest intensity peaks were considered, reducing complexity at the expense of a less reliable answer based only on a small fraction of the information in the spectrum. This game of trial and error became easier with experience, as the analyst learned to recognize patterns after dealing with spectra on a case by case basis, but in short, determination of the complete sequence required considerable effort. An analyst would have only humanly been able to keep track of and explore only the tip of the iceberg of possible combinations. Computer algorithms were developed to automate the de novo sequencing process, and we classify them into two categories depending on how sequence guesses are made: the sequenceto-spectrum approaches enlist the aid of the experimental dataset, while the spectrum-tosequence approaches do not. All the approaches considered in the following sections are classified in Table 6.1. We are interested in the strategies embodied in these algorithms, and while the details of each varies, approaches that fall in the same classification attack the problem in the same manner. The table also indicates which approaches use fragment type probabilities. 66 Often these probabilities are determined empirically from some sample data, or arbitrarily chosen to be loosely indicative of patterns in the sample data. spectrum-to-sequence sequence-to-spectrum local global local global(fundamental graph) local reference SMMK84 LS85 fragment probabilities global V HWH86 IN86 SBB87 SB88 JB89 ZTEB90 Ba90 YGHZ91 HFBG92 V VV V V V V V V V V SZK95 FdCGB95 FdCGB98 FdCGB99 DAC+99a DAC+99b V V_ CKT+00 our work Table 6.1: Taxonomy of De Novo MS/MS Approaches 6.7.1 Sequence-to-Spectrum Categories With a sequence-to-spectrum strategy, candidate sequences are generated independently of the experimental spectrum, but the spectrum is used later on to see if any peaks support a particular candidate sequence. This correlation process involves generating a theoretical spectrum from the candidate sequence guess, and then comparing it to the experimentally obtained spectrum. The simplest type of correlation is a count of the number of matches where a match is a mass that appears in the both the theoretical and experimental spectra. A more complex one was already seen in the cross-correlation routines of 67 [EMY94]. Global Sequence-to-Spectrum Approaches Global sequence-to-spectrum approaches correlate complete candidate sequences. These can be enumerated by brute force, with [HWH86] or without [SMMK84] knowledge of the amino acid composition. An advantage of correlating with a full peptide sequence is that all known fragment types, including internals and variants, can be considered since the entire sequence is available, resulting in a more accurate measure of similarity between the theoretical and experimental datasets. However, these approaches are computationally intensive, growing exponentially with parent mass/peptide length. Local Sequence-to- Spectrum Approaches Local sequence-to-spectrum approaches [IN86, YGHZ91] avoid exhaustive searching by exploring only those portions that seem promising, instead of the whole space of possibilities. Sequence guesses are built up in a stepwise manner by executing repeated rounds of 1) extension by one or more residues ( [IN86] starts with tripeptide seeds and extends all possible dipeptides), 2) correlation of the new partial sequences and 3) pruning to remove those of low potential from consideration. Only some limited number of high scoring candidates survive to the next round. This process ends when some number of complete sequences are found. One might imagine an algorithm that attempts to use the experimental dataset intelligently as a guide for determining potential candidate sequences in hopes of avoiding unnecessary searching. This type of algorithm falls into the spectrum-to-sequence category of approaches. 6.7.2 Spectrum-to-Sequence Category With a spectrum-to-sequence strategy, the candidate sequence guesses that are constructed are determined from certain relationships that may exist in the experimental spectrum itself. 68 Local Spectrum-to-Sequence Approaches Local spectrum-to-sequence approaches [LS85, JB89] build a sequence up from one terminus to the other, one residue at a time, using the spectrum directly to decide which residue to grow a partial sequence with. In another approach [ZTEB90], candidate sequences are derived mathematically from equations based on atomic weights, isotope peak ratios and immonium ion hints that dictate composition constraints. Each possible composition is then considered, one residue at a time, to find a permutation that is well supported by peaks in the spectrum. An interactive program [SBB87] allows users to link pairs of peaks that differ by a basic amino acid mass in hopes of finding long chains that explain the parent mass. This approach would work if the same fragment type were present for each family but not when neighboring peptide positions are represented by family members of different fragment types. As is the case with local sequence-to-spectrum approaches, pruning is often used to reduce the number of outstanding sequence paths being explored. But the choice of which to keep and which not to keep is made with only a local view of the situation at hand. Prefix pruning can inadvertently and prematurely remove the partial path that would develop into the real sequence when the portion of the sequence seen so far is underrepresented (even though the remainder of the sequence, yet to be visited, is better supported). However, if all possible extensions are extracted and compactly represented in a fundamental graph, then the entire path would be available for analysis. Pruning would no longer have to occur dynamically during residue extension; instead, all information could be preserved for as long as possible, and when pruning became unavoidable, an informed decision of which to rule out and prune could be made based on a more global examination of the situation in its entirety. 6.7.3 Fundamental Graph (Global Spectrum-to-Sequence) Approaches The fundamental graph is a complete picture of all sequences that can be possibly found in a spectrum, and is the result of an analysis of mass differences between potential families in 69 the experimental data. In the case of representative spectra, the real sequence is guaranteed to be present as a complete path in graph, making the fundamental graph a natural and attractive tool for allowing peak mass relationships to guide candidate sequence prediction. Even within this global context, there is a local as well as a global way to use the information in a fundamental graph. Keep in mind that scores are typically associated with each node of a fundamental graph, and these are distinct from the correlation scores. Node scores indicate the degree to which a supposed fundamental is potentially supported by the data -the more redundancy, the higher the node score, the more likely it is legitimate and part of the real sequence. Note that the approaches in the following sections do not all use the same fundamental graph but some variation of the fundamental graph concept (different fragment types are chosen as the fundamental, different f-generating roles, [SB88, Bar90] do not explicitly construct the edges, etc). Local Fundamental Graph Approaches The beginnings of the fundamental graph idea can be found in [SB88]. Siegel and Bauman transform a FAB MS/MS spectrum into a "reconstructed spectrum", which is simply a list of fundamentals9 (the nodes of the graph without the edges). Their approach is rather complex and involves the costly pre-computation of tables containing the mass of every ion possible from all permutations of all residues likely to be in the peptide's sequence. Mass differences of neighboring reconstructed peaks were then computed and these tables were consulted to find subsequences that could account for them. If any complete path were found, the subsequences of each segment of the path could be permuted and then concatenated together to form a complete candidate guess. This approach will generate a great number of possible guesses, and was found to be impractical for spectra with more than about 20 peaks [Bar90]. Hines et al. [HFBG92] uses a pattern-based approach to identify likely fundamentals, and 9 Their fundamental is not a single fragment type, but a set of fragment types, namely, all prefix series ions. 70 Scarberry et al.[SZK95] employs neural networks, well suited for pattern recognition tasks, to classify experimental peaks by likely fragment types, from which the fundamental list can be computed. As with the local spectrum-to-sequence approaches, partial sequence guesses are extended one residue at a time, except here, the fundamentals [SZK95] and the fundamental graph [HFBG92] are used in place of the experimental spectrum to determine which residues to try. 6.7.4 Global Fundamental Graphs Approaches A more global use of a fundamental graph involves some computation on the entire graph before the most likely answer(s) is chosen. Examples of graph algorithms in this category might be dynamic programming algorithms that find the longest or the heaviest path. In Bartels' approach, the fundamental graph nodes are scored based on probabilities (from [SB88]) associated with the fragment types of each of its supporters. Scores are propagated through the entire graph from one node to the next if the nodes are a basic residue apart. At each fundamental, two scores are kept, one for each pass are made through the graph once, when scores are propagated from higher mass nodes to low mass nodes, and once in the reverse direction. The sum of these scores is meant to be indicative of "how well the best explanation will be if the corresponding mass is included in the interpretation". The highest scoring nodes then best explain the spectrum, but no further details are available in [Bar90] Fernandez-de-Cossio et.al.'s MSEQ ( [FdCGB95] for hi energy CID, [FdCGS+99] for low energy CID and PSD) and SeqMS algorithms [FdCGB+98] are both based on the work of Bartels. Scores depend on probabilities that are different from those of Bartels, but SeqMS takes more fragment types into account. A graph algorithm (e.g. a variation of Dijktra's single source shortest path algorithm) is used to find the maximum scoring path from the base fundamental to every fundamental, and this information is used to enumerate the best scoring paths through the entire graph. Sherenga [DAC+99b, DAC+99a] also uses graph theory to find a maximum scoring path, and it is the only approach to our knowledge that presents a more rigorous argument for why 71 the sequence for which their scoring function is maximal is the best choice for an answer. Interestingly, unlike all previous algorithms which are designed for a particular type of spectra with a known set of common fragment types, theirs is also instrument-independent. It is theoretically able to process any type of spectra by examining mass spectrum samples produced by some ionization technique/mass spectrometer combination, and by predicting the commonly occurring fragment types for this machine. An algorithm by Chen, et.al. [CKT+00] uses a dynamic programming algorithm to enumerate all paths in their version of a fundamental graph, but care is taken to avoid paths that visit related fundamentals. This is feasible because the B and Y ions are the only f-generating roles considered, keeping the size of the graph and the ensuing search small. However, if these ions are not the dominant fragment types in the experimental spectrum, then the search may not produce the correct answer. This chapter surveyed a range of solutions for peptide sequencing from mass spectra. Of these, the computational approaches are most applicable to the sequencing of novel unsequenced peptides. Several de novo MS/MS algorithms exist, but none of them are widely used [Man98], perhaps due to interest in more interactive forms of analysis [Bar98] and low confidence in the predicted answers. Appendix E chronicles some of the approaches we investigated, Chapter 7 discusses what we learned from them, and subsequent chapters propose a new global sequence-to-spectrum strategy for de novo MS/MS sequencing. 72 Chapter 7 Observations and Issues We make several observations regarding the input spectrum and the scoring function in this chapter. These are issues that an algorithm designer should be aware of and no doubt, other researchers have also encountered them in their study of the protein sequencing problem. Some of them may have been touched upon in other parts of this thesis, but here, we focus on their effects on sequencing attempts. 7.1 7.1.1 Spectrum-Related Issues Gaps Missing peaks can lead to degeneracy and uncertainty in the predicted sequence. Enough missing peaks can cause algorithms which iteratively grow sequence guesses based on supporting experimentals to prematurely terminate. In an attempt to bridge a gap, funda- mental graph approaches often deliberately introduce dipeptide and tripeptide edges in the fundamental graph. The success of these bridging edges is limited because the existence of a gap, the size of a gap and its location are indeterminable from the spectrum. Longer bridging edges may allow for better sequence recovery in the presence of large gaps, but this increase in edge population, even with the addition of dipeptide bridging edges only, can lead to an explosive number of possible sequences to consider. 73 If, on the other hand, dipeptide/tripeptide bridging edges are not included in the construction of the graph, then another problem can occur. Assume for example that there is a gap in the dataset because some family of the correct sequence is not represented, but there is an alternate path already in the graph that bridges the gap. This bridge may act as a detour, leading the path back onto the path of the real sequence (defining a sequence that is the real sequence with a short substitution), but it may also diverge to a completely separate path (defining a sequence whose the prefix or suffix is the same as the real sequence). Since no "dead end" is encountered, the algorithm would not even realize that a gap was indeed present. 7.1.2 Immonium Interference Immonium ions happen to be the same mass as the fi A ion, so often, the node representing the base fundamental has a large fanout. Algorithms which work from the N-terminus to the C-terminus are immediately faced with a large number of possibilities for the first residue, choosing an incorrect residue for the first residue. The earliest opportunity for it to recover from this mistake is at the third residue. One solution is to reverse the edges of the fundamental graph and work from the C-terminus to the N-terminus. Another alternative is to require complete paths to begin and end with a dipeptide edge so that the first two and the last two residues are temporarily unresolved. 7.1.3 Mistaken Identities A case of mistaken identity occurs when an experimental peak is interpreted incorrectly and assigned an identity other than its real one. This false identity can potentially support a sequence other than the actual one, particularly if it is of high intensity, thereby diverting the algorithm onto the wrong path. 74 7.1.4 Under-/ Over-Represented Families Families can suffer from under-representation and over-representation. A gap is the ex- treme case of an under-represented legitimate family and algorithms that construct sequence guesses by extending partial sequences one residue at a time [IN86, JB89, YGHZ91, SZK95] will run into a dead end since a gap offers no supporting evidence for continued expansion of a partial sequence. Families that are not totally devoid of members, but are under-represented face similar problems. Poor representation can lead to low scores and prefix pruning can remove the partial sequence from the running, even though the algorithm may later reach a portion of spectra that is more redundant and richer in supportive fragments. Over-representation of a legitimate family, namely redundancy of real fragments, is desireable. However, over-representation of an illegitimate family can be problematic especially if the bogus fundamental has a high degree of supporters, many of which might due to mistaken identities. Such extraneous peaks can lead the algorithm astray by artificially elevating the standing of incorrect candidate sequences so that the search may prune away the correct path. 7.1.5 Experimental Peak Heights Some approaches normalize/scale the heights of experimental peaks within a spectrum. Others classify them into weak/medium/strong. When multiple supporters for a funda- mental exist, scoring functions often sum together their abundances. Is there a different view of peak heights or a more intuitive way of taking them into account in the computation of the score? Here is one view: Peak intensities depend strongly on the physical and chemical properties of the analytes, so that it would be rash to assume that the more intense peaks were more "valid" than the weaker ones [but] because MS/MS spectra tend to exhibit much higher levels of apparently random noise, often a peak at every mass, it becomes essential for peaks to be selected on the basis of intensity. [PPCC99] 75 7.1.6 Mass Tolerances A fragment ion of mass m seldom appears as a peak centered at m but rather at m ± some E for E > 0. Mass inaccuracies almost always occur in experimental mass measurements, and it is commonly assumed that a mass is within 0.5 daltons of the actual value( [SZK95, HFBG92],E in [DAC+99b]). When creating a fundamental graph or merging multiple datasets, algorithms will need to merge two nodes together if they correspond to the same peak. Dancik et.al. write: "Ifwe do not merge vertices that correspond to the same partial peptide, we will interpret meaningful peaks as noise. On the other hand, if we merge vertices that do not correspond to the same peptide, we may interpret noise as meaningful peaks" [DAC+99b]. Their merging algorithm [DAC+99b] combines fundamentals that are believed to correspond to the same fragment ion. But peak merging can be tricky; one difficulty with these algorithms is deciding when to cluster peaks together - e.g. which peaks should be merged if there are 3 peaks, such that p1, P2 and P 3 , such that P2 -P 1 < c and P3 -P 2 < E but P3 -P 1 > E? Indeed, a patch was necessary [DAC+99a] since two peaks that were originally an amino acid distance apart, may not be, after being separately merged with other peaks. In short, some means is needed for telling when two unequal mass values are close enough that they actually refer to the same theoretical peak. We address this issue with the use of checkpoints. Checkpoints Molecules are composed of atoms, and atoms are in turn comprised of neutrons, protons and electrons. 1.008664904 amu, mP Each of these fundamental particles have a particular mass: mT = = 1.00727647 amu and me = 0.0005485799 amu (note, 1 amu = 1 Da). The mass of every molecule must then be a linear combination mnNn + mpNp + meNe where N, represents the number of x particles. If we assume that every proton is matched with an electroni so that Np ~ Ne, then Nmp + Neme is approximately Np(m 1 + me), and MALDI ions are assumed to be singly-charged, so here, there is one more proton than electrons, but the weight of an electron is so small that we can ignore the difference. 76 we can combine the proton and electron masses together. Since a neutron and a proton are rougly equal in mass, we use the average of m, and (m + me) as an approximation of a quantum of mass called a "checkpoint", a notion suggested by F Tom Leighton. As a result, the mass of every molecule must be very close to some integer multiple of the checkpoint, and the range of possible masses is not continuous but discrete. (Note that it is actually not quite this simple - when protons and neutrons combine to produce the nucleus of an atom, some mass is lost in the form of energy. The total mass is then the sum of the masses of the individual nucleons, less this fusion energy. We computed the average fusion energy loss per nucleon for each amino acid, weighted over the frequency of all amino acids. This value was found to be 0.0077143251 amu 2 and thus, a checkpoint is basically mn rmP+me 2 - 0.0077143251 amu.) The checkpoint turns out to be a convenient "unit" of mass. Any mass calculations that we perform are done in checkpoint units, and our algorithms round the masses of all experimental peaks to the nearest checkpoint. This imposes an assumption that any errors in mass must be less than one half the distance between two checkpoints (a value slightly greater than 0.5Da). 7.2 Scoring Function Issues The scoring function is a critical component of any sequencing algorithm. 7.2.1 Uses of a Scoring Function All approaches are dependent on a good scoring function, and in the literature, scoring functions are used in several ways: 2 This was calculated in the following manner: (1) Let A {C,H,N,O,S}, the set of elements found in proteins. Compute the fusion loss of atom a E A as the difference between the nuclear mass of its components and the published mass less the mass of any electrons, i.e. A fa = (Np mT + N~m",) - (ma - Neme), where N, denotes the number of x particles in atom a and ma is the published mass of a. (2) For each amino acid r, calculate the total fusion loss of all its atoms as Afr = EaAfaNa. (3) The fusion loss of a residue per nucleon is then given by Afr/(N; + Nn), and (4) the average fusion energy per nucleon, weighted over all residues is: Avg( Af /(Nj + NI)f req,). 77 1. to evaluate the amount of supportive evidence for a particular peak - in the fundamental graph approach, this corresponds to node and/or edge scores, 2. to evaluate and rank partial paths through a graph comprised of these nodes and edges, and 3. to correlate a complete sequence guess against the experimental spectrum. Scoring Fundamentals The score of a fundamental node is intended to serve as a measure of how likely the fundamental is a correct and legitimate one, and hence the more redundancy, the better the score. Node scores are often based on such factors as the number of supporter peaks [SB88, HFBG92], the intensity of each supporter peak [SB88, IN86, HFBG92] and the fragment type role of each supporter peak [FdCGS+99. But one drawback with node scores is that they are often over-encompassing and can be misleading - the score is a combined score for possible interpretations of supporting peaks in the spectrum. Not all of these interpretations may be compatible and simultaneously correct, so the node score can be artificially higher than it actually is. Scoring Partial and Complete Paths The score of a (partial or complete) sequence is a measure of how well the guess accounts for the observed data, and can be used to rank a pool of candidates. This is useful for deciding which partial path to expand next, or which to prune away. However, any such threshold or cutoff process introduces the risk that the path of the real sequence may be discarded, particularly when the real sequence may score poorly at first due to gaps/underrepresented peaks, but improve later on. Scoring functions for paths can take into account peak intensity [ZTEB90, HFBG92], the number of matched experimentals, series continuity [ZTEB90, FdCGS+99] and even se- quence length [ZTEB90]. Most account for series ions only, but some account for internal ions [JB89, FdCGS+99] and immoniums [SZK95, HFBG92]. There has, however, been no 78 comprehensive systematic study of how to account for each factor - Should they contribute equally? If not, what weighting function(s) should be used? 7.2.2 Theoretical Peak Heights Oftentimes, the scoring process involves generating theoretical spectra, and the simplest approach is to predict the daughter masses only. A theoretical spectrum and an experimental one can match well if they have a lot of fragment masses in common. This seems promising, but the theoretical mass peaks, both strong and weak alike, are treated equally, and scoring functions can compare fragment masses, but they are not able to compare peak intensities. There is another dimension of information available from an experimental spectrum that can be harnessed - in addition to predicting fragment masses, it would be useful to predict fragment abundances that are indicative of a fragment's relative likelihood to occur. Were this possible, then not only would a high overlap in fragment masses matter, but also the general "shape" of the theoretical and experimental spectrums. 7.2.3 Award / Penalty System When comparing experimental and theoretical spectra, peaks fall into one of three categories: (I) matched peaks - peaks that are predicted and observed, (II) unaccounted experimental peaks - peaks that appear in the experimental but are not expected in the theoretical (and consequently attributed to noise), and (III) absent theoretical peaks - peaks that are present in the theoretical but not in the experimental. Scoring functions in the literature have largely considered Category I peaks only. Awarding for matches gives a feel for how well an experimental spectrum agrees with what is theoretically expected, but it is only a part of a bigger picture. The candidate with the highest match score can still be a poor candidate if there are many unaccounted peaks in the experimental spectrum. Dancik et.al. consider both Categories I and II by awarding for matches and penalizing for 79 Experimental Theoretical Category I ryI Category IIICae Figure 7-1: Theoretical Masses and Experimental Peaks: Region A contains those theoreticals that are absent from the experimental spectrum, Region B contains matched peaks and Region C contains the unaccounted experimentals. unaccounted experimentals. We submit that penalizing for category III is also important. The real sequence and a longer sequence guess that is only marginally different (e.g. an N in the correct sequence is replaced by the dipeptide GG) can have the same bonus for Category I and the same fine for Category II, but a slightly different Category III score, which may be enough to distinguish the two sequences from each other. 7.2.4 Accounting for Disallowed Variants Recall that the score of a fundamental depends on its set of supporters, the score of a path is a function (often the sum) of its individual component fundamentals, a complete path through a fundamental graph defines a complete peptide sequence. Including series variants in the set of f-generating roles is beneficial, but these ions are sequence dependent. Path scores may require adjusting to remove the contribution of variant supporters that turn out to be prohibited by the path's sequence. 80 7.2.5 Accounting for Internals For the same reason of path dependence, it is difficult to account for supporting internal ions in a fundamental graph. Internal ion contributions should be included in the score of a complete path only when all fundamental nodes necessary for defining the partial path corresponding to the internal ion are visited by the complete path being scored. 7.2.6 Fragment Type Frequencies The ability to quantify the tendency that a particular fragment type occurs in spectra can be useful, and several approaches attempt to do this(see Table 6.1). These fragment probabilities are useful for predicting peak heights in theoretical spectra [YEMS95, TJ97], but they are more often used to calculate the influence of supporting experimental peaks in different f-generating roles when scoring fundamentals in a fundamental graph [Bar90, DAC+99b, DAC+99a, TJ97, FdCGB95, FdCGS+99, FdCGB+98]. These fragment type frequencies are often arbitrarily chosen or set at values that are compatible with simple observations of fragment type likelihoods in spectra - e.g. the tendency that an A, B or Y series ion occurs appears greater than that for an internal ion. This results in a simple but very rough initial approximation of the relative frequencies which, in reality, are likely to depend on other factors such as the peptide sequence itself. Analyses of experimental spectra can lead to improved estimates of these frequencies. Fernandez-de-Cossio et.al. [FdCGB95] arrive at probabilities based on "hundreds of CAD spectra" but details are not included in their paper. Recently, Dancik et.al., in an effort to make their algorithms instrument-independent, describe a means for inferring fragment types expressed by a particular MS instrument and deriving probabilities for them based on experimental spectra of known sequences gathered on the instrument [DAC+99b]. A better understanding of fragmentation could lead to more precise fragment type frequencies and scores that reflect supporter fragment type influences more accurately. 81 Chapter 8 Approach 8.1 Solution Schematic The approaches that we have considered so far, those in the literature (Section 6.7) as well as those in our own explorations (Appendix E), have a common structural theme of two basic modules: one for guess evaluation and another for guess generation(see Figure 8-1). The former serves as a measure of how good a guess is and thus requires access to the input spectrum, while the latter is the source of these guesses. Variations of this schematic exist: some approaches use the input spectrum to aid in guess generation and others use the score of the current guess as a feedback mechanism for choosing the next guess(see dotted line of Figure 8-1). Fundamental graph-based approaches transform the input sequence into a fundamental graph which can then be used in the guess generation process. One could easily imagine additional dotted arrows representing other variations of this paradigm schematic - e.g. the use of the fundamental graph as a means for generating guesses by enumerating some path through the graph. This chapter describes an implementation for each of these algorithmic components. Our evaluation module consists of a simple model for fragmentation and a scoring function based on this model. Despite the fact that our understanding of the rules governing fragmentation is incomplete (Section 3.2), we begin with the development of a simple probabilitistic model 82 Guess Generation Guess Evaluation S -0 g=r~r2 r3 0 0 0 Figure 8-1: Schematic of Key Algorithmic Components for a trial, the fragmentation of a single peptide molecule. Using this model, all possible outcomes of a single trial can be predicted from the peptide's sequence, and in addition, a probability can be associated with each outcome. Once we have a probability distribution describing the outcomes of a single trial, if we view a tandem mass spectrum as a summary of the outcomes of a number of independent single trials, then we can compute the probability that a particular set of outcomes could have occurred. Given this, we have a means for scoring a sequence guess against an experimentally observed spectrum. Our generation module involves a simulated annealing search strategy that uses this scoring function to efficiently explore a space of possible sequence guesses with the objective of finding the optimal scoring sequence. 8.2 Modelling the Fragmentation of a Single Molecule We begin by examining the fragmentation of a single peptide molecule p. A simplified view of the process is as follows: 83 1. a parent molecule is protonated, becoming positively charged, 2. the parent ion may then fragment into any number of pieces, 3. only the piece retaining the charge is capable of being detected and, if detected, would serve as this molecule's contribution to the mass spectrum. We develop a fragmentation model incrementally by beginning with a very simple one, incorporating various improvements into each successive model, and ending with a model that is more complex but also more reflective of what actually happens. Each model can be described by a fragmentation tree, a decision tree where the root node represents the intact parent molecule and the leaves represent all possible outcomes that a single parent molecule can produce when fragmented under a particular model. Edges represent certain decisions in the fragmentation process and probabilities are associated with every edge. A path from the root to a leaf of the tree describes the series of events responsible for the ion outcome specified at the leaf. The probability of this outcome is the product of the probabilities of all edges in this path (so the order in which these events happen does not affect the probability of the outcome). Model I: Modelling Series Ions 8.2.1 The simplest model for a trial involves at most one single cleavage event so that the ion that results is either a prefix (an Aion or a Bion), a suffix (a Yion) or the original molecule (M+H) left intact. The Model I fragmentation tree for the short peptide DRVY is illustrated in Figure 8-2. The generalization to any arbitrary peptide can be easily made. A break can occur at any break position pi of the peptide with some probability Pp%such that EiPi = 1. For all internal break positions pi (1 < i < n where n is the length of sequence), there are two possible bonds along the peptide backbone that can be cleaved (the C,-CO bond and the peptide bond) with probabilities Pa, and Py = 1 - Pa, respectively1 The type of series ion that results depends on which bond is broken and which piece is retained: cleavage of the C,-CO bond (see Figures A-1 and 3-1) produces an A ion only (X ions are 'There is a third bond in the backbone, the NH-Ca bond, but cleavage of this bond produces the C and Z series ions which are not prevalent in MALDI-PSD spectra. 84 Parent Ion: D R 1 4 3 N-C-C- N-C-C1I f 1 0 0 H H 2 N-C-Ci 0 H H - N-C-CI 0 H N-terminus Y V OH C-terminus H break position: 0 1 2 3 5 4 M+H DRVY - prefix retention suffix retention _Aion D Bion D A B 0 K - B1Y I . Yion RVY --.-- A --- ion DR 2 -Bion DR -....-Y.... ..Yion VY B/Y M+H DRVY Ax- - A - - B/Y 4 A/X 5 ---- --- Aion DRV --- Bion DRV Yion ....--.. A --- - -Aion B/Y Y DRVY Bion DRVY M+H DRVY Figure 8-2: Model I: Basic Fragmentation Tree. A single break produces only prefixes and suffixes. not prevalent in MALDI-PSD), while cleavage of the amide bond produces a B ion with probability P and a Y ion with probability (1 - P). Note that the C-terminal group is treated as if it were a residue in that it is flanked by break positions. Cleavage can thus occur on either side of the C-terminal group: a break on the N-terminal side leads to a prefix ion for DRVY only, while a break on the C-terminal side, called a "non-break", implies that the molecule remained intact. A "non-break" can also occur at the N terminus at break position 0. Unlike the C-terminal group, the N-terminal group was not generalized (although this can be easily done) because a break on either side of a hydrogen N-terminal group would have produced the same structure. 8.2.2 Model II: Modelling Internal Ions A single cleavage event was enough to generate prefix and suffix series ions. In order to generate internal ions, however, two cleavage events must occur. The Model II fragmenta- 85 tion tree, then, is basically the same as the Model I fragmentation tree, except every leaf is allowed to undergo a second round of breaking. When a prefix(suffix) undergoes a second cleavage event, it can produce another prefix(suffix) or an internal ion. If a molecule survived the first break intact (a non-break), then the products of its second round of breaking resembles a Model I tree. A partial Model II fragmentation tree for DRVY is shown in Figure 8-3. Some remarks about Model II are in order: Multiple Pathways for the Same Ion There can be several leaves in the fragmentation tree that represent ions with the exact same identity. This means there may be multiple pathways (either a different set of events or a different ordering of the same events) to generate the same fragment ion. For example, there are two ways depicted in Figure 8-3 to produce the YA internal RV ion. In general, there are more ways to produce shorter peptides than longer ones. Different Ions With the Same Mass The resulting fragment ions may not have unique masses; there can be two ions with different identities that have the identical mass. Peaks at these masses are said to have multiple identities. Aion Formation Our model allows for the possibility of a B ion, formed from the first break, to produce an A ion after the second break. In reality, there is no proven evidence of this, or of the contrary. 8.2.3 Model III: Modelling Variants The side chains of certain amino acids can lose ammonia/water or gain water. Model III addresses the ability of a fragment to exhibit a neutral loss/gain variant according to the following assumed rules: * a peptide containing S,T,D or E anywhere can lose water e a peptide containing R anywhere can lose ammonia 86 Yion RVY I A/X -- - - - - YA inmoniurn R YBimmonium --13yRVYYinV AX- - YiZ -- YA internal RV 3B YB internal RV internal RVY __+____R_____-YA -YB internal B/ N,1DRYNA/ -- -- - - R -D., DR A n DRY RVY Yion RVY N ScondBRea ________AV - - - - B- - R -- - - M+HD DDVY -. First Break PositionA B/Y - - - - ~Bion -YA - - - D (YB immnium ---.. DR\Aion D) internal RV AjnD DRDRV ABi Bi DR YAimmonium DRV B/Y AinD(YA imRmniumD) V Figure 8-3: Model II: A partial Model II fragmentation tree showing the second stage of cleavage for two Model I leaves. When two breaks are possible, immonium and internal ions are added to the repertoire of fragment types. a peptide containing R, K or H can gain water only if the R,K or H residue is the C-terminal residue Note that in our model, we have selected and included some of the more agreed upon rules in the literature for variants and when they are allowed. A sampling of the many other observations, which have not been incorporated into our model, includes: losing ammonia N,Q,K also may lose ammonia [str95, FHM+93]; K also may lose ammonia [CLS99]; an N-terminal Q may also lose ammonia [Hin97]. losing water only S,T lose water [CLS99]; S,T,D,E are more prone to lose water if they are the ultimate or penultimate residues [Hin97] gaining water Bp18 ions only occur for the (n - 1)st and (n of fnI and the Bion of fn2) 2 )nd B ions (i.e. the Bion where n is the length of the peptide [BC]. Since variants are residue dependent, at least one variant-permitting (or v-permitting) residue must be present (at the required location if applicable) in order for the corre- 87 sponding variant to be even possible. The number of instances of each v-permitting residue is assumed to affect how likely the variant will occur as well. Let Vm18, Vm17, vp 1 8 be the number of v-permitting residues for each variant. If a variant x also has associated with it a "per-instance" likelihood of occurring, tendency1 , then the probability that variant x is expressed is given by: X* tendencyx Vm18 * tendencym18 + Vm17 * tendencym17 + Vp18 * tendencyp18 (8.1) The Model III fragmentation tree is the same as its predecessor except each leaf of the Model II tree can now either appear as is with probability Pnovariant (this probability equals 1 if no v-permitting residues are present at all), or appear as a variant with the appropriate probability computed using Equation 8.1. Bion DR YA internal RVY Bion DR Brnl8ion DR YA internal RVY Bml7ion DR Y Aml7 internal RVY Bpl8ion DR Figure 8-4: Model III: Each Model II leaf may express a variant and the decision process is shown here only for two leaves from Figure 8-3. 8.2.4 Model IV: Modelling Residue Tendencies Variants were the first example of residue-dependent fragmentation. Aside from variants, there is little in the model so far that would differentiate the fragmentation tree of one peptide from that of another save for differences due to peptide composition and peptide length in the set of masses produced. Certain residues are more prone to fragment than others, and certain bonds may be more 88 likely to break depending upon which residues flank it. Lin and Glish remark that "the current state of knowledge is insufficient to predict or understand routinely which [of these] bonds will be broken, and when a bond is broken, which end of the dissociated peptide will retain the charge." [LG98] Again, the complete picture has not yet been elucidated, but certain tendencies have been observed. To incorporate what little is known, we introduce a 22x22 square non-symmetric matrix 2 M of break tendencies where each entry M[i, j] represents the likelihood that a bond flanked by residues i and j in Section 9.2.1. Note that for M[ij], residue i is N-terminal to residue will break. This is a value determined by a training process discussed j, so it is not necessarily the case that M[i, jj = M[ji. Matrix M allows us to evaluate the likelihood that a bond will break based on the actual residues that flank the break. The probability that a break will occur at a particular break position can be computed by finding the ratio of its likelihood to the sum of the likelihoods of all other bonds in the molecule. In practice, the actual location of the bond in the peptide may have an effect as well, but aside from the bonds positioned at the termini, our model does not account for bond location. 8.3 Accounting for Noise Our models so far have been concerned only with legitimate fragments, but real data is afflicted with various forms of noise, and any model that purports to realistically explain experimental data cannot ignore them. The following is a discussion of which noise categories from Chapter 5.2.2 are accounted for in our model, Model V, and which are not: 8.3.1 Physical Noise Physical-Chemical Properties of Peptides Matrix M of Section 8.2.4 accounts for some of the effects the residues themselves may have on the fragmentation process. 2 In addition to the basic residues, this matrix also includes entries for the C terminal group and a "null" entity for handling non-breaks, hence 22. Since hydrogen is the N terminal group, we do not include an entry for the N terminal, but this can be easily generalized if desired. 89 Low matrix values correspond to bonds between certain residue pairs that are unlikely to break. This may lead to an under-represented family in the resulting spectrum. Other more complex interactions based on chemical properties and residue location, for example, can exist but are not taken into account. Impurities The current model addresses only those outcomes possible from the fragmentation of the desired parent peptide. It is not concerned with fragments of molecules of origin other than the parent, such as impurities. If impurities are known to be present, one might account for them by modifying the concept of a trial to include fragmentation events for other molecules. In our fragmentation tree, we assumed that the root represented the parent peptide molecule. Instead, one might envision the root as a decision point that branches to one of several fragmentation trees, one for each molecule possibility. Thus, with some probability, the molecule being fragmented is the parent molecule, and its branch would lead to a fragmentation tree like the ones we have been describing. With some other probability, a contaminant branch is chosen, leading to the appropriate fragmentation tree for this other molecule. The probabilities of these choices sum to one and can perhaps be made dependent on the relative concentrations of each species. Recall that in order to influence the resulting tandem mass spectrum, the effect of an impurity must survive the filtration effects of the timed ion selector, so it is not immediately obvious how to incorporate impurities into the model which may heavily depending on when/where/how the impurity is introduced. 8.3.2 Measurement Noise Mass Inaccuracies Mass inaccuracies are handled on two fronts: (1) the input exper- imental spectrum is subjected to a processing step where the experimental masses are converted to the nearest checkpoint (see Section 7.1.6), and (2) the masses of the fragments predicted by the model are given in terms of checkpoint values. Again, this is acceptable so long as the experimental peaks are correct to within roughly 0.5 daltons. Height Inaccuracies 90 Undetected Trials Every ion that is successfully produced has some probability of not making it into the final observed spectrum. To model this, each leaf of the Model IV fragmentation tree is taken through an additional decision step, where with some probability, Punobserved, an ion can be lost and unobserved. Basal Noise Low intensity basal noise existing across the entire mass range ar- tificially raises the intensities of all peaks in the spectrum. With probability, Prandom, a trial results in random noise, which means the detector registers it at some mass according to some random distribution. Peak Detection Errors This is not directly accounted for in our model because this occurs at the spectrum level, not the trial level which we are modelling - i.e. peak detection errors occur separately, after the outcomes of all trials have completed and are tabulated. If the behavior of the peak detection software were known - how often and with what distribution it fails to detect a real peak or introduces a spurious peak, one might be able to incorporate its behavior into the model. 8.4 Assumptions The fragmentation model that we have presented thus far makes the following assumptions: e Spectrum Features Single Protonation Only ions with a single positive charge are produced. Core Fragment Types The A, B, Y ions and their loss/gain variants. Mass Error When a trial produces a legitimate ion, the mass that is registered is within 0.5 of the actual mass. * Variant Assumptions Residue and Fragment Type Dependence Variant formation is subject to the rules in Section 8.2.3. 91 Multiple Loss/Gain Events Multiple variant events, e.g. the loss of both wa- ter and ammonia, are theoretically possible, where sequence-permitting, but assumed to be a rare occurrence. We consider only the loss/gain of at most one neutral molecule (either water or ammonia). Influence of V-permitting Instances The more instances of a variant-permitting residue, the more likely that variant will occur. There has been no formal study of this in the literature, but it is reasonable to assume that more instances implies more sites liable to undergo a loss/gain event. However, the actual natural probabilities governing this may differ from those used by our model. The scoring function makes the following assumptions: Independent Trials A tandem mass spectrum is the collective sum of a number of independent trials. The current implementation assumes: Terminal Groups H- and -OH are the N-terminal and C-terminal groups respectively of the parent peptide. The algorithm can be modified so that the masses of both terminal groups can be specified by the user. Unmodified Basic Residues The parent peptide is composed of residues drawn from the 20 basic unmodified residues. Modified amino acids can be accounted for by including their masses in the list of residues the algorithm knows about. For the remainder of this thesis, we also assume that our model and its probability distributions hold for the fragmentation of all instances of a specific peptide and across all peptides. In reality though, there exists some experimental variance in such acquisition conditions as the laser intensity, the sample concentration, the duration of laser irradiation, etc. 92 8.5 8.5.1 From Model to Scoring Function Computing the Probability Mass Function Once the Model V fragmentation tree for peptide p is constructed, it is simple to derive the associated probability mass function(PMF) Fp from the set of outcome masses and their probabilities. The PMF Fp is defined for all checkpoint masses m, from 0 to the checkpoint of the parent, such that 0 < Fp(m) < 1 and EmFp(m) = 1. The probability of a multiple identity mass is the sum of the probability of each identity. Basal noise is modelled as a uniform distribution, so the probability of basal noise is added to the probability of every checkpoint mass. 8.5.2 Evaluating Sequence Guesses Recall that a tandem mass spectrum is essentially the aggregate sum of individual independent trials. Let S = {si,..sr} be a tandem mass spectrum with r experimental peaks. Let m(sj) = mi and h(si) = hi be the mass and height respectively of peak s. Given a particular sequence p, the model predicts a fragmentation tree and PMF Fp for a single trial. This PMF provides a means for evaluating the likelihood that the fragmentation of a number of molecules of this sequence could have given rise to S. This probability can be computed using the multinomial distribution: Prob(S, F) N = FN Fp(mi)h( h1 = ) Ni! ,Fp(Mi)h, - h1 Fp(m 2 )h2 . . . N - Zrk1 hr hr h2 ) Fp(mr)h, - Fp(Mr )hr where N is the total number of trials(molecules) 3 We use this as a scoring function for 3 Recall that some number of total molecules N were fragmented, and that a spectrum reports the subset 93 evaluating different sequence guesses by examining how likely each guess explains S, preferring that sequence which maximizes Prob(S, Fp). It will be convenient to compute the logarithm of the probability, log(Prob(S, Fp)) = E_1(hjlogFp(m-)) and then minimize log(Prob(S, Fp)) instead. Note that the term has been dropped because when differ- N! ent sequence PMFs are scored against the same S, this term is constant and can be omitted with no effect on score comparisons. 8.5.3 Scoring Function Maximum Given spectra S, how is the scoring function affected by different peptide guesses (and hence different PMFs)? In particular, if the best sequence guess ought to be the sequence with the highest probability, when, then, is the scoring function maximal? The scoring function is maximal when the spectrum resembles the PMF, i.e. the heights of the observed peaks are proportional to the heights of the PMF distribution. For the simplest case, when f(k) () - a proof of the claim that f(k) is maximal was k) given by David Stephenson [SteOO] when k = pin entails showing that f (pin - 1) < f(pin) and f(pin + 1) < f(pin). In Appendix F, we generalize this result to f(pin - 6) < f (pin) and f (pin + 6) < f (pin) for 6 > 0, and show that this is true for the multinomial case as well. The take home message is that if we could model fragmentation perfectly and if experimentally obtained spectrum were identical in shape to the calculated PMF distribution for large enough N (by the law of large numbers), then the real sequence would be the best scoring. 8.6 Exploring the Search Space Given a number of sequence guesses, one could find the PMF for each and compute the probability that each sequence produced the observed experimental dataset. If the real sequence is present among these guesses and the model is good, then the real sequence will in fact be the best scoring. The problem we are now faced with is how to generate these that are successfully detected. 94 guesses and how to insure that the correct sequence is among them. If this were possible, then the scoring function would be able to separate the wheat from the chaff, identifying the correct sequence from a pool of many wrong ones. The naive strategy is to enumerate all possible amino acid sequences whose mass equals that of the parent mass, evaluate the scoring function on each of them and pick the best scoring one as the sequence prediction. For anything but the smallest peptides, such an exhaustive approach is computationally infeasible, requiring [(M + H)/Rmaxl. 0( 20 N) guesses, where N = Here, (M + H) is the parent mass and Rmax is the maximum basic residue weight. Spectrum-dependent pruning techniques may help, but if the experimental data is poor, there is a good chance the real sequence may not survive. Instead, we use an efficient combinatorial minimization technique called simulated annealing to explore probabilistically through the space of possibilities. Simulated annealing is a stochastic means for finding the minimum scoring state of a system through gradual descent [KGJV83, BT93, PTVF95]. It finds its origin in the thermody- namic process of cooling. When a liquid freezes, it crystallizes; if the temperature is reduced gradually, the system settles into its minimum energy state and the crystal formed is pure. With simulated annealing, the algorithm starts at some initial temperature, and with each iteration, this temperature is gradually lowered according to some cooling schedule. Some starting sequence is randomly selected, and the search space is explored by making random modifications, called moves, first to the initial sequence and then to all subsequent sequences. Any move resulting in a better scoring sequence(a lower energy state) is immediately taken. Any move resulting in a higher energy state is allowed with some probability that depends on the current temperature. These uphill events allow for the search to escape local minima, but they are less likely to occur as the algorithm progresses (and the temperature drops). Thus, the start of the algorithm is essentially a random exploration of the space as long range moves have a higher probability of being taken, while more fine tuning is done at lower temperatures. It has been shown that if the cooling schedule is slow enough, then the algorithm will converge on the global optimum with high certainty [Haj88]. The simulated annealing specification for our problem is as follows: 95 Configuration A peptide sequence guess is a string of amino acid residues r 1 that Ei ... r, such m(ri) is constrained to equal the mass of the parent molecule of interest. Rearrangements The simulated annealing algorithm can make three different types of sequence moves 4 : 1. Permutation: randomly rearrange the order of the residues of a random subsequence. 2. Reversal: reverse the order of the residues of a random subsequence. 3. Substitution: replace a random subsequence with a different random subsequence of the same mass, but not necessarily of the same length. Note that there exists a finite number of moves from one sequence to any other in the space of sequences whose mass equals that of the parent mass. The reverse move is a special case of the permute move, and the permute move is a special case of the substitute move. The identity move is not allowed. We also require that the pre-move and post-move sequences have the same checkpoint so that moves are mass-preserving. If, instead, we had allowed the mass of the postmove sequence to be within some tolerance of the pre-move sequence mass, then mass skew can result - there exists a series of moves such that each individual move produces a sequence within the required mass bound, but the mass of the final sequence is more than the allowed tolerance from the mass of the original starting sequence. This is an example of the usefulness of checkpoints. Objective Function The function to be optimized is based on the scoring function Prob(S, F.), i.e. the likelihood that a particular sequence guess g with a fragmentation outcome distribution F. produced the observed experimental data S. Recall that we desire the g that produces the maximum Prob(S, F) value; this corresponds to finding a sequence that yields the minimum (-log Prob(S, Fg)) value. Annealing Schedule The cooling schedule consists of several parameters: 1. initial temperature, TO, 4The choice of moves is based on an observation that sequence prediction algorithms frequently produce a list of sequence guesses that are often very similar, differing only in small portions ofo the sequence [TJ97]. 96 2. number of temperature changes allowed, tempsteps, 3. temperature change factor k < 1 such that Ti+j = k * T, i > 0, 4. max number of sequences to try at T, nover 5. max number of successful moves at T before continuing to Ti+1, nlimit Initially, arbitrary values were chosen for these parameters, so the annealing schedule was: " To = 100, 000, " tempsteps = 100, " k = 0.9, " nover " nlimit 100 - * avgLength, 10 * avgLength, where avgLength is equal to the ceiling of the parent mass divided by the average amino acid mass. An implementation of the simulated annealing algorithm, based on an example in [PTVF95], was written in Java. In addition, we modified the algorithm so that it would keep track of and report the minimum scoring sequence that it encountered during its search. 8.7 Summary We have proposed a model for fragmentation based on some simple rules. Given a sequence, the model predicts a tree of possible fragmentation outcomes, and this fragmentation tree in turn describes a PMF from which a scoring function can be derived. A simulated annealing search can then use this model and its scoring function as the basis for designing an efficient strategy for traversing the vast space of sequence guesses. The next chapter contains a more detailed analysis of the effectiveness and performance of our approach. 97 Chapter 9 Testing the Model and Its Scoring Function This chapter evaluates the model and the scoring function module of our approach. We begin with a discussion of some experimental data that we acquired. Then we describe how the model was trained using this data, and what effect this had on the scoring function. 9.1 Training Data We acquired some experimental spectra locally so that we could gain a hands-on feel and understanding for the acquisition process(see Appendix B), examine and analyze sample data, and use the data to train our model and validate our algorithm. A total of six MALDI-PSD datasets were obtained: four for angiontensin(DRVYIHPFHL) and two for bradykinin(RPPGFSPFR), and they are included in Appendix C. Note that the 1205 dataset for angiotensin is believed to be problematic because of a calibration error encountered during acquisition. Ordinarily, such a dataset would have been immediately thrown out, but it was retained so that we might study the effects a poor dataset might have on our model. 98 Incidentally, the mass spectrometer came with a diagnostic PSD spectrum' for angiontensin, acquired on the instrument independently by its manufacturer. 9.1.1 Observations of Training Spectra A number of observations can be made from simple visual inspection of our PSD data and of spectra appearing in the literature 2. The key issue is whether our model is consistent with actual data, and these observations represent features that the model designer should be aware of. The most immediate observation is that there is a wide range of peak intensities with the parent ion being one of the highest, if not the highest, peaks. Other observations concern the distribution of peak masses, the different fragment types present and the problem of unknown peaks. Distribution of Peak Masses With regards to the shape of the spectrum or the overall distribution of masses, there are more peaks concentrated in the lower mass region than in the higher mass range. There are a two plausible types of explanations for why this might occur: the first pertains to the fragmentation process and the second, the fragment distribution. Large molecules, which have a larger mass to charge ratio, travel at slower velocities, perhaps slow enough that they are below the threshold necessary to trigger the collision-induced electron conversion mechanism for detection [Bea92]. Alternatively, perhaps the energy imparted to a molecule is so great that the molecule shatters into smaller pieces, making it unlikely to encounter large intact fragments [Mat98]. The distribution of input fragments may be another explanation for the observed spectrum shape. From a mathematical viewpoint, assuming n is the length of the peptide, there are O(n) core ions, and these ions are evenly spread out across the range of masses. There are Located at C:\VOYAGER\FACTORY\INSTALL\PSD\PARENOO1.MSA of the Voyager computer. [CBB96], Figure 1; [CLS99], Figures 2,3; [Spe97], Figure 20; [JnC96], figures 4a,5; [LL95], Figures 3,4; [KSL93], Figure 3; [KKS94], Figure 6. 2 99 also 0(n 2 ) possible internal ions, but there are more shorter internal ions than longer ones - e.g. there is only 1 internal ion of length (n - 2) but n - 3 internals of length 2. As a result, the concentration of lower mass fragments is naturally greater. Fragment Types Present in Spectra A comparison study of spectra from different mass spectrometers is found in [RYM95I, and it includes an analysis of the fragment types present in each type of spectra. The breakdowns that they found for their MALDI-PSD spectra (taken from pie charts in Figures 1,2,3,4 of [RYM95]) are reported in Table 9.1. Peptide Sequence PPGFSPFR RPPGFSPF DRVYIHPFHL RPVKVYPNGAEDESAEAFPLEF prefixes suffixes internals immoniums 32% 38% 33% 59% 28% 15% 15% 3% 31% 36% 14% 18% 9% 11% 38% 20% Table 9.1: Peak Classification from [RYM95] Since the sequences for our datasets are known, we can likewise confirm the presence of immoniums, cores, internals, variants and noise in our data 3 . Table C.7 of Appendix C lists various statistics for our datasets: size, number of experimentals accounted for, number of multiple identity peaks present, number of each ion present and various other totals. Multiple identity peaks are multiply counted in the fragment type tally of each identity. We are particularly interested in the presence of internal ions, which accounted for about 20% or more of the total number of peaks in each dataset, and at least 33% of all experimental peaks with known identities. Researchers [SC98, FHM+93] have noted the possible sequencing value of internal ions, which are indeed present [CCC95, JnC96, but most approaches in the literature that do not perform a global correlation that includes internalas, do not account for internals. It is difficult for global fundamental graph approaches to account for internals because whether or not an internal supports a fundamental depends on the path one takes to reach the fundamental. Local approaches that grow partial sequences 3 Note that verification was based solely upon matching theoreticals to experimentals; no chemical verification was actually performed 100 can check for internals with the partial sequence calculated so far, but the real partial sequence could still be disqualified if internals are poorly represented. In short, it is hard for certain algorithms to harness the sequence information available from internal peaks, but were it possible, they could benefit greatly. Algorithms the employ some form of global correlation are favorable because they are able to do precisely this. Fragment Type Intensities Peaks of high intensity tend to be core ions, core variants and certain immonium ions. Internals and other immoniums are often responsible for the low intensity peaks. Presence of Multiple Identity Peaks Multiple identity peaks are present in these datasets. For example, for angiotensin, the 354.2 could be explained as a Bm17 ion of the subsequence DRV, or as a YA internal ion for PFH. And for bradykinin: (1) 157.1 is the mass of a B ion for R and a YA internal for SP, and (2) 555.3 is the mass of a B ion for RPPGF and a YA internal for PPGFSP. Note that in these illustrations, the multiple identities involved a series ion having the same mass as an internal ion, but nothing precludes them from being a prefix and a suffix, both series ions. Unknown Angiotensin Peaks There are a handful of experimental peaks of fairly low intensities that appear consistently across multiple datasets, but are of unknown origin. pendix D. 101 They are discussed further in Ap- 9.2 9.2.1 Training the Model Parameterizing the Model In our discussion of the model and its fragmentation tree (Chapter 8), a number of parameters related to the various decision points of the fragmentation tree were discussed. These comprise the parameters of our model, and they fall into several major categories: Variant Likelihoods Pnovariant, tendencym18, tendencym17, tendencypis Noise Probabilities Prandom, Punobserved Series Likelihoods Pax, Pb Matrix Tendencies matrix M An overview of the various regions of matrix M is depicted in Figure 9-1. Only a few matrix parameters are being estimated: the two columns M[*, H] and M[*, P], and the portions involving a non-break. The remaining entries have a default value of 1. We have only scratched the surface of fully utilizing this matrix because ideally, every matrix entry would be a parameter; these regions were singled out because we thought they were the most influential factors in fragmentation. Histidine and proline are known(see Section 3.2) to favor N-terminal fragmentation. The non-break tendency also seems to be greater than that of the other positions because core ions consistently appear more abundantly than internal ions (a view held by the literature and supported by our datasets). Furthermore, the parent ion, the only kind of molecule with two non-breaks, is almost always the peak of highest intensity in the spectrum. An accurate estimate for all model parameters is necessary for producing PMF that resemble what actually occurs in the physical/chemical process of fragmentation. 9.2.2 Training the Model Parameters We trained our model by finding a set of assignments for these model parameters that maximized the likelihood of the datasets we acquired (by minimizing on the sum of their 102 AMINO ACID RESIDUES H P CTERM NULL normal tendency non-break tendency at ends of sequence special residue dependent tendencies 5 CTERM NA NULL NA NA NA Figure 9-1: Overview of Matrix Layout: Ideally, every matrix entry would be a parameter, but only a few regions have been parameterized and singled out for estimation - the nonbreak tendencies, and the Histidine(H) and Proline(P) residue dependencies. All entries within the same shaded region are assumed to have the same likelihood in the current model. Entries marked NA are not possible. -log(Prob(S, Fp)) scores - see Section 8.5.2). Rather than try all possible parameter assignments, we started with a few values for each parameter and gradually reduced the range of values for each parameter as we homed in on a set of parameter values that yielded an optimal sum (it is possible that this is not the global optimum). The trained model settings are tabulated in Table 9.2. Note that we have allowed variant tendencies to only have a small range of values. While more or less sufficient for now, tendency values can be more representative of variant content if a wider range of values were permitted. For example, if the three tendency parameters were allowed to take on any integer value, then training would have produced the following parameter values, listed in order: 0.57, 1, 22, 16, 0.17, 0.0, 0.23, 0.6, 3.0, 2.2. With little data, too large a range can lead to an overtrained model. With more data available, however, a more fine-grained scale may be preferred. Unmatched experimental peaks are attributed to basal noise. The more intense a noise peak, the more trials resulted in noise, the lower the probability of the spectrum occurring. 103 Parameter Untrained Trained Overall Pnovariant tendencym18 tendencym17 tendencyp18 0.5 0.58 1 1 1 0.2 1 2 2 0.18 Pb 0.1 0.5 0.5 0.0 0.24 0.56 non-breaks 1 3 M[*,H],M[*,P] 1 2.4 Prandom Punobserved Pax Table 9.2: Model Parameters: Untrained and Trained Overall During training, a dataset with no unmatched peaks should favor a trained Pandom value of 0; conversely, a dataset with many unmatched peaks, especially intense ones, will increase the Prandom value. Finally, scores are optimal when Punobserved is 0. This makes sense because the parameter, Punobserved, is the probability that an ion is lost and undetected, but a spectrum is com- prised of exactly those ions that are detected. Without knowledge of N, the total number of molecules fragmented, 0 is the best assignment (when N is assumed to be the number of observed molecules), and we can in effect ignore this parameter. One could approximate N by estimating the number of molecules in the sample (based on sample quantity and concentration), but it is difficult to tell how many of these get ionized during data acquisition. Fortunately, as we saw in Chapter 8.5.2, the one place N occurs is factored out, so it not needed for our purposes of comparing scores of different sequences evaluated against the same spectrum. Does training make a difference? We endeavor to show that even if the parameters do not produce the global optimum, the trained model performs better than the untrained one. 9.3 Examination of the Trained PMF The untrained model isn't very good at all - the PMF for angiotensin generated by the untrained model using arbitrary, but reasonable, parameters (Column 2 of Table 9.2), is 104 shown in Figure 9-2. Note that since cleavage at all break positions is equally likely, M is a constant matrix. PMF for Angiotensin - untrained parameters 0.04- 0.04 - 0.035 - 0.03 0.025 - 0.02 - 0.015 - 0.01 - 0.005-0 200 400 600 800 1000 1200 1400 Figure 9-2: PMF for Angiotensin Using an Untrained Model There is little resemblance between the PMF in Figure 9-2 and the actual angiotensin experimental data in Appendix C. The parent molecule is certainly not the most abundant ion in the model and the probabilities seem inversely proportional to fragment length - e.g. the shortest prefix and the shortest suffix are especially favored because there are many more ways to produce them. The PMF shape is also too regular, being solely dependent on peptide length. Differentiation based on other factors is necessary so that the PMFs of different peptides of the same length are not identical. On the other hand, compared to the untrained model, the trained model allows for some differentiation and the resulting PMF is shown in Figure 9-3. Though still not identical in shape to experimental spectra, this PMF already shows improvement compared to the untrained one of Figure 9-2. Now the parent mass is the highest peak, and the higher probability ions correspond to the core ions with the internal ions being less probable. Less regularity is observed because dependencies on the peptide composition have been introduced. More differentiation is still possible - the entire residue tendency matrix could have been parameterized; residue position could be taken into account; rules for neighboring/distant residue combinations could be introduced, etc. - but our training data is too small and 105 PMF for Angiotensin -- trained parameters 0.045. 0.040.035- 0.03 - Z, 0.025 - 0.02 - 0.0150.01 - 0.005- 0 200 400 600 800 1000 1200 1400 Figure 9-3: PMF for Angiotensin Using a Trained Model doesn't capture the diversity needed to train a more comprehensive model. Nevertheless, even with a small amount of differentiation, we can implement a fairly robust scoring function. 9.4 Scoring Guesses Against an Observed Spectrum How good is this scoring function? What happens when different sequence guesses are scored against the same experimental spectrum? How does the real sequence fare? Two measures of performance, suggested by [PPCC99], are sensitivity and selectivity. If the scoring algorithm is sensitive, the correct sequence scores well; if selective, unrelated sequences score poorly. We used our scoring function to correlate our experimental data against a pool of peptide sequences. These sequences, listed in the leftmost column of Tables 9.3 and 9.4, all have the same mass as the parent ion, but vary in length and similarity to the parent sequence. The list was chosen so that the first set of sequences differed from the parent the most (in length and sequence), while the remaining sequences exhibited a greater resemblance to the parent, being of the correct length and including sequences such as the parent reversed, the parent permuted and the parent with neighboring residues flipped. The last sequence is the 106 actual parent sequence. The tables record the correlation scores for each sequence, under both the untrained and trained models, as well as the difference between these scores. In the ideal case, use of the trained model improves only the score of the actual sequence. The situation here is not ideal, but the real sequence is one of the sequences that show the greatest improvement when the trained model is used. In all cases, the correct sequence (either the parent sequence or the parent sequence with the first 2 residues flipped) is the best scoring sequence. 107 0123 Dataset 0119 Dataset Sequence Untrained Trained Difference Untrained EWHTWFIMF 1000610.2 986586.2 14024.0 376255.5 369742.7 6512.8 PRMKWCPCIY 975305.5 951105.7 24199.8 372495.9 362382.5 10113.4 VLQFHLYTYI 972085.0 960256.5 11828.5 366327.7 360586.0 5741.7 HKMKFREYSA 977920.6 957214.7 20706.0 352852.6 347123.8 5728.8 FRLSCWYAHN 996250.5 974148.6 22101.9 377026.7 367970.0 9056.7 DCSYAKAWYSC 966902.2 939270.9 27631.3 373901.7 362997.1 10904.6 QCYLTGQAYYS 952266.0 932862.8 19403.2 371618.9 362988.3 8630.6 VWFLNPEHGPT 961941.0 942032.0 19909.0 359397.6 352370.4 7027.2 TFGKETIKEIM 971034.8 961195.5 9839.3 377140.7 371161.4 5979.2 YLQMSIEELGN 986816.0 964675.0 22141.0 383318.1 373152.3 10165.8 Trained Difference EKDICDCCHSGS 1006346.8 993525.5 12821.3 376325.8 369655.7 6670.1 ICSVWMSVVICG 1021791.8 1003735.2 18056.6 391636.1 383796.8 7839.3 HSTENCAAIITH 967937.2 953169.5 14767.7 353608.4 349607.1 4001.4 ANYSNCGCEPAPA 1016495.5 998025.3 18470.3 389600.9 381338.4 8262.5 ANCSTTRKTNGGS 995608.6 975148.5 20460.1 382945.8 372557.5 10388.3 DRVYLHLHML 721760.7 679433.3 42327.4 275185.3 261156.2 14029.2 DRVYLHMFCL 727984.5 676692.6 51291.9 276618.7 258219.5 18399.2 DRVYLHMPCY 737855.5 685587.0 52268.6 278511.3 261330.0 17181.3 DRVYLHMPEH 726715.4 684229.4 42486.0 273222.0 260859.8 12362.2 RDVYLHPWPN 761811.3 703958.5 57852.8 284003.5 259902.8 24100.6 DRVYYSLHML 760932.9 713756.0 47176.9 291209.7 275682.2 15527.6 DRVYIHFPHL 715022.7 670690.6 44332.0 272046.8 258363.3 13683.6 DVRYIHPFHL 739678.7 680309.0 59369.7 283188.6 262018.9 21169.7 DRYVIHPFHL 737180.8 676238.9 60941.9 278286.2 255804.3 22481.9 DRVIYHPFHL 738506.3 678979.6 59526.7 281096.1 259648.8 21447.3 DRVYH1PFHL 740373.8 676650.9 63722.9 281312.1 260521.3 20790.8 DRVYIPHFHL 762835.2 721162.0 41673.2 289738.6 278062.0 11676.6 DRVYIHPHFL 739913.0 682881.0 57032.0 278161.2 256277.1 21884.1 DRVYIHPFLH 726697.1 677339.3 49357.8 273243.1 254749.1 18494.0 DRVPFYIHHL 872519.8 824716.8 47803.1 330034.1 315203.6 14830.5 DRVHIYPFHL 778471.5 718644.9 59826.6 295005.1 276876.4 18128.7 DRHVYIHPFL 840385.2 796113.4 44271.8 315209.8 297241.9 17967.9 RDVYIHPFHL 711485.4 648696.4 62789.0 *269161.1 *246465.2 22696.0 LHFPHIYVRD 790750.9 768738.7 22012.3 300054.2 293322.9 6731.3 PRDVYIHFHL 856353.2 822967.0 33386.1 320955.7 310903.2 10052.5 HRDVYIHPFL 816051.5 778512.3 37539.2 295598.0 284770.6 10827.3 DRVYIHPFHL *707448.0 *644984.9 62463.1 269449.3 247236.8 22212.5 Table 9.3: Scores of Sequences with the Same Mass as Angiotensin for Datasets 0123 and 0119. A '*' denotes best score in a column. Sequences considered correct are the actual sequence (ilisted last) and the sequence RDVYIHPFHL, the actual with the first 2 residues flipped, listed fifth from last. 108 0121 Dataset 1205 Dataset Difference Trained Difference 1583255.8 1562844.2 20411.6 1538727.6 1500969.5 37758.1 Untrained Sequence Untrained Trained EWHTWFIMF 3591140.3 3585832.4 5307.9 PRMKWCPCIY 3481601.0 3440904.1 40696.9 VLQFHLYTYI 3422311.3 3404494.1 17817.2 1538805.3 1525055.9 13749.4 HKMKFREYSA 3539148.9 3523349.4 15799.5 1549703.5 1513659.1 36044.4 FRLSCWYAHN 3605618.2 3588582.6 17035.6 1588387.5 1558620.7 29766.8 DCSYAKAWYSC 3475024.8 3440729.6 34295.3 1509786.6 1469993.5 39793.1 QCYLTGQAYYS 3440372.5 3432988.8 7383.7 1458486.1 1429528.5 28957.6 VWFLNPEHGPT 3484229.2 3452109.6 32119.6 1513068.5 1483061.6 30007.0 TFGKETIKEIM 3447098.4 3457061.1 -9962.6 1491126.1 1486291.2 4834.9 YLQMSIEELGN 3526325.5 3498799.0 27526.6 1539122.3 1505162.6 33959.7 EKDICDCCHSGS 3534730.3 3527607.1 7123.3 1571204.9 1547374.4 23830.6 ICSVWMSVVICG 3437413.3 3363538.8 73874.6 1593272.0 1565586.2 27685.8 HSTENCAAIITH 3490206.1 3480559.8 9646.3 1545114.6 1522448.0 22666.6 ANYSNCGCEPAPA 3590061.1 3558109.3 31951.8 1587972.5 1553456.6 34515.9 ANCSTTRKTNGGS 3533219.1 3520911.5 12307.6 1556569.3 1528001.8 28567.4 DRVYLHLHML 2658723.3 2564833.4 93889.9 1113992.0 1044975.8 69016.2 DRVYLHMFCL 2663227.4 2547695.6 115531.8 1124362.5 1043150.1 81212.4 DRVYLHMPCY 2785226.9 2686913.3 98313.6 1132070.5 1042132.9 89937.6 DRVYLHMPEH 2765007.8 2689378.4 75629.4 1116415.0 1040693.1 75721.9 RDVYLHPWPN 2838234.4 2709785.2 128449.1 1181329.1 1094742.5 86586.5 DRVYYSLHML 2724515.6 2609064.6 115451.1 1148614.0 1066800.9 81813.1 DRVYIHFPHL 2569794.0 2450760.5 119033.5 1093327.9 1014123.0 79204.9 DVRYIHPFHL 2638866.2 2476966.3 161899.9 1141435.7 1045422.1 96013.6 DRYVIHPFHL 2639212.4 2470831.0 168381.4 1128156.5 1026553.8 101602.7 DRVIYHPFHL 2671037.9 2508420.0 162617.9 1135684.4 1038167.0 97517.4 DRVYHIPFHL 2615010.7 2430165.6 184845.2 1120847.7 1011129.0 109718.7 DRVYIPHFHL 2710566.2 2600019.9 110546.3 1156037.5 1080070.5 75967.0 DRVYIHPHFL 2700650.0 2560715.7 139934.3 1159050.9 1070259.6 88791.4 DRVYIHPFLH 2768536.1 2666870.9 101665.2 1112153.9 1039635.2 72518.7 DRVPFYIHHL 3043928.2 2915263.4 128664.8 1325732.7 1239322.1 86410.7 DRVHIYPFHL 2743394.9 2574565.3 168829.6 1181530.6 1077197.6 104333.0 DRHVYIHPFL 3037303.8 2942113.1 95190.8 1323411.4 1258188.6 65222.8 RDVYIHPFHL *2534547.7 *2361854.8 172692.8 *1076543.8 *973923.5 102620.3 LHFPHIYVRD 2821830.3 2709905.1 111925.2 1222077.9 1187291.4 34786.5 PRDVYIHFHL 3006660.2 2907902.0 98758.1 1319292.1 1256930.4 62361.7 HRDVYIHPFL 2929037.6 2843633.2 85404.4 1289927.7 1233125.8 56801.9 DRVYIHPFHL 2539808.1 2369364.1 170444.0 1078123.1 975763.0 102360.1 Table 9.4: Scores of Sequences with the Same Mass as Angiotensin for Datasets 1205 and 0121. A '*' denotes best score in a column. Sequences considered correct are the actual sequence (ilisted last) and the sequence RDVYIHPFHL, the actual with the first 2 residues flipped, listed fifth from last. 109 0220 Dataset Sequence FFNWAGRY RSSCQPDMH WDPNICTLV VFFMDHGAH TRQKRMLAG VAYHGYFCT CQSFCDGDSV TVNVHHPGSI LRQNLAMGGT LSQKACPGKE SAFAPRIVCP DCHMASDGAGP TYLFGSCGGGV EIRAIQGCGGG KNAGPGTEISS RPVDSSPRF RPPGFSPRF RPVSDSPRF RPTAESPRF RPWDSAMPT RPVDSSPFR RPVDSSLMR RPSFGPPRF RPWDSPGVF RPTCPSPRF RPWDSAMVV PRPGFSPFR RPGPFSPFR RPPFGSPFR RPPGSFPFR RPPGFPSFR RPPGFSFPR RFPSFGPPR RPPGFSPFR Untrained 637458.2 573000.0 605878.4 617818.7 612576.6 648209.6 621628.3 616680.9 622702.2 617880.6 603815.9 607585.3 625747.3 617889.9 604424.9 516809.6 496651.5 516406.2 509738.6 521433.1 511242.9 518657.1 503237.9 518995.3 512074.9 522496.4 508248.9 496290.8 499639.0 509410.8 511823.5 495785.1 529863.1 *492761.5 0218 Dataset Trained 620229.3 559245.2 591745.4 604156.9 597983.7 631083.7 605042.8 604741.1 605893.8 600357.9 587313.5 602810.6 604016.7 594656.0 584624.0 489526.9 466971.0 489144.7 480533.7 501069.5 484323.2 494543.2 478142.5 492483.1 488138.4 498975.2 486135.7 475676.0 472065.1 480895.7 487540.5 475180.1 519962.3 *463690.3 Difference Untrained Trained Difference 17228.9 13754.8 14133.0 13661.8 14592.9 17125.9 16585.6 11939.8 16808.4 17522.7 16502.4 4774.7 21730.6 23233.9 19800.9 27282.7 29680.5 27261.5 29204.9 20363.7 26919.7 24113.8 25095.4 26512.2 23936.5 23521.2 22113.3 20614.8 27574.0 28515.1 24283.0 20605.1 9900.8 29071.2 1313724.7 1182864.6 1234530.6 1289685.8 1241441.3 1342050.6 1267178.7 1258687.3 1262377.0 1274532.3 1266159.0 1265081.6 1259824.2 1242888.5 1230376.3 1008198.9 939676.7 1010008.5 992577.2 1026787.8 995771.5 1014359.4 975513.9 1021498.9 1002086.2 1024707.9 984056.3 948037.7 958188.2 995003.5 998267.0 946461.2 1063121.5 *931460.8 1317429.7 1186785.4 1232528.8 1289786.0 1252506.1 1337724.3 1268856.7 1253688.5 1270854.0 1270929.9 1265483.1 1280059.7 1254787.7 1234011.7 1221202.1 977668.9 898853.3 978912.0 958308.4 1019673.7 965772.3 1001363.7 944049.5 996014.4 971484.8 1014970.9 952750.9 931302.2 923245.2 953766.6 975018.5 936713.1 1068835.8 *892716.5 -3705.0 -3920.8 2001.8 -100.2 -11064.8 4326.2 -1678.0 4998.8 -8477.1 3602.5 675.9 -14978.1 5036.5 8876.8 9174.2 30530.0 40823.4 31096.5 34268.8 7114.1 29999.2 12995.7 31464.4 25484.5 30601.3 9737.0 31305.4 16735.5 34943.0 41236.9 23248.5 9748.1 -5714.3 38744.3 Table 9.5: Scores of Sequences with the Same Mass as Bradykinin. A '*' denotes best score in a column. 110 The sequences that are similar to the parent are expected to score better than those that are not. This can be seen when the trained model scores of Tables 9.3 and 9.5 are graphically displayed (see Figures 9-4 and 9-5). The scores fall into clusters when plotted - two clusters for each dataset (the first two clusters of Figure 9-4 (points 0 to 40 on the x-axis) correspond to the sequence scores for the 0123 dataset, the next two clusters for 0119, and so forth). There is a visible marked distinction between the cluster of sequences that resemble the real sequence (these have better, i.e. lower, scores) and the cluster of sequences that are more distant. 3.5e-06 2.5.06 2e+06 500000 0 20 40 80 60 100 140 120 160 180 Figure 9-4: Trained Model Scores of Sequences Guesses for Angiotensin: 0123(points 0-40 along the x-axis), 0119(40-60), 1205(80-120) and 0121(120-160). zero were inserted to separate each dataset. Datasets Scores at 1.2e+06 .+06 800000 600000 400000 200000 0 10 20 30 40 50 60 70 80 Figure 9-5: Trained Model Scores of Sequences Guesses for Bradykinin: Datasets 0220(points 0-40) and 0218(40-80). Scores at zero were inserted to separate each dataset. 111 9.4.1 Not the Real Sequence, but Still Correct De novo algorithms find the amino acid sequence that best explains the observed data according to some measure of evaluation. Ideally, the sequence that is the best fit is also the actual parent sequence. Examination of the scores for each dataset reveals that the real sequence is indeed found for the bradykinin datasets and the 0123 angiotensin dataset. For the remaining angiotensin datasets, however, there is an alternate sequence, RDVYIHPFHI, that outscores the real sequence. Notice that this alternate sequence is the real sequence with the first two residues interchanged, and from Section 4.4, this is an acceptable prediction and considered correct. Investigation of our data reveals two reasons why this sequence would be favored: (1) the data contains no fi ions and weak f{ ions, and (2) peaks (both legitimate and noise) are misinterpreted as supportive fragment ions for the alternate sequence. Take the 0119 and 0121 datasets for example. The 0119 dataset contains no fi or but it has an immonium YAm17 ion for R at mass 112. f' ions, The mass of the Am17 ion for R is also 112, and moreover, the probability of the Am17 ion, being a core ion, is higher than that of the immonium YAm17 ion. Consequently, a sequencing algorithm would favor the misinterpretation of 112 as an Am17 ion, identifying R as the first residue, because this assignment yields a better score. The same immonium interference by the same ion occurs with the 0121 dataset, except unlike the 0119 dataset, an extended family peak, an internal YB ion for RVYIH at mass 669.47, is present, but unfortunately, its intensity is too low to be of consequence. Thus, because fi prefix ions are more probable according to our rules than immonium ions, competing sequences whose first residue is well supported by immonium peaks in the spectrum, might outscore a real sequence that contains a gap at position 1. One might be able to rescue the real sequence by appropriately reducing the break probability of break position one. This would lower the probabilities of all fi prefix ions as a consequence, and diminish the effect of immonium interference. The 1205 dataset, on the other hand, has no fi, but it does have a single extended family ion - an internal ion YBp18 for RVYIHPFH at mass 1068.83. Noise peaks at 473.147 and 696.93, however, can be interpreted as a YBm18 for DVYI and a YA for DVYIHP 112 respectively. These support RDVYIHPFHL as the sequence and help it to outscore the real sequence despite the presence of a legitimate peak at 1068.83. Note that while certain ion families can be better represented, it is not the case that the dataset, as a whole, is too small. Were this true, the scores of sequences that are radically different from the real one would be more competitive, and quite likely surpass that of the real sequence. With "decent" spectra (good signal to noise, good redundancy), the danger is not that a sequence totally distinct from the real sequence would score better (because the real sequence being the real sequence should explain the bulk of the peaks), but that a sequence, deviating ever so slightly, would outperform the real one. The reason it scores high is because of its high degree of overlap with the real sequence; but the reason it scores better is because the minor differences in sequence happen to be better supported. Incidentally, no fundamental graph approach would have fared better. In the case of 1205, an extended family ion is present, but it is an internal ion, and not one of the f-generating roles. Unlike series ions, fundamentals cannot be definitively inferred from internals (without more localizing information). Internal ion supporters cannot be easily incorporated into a fundamental graph. Global approaches might attempt to find all possible subpaths in the graph that can explain the internal ion, and then increment some score for all fundamentals of each subpath. But because there can be many such subpaths, this process may obscure the real sequence instead of highlighting it. There is also no reason to believe that fundamentals that are located at the crossroads of many such subpaths are, in fact, fundamentals of the real sequence. Local approaches can take internals into account as they progress through the the search space of partial sequences (a huge tree where each node has outgoing degree 18); but it would not be able to account for the f' internal ion of dataset 1205 until it has almost reached the end of its search. It is unclear whether the real sequence would have survived pruning long enough to make it this far, whether the influence of non-internal peaks interpreted as internals would artificially elevate the score of incorrect sequences enough to knock out the real sequence, and whether a search with an increased number of survivors allowed after each round would be too inefficient. 113 Thus, while the angiotensin sequence is actually DRVYIHPFHL, the sequence RDVYIHPFHI is considered a correct solution. 9.5 Summary: Model Training The different fragment types are all present in the datasets we collected. In addition, the spectra also contained noise, as well as consistently occurring peaks of unknown origin. Our model was trained with these datasets, and despite the presence of unknowns, the PMF improved, as did the selectivity/sensitivity of the scoring function for the correct sequence. 114 Chapter 10 Testing the Simulated Annealing Search This chapter examines the simulated annealing search and addresses such questions as: Does the search converge? Does restricting the search to moves that are length-preserving help? How do different searching parameters affect the search? 10.1 Search Convergence The scores of each successive move made by the simulated annealing search on an angiotensin dataset and a bradykinin dataset are plotted in Figures 10-1 and 10-2 respectively. The search used the trained model of Table 9.2 to evaluate each move that it made. The scores fluctuate randomly at first but as the search progresses, gradually descends to a minimum value. In both cases, the search found the real sequence successfully. 10.2 Sequence Prediction Simulated annealing was performed on all our datasets using a model whose parameters were trained by the very same datasets (Table 9.2) (cross-validation and the use of other validating datasets are the subject of Chapter 11). Table 10.1 summarizes these results and 115 400000 V 380000 360000 340000 .* A ~ ,~ 2 14 4 4,., t .4 ~t. 320000 *1 4 4$ * 300000 ~ !$4~*..: i 280000 260000 240000 1000 500 0 2500 2000 1500 3500 3000 4000 4500 Figure 10-1: Simulated Annealing Moves for 0119 Dataset. The x-axis represents the progress of the search, and the y-axis is the score of each successive move. 640000 620000 , 600000 * 580000 560000 - * **+* 540000 -. 520000 4 500000 480000 460000 0 500 1000 1500 2000 2500 3000 3500 Figure 10-2: Simulated Annealing Moves for 0220 Dataset. The x-axis represents the progress of the search, and the y-axis is the score of each successive move. 116 reveals the effects of using a model that is untrained versus one that is trained, and using moves that are restricted(length-preserving) versus moves that are unrestricted. Each entry in the table indicates the number of times, out of 10, simulated annealing found the correct sequence. The outcomes of the algorithm are broken down into three categories: " the number of times the algorithm returned the correct answer, " the number of times the algorithm terminated with no answer, but the minimum sequence encountered was correct, and " the number of times the algorithm returned the wrong answer, but the minimum sequence encountered was correct. The second category occurs when the algorithm does not think it has converged, and the third category occurs when it gets stuck in some local minima. We distinguish among these three categories to better understand the algorithm's behavior; one can easily modify the code so that the minimum scoring sequence that was encountered is always returned at the completion of the algorithm. One can consider the sum of these three numbers as the number of times the algorithm succeeds for the given parameters. Moves 0123 Restricted 0 0 0 Unrestricted 0 0 0 Untrained Model 0119 1205 0121 0220 0 1 0 9 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0218 9 0 0 10 0 0 0123 10 0 0 4 0 0 Trained Model 0119 1205 0121 0220 8 7 7 10 0 0 0 0 2 0 1 0 1 3 4 10 0 0 0 0 0 0 0 0 0218 10 0 0 10 0 0 Table 10.1: Results of Simulated Annealing Run on Different Datasets using an Untrained/Trained Model, and a Length-Preserving/Non-Length-Preserving Search These results indicate that simulated annealing performs better when scoring is done using a trained model and when length-preserving moves are made. When moves are length- preserving, the search space is smaller, and hence the likelihood of the search succeeding (if the subspace contains the correct sequence) is higher. Neither of these observations comes as a surprise especially since the trained model is being validated on the very datasets used 117 to train it. What is interesting is that the bradykinin datasets (0220, 0218) appear to do quite well regardless of the simulation conditions. Finally, the algorithm, when used with the untrained model, frequently found sequences that scored better than the correct one, indicating that training definitely made a difference by improving sensitivity and/or selectivity. The simulations that used a trained model encountered no sequence that scored better than the real one. 10.3 Different Restricted Sizes Ideally, the simulated annealing algorithm should run in unrestricted mode, and predict the correct sequence after considering sequences of all lengths. We attempted to improve the performance of the unrestricted version by trying different move probabilities and slower schedules (results not included), but met with little success. It was also unclear whether the tremendous cost in time, due to the use of even slower annealing schedules, would be worth the gain in prediction success. However, since the algorithm worked well when restricted to sequences of the right length, one might consider partitioning the space of sequences by length into smaller non-overlapping subspaces. Since the right length is not known, one would have to run several restricted searches, but instead of a single search over the full space, one could run a few well-chosen searches over smaller spaces. The following question is then raised - how do the best scoring sequences of each length compare? Is the real sequence the best scoring sequence of not only its length, but of the other lengths as well? One could reason that this might be true for the shorter and longer lengths. Very short sequences account for fewer of the experimentals and as a result, more unexplained experimentals must be attributed to noise. If the probability of noise is low, then this would greatly hurt the resulting score. Very long sequences are able to generate many more theoretical peaks, especially low mass ones, and their PMFs will experience an overall reduction in probabilities, particularly the probabilities of the higher mass fragments. This also has the potential of hurting the resulting score. 118 The simulated annealing algorithm was run in restricted mode on a random initial sequencesi for every length from 8 to 13. Each size was tested 10 times per dataset, and the sequences predicted by our model are listed by size in Tables 10.2 and 10.3. We found that the correct sequence for each dataset was indeed the best scoring sequence over all lengths. This suggests that the reason the unrestricted searches (Tables 10.1 and 11.2) did poorly was because the sequence space was vast, not because the correct sequence scored suboptimally. 10.4 Simulated Annealing on Data Without Noise Simulated annealing was also performed on our datasets with all noise (unmatched) peaks removed. One would expect the algorithm to do very well with clean data, but the statistics compiled in Table 10.4 show little improvement compared to Table 10.1. This indicates that the the algorithm is somewhat tolerant of a certain level of noise (because the amount of noise in the original datasets does not markedly change the algorithm's noiseless prediction results). We will explore noise further in Section 12.2.3. Since the correct sequence is still the best scoring sequence encountered, the performance of the algorithm on the noiseless 0119, 1205 and 0121 datasets suggests that perhaps the search itself could use improvement. 10.5 Exploration of Simulated Annealing Parameters Thus, we shift our attention to the simulated annealing search and its parameters in hopes of finding settings that will improve the performance of the algorithm on the 0119, 1205 and 0121 datasets. We make forays into the multi-variable space of simulated annealing parameters by examining the algorithm's predictions when a single parameter is changed and when combinations of parameters are changed. Note that we are exploring this space under the "best possible conditions" in terms of the data and the model - namely, we used datasets containing no noise and we used a model whose parameters were trained with all six datasets. 'At the start of every search, the algorithm attempts to pick a random sequence of the desired mass for the specified length. It may have difficulty finding a such a sequence if the length is too short or too long. 119 0123 0119 1205 8 9 10 11 12 13 YWYWREFF,912071.27 HMIMFYRWI,747736.38 DRVYIHPFHI,644984.87 HICHGVHPFHI,694030.16 HMIHAGHPSGRP,699639.42 PNGIHQHPSGVGP,715958.5 YYWYYWRP,914142.34 HMIMFYRWI,747736.38 DRVYIHPFHI,644984.87 NCRGCIHPFHI,693794.29 CRGGGCIHPFHI,700914.51 PAQGPVPHPKGVI,725074.9 NCRGCIHPFHI,693794.29 CRGGGCIHPFHI,700914.51 HMIPAPHPGGAVI,707189.9 YWYWREFF,912071.27 HMIMFYRWI,747736.38 DRVYIHPFHI,644984.87 YWYWREFF,912071.27 HMIMFYRWI,747736.38 DRVYIHPFHI,644984.87 HICYPGIPFHI,700056.62 HMIPAPHPKGVI,702909.07 PAAGGHGVHPFHI,709675.0 YWYWREFF,912071.27 HMIMFYRWI,747736.38 DRVYIHPFHI,644984.87 NCRGCIHPFHI,693794.29 HMIHAGHPSGRP,699639.42 PQAGHRHPSGVGP,722727.3 YWYWREFF,912071.27 HMIMFYRWI,747736.38 DRVYIHPFHI,644984.87 PNAAHRHPFHI,692246.85 DRVYIHPGGGII,682825.96 PKAGHRHPSGVGP,722727.3 YWYWREFF,912071.27 HMIMFYRWI,747736.38 DRVYIHPFHI,644984.87 HICHGVHPFHI,694030.16 PGGAAHRHPFHI,699720.92 SAAAADCGHPFHI,724811.9 HKWWKYYW,910361.35 HMIMFYRWI,747736.38 DRVYIHPFHI,644984.87 HIMPAAYPFHI,707205.39 PGGAAHRHPFHI,699720.92 HICTGGSAGPFHI,723111.5 YRWHHWRR,909109.71 HMIMFYRWI,747736.38 DRVYIHPFHI,644984.87 HICHGVHPFHI,694030.16 PGGAAHRHPFHI,699720.92 PKAGHGVHPSGPR,722596.9 YWYWREFF,912071.27 HMIMFYRWI,747736.38 DRVYIHPFHI,644984.87 HICHGVHPFHI,694030.16 CRGGGCIHPFHI,700914.51 HMIHAGHPSGVGP,706485.7 (didn't converge),0.00 HMIMFYRWI,278945.79 RDVYIHPFHI,246465.16 HPNANIHPFHI,254911.32 HMIPAPHPNAVI,256264.55 HMIPAPHPGGAVI,258728.6 YYWYYWRP,344307.06 HMIMFYRWI,278945.79 RDVYIHPFHI,246465.16 HICHGVHPFHI,255383.07 HMIPAPHPNAVI,256264.55 HMIPAPHPGGAVI,258728.6 YYWYYWRP,344307.06 HMIMFYRWI,278945.79 HHRCIHPFHI,256265.93 HICHGVHPFCF,255278.29 HIMCAGGNPFCF,263968.83 HICMAGNGPNAVI,264847.9 HHWWRYRR,327530.14 HMIMFYRWI,278945.79 RDVYIHPFHI,246465.16 HICHGVHPFCF,255278.29 HPNAGGIHPFHI,256512.18 HMIPAPHPGGAVI,258728.6 HWWKKYYW,324801.74 HMIMFYRWI,278945.79 RDVYIHPFHI,246465.16 HPNANIHPFHI,254911.32 HPNAGGIHPFHI,256512.18 HPGGAGGIHPFHI,259194.2 HHWWRYRR,327530.14 HMIMFYRWI,278945.79 RDVYIHPFHI,246465.16 HICHGVHPFHI,255383.07 HPNAGGIHPFHI,256512.18 HMIPAPHPGGAVI,258728.6 (didn't converge),0.00 HICYRQWYK,294150.11 RDVYIHPFHI,246465.16 HMIHQHPNAVI,256400.84 HMIPAPHPNAVI,256264.55 HMIHAGHPPAGGD,261261.7 YWYFIWRY,346389.43 HHWWHPFHI,269454.26 HHRCIHPFHI,256265.93 HPNANIHPFHI,254911.32 HPNAGGIHPFHI,256512.18 HICMAGNGPNAVI,264847.9 (didn't converge),0.00 HMIMWYYPR,279180.02 RDVYIHPFHI,246465.16 HPNANIHPFHI,254911.32 HMIPAPHPNAVI,256264.55 HICHGVHPGGGII,263342.6 (didn't converge),0.00 HMIMFYRWI,278945.79 RDVYIHPFHI,246465.16 HPFYPAAPFFC,259035.64 HMIPAPHPNAVI,256264.55 HICYPIGPGGAVI,261826.2 (didn't converge),0.00 IHMMFWRYI,2617491.58 RDVYIHPFHI,2361854.84 HICYPGIPFHI,2397777.77 IHCYPAVPKGVI,2414683.06 IHCGGAACAPFHI,2506416.0 MHHWWWWK,3142580.37 IHMMFWRYI,2617491.58 IHCHRHPFHI,2426544.80 HICYPGIPFHI,2397777.77 IHCYPAVPGQVI,2414131.77 IHCYPVAPGGAVI,2431328.6 (didn't converge),0.00 IHMMYYYYI,2645006.18 RDVYIHPFHI,2361854.84 HICYPGIPFHI,2397777.77 PGQAYPGIPFHI,2467377.90 IHCPVPHPGGAVI,2435215.2 (didn't converge),0.00 IHMMFWRYI,2617491.58 RDVYIHPFHI,2361854.84 HICYPGIPFHI,2397777.77 (didn't converge),0.00 IHCYPVAPGGAVI,2431328.6 (didn't converge),0.00 IHMMFWRYI,2617491.58 IHCHRHPFHI,2426544.80 HICYPGIPFHI,2397777.77 IHMGGCANPMCY,2495356.20 IHCYPVAPGGAVI,2431328.6 (didn't converge),0.00 IHMMFWRYI,2617491.58 IHCHRHPFHI,2426544.80 HICYPGIPFHI,2397777.77 IPGGGHQHPFHI,2458782.53 IHCYPVAPGGAVI,2431328.6 IRTYWWWW,2957211.69 IHMMFWRYI,2617491.58 RDVYIHPFHI,2361854.84 HICYPGIPFHI,2397777.77 PGQAYPGIPFHI,2467377.90 IPGGGPAPHPFHI,2482744.5 IRTYWWWW,2957211.69 IHMMFWRYI,2617491.58 IHCHRHMFCI,2449025.18 HICYPGIPFHI,2397777.77 IHCYPAVPGQVI,2414131.77 (didn't converge),0.0 IRTYWWWW,2957211.69 IHMMFWRYI,2617491.58 RDVYIHPFHI,2361854.84 HICYPGIPFHI,2397777.77 IHCYPAVPGQVI,2414131.77 IHCPVPHPGGAVI,2435215.2 (didn't converge),0.00 IHMMFWRYI,2617491.58 RDVYIHPFHI,2361854.84 HICYPGIPFHI,2397777.77 IHCPVPHPGQVI,2416431.25 IHCYPVAPGGAVI,2431328.6 Table 10.2: Simulated Annealing Results for Different Lengths. This table lists the ten predictions made by a search, restricted to sequences of length 8 to 13 inclusive, on the 0123, 0119 and 1205 datasets. 0121 0220 0218 8 9 YYWYYWRP,1426797.79 HMIMFYRWI,1121049.04 HICHRHPFHI,1042694.80 HICYPGIPFHI,1033167.66 PANAYVPAMIHI,1056730.57 PGAGAHGVHPFHI,1068282.4 YWYWREFF,1413671.31 HMIMYYWRP,1124357.12 RDVYIHPFHI,973923.46 HICYPGIPFHI,1033167.66 HMIHGAHPGSRP,1065444.91 PGAGAHGVHPFHI,1068282.4 YWYWREFF,1413671.31 HMIMFYRWI,1121049.04 RDVYIHPFHI,973923.46 RDVYIPGVKHI,1052268.12 PANAYVPAMIHI,1056730.57 HMIPACMPGGAVI,1066611.8 YWYWREFF,1413671.31 HMIMFYRWI,1121049.04 RDVYIHPFHI,973923.46 HICYPGIPFHI,1033167.66 PANAYVPAMIHI,1056730.57 PGAGAHGVHPFHI,1068282.4 YWYWREFF,1413671.31 HMIMFYRWI,1121049.04 RDVYIHPFHI,973923.46 HMIHQHIGARP,1059967.29 HMIPACMPNAVI,1060398.24 PGAGAHGVHPFHI,1068282.4 RDVYWWWW,1297273.15 HMIMFYRWI,1121049.04 RDVYIHPFHI,973923.46 HMICNANPMCY,1063898.78 HMIMPIGSSAPR,1066942.84 PGAGAHGVHPFHI,1068282.4 YWYWREFF,1413671.31 HMIMFYRWI,1121049.04 RDVYIHPFHI,973923.46 RDVYIHPGNII,1044130.50 HMIYPAAIGAPR,1058998.80 PGAGAHGVHPFHI,1068282.4 YWYWREFF,1413671.31 HMIMYYWRP,1124357.12 RDVYIHPFHI,973923.46 IHCHRHIGARP,1066807.39 PGQAHGVHPFHI,1057926.96 HICTGGAGSPFHI,1077096.2 10 11 12 13 YWYWREFF,1413671.31 HMIMYYWRP,1124357.12 RDVYIHPFHI,973923.46 HICYPGIPFHI,1033167.66 IHCYVPAIGAPR,1066213.71 PGAGAHGVHPFHI,1068282.4 ERDWWYRW,1401440.03 HMIMFYRWI,1121049.04 RDVYIHPFHI,973923.46 HICYPGIPFHI,1033167.66 IHCYVPAIGAPR,1066213.71 PGGAAYVPAMIHI,1069764.6 RPDWSPFR,474834.93 RPPGFSPFR,463690.29 RPPGFSPFGV,475238.25 RPPGFSAGGSK,480531.83 RPPGFSGGASAG,485736.16 PPGVCAAGGGGAF,529482.6 RPDWSPFR,474834.93 RPPGFSPFR,463690.29 RPPGFSPFGV,475238.25 RPPGFSNASAG,481168.09 RPPGFSGGASAG,485736.16 SSPPGESAGGSGA,532466.9 RPDWSPFR,474834.93 RPPGFSPFR,463690.29 RPPGFSPFGV,475238.25 PPGHVPGPFGV,514439.80 RPPGFSGGASAG,485736.16 SPPGSESSGGAGA,532099.6 RPDWSPFR,474834.93 RPPGFSPFR,463690.29 RPPGFSPFGV,475238.25 RPPGFSAGGSK,480531.83 RPPGFSGGASAG,485736.16 SPPGSESSGGAGA,532099.6 RPDWSPFR,474834.93 RPPGFSPFR,463690.29 RPPGFSPFGV,475238.25 RPPGFSAGGSK,480531.83 RPPGFSGGASAG,485736.16 PPGVCAAGGGGAF,529482.6 RPDWSPFR,474834.93 RPPGFSPFR,463690.29 RPPGFSPFGV,475238.25 RPPGFSAGGSK,480531.83 RPPGFSGGASAG,485736.16 SPPGSESSGGAGA,532099.6 RPDWSPFR,474834.93 RPPGFSPFR,463690.29 RPPGFSPFGV,475238.25 RPPGFSAGGSK,480531.83 RPPGFSGGASAG,485736.16 SPPGSESSGGAGA,532099.6 RPDWSPFR,474834.93 RPPGFSPFR,463690.29 RPPGFSPFGV,475238.25 RPPGFSAGGSK,480531.83 RPPGFSGGASAG,485736.16 PDGEGVSGGAGGT,534916.8 RPDWSPFR,474834.93 RPPGFSPFR,463690.29 RPPGFSPSDT,477261.05 RPPGFSAGGSK,480531.83 RPPGFSGGASAG,485736.16 SSPPGESAGGSGA,532466.9 RPDWSPFR,474834.93 RPPGFSPFR,463690.29 RPPGFSPFGV,475238.25 RPPGFSAGGSK,480531.83 RPPGFSGGASAG,485736.16 SPPGSESSGGAGA,532099.6 RPDWSPFR,939042.40 RPPGFSPFR,892716.45 RPPGFSPSDT,927701.33 RPPGFSGGASK,948303.57 RPPGFSGGAGGT,956080.93 PPGESSSGGAGGT,1088705.5 RPDWSPFR,939042.40 RPPGFSPFR,892716.45 RPPGFSPSDT,927701.33 RPPGFSGGASK,948303.57 RPPGFSGGAGGT,956080.93 PPGESSSGGAGGT,1088705.5 RPDWSPFR,939042.40 RPPGFSPFR,892716.45 RPPGFSPSDT,927701.33 RPPGFSGGANT,948506.39 RPPGFSGGAGGT,956080.93 EDGSPSSPGGGGG, 1083355.0 RPDWSPFR,939042.40 RPPGFSPFR,892716.45 RPPGFSPSDT,927701.33 PPGHVPGFPVG,1038657.50 RPPGFSGGAGGT,956080.93 RPGGGGCGGGASK,1066973.5 RPDWSPFR,939042.40 RPPGFSPFR,892716.45 RPPGFSPFGV,928388.14 RPPGFSGGANT,948506.39 RPPGFSGGAGGT,956080.93 PPGESSSGGAGGT,1088705.5 RPDWSPFR,939042.40 RPPGFSPFR,892716.45 RPPGFSPSDT,927701.33 PPGFSVGPESS,1053005.48 RPPGFSGGAGGT,956080.93 PDGEGVSGGAGGT,1094221.3 RPDWSPFR,939042.40 RPPGFSPFR,892716.45 RPPGFSPSDT,927701.33 RPPGFSGGANT,948506.39 RPPGFSGGAGGT,956080.93 PPGESSSGGAGGT, 1088705.5 RPDWSPFR,939042.40 RPPGFSPFR,892716.45 RPPGFSPSDT,927701.33 EDGSPSPPGAF,1039618.80 RPPGFSGGAGGT,956080.93 PPGESSSGGAGGT,1088705.5 RPDWSPFR,939042.40 RPPGFSPFR,892716.45 RPPGFSPSDT,927701.33 EDGSPSPPGAF,1039618.80 RPPGFSGGAGGT,956080.93 EDGSPSSPGGGGG,1083355.0 RPDWSPFR,939042.40 RPPGFSPFR,892716.45 RPPGFSPSDT,927701.33 EDGSPSPPGAF,1039618.80 RPPGFSGGAGGT,956080.93 EDGSPSSPGGGGG,1083355.0 Table 10.3: Simulated Annealing Results for Different Lengths(cont). This table lists the ten predictions made by a search, restricted to sequences of length 8 to 13 inclusive, on the 0121, 0220 and 0218 datasets. Dataset: 0123 0119 1205 0121 0220 0218 No Noise Size: 49 34 34 42 46 47 Restricted 10 8 8 8 10 10 0 0 0 0 0 0 0 0 0 0 0 0 6 3 1 3 10 10 0 0 0 0 0 0 0 0 0 0 0 0 Unrestricted Table 10.4: Simulated Annealing of Datasets Without Noise The various experiments and their resulting outcomes, again out of 10 executions, are detailed in Tables 10.5, 10.6 and 10.7. The parameter settings for experiment A are those of Section 8.6 (namely, the settings we have been using thus far), and the settings for the other experiments are the same as those of experiment A unless otherwise noted. Additionally, running times (an average of three runs) for each experiment are included 2 , and the ratio to the corresponding experiment A running time is given in parenthesis. Certain experiments - C, D and E which involve a single parameter change, and H, K, M, 0 and P which involve multiple changes, show some improvement in sequence prediction, but at the expense of an increase in computation time. One could imagine using a genetic algorithm as a more rigorous method for optimizing these search parameters as well as the model parameters from Section 9.2. But in the meantime, we elect to continue using the initially chosen settings (experiment A) since they seem to do fairly well for the price paid in computation time. 10.6 Summary Simulated annealing proved to be an effective means for exploring the space of sequence guesses. It performed best when peptide guesses are restricted to a specific length. Since the length of the correct sequence may not be known a priori, and an unrestricted search does not seem to perform as well, one solution is to conduct several searches restricted to 2 The running times are only very rough estimates - the algorithm is probabilitistic in nature, and the machines on which these simulations were run, serve other processes as well. 122 Experiment A: (see Section 8.6) B: To = 150000 restricted unrestricted restricted unrestricted C: nover* 2 restricted unrestricted D: nlimit* 2 restricted unrestricted E: k = 0.95 restricted unrestricted F: k 0.75 restricted unrestricted G: tempsteps* 5 restricted unrestricted Outcomes 0119 1205 0121 (see Table 10.1) (see Table 10.1) 8 6 10 0 0 0 0 2 0 6 1 4 0 0 0 1 0 0 10 7 9 0 0 0 0 1 1 2 1 8 0 0 0 0 0 0 8 7 10 0 0 0 2 2 0 4 0 1 0 0 0 0 1 0 0 4 7 9 6 2 0 0 1 0 1 5 2 0 1 0 1 0 6 7 8 0 0 0 0 0 0 2 2 1 0 0 0 0 8 0 7 0 8 0 0 0 1 2 0 1 0 0 0 0 0 3 0 0 Performance in Minutes(Ratio to A) 0119 1205 0121 7.58(1.00) 7.73(1.00) 10.56(1.00) 8.65(1.00) 9.16(1.21) 9.13(1.00) 7.88(1.02) 11.35(1.00) 11.15(1.06) 8.82(1.02) 10.14(1.11) 10.68(0.94) 16.41(2.16) 14.47(1.87) 20.96(1.98) 14.19(1.64) 15.71(1.72) 19.41(1.71) 10.52(1.39) 8.61(1.11) 11.96(1.13) 10.78(1.25) 10.54(1.15) 13.70(1.21) 13.44(1.77) 16.35(2.12) 20.66(1.96) 13.03(1.51) 17.11(1.87) 16.78(1.48) 3.09(0.41) 3.50(0.45) 3.55(0.34) 3.34(0.39) 3.52(0.39) 4.29(0.38) 9.05(1.19) 7.79(1.01) 10.31(0.98) 10.23(1.18) 7.99(0.88) 10.19(0.90) Table 10.5: Exploring Simulated Annealing Parameter Space 123 Experiment H: nover* 2, nlimit* = 2 restricted unrestricted I: To = 75000 restricted unrestricted J: To 50000 restricted unrestricted K: k = 0.95, tempsteps* = 5 restricted unrestricted L: To = 150000, tempsteps* = 5 restricted unrestricted M: k = 0.95, nover* = 2, nlimit* = 2 restricted unrestricted Outcomes 0119 1205 0121 10 9 10 0 0 0 0 0 0 5 0 7 0 0 0 0 1 0 8 9 6 0 0 0 0 0 0 3 2 1 0 0 0 0 0 0 7 4 10 0 0 1 0 0 8 0 2 4 0 0 9 0 0 4 0 0 0 10 0 0 0 1 2 0 0 9 0 0 3 0 0 4 0 0 1 0 0 0 10 0 0 0 0 2 0 0 10 0 0 4 0 0 10 0 0 2 0 0 2 8 0 2 7 1 6 0 3 0 Performance in Minutes(Ratio to A) 0121 1205 0119 23.83(2.26) 19.13(2.47) 17.59(2.32) 17.83(2.06) 18.97(2.08) 22.83(2.01) 8.87(1.17) 6.95(0.90) 11.45(1.08) 8.34(0.96) 7.25(0.79) 8.89(0.78) 7.76(1.02) 6.95(0.90) 11.32(1.07) 8.82(1.02) 8.62(0.94) 9.72(0.86) 13.89(1.83) 17.47(2.26) 22.04(2.09) 15.37(1.78) 17.65(1.93) 22.02(1.94) 9.61(1.27) 8.48(1.10) 11.27(1.07) 9.79(1.13) 9.25(1.01) 10.88(0.96) 26.25(3.46) 37.17(4.81) 47.46(4.49) 27.14(3.14) 36.98(4.05) 45.48(4.01) Table 10.6: Exploring Simulated Annealing Parameter Space (cont) 124 Experiment 0119 N: To = 75000, 8 9 7 9.98(1.32) 8.05(1.04) 12.66(1.20) unrestricted 0 1 3 0 0 2 0 1 4 10.40(1.20) 9.45(1.04) 11.01(0.97) 0 0 0 0 0 0 8 0 9 0 9 16.38(2.16) 11.86(1.53) 22.21(2.10) 0 5 0 3 0 1 4 16.28(1.88) 16.18(1.77) 22.84(2.01) 0 0 0 0 0 0 9 0 10 0 9 0 18.69(2.47) 17.71(2.29) 30.33(2.87) 1 2 0 2 1 4 17.70(2.05) 18.89(2.07) 30.18(2.66) 0 0 0 0 0 0 9 8 8 7.56(1.00) 8.85(1.14) 14.12(1.34) 0 0 0 1 0 1 2 0 3 0 5 0 8.85(1.02) 8.47(0.93) 13.05(1.15) 0 0 1 restricted nover* = 2 unrestricted P: To = 75000, restricted nover* =2, nlimit* 2 unrestricted Q: To = 75000, Performance in Minutes(Ratio to A) 0119 1205 0121 restricted nlimit* = 2 0: To = 75000, Outcomes 1205 0121 restricted tempsteps* = 2 unrestricted 1_1 Table 10.7: Exploring Simulated Annealing Parameter Space (cont) 125 different sizes and then select the best overall sequence. 126 Chapter 11 Testing the Approach This chapter focuses more on the data and the predictions made by the algorithm rather than the specifics of the simulated annealing search. Because data was limited, we performed a Leave One Out Cross-Validation, and looked for other sources of data to use for algorithm validation. We also investigate the 1205 dataset in more detail. 11.1 Leave One Out Cross-Validation Is the model too specific for its training set? One would naturally expect the algorithm to perform well when run on datasets that were used in the algorithm's training(Table 10.1). However, when sequencing de novo, the peptide is novel and could not have been used to previously train the model. So a more realistic situation is to train on a subset of the data, and then ask how the algorithm performs on the remaining datasets. This is a technique called Leave One Out Cross-Validation which is commonly done when the amount of data is wanting, and typically, one trains on the largest possible training set (n-1 of n datasets) and validates with the remaining dataset. Leave One Out Cross-Validation was performed on two levels: (1) since we had data for two peptides, we trained the model on one peptide and used the other for validation, and (2) since we had six individual datasets, we trained the model on all but one, and validated on the remaining one. This led to the following scenarios: 127 " Scenario 1: train the model using the angiotensin datasets only, validate on the bradykinin datasets, " Scenario 2: train the model using the bradykinin datasets only, validate on the angiotensin datasets, " Scenario 3: train the model using five of the six datasets, validate on the sixth. 11.1.1 Results for the Different Scenarios The trained parameter values for each scenario are listed in Table 11.1, and for each scenario, Table 11.2 summarizes the number of times the sequence was correctly predicted, again out of 10 attempts. Overall (All 6) Scenario 1 Scenario 2 Angio Brady 0123 | 0119 Pnovariant tendencym18 tendencym17 tendencypis 0.58 0.60 0.52 0.57 0.58 0.61 1 2 2 1 2 2 1 2 1 Prandom 0.18 0.16 0.22 1 2 2 0.18 1 2 2 0.18 1 2 1 0.17 Punobserved Pax Pb non-breaks 0.0 0.24 0.56 3.0 0.0 0.23 0.60 3.2 0.0 0.27 0.47 2.6 0.0 0.24 0.57 2.9 0.0 0.31 0.56 3.0 M[*,H],M[*,P] 2.4 2.2 2.9 2.3 2.4 Parameter Scenario 3: All But... 1205 10121 J0220 0218 0.58 0.58 0.60 1 2 2 0.19 1 2 2 0.17 1 2 2 0.17 0.0 0.26 0.47 3.3 0.0 0.25 0.58 2.9 0.0 0.24 0.57 2.9 0.0 0.23 0.58 3.3 3.3 2.3 2.4 2.3 Table 11.1: Model Parameters for the Different Scenarios. The values of the Overall Model from Table 9.2 are included for ease of comparison. The results of Scenarios 1 and 2 indicate that the model, when trained on one peptide, is able to successfully predict the sequence of a different peptide not seen by the model. The algorithm was not as successful with the 1205 dataset in Scenario 2, and even less so in Scenario 3, but recall that 1205 was known to be flawed. 128 Restricted Unrestricted 3 2 1 Scenario: Dataset: 0123 9 0119 8 1205 1 0121 9 0220 9 0218 8 0 0 0 0 0 0 0 0 6 0 5 0 2 0 1 0 3 0 10 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0220 8 0218 9 0123 10 0119 10 1205 5 0121 10 0 0 0 0 0 0 10 0 10 0 7 0 1 2 1 0 0 0 0 0 1 0 0 0 0 Table 11.2: Results of Running Simulated Annealing with Model Parameters from the Different Scenarios 11.1.2 Investigation of the 1205 Dataset 'Tr-aining on All But 1205(AllBut1205) When the model is trained without 1205 (column 8 of Table 11.1), the algorithm also fails to find the correct sequence for 1205 (see Table 11.3). This indicates that there is something about the 1205 dataset that is different. Dataset: Restricted Unrestricted 0123 10 0119 10 1205 1 0121 10 0220 10 0218 10 0 0 0 0 0 0 0 8 0 5 1 0 0 6 0 10 0 10 0 0 0 1 0 0 0 0 0 0 0 0 Table 11.3: Results of Running Simulated Annealing with Model Parameters Trained With All Angiotensin and Bradykinin Datasets Except 1205 (AllBut1205) If one were to compare the 1205 dataset with the other angiotensin datasets one would notice the following feature differences: the 1205 parent ion is not the highest peak, rather, a Bp18 ion is; the magnitude of the 1205 intensities are higher than those of the other datasets; and the peaks appear to be saturated (an effect of high laser intensity or long acquisition times). Examination of the model parameters when training with all datasets but 1205 reveals that there are three model parameter values that draw particular attention: tendencyi 8 = 1, P = 0.47 and M[*,H]=M[*,PI=3.3. 129 Since these values result when 1205 is excluded, they would suggest that in the 1205 dataset: 1. the "influence" of p18 variants is high (because when the 1205 dataset is excluded, the tendencypi8 value decreases), 2. the "influence" of B type breaks is high, and 3. the "influence" of N-terminal H and P breaks is low. Some means was needed for comparing the datasets. Because the total number of observed trials differs from one dataset to another, one cannot simply compare the number of peaks and/or their intensities without some sort of normalizing factor. One might consider the contribution of a set of peaks to the overall score as a measure the influence of those peaks. Since a few peaks of high intensity can be more influential than many peaks of low intensity, this quantity is attractive because it depends upon both the cardinality of the set as well as the intensity of each peak. However, contributions must be computed with respect to some PMF generated by some model. and compare differently depending upon the model parameters. Scores will change The difficulty thus lies in determining which parameters values are appropriate to use in order to effect such a comparison. Comparing Datasets by Training on a Single Dataset We decided to use the model parameters themselves, when trained on each dataset alone, as a means for comparing the datasets. The resulting parameter values should reflect fragment type composition and fragmentation patterns of the training dataset, and thus, can potentially serve as indicators of influence. Table 11.4 lists the settings obtained from single dataset training, and we can indeed verify that the parameter settings for the model, when trained with 1205 only, are supportive of the above assertions. Note that for some of the parameters, the settings cover a wide range of possible values across all datasests (e.g. Pnovariant and non-break tendencies). Also, other datasets have deviant setting values as well: 0220 has a high Prandom and non-break tendency; 0119, a 130 Parameter 0123 0119 11205 10121 0220 0218 Pnovariant tendencymis tendencym1 7 tendency18 0.71 1 2 1 0.78 1 2 1 Prandom 0.15 0.0 0.12 0.0 0.56 1 2 2 0.19 0.0 0.62 1 2 1 0.13 0.0 0.62 1 2 1 0.29 0.0 0.47 1 2 1 0.18 0.0 0.28 0.46 0.34 0.4 0.21 0.69 0.23 0.49 0.28 0.44 0.26 0.49 4.1 4.1 2.8 3.5 5.6 1.6 3.7 4.4 1.5 3.2 3.6 2.7 Punobserved Pax Pb non-breaks M[*,H],M[*,P] Table 11.4: Model Parameters When Trained With a Single Dataset higher Pax and M[*, H], M[*, P] value; and 0218, a rather low Pnovariant and non-break tendency. Some of these values were surprising as they were not readily apparent from simple visual inspection of the dataset, nor were they anticipated from the experiments performed so far. These may indicate that the model needs to be further tuned in terms of number of parameters and/or granularity for each parameter. As a control, we ran the algorithm on the very same dataset used to train the model. One would expect the algorithm to do well, and this was the case for almost all the datasets except 1205. Dataset: Restricted Unrestricted 0123 10 0119 9 1205 6 0121 10 0220 9 0218 10 0 0 0 0 0 0 0 7 1 5 1 1 0 6 0 10 0 10 0 0 0 0 0 0 0 0 0 0 0 0 Table 11.5: Results of Running Simulated Annealing with Each Model of Table 11.4 on Its Training Set Training on All Angiotensin Except 1205(AllAngioBut1205) This time, we trained the model on all angiotensin datasets except 1205(see Tables 11.6 and 11.7). The algorithm did not work well for 1205 - the correct sequence was not the 131 best scoring sequence. We knew 1205 was questionable, and this indicates that 1205 is very "un-angiotensin like". Parameter All Angio But 1205 Pnovariant 0.68 tendencymis tendencym17 tendencyp18 1 2 Prandom Punobserved 0.14 0.0 Pax Pb non-breaks 0.27 M[*,H],M[*,P] 3.6 1 0.45 3.8 Table 11.6: AllAngioBut1205: Trained Parameter Values Dataset: Restricted Unrestricted 0123 10 0119 9 1205 0 0121 10 0220 10 0218 10 0 0 0 0 0 0 0 10 0 7 0 0 0 7 0 10 0 10 0 0 0 0 0 0 0 0 0 0 0 0 Table 11.7: Results of Running Simulated Annealing with Table 11.6 Model Parameters on the Other Datasets It is interesting to note that when 1205 is removed from the training set, there seemed to be an overall improvement in the performance of the unrestricted search for the other datasets (compare Tables 11.3 and 11.7 to Table 10.1). Influential Training Datasets Whenever 1205 is included in our training sets, the resulting parameter settings bore a closer resemblance to those of 1205 than any other dataset (compare columns 2 and 3 of Table 11.1 to Table 11.4). In addition, with Leave One Out Cross-Validation, the greatest change in parameter settings was observed when the single dataset excluded was the 1205 dataset (compare column 2 of Table 11.1 to columns 5-10 of Table 11.1). 132 The reason 1205 exerts such a strong influence during training is likely due to its higher peak intensities. Compared to the other datasets, the 1205 score is larger in magnitudel, and since the model is trained by optimizing on the sum of the scores of the training datasets, large scoring datasets will overshadow and unfairly overpower the scores of other datasets. (Conversely, of our datasets, 0119 is the least influential; it also has the lowest peak intensities.) Thus, model parameters are biased towards datasets with high intensities, and will tend to reflect any idiosyncrasies these datasets may have. In the case of 1205, these parameters would be adversely affected. Some form of normalization may help re-weigh the datasets and equalize them during training so that no one dataset will be strong enough to unduly sway the parameter settings. Another solution would be to train with a much larger training set so that negative effects due to any single dataset are diluted. Normalization of Training Data We explored the effect of normalization during training 2 by modifying the heights of the experimental peaks in one of two ways: Method I Normalize on the parent ion by dividing the intensity of each peak by the intensity of the parent. Method 1I Adjust all the resulting intensity quotients of Method I so that their sum is 1. The Method I correlation score is essentially the original unnormalized score divided by the height of the parent ion. The Method II correlation score is the original divided by the sum of all peak heights in the dataset. The parameter settings are listed in Table 11.8 and the results in Tables 11.9 and 11.10. 'Recall that the logarithm of Prob(S,F) from Section 8.2 is a sum of products. The higher the intensity, the larger the multiplier, the larger the product and the resulting sum, the larger the score. 2 Note that the intent of this line of inquiry is not to try to normalize in order to improve the performance of Tables 11.7 or 11.3 because 1205 was not used to train these models. Furthermore, if 1205 were indeed unlike the typical angiotensin dataset, then no form of normalization would help salvage it. 133 Parameter Method I Method II Pnovariant 0.60 0.63 tendencymi8 tendencym17 tendencyp18 1 2 1 Prandom Pbunobserved 0.18 1 2 1 0.18 0.0 0.0 Pax 0.26 0.17 Pb 0.5 0.49 non-breaks 3.0 3.4 M[*,HI,M[*,PI 2.9 3.1 Table 11.8: Trained Parameter Values When Normalizing All Six Datasets Dataset: Restricted Unrestricted Table 11.9: 0123 10 0119 10 1205 5 0 0 0 5 0 2 0 0 0121 9 0220 10 0218 10 0 0 0 0 1 1 0 5 0 10 0 10 0 0 0 0 0 0 0 0 0 0 Results of Running Simulated Annealing When Datasets Are Normalized (method I) Dataset: Restricted Unrestricted Table 11.10: 0123 10 0119 10 1205 0 0121 10 0220 10 0218 10 0 0 0 0 0 2 0 0 0 0 0 0 5 3 0 4 10 10 0 0 0 0 0 0 0 0 0 0 0 0 Results of Running Simulated Annealing When Datasets Are Normalized (method II) 134 One would expect the influence of 1205 to increase with Method I because of all our datasets, it has the lowest intensity parent ion. And indeed, the Method I results were comparable to those of the unnormalized model (Table 10.1) - there was enough "1205 character" in the Method I parameters to allow the algorithm to correctly predict 1205 some of the time. Note that the performance of the algorithm on the other datasets remained relatively unaffected. One would also expect the influence of 1205 to decrease with Method II because the sum of all peak heights is largest for 1205 and therefore, 1205 contributes less to the sum being minimized during training. It was not a surprise, then, that with Method II, the other datasets prevailed, and because the 1205 was so unlike them, the algorithm did poorly, preferring a better-scoring competing sequence, IRTYIHPFHI. While normalization during training may be useful, normalization of the input spectra, data used for algorithm validation is unnecessary. An input spectrum is used only to score sequence guesses against each other. Normalizing the input spectrum with either Method I or Method II amounts to multiplying the score assigned to every guess by the same constant factor, so normalizing with either method would not alter any guess rankings. Factory Angiotensin: a Valid Angiotensin Dataset This last section in our discussion of 1205 establishes that our non-1205 datasets are indeed valid datasets for angiotensin, and confirms that 1205 is a poor dataset for angiotensin. As mentioned in Section 9.1, our mass spectrometer came with an angiotensin PSD spectrum of 56 peaks, probably acquired for instrument testing/evaluation purposes. Fortunately, this provided us with another angiotensin sample point. The parameter settings for a model trained on this dataset alone are given in Table 11.11 and they are within normal range of the other non-1205 angiotensin settings (Table 11.4). The tendencyis setting resembles that of 1205, but recall that coarse-grained values were used for these parameters (so that a value even slightly above 1.5 would be reported as a 2 and similarly a value slightly below 1.5 would be treated as a 1). Sequence prediction was performed on this dataset (see Table 11.12) using the Overall 135 Parameter Factory Angiotensin Pnovariant 0.77 tendencym18 tendencymi7 1 2 2 tendencyp18 PA 0.14 0.0 0.33 0.44 non-breaks 4.2 M[*,H],M[*,P] 4.1 Prandom Punobserved Pax Table 11.11: Factory Angiotensin: Trained Parameter Values model, as well as models trained without 1205 such as AllBut1205 and AllAngioBut1205. The unrestricted Overall search did not do well because it encountered a lower scoring sequence of length 11, HICYPVAPFHI. When 1205 is removed from the training set however, the algorithm is more successful(last two columns of Table 11.12). Model: Restricted Unrestricted Overall 3 AllBut1205 4 AllAngioBut1205 5 0 3 0 4 0 5 0 0 3 0 0 0 0 0 4 Table 11.12: Results of Running Simulated Annealing on Factory Angiotensin with Various Trained Models This dataset and these results are important because they are consistent with the fact that our non-1205 angiotensin datasets are valid and that the 1205 is markedly different. Although acquired on the same machine, the factory dataset was collected independently and by a different operator. Despite any resulting variations in sample preparation style and acquisition technique, its tendencies are similar to the other non-1205 angiotensin datasets, and it is recognized by a model trained with them only(AllAngioBut 1205). We conclude that the 1205 dataset is an outlier and exclude it from the training sets in all subsequent analyses; it was thought likely to cause problems, and when included in the 136 training, it did. 11.2 Meta-Analysis of Published Spectra To obtain more validating data, we searched the literature for cation MALDI-PSD spectra of peptides that contained no modified residues, and had H- and -OH as the N- and Cterminal groups respectively. Although we preferred spectra with a substantial number of peaks labeled, for datasets that were small, we manually interpolated and included in the dataset peaks that were unlabeled but of a high enough intensity to be of potential interest. Since the real sequence was known, we could sometimes use the theoretical spectrum to deduce masses for unlabeled peaks when the resolution of the spectrum was good enough. Otherwise, manual measurements, which may correctly salvage a legitimate peak otherwise ignored, could potentially introduce unwanted noise into the dataset. If the peak intensities of a published spectrum were not included in the paper, then the intensities were inferred by measuring the height of the printed peak with a ruler, picking an intensity for the parent peak, and computing the heights of all others relative to this. Note that as a result, the intensities of these peaks differ from their actual intensities, but their relative intensities are roughly preserved. This may result in a slight fluctuation in the computed score, but our approximation of peak heights does not appear to be grossly unreasonable. Recall that datasets are not compared to each other, but rather, the same dataset is used to evaluate different candidate sequences. The performance of the algorithm using the AllBut1205 model is shown in Table 11.13. The algorithm produced acceptable sequence predictions for the first three peptides. The poor performance of the first peptide - 4 out of 10, particularly when the correct sequence is the minimum scoring sequence, is indicative of a poor search. The searching parameters may be suboptimal, and it may be further complicated by the surface features of the scoring function - e.g. the global minimum could be located at bottom of a very narrow well so that moves out of the well are easy, but moves into the well are difficult. After trying several different combinations of searching parameters, we were able to improve the performance of the algorithm to (6 2 0), but the algorithm took six times as long. 137 Originally, the fourth dataset consisted of the 33 labeled peaks in the published spectrum, and our algorithm predicted DRVYIHPFIHIIHV (restricted search). Three other masses (251, 263 and 463.2, corresponding to internal ions for LH, VY and HLLV respectively) were mentioned in the text of the paper but not labeled in their figure. When these were included in the dataset, the restricted search then found DRVYIHPFIHIVYS, but only 3 out of 10 times. The unrestricted search, on the other hand, found a longer sequence that did not resemble the real sequence but scored better. The paper mentioned other immonium ions, including two for tyrosine(Y), and these might have helped focus the unrestricted search to a sequence more similar to the real one, but we could not locate these peaks in the spectrum to estimate peak intensities, so they were not included. The problem with this dataset is that for a peptide of this length, the dataset is too small. The use of published spectra in this manner is definitely suboptimal and not terribly reliable. Fortunately, we were able to obtain several datasets that were more suitable for analysis. 11.3 Data from Another Center Having data from another center is beneficial because it is collected independently, by a different operator on a different machine. This is especially useful because when validating with them, they can help reveal any invalid assumptions we made that were specific to our instrument, data and/or protocol. Dr. Arnie Falick of Applied BioSystems supplied us with three PSD datasets, and Drs. Wishnok and Tannenbaum(MIT) allowed us to view the datafiles on their mass spectrometer. Peaks in the spectrum for AVPYPQR and for DIYETDYYR were already labelled; so the peak lists were used as the datasets for these two peptides and no additional peaks were added. A peak list for the third peptide, AIEAQQHLLQLTVWGIK, was supplied to us directly by Dr. Falick. The results of running our algorithm using the AllBut1205 model are shown in Table 11.14. For the third dataset, the performance of the restricted search can be improved to (5 0 1) with a longer search. Also, the unrestricted search actually found IAEAQKHIIQITVWGIQ six times(6 0 0), but QANAKKHIIQITVSVGGPS (1 0 2) was a better scoring sequence. It 138 is also a longer sequence, so when sequencing larger peptides, this may mean that the search cannot simply take the best scoring guess over all lengths (as was proposed in Section 10.3) without some improvement to the scoring function or the dataset quantity/quality. We will return to a discussion of longer peptides in Section 13.2. 11.4 Summary Leave One Out Cross-Validation was promising, the inadequacy of 1205 was confirmed, and the performance of our algorithm was satisfactory for the most part. When spectra for peptides not used in the training set were presented to the algorithm, the predictions were adequate for the most part, however, we began to see the effects of (1) datasets that are too small (more data peaks is always welcomed to increase redundancy), (2) searching parameters that may need improvement, and (3) longer peptides on the sequencing process. We address some of these issues in the next chapter. 139 Dataset Size: Source: Real Sequence: Predicted: Restricted Predicted: Unrestricted 1137.6 47 [CLS99] YGGFLRRIR YGGFIRRIR 4 0 0 YGGFIRRIR 7 0 0 Parent Mass of Dataset Peptide 1375.8 1046.5 1758.9 71 28 36 [Spe97] [JnC96] [JnC96] GDHFAPAVTLYGK DRVYIHPF DRVYIHPFHLLVYS DGHFAPAVTIYGK DRVYIHPF DRVYIHPFIHIVYS 8 10 3 0 0 0 0 0 0 DGHFAPAVTIYGK 8 0 1 DRVYIHPF 9 0 0 HIMIIIGCFTIYVHV 5 0 0 Table 11.13: Results of Simulated Annealing Run on Datasets from the Literature Using a Model Trained Without 1205 (AllBut1205). The 1375.8 dataset was the only dataset for which extra peaks were not inferred. Dataset Size: Real Sequence: Predicted: Restricted Predicted: Unrestricted 830.4 50 AVPYPQR AVPYPQR 9 0 0 Parent Mass of Dataset Peptide 1237.5 1948.1 35 72 DIYETDYYR AIEAQQHLLQLTVWGIK DIYETDYYR IAEAQKHIIQITVWGIQ 10 4 0 0 0 0 AVPYPQR 8 0 0 DIYETDYYR 8 0 0 QANAKKHIIQITVSVGGPS 1 0 2 Table 11.14: Results of Simulated Annealing Run on Datasets from Applied BioSystems Using a Model Trained Without 1205 140 Chapter 12 Discussion This chapter examines two issues in more detail - What happens when the algorithm is presented with spectra of longer peptides? How does dataset size affect performance? 12.1 A Study of Two Longer Peptides Our algorithm encountered problems when handling spectra of longer peptides because longer sequence guesses frequently outscore the real sequence. Scoring problems indicate that the data is poor, or the model is inaccurate/insufficient, or both. Since we have more control over the model, we considered two possible ways to improve it: (1) by expanding the training set so that a model trained on more diverse data would be more encompassing in scope, and (2) by augmenting the fragmentation model so that it would be a truer reflection of the real process. 12.1.1 Enlargement of the Training Set We included the three datasets of Section 11.3 in the training set, bringing the training set membership to a total of 8. 141 Parameter 0123 0119 0121 0220 0218 830.4 1237.5 1948.1 Pnovariant tendencym18 tendencym1 7 tendencypi8 0.72 1 2 1 0.72 1 2 1 0.74 1 2 1 Prandom Punobserved P ax Pb non-breaks 0.12 0.12 0.12 0.72 1 2 1 0.11 0.75 1 2 1 0.11 0.73 1 2 1 0.12 0.64 1 2 1 0.15 0.69 1 2 1 0.12 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 M[*,H],M[*,P] 0.14 0.15 0.14 0.15 0.14 0.16 0.19 0.18 0.44 0.44 0.44 0.44 0.44 0.45 0.52 0.36 2.6 2.6 2.6 2.5 2.8 2.6 3.0 2.7 2.9 2.9 3.0 2.9 2.9 2.7 3.0 3.3 Table 12.1: Model Parameters for Leave One Out Cross-Validation Dataset: Restricted Unrestricted 0123 10 0119 9 0121 10 0220 10 0218 10 830.4 10 1237.5 9 1948.1 4* 0 0 0 0 0 0 0 0 0 10 0 5 0 6 0 10 0 10 0 7 1 7 1* 8* 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 Table 12.2: Results of Leave One Out Cross Validation with the Eight Datasets 142 Leave One Out Cross Validation for a Larger Training Set The trained model parameters and the corresponding validation results are given in Tables 12.1 and 12.2. In all cases, except for the peptide of mass 1948.1, the predicted sequence was the correct sequence and the best scoring candidate found. In the case of 1948.1, the restricted search found PDTAQKHIIQITVWGIK (score 1859158.0) to be better scoring than IAEAQKHIIQITVWGIK(1860305.6). And the unrestricted search found IAEAQKHIIQITVSVGIQ instead. We will return to this momentarily. Training on All Eight Datasets The model was trained on all eight datasets, and the resulting parameters and performance are shown in Tables 12.3 and 12.4. Parameter Trained Value Pnovariant tendencym18 tendencym17 tendency18 0.75 1 2 Punobserved 1 0.12 0.0 Pax Pb non-breaks 0.16 0.44 2.7 M[*,H],M[*,P] 2.9 Prandom Table 12.3: Trained Model Parameters: Overall Training Set Plus 830.4, 1237.5 and 1948.1 Restricted 0123 10 0 0 Unrestricted Original Dataset 0119 0121 0220 10 8 10 0218 10 M+H of Other Datasets 830.4 1237.5 1948.1 7 10 3* 0 0 0 1 0 0 0 0 0 0 0 0 0 0 8 7 6 10 10 10 7 10* 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 Table 12.4: Results of Running Simulated Annealing On All Datasets with the Model Parameters of Table 12.3. 143 In general, the performance of the search was fairly good but this should not be surprising as the training set was used as the validating set. Some difficulty was encountered, again with 1948.1, the last column of Table 12.4. The restricted search found PSEAQKHIIQITVWGIQ which outscored the real sequence due to better support for the first break position: masses 217.2, 242.2 and 1330.9 could be interpreted as a YB internal for SE, YAm18 for SEA and YBm18 for SEAQKHIIQITV respectively. The real sequence, on the other hand, had two supporters for p1 (masses 1329.0 and 1346.9 which are YAm18 for IEAQQHIIQITV and YA for IEAQQHIIQITV) but unfortunately, these peaks had multiple identities (Bm18 and Bion respectively) and these alternate identities were also generated and accounted for by the competing sequence. The competiting sequence could thus account for more peaks, and without any additional support of the first position, the scoring favored the competition. The unrestricted search found IAEAQKHIIQITVSVGIQ which scored better than the real sequence because low mass peaks supported an SV in place of the W. These masses were: 159, 242, 355 and 892, which can be interpreted as a YA for VS, a YAm18 for TVS, a YAm18 for ITVS, and a YB for HIIQITVS respectively. Even with an enlarged training set, the model experienced difficulty predicting the correct sequence for 1948.1. One comment can be made about the performance of the other datasets. Despite the variance in parameter settings for the different models, the various searches did well for these other peptides (see Tables 12.4, 11.7 and 11.3). This suggested that there was a range of model parameter settings that were acceptable (at least for theese peptides). Note however, that even though training on two peptides allowed the model to perform well on peptides not used in the training set, one may still encounter some novel peptide whose fragmentation behavior is entirely distinct from anything our model could be trained with. More generally, suppose each peptide has a subspace of optimal model/search parameter settings. There is no guarantee that the intersection of all such subspaces for every possible peptide is non-empty, i.e. there may not exist a universal set of parameter values that will allow our algorithm to recognize all peptides. Or if such a set existed, would the settings be so compromised so that the resulting algorithm performs poorly for all datasets, and reliably for none. 144 12.1.2 Refining the Model to Improve the Scoring Function When the search finds that the real sequence is not the best scoring sequence, one solution is to refine the model to improve the score of the real sequence. From the previous section, we noticed that the incorrect sequences were supported by YAm18 ions, which are forbidden by the following rule: a peptide containing S,T,D or E can only lose water and a peptide containing R,K or H can gain water only if the fragment type in question is a Bion [Hin97]. Inclusion of this rule in the model restricts the allowed fragment types to {Am17, A, Bm18, Bm17, B, Bp18, Ym17, Y} which could potentially reduce the amount of supporting evidence for incorrect sequences. In addition, the refined model specifies that variant formation is residue dependent, location dependent and now, fragment type dependent. While one additional rule may not be enough to remedy the problem in the end, it is an example of one of the many things that can be done to make the model a more realistic simulator of the rules by which Nature operates. Note also that we are now close to the edge of unchartered territory; for example, the literature does not address the following questions: If a peptide containing S, T, D or E can exhibit a Bm18 or Bp18 ion and if the loss/gain of water is side-chain specific (because it is residue specific), why does it not also express the Am18 and Ap18 ions? Is it dependent on some interaction between the side-chain and the C-terminus? 12.1.3 Performance of the Different Variations Predictions made by the various models for the peptides of mass 1758.9(length14) and 1948.1(length 17) are compiled in Tables 12.5 and 12.6 for convenience. Column 1 reports the performance of the AllBut1205 model, while Columns 2 and 3 both use a model trained on an enlarged training set (Table 12.3). The Bion variant check refinement was also incorporated into the model of the third column. It is difficult to evaluate the effect of the different models on 1758.9. Again, the central problem is one of dataset size. An analyst with the intent of sequencing would have compiled a larger peak list if possible. Nevertheless, the predictions made by a restricted search seemed to worsen, but one could argue that the unrestricted predictions showed improve- 145 ment, although in some cases, the unrestricted search preferred a longer sequence. Predicted: Restricted Predicted: Unrestricted AllBut1205 DRVYIHPFIHIVYS 3 0 Enlarged DRVYIHPFIHIVYS 4 0 Enlarged+Refined DRVYIHPFIHIIHV 7 0 0 0 0 HIMIIIGCFTIYVHV 5 HIMIERTFTIYVHV 1 DYAIAIHPFITYVHV 1 0 0 0 0 0 0 Table 12.5: Results of Various Simulated Annealing Runs on Dataset with M+H 1758.9 AllBut1205 Enlarged Enlarged+Refined Predicted: IAEAQKHIIQITVWGIQ PSEAQKHIIQITVWGIQ PSEAQKHIIQITVWGIK Restricted 4 3 6 0 0 0 0 0 0 QANAKKHIIQITVSVGGPS IAEAQKHIIQITVSVGIQ IAEAQKHIIQITVSVGIQ 1 10 8 0 0 0 2 0 0 Predicted: Unrestricted Table 12.6: Results of Various Simulated Annealing Runs on Dataset with M+H 1948.1 A similar phenomenon occurred with the 1948.1 dataset - the restricted predictions seemed to move farther from the real one, while the unrestricted ones improved. An analysis of the scores showed that in the restricted case, the shift from IAEAQKHIIQITVWGIQ (score 1843917.7) to PSEAQKHIIQITVWGIQ (score 1843207.4) was due to the enlargement of the training set, not to model refinement. On the other hand, both enlargement and refinement independently and individually caused a shift from QANAKKHIIQITVSVGGPS to the closer sequence, IAEAQKHIIQITVSVGIQ. So is it an issue of the model, the search or the data? Probably all three. Here we saw one small refinement to the model which showed some improvement in (unrestricted) prediction. The search did not find a better scoring sequence than the ones listed in these tables, but from the low performance numbers, it did not converge on them reliably. The input spectrum is the component that we have least control over, but more (legitimate) experimental 146 peaks are always a bonus, and are especially essential when dealing with spectra of longer peptides. 12.2 A Study of Dataset Size Our algorithm consistently found the correct sequence for the 0220, 0218 and often 0123 datasets over a wide range of parameter settings, and we conjecture that the size of the datasets and the redundancy contained therein are the reasons for this. Larger datasets constrain the space of possibilities and decrease the scores of candidate sequences that do not account for a large percentage of the dataset since penalties are levied for unexplained experimentals. In constrast, small datasets are less constraining so that it is easier to find a peptide guess that accounts for almost all experimental peaks, and perhaps easier to find one that may even score better than the actual sequence. In short, what matters is not only how many experimentals are accounted for, but also how many are unaccounted for. We ran three experiments to study dataset size using the AllBut1205 model parameters of Table 11.1 (and the refined model of Section 12.1.2). They entailed running the sequence prediction algorithm in restricted mode while: 1. removing the lowest intensity peaks one at a time, 2. removing the highest intensity peaks one at a time, and 3. removing all noise peaks and adding them back in one at a time Certain behaviors and patterns can be seen in our datasets, but additional data is needed if more general conclusions are to be drawn. 12.2.1 Removing Low Intensity Peaks The results of the first series of experiments are depicted in Figure 12-1. The x-axis is the number of peaks left in the dataset and this decreases as more and more of the lowest intensity peaks are removed. The y-axis is the number of times, out of 10, the correct sequence 147 10 -+++ + + + - C, 0 10 + ± 0-, 0 + + ± 5 ± ± 5 C-) * 0 C, 0 050 40 20 30 10 num peaks left, dataset 0119 0 10 *' 0 _0 50 5 a) 30 40 20 10 num peaks left, dataset 0123 C 0 10 0 + 10-1 + ± I0 + + + 0 0 _0 5 0- 5 0 (D 0 0 0 a) 080 00 60 40 20 num peaks left, dataset 0220 0 40 30 20 10 num peaks left, dataset 0218 Figure 12-1: Removal of Lowest Intensity Peaks 148 0 was encountered as the minimum scoring sequence in a length-preserving search (i.e. it is the sum of the three numbers that we have been reporting separately, see Section 10.2). At the start of the algorithm, the peaks that are removed are largely noise peaks. As the algorithm progresses, it begins to remove real peaks, but the performance of the algorithm is unaffected because these peaks are extra and expendable; enough redundancy remains in the dataset so that the correct sequence continues to be found. When the algorithm begins to remove more critical peaks, the remnant is insufficient and incapable of supporting the correct sequence. Before this point is reached, the correct sequence usually abdicates its title of best scoring sequence to a sequence that bears some similarity to the correct one (For example, RPPGFSPRF in the case of 0220 and DRVYIHPHFI in the case of 0123). In Figure 12-1, this transition event is indicated by a shift from the use of "+" to "*"I. Most datasets degenerated into this condition when there were less than 20 peaks left. Another way to view these results is that roughly 20 of the highest intensity peaks were sufficient for sequence recovery of these specific peptides. Figure 12-1 sheds some light on the question of how much lossage a dataset can tolerate before the real sequence is no longer optimal. Examination of the fragment types present in the dataset right before the "* -+ +" transition reveals the presence of several gaps - for example, the 15 peaks of 0123 have gaps at break positions 1,7 and 9; the 16 peaks of 0220 have gaps at 3,4,5 and 7. Interestingly enough, the correct sequence is still recoverable, in some cases because of internal ions (extended family ions) that fill in these gaps and help resolve ambiguous residue ordering in the complete sequence, and in other cases (gap at position 7 of 0123), probably because of the limited number of possibilities from constraints imposed by the fragment types that are present. Fundamental graph approaches, which often do not make use of internal ion information for candidate generation, rely on other fragment types and would have difficulty enumerating the real sequence when so many gaps are present. Note also that the 15 peak 0220 dataset has a huge gap which neither a dipeptide nor tripeptide edge in a fundamental graph can rescue. 'Note, this is for the restricted case. In the case of 0123 and 0121, the unrestricted search produced a better scoring alternate sequence at dataset size 14 and 16 respectively. 149 12.2.2 Removing High Intensity Peaks While a large number of low intensity peaks can be removed before the real sequence is no longer optimal, only a small number of the highest intensity peaks can be removed before the search begins to falter, and when it does, it often fails dramatically (see Figure 12-2). This is supportive of the idea that high intensity peaks tend to be more meaningful than lower intensity peaks. 12.2.3 Removing Noise Peaks Figure 12-3 contains the results of the third series of experiments which examines how performance is affected by noise content. In this experiment, all noise peaks are first removed from a dataset, and then they are gradually added back in, one at a time, starting with the most intense first. The performance of the noiseless angiotensin and bradykinin datasets, previously examined in Table 10.4, did not differ noticeably from the performance of the entire datasets (Table 10.1), so one would not expect much fluctuation in these plots. In all cases, the correct sequence remained the optimal scoring, and the datasets were rather resilient to the addition of noise. The effect of noise peaks is less severe here than in other approaches where noise peaks could greatly contribute to the size of a fundamental graph, and to the number of incorrect sequence guesses. It may be worthwhile to devise a means for assaying just how much noise a dataset can tolerate before completely being ineffective - this would require some controlled means for deciding which noise masses are reasonable to add and what their intensities should be. 12.3 Summary An expanded set of training data and an additional rule for our model led to some improvement in prediction, but it became apparent that the next challenge to address is the effect of longer peptides. 150 CO 10 U) c 0 0 10 -H- 0 a) 5 5 0 0 0 72 10 70 68 66 64 num peaks left, dataset 0123 +++ 0 62 20 30 40 num peaks left, dataset 0119 50 10 ++- 0 +CL, a) 5 0 0 a) 0 55 0 50 45 40 num peaks left, dataset 0121 35 _0 10 10 a), CL 0. 0 a) - ++ + + + +* - -+ 0 0 5 5 0 0 0 85 80 75 70 num peaks left, dataset 0220 0 65 70 65 60 55 50 num peaks left, dataset 0218 Figure 12-2: Removal of Highest Intensity Peaks 151 45 U, 10 II II I II 111111 + c10 + 0 + ++ 0 4- ++ + ++± _0 5 C- 05 0'40 ci' c 0 10- 50 60 70 num peaks, dataset 0123 +++ +++++ + 80 30 40 45 35 num peaks, dataset 0119 50 50 60 70 num peaks, dataset 0218 80 ++ _0 a) 05 4a) 0 0- OL 40 cl) 10- + 45 50 55 num peaks, dataset 0121 60 H HHHHHHHHHH" iI -:II I 10 0 + 0 CD) 5 0 0 00 50 60 70 80 num peaks, dataset 0220 0L40 90 Figure 12-3: Removal and Subsequent Addition of Noise Peaks 152 With regards to dataset size, it appears that most low intensity peaks are either noise or redundant, legitimate peaks while most high intensity peaks are more meaningful and relevant to sequencing. An extended family member for each break position is necessary for correct sequencing. 153 Chapter 13 Conclusions In this thesis, a new method for de novo sequencing was presented. Like other approaches in the literature, we presented a means for traversing the space of possible guesses, and a means for scoring a sequence guess against an input spectrum. Exploration of the search space was implemented with simulated annealing, an efficient technique for locating the sequence most likely to have produced the observed data. In order to compute this likelihood, a probabilistic model for protein fragmentation was proposed and a scoring function was implemented based on a probability distribution of fragment masses predicted by this model. Since our approach is a global sequence-to-spectrum strategy, complete sequence guesses are generated in a manner that is independent of the spectrum, and because of this, it is less vulnerable to the effects of noise and gaps. With fundamental graph approaches, noise peaks can be mistakenly interpreted as supporting peaks; such false positives can elevate the standing of competing sequence guesses. Candidate generation in sequence-to-spectrum approaches is not as affected by gaps in the data - there is no fear of pruning away the real sequence due to under-representation, and because guess generation is not dependent on the existence of paths through a graph or the presence of supporting experimentals, these approaches can generate sequences that fundamental graph approaches will not. Our scoring function is based upon a probabilistic model of fragmentation. Other ap- proaches in the literature make use of fragment type probabilities, and to our knowledge, only [DAC+99b, DAC+99a] have tried using a more formal framework. Fragment probabil- 154 ities appearing in the literature are often either empirically or arbitrarily determined, and are independent of each other - e.g. an Aion appears with some probability, a Bion with another and so on, and these probabilities do not sum to 1 [DAC+99b, FdCGS+99]. Our probabilities are empirically determined by fitting parameters to training data, and they are not independent - a single molecule can fragment to produce either an Aion or some other ion, but the sum of the probabilities of all possible outcomes is 1. Because of this, our scoring function not only awards for matches (which all other approaches do), but naturally penalizes for unmatched experimentals (which some other approaches do) and even for unmatched theoreticals. The PMF nicely handles multiple identity peaks when there are several legitimate explanations for a single peak,the probability of this peak is the sum of the probabilities of each identity. Peak intensities are also nicely accounted for by viewing the input spectrum as a histogram of outcomes of repeated trials. 13.1 Room for Improvement We tried to keep our approach as simple as possible, but there are extensions that can be made to the model, search and data to improve the algorithm. 13.1.1 Improvements in the Data Advances in Mass Spectrometry Progress is continually made in mass spectrometry. With increased accuracy and decreased mass fluctuation, spectra with cleaner and clearer peaks would enable better peak identification and labeling. An increase in range capability would allow for the analysis of larger molecules. Advances in Peptide Preparation and Data Acquisition Alternate methods for obtaining spectra may yield better spectra. There are many variations to the recipe for generating spectra - use of different ionization methods, matrices, 155 peptide concentrations, laser intensities and chemical treatments of the analyte prior to spectra acquisition may lead to data that is more bountiful in relevant peaks. The choice of method and technique affect the types of peaks that are obtained. Future techniques may be developed that increase the intensity and presence of helpful peaks, while minimizing noise and other peaks which confuse and misdirect sequence reconstruction. 13.1.2 Improvements in the Model Advances in Understanding Peptide Fragmentation Since an effective correlation function requires knowledge of the fragmentation process [SZK95], as more of the fragmentation process is elucidated, the model can be updated to better simulate peptide fragmentation, resulting in more accurate PMFs and improved scores. There are quite a number of rules that have not been incorporated into our model. For example, break probabilities depend on a combination of other factors such as break position and fragmentation tendencies/proton affinities of the residues involved. Certain dipeptides such as Arg-Lys and His-His [OSTV95], and the amide bonds of the dipeptides Asp-Pro, Asp-X, Glu-X [KCKS96] exhibit certain affinities that could be captured in the matrix. But there are also quite a number of rules that are not known. Are there other fragmentation events that occur? What are the relevant factors? Are parameters constant or subject to some distribution? The better we can approximate Nature's chemical/physical rules, the better the model will be. Training and Validating with More Data A larger training set, more representative of different types of peptides, would be essential for proper training of the model. One may find additional parameters, as well as the need for finer granularity in parameter values, helpful. A larger validation set would also help demonstrate the effectiveness of our approach. 156 Handling Modifications and Different Terminal Groups Our approach is currently not equipped to handle residue modifications or other terminal groups. Known modifications (for a list of some of them, see Table III of [Yat96]) can be handled by considering the modified residue simply as another amino acid. A walk through the peptide search space would then include peptides involving these modifications, and any relevant fragmentation rules would need to be programmed into the model and PMF generation routines. Different terminal groups can be handled by parameterizing the mass of the N- and C-terminal groups in all mass calculations. 13.1.3 Improvements in the Search Aside from further optimization of the searching parameters, there are other aspects of the search that can be improved. For example, it may be possible to implement an efficient incremental scoring method so that as the search makes a move from one sequence to another, the entire PMF for the new sequence doesn't have to be recomputed from scratch. Move Strategies The three different sequence moves currently occur with equal probability; perhaps allowing for them to be selected according to some non-uniform distribution may be beneficial. To save time, our implementation precomputes all dipeptides and tripeptides, and stores them by mass. The substitute move uses these as building blocks (instead of building a replacement sequence one residue at a time). Dr. Ting Chen suggested going one step further and precomputing all amino acid combinations for every possible sequence mass up to the parent ion. This basically amounts to a "complete fundamental graph" where all nodes have outgoing edges for every possible residue so that all paths from the base node to a particular fundamental node with mass n represent all possible sequences of mass n. Choosing one of these paths randomly would be akin to finding a sequence of mass n when making an unrestricted substitute move. The length of the subsequence that a move affects can also be made a function of the 157 temperature so that at lower temperatures, shorter subsequences are changed. An Educated Initial Sequence Guess One might imagine a situation where the initial guess is not randomly chosen, but is intelligently selected based on some relevant information. For example, one might use the best scoring sequence guess obtained from some fundamental graph approach. Or one might consult the immonium ions or some other feature of the spectrum, and specify a pool of possible residues to draw from when constructing this guess. With an intelligent initial guess, the search is "much farther along" than a search that starts with a random sequence, and could be made faster by starting at a lower initial temperature. Too high an initial temperature would cause the algorithm to make a move to a random sequence, throwing away any benefits afforded by an intelligent non-random initial guess. (On the other hand, too low an initial temperature is also undesirable since it impedes the ability of the algorithm to explore the search space.) Such a optimization may be possible depending on the structure of the scoring function and the specific application [MitOO], and there has been some study of initial temperature settings [RKW88], but no theoretical result of a general nature [MitOO]. Prediction of Correct Length With restricted searches, we partitioned the large peptide sequence space into smaller disjoint subspaces and used length-preserving moves. This was an effort to see if the global optimum could be found more reliably by reducing the size of the search space without changing scheduling parameters. Since the range of sizes is large (from v+H to M+H) if there were some way to guess the length of the real sequence, the savings in search time could be potentially large. Several of the approaches we tried in Appendix E enumerated promising paths through a graph, sorted by length. It may be possible to use this information as a rudimentary filter for determining which lengths are worth searching. 158 13.2 Looking Towards the Future: Longer Peptides Most of the spectra used in this thesis were acquired from short peptides. For certain applications, this may be sufficient - assuming each residue is equally likely to appear, and assuming a protease cleaves after a single particular amino acid, then the expected length of a proteolytic product is 20 residues long. For proteases with specificity pockets for several side-chains, the expected length of fragments is less than 10 residues long - well within the range of our results. However, it is useful to consider the effects of longer peptides on sequence guess evaluation and on peptide space searching. The experimental spectrum of a longer peptide would exhibit a larger range of mass values, and if N, the total number of trials, is kept constant, peak intensities decrease overall as there are less ions available for covering the entire spectrum of possible fragments. Low masses would tend to have higher intensities than high masses as a longer peptide sequence provides more opportunities for arriving at lower mass fragments. These features are also mimicked in the theoretical spectrum of the PMF - since the sum of the fragment probabilities must equal 1, the individual probabilities are likely to be lower than those of shorter peptides because these are distributed over a greater range of mass possibilities. Lower masses may exhibit higher probabilities because there may be multiple ways to arrive at a small mass value so the probabilities aggregate. What of their effect on the search? Unrestricted searches were not as successful as their restricted counterparts because a much larger space had to be traversed and without a comparable incase in search time, the correct answer was less likely to be found. Larger peptides only exacerbate the problem, and if it were only a problem of a vast search space, one could simply adopt an extremely slow cooling schedule. But as we saw with the AIEAQQH- LLQLTVWGIK peptide from Chapter 11, other problems develop. Namely, alternate sequences begin to score better, especially longer ones as they tend to be able to account for more peaks and currently, no penalty is imposed based on guess length. Predictions that resemble the real sequence but have short substitutions of one subsequence for another (e.g. SV for W) may be common occurrences because of the increased chance that some 159 peak may happen to accidentally support a break position within the subsitution. Solutions that might help are a model that uses some sort of minimum description length-based measure, and possibly, input datasets with large signal to noise ratios (so that even if alternate interpretations exist, they do not outscore the correct one). 13.2.1 Effect of Isotopes With longer peptides, the presence of isotopes will also affect mass computations. This effect has been ignored so far in our discussions because we have been working with short peptides. Let I(n) represent the isotope distribution of a molecule. This distribution describes the probability that a molecule contains n extra neutrons due to isotopes, and let the "isotope contribution" be the value of n for which I achieves its maximum. For short peptides, the non-isotope form, the case of no extra neutrons, is most probable and hence, the isotopic contribution is 0. The mass of the peptide is exactly the mass of its constituent atoms. One could avoid isotope issues by limiting sequencing to short peptides only. For longer peptides, this no longer holds. With enough atoms, the occurrence of an isotope is more likely, and the isotopic contribution is non-zero. Consequently, some adjustment needs to be made for the corresponding shift in mass. Gras, et. al. account for this effect in their peak detection algorithms [GMG+99], but otherwise, it seems this has been largely ignored in the literature. The peptides used in this thesis had few enough nucleons' that it was safe to disregard the effect of isotopes. However, our peptides are at the fringes of the mass region where 'For a molecule, its isotopic distribution I can be estimated as follows: take the isotopic distribution of each atom (which is known) and convolve them for every instance of each atom in the peptide. How many nucleons does it take to shift the isotopic contribution from 0 to 1? We approximate this as follows: (1) for each residue, compute nr, the number of instances of the residue necessary for I(1) to exceed 1(0), the non-isotope probability (e.g. with 38 glycines, one expects to find the mass to be 2185 rather than 2184=38*57+m(H)+m(OH)), (2) the isotope contribution per nucleon of the residue is then (N +N-) where N, denotes the number of x particles in residue r, (3) the weighted sum of these is computed, weighted by residue frequency, and finally, (4) the inverse taken to arrive at the number of nucleons necessary to achieve a weighed isotope contribution of 1. It takes about 1867 nucleons to shift the isotopic contribution from 0 to 1. 160 the appearance of an isotope begins to become more probable. Mass calculations in this region and beyond should include a mass offset equal to the isotope contribution to account for the most likely number of extra neutrons. Our estimate of average fusion loss from Section 7.1.6, currently weighted by amino acid frequency, may also have to be weighted by isotopes of these residues as well. 13.3 Summary Current de novo sequencing approaches exhibit limited success in solving the sequencing puzzle. We have proposed and designed a de novo sequencing algorithm with the following properties in mind: Performance Our Java implementation currently takes about 10 minutes to issue a prediction for the angiotensin datasets (using the restricted search with parameter settings from Section 8.6) on a dual processor Pentium III 500MHz machine running Linux. The running time varies with machine load, search parameters and dataset size, and increases with peptide length. Some simulations were also run on two other machines: a 400MHz Pentium II PC running Linux and a 550MHz Pentium III PC running Linux. An optimized C implementation should run faster. Robustness Because candidate generation is independent of the spectrum, sequence-tospectrum approaches are slightly more tolerant of noise and of gaps in particular. Prediction is still possible if internal ions serve as extended family members at gap positions. Scalability Since the PMF computation is quadratic in the length of the peptide (there are 0(n 2 ) possible fragments), and the computation of the score is linear in the size of the dataset, the complexity of the scoring module is polynomial in the length of the peptide. When the model is good, the real sequence scores optimally but may be hidden amongst competing local extrema that are all embedded within a vast search space whose size is exponential in the length of the peptide. This is exactly the type of 161 optimization problem suitable for simulated annealing, which performs an efficient walk of the space according to a set of search parameters. Currently, two of the parameters, nlimit and nover, are directly proportional to the parent mass. We have not adequately studied the effects of longer peptides; it may be that other parameters may need to be dependent on the parent mass as well. In our investigations, the performance of longer peptides seemed to suffer because of the model rather than the search. Reliability Our approach finds the answer that maximizes the probability of the observed spectrum, and it is built on a simple probabilistic framework for reasoning about the likelihoods of sequence guesses. Comprehensiveness As we saw earlier in Section 7.2, some scoring functions do not take into account all possible supporting fragment types, namely the internal ions. There is no aspect of our approach that only makes use of a subset of the input spectrum. By virtue of the scoring function, the entire input spectrum, peaks of all heights and of all hypothetical identities, is taken into account. We have described an algorithm that takes as input a tandem mass spectrum and the parent mass, and given a finite set of residue building blocks, predicts the parent sequence using a model that makes certain assumptions about the fragmentation process. This approach may also be applicable to the general linear sequencing problem for synthetic polymers and other biopolymers such as DNA by identifying the set of building blocks, capturing the MALDI-PSD fragmentation patterns in a relevant model and implementing an appropriate simulated annealing search specification. Nature obeys certain fragmentation rules and we have endeavored to capture its rules in a simple probabilistic model. If this model is good and the data is sufficient, then the real sequence scores optimally (Appendix F) and a simulated annealing search under the right conditions will find it. Investing in improvements in the model is the more immediate need and the most promising next goal. While there is still much room for improvement, our approach is one step in a direction that deserves further development and study. 162 Appendix A Amino Acid Information R N-C-CH 0 Figure A-1: Basic Residue Structure: The side-chain R of a residue hangs off of the acarbon. An amide bond joins the a-carboxyl group of one residue to the a-amino group of an adjoining residue polymerizing multiple basic residues into a peptide. 163 Residue 11Frequency A C D E .076 .0189 F Monoisotopic Mass 71.03712 .0521 103.00919 115.02695 .0632 129.04260 .0397 .0719 147.06842 .0228 137.05891 .0529 113.08407 K L M N .0581 .0917 128.09497 113.08407 .0229 131.04049 .0436 114.04293 P .052 .0417 .0523 97.05277 128.05858 .0715 87.03203 .0587 .0649 101.04768 99.06842 186.07932 163.06333 G H I Q R S T V w Y 57.02147 156.10112 .0131 .0321 Table A.1: Basic Residues, their Frequencies and Masses 164 Appendix B Experimental Methods Peptide samples for angiotensin I and bradykinin were prepared and data was acquired in-house. This appendix describes the process used to acquire PSD spectra for angiotensin I. The procedure is exactly the same for bradykinin except appropriate concentration modifications were made to account for the difference in mass. B.O.1 Sample Preparation Materials " matrix: a-cyano-4-hydroxy-cinnamic acid, CIOH 7 NO 3 , Sigma Chemical Company (St Louis, MO, USA) [28166-41-8], " solvent: 70% CH 3 CN in Q-H 2 0 + 0.1% TFA (100pl TFA + 70ml acetonitrile + enough Q-H 2 0 to make 100ml), " peptide analyte: angiotensin I, Sigma Chemical Company (St Louis, MO, USA)[70937- 97-2], " 1.5ml microfuge tubes, " mettler weigher and spatula, " sample spotting gold plate (#5-2204-00-0002 Sample Plate/Polished surface). 165 Peptide samples were used directly out of its commercial packaging without any purification or further processing, and because of the mass range of these peptides, a-cyano-4-hydroxycinnamic acid was chosen as the matrix (works best for 500-5000Da [BCC91]). Preparation of Matrix and Analyte Using a mettler weigher, 0.0148g of dessicated a-cyano-4-hydroxy-cinnamic acid was weighed into a 1.5ml microfuge tube, then 1.48ml of (70% CH 3 CN in Q-H 2 0 + 0.1% TFA) was added to produce a matrix concentration of 10mg/ml. A 12.97mg/ml solution of angiotensin was made using 0.0011g of dessicated angiotensin and 84.8pA of (70% CH 3CN in Q-H 2 0 + 0.1% TFA). Since 1 mole of angiotensin weighs 1296.5g, this sample was approximately 10nmol/pl. To obtain a desired lOpmol/pl concentration for MALDI-PSD, we performed three serial dilutions, each time diluting 10pl of the sample with 90pl of (70% CH 3 CN in Q-H 20 + 0.1% TFA). Finally, we combined 2pI angiotensin(10pmol/pl) with 2pl of matrix, so that the final concentration of the peptide sample was 5pM. A 1pl aliquot of this was spotted onto the sample plate, air dried at the ambient temperature and placed into the source area of the mass spectrometer for spectral acquisition. B.O.2 Data Collection Spectral data was acquired on a PerSeptive Biosystems Voyager Elite, later upgraded to a Voyager DE STR, at the MIT Whitehead Institute. A N 2 laser produces a 337.1nm wavelength output at a pulse rate of 3.01Hz, and a 4mm 2 view of the sample plate and laser illumination area can be seen on a color video monitor(Hitachi, model CT1396VM). A 1129mm long flight tube directs ionized fragments to a detector, and a digitizer scope, model TDS520B, displays the growing spectrum of recorded collisions with the detector. This spectrum can then downloaded to an IBM compatible computer running GRAMS software. At the start of each PSD acquisition run, a calibration file was created by performing a one 166 point calibration on the parent ioni with the mass spectrometer in PSD mode (mirror ratio of 1.00, low mass gate off, and timed ion selector on). This calibration file is used for the collection of snapshots of small overlapping mass ranges. These stitches are linearly overlaid by the spectrometer computer software to arrive at the final desired PSD spectrum. Piecemeal spectral concatenation is a result of the fact that a particular mirror ratio is only capable of properly focussing a particular range of mass fragments. For this reason, a complete PSD run consisted of collecting stitch data for several mirror ratios, e.g. 1.0, 0.9126, 0.6049, 0.4125, 0.2738, 0.1975, 0.1213, 0.0859, 0.0674 and 0.0566. Once the PSD composite is successfully created, the spectrum is displayed on the computer monitor, and peaks may be selected for inclusion in the experimental dataset to be used as input to a sequencing algorithm. 'When obtaining spectra for a peptide other than angiotensin, angiotensin was used as an internal calibrant. 167 Appendix C Experimental Data Four PSD datasets for angiotensin and two for bradykinin were collected. A cutoff intensity was determined by visual inspection, and all peaks with intensities greater than this threshold were selected. Additional peaks of lower intensity were also selected if they were well-defined and sharp. Each dataset is comprised of these selected peaks, which are listed below, along with each peak's intensity, checkpoint, distance from checkpoint and identity. Some statistics are also compiled for each of the datasets and these appear in Table C.7. Note that in the 0218 dataset, when the experimentals are converted to checkpoints, the intensities of two peaks, masses 71.1355 and 71.3207, are added together since they both resolve to the same checkpoint. Examination of the spectra reveals that it is not the case that there are two clean peaks at these masses, rather, there is a single jagged peak with ridges, so it is likely to be the fault of the labelling software (in that it incorrectly interpreted a ridge as a peak, and may have done so imprecisely) and/or the operator (in that he/she elected to have the extra ridge labelled). A similar situation occurs for peaks 42.6474 and 43.0738 in dataset 0220. Entries in these tables, fragment identities in the data tables and the statistics of Table C.7, are based a model of fragmentation model (from Chapter 8) that includes the refinement of Section 12.1.2. Table entries enclosed in parenthesis indicate the identities/values that would result when using a model without this refinement. Differences occur because in the refined model, the Am18, Ap18, Ym18 and Yp18 variant ions are not allowed. 168 C.1 Dataset Peaks and Peak Identities Experimental Mass(E) Intensity Checkpoint(C) 39.0087 375 239 39.0206 50.0265 70.1586 72.1756 86.2138 574 70.0371 72.0382 86.0456 110.2028 113.2269 115.2049 5914 253 49.9061 136.1618 138.1603 156.1609 166.2488 207.3252 212.2403 213.2452 214.2558 217.2618 223.3004 230.2403 235.2327 237.2398 245.2347 249.3441 251.2606 255.2284 256.2551 257.2591 263.1501 269.1432 270.1365 272.0874 279.1152 285.1174 __________________________________ 512 480 297 941 464 502 362 311 293 268 154 449 563 812 1523 210 261 302 692 3889 110.0583 113.0599 115.0610 136.0721 212.1124 213.1130 214.1135 217.1151 223.1183 230.1220 -0.1607 -0.2153 -0.1278 -0.1321 -0.1422 -0.1466 -0.1820 -0.1182 235.1247 -0.1079 237.1257 -0.1140 -0.1046 -0.2119 -0.1274 -0.0930 -0.1192 -0.1227 -0.0105 -4.5465E-4 0.0067 0.0569 0.0328 0.0338 245.1300 249.1321 251.1331 255.1353 256.1358 257.1363 2558 272.1443 279.1480 I _______________________ .11 -0.0870 -0.0781 138.0732 156.0827 166.0880 207.1098 747 540 1789 8976 1654 866 1635 Delta Mass(C-E) 0.0119 0.1204 -0.1214 -0.1373 -0.1681 -0.1444 -0.1669 -0.1438 -0.0896 263.1395 269.1427 270.1432 285.1512 ____________________________________ 169 1 Identity M:YA P, (P:Aionml8 D) M:YA V M:YA I M:YA H M:YA Y M:YB H M:YBp18 H I:YA HP I:YA PF I:YA IH I:YA VY, I:YB HP I:YB PF I:YA YI I:YB IH P:Bionml7 DR I:YB RV I:YA FH I:YB VY S:Yion HL, I:YBp18 IH P:Bion DR I:YB FH 303.1342 313.0369 326.1207 329.0269 337.0419 343.0863 354.035 364.3554 371.3498 382.3082 400.3986 414.369 416.2967 426.276 489.2467 506.2331 513.2342 517.196 798 1085 1346 1149 1044 1230 12208 1542 847 2630 1237 971 1824 1226 1175 2377 4542 4430 303.1607 313.1660 326.1729 329.1745 337.1788 343.1820 354.1878 364.1931 371.1968 382.2027 400.2122 414.2196 416.2207 426.2260 489.2594 506.2685 513.2722 517.2743 0.0265 0.1291 0.0522 0.1476 0.1369 0.0957 0.1528 -0.1622 -0.1529 -0.1054 -0.1863 -0.1493 -0.0759 -0.0499 0.0127 0.0354 0.0380 0.0783 527.5279 534.2053 548.9729 730 1947 520 527.2796 534.2833 549.2913 -0.2482 0.0780 0.3184 P:Bion DRVY 619.6552 632.4269 641.4564 3029 1537 665 619.3284 632.3353 641.3401 -0.3267 -0.0915 -0.1162 P:Aion DRVYI I:YB IHPFH I:YA RVYIH 647.5013 2216 647.3433 -0.1579 P:Bion DRVYI 650.4275 1768 650.3449 -0.0825 S:Yion HPFHL, I:YBp18 IHPFH 654.7842 677 654.3470 -0.4371 739.6634 994 739.3921 -0.2712 P:Aionm17 DRVYIH 756.3962 4053 756.4011 0.0049 P:Aion DRVYIH 767.3647 1689 767.4070 0.0423 P:Bionm17 DRVYIH, I:YA YIHPFH 784.3812 3857 784.4160 0.0348 P:Bion DRVYIH 1137.8063 1166.8656 1181.8581 332 372 601 1137.6033 1166.6187 1181.6266 -0.2029 -0.2468 -0.2314 P:Aion DRVYIHPFH 1183.0156 612 1182.6272 -0.3883 1183.7039 1279.7142 1296.7269 818 891 16750 1183.6277 1279.6787 1296.6877 -0.0761 -0.0354 -0.0391 1300.7974 896 1300.6898 -0.1075 1311.5771 624 1311.6956 0.1185 I:YBp18 FH P:Aionml7 DRV P:Aion DRV P:Bionml7 DRV, I:YA HPF P:Bion DRV I:YB HPF I:YBp18 PFH I:YB YIH S:Yion FHL P:Aionm17 DRVY P:Aion DRVY S:Yion PFHL, I:YB VYIH P:Bionm17 DRVY S:Yion RVYIHPFHL P:Bionp18 DRVYIHPFH M+Hm17 DRVYIHPFHL M+H DRVYIHPFHL Table C.1: Angiotensin Dataset: data/012360c/unprependedpeaks 170 Post Source Decay Analysis File # 1=C:\MATSU\TLENG\012360C\PSDiPOOI.MSA Stitch Factors 0,600 - 1.010 -50000 L -100000. . -150000 -200000 -- - ...... ......... . 560 1000 15000-i 10000- 5000-- 0 500 Mass (m/z) Figure C-1: PSD for 0123 Angiotensin 171 100o .. .. Experimental Mass(E) 66.597 70.141 72.1589 86.1969 110.2144 112.2346 113.2549 115.1613 136.182 138.1471 156.199 166.175 212.1768 223.2092 230.1264 235.1531 251.1221 255.0997 263.1244 269.1443 272.1034 285.0768 326.169 343.0809 354.063 364.1417 382.1243 416.2279 426.272 489.2264 506.172 513.2287 517.1144 534.3137 619.6771 632.516 647.818 650.583 Intensity Checkpoint(C) 278 627 541 67.0355 0.4385 70.0371 72.0382 86.0456 -0.1038 -0.1206 -0.1512 -0.1560 -0.1751 -0.1949 -0.1002 -0.1098 516 6525 296 221 164 462 191 226 320 278 366 110.0583 112.0594 113.0599 115.0610 136.0721 138.0732 156.0827 166.0880 521 836 433 2160 212.1124 223.1183 230.1220 235.1247 251.1331 255.1353 452 263.1395 2359 860 269.1427 272.1443 285.1512 326.1729 343.1820 676 562 578 3519 491 744 440 487 752 2152 416.2207 1616 721 995 740 720 730 M:YA P, (P:Aionml8 D) M:YA V M:YA I M:YA H M:YAm17 R M:YA Y M:YB H -0.0738 M:YBp18 H I:YA IH 0.0039 0.1011 0.1248 I:YA VY, I:YB HP I:YB IH P:Bionml7 DR I:YB VY S:Yion HL, I:YBp18 IH P:Bion DR I:YB FH P:Aionml7 DRV P:Aion DRV P:Bionml7 DRV, I:YA HPF 0.0514 0.0784 -0.0071 I:YB HPF S:Yion FHL 0.0744 364.1931 382.2027 Identity -0.1162 -0.0869 -0.0643 -0.0908 -0.0043 -0.0283 0.0110 0.0356 0.0151 -0.0015 0.0409 354.1878 849 Delta Mass(C-E) 426.2260 489.2594 506.2685 513.2722 517.2743 -0.0459 534.2833 619.3284 -0.0303 0.0330 0.0965 0.0435 0.1599 -0.3486 -0.1806 -0.4746 -0.2380 632.3353 647.3433 650.3449 L-. 172 P:Aionml7 DRVY P:Aion DRVY S:Yion PFHL, I:YB VYIH P:Bionml7 DRVY P:Bion DRVY P:Aion DRVYI I:YB IHPFH P:Bion DRVYI S:Yion HPFHL, I:YBpl8 IHPF H 741.322 540 741.3932 0.0712 756.262 767.648 784.265 1570 723 1820 756.4011 767.4070 784.4160 0.1391 -0.2409 0.1510 1133.7786 302 1133.6012 -0.1773 1182.7143 472 1182.6272 -0.0870 1184.6494 1253.771 310 302 1184.6282 1253.6649 -0.0211 -0.1060 1296.7507 7326 1296.6877 -0.0629 1311.8763 364 1311.6956 -0.1806 P:Aion DRVYIH P:Bionml7 DRVYIH, I:YA YIHPFH P:Bion DRVYIH M+H DRVYIHPFHL Table C.2: Angiotensin Dataset: data/011959adata/unprependedpeaks 0) PerSeptive Biosystems Original Filename: This File # 1= eI950pd1po0.msa c:\matsuille~g11 C:ATATLENG\119SOAPSDiPC0.M6A CoNectd: 1/199 12-20 PM SampK MIT- I 4000- 14 L 0- 20 Commmen: Method: Made: 00 600 400 000 20 Engic PDE2000 PSD 59 Acomlersting Voltage 20W Gdd Votager. 75.000 % Guide Wre Voltage: Delay. Laser: Praefurm: 2.03e.07 0.01 % 50 ON Low Mass Gets: Mirror RAtie: 1.110 1700 Scans Avrwge: 110 OFF PSD Tuned Mirror RatioIon Set1ctor: 126.7 ON Negatdve Figure C-2: PSD for 0119 Angiotensin 173 tns. OFF Experimental Mass(E) 69.9885838 109.953654 135.980053 217.015005 223.038243 229.935623 234.930675 250.871086 254.865274 263.180953 269.24055 272.167793 279.068532 285.106881 303.143077 313.067376 326.068144 329.072206 337.008348 343.025774 353.96337 354.951488 364.567033 370.989102 382.421177 400.350111 416.320086 426.27812 473.147628 489.228243 506.149113 513.091947 517.045502 527.238686 534.036839 619.532451 647.383406 664.539414 696.932765 714.445666 740.276771 Intensity _ Checkpoint(C) Delta Mass(C-E) Identity 70.0371 110.0583 136.0721 0.0485 0.1047 0.0921 217.1151 223.1183 230.1220 0.1001 M:YA P, (P:Aionml8 D) M:YA H M:YA Y I:YA PF I:YA IH _ 1749 16416 2416 1890 2041 235.1247 0.0800 0.1864 0.1940 251.1331 0.2621 2301 255.1353 263.1395 269.1427 272.1443 279.1480 5475 285.1512 1919 4140 303.1607 313.1660 326.1729 329.1745 337.1788 343.1820 354.1878 355.1883 364.1931 371.1968 382.2027 0.2700 -0.0413 -0.0978 -0.0234 0.0795 0.0443 0.0177 0.0987 3350 5581 2392 13451 4058 20367 6697 6345 4656 3588 5128 39182 10481 4570 2761 7502 3550 6689 3773 1877 4833 10248 18042 17791 4508 8245 7705 6339 1889 1858 2175 3134 0.1048 0.1023 0.1704 0.1562 0.2244 0.2368 -0.3738 0.2077 -0.2184 -0.1378 -0.0993 -0.0520 0.1033 400.2122 416.2207 426.2260 473.2509 489.2594 506.2685 513.2722 517.2743 527.2796 534.2833 619.3284 647.3433 664.3523 697.3698 714.3788 740.3926 0.0312 0.1193 0.1802 0.2288 0.0409 0.2465 -0.2039 -0.0400 -0.1870 0.4370 -0.0667 0.1159 174 I:YA VY, I:YB HP I:YB IH P:Bionml7 DR I:YB VY S:Yion HL, I:YBp18 IH P:Bion DR I:YB FH I:YBp18 FH P:Aionml7 DRV P:Aion DRV P:Bionml7 DRV, I:YA HPF P:Bion DRV I:YB HPF I:YBp18 PFH S:Yion FHL P:Aionml7 DRVY P:Aion DRVY S:Yion PFHL, I:YB VYIH P:Bionml7 DRVY P:Bion DRVY P:Aion DRVYI P:Bion DRVYI 756.225266 784.09736 927.159077 985.637163 1001.06581 1029.11267 1068.8317 1137.57857 1165.45098 1183.43523 1197.7826 1296.40479 12088 16216 2270 1761 3900 3585 2719 8810 19237 44775 5591 33467 1,. 756.4011 784.4160 927.4919 985.5226 1001.5311 1029.5460 1068.5667 1137.6033 1165.6182 1183.6277 1197.6351 1296.6877 0.1759 0.3186 0.3328 -0.1144 0.4653 0.4333 -0.2649 0.0247 0.1672 0.1925 -0.1474 0.2829 1 P:Aion DRVYIH P:Bion DRVYIH I:YBp18 RVYIHPFH P:Aion DRVYIHPFH P:Bion DRVYIHPFH P:Bionpl8 DRVYIHPFH 1 M+H DRVYIHPFHL Table C.3: Angiotensin Dataset: data/120598b/120598bdata Post Source Decay Analysis FilIe 40000 Stitch Factors: 1-C:\MATSUTLENG120508BPO1PA.MSA .600 0- -20000a-am00 0- 40W0 Soo 0$0 000 MasInh Figure C-3: PSD for 1205 Angiotensin 175 - 1 010 Experimental Mass(E) 39.0365 Intensity Checkpoint(C) 478 39.0206 70.1559 72.1845 498 469 380 6786 694 70.0371 72.0382 86.2137 110.1966 111.1807 112.2168 115.161 136.1696 138.1572 156.1484 166.204 207.2945 217.2656 223.2725 230.2349 235.2373 245.2555 251.2266 255.2144 263.1141 269.1331 272.1083 285.1055 303.0332 313.0381 326.0851 343.0954 354.0267 355.0395 364.3496 382.2554 400.2776 416.2982 426.2724 489.1976 306 344 1061 584 576 728 468 783 1023 1655 2585 441 1287 7438 3454 21122 5155 3976 1626 2510 86.0456 110.0583 111.0589 112.0594 115.0610 136.0721 138.0732 Delta Mass(C-E) -0.0158 -0.1187 -0.1462 -0.1680 -0.1382 -0.1217 -0.1573 -0.0999 -0.0974 -0.0839 -0.0656 -0.1159 -0.1846 -0.1504 -0.1541 -0.1128 -0.1125 -0.1254 -0.0934 156.0827 166.0880 207.1098 217.1151 223.1183 230.1220 235.1247 245.1300 251.1331 255.1353 263.1395 -0.0790 0.0254 0.0096 0.0360 269.1427 272.1443 285.1512 303.1607 313.1660 0.0457 0.1275 0.1279 3028 2835 326.1729 0.0878 343.1820 25045 354.1878 5483 355.1883 364.1931 382.2027 400.2122 0.0866 0.1611 0.1488 -0.1564 2727 5133 2263 3514 2711 2078 -0.0526 -0.0653 416.2207 426.2260 506.2023 4013 513.2117 8758 489.2594 506.2685 513.2722 517.1686 8603 517.2743 527.2644 1255 534.1615 3114 527.2796 534.2833 -0.0774 -0.0463 0.0618 0.0662 0.0605 0.1057 0.0152 0.1218 176 Identity M:YA P, (P:Aionml8 D) M:YA V M:YA I M:YA H M:YAm17 R M:YA Y M:YB H M:YBp18 H I:YA HP I:YA PF I:YA IH I:YA VY, I:YB HP I:YB PF I:YB IH P:Bionml7 DR I:YB VY S:Yion HL, I:YBp18 IH P:Bion DR I:YB FH I:YBp18 FH P:Aionml7 DRV P:Aion DRV P:Bionml7 DRV, I:YA HPF I:YB HPF I:YBp18 PFH S:Yion FHL P:Aionml7 DRVY P:Aion DRVY S:Yion PFHL, I:YB VYIH P:Bionml7 DRVY P:Bion DRVY 577.225 602.4684 619.5363 632.5459 647.4372 669.4731 680.2988 740.3599 756.3997 767.296 784.3165 1183.8495 1279.7243 1296.6713 722 725 3543 2153 2404 818 840 1112 4490 1700 5108 1132 984 24878 577.3061 577.3061 602.3194 619.3284 632.3353 647.3433 669.3550 680.3608 740.3926 756.4011 767.4070 784.4160 1183.6277 1279.6787 1296.6877 0.0811 0.0811 -0.1489 -0.2078 -0.2105 -0.0938 -0.1180 0.0620 0.0327 0.0014 0.1110 0.0995 -0.2217 -0.0455 0.0164 P:Aionml7 DRVYI P:Aion DRVYI I:YB IHPFH P:Bion DRVYI I:YB RVYIH P:Aion DRVYIH P:Bionml7 DRVYIH, I:YA YIHPFH P:Bion DRVYIH P:Bionpl8 DRVYIHPFH M+Hml7 DRVYIHPFHL M+H DRVYIHPFHL Table C.4: Angiotensin Dataset: data/01 2170c/unprependedpeaks 4, PerSeptive Biosystems Original Filename: C:MATSU\TLENG1012170C\PSD1P00 mae This File # 1 = C:\MATSU\TLENG\012170C\PSD1P00.MSA Sample: 70 Collected: 1/21/98 2:03 PM 2 5 0 0; 20000- 15000M r 10000- 5000- J'i LliiLlL~1L ' ' ' 0- 200 400 e50 80 - -- - 1000 "-. * - -- -T 1200 Mass (mn/z) Comment angio, bwIdth=fulI Method: PDE2000 Mode- PSD Acculeratng Voltage: 20000 Grid Voltage: 75.000 % Guide Wire Voltage: 0.01 % Delay 50 ON Laser: 1790 Scans Averaged: 231 Pressure: 2.33e-07 Low Mass Gate: OFF MirrorRatio: 1.110 PSD Miror Ratio: Timed Ion Selector: 1298.7 ON Negative Ions: OFF Figure C-4: PSD for 0121 Angiotensin 177 Experimental Mass(E) 38.9917 42.6474 43.0738 43.5834 60.045 70.1447 71.1589 98.1434 112.1678 Intensity Checkpoint(C) 772 380 363 353 304 6643 549 465 3290 39.0206 43.0228 43.0228 44.0233 60.0318 70.0371 71.0376 98.0520 112.0594 Delta Mass(C-E) 0.0289 0.3754 -0.0509 0.4399 -0.0131 -0.1075 -0.1212 -0.0913 -0.1083 115.1478 419 115.0610 -0.0867 120.1403 526 120.0636 -0.0766 126.0993 780 126.0668 -0.0324 140.2396 155.2502 157.2721 762 1009 1258 140.0742 155.0822 157.0833 -0.1653 -0.1679 -0.1887 P:Bionml7 R, M:YBm17 R I:YB PG P:Bion R, I:YA SP, M:YB R 158.2839 166.3079 616 705 158.0838 166.0880 -0.2000 -0.2198 S:Yionml7 R 167.2804 175.2782 185.223 1309 1619 739 167.0886 175.0928 185.0981 -0.1917 -0.1853 -0.1248 I:YA PP, I:YBm18 SP P:Bionpl8 R, S:Yion R, M:YBp18 R I:YB SP 192.2196 195.2451 1327 1357 192.1018 195.1034 -0.1177 -0.1416 I:YB PP 209.2512 212.2254 217.123 218.2249 236.5522 237.1787 245.1472 1152 627 1163 625 911 2982 1435 209.1109 212.1124 217.1151 218.1156 236.1252 237.1257 245.1300 -0.1402 -0.1129 -0.0078 -0.1092 -0.4269 -0.0529 -0.0171 252.1834 254.2386 1217 1043 252.1337 254.1347 -0.0496 -0.1038 J:YB PPG P:Bion RP 302.1729 305.5506 359.4264 363.414 371.3495 389.3236 544 426 439 396 1329 591 302.1602 305.1618 359.1905 363.1926 371.1968 389.2064 -0.0126 -0.3887 -0.2358 -0.2213 -0.1526 -0.1171 I:YB PGF S:Yionml7 FR 178 Identity M:YA S M:YA P, M:YBm18 S M:YB P P:Aionm7 R, M:YAm17 R M:YA F P:Aionml7 RP I:YBm18 FS, I:YA PF P:Bionml7 RP I:YB PF P:Aionm17 RPPG I:YA PPGF, I:YBm18 PGFS I:YB PGFS 402.3099 632 1158 408.2843 631 419.2012 796 458.4447 527 842 399.2848 486.3314 506.4223 510.5818 787 995 527.6151 528.556 900 555.2614 556.6239 1029 539 531 466 572.8344 588.9908 597.6304 598.5546 559 837 892 614.3491 1013 614.956 625.3541 640.1559 642.1966 668.6404 706.4312 751.518 744 579 806.9349 827.41 833.4403 850.4326 862.8687 868.6491 877.6875 883.1278 887.5898 890.0149 904.5686 905.6564 591 1287 184 183 226 218 216 216 252 315 232 257 227 273 192 1948 911 915.3463 920.911 950.0765 990.8686 1014.4169 265 1018.3881 641 256 212 623 644 399.2117 -0.0730 402.2133 408.2165 419.2223 -0.0965 458.2430 486.2578 506.2685 510.2706 527.2796 528.2801 555.2945 556.2950 573.3040 589.3125 597.3167 598.3173 614.3258 615.3263 625.3316 640.3396 642.3406 668.3544 706.3746 751.3985 807.4282 827.4388 833.4420 850.4510 862.4574 868.4606 877.4653 883.4685 887.4706 890.4722 904.4797 905.4802 915.4855 920.4881 I:YB PPGF S:Yionml7 PFR P:Bion RPPG S:Yion PFR, I:YBp18 PFR I:YA PPGFS I:YB PPGFS S:Yion SPFR, I:YBp18 SPFR P:Aionml7 RPPGF P:Aion RPPGF -0.0677 0.0211 -0.2016 -0.0735 -0.1537 -0.3111 -0.3354 -0.2758 P:Bion RPPGF, I:YA PPGFSP 0.0331 -0.3288 0.4696 0.3217 -0.3136 -0.2372 -0.0232 0.3703 -0.0224 0.1837 (I:YAm18 FSPFR) P:Aionml7 RPPGFS P:Aion RPPGFS I:YBm18 PGFSPF P:Bionml7 RPPGFS 0.1440 -0.2859 -0.0565 P:Bion RPPGFS -0.1194 0.4933 S:Yion PGFSPFR, I:YBp18 PGFSPFR 0.0288 0.0017 0.0184 -0.4112 -0.1884 -0.2221 0.3407 -0.1191 0.4573 -0.0888 P:Bionml8 RPPGFSPF, I:YBm18 PPGFSPFR S:Yionml7 PPGFSPFR S:Yion PPGFSPFR, I:YBp18 PPGFSPFR -0.1761 0.1392 -0.4228 0.4276 950.5041 -0.3432 990.5253 0.1211 1014.5380 1018.5402 I 0.1521 I P:Aion RPPGFSPFR 179 1019.4492 1038.6707 673 617 1019.5407 1038.5508 0.0915 -0.1198 1043.4846 1044.4258 1038 719 1043.5534 1044.5540 0.0688 0.1282 1045.5164 1049.0638 1054.096 632 599 562 1045.5545 1049.5566 1054.5593 0.0381 0.4928 0.4633 1060.4957 14240 1060.5624 0.0667 M+Hm17 RPPGFSPFR M+H RPPGFSPFR, P:Bionpl8 RPPGFSPFR Table C.5: Bradykinin Dataset: data/022064c/unprependedpeaks Post Source Decay Analysis File # 1=CAMATSU\TLENG\022064C\PSD1POOA.MSA - -- Stitch Factors: 0.600 - 1.010 ------- - -- ---- -5000----........ ...... -100000- .... -150000 2.. 4...... 200 400 ..... .... . 600 800 1000 15000 1000i 5000- 04 I.___...._._6 I. ; I I 600 400 Mass (m/z) Figure C-5: PSD for 0220 Bradykinin 180 800 1000 Experimental Intensity Checkpoint(C) Delta Identity 507 1051 7764 462 308 311 456 871 8015 0.0024 0.0011 -0.1109 M:YA P, M:YBm18 S 3105 23.0122 39.0206 70.0371 71.0376 71.0376 87.0461 97.0514 98.0520 112.0594 115.0610 120.0636 126.0668 140.0742 151.0801 155.0822 156.0827 157.0833 158.0838 165.0875 166.0880 167.0886 168.0891 175.0928 185.0981 192.1018 193.1024 194.1029 195.1034 196.1040 209.1109 212.1124 217.1151 237.1257 245.1300 252.1337 2282 263.1395 Mass(E) 23.0098 39.0195 70.1481 71.1355 71.3207 87.2063 97.1742 98.1618 112.1888 115.1916 120.1754 126.1679 140.2768 151.3537 155.3028 156.282 157.3173 158.311 165.3573 166.3184 167.3258 168.3327 175.3017 185.2757 192.2622 193.294 194.3283 195.2752 196.3088 209.294 212.3102 217.2673 237.233 245.2819 252.2211 263.2083 777 1677 1644 2424 1099 3114 652 3083 1276 491 1845 3432 678 4637 1971 3237 1347 1245 3412 561 3764 1317 2563 8240 2456 -0.0978 -0.2830 -0.1601 -0.1227 -0.1097 -0.1293 M:YB P P:Aionml7 R, M:YAm17 R -0.1305 -0.1117 -0.1010 -0.2025 -0.2735 -0.2205 M:YA F P:Bionml7 R, M:YBm17 R I:YB PG -0.1992 -0.2339 -0.2271 P:Bion R, I:YA SP, M:YB R S:Yionml7 R -0.2697 -0.2303 -0.2371 -0.2435 I:YA PP, I:YBm18 SP -0.2088 P:Bionpl8 R, S:Yion R, M:YBp18 R I:YB SP -0.1775 -0.1603 -0.1915 -0.2253 -0.1717 -0.2047 -0.1830 -0.1977 -0.1521 -0.1072 -0.1518 -0.0873 -0.0687 181 (S:Yionpl8 R) I:YB PP P:Aionml7 RP I:YBm18 FS, I:YA PF P:Bionml7 RP I:YB PF I:YB PPG 274.1503 275.0342 292.2046 302.4998 305.4816 316.5282 332.4663 351.4181 359.3487 361.4266 363.4294 371.421 377.4286 380.4515 389.3622 399.2901 402.3501 408.3318 419.3184 446.8625 458.7359 468.6683 486.6087 506.6194 510.6261 527.5678 538.522 555.4765 573.1925 597.4805 614.4251 624.8962 642.2694 904.5703 1043.5426 1060.5442 2348 1105 1422 2712 1828 1382 920 902 1258 1445 2061 4546 934 1041 2378 2172 4689 1312 3136 830 1528 1353 3034 2054 3479 3786 1096 2115 1352 3398 3530 1361 3551 1101 729 10508 274.1453 275.1459 292.1549 302.1602 305.1618 316.1676 332.1761 351.1862 359.1905 361.1915 363.1926 371.1968 377.2000 380.2016 389.2064 399.2117 402.2133 408.2165 419.2223 447.2372 458.2430 468.2483 486.2578 506.2685 510.2706 527.2796 538.2854 555.2945 573.3040 597.3167 614.3258 625.3316 642.3406 904.4797 1043.5534 1060.5624 , -0.0049 0.1117 -0.0496 -0.3395 -0.3197 -0.3605 -0.2901 -0.2318 -0.1581 -0.2350 -0.2367 -0.2241 -0.2285 -0.2498 -0.1557 -0.0783 -0.1367 -0.1152 -0.0960 0.3747 -0.4928 -0.4199 -0.3508 -0.3508 -0.3554 -0.2881 -0.2365 -0.1819 0.1115 -0.1637 -0.0992 0.4354 0.0712 -0.0905 0.0108 0.0182 I:YA PGF, I:YBm18 GFS I:YB GFS I:YB PGF S:Yionml7 FR I:YB FSP P:Bion RPP I:YA PGFS P:Aionml7 RPPG I:YA PPGF, I:YBm18 PGFS P:Aion RPPG I:YB PGFS I:YB PPGF S:Yionml7 PFR P:Bion RPPG S:Yion PFR, I:YBp18 PFR I:YA PPGFS I:YBm18 PPGFS I:YB PPGFS S:Yion SPFR, I:YBp18 SPFR P:Aionml7 RPPGF P:Aion RPPGF P:Bionml7 RPPGF P:Bion RPPGF, I:YA PPGFSP P:Aionml7 RPPGFS P:Aion RPPGFS P:Bionml7 RPPGFS,(I:YAp18 FSPFR) P:Bion RPPGFS S:Yion PPGFSPFR, I:YBp18 PPGFSPFR M+Hml7 RPPGFSPFR M+H RPPGFSPFR, P:Bionpl8 RPPGFSPFR Table C.6: Bradykinin Dataset: data/021829c/unprependedpeaks 182 Post Source Decay Analysis Stitch Factors: 0.600 - 1.010 File# 1=C:\MATSL\TLENG\021829C\PSD1P0OBMSA ~~ .50000- -1000 --- 200 400 200 400 60 800 1000 66)0 Bo 1000 10000- 5000- Mass (mlz) Figure C-6: PSD for 0218 Bradykinin C.2 Distribution of Fragment Types Statistics on these datasets are compiled in Table C.7. These include: dataset size, number of experimentals accounted for, number of multiple identity peaks present, number of the different ions present and various other totals. The ion counts include any peak that can be assigned a particular fragment type, so multiple identity peaks are multiply counted in the tallies for the fragment type of each identity. 183 Fragment Type 0123 0119 Dataset 1205 0121 Dataset Size Matched Peaks Multiple Identities 73 49 6(7) 48 34 6(7) 53 34 4(5) 5 5 4 4 5 6 0220 0218 55 42 5(6) 86 46 15 71 47 14(15) 4 4 3 5 3 5 5 Cores: Aion Bion Yion 6 5 4 4 6 Am18 0(1) 0(1) 0(1) 0(1) 0 0 Am17 Bm18 3 0 2 0 2 0 3 0 5 1 4 0 Bm17 4 4 3 4 3 4 Bp18 1 0 1 1 2 2 Ym18 0 0 0 0 0 0 Ym17 1 0 0 1 5 5 Yp18 0 0 0 0 0 0(1) 9 10 0 0 0 0 0 4 4 7 0 0 0 0 0 2 4 6 0 0 0 0 0 4 6 9 0 0 0 0 0 3 6 9 0(1) 0 0 5 0 4 8 11 0 0 0(1) 5 0 3 Internals: YA YB YAm18 YAm17 YAp18 YBm18 YBm17 YBp18 Immoniums: YA 5 5 3 5 4 2 YB YAm17 1 0 1 1 0 0 1 1 1 1 2 1 YBm18 0 0 0 0 1 1 YBm17 YBp18 0 1 0 1 0 0 0 1 1 1 1 1 25(26) 23 19(20) 13 21(22) 14 21(22) 18 30 24(25) 28(29) 27(28) Totals: cores internals immoniums m18 m17 p18 7 8 3 8 9 8 0(1) 8 6 0(1) 7 3 0(1) 5 5 0(1) 9 5 7(8) 15 6(7) 6 15 6(8) Table C.7: Ions Present in Data: Counts in parenthesis represent counts when using a model without the refinement of Section 12.1.2. 184 PerSeptive Biosystems LI**: r.@o Soms warm4: OCkWim Fleamm: C :WOYAGERFFACTCflMhTALP lfWO@0I I-ji Lvme Mmrs ge.no0. slilv9"age. now% Fl4 p5* m-SVOYAGERWACTORYUNSTAL.USDiPARENWI.MaA T-il GwdWiW VdJt twnl. 100 Sampc 103 n2-07 Anostlng V~ega: 2cCOO Prujruc z Q MM: CJS3% Timpwe Wh 4 Negm kim: CAkbed: iIif 50 S8 F WF 2.34 cO sooOjz0 4 10000- I I 0- iRlL b!h11 Lii. .aN~ i 200 1. alt 1 " vI- 500 400 Mas ("z} Figure C-7: Manufacturer PSD for Angiotensin 185 ~IA .kttLiaJkiM - . 1000 1200 Ii Appendix D Data Peaks of Unknown Origin There are a handful of experimental peaks of fairly low abundance whose identities are not known. They do not appear in the theoretical spectrum, and were it not for the fact that they occur consistently across multiple datasets, one might have attributed them to noise. checkpoint 1205 0123 0121 0119 factory 115.0610 166.0880 297 362 344 728 164 320 1679 906 212.1124 293 278 1095 521 230.1220 279.1480 303.1607 3350 2301 1919 812 866 798 1655 1626 2644 1759 1702 313.1660 4140 1085 2510 2750 329.1745 337.1788 364.1931 4656 3588 4570 1149 1044 1542 2727 2345 2050 3642 400.2122 3550 1237 2263 426.2260 527.2796 740.3926 M+H: 1296.6805 3773 4508 3134 33467 1226 730 2711 1255 1112 24878 16750 491 2941 440 2692 7326 3388 51538 Table D.1: Angiotensin Peaks of Unknown Identity: when a dataset contains an unknown, the height of the peak is listed. The height of the parent ion is included in the last row of the table for reference. The masses in question are listed by checkpoint in Table D.1. When a dataset contains an unknown, the height of the peak is listed. The diagnostic spectrum, see Section 9.1, also 186 contains peaks at many of these unknown masses (see column 4 of Table D.1). Lastly, we consulted a batch of 10 angiotensin datasets, which were one of the first datasets we acquired as part of our training on how to use the mass spectrometer. While unfit for use in our analyses, they also expressed a number of these peaks in question: checkpoint instances (out of 10) 166.0880 7 212.1124 230.1220 426.44926 5 6 7 Table D.2: Number of times each unknown appears in training datasets (out of 10). Again, these unknowns are not major peaks, however, they occur consistently enough to arouse suspicion. From a purely mathematical analysis of masses and with our elementary knowledge of fragmentation, we could not deduce a structure for these fragments using the residues in angiotensin. Several questions then came to mind: D.1 Do bradykinin spectra also contain unknown peaks? Even though there are only two bradykinin datasets, they were checked for recurring unknowns, and the following were found in both: checkpoint 0220 0218 71.0376 115.0610 126.0668 166.0880 192.1018 549 419 780 705 1327 770 777 1644 1845 3237 212.1124 1152 1317 573.3040 M+H: 1060.5653 1352 114240 110508 531 Table D.3: Bradykinin Peaks of Unknown Identity. Note that there are two 0218 experimentals (of height 462 and 308) that have the same checkpoint value of 71.0376. Only 212.1124 and 115.061 are an amino acid (proline) distance apart, and interestingly 187 enough, these two are the only unknowns also present in the angiotensin spectra. What do these two peptides have in common? D.2 Could these peaks be due to the matrix? In both cases, the same matrix compound was used. Could these peaks be somehow due to the influence of the matrix? We acquired MS spectra for the matrix alone in linear and reflector modes, and two of our unknowns, 212.1124 and 426.2260, appeared. We do not know if the molecules behind these peaks are also responsible for the unknowns at these masses in our datasets, but we suspect that it is unlikely. The timed ion selector, used during MS/MS acquisition, should have barred non-angiotensin/non-bradykinin molecules from the entering the second stage of MS. It is true, however, that the timed ion selector actually allows a small window of masses through, and it is also conceivable that portions of the matrix may interact with pieces of the analyte to form a molecule that happens to reach the selector during this window, but it is thought that such an occurrence is extremely rare. We also acquired MS/MS spectra for the matrix control with the timed ion selector set at 1296.68, the mass of the angiotensin parent ion, and none of the unknown peaks were present. D.3 Is there any way to explain these peaks? Of course - one can find some hypothetical structure to account for peaks at these masses, but we have no reason to believe any of them are likely to occur. In some of our earlier investigations, we used a larger and more general set of fragment types that included the C, D, V, W, X, Z ions and alternate forms for the A and Y ions(Scheme 111.2 of [Joh88] contains structures for these). We found that in the resulting PMF, these unknown masses could be explained by these extra fragment types, and often, in more ways than one. But while they may be possible explanations, it is unlikely that they are plausible ones because they are extremely rare for MALDI-PSD spectra. 188 D.4 Might these unknowns be related to each other? In the event that some of the unknowns might be the result of the same phenomena, mass differences were calculated, and Table D.4 lists the pairs of unknown peaks that are an amino acid apart. Note that this is only useful if the two unknowns are of the same fragment type, in which case the one of smaller mass could be a prefix or suffix of the larger. It is interesting to note that almost all of the residues (save T, S and A) are residues present in the angiotensin sequence. residue distance P D angiotensin residue 166.088 166.088 166.088 166.088 166.088 I L H F Y V 313.166 212.1124 329.1745 230.122 T V 426.226 279.148 400.2122 303.1607 400.2122 313.166 426.226 313.166 426.226 313.166 F P S I L 400.2122 329.1745 426.226 329.1745 A P 527.2796 364.1931 Y 527.2796 426.226 T pair of unknown masses 212.1124 115.061 230.122 115.061 279.148 279.148 303.1607 313.166 329.1745 Table D.4: Unknowns that are a Residue Distance Apart: angiontensin is DRVYIHPFHL. V V V V V recall that the sequence for The graph of Figure D-1 is an alternate representation of the information in Table D.4. It is reminiscent of a fundamental graph and non-angiotensin residue edges are represented by dotted arrows. Short partial subsequences of angiotensin that are consistent with paths through the graph are listed in Table D.5, but how these sequences can exhibit a structure of this mass is unknown to us, e.g. what chemical structure for R has a mass of 115? 189 212 527 364 115 A'14 T 279 xV D 329% 23 IL ~t~u F A I,L Y 166 400 HP 303 S F 313 Figure D-1: Graphical Representation of Residue Relationships Between Unknowns sequence of peptides sequence of nodes visited R, DR, DRV IH, YIH, YIHP H, HL, FHL HP, IHP, IHPF I, IH, IHP H, FH, FHL H, HP F, PF V, VY I, YI 115,230,329 166,329,426 166,279,426 166,279,426 166,303,400 166,313,426 115,212 115,212 364,527 364,527 Table D.5: Angiotensin Peptides and Consistent Path Nodes 190 D.5 Keep in Mind... In some sense, our portrayal of unknowns is a little misleading because the datasets that we are working with are really subsets of the raw data output by the mass spectrometer: " the operator labels and selects peaks from raw data to be included in the final dataset, and understandably, often prefers the higher intensity ones " the GRAMS peak labelling software does not always label all peaks below a selected threshold, and even worse, certain labels may disappear during subsequent attempts at labelling other peaks. Mass labels also do not always correspond to the peak apex - sometimes they indicate the centroid location instead. At times, we resorted to peak magnification and individual labelling when we were unable to tease the software into labelling a specific desired peak. Therefore, not all experimental peaks make it from the raw data into the dataset that is input to a sequencing algorithm. There could be other unknown peaks that were deemed insignificant and consequently not included simply because their intensities were too low. Other unknowns, which may shed additional light on the matter, may not have appeared frequently enough to be noticed by our examination and included in our list. No one seems to know, with any degree of certainty, what the origin of these unknown peaks are. Johnson [Joh88] surmises that if they are novel ion types, then they are not related to the parent sequence or they are due to higher energy processes. In short, they could be evidence of new fragmentation, of interactions between matrix and/or peptide and/or other contaminants (e.g. on the sample plate, in solvents, in the peptide crystals when purchased), of noise or of some other phenomena. 191 Appendix E Visits to the Drawing Board This appendix chapter chronicles some of the strategies that were tried and efforts that were made to tackle de novo sequencing. While some of these turned out to be variations of approaches that have appeared in the literature, they were instructive and instrumental in shedding light on the nature of the problem and the characteristics of a good solution. E.1 Understanding the Problem Our early investigations focused on understanding the nature of the experimental data the acquisition process, the types of peaks present and the extent of noise. Much of what has been learned in these areas have already been discussed in the background sections of this thesis. E.1.1 Understanding the Acquistion Process Spectra was acquired in-house as a means for becoming familiar with the procedure and the equipment. Among other things, we learned that: high laser intensities can tease out additional fragments that require more energy to form, but at the expense of poorer quality for existing peaks, which widen and become less crisp and sharp; low mass peaks are acquired blindly - they are present in the downloaded spectrum but not in the oscilloscope output; 192 when stitches are combined to form the PSD composite, peak intensities do not appear to be preserved across the board or normalized; some peaks had to be manually selected for labelling because the peak detection/labelling software was unable to label every single peak(see Section D.5); and finally, the mass spectrometer software surprisingly allowed no direct access to the raw digital data of the displayed spectrum. E.1.2 Understanding the Spectra Spectra was analyzed to discover what information is contained within and which features might prove useful. Observations included the following: fragmentation is non-conserved and incomplete - i.e. not all pieces of a parent molecule are retained and not all possible fragments are produced; peak masses are often shifted from their calculated theoretical value with no apparent consistent bias; and the most common peak mass difference is 28 which is the distance between an A type break and its B type counterpart (e.g. Am17 and Bm17, YA and YB internals, etc.). PSD spectra for an assortment of polyamino acids (e.g. poly L arginine, poly L histidine, etc) and short homopeptides (e.g. FFF, VV, YYY, AAAA) were also acquired in the hopes that fragmentation patterns particular to a residue might be useful general aids to sequencing, but fragmentation appears to depend not only on composition but order as well. We found that several spectra for the same peptide could be merged into a single dataset (with some rules for deciding when two peaks were the same mass) to improve the signal to noise ratio. If legitimate peaks are assumed to occur consistently but noise peaks randomly, then combining datasets helps reinforce actual fragments while diluting noise. Access to multiple spectra for the same peptide acquired under different conditions can also be useful because affected peaks can provide additional information [SCE+97], but we had little success with deuterium exchange (the use of D 2 0 instead of H 2 0 so that the mass of molecules which acquire a deuterium atom will be shifted by 1). Two other important outcomes from understanding the data are (1) the idea that the mass of a molecule can only be at certain discrete values (which we call a checkpoint (see 193 Section 7.1.6), and (2) the realization that there would be portions of the data that cannot be explained by the current knowledge of fragmentation. These peaks of unknown identity appear persistently in spectra for the same peptide, so any de novo sequencing algorithm would have to be robust enough to formulate a sequence prediction in spite and despite these offending peaks. E.2 Exploring Sequencing Algorithms What follows is a description of a variety of different strategies and ideas we considered. They span the gamut in terms of approach- from static to dynamic scoring, from deterministic to stochastic, and from poor to fair performance (some fail to find the real sequence, some generate it but need to better distinguish it from competing sequences.) Only some of the approaches are described; the observations made from these explorations are summarized in Chapter 7. One simple early approach was one where a spectrum of n peaks was converted into a graph of n nodes. Nodes representing peaks that were potentially of the same fragment family were linked together with edges in some lexicographic order. Edges between families were then added to the graph if a family was a amino acid distance apart from another. Indeed, the use of a graph is attractive because a realm of efficient graph algorithms can now be used to identify and isolate a path with some specific property - in our case, the longest path that accounted for the parent mass. The idea was that this path visited the most vertices, and hence was the most likely sequence because of the redundancy it could account for. With complete (and hence, representative) spectra, this would be a reasonable conclusion since the amount of redundancy would constrain the spectrum to likely support only one interpretation. With incomplete, but representative, spectra, it is difficult to prove that the "best" path is the correct answer. Paths longer than the correct sequence could score better because more nodes are visited, and even though these may be "uninteresting" (in that they are low scoring), their aggregate sum may outscore that of the correct path. Although not the next paradigm we considered, the fundamental graph, introduced in Section 4.2, is the first we discuss because it conveniently highlights potential sequences 194 embedded in the data, and is therefore a natural starting point for algorithm development. E.2.1 Fundamental Graph-Based Approaches The simplest way to score a fundamental node is to use the number of experimental supporters for the fundamental. More complex scoring functions have been tried by us and other centers, and while they include attributes and factors that reward for redundancy, there is no reason to believe them optimal, being largely based on empirical observation. Given a graph with node scores, and possibly edge scores, one can efficiently identify the optimum path, using a greedy strategy for example, but with spectra that is not representative, it is difficult to prove that the optimal path is the correct one. Algorithms may list all paths that score above a certain threshold or simply the best n paths in hopes that the real sequence scores well enough to be included among them. With a graph of 559 nodes and 4645 edges for the 0123 dataset, it is too costly to first enumerate all possible complete paths and then select from them. A few methods exist for keeping the problem tractable - optimizations on the physical graph can reduce its size (e.g. to less than 396 nodes and 2753 edges), and intelligent pruning during path enumeration can limit the number of paths considered by discarding undesirable paths and retaining only those which correlate well with the experimental data. However, there is a chance that the real answer is also removed in the process. One means for finding the best n paths involved a propagation stage where scores were propagated through the graph from the base to the parent fundamental. Each node kept track of the n best ways to reach it from the base, and these were collated by the distance of the node from the base (the number of residues in the sequence defined by the partial path). A backtracing stage in the reverse direction starting from the parent fundamental delineated the best paths(sequences) of a particular desired length. An algorithm usually generates a pool of predictions. The highest ranking sequence predictions often share a common subsequence with the real sequence. So, even when a peptide cannot be completely sequenced, an accurate partial prediction might be preferred when the answer cannot be found in its entirety. 195 A single gap is one reason why a real sequence may not be complete. Even worse is the inability to realize that the real sequence is missing from a pool of candidates. The inclusion of short peptide edges in the graph is a simple measure against gaps in the dataset. However, when only dipeptides and tripeptides are considered, 25189 new edges were introduced into the graph - the dipeptides alone ballooned the graph by 10432 edges. The introduction of edges that represent more than one residue can make it difficult to find a fair scoring mechanism, and may not turn out to be a viable solution for handling large gaps. Recall that during the construction of the graph, each experimental peak gave rise to a number of fundamentals, and these are collectively referred to as related fundamentals(see Section 4.2). Certain paths were found to visit related fundamentals. Paths like these should be disallowed if the same experimental peak is explained by multiple conflicting roles. Dancik et.al. and Clien, et.al. were careful in their handling of related fundamen- tals [DAC+99a, CKT+00]. Internal ions can be included in the score of a node by considering suffixes of the sequences described by each partial path. The legitimacy of some variant supporters can also be verified from the composition of a partial path. These are examples of dynamic score calculation. Because a partial path is available to a dynamic scoring function, it can compute a score that is more accurate than that of a static one. This allows for more path-specific (hence sequence-specific) features to be accounted for earlier in the search instead of at the end when a sequence guess is complete. However, despite a number of optimizations, the resulting search took too long. E.2.2 Expanding Islands of Certainty Instead of sequencing from one terminus to another, this approach tries to work inside out. The idea is to guess very likely subsequences of the peptide, and then gradually extend these islands in both directions with residues that are less and less certain until no further extensions are possible or the entire parent mass is accounted for. How does one find a good starting guess? In our datasets, there were definitely regions of high redundancy as well as regions with low amounts of support. 196 The best scoring fun- damentals, those which are highly redundant, are selected1 , and those that are an amino acid apart are chained together to form islands. These fundamentals and their corresponding peaks can then be removed from the data, and the best scoring fundamentals of the remaining data can then be assimilated into the growing islands. Of course, it is possible that not all of the initially chosen fundamentals are correct, and some of the second-tier fundamentals may not be compatible with the existing islands, so one must decide what to do in situations like this - with only a few potential fundamentals, one could consider all possible combinations and generate best solutions for each, or one could try to find a maximum compatible set. Being largely dependent upon redundancy and how quickly support peters out with every iteration of the algorithm, it is not clear that this approach will perform well in general. Since each iteration considers information that is less and less reliable, this might instead suggest that perhaps finding a portion of the real sequence with high confidence is preferable to proposing an entire sequence whose correctness is uncertain. E.2.3 Bounding Partial Paths A branch and bound search can be performed on the fundamental graph. Here, the search algorithm would keep track of some limited number of partial paths sorted by score, and always choose the best path to expand next. A better strategy would be one that calculates an upper bound for the score of a partial path so that the most promising partial path would always be the one chosen for expansion. To estimate the promise possible from continued expansion of a partial path, the scoring function was modified so that it could compute a score from a corresponding partial sequence. The length of the real sequence must be known in order to pad the partial sequence with the appropriate number of wildcard characters, special placeholders for a to-be-determined residues. The theoreticals peaks computed for a padded partial sequence now fall into one of two 'A variation was to sort the experimentals by intensity and choose the most intense ones. This assumes that high intensity peaks are more likely to represent legitimate fragments and not noise. 197 categories: definites and ranges. Definites refer to those masses that are exactly known. Subsequences containing wildcards induce a range of possible mass values because the wildcard stands in place of a residue whose mass is between 57(glycine) and 186(tryptophan). The more wildcards, the greater the possible range. Now, computation of a score involves finding the best mapping or assignment of experimental peaks to these theoreticals, and the Hungarian method [Kuh55, PS82], an algorithm for solving the weighted bipartite matching problem, was used. The Hungarian method is commonly used on job scheduling problems: jobs need to be assigned to processors, there is a cost associated with each pairing, and the minimum cost assignment is desired. This problem can be represented as a bipartite graph where processors and jobs are nodes, edges exist between processor and job nodes only, and the cost of each edge is given by a square cost matrix (square because the Hungarian method requires the number of processors to equal the number of jobs). The Hungarian method produces a perfect assignment - one where the mapping between processors and jobs is a bijection (one-to-one). Variations of this algorithm abound [DE96, HHF95, TYC93]. In particular, [TYC93] formulate the problem as a maximization one, and [HHF95] allow matchings where multiple jobs are assigned to a processor by introducing dummy processors/jobs when necessary to equalize the number of processor and job nodes. With these in mind, we reduced our problem to an instance of bipartite matching that was suitable for the Hungarian method. Our reduction accounted for: " an unequal number of theoreticals and experimentals by adding the appropriate number of dummy nodes " the possibility of multiple identity peaks by allowing multiple theoreticals to be assigned to (different instances of nodes representing) the same experimental, and " the presence of unmatched experimentals by matching them to theoretical nodes which represent noise. Even with a number of optimizations, the resulting graph was huge, and the Hungarian method took very long. 198 An alternate means for calculating an upper bound was found to be very fast, but the bound was too relaxed and the real partial sequence was no where near the top. In fact, it scored so poorly that it was not chosen to be expanded until much later - only when incorrect partial paths were expanded practially to completion were they outscored by the real partial path. It was basically a tradeoff between time and accuracy: a good estimate of a path's potential would have been accurate but extrememly costly to compute; a very fast estimate was possible, but the result was too imprecise to be useful in the ensuing search. Some of the other approaches that were considered include: Hidden Markov Models The observation sequence of a HMM is a series of outputs of states visited in some order. The most straightforward approach is to regard an experimental spectrum as this observation sequence. However, spectra has no notion of temporal assignment, neither in terms of peak generation nor peak appearance, associated with it while HMM observations are not simply a set of observations (as is in a mass spectrum), but a time dependent ordering of events. Perhaps there exists some other non-trivial way to view the problem so that HMM's can be applicable. Bottom-Up Subsequence Reconstruction A subsequence analysis approach attempted to take advantage of the fact that there are (1) internals present and (2) more peaks due to shorter subsequences than longer ones. The idea is to list all possible distinct dipeptides that are supported by an experimental peak. If two list members are found such that one has a prefix that is a suffix of the other, a new sequence is formed by merging the overlap portions (e.g. DR and RV would give rise to DRV) and if this sequence is supported by an experimental and has not yet been considered, this new merged sequence added to the list. This algorithm relies on there being enough overlapping pieces present, something that is not characteristic of our datasets. Without some other additional provisions, there are not enough subsequence peaks to permit assembly in this splint-like fashion. Path Reinforcement for Internal Ions The main idea behind this approach was to augment the scores (reinforce) of certain subpaths through the fundamental graph. Since subpaths correspond to possible internal ions, only those that are supported by the data are reinforced - namely, reinforce all edges along all subpaths between 199 two nodes only if the mass difference of the nodes is supported by some experimental peak. The hope was that the most reinforced path would be the most redundant and hence the path of the correct sequence. However, as a result of being situated at the crossroads of several reinforced paths, certain edges ended up with very high edge scores. These edges were not part of the correct path, and had been "taken out of context" - the score was high because it was part of a particular subpath; if the other edges were not also included in the selected complete path, the component of the score due to this subpath should be disregarded. It is possible that a highly redundant spectrum could overcome this because the most reinforced edges would be exactly those edges of the supported subpaths. The various approaches sought to capitalize on different features of the data, and each met with varying degrees of success/failure. Because two pairs of amino acids have indistinguishable weights, it is already impossible to predict the true sequence exactly. Gaps compound the problem and only serve to obscure additional residues. When a pool of solutions generated by an approach contains the best correct sequence possible, the task becomes one of examining why the real sequence scored poorly and why an incorrect competitor scored well. The score and/or search were often modified to account for some additional property, in as general a manner as possible, to elevate the standing of the correct answer. Oftentimes, these modifications, and even the entire approach itself, seemed to be ad hoc patchwork. What resulted seemed a complicated motley collection of rules and heuristics that lacked a formal framework for proving why the answer offered by the algorithm should be a good decent one. In formulating our approach of Chapter 8, however, we desired an algorithm that had a more structured framework, and Chapter 7 describes the issues we had to keep in mind. 200 Appendix F Scoring Function Maximum The scoring function Prob(S, F) of Chapter 8 is maximal when the spectra S most resembles the PMF F. A proof sketch is contained in this appendix. Let r be the parentmass. The PMF F is defined from 0 to r and let the value of F at i be denoted as F. The experimental spectrum S, also defined for all integer values between 0 and r, can be thought of as a vector of assignments k where ki is the height of peak i. Given integer N and assignment k =< ko, ki,- - , kr > where the ki are integers and Zki = N, let i=O N! _Fko ... rk, Prob (k, F) = kok! . k 0 - Fr where 0 < Fi < l and Fi = 1. i=Q Theorem F.O.1 F attains its maximum value when kopt =< F0 N, F 1N,.-- , FrN > Proof: The goal is to show that for any assignment k, Prob(k, F) < Prob(kopt, F). Consider some assignment k =< ko, ki, - -, kr >. For each i, one of the following must be true: 1. ki > FiN ->ki = FiN + ej, 2. ki = FiN, 3. k2 < FN - k = FN - e 201 where ej is an integer greater than 1. Rearrange the expression for F by grouping together terms that fall in each of these three categories, and without loss of generality, shuffle and rename the indices, so that for index i, * 0 Ki <a : ki > FN, * a + 1 Ki Kb: ki = FN and * b+1 Ki < r: ki < FiN. We now arrive at: N! (FoN + eo)!...(FaN+ ea)!Fa+1N! - -FbN!(Fb+lN - eb+1)! ...(FrN - er)! Prob(k, F) FaN+ea)(F[a1N ... x ( F'F0oN+eo - ((Fb+lN - eb+1 + 1) N! rOV! --- rIV! x (FFON+eo ... (Fb+lN)) iV ± e0 )) U(FO1V -- 1) ... FiaN+ea)(F~ 4 1N 0 N -eb+1 - FN)F1 t=-eb+1+1 ± 1) Falv -FbN)(FFb1N - 1 ... eb+1 ... ... F' FO... (HFoN + t) -..(HFaN + t) t=1 Fea Feb+1 ... Fe, ea N-er ) Prob(kopt, F) b+1 t=1 aProb(kopt, F). We desire to show that a K 1. We can rewrite a as: 0 F eo eo (HFoN + t) 1 Fea e. a t 0 Fb+ 1N +t) F-eb+ eb+1 b+1 (flFaN + t) ( FrN + t) t=-e+ (F.1) Fr' t=1 Consider the fractions of the form: -'F----. 202 These terms result from all ki (FrN)) -t- ea)) V FrN + t) t=-e,+1 eo t=1 ((FrN - er + 1) 0 11Fb+1N + t) -.--.( = --- ... FrN-er) that fall into case 1. We can bound each of these as follows: Fei F FN+ ej 1 FFi FN +1 1 fiit=,FN+ t) N + N+ (N < Substituting this into F.1 leads to: it=1 FoN + t) (N) ) (tI" 1 =- ea Fb+1N + t) (Hf=-er+i FrN + t) F Fb1 (f=-er+i FrN + t) + t) F6'+ Fr e ( (HO-e+ +1 Fb+lN + t) N :-eb+1+1 FaN + t) Fb+l b+1+1 (HOe-( eo N (to- Fa F0 (H=-er+i FrN + t) F|r Fib+1 Consider the remaining fractions in a - namely, those that result from case 3: FN - ej + (H=-ei+1 FN + t) Fiei 1 Fi N +1 FiN Fj - ei = (N± +F < Nei(because ej > 1). Since the assignment k is one way to partition N things into r parts: N = Ski = (FoN+eo)+-- + (FrN - er) a r e EeNFN+ = i i=O i=b+1 a N+ i=0 ei- 203 Cei i=b+l e, r -> Z a e2 = i=b+1 Therefore, a < ( Z ez i=O )Zi=o eiNZ=b+1 e2 < 1 and the maximum value of Prob(k, F) is obtained when k "mimics" the probability distribution defined by F; namely, when k = kot =< FoN, FiN, - , FrN >. Thanks to David Stephenson for assistance with the proof for the case when r = 2. 204 Bibliography [ASR96] J. Andersen, B. Svensson, and P. Roepstorff. Electrospray ionization and matrix assisted laser desorption/ionization mass spectrometry: Powerful analytical tools in recombinant protein chemistry. Nature Biotechnology, 14:449-457, 1996. [Bal95] M. Baldwin. Natural Modern mass spectrometry in bioorganic analysis. Products Reports, pages 33-44, 1995. [Bar90] C. Bartels. Fast algorithm for peptide sequencing by mass spectroscopy. Biomedical and Environmental Mass Spectrometry, 19:363-368, 1990. [Bar98] C. Bartels. personal email communication. April., 1998. [BBG96] A. Burlingame, R. Boyd, and S. Gaskell. Mass spectrometry. Analytical Chemistry, 68:599R-651R, 1996. [BC] P. Baker and K. Clauser. Ms-product from ucsf mass spectometry facility. http://prospector.ucsf.edu/ucsfhtml3.2/instruct/prodman.htm. [BC96] R. Beavis and B. Chait. Matrix-assisted laser desorption ionization massspectrometry of proteins. Methods of Enzymology, 270:519-551, 1996. [BCC91] R. Beavis, T. Chaudhary, and B. Chait. a-cyano-4-hydroxycinnamic acid as a matrix for matrix-assisted laser desorption mass spectrometry. Organic Mass Spectrometry, 27:156-158, 1991. [Bea92] R. Beavis. Matrix-assisted ultraviolet laser desorption: Evolution and principles. Organic Mass Spectrometry, 27:653-659, 1992. 205 [Bie9O] K. Biemann. Sequencing of peptides by tandem mass spectrometry and highenergy collision-induced dissociation. Methods in Enzymology, 193:455-479, 1990. [BJHP94] M. BartletJones, W. Jeffery, H. Hansen, and D Pappin. Peptide ladder sequencing by mass-spectrometry using a novel, volatile degradation reagent. Rapid Communications in Mass Spectrometry, 8:737-742, 1994. [BS87] K. Biemann and H. Scoble. Characterization by tandem mass spectrometry of structural modifications in proteins. Science, 237:992-998, 1987. [BT93] D. Bertsimas and J. Tsitsiklis. Simulated annealing. StatisticalScience, 8:1015, 1993. [CBB96 K. Clauser, P. Baker, and A. Burlingame. Peptide fragment-ion tags from maldi/psd for error-tolerant searching of genomic databases. 44th Conference on Mass Spectrometry and Allied Topics, Portland, OR, page 365, 1996. [CCC95] M. Cordero, T. Cornish, and R Cotter. Sequencing peptides without scanning the reflectron: Post-source decay with a curved-field reflectron time-of-flight mass spectrometer. Rapid Communications in Mass Spectrometry, 9:1356- 1361, 1995. [CGAP99] G. Corthals, S. Gygi, R. Aebersold, and S. Patterson. Proteome Research: 2D Gel Electrophoresis and Detection Methods, Ed. Rabilloud, T., chapter Identification of Proteins by Mass Spectrometry, pages 197-231. Springer, New York, 1999. [CGMW96] K. Cox, S. Gaskell, M. Morris, and A. Whiting. Role of the site of protonation in the low-energy decompositions of gas-phase peptide ions. Journal of the American Society of Mass Spectrometry, 7:522-531, 1996. [CKT+00] T. Chen, M. Kao, M. Tepel, J. Rush, and G Church. A dynamic programming approach to de novo peptide sequencing via tandem mass spectrometry. 11th Annual SIAM-ACM Symposium on Discrete Algorithms (SODA 2000), pages 389-398, 2000. 206 [CLS99] P. Chaurand, F. Luetzenkirchen, and B. Spengler. Peptide and protein identification by matrix-assisted laser desorption ionization (maldi) and maldi-postsource-decay time-of-flight mass spectrometry. J Am Soc Mass Spectrometry, 10:91-103, 1999. [CM98] P. Crain and J. McCloskey. Applications of mass spectrometry to the characterization of oligonucleotides and nucleic acids. Current Opinion in Biotechnology, 9:25-34, 1998. [Cre93] T. Creighton. Proteins: Structures and Molecular Properties, 2nd ed, chapter Chemical Properties of Polypetides, page 34. W.H.Freeman and Company, 1993. [CWBK93] B. Chait, R. Wang, R. Beavis, and S. Kent. Protein ladder sequencing. Science, 262:89-92, 1993. [DAC+ 99a] V. Dancik, T. Addona, K. Clauser, J. Vath, and P Pevzner. De novo peptide sequencing via tandem mass spectrometry. Journalof ComputationalBiology, 6:327-342, 1999. [DAC+99b] V. Dancik, T. Addona, K. Clauser, J. Vath, and P Pevzner. De novo peptide sequencing via tandem mass spectrometry: A graph-theoretical approach. RECOMB, pages 135-144, 1999. [DE96] V. Dondeti and H. Emmons. Max-min matching problems with multiple assignments. Journal of Optimization Theory and Applications, 91:491-511, 1996. [EMY94] J. Eng, A. McCormack, and J. Yates, III. An approach to correlate tan- dem mass spectral data of peptides with amino acid sequences in a protein database. J Am Soc Mass Spectrom, 5:976-989, 1994. [FdCGB95] J. Fernandez-de Cossio, J. Gonzalez, and V. Besada. A computer program to aid the sequencing of peptides in collision-activated decomposition experiments. Computer Applications in the Biosciences, 11:427-434, 1995. [FdCGB+98] J. Fernandez-de Cossio, J. Gonzalez, L. Betancourt, V. Besada, G. Padron, Y. Shimonishi, and T Takao. 207 Automated interpretation of high energy collision-induced dissociation spectra of singly protonated peptides by 'seqms', a software aid for de novo sequencing by tandem mass spectrometry. Rapid Communications in Mass Spectrometry, 12:1867-1878, 1998. [FdCGS+99] J. Fernandez-de Cossio, J. Gonzalez, Y. Satomi, T. Shima, N. Okumura, V. Besada, L. Betancourt, G. Padron, Y. Shimonishi, and T Takao. Automated interpretation of low energy collision-induced dissociation spectra by 'seqms', a software aid for de novo sequencing by tandem mass spectrometry. Electrophoresis, 21:1694-1699, 1999. [Fen91] C. Fenselau. Beyond gene sequencing: analysis of protein structure with mass spectrometry. Annual Review of Biophysics and Biophysical Chemistry, 20:205-220, 1991. [FHM+93] A. Falick, W. Hines, K. Medzihradszky, M. Baldwin, and B. Gibson. Lowmass ions produced from peptides by high-energy collision-induced dissociation in tandem mass spectrometry. J Am Soc Mass Spectrom, pages 882-893, 1993. [FQC98] D. Fenyo, J. Qin, and B. Chait. Protein identification using mass spectrometric information. Electrophoresis, 19:998-1005, 1998. [FS96] M. Fitzgerald and G. Siuzdak. Biochemical mass spectrometry: Worth the weight? Chemistry & Biology, 3:707-715, 1996. [GME+95] P. Griffin, M. MacCoss, J. Eng, R. Blevins, J. Aaronson, and J. Yates, III. Direct database searching with maldi-psd spectra of peptides. Rapid Communications in Mass Spectrometry, 19:1546-1551, 1995. [GMG+99] R. Gras, M. Muller, E. Gasteiger, S. Gay, P. Binz, W. Bienvenut, C. Hooglang, J. Sanchez, A. Bairoch, D. Hochstrasser, and R. Appel. Improving pro- tein identification from peptide mass fingerprinting through a parameterized multi-level scoring algorithm and an optimized peak detection. Electrophore- sis, 20:3535-3550, 1999. [GP97] Q. Gu and G. Prestwich. Efficient peptide ladder sequencing by maldi-tof 208 mass spectrometry using allyl isothiocyanate. Journal of Peptide Research, 49:484-491, 1997. [GVP+96] K. Gevaert, J. Verschelde, M. Puype, J. Van Damme, M. Goethals, S. DeBoeck, and J. Vandekerckhove. Structural analysis and identification of gel- purified proteins, available in the femtomode range, using a novel computer program for peptide sequence assignment, by matrix-assisted laser desorption ionization-reflectron time-of-flight-mass spectrometry. Electrophoresis, 17:918-924, 1996. [Haj88] B. Hajek. Cooling schedules for optimal annealing. Mathematics of Operational Research, 13:311-329, 1988. [HBS+93] W. Henzel, T. Billeci, J. Stults, S. Wong, C. Grimley, and C. Watanabe. Identifying proteins from two-dimensional gels by molecular mass searching of peptide fragments in protein sequence databases. Proc Natl Acad Sci, U.S.A., 90:5011-5015, 1993. [HFBG92] W. Hines, A. Falick, A. Burlingame, and B. Gibson. Pattern-based algorithm for peptide sequencing from tandem high energy collision-induced dissociation mass spectra. J Am Soc Mass Spectrom, 3:326-336, 1992. [HHF95] A. Hsieh, C. Ho, and K. Fan. An extension of the bipartite weighted matching problem. Pattern Recognition Letters, 16:347-353, 1995. [Hin97] W. Hines. personal communication, PerSeptives Biosystems,, 1997. [HKBC91] F. Hillenkamp, M. Karas, R. Beavis, and B. Chait. Matrix-assisted laser desorption/ionization mass spectrometry of biopolymers. Analytical Chemistry, 63(24):1193A-1202A, 1991. [HWH86] C. Hamm, W. Wilson, and D. Harvan. Peptide sequencing program. Com- puter Applications to the Biosciences, 2:115-118, 1986. [IN86] K. Ishikawa and Y. Niwa. Computer-aided peptide sequencing by fast atom bombardment mass spectrometry. Biomedical and Environmental Mass Spec- trometry, 13:373-380, 1986. 209 [JB89] R. Johnson and K. Biemann. Computer program (seqpep) to aid in the interpretation of high-energy collision tandem mass spectra of peptides. Biomed Environ Mass Spectrom, 18:945-957, 1989. [JMB88] R. Johnson, S. Martin, and K. Biemann. Collision-induced fragmentation of (m+h)+ ions of peptides: Side chain specific sequence ions. International Journal of Mass Spectrometry and Ion Processes, 86:137-154, 1988. [JnC96] J. Jai-nhuknan and C. Cassady. Anion and cation post-source decay time-offlight mass spectrometry of small peptides: Substance p, angiotensin ii, and renin substrate. Rapid Communications in Mass Spectrometry, 10(13):1678- 1682, 1996. [Joh88l R. Johnson. Determination of peptide and protein structure by tandem mass spectrometry. MIT PhD Thesis, Dept of Chemistry, 2 volumes, 1988. [JQCG93] P. James, M. Qaudroni, E. Carafoli, and G. Gonnet. Protein identification by mass spectrometry. Biochemical and Biophysical Research Communications, 195:58-64, 1993. [Kau95] R. Kaufmann. Matrix-assisted laser desorption ionization mass spectrometry a novel analytical tool in molecular biology and biotechnology. Journal of Biotechnology, 41:155-175, 1995. [KCKS96I R. Kaufmann, P. Chaurand, D. Kirsch, and B. Spengler. Post-source de- cay and delayed extraction in matrix-assisted laser desorption/ionizationreflectron time-of-flight mass spectrometry. are there trade-offs? Rapid Communications in Mass Spectrometry, 10:1199-1208, 1996. [KGJV83] S. Kirkpatrick, C. Gelatt Jr, and M. Vecchi. Optimization by simulated annealing. Science, 220:671-680, 1983. [KH93] M. Karas and F. Hillenkamp. Matrix-assisted laser desorption ionization mass spectrometry - fundamentals and applications. AIP Conference Proceedings, Second International Conference, Tennessee, 288:447-458, 1993. 210 [KKS94] R. Kaufmann, D. Kirsch, and B. Spengler. Sequencing of peptides in a timeof-flight mass spectrometer: Evaluation of postsource decay following matrixassisted laser desorption ionisation. International Journal of Mass Spectrometry and Ion Processes, pages 355-385, 1994. [KSL93] R. Kaufmann, B. Spengler, and F. Lutzenkirchen. Mass spectrometric se- quencing of linear peptides by product-ion analysis in a reflectron time-offlight mass spectrometer using matrix-assisted laser desorption ionization. Rapid Communications in Mass Spectrometry, 7:902-910, 1993. [Kuh55] H. Kuhn. The hungarian method for the assignment problem. Naval Research Logististics Quarterly, 2:83-97, 1955. [LG98] T. Lin and G. Glish. C-terminal peptide sequencing via multistage mass spectrometry. Analytical Chemistry, 70:5162-5165, 1998. [LL95] H. Lee and D Lubman. Sequence-specific fragmentation generated by matrixassisted laser desorption/ionization in a quadrupole ion trap/reflectron timeof-flight device. Analytical Chemistry, 67:1400-1408, 1995. [LM97] A. Lamond and M Mann. Cell biology and the genome projects - a concerted strategy for characterizing multiprotein complexes by using mass spectrometry. Trends in Cell Biology, 7:139-142, 1997. [LS85] T. Lee and V Spayth. Computer assisted interpretation of fast atom bombardment mass spectra of peptides. 33rd Conference on Mass Spectrometry and Allied Topics, San Diego, pages 266-267, 1985. [Man98] M. Mann. personal email communication. April., 1998. [Mat98] P. Matsudaira. personal communication, Whitehead Institute, 1998. [MB94 K. Medzihradszky and A. Burlingame. The advantages and versatility of a high-energy collision-induced dissociation-based strategy for the sequence and structural determination of proteins. Methods: A Companion to Methods in Enzymology, 6:284-303, 1994. 211 [MHH+94] S. Martin, F. Hsieh, W. Hines, D. Dalke, C. Elicone, and M. Vestal. Use of a poroszyme immobilized trypsin cartridge and a voyager elite biospectrometry research station for analysis of myoglobin tryptic peptides. Application Note, Perceptive Biosystems, PA427, 1994. [MHR93] M. Mann, P. Hojrup, and P. Roepstorff. Use of mass spectrometric molecular weight information to identify proteins in sequence databases. Biological Mass Spectrometry, 22:338-345, 1993. [Mit00] S. Mitter. personal communication, LIDS, MIT, 2000. [MSM+83] T. Matsuo, T. Sakurai, H. Matsuda, H. Wollnik, and I. Katakuse. Improved paas, a computer-program to determine possible amino-acid-sequences of peptides. Biomedical Mass Spectrometry, 10:57-60, 1983. [Mur96] K. Murray. Dna sequencing by mass spectrometry. Journal of Mass Spectrometry, 31:1203-1215, 1996. [MW94] M. Mann and M. Wilm. Error-tolerant identification of peptides in sequence databases by peptide sequence tags. Analytical Chemistry, 66:4390-4399, 1994. [OSTV95] Z. Olumee, M. Sadeghi, X. Tang, and A. Vertes. Amino acid composition and wavelength effects in matrix-assisted laser desorption/ionization. Rapid Communications in Mass Spectrometry, 9:744-752, 1995. [Pap95] I. Papayannopoulos. The interpretation of collision-induced dissociation tandem mass spectra of peptides. Mass Spectrometry Reviews, 14:49-73, 1995. [Pet97] E. Petit. Instrumentation and applications with maldi/tof. presentation at the WhiteHead Institute. PerSeptives Biosystems, Voyager Elite Representative., July 1997. [PHB93] D. Pappin, P. Hojrup, and A. Bleasby. Rapid identification of proteins by peptide-mass fingerprinting. Current Biology, 3(6):327-332, 1993. [PPCC99] D. Perkins, D. Pappin, D. Creasy, and J. Cottrell. Probability-based protein 212 identification by searching sequence databases using mass spectrometry data. Electrophoresis, pages 3551-3567, 1999. [Pro] Protein and sequencing amino acid analysis. http://www.biotech.ufl.edu/ pccl/protseq.html. [PS82] C. Papadimitriou and K. Steiglitz. Combinatorial Optimization: Algorithms and Complexity. Prentice Hall, Inc, USA, 1982. [PTVF95] W. Press, S. Teukolsky, W. Vetterling, and B. Flannery. Numerical Recipes in C: The Art of Scientific Computing, 2 ed. Cambridge University Press, USA, 1995. [QC96] J. Qin and B. Chait. Matrix-assisted laser desorption ion trap mass spectrometry: Efficient isolation and effective fragmentation of peptide ions. Analytical Chemistry, 68:2108-2122, 1996. [RF84] P. Roepstorff and J. Fohlman. Proposal for a common nomenclature for sequence ions in mass spectra of peptides. Biomedical Mass Spectrometry, 11:601, 1984. [RKW88] J. Rose, W. Klebsch, and J Wolf. Temperature measurement of simulated annealing placements. International Conference on Computer-Aided Design, pages 514-517, 1988. [RYM95] J. Rouse, W. Yu, and S. Martin. A comparison of the peptide fragmentation obtained from a reflector matrix-assisted laser desorption-ionization time-offlight and a tandem four sector mass spectrometer. Journal of the American Society for Mass Spectrometry, 6:822-835, 1995. [SB88] M. Siegel and N Bauman. An efficient algorithm for sequencing peptides using fast atom bombardment mass spectral data. Biomedical and Environmental Mass Spectrometry, 15:333-343, 1988. [SBB87] H. Scoble, J. Biller, and K. Biemann. A graphics display-oriented strategy for the amino acid sequencing of peptides by tandem mass spectrometry. Fresenius Z Anal Chem, 327:239-245, 1987. 213 [SC98] D. Suckau and D. Cornett. Protein sequencing by isd and psd maldi-tof ms. Analusis, 26:M18-M21, 1998. [SCE+97] A. Shevchenko, I. Chernushevich, W. Ens, K. Standing, B. Thomson, M. Wilm, and M. Mann. Rapid 'de novo' peptide sequencing by a combination of nanoelectrospray, isotopic labeling and a quadrupole/time-of-flight mass spectrometer. Rapid Communiciations in Mass Spectrometry, 11:1015- 1024, 1997. [Sch97] M Schar. Maldi-ms at the ingenieurschule burgdorf: The technique, some applications and expected benefits for the education in modern analytical chemistry. Chimia, 51:782-785, 1997. [SMMK84] T. Sakurai, T. Matsuo, H. Matsuda, and I. Katakuse. Paas-3 - a computerprogram to determine probable sequence of peptides from mass-spectrometric data. Biomedical Mass Spectrometry, 11:396-399, 1984. [Spe97] B. Spengler. Post-source decay analysis in matrix-assisted laser desorp- tion/ionization mass spectrometry of biomolecules. Journal of Mass Spec- trometry, 32:1019-1036, 1997. [SSMW97] J. Scott, S. Schurch, S. Moore, and C. Wilkins. Evaluation of maldi-ftms for analysis of peptide mixtures generated by ladder sequencing. International Journal of Mass Spectrometry and Ion Processes, 160:291-302, 1997. [SteO0] D. Stephenson. personal communication, 2000. [str95l Strategy for the interpretation of peptide cad spectra. Typewritten Manuscript, 1995. [SWM97] A. Shevchenko, M. Wilm, and M Mann. Peptide sequencing by mass spectrometry for homology searches and cloning of genes. Journal of Protein Chemistry, 16:481-490, 1997. [SZK95] R. Scarberry, Z. Zhang, and D. Knapp. Peptide sequence determination from high-energy collision-induced dissociation spectra using artificial neural networks. J Am Soc Mass Spectrom, 6:947-961, 1995. 214 [Tay0O] A. Taylor. User guide for sherpa: Your guide to the http://www.hairyfatguy.com/Sherpa/docs/Sherpa331/SherpaDoc.html, peaks. Ver- sion 3.3.1, 2000. [TBG90] G. Thorne, K. Ballard, and S. Gaskell. Metastable decomposition of peptide [m+h1+ ions via rearrangement involving loss of the c-terminal amino acid residue. Journal of the American Society for Mass Spectrometry, 1:249-57, 1990. [TJ97] J. Taylor and R. Johnson. Sequence database searches via de novo peptide sequencing by tandem mass spectrometry. Rapid Communiciations in Mass Spectrometry, 11:1067-1075, 1997. [TWJ96] J. Taylor, K. Walsh, and R. Johnson. Sherpa: A macintosh-based expert system for the interpretation of electrospray ionization lc/ms and ms/ms data from protein digests. Rapid Communications in Mass Spectrometry", pages 679-687, 1996. [TYC93] F. Tseng, W. Yang, and A. Chen. Finding a complete matching with the maximum product on weighted bipartite graphs. Computers Math Applic, 25:65-71, 1993. [YalOO] More information on matrix assisted laser desorption ionization (maldi) mass spectrometry. http://info.med.yale.edu/wmkeck/procmald.htm, 2000. [Yat85] J. Yates, III. Mass spectrometry and the age of the proteome. Journal of Mass Spectrometry, 33:1-19, 1985. [Yat96] J. Yates, III. Protein structure analysis by mass spectrometry. Methods in Enzymology, 271:351-377, 1996. [YCPH96] T. Yalcin, I. Csizmadia, M. Peterson, and A. Harrison. The structure and fragmentation of b, (n > 3) ions in peptide spectra. Journal of the American Society of Mass Spectrometry, 7:233-242, 1996. [YECB96] J. Yates, III, J. Eng, K. Clauser, and A. Burlingame. Search of sequence databases with uninterpreted high-energy collision-induced dissociation spectra of peptides. J Am Soc Mass Spectrometry, 7:1089-1098, 1996. 215 [YEMS95] J. Yates, III, J. Eng, A. McCormack, and D. Schieltz. Method to correlate tandem mass spectra of modified peptides to amino acid sequences in the protein database. Analytical Chemistry, 67:1426-1436, 1995. [YGHZ91] J. Yates, III, P. Griffin, L. Hood, and J. Zhou. Computer aided interpretation of low energy MS/MS mass spectra of peptides. Techniques in Protein Chemistry II, pages 477-485, 1991. [YGSH93] J. Yates, III, P. Griffin, S. Speicher, and T. Hunkapiller. Peptide mass maps: A highly informative approach to protein identification. Analytical Biochemistry, 214:397-408, 1993. [YKC+95] T. Yalcin, C. Khouw, I. Csizmadia, M. Peterson, and A. Harrison. Why are b ions stable species in peptide spectra? Journal of the American Society of Mass Spectrometry, 6:1164-1174, 1995. [YME96] J. Yates, III, A. McCormack, and J. Eng. Mining genomes with MS. Analytical Chemistry News and Features, pages 534A-540A, 1996. [Zen97a] R. Zenobi. Frontiers of laser chemical analysis. Chimia, 51:234-236, 1997. [Zen97b] R. Zenobi. Laser-assisted mass spectrometry. Chimia, 51:801-803, 1997. [ZGW95] E. Zaluzec, D. Gage, and J. Watson. Matrix-assisted laser desorption ionization mass spectrometry: Applications in peptide and protein characterization. Protein Expression and Purification, 6:109-123, 1995. [ZTEB90] D. Zidarov, P. Thibault, M. Evans, and M Bertrand. Determination of the primary structure of peptides using fast atom bombardment mass spectrometry. Biomed Environ Mass Spectrom, 19:13-26, 1990. 216