De Novo Peptide Sequencing from ... Desorption/Ionization-Time of Flight Post-Source-Decay

advertisement
De Novo Peptide Sequencing from Matrix-Assisted Laser
Desorption/Ionization-Time of Flight Post-Source-Decay
BARKER
Spectra
by
MASSACHUSETTS INSTITUTE
OF TECHNOLOGY
Tony Liang Eng
APR 2 4 2001
Bachelor of Science in EECS, MIT, 1992
Master of Science in EECS, MIT, 1994
Bachelor of Science in Mathematics, MIT, 1996
Bachelor of Science in Biology, MIT, 1998
LIBRARIES
Submitted to the Department of Electrical Engineering and Computer Science
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
February 2001
@ Tony Liang Eng, MMI. All rights reserved.
The author hereby grants to MIT permission to reproduce and distribute publicly
paper and electronic copies of this thesis document in whole or in part.
Author.................
Department of4ectri4 Engineering and Computer Science
February 2, 2001
/71
14
C ertified by..................
. .
.
.. . ......
/1)
.........................
To
P ofessor of Electrcal Engineering
-~)
Accepted by .................
is Lozano-Perez
Computer Science
Thesis Supervisor
.............
0Arthur Smith
Chairman, Department Committee on Graduate Students
2
De Novo Peptide Sequencing from Matrix-Assisted Laser
Desorption/Ionization-Time of Flight Post-Source-Decay Spectra
by
Tony Liang Eng
Submitted to the Department of Electrical Engineering and Computer Science
on February 2, 2001, in partial fulfillment of the
requirements for the degree of
Doctor of Philosophy
Abstract
With the explosion of research activity in genomic and proteomic bioinformatics, there is
an increased demand for rapid protein sequencing algorithms, and mass spectrometry(MS)
has been explored as a possible tool for aiding in this process [YME96]. Most sequencing
from tandem mass spectra relies on either some form of comparison to a database of known
peptides, or manual sequence inference by human analysis of spectra. Such approaches
encounter difficulties when presented with the spectra of unknown and novel proteins not
catalogued in a database, or with complex spectra that do not easily lend themselves to
manual interpretation. A few de novo approaches exist but their performance is sensitive
to noise and gaps in the dataset and their scoring methods lack a formal framework for
reasoning about the answer produced.
We propose a new approach that involves a probabilistic model for peptide fragmentation, a
scoring function based on this model, and a simulated annealing search based on this scoring
function. Our algorithm takes as input the original mass and the MALDI-TOF PSD mass
spectrum of the peptide to be sequenced, and finds the amino acid sequence consistent with
the best interpretation of the spectrum under the proposed model. If the model is good and
the dataset is sufficient, then the real sequence scores optimally, and simulated annealing,
under the appropriate searching conditions, will converge onto this sequence. We found
that a simple model was sufficient to correctly predict the sequence of short peptides, and
that our approach exhibited some resilience to noise and gaps in the data.
Thesis Supervisor: Tomas Lozano-Perez
Title: Professor of Electrical Engineering & Computer Science
4
Acknowledgments
This thesis has been long in the making- it is not simply a product of my doctoral years,
but of my entire time at MIT and consequently, there are many people to thank and I will
no doubt inadvertently omit several who deserve to be recognized and thanked.
There are several categories of people that deserve mention. First and foremost, I am grateful for the availability and support of my thesis committee. I had enjoyed having Professor
Tomas Lozano-Perez as my thesis advisor. It was a wonderful research experience, and I
respect him for his insight, counsel and passion for research. Professor Paul Matsudaira
suggested the de novo peptide sequencing problem, and I always felt he was "on my side,
rooting for me" from his encouraging remarks and his support, both morally and financially
through the MIT Whitehead Training Grant in Genomic Science. Professor Eric Grimson
has been supportive, accommodating and affirming as a committee member, but also as the
lecturer of a class I was pleased to be a part of.
6.001 has been a large part of my graduate life, and I am grateful for the chance to work with
and learn from Professors Duane Boning and Eric Grimson (both of whom have inspired
me in my teaching), TAs Robbin Chapman, Aileen Tang, Kyle Ingols, and of course the
students themselves.
Professors F Tom Leighton and Daniel Kleitman deserve special mention for their time and
involvement in the earlier stages of this thesis(Sections E.1 and E.2 respectively). Thanks
also to Professor Bonnie Berger for supplementing some of my support through the Program in Mathematics and Molecular Biology Graduate Student Fellowship and NIH/NHCRI
HG00039.
Various other professors and colleagues have been helpful in providing information/lending
a hand/giving advice. Thanks to Arnie Falick(Applied Biosystems) for three sets of data;
Drs Wishnok and Tannenbaum(MIT BEH) for the use of their mass spectrometer; Hong Bin
Ni and Bryan Robinson(MIT); Kevin Hayden and Wade Hines (Applied Biosystems); Ivan
Correia and James Pang (MIT Whitehead); Duane Boning(MIT MTL), Arthur Smith(MIT
EECS) and Charles Leiserson(MIT LCS); David Williamson(IBM); Erik Winfree(Caltech);
Ting Chen(USC); Bryan Che, Daniel Derksen and Tamara Williams(MIT) for use of pigtail,
5
a laptop and asti-spumanti respectively. My palm pilot has also been indispensable during
the thesis writing stages.
Thanks to the various administrators who are always rooting for us graduate students
and who keep MIT running: Marilyn Pierce, Lisa Bella, Be Blackburn, Jill Fekete, Teresa
Coates, Julie Ellis, David Jones and Bruce Dale.
My years at MIT have been punctuated with many faces who have walked beside me and
made MIT more pleasant and bearable. At some point in time during my later PhD years,
each of them, whether at MIT or from afar, have sustained me through one thing or another,
in some fashion or another - a card, a hug, a smile, a prayer, a backrub or a meal. My
grateful thanks to: Ona Wu, Kiet Van, Mona Lou, Jen Chen, David Stephenson, Nicole
Lazo, Jeff Kuo, Jim Derksen, Irene Yeh, Anca Brad, the generations of Cross Products
(especially the London year), Jesse Byler (thanks for helping with simulations), Vivian
Cheung (you'll make a great mom!), Dan Shiau, Julie Gesch, Christine Ko, Connie Chang,
Vanessa Wong, Ben Nunes, Bryan Che, David Robison, Jennifer Lee, Jane Hsu, Jennifer
Tam Lin, Christina Park, Joonah Yoon, Buck Goh, Thomas Lee, Christian Sevilla, Eric
Hsieh, Lawrence Chang, Susan Huang and the Doctor's Small Group.
I am especially
grateful to those who have been there through to the end, during the hard times and during
the (6 month!) home stretch when thesis became all-consuming - your prayers and all the
little ways you have cheered me up/on mean a lot to me.
Last but not least, to my loving parents who put their dreams aside so I could pursue mine
- thanks for your belief in me and your solid support, patience and constant encouragement
in all my endeavors.
As I reflect back on the Phd years, writing this moments before the thesis deadline, I thank
God for all that has happened, for in many ways and by many people, I have been blessed.
6
Contents
1
Introduction
23
2
Mass Spectrometry
26
2.1
Overview of MALDI MS . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
2.1.1
Sample Preparation
. . . . . . . . . . . . . . . . . . . . . . . . . . .
27
2.1.2
Desorption and Ionization of Analyte . . . . . . . . . . . . . . . . . .
28
2.1.3
Ion Separation and Detection . . . . . . . . . . . . . . . . . . . . . .
29
2.1.4
Useful Improvements and Variations . . . . . . . . . . . . . . . . . .
29
2.2
3
Tandem Mass Spectrometry (MS/MS)
. . . . . . . . . . . . . . . . . . . . .
31
Protein Fragmentation
35
3.1
MALDI-PSD Fragment Types . . . . . . . . . . . . . . . . . . . . . . . . . .
36
3.1.1
Series Ions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
3.1.2
Internal Ions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
3.1.3
Immonium Ions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
3.1.4
Parent Ion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
3.1.5
Neutral Loss/Gain Variants . . . . . . . . . . . . . . . . . . . . . . .
39
7
Other Fragment Types . . . . . . . . . . . . . . . . . . . . . . . . . .
39
Fragmentation Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
3.1.6
3.2
4
5
6
43
Terminology and Concepts
4.1
Fragmentation and Spectra
. . . . . . . . . . . .
43
4.2
Fundamental Graphs . . . . . . . . . . . . . . . .
44
4.2.1
Purpose . . . . . . . . . . . . . . . . . . .
45
4.2.2
Construction
. . . . . . . . . . . . . . . .
45
4.3
De Novo Peptide Sequencing From Tandem Mass Spectra
46
4.4
Notion of a Correct Sequence Prediction . . . . .
47
de novo Protein Sequencing with Mass Spectra
49
5.1
Chemical Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
5.2
Sequencing with Mass Spectrometry
. . . . . . . . . . . . . . . . . . . . . .
50
5.2.1
Enlisting the Aid of Fragment Types . . . . . . . . . . . . . . . . . .
51
5.2.2
Persevering Despite the Effects of Noise . . . . . . . . . . . . . . . .
52
5.2.3
Considering Different Interpretations of the Same Spectrum . . . . .
54
56
Prior Work
. . . . . . . . . . . . . . . . . . . . . . . . .
56
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
Database Search with MS/MS . . . . . . . . . . . . . . . . . . . . . . . . . .
60
Searching with Peptide Sequence Tags . . . . . . . . . . . . . . . . .
60
6.1
Four Categories of Approaches
6.2
Database Search with MS
6.3
6.3.1
8
6.3.2
Evaluating Theoretically Predicted Spectra with Experimentally Ob. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
Hybrid Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
6.4.1
Computational Approaches with a MS/MS Database Search . . . . .
62
6.4.2
Database Search with MS and MS/MS . . . . . . . . . . . . . . . . .
63
6.4.3
Database Search with MS or MS/MS . . . . . . . . . . . . . . . . . .
63
6.5
Discussion of Database Approaches . . . . . . . . . . . . . . . . . . . . . . .
64
6.6
Computational Search with MS . . . . . . . . . . . . . . . . . . . . . . . . .
64
Ladder Sequencing with Mass Spectrometry . . . . . . . . . . . . . .
64
. . . . . . . . . . . . . . . . . . . . . .
66
6.7.1
Sequence-to-Spectrum Categories . . . . . . . . . . . . . . . . . . . .
67
6.7.2
Spectrum-to-Sequence Category
. . . . . . . . . . . . . . . . . . . .
68
6.7.3
Fundamental Graph (Global Spectrum-to-Sequence) Approaches
69
6.7.4
Global Fundamental Graphs Approaches . . . . . . . . . . . . . .
71
tained Spectra
6.4
6.6.1
6.7
7
Computational Search with MS/MS
73
Observations and Issues
7.1
Spectrum-Related Issues . . . . . . . . . .
73
7.1.1
G aps . . . . . . . . . . . . . . . . .
73
7.1.2
Immonium Interference
. . . . . .
74
7.1.3
Mistaken Identities . . . . . . . . .
74
7.1.4
Under-/Over-Represented Families
75
7.1.5
Experimental Peak Heights . . . .
75
7.1.6
Mass Tolerances
. . . . . . . . . .
76
9
7.2
8
Scoring Function Issues
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
. . . . . . . . . . . . . . . . . . . . . . .
77
7.2.1
Uses of a Scoring Function
7.2.2
Theoretical Peak Heights
. . . . . . . . . . . . . . . . . . . . . . . .
79
7.2.3
Award / Penalty System . . . . . . . . . . . . . . . . . . . . . . . . .
79
7.2.4
Accounting for Disallowed Variants
. . . . . . . . . . . . . . . . . .
80
7.2.5
Accounting for Internals . . . . . . . . . . . . . . . . . . . . . . . . .
81
7.2.6
Fragment Type Frequencies . . . . . . . . . . . . . . . . . . . . . . .
81
Approach
82
8.1
Solution Schematic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
8.2
Modelling the Fragmentation of a Single Molecule
. . . . . . . . . . . . . .
83
8.3
8.2.1
Model I: Modelling Series Ions
. . . . . . . . . . . . . . . . . . . . .
84
8.2.2
Model II: Modelling Internal Ions . . . . . . . . . . . . . . . . . . . .
85
8.2.3
Model III: Modelling Variants . . . . . . . . . . . . . . . . . . . . . .
86
8.2.4
Model IV: Modelling Residue Tendencies
. . . . . . . . . . . . . .
88
Accounting for Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
8.3.1
Physical Noise
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
8.3.2
Measurement Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . .
90
8.4
Assum ptions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
91
8.5
From Model to Scoring Function
. . . . . . . . . . . . . . . . . . . . . . . .
93
8.5.1
Computing the Probability Mass Funct ion.
. . . . . . . . . . . . . .
93
8.5.2
Evaluating Sequence Guesses . . . . .
. . . . . . . . . . . . . .
93
10
8.5.3
9
Scoring Function Maximum . . . . . . . .
94
. . . . . . . . . . . .
94
. . . . . . . . . . . . . . . . . . . . . .
97
8.6
Exploring the Search Space
8.7
Sum mary
98
Testing the Model and Its Scoring Function
Training Data . . . . . . . . . . . . . . . . . . . .
98
Observations of Training Spectra . . . . .
99
Training the Model . . . . . . . . . . . . . . . . .
102
9.2.1
Parameterizing the Model . . . . . . . . .
102
9.2.2
Training the Model Parameters . . . . . .
102
9.3
Examination of the Trained PMF . . . . . . . . .
104
9.4
Scoring Guesses Against an Observed Spectrum.
106
9.1
9.1.1
9.2
9.4.1
9.5
Not the Real Sequence, but Still Correct
Summary: Model Training . . . . . . . . . . . . .
10 Testing the Simulated Annealing Search
112
114
115
10.1 Search Convergence
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 15
10.2 Sequence Prediction
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 15
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 18
. . . . . . . . . . . . . . . . .
1 19
10.5 Exploration of Simulated Annealing Parameters . . . . . . . . . . . . . . . .
1 19
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
122
10.3 Different Restricted Sizes
10.4 Simulated Annealing on Data Without Noise
10.6 Summ ary
11
11 Testing the Approach
127
11.1 Leave One Out Cross-Validation
. . . . .
127
11.1.1 Results for the Different Scenarios
128
11.1.2 Investigation of the 1205 Dataset
129
. . .
137
11.3 Data from Another Center . . . . . . . . .
138
11.4 Sum m ary
139
11.2 Meta-Analysis of Published Spectra
. . . . . . . . . . . . . . . . . .
12 Discussion
141
12.1 A Study of Two Longer Peptides . . . . . . . . . . . . . . . . . . . . . . . .
12.1.1 Enlargement of the Training Set
141
. . . . . . . . . . . . . . . . . . . .
141
12.1.2 Refining the Model to Improve the Scoring Function . . . . . . . . .
145
12.1.3 Performance of the Different Variations
. . . . . . . . . . . . . . . .
145
12.2 A Study of Dataset Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
147
. . . . . . . . . . . . . . . . . . . . .
147
12.2.2 Removing High Intensity Peaks . . . . . . . . . . . . . . . . . . . . .
150
12.2.3 Removing Noise Peaks . . . . . . . . . . . . . . . . . . . . . . . . . .
150
12.2.1 Removing Low Intensity Peaks
12.3 Sum m ary
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13 Conclusions
13.1 Room for Improvement
150
154
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
155
13.1.1 Improvements in the Data . . . . . . . . . . . . . . . . . . . . . . .
155
13.1.2 Improvements in the Model . . . . . . . . . . . . . . . . . . . . . .
156
12
13.1.3 Improvements in the Search . . . . . . . . . . . . . . . . . . . . . . .
157
13.2 Looking Towards the Future: Longer Peptides . . . . . . . . . . . . . . . . .
159
Effect of Isotopes . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
160
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
161
13.2.1
13.3 Sum mary
A Amino Acid Information
163
B Experimental Methods
165
B.0.1
Sample Preparation
. . . . . . . . . . . . . . . . . . . . . . . . . . .
165
B.0.2
Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
166
168
C Experimental Data
. . . . . . . . . . . . . . . . . . . . . . .
169
C.2 Distribution of Fragment Types . . . . . . . . . . . . . . . . . . . . . . . . .
183
C.1
Dataset Peaks and Peak Identities
186
D Data Peaks of Unknown Origin
D.1
Do bradykinin spectra also contain unknown peaks?
. . . . . . . . . . . . .
187
D.2
Could these peaks be due to the matrix? . . . . . . . . . . . . . . . . . . . .
188
D.3 Is there any way to explain these peaks? . . . . . . . . . . . . . . . . . . . .
188
D.4 Might these unknowns be related to each other?
. . . . . . . . . . . . . . .
189
D .5 K eep in M ind... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
191
192
E Visits to the Drawing Board
E.1
Understanding the Problem . . . . . . . . . . . . . . . . . . . . . . . . . . .
192
Understanding the Acquistion Process . . . . . . . . . . . . . . . . .
192
E.1.1
13
. . . . . . .
. . . . . . .
193
Exploring Sequencing Algorithms . . . . . . . .
. . . . . . .
194
E.2.1
Fundamental Graph-Based Approaches
. . . . . . .
195
E.2.2
Expanding Islands of Certainty . . . . .
. . . . . . .
196
E.2.3
Bounding Partial Paths . . . . . . . . .
. . . . . . .
197
E.1.2
E.2
Understanding the Spectra
F Scoring Function Maximum
201
Bibliography
205
14
List of Tables
. . . . . . . . . . . . . . . . . .
67
. . . . . . . . . . . . . . . . . . . . . . .
100
9.2
Model Parameters: Untrained and Trained Overall . . . . . . . . . . . . . .
104
9.3
Scores of Sequences with the Same Mass as Angiotensin for Datasets 0123
6.1
Taxonomy of De Novo MS/MS Approaches
9.1
Peak Classification from [RYM95]
and 0119. A
'*'
denotes best score in a column. Sequences considered correct
are the actual sequence (ilisted last) and the sequence RDVYIHPFHL, the
actual with the first 2 residues flipped, listed fifth from last. . . . . . . . . .
9.4
108
Scores of Sequences with the Same Mass as Angiotensin for Datasets 1205
and 0121. A '*' denotes best score in a column. Sequences considered correct
are the actual sequence (ilisted last) and the sequence RDVYIHPFHL, the
actual with the first 2 residues flipped, listed fifth from last. . . . . . . . . .
9.5
Scores of Sequences with the Same Mass as Bradykinin. A
score in a colum n.
'*'
109
denotes best
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
110
10.1 Results of Simulated Annealing Run on Different Datasets using an Untrained/Trained Model, and a Length-Preserving/Non-Length-Preserving Search 117
10.2 Simulated Annealing Results for Different Lengths. This table lists the ten
predictions made by a search, restricted to sequences of length 8 to 13 inclusive, on the 0123, 0119 and 1205 datasets. . . . . . . . . . . . . . . . . . . .
15
120
10.3 Simulated Annealing Results for Different Lengths(cont). This table lists the
ten predictions made by a search, restricted to sequences of length 8 to 13
inclusive, on the 0121, 0220 and 0218 datasets.
. . . . . . . . . . . . . . . .
121
10.4 Simulated Annealing of Datasets Without Noise . . . . . . . . . . . . . . . .
122
10.5 Exploring Simulated Annealing Parameter Space
123
. . . . . . . . . . . . . . .
10.6 Exploring Simulated Annealing Parameter Space (cont)
. . . . . . . . . . .
124
10.7 Exploring Simulated Annealing Parameter Space (cont)
. . . . . . . . . . .
125
11.1 Model Parameters for the Different Scenarios.
The values of the Overall
Model from Table 9.2 are included for ease of comparison. . . . . . . . . . .
128
11.2 Results of Running Simulated Annealing with Model Parameters from the
D ifferent Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
129
11.3 Results of Running Simulated Annealing with Model Parameters Trained
With All Angiotensin and Bradykinin Datasets Except 1205 (AllBut1205) .
129
11.4 Model Parameters When Trained With a Single Dataset . . . . . . . . . . .
131
11.5 Results of Running Simulated Annealing with Each Model of Table 11.4 on
Its Training Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.6 AllAngioBut1205: Trained Parameter Values
. . . . . . . . . . . . . . . . .
131
132
11.7 Results of Running Simulated Annealing with Table 11.6 Model Parameters
on the Other Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.8 Trained Parameter Values When Normalizing All Six Datasets
. . . . . . .
132
134
11.9 Results of Running Simulated Annealing When Datasets Are Normalized
(m ethod I)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
134
11.10Results of Running Simulated Annealing When Datasets Are Normalized
(m ethod II) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
134
11.11 Factory Angiotensin: Trained Parameter Values . . . . . . . . . . . . . . . .
136
11.12Results of Running Simulated Annealing on Factory Angiotensin with Various Trained Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
136
11.13Results of Simulated Annealing Run on Datasets from the Literature Using
a Model Trained Without 1205 (AllBut1205).
The 1375.8 dataset was the
only dataset for which extra peaks were not inferred. . . . . . . . . . . . . .
140
11.14Results of Simulated Annealing Run on Datasets from Applied BioSystems
. . . . . . . . . . . . . . . . . . . . .
140
12.1 Model Parameters for Leave One Out Cross-Validation . . . . . . . . . . . .
142
. . . .
142
Using a Model Trained Without 1205
12.2 Results of Leave One Out Cross Validation with the Eight Datasets
12.3 Trained Model Parameters: Overall Training Set Plus 830.4, 1237.5 and 1948.1 143
12.4 Results of Running Simulated Annealing On All Datasets with the Model
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
143
12.5 Results of Various Simulated Annealing Runs on Dataset with M+H 1758.9
146
12.6 Results of Various Simulated Annealing Runs on Dataset with M+H 1948.1
146
Parameters of Table 12.3.
. . . . . . . .
A.1
Basic Residues, their Frequencies and Masses
C.1
Angiotensin Dataset: data/012360c/unprependedpeaks . . .
. . . . . . . .
170
C.2
Angiotensin Dataset: data/011959adata/unprependedpeaks
. . . . . . . .
173
C.3
Angiotensin Dataset: data/ 120598b/ 120598b data . . . . . .
. . . . . . . .
175
C.4
Angiotensin Dataset: data/01 2170c/unprependedpeaks . . .
. . . . . . . .
177
C.5
Bradykinin Dataset: data/022064c/unprependedpeaks
. . .
. . . . . . . .
180
C.6
Bradykinin Dataset: data/021829c/unprependedpeaks
. . .
. . . . . . . .
182
17
164
C.7
Ions Present in Data: Counts in parenthesis represent counts when using a
model without the refinement of Section 12.1.2. . . . . . . . . . . . . . . . .
D.1
184
Angiotensin Peaks of Unknown Identity: when a dataset contains an unknown, the height of the peak is listed.
The height of the parent ion is
included in the last row of the table for reference. . . . . . . . . . . . . . . .
186
D.2 Number of times each unknown appears in training datasets (out of 10). . .
187
D.3 Bradykinin Peaks of Unknown Identity. Note that there are two 0218 experimentals (of height 462 and 308) that have the same checkpoint value of
71.0376.
D.4
D.5
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
187
Unknowns that are a Residue Distance Apart: recall that the sequence for
angiontensin is DRVYIHPFHL. . . . . . . . . . . . . . . . . . . . . . . . . .
189
Angiotensin Peptides and Consistent Path Nodes . . . . . . . . . . . . . . .
190
18
List of Figures
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2-1
Linear M ass Analyzer
2-2
Tandem Mass Spectrometer: (A) only molecules of the desired parent mass
28
are allowed to pass through the timed ion selector, (B) post source decay
continues, (C) fragment ions are detected
3-1
. . . . . . . . . . . . . . . . . . .
32
Peptide Fragment Ions Common to MALDI-PSD. Recall that a fragment
must be positively charged to be detected - the H+ accompanied by a bracing
line indicates that a proton has affixed itself to some part of the molecule
encompassed by the bracing line. . . . . . . . . . . . . . . . . . . . . . . . .
37
. . . . . . . . . .
57
6-1
Classification of Different Protein Sequencing Approaches
6-2
MS and MS/MS spectra: MS is simply a way to separate pieces of the peptide
resulting from proteolytic digestion by mass. MS/MS is a means to home in
on a particular mass to generate random non-specific fragmentation. ....
7-1
58
Theoretical Masses and Experimental Peaks: Region A contains those theoreticals that are absent from the experimental spectrum, Region B contains
matched peaks and Region C contains the unaccounted experimentals. . . .
80
. . . . . . . . . . . . . . . . . .
83
8-1
Schematic of Key Algorithmic Components
8-2
Model I: Basic Fragmentation Tree. A single break produces only prefixes
and suffixes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
85
8-3
Model II: A partial Model II fragmentation tree showing the second stage of
cleavage for two Model I leaves. When two breaks are possible, immonium
and internal ions are added to the repertoire of fragment types. . . . . . . .
8-4
87
Model III: Each Model II leaf may express a variant and the decision process
is shown here only for two leaves from Figure 8-3. . . . . . . . . . . . . . . .
9-1
88
Overview of Matrix Layout: Ideally, every matrix entry would be a parameter, but only a few regions have been parameterized and singled out for
estimation - the non-break tendencies, and the Histidine(H) and Proline(P)
residue dependencies. All entries within the same shaded region are assumed
to have the same likelihood in the current model. Entries marked NA are
not possible.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
103
9-2
PMF for Angiotensin Using an Untrained Model
9-3
PMF for Angiotensin Using a Trained Model
9-4
Trained Model Scores of Sequences Guesses for Angiotensin: Datasets 0123(points
. . . . . . . . . . . . . . .
105
. . . . . . . . . . . . . . . . .
106
0-40 along the x-axis), 0119(40-60), 1205(80-120) and 0121(120-160).
at zero were inserted to separate each dataset...
9-5
. ..
Scores
. . . . . .... .....
111
Trained Model Scores of Sequences Guesses for Bradykinin: Datasets 0220(points
0-40) and 0218(40-80). Scores at zero were inserted to separate each dataset. 111
10-1 Simulated Annealing Moves for 0119 Dataset.
The x-axis represents the
progress of the search, and the y-axis is the score of each successive move. .
10-2 Simulated Annealing Moves for 0220 Dataset.
116
The x-axis represents the
progress of the search, and the y-axis is the score of each successive move. .
116
12-1 Removal of Lowest Intensity Peaks . . . . . . . . . . . . . . . . . . . . . . .
148
12-2 Removal of Highest Intensity Peaks . . . . . . . . . . . . . . . . . . . . . . .
151
12-3 Removal and Subsequent Addition of Noise Peaks . . . . . . . . . . . . . . .
152
20
A-i
Basic Residue Structure: The side-chain R of a residue hangs off of the acarbon.
An amide bond joins the a-carboxyl group of one residue to the
a-amino group of an adjoining residue polymerizing multiple basic residues
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
163
C-i PSD for 0123 Angiotensin . . . . . . . . . . . . . . . . . . . . . . . . . . . .
171
C-2 PSD for 0119 Angiotensin . . . . . . . . . . . . . . . . . . . . . . . . . . . .
173
C-3 PSD for 1205 Angiotensin . . . . . . . . . . . . . . . . . . . . . . . . . . . .
175
C-4 PSD for 0121 Angiotensin . . . . . . . . . . . . . . . . . . . . . . . . . . . .
177
C-5 PSD for 0220 Bradykinin
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
180
C-6 PSD for 0218 Bradykinin
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
183
C-7 Manufacturer PSD for Angiotensin . . . . . . . . . . . . . . . . . . . . . . .
185
Graphical Representation of Residue Relationships Between Unknowns . . .
190
into a peptide.
D-1
21
Abbreviations and Acronymns
CID
Collision-Induced Ionization
ESI
Electro-Spray Ionization
FAB
Fast Atom Bombardment
MALDI
Matrix-Assisted Laser Desorption/Ionization
MS
Mass Spectrometry
MS/MS
Tandem Mass Spectrometry
PSD
Post-Source Decay
TOF
Time-Of-Flight
mass per charge ratio
P
peptide
S
mass spectrum
fi
Zth ion family
M+H
mass of the original peptide that is singly protonated
P(
probability
mi
mass of peak i
hi
height/intensity/abundance of peak i
FO
probability mass function (PMF)
pico
10-12
femto
10-15
atto
10-18
22
Chapter 1
Introduction
Proteins are essential to life, playing key roles in all biological processes: from enzymes
that catalyze reactions, to antibodies in an immune response, from messengers in signaling
pathways that allow a cell to react to stimuli, to secreted messengers that effect extracellular
changes, and much more. Such is the extent of protein functionality to the survival of any
organism.
One of the first steps in understanding a protein is discovering its primary structure. Knowledge of the primary sequence characterizes the protein, offering a glimpse of what it does
(its role and functionality), where it does it (its targeted destination) and how it does it (its
active sites, specificity and structural motifs). Protein sequencing is the process by which
this primary structure, the identity of each amino acid residue in order of appearance from
one terminus to the other, is enumerated.
A protein can be easily sequenced using the genetic code if the corresponding cDNA sequence
is known.
If, on the other hand, the genomic DNA were available, one would not be
able to predict the amino acid sequence with 100% certainty because post-transcriptional
and post-translational modification events cannot be completely predicted from a genomic
sequence
[BS87, Yat85]. If, however, one were able to somehow correctly deduce four or five
consecutive residues, then one might find sequence information by probing a cDNA library
or searching a database of previously sequenced proteins.
23
With a newly-discovered protein, however, sequence information must be determined from
the actual protein itself. This process is said to be de novo, done "from scratch", since since
no database contains an entry to lookup and no genetic information may be available to
consult.
Mass spectrometry has been explored as a possible tool for aiding in de novo sequencing.
It is a fast and convenient means for sorting a mixture of molecules by mass and reporting
the result as a histogram of masses and counts for each mass. Tandem mass spectrometry
is based on a two-stage mass spectrometer: molecules are sorted by the first stage, only
those molecules of a particular specified mass are allowed to pass into the second stage,
non-specific fragmentation of the selected molecules produces various daughter ions, and
these fragments are sorted by the second stage yielding the desired tandem mass spectrum.
Various de novo computational approaches for sequencing from tandem mass spectra have
been proposed in the literature. They take as input the tandem mass spectrum and the mass
of the peptide to be sequenced, and all have a common strategy that involves some means
for generating a guess for the sequence and some means for evaluating how good each guess
is. Scoring functions in the literature incorporate factors that are reasonable properties of
good matches, but the choice of these factors and how they are combined is often based on
empirical observations and arbitrary decisions. Only Dancik et. al. [DAC+99b, DAC+99a]
have attempted to build a more formal framework for reasoning about the scoring function and what a score means. Nonetheless, all approaches are sensitive to the quality of
the observed spectrum, and encounter difficulties when the input spectrum is noisy and
incomplete.
And despite the existence of these algorithms, the majority are not widely
used [DAC+99b].
This thesis serves as a proof of concept for a new de novo sequencing algorithm. An overview
of our approach is as follows:
Fragmentation Model We propose a probabilistic model of peptide fragmentation. Given
a peptide sequence, this model produces a probability distribution that describes the
masses of all possible ions that can result when a single peptide molecule of this
sequence is fragmented.
24
Scoring Function Given this distribution of outcomes for an individual trial, we derive
a multinomial-based scoring function, and if we consider a tandem mass spectrum as
the cumulative outcome of many such independent trials, then we can compute the
probability that a particular set of outcomes is observed.
Searching Method A search strategy is formulated based on simulated annealing, a well
known combinatorial optimization technique [KGJV83].
Simulated annealing effi-
ciently samples the space of possible sequence guesses, identifying the one that scores
optimally.
In this thesis, we show that if the model is good and the data is sufficient, then the real
sequence scores optimally, and simulated annealing, under appropriate searching conditions,
will find this sequence. A simple model was sufficient to correctly predict the sequence of
short peptides, up to mass equivalence and initial dipeptide inversion, with some tolerance
of noise and gaps in the datasets we had available.
This thesis is organized into three major parts. The first is an introduction to mass spectrometry(Chapter 2) and protein fragmentation(Chapter 3), and may be safely skipped if
the reader is familiar with these physical/chemical processes.
The second part begins with definitions and concepts(Chapter 4) used in this thesis, and
then surveys existing de novo sequencing approaches(Chapter 5), with an emphasis on those
computational approaches that involve the use of mass spectra(Chapter 6).
The last part of this thesis contains our main contributions. After some remarks on the
sequencing problem and issues that a solution should address(Chapter 7), we propose a new
approach to de novo sequencing(Chapter 8), and test the model and scoring function (Chapter 9), the searching strategy(Chapter 10), and the predictions of our algorithm(Chapter
11). Chapter 12 discusses the sequencing of longer peptides and the effects of dataset size,
and finally, Chapter 13 concludes this thesis.
25
Chapter 2
Mass Spectrometry
For more than a decade now, mass spectrometry(MS) has been used extensively for studying
biopolymers such as proteins, oligonucleotides and carbohydrates.
It is a useful tool for
molecule detection, structural analysis, compositional analysis and more recently, sequence
prediction. For example, it can be used to determine the molecular weight of a molecule,
and it can be used to compare the weight of a peptide product to that predicted by a gene
sequence for verification or for intron/exon and modification discovery purposes. Central
to MS is the ability to obtain a "mass signature" of a given sample - namely, a spectrum
tabulates the range of masses present and the abundance (also called height or intensity) of
each such mass.
There are two major components of a mass spectrometer:
the mass analyzer.
The ionization method (e.g.
the ionization technique and
Fast Atom Bombardment(FAB), Elec-
troSpray Ionization(ESI) and Matrix-Assisted Laser Desorption Ionization(MALDI)) takes
the biopolymer of interest and forms gas-phased charged ions. Different ionization methods
often produce different species of ion fragment types. The mass analyzer portion of the
mass spectrometer allows for selection and analysis of the ions created by the ionization
process. Examples of mass analyzers are the triple quadrupole, quadruple ion trap and the
time-of-flight(TOF) mass spectrometers.
MALDI-TOF, MALD ionization combined with a TOF analyzer, is a widely-used combination [Zen97b] because of its attomole to picomole sensitivity, tolerance of mixtures and
26
salt conditions, and ability to handle low-purity protein samples in excess of 300,000 Da in
size [Kau95, FS96] (a practical limit; time-of-flight mass spectrometry imposes no theoretical limit [Sch97], although the error in mass measurement increases for higher masses). We
also chose to work with MALDI-TOF because of the availability/accessibility of MALDITOF hardware and its ease of use (only a relatively short training time is needed before
a neophyte can begin to acquire spectra, although spectra quality increases with skill and
experience [Sch97]).
Since we are interested mainly in the output spectrum and the rules governing fragmentation
are largely unknown and incomplete(see Section 13.1.2), we treat the mass spectrometer as
a black box. However, it will be useful to give a high-level description of MALDI-TOF to
convey the basic intuition behind how the process works. For more thorough treatments,
see [ZGW95, BC96, Yat85, BBG96, Kau95].
2.1
2.1.1
Overview of MALDI MS
Sample Preparation
Sample preparation is largely an experimental art rather than a methodical science, so there
are countless variations to this recipe [BC96]. In general, the biopolymer analyte of interest
is combined with an excess of a compound called the "matrix", usually a compound of low
molecular weight, diluting the analyte and producing a sample that is slightly acidic (pH
less than 4 [BC96]). The choice of matrix is one variable in the recipe, determined by such
factors as the solubility of the matrix and analyte, its absorptive spectrum, and the inability
of matrix and analyte to react to form a stable product [Bea92].
Unfortunately, which
compounds make good matrices are often discovered only by "trial-and-error" [HKBC91,
FS96].
Nevertheless, whatever is chosen as the matrix must serve its primary function
well - it must successfully facilitate the transfer of charge to analyte molecules during the
ionization process.
27
prism
[
laser
charged fragments
gas
plume
fl4J
E10 E10
detector
sample plate
ion source
flight tube
accelerating
grids
NMS spectra
-- - - - -- - - -
-- --
..
..
....
....
...
.......................
Figure 2-1: Linear Mass Analyzer
2.1.2
Desorption and Ionization of Analyte
Once the sample is prepared, it is deposited onto a sample plate and allowed to dry. The
plate is then inserted into the mass spectrometer, and is illuminated by a laser. When the
pulsed laser hits the solid crystalline matrix-analyte sample (see Figure 2-1), a small area
is vaporized, and charged molecules are formed in the ensuing gaseous plume.
The exact ionization mechanism is unknown [Bal95, Zen97a, Zen97b], but it is believed that
the matrix enhances ion formation, by somehow absorbing the laser's energy and imparting
it to the analyte. It is thought to serve as a hydrogen source for protonation [BCC91], and
use of a matrix appears to facilitate ionization, expanding the range of masses that can be
easily ionized from about lkDa in early non-matrix MS, to in excess of 300kDa [HKBC91,
FS96].
28
2.1.3
Ion Separation and Detection
Charged analyte ions are extracted from the plume and propelled into the mass analyzer
portion of the mass spectrometer by an electric field (typically 1-30kV [HKBC91]) that
accelerates them to a constant kinetic energy.
Although each ion has the same kinetic energy, the velocity of each ion is inversely proportional to
(
where m is mass of the particle and o- is the number of charges it carries.
Because of this, smaller mass ions travel at a higher velocity than those of larger mass, and
the collection of ions can be separated by mass while in flight.
A detector is situated at the end of the flight path, and the idea is to infer each ion's
mass based on its flight time. This principle is the hallmark of time-of-flight (TOF) mass
analyzers'.
Extremely short laser pulses (e.g. 1-100ns), highly focused laser bursts (e.g.
10-300pim) and the fact that the start of ion production can be triggered by the firing of the
laser make MALDI highly compatible with TOF analysis because ions can be considered to
have originated from a "point source in time and space" [HKBC91].
Thus, as each molecule in the ion train collides with the detector, the mass is derived from
its flight time, and the collision event is recorded by incrementing a count that keeps track
of the abundance of each mass.
What results is a spectrum of peaks that is a histogram distribution of the masses (mass
per unit charge) of all analyte ions dislodged by the impaction of a single laser pulse and
detected by the machine. To improve signal-to-noise, the typical output that one obtains
from a mass spectrometer is actually the aggregate sum or an average of spectra from a
number of laser pulses (at least 50) [BC96].
2.1.4
Useful Improvements and Variations
The arrangement of a simple straight-line flight path from an ion source, through a mass
analyzer, to a detector is characteristic of a linear analyzer. The spectrum produced by a
'Other types of analyzers include:
magnetic-sector.
triple quadrupole, hybrid mass, fourier-transform,
29
ion-trap, and
linear analyzer may contain peaks with poor resolution. This is due to the fact that the
ionization energy imparted by the laser cannot be precisely controlled. Thus fragments of
the same mass have slightly different energies, exhibit a variance in flight time and cause the
detector to register an event at mass values that are slightly shifted from the actual. Use
of an analyzer in reflector mode and use of delayed extraction are two means to partially
counter this.
Operating in reflector mode instead of linear mode allows for some deviations in ion velocity
to be corrected, resulting in narrower peaks and higher mass resolution 2 . The reflectron is
implemented using an electrostatic mirror which is discussed further in Section 2.2. Typical
mass spectrometers have an accuracy of about 0.1%
[YalOO] and a resolution of 300-500 in
linear mode [HKBC91], and an accuracy of 0.05% [Yal00] and a resolution of 1200-3000 in
reflector mode [HKBC91] (another source claims up to 4000 for peptides up to 10kDa on
TOF reflectron instruments [KH93]).
Delayed extraction is another means for correcting variations in ion velocities. After desorption, faster ions move farther from the surface, so the position of an ion in the cloud is
determined by its initial escape velocity off the surface [Mur96]. By introducing a submicrosecond (typically 100-300nsec [CM98]) delay after the laser pulse but before the acceleration voltage is applied, slow ions are allowed to effectively "catch up" to the faster ions.
This has been found to improve both mass accuracy and resolution [Yat85].
Once again, the basic idea behind MS is the ability to separate molecules by mass into a
spectrum. With tandem mass spectrometry, this principle is built upon to obtain spectra
of a slightly different nature.
2
Two measures of performance are useful to mention: mass accuracy and mass resolution. Mass accuracy
indicates the percentage by which the experimental mass as detected by the mass spectrometer is off from the
actual theoretical mass. Mass resolution '
reflects the mass spectrometer's ability to separate two peaks
that are close in mass, and is calculated by dividing a peak's mass by its peak width at the half-intensity
point.
30
2.2
Tandem Mass Spectrometry (MS/MS)
Molecules can sometimes, either naturally or artificially induced, break up into pieces or
fragments during the post-desorption pre-detector time frame.
These fragments(called
daughter ions), though smaller in mass, still travel with the same velocity as the original intact molecule (called the precursor or parent), and hence would hit the detector at
the same time as the parent. It would be useful to be able to examine the spread of daughter
fragments by targetting them for MS, producing a second spectrum that depicts the spread
of daughter ion masses and their abundances. This is known as tandem mass spectrometry
and it yields the information that we are interested in for protein sequencing.
Figure 2-2 depicts a schematic of how this works. One can imagine a gate at the end of
the first MS stage that stays closed, allowing no ions to pass beyond it. If a particular
mass, called the "parent ion mass" or "precursor ion mass", is the desired target mass, the
time at which it reaches the gate can be computed and programmed into the timed-ion
selector. When the timed-ion selector is activated at this desired point in time, the gate
is opened, allowing molecules arriving at the gate during this time frame to pass through.
This has the effect of homing in onto one particular peak of the spectrum that would
have been produced from a single stage of MS. All other peaks, such as fragments due
to other molecules, contaminants and matrix molecules, are ignored. The gate acts as a
filtering mechanism allowing only ions flying at the specific target time to pass through into
a second stage of MS which then separates these fragments according to their individual
masses. The end result of MS/MS is thus the spectrum resulting from these daughter ion
that have arisen from fragmentation of the precursor ion.
With MALDI-TOF analyzers, fragmentation can occur from Post Source Decay(PSD). Ions
are stable long enough to survive extraction from the source, but then dissociate into smaller
fragments before reaching the detector. Dissociation can occur within the dense analyte
plume or while in flight through the flight tube. It is the fragment ions produced from PSD
that we are interested in and will allow us to sequence the peptide.
In addition, one may use Collision Induced Dissociation(CID) to enhance fragmentation
by injecting inert gas particles into the ions' flight trajectory towards the detector.
31
A
prism
detector
H
Slaser
mirror
gas
plume
a t1
accelerating
grids
V
I
flighttube
ti
mirror
di
H
II|
H
MS/MS spectra
rII
Figure 2-2: Tandem Mass Spectrometer: (A) only molecules of the desired parent mass
are allowed to pass through the timed ion selector, (B) post source decay continues, (C)
fragment ions are detected
32
parent analyte molecule fragments when it collides with a gas particle.
However, CID
tends to produce a repertoire of fragment types different from those obtained by PSD (see
Section 3.1), and our work does not take them into account.
When an ion fragments, all its pieces travel at the same velocity as the original intact parent
ion. Although their velocities are the same, their kinetic energies are not. The kinetic energy
of each daughter fragment is equal to the kinetic energy of the parent times the ratio of the
daughter mass to the parent mass [KSL93]. The daughter fragments can be separated by
using the reflector mode mirror which serves to slow, deflect and re-accelerate ions towards
the detection apparatus.
The net effect is that ions with higher kinetic energies will be
allowed to traverse a slightly longer trajectory because ions with high kinetic energy (and
hence high mass) penetrate farther into the mirror and arrive at the detector later than
those ions with lower kinetic energy.
This action of the mirror depends upon its hardness, which is governed by a parameter
called the "mirror ratio". Different ratio settings allow for selective deflection, enabling a
particular mass range of peaks to be detected. A lower mirror ratio facilitates the detection
of low energy fragments, and a high mirror ratio for higher masses.
The spectrum collected at a particular mirror ratio, which we refer to as a "stitch", contains
peaks that are correctly focused for a particular mass interval. To produce one complete
"PSD composite" (a spectrum correctly focused for the entire mass range), spectra must
be collected at several mirror ratio settings chosen so that the mass range covered by the
stitches, taken collectively, is enough to span the entire range of the analyte's molecular
weight. These stitches (rather, the portion of each stitch for which the peaks are focused) are
then stitched together to arrive at the desired composite. Note, however, that the conditions
under which each stitch was acquired may be slightly different (e.g. laser intensity). The
final PSD spectrum then, is basically a concatenation of individual stitches obtained by
differ mirror ratios, each of which is an aggregate sum of several laser pulses.
In this
manner, since one can only focus on a portion of the entire mass range for each stitch, each
stitch, and hence the final spectrum, is only a subset sampling of all fragments produced
during the process.
For the most part, we will consider the mass spectrometer as a black box - given a peptide
33
as input, a spectrum of fragment peaks is produced as output. Peptide fragmentation is
discussed in the next chapter.
34
Chapter 3
Protein Fragmentation
Proteins are polymers made up of amino acid building blocks called residues. Appendix A
contains the twenty basic amino acids1 , including their one letter codes, molecular weights,
and frequencies. Of these, two pairs of amino acids, leucine and isoleucine, and glutamine
and lysine, have similar masses. Hence, sequencing algorithms based solely on mass will be
unable to distinguish between members of these pairs.
Amino acid sequences called peptides (or proteins) are strings taken over this alphabet of
residues, and are written with the residue's one letter code, starting from the N terminus
to the C terminus when reading left to right.
When a peptide undergoes MALDI PSD fragmentation, one or more bonds within the
molecule break, allowing the molecule to dissociate into two or more daughter pieces. Given
one of these pieces, the bonds which were broken to create it determine the type of daughter
fragment that results.
'Aside from the basic 20, other nonstandard amino acids exist (Table 2 of [FHM+93]), but they will not
be considered in this thesis.
35
3.1
MALDI-PSD Fragment Types
A variety of fragment types can be produced when a peptide undergoes fragmentation, and
we classify them as follows:
1. Series Ions
2. Internal Ions
3. Immonium Ions
4. Parent Ion
5. Neutral Loss/Gain Variants
The first three result from backbone bond cleavages. The last is a variation that can occur
with any of the first four. Examples of the first four are illustrated in Figure 3-1 for a
peptide of length 4. The fragment names are due to [RF84].
3.1.1
Series Ions
Series ions are those fragments that are prefixes and suffixes of the parent peptide, i.e. they
contain either the original amino(N)- or the original carboxyl(C)-terminus respectively. An
ion is named by which bond was broken to produce the fragment, and which terminus of
the original peptide was retained.
Since there are three bonds that can be broken (the
NH-C 0 bond, the C,-CO bond and the peptide bond) and two fragments that can result
when one of these is broken (a prefix and a suffix), there are six different possible series ions
(the A,B,C,X,Y,Z ions); however, MALDI PSD usually only generates A,B and Y ions (see
Figure 3-1).
3.1.2
Internal Ions
Internal ions represent other subsequences of the peptide that are not prefixes or suffixes.
These generally result from double cleavage of the peptide. The three types of internal ions
36
Parent Ion:
1
H -N-C-C-
11S
0
H
N-terminus
2
N-C-C,
H
1
0
3
4
N-C-C- N-C-CII
0 H
H
I1I
0
OH
C-terminus
H
Family Ions:
suffixes
prefixes
Aion
R
+ R2
H - N-C-C - N=C
O H
H
R2
R1
Bion
H
-
N-C-C
-
Yion
N-C-C
H-
3
N-C-C-
4
N-C-C
OH
HH
-
OH
0
H
Immonium Ion:
Internal Ions:
R
R3
YAion
H - N-C-C - N=C
1
11
1
O H
H
YBion
II
R
H - N-C-C - N-C-C
ill
1
o+
OH
H
R.
H- N-C-C
11 +
'
0
H
Figure 3-1: Peptide Fragment Ions Common to MALDI-PSD. Recall that a fragment must
be positively charged to be detected - the H+ accompanied by a bracing line indicates that
a proton has affixed itself to some part of the molecule encompassed by the bracing line.
37
that we will be concerned with: YB, YA and XB ions, are named for the types of cleavages
that produced their ends [Hin97].
3.1.3
Immonium Ions
Immonium ions are internal ions of length one, and hence are the class of ions of the lowest
mass. Their presence and even their absence can provide hints of the amino acid content
of the peptide. For example, no peak at mass 70 means that proline is definitely absent
from the sequence, and a peak at mass 61 means that methionine is present [FHM+93].
In some cases, certain combinations of low mass peaks with certain threshold intensities
(strong/medium/weak) have been found to be indicators of a residue's presence. But what
these combinations are and how to interpret them is still an area of investigation [FHM+93].
No comprehensive systematic study of immonium ions and sequence correlation has yet been
done for PSD, but some general observations (for PSD and CID spectra) have been made
in [SC98, FHM+93, MB94, Pap95].
Note that the immonium ions of the N-terminal residue are actually prefix series ions.
3.1.4
Parent Ion
The parent ion is the mass of the intact peptide of interest. It is also the fragment whose
mass is selected during the first MS stage to continue onto secondary fragmentation. This
ion is the maximal Yion, since it is a suffix ion containing the original C terminus. Its mass
is also the same as that of the maximal Bion with the addition of a extra water molecule -the masses are the same, but whether or not it can be classified as such in reality depends
on whether or not this B ion is allowed to neutrally gain water.
The actual or real sequence of an experimentally obtained spectrum refers to the parent ion,
the peptide that actually produced the spectrum.
38
3.1.5
Neutral Loss/Gain Variants
Unlike the fragment types discussed so far, the neutral loss/gain variations are not the result
of peptide backbone bond cleavage, but rather, they involve the amino acid side-chains. Ions
may exhibit a loss of ammonia, a loss of water, or a gain of water, and these are notated as
"m17", "m18" and "p18" respectively, where 17 and 18 represent adjustments due to NH 3
and H 20, and the letter "in" denotes a loss (minus) and "p" a gain(plus).
The literature offers an assortment of rules for when a variant is allowed and when it isn't
(see Section 3.1.5 and
8.2.3). The exact mechanism for how these variants are produced
is not known (e.g. Is the B ion a precursor to a Bm17 ion?). But whether or not an ion is
capable of a neutral loss/gain variant is dependent upon the amino acid composition, the
location of particular residues, and even the fragment type of the ion. These observations
are likely based on empirical studies; no conclusive systematic study has been done to
investigate/explain/unify the different observations various researchers have made.
3.1.6
Other Fragment Types
Numerous other fragment types exist 2 . They are less likely to appear in MALDI-PSD spectra and we do not concern ourselves with them in our work. Some of them are listed below to
illustrate the range of different fragment types that can be produced and to emphasize that
there are likely to be other fragment types that haven't yet been discovered, so identifying
the origin of every peak appearing in a spectrum is frequently not possible.
Multiply Protonated Ions In order for a fragment to reach the detector, it must be an
ion (it must be charged). MALDI PSD ions are usually singly protonated molecules,
but with other ionization methods, such as ESI, multiply protonated molecules are a
common occurrence. Multiply charged peaks can provide an additional "dimension"
of information, especially if the singly charged fragment is not present.
Isotope Peaks A particular fragment ion may appear as a spread of peaks called isotope
2
More than 20 ion types have been identified for hi energy CID [HFBG92] -actually in Martin, S. and
Biemann, K., Int J Mass Spectrom Ion Processes, 1987, 78, 218-228
39
peaks that are at consecutive masses approximately 1 dalton apart (the mass of a
neutron). This "mountain range" of peaks is most easily seen when the parent mass
peak is magnified.
These peaks are due to the presence of isotopes, and all five
elements found in proteins - C,H,N,O,S - have naturally occurring isotopes. Because
the probability of finding an isotope is low for small peptides, the leftmost peak is
typically the highest in intensity and the heights of the other peaks in the range
typically decrease rapidly with distance. This will not be true for longer peptides (see
Section 13.2.1).
Alkalinated Ions Impurities such as K+ and Na+ can attach to fragments displacing
peaks from their original mass [OSTV95].
Adduct Peaks Adduct peaks can result when portions of the matrix (that breakdown
into reactive components, for example) attach to a fragment (often, the parent ion
causing a peak that is greater than the parent mass to appear [HKBC91, OSTV95]).
Side Chain Ions Fragmentation of the amino acid side chain can also occur, producing
fragments such as the D, V and W ions which are useful for telling isoleucine apart
from leucine. "R-group losses" for CID spectra can be found in Table 111.10 of [Joh88].
In addition, there are other fragments that we haven't mentioned that possibly involve
rearrangements [TBG90] and probably others that are yet to be discovered. With such a
wide range of fragment ions possible, what are the rules or laws that govern the formation
of all these ions? Are certain fragments are more likely or unlikely to occur? How does the
actual sequence and position of its residues affect fragmentation?
3.2
Fragmentation Rules
There is still a great deal about the fragmentation process that is yet to be satisfactorily
explained. Some things are known, though sometimes this knowledge is limited. For example, the chemical structures of the various fragment types(see Figure 3-1) are known but
the mechanisms for how they form are not. Although some schemes have been postulated
(e.g. Scheme 111.4 of [Joh88] for A ions in CID spectra; p.455 of [Bie90] for CID B and Y
40
ions), observations are sometimes conflicting and often incomplete; no general unifying set
of rules has yet been put forth.
The extent of fragmentation - the intensity of ion formation and the fragmentation types
observed - is probably dependent on several factors. These include:
Sample Preparation According to [BC96], sample preparation is the "key to successful
analysis".
This includes matrix choice [KH93], relative concentration of matrix to
peptide (to internal calibrant if any), etc.
Type of Mass Spectrometer Different instruments produce different spectra. Collisions
and hence fragmentation is more likely to occur with CID for example. It is generally
accepted that the MALDI-PSD fragment types are a subset of those present in high
energy CID spectra [GME+95]. Although not as well studied and understood as CID
spectra [ASR96, Bie90], PSD requires less energy but more time for fragmentation to
occur
[Hin97].
Spectrometer Settings Laser intensity, probably the most influential setting [KH93],
affects the potential amount of energy imparted to the precursor ion. Other settings
such as the guide wire voltage, mirror ratios, etc. affect different aspects of acquisition.
Fragment Stability Unstable products may decay too rapidly so that detection is not
possible.
Peptide Length N-terminal fragments seem to dominate the spectra of small peptides [JnC96]
while longer peptides seem harder to fragment even with higher laser energies [Fen9l].
Effects of peptide chain length (possibly because of intra-molecular hydrogen bonding) and 3-dimensional peptide structure on fragmentation have been postulated by
several groups [CGMW96].
Sequence Dependencies Fragmentation is sequence-dependent; the identity of a residue
and its location can affect fragmentation in a number of ways:
Immonium Ions Certain amino acids are more likely to exhibit characteristic immonium ions than others, for example His, Ile/Leu, Phe, Pro and Tyr seem to
express themselves strongly
[FHM+93].
41
Residue Specific Tendencies Certain residues can be more reactant than others for example, histidine and proline [LL95] exhibit more N terminal breaks than
most other residues. Additional supposed effects that various residues have on
spectra are given in [Pap95].
Protonation Sites Fragmentation is highly dependent on the availability and location of protonation sites
[CGMW96]. These sites acquire the positive charge
which is prerequisite for detection, and their location determine which bonds
may break and which of the peptide pieces are ionized. Likely protonation sites
include the amino terminus, the carbonyl oxygen of peptide bonds, certain side
chains, and basic residues.
Basic Residues and their Location In general, basic residues (arginine, histidine
and lysine) are more readily protonated since the free electrons of the nitrogen(s) of the residue side chain are more apt to accept the H+ proton. The location and number of basic residue influence the type of ion fragments produced
( [OSTV95, Bie90, CGMW96] for CID spectra). For this reason, when the basic
residue is at the N-terminus(C-terminus), N-terminal(C-terminal) fragments will
dominate the spectrum [MB94, LL95, KSL93]. Contrary to these observations,
one analysis of MALDI-PSD spectra seems to show that the placement of a basic residue does not greatly affect the distribution of N-terminal and C-terminal
fragments [RYM95].
The factors are numerous and diverse, but they are a taste of the elements at play in the
larger fragmentation puzzle, of which only bits and pieces are known.
42
Chapter 4
Terminology and Concepts
At this point, we introduce some definitions that will be convenient to have and refer to in
later portions of this thesis.
4.1
Fragmentation and Spectra
Although there are a wide range of different known fragment ions, by fragment type we
mean any of the five ion types classified in Section 3.1. We will use the term variants to
refer to the neutral loss/gain of water/ammonia, and the set of all series ions and their
permissible variants are collectively referred to as the core ions.
The identity of an experimental peak consists of two pieces of information: a fragment type
and a peptide sequence. The mass of this peptide sequence with appropriate adjustments
made for the terminal groups of the fragment type explains the mass of the experimental
peak.
An experimental peak with more than one identity is said to have multiple identities since
there is more than one way to explain this peak mass.
Let peptide P = ri
...
r, be a string of residues ri. Let p-, 1 < i < n, refer to the position
between ri and ri+1 (when i = n, ri+1 is the C-terminal group). The ion family at pi is the
set of all fragments that directly testify to some cleavage event at pi. The immediate ion
43
family associated with pi, denoted
fi,
consists of those family ions at pi that are core ions.
Given the parent mass and the mass of any immediate family ion, the masses of all other
immediate family members can be computed. Some of the mathematical relationships are
given below in terms of the mass of the Bion member; the remaining variant computations
can be derived similarly:
Aion
=
Bion - CO
Bm17
=
Bion - NH 3
Bm18
=
Bion - H 2 0
Bp18
=
Bion + H 2 0
Yion
(M+H)+H-Bion
A spectrum is said to be representative if at least one family member from every immediate
family is present.
Every position of the peptide is then represented in the dataset.
A
spectrum contains a gap if there is at least one immediate family which is not represented.
An extended ion family for pi, denoted
f',
consists of the immediate family members
plus certain internal ions which can also serve as evidence of a fracture at pi.
ions related to pi fall into two categories:
the form rk
...
ri, 1 < k <
fi
Internal
those with pi at the C-terminus (peptides of
i), and those with pi as the N-terminus (peptides of the form
ri ..-.
rk,i% k < n).
4.2
Fundamental Graphs
We define the notion of a fundamental graph, but defer discussion of its uses until Chapter 6.
44
4.2.1
Purpose
The idea is to transform an input spectrum containing a diverse range of fragment types into
a list of fundamental peaks where each peak is of the same fragment type. Each fundamental
is a representative of a potential ion family, and a natural way to depict neighboring ion
families is with a directed acyclic graph where the nodes of the graph represent fundamental
fragment masses and a directed edge connects a node of lower mass to one of higher mass
if their masses are an amino acid distance apart. Since the edges along any path through
the graph define an amino acid sequence, this graph can then be traversed to enumerate all
possible sequence guesses consistent with the spectrum.
4.2.2
Construction
To build a fundamental graph, one needs to select a fragment type to serve as the fundamental and one has to select a subset of the family fragment types to be fundamental-generating
or f-generating roles. Each experimental peak is then considered in every possible allowed
f-generating role, and the corresponding fundamental mass is computed based on the mathematical relationships given above in Section 4.1.
In this manner, a single experimental
peak can give rise to a set of possible fundamentals, one fundamental per f-generating role.
For example, assuming the parent mass is 1296.68, if the Bion is chosen as the fundamental,
and the f-generating roles are
{ Aion, Bm17ion, Yion}, then the experimental peaks 506.24
and 763.51 each gives rise to the following fundmentals:
791.51, 780.51, 534.17
{ 534.24, 523.24, 791.44 } and {
} respectively. Some algorithm would need to detect and propose that
534.24 and 534.17 correspond to the same fundamental, and we say that the experimentals
506.24 and 763.51 both support (and hence are supporters of) this fundamental'. Lastly,
related fundamentals are a set of fundamentals (e.g.
{ 534.24, 523.24, 791.44}) supported
by the same experimental peak (e.g. 506.24).
Thus, a dataset of p peaks generates a graph with at most cp fundamental nodes, where
c is the cardinality of the set of f-generating roles.
Two fundamentals are special and
'Although presented in a fundamental graph context, the terms "supporter" and "support" can apply
more generally and we will use them often simply to indicate that there is evidence in the spectrum that
favors some particular interpretation or conclusion.
45
of interest: the parent fundamental and the base fundamental. The parent fundamental
is the fundamental consistent with the parent mass interpreted as a Yion. Much like the
parent fundamental represents the entire peptide, the base fundamental represents the other
extreme, the
null peptide, and its value depends on what fragment type has been chosen as
the fundamental. In the above example with the Bion as the fundamental, the 1.0 would
be the base fundamental.
Edges are created between any pair of fundamentals that differ in the weight of an amino
acid residue. Each fundamental node and each edge may have scores associated with them
(e.g. number of supporter peaks or sum of the intensities of supporter peaks, etc).
A path in the graph connects two nodes via a sequence of edges and intermediary nodes.
To score a path, a common strategy is to sum the scores of the nodes (and/or edges) visited
along the path (or the logarithm of these scores).
Many variations of the fundamental graph appear in the literature:
sometimes it is not
constructed in its entirety; different approaches use different node and/or edge scoring
functions; the methods for finding the best scoring path vary, etc. Nevertheless the goal remains the same: find a path that connects the base fundamental to the parent fundamental.
Such a path is called a complete path and it defines a complete sequence, one whose mass is
the same as that of the parent mass. The hope (of algorithms employing the fundamental
graph) is that with the appropriate scoring function and graph traversal algorithm, the best
scoring complete path defines the correct sequence.
4.3
De Novo Peptide Sequencing From Tandem Mass Spectra
Let S(P) (or S) be the tandem mass spectrum of peptide P. Let m(P) be the mass of the
protonated form of P, i.e. m(P) = m(Nterm) + m(Cterm) + m(H) +
Ze
m(rj) where m
computes the mass of its argument, Nterm and Cterm are the N- and C-terminal groups,
ri is the basic residue (see Appendix A) and H stands for hydrogen.
sequencing from tandem mass spectra can be stated as:
46
De novo peptide
Given m(P) and S(P) for some P,
find P' such that m(P') = m(P) and Prob(SIP') is maximal 2 .
An algorithm for de novo sequencing from tandem mass spectra should be designed with
the following issues in mind:
Performance A fast method, inexpensive in terms of computational time - seconds/minutes
of computer time, compared to hours/days of laboratory time, is desired. In addition,
the method should involve a minimal number of laboratory steps, e.g. the acquisition
of a tandem mass spectrum requiring no special chemical treatment of the peptide
sample.
Robustness The sequencing algorithm should be robust, able to tolerate some amount of
noise in the form of extraneous peaks and data lossage in the form of missing peaks.
Scalability The algorithm should be able to accommodate the sequencing of longer peptides.
Reliability The user should be able to reason about or have some degree of confidence in
the validity of the predicted sequence.
Comprehensiveness When possible, all portions of the spectrum - the range of masses,
the intensities of the peaks, the distribution of the fragments and fragment types, etc.
- should be accounted for and considered in the sequencing process.
4.4
Notion of a Correct Sequence Prediction
When a de novo sequencing algorithm halts, it may make one or more predictions as to the
sequence of the parent ion. Most of these predictions will be different from the real sequence,
but some of them will be considered correct and acceptable predictions. The conditions for
when a sequence prediction is correct stem from physical/chemical properties, and are as
follows:
2
Another way to view this problem is: Given S(P) and m(P), find P' such that m(P') = m(P) and
c(P, P') is minimal, for some function c that evaluates the similarity of P and P'.
47
1. when an I appears in place of an L (or vice versa) in the real sequence,
2. when an K appears in place of an
Q
(or vice versa) in the real sequence, and/or
3. when the first two residues of the real sequence are interchanged.
The first two conditions are due to the fact that certain pairs of amino acids cannot be
distinguished by their mass alone: leucine and isoleucine are isomeric (same atoms in a
different arrangement, hence identical mass), and lysine and glutamine are isobaric (different
atoms but nearly identical mass). Thus, a sequence guess is considered correct if all residues
can be identified and ordered up to mass equivalence 3 , so for all intents and purposes, these
two sequences are pretty much the same guess.
Researchers have found that the fi Bion is "too weak for meaningful assignment" [SWM97].
Similarly, Yalcin, et.al. [YKC+95, YCPH96] found the B 1 ion less favorable. Immonium
interference, discussed later in Section 7.1.1, can also contribute to this reversal of the first
two residues.
3
Fragment types that occur in other kinds of mass spectra can allow for, for example, the differentiation
of leucine and isoleucine [JMB88]
48
Chapter 5
de novo Protein Sequencing with
Mass Spectra
Current strategies for protein sequencing often involve proteolytic cleavage of the protein,
isolation of each peptide product (e.g.
by reverse-phase high-pressure liquid chromato-
graph), sequencing of each digestion product, and then non-trivial assembly of these sequences to reconstruct the sequence of the original intact protein. In the case of a novel
protein, the sequencing step is a de novo one which can be done chemically with certain
reactions or computationally with mass spectrometry.
5.1
Chemical Sequencing
Early de novo sequencing efforts were chemical in nature, perhaps nothing more than a series
of biochemical assays and analyses aimed at identifying residues based on some recognizable
property. For example, acetylation could be used as a detector for lysine, and esterification
for aspartic acid and glutamic acid [YME96]. One could, for example, acquire two sets of
spectra - one with and one without the assay and then compare the two datasets [CGAP99].
Analyses via these direct biochemical methods would have required large amounts of protein
which might have been hard to amass or taken a long time to isolate.
A peptide's sequence can be chemically determined in a more methodical manner by us49
ing the Edman degradation reaction which removes amino acids from the N-terminus, one
residue at a time. This chemical process involves the addition of phenylisothiocyanate(PLIC)
and subsequent acid treatment in order to cleave the N-terminal residue at the peptide bond.
The identity of the released residue can be measured by comparing its reversed-phase HPLC
retention time with that of known amino acid standards.
By chemically removing one
residue at a time, a peptide sequence could be absolutely determined.
This process has
been automated, but it takes 30-60 minutes for each residue [YGHZ91, HFBG92].
As a
result, Edman degradation is currently limited to the identification of about 50 residues per
day in the best conditions (e.g. ample quantity and sufficient purity of sample) [CWBK93].
But proteins can range from 50 residues to more than 25,000 residues in size (average size
is approximately 250) [Cre93], so Edman degration can take days to sequence the entire
protein.
Furthermore, errors in Edman degradation can be cumulative. If the N-terminal residue of
some of the peptides fail to react and are not removed during one iteration of the degradation
process, they will react and be removed during a subsequent cycle, potentially interfering
with the identification of the proper residue. After n cycles, the true residue at the nth
position from the N-terminus will be unclear and its identification, unreliable (n can be on
the order of 70) [Cre93].
5.2
Sequencing with Mass Spectrometry
Although many laboratories today still use Edman degradation to obtain de novo sequence
information
[SCE+97], in practice, Edman degradation and mass spectrometry approaches
are often used simulatenously, and sometimes in a complementary fashion (for laboratories
that can afford to) [Fen9l, TJ97]. Mass spectrometry offers the following advantages:
Blocked N Terminus A blocked N-terminus interferes with the Edman degradation reaction (e.g. will not react with phenylisothiocyanate), but it does not pose a problem for
mass spectrometric techniques. Roughly 30-50% of proteins isolated from SDS-PAGE
gels, for example, are N-terminally blocked [YGHZ91].
Modified/Unanticipated Amino Acids Modified residues and unexpected residue vari-
50
ants can interfere with residue identification during Edman analysis if a comparative
standard is unavailable
[BS87, HFBG92]. With mass spectra, fragments containing
the modification will exhibit a shift in mass, however identifying the chemical structure of the modified moeity would still be challenging despite possible knowledge of
its location and molecular weight.
Sample Quantity Low picomolar (e.g. less than 1 pmol [Bal95]) and femtomolar (e.g.
200fmol [GME+95]) protein concentrations can be successfully used with mass spectrometry. Under certain conditions, even attomole sensitivity can be achieved [FS96].
Edman analysis requires picomolar amounts, e.g. 25-100pmol [Pro].
Consequently,
Edman degradation may be unfit for peptides that are scarce or rare.
Sample Purity The sample does not have to be pure since mass spectrometry can tolerate
the presence of other peptides (e.g. in a proteolytic digest) and impurities including
salts and other biochemical agents commonly used as buffers or detergents [BC96].
Not so for Edman analysis; sample purity matters.
How does a mass spectrum aid in the sequencing process?
What can be done with an
uninterpreted tandem mass spectrum for an unknown peptide? If an oracle could correctly
label each experimental peak with its real identity (fragment type and sequence), the protein
sequencing problem would be decidable: an algorithm could examine the identities and
either supply the correct sequence, or indicate that it is unable to do so due to insufficient
information.
Such an oracle may not be realistic. What if the oracle were allowed to be less powerful what if it were only able to supply the fragment type portion of a peak's identity?
5.2.1
Enlisting the Aid of Fragment Types
Different fragment types provide different pieces of information useful for sequencing.
Parent Ion The parent ion reveals the parent mass (but this value is already known as it
was needed to tune the timed ion selector for MS/MS).
51
Immonium Ions Immonium ions are useful because they hint at a peptide's composition
(they do not yield any sequence information).
However, immonium clues are less
informative when sequencing large peptides because the probability that each residue
is present in the sequence increases with peptide length, and thus, it is unclear how to
distinguish one possible sequence from another based on immonium ion data alone.
Variant Ions Like immonium ions, variants can be indicative of residue composition, but
again, variants of longer peptides are less informative than variants of shorter ones.
Immediate Family Ions Immediate family ions are useful because they serve as evidence
for the cleavage of some bond between two residues. When members of two neighboring families are present, e.g.
fi
and fin, the mass of the intermediate residue ri can
be discovered.
Internal Ions Internal ions are most often used to confirm a sequence guess by matching
some expected theoretical internal calculated from the proposed guess. They can also
sometimes fill in for absent immediate family members by bridging a gap in data that
is not representative.
Being able to label a real peak with its correct fragment type is not enough. In practice, the
oracle needs to be able to contend with spectra marred with imperfections such as missing
peaks, imprecise peaks and irrelevant peaks. This complicates the job of our oracle which
now must first distinguish noise peaks from real peaks before an assignment of fragment
types to the real peaks can be made.
5.2.2
Persevering Despite the Effects of Noise
Noise is anything that causes the spectrum to deviate from the ideal, and can be categorized
into two general categories: physical noise and measurement noise.
The effects of noise
can be additive (bogus peaks that do not belong appear in the spectrum), or subtractive
(legitimate peaks disappear). When a peak is present, measurement noise interferes with
the accuracy with which certain properties of the peak can be measured. The following is
a taxonomy of the various categories of noise:
52
* Physical Noise
Physical-Chemical Properties of Peptides
The peptide itself, by nature, may
not ionize well or it may not fragment well because of its residue composition, e.g.
dynorphin 1-13 is highly resistant to fragmentation [QC96] (also see Section 3.2).
Certain fragments which do not occur readily, produce a subtractive effect by
being either totally absent or present with low abundance.
Impurities The presence of impurities, introduced either by the peptide supplier
(e.g. as artifacts of purification) or by the analyst during sample preparation,
can lead to foreign peaks, peaks which do not correspond to any real fragment.
This additive effect can complicate spectra and mislead sequencing algorithms.
* Measurement Noise
Mass Inaccuracies
The mass of a peak is considered to be accurate to within
0.1% (and sometimes up to 0.01%) of the actual mass [MHH+94, Ba195, FS96]1 .
As discussed in Section 2.1.4, the laser source is one of the contributors to this
imprecision.
Extreme cases of inaccuracy can lead to broad mountain ranges
rather than sharp well-defined spectrum peaks.
Height Inaccuracies
Ambient electronic or background noise causes the spectrum
to exhibit a basal non-zero baseline intensity level.
The heights of peaks at
particular masses may also be lower or higher than their actual counts due to the
loss of ions during flight (e.g.it collides against the flight tube wall, it misses the
detector due to improper mirror focussing), low or hyper detector sensitivity (e.g.
it impacts the detector but fails to trigger the sensing/registering mechanism)
and electronic disturbances [GMG+99].
Peak Detection Errors
An algorithm is used on the raw mass spectrum data for
peak detection. Additive and subtractive effects are possible as peak detection
errors can treat spurious background noise peaks as possible data(false positives)
or disqualify actual peaks of low intensity from consideration(false negatives).
'According to [ASR96], the mass accuracy is 0.1% for large proteins and 0.01% for small ones.
estimate for the Voyager Elite in reflector mode is 0.02% and for PSD, ±0.2- 0.3 Daltons [Pet97].
53
An
5.2.3
Considering Different Interpretations of the Same Spectrum
Let us weaken the oracle further by considering one that does not always successfully distinguish signal from noise. It may mistakenly assign a peak an incorrect fragment type
but one that is still consistent with the fragment type assigned to the other peaks in the
dataset. As a result, the peaks of a single dataset may be interpreted in multiple ways, each
with a complete set of fragment type assignments that are plausible and self-consistent. For
each such complete assignment, a candidate guess for the complete peptide sequence can be
inferred. Note that the sequences deduced from different complete assignments are fundamentally different, but a single complete assignment can lead to several sequences that are
only marginally different. For example, DRVYIHPFHL (correct angiotensin sequence) and
YSMYFVYNPI may both be sequence guesses derived from different complete assignments
which each explain a particular dataset in some consistent interpretation, but the fragment
type assignments are fundamentally different. However, DRVYIHPFHI, DRVYLHPFHL,
DRFHPYLVHL and ALWYIHPFHI are all marginally different, because it may be possible
to derive all of them from the same complete assignment. (A large enough dataset could
potentially constrain the interpretation and significantly reduce the number of marginally
different sequences.)
Two factors contribute to the existence of marginally different sequence inferences. One is
the non-uniqueness of residue weights, previously mentioned in Section 4.4, which leads to
sequences like DRVYIHPFHI and DRVYLHPFHL.
The second has to do with the fact that multiple distinct combinations of amino acids can
have the same mass. Simple examples are: HR and GHV, N and GG, and TV and VT.
Even if the amino acid composition were known, one could not tell the correct ordering
from its anagrams if supporting (either core or internal) fragments are not present in the
spectrum to resolving questions of residue position.
If the intervention of a weak oracle can lead to a (potentially large) number of sequence
guesses, how are they to be ranked and judged? How is one to be chosen over the others
as the most likely answer? One answer is to use redundancy. Not all sequence guesses
are equal; some guesses - hopefully, those that are closer to the correct sequence - are
supported by more evidence peaks than others. Redundancy also plays a role in solving the
54
problem of missing peaks since interpretation may still be successful if enough redundant
information is present.
Some of the algorithms presented in the literature (the spectrum-to-sequence approaches
discussed in Chapter 6) are basically simulations of this weak oracle. However, the idea of
using such an oracle is not the only way to address the protein sequencing problem. De
novo sequencing from MS/MS is the focus of this thesis, but it is but one of four answers
the research community has given to the peptide sequencing question.
55
Chapter 6
Prior Work
In this chapter, we briefly survey some of the previous work (not necessarily in chronological
order) in de novo protein sequencing using mass spectrometry. While many of these approaches were not designed for MALDI-PSD spectra, the ultimate goal is the same: to find
the sequence of the peptide, from a pool of candidate sequences, presumably best explains
the observed data. The purpose of this chapter is to examine how the different algorithms
arrive at a solution.
6.1
Four Categories of Approaches
We partition the space of algorithms into four categories, and in general, each approach falls
into one of these, although there are some hybrid approaches that fall into two. Membership
in a category is dependent on whether it takes as input experimental data that is single
stage MS spectra or tandem mass spectra, and whether it involves a database lookup. The
works we visit are summarized in Table 6-1.
In the case of MS spectra, the peptide is cleaved (perhaps incompletely) by a known protease
at specific residues so that the spectrum is a conserved (all pieces are retained) histogram
of masses corresponding to those subsequence blocks flanked by residues targeted by the
protease. Tandem mass spectra is a non-conserved histogram of masses resulting from a
more complex fragmentation process that can occur at non-specific residue positions(see
56
MS/MS
MS
YGSH93
JQCG93
MHR93
PHB93
GMG+99
FGVP+96
MW94
EMY94, YEMS95, GME+95, YECB96
TWJ96, TayOO
PPCC99
SCE+97
CBB96
TJ97
CLS99
CWBK93
BJHP94
GP97
0
SMMK84
LS85
HWH86
IN86
SB88
JB89
ZTEB90
Bar90
YGH91
HFBG91
SZK95
FdCGB95, FdCGB+98, FdCGS+99
DAC+99a, DAC+99b
CKT+00
Figure 6-1: Classification of Different Protein Sequencing Approaches
Figure 6-2).
Database approaches mine spectra for informational clues (called a fingerprint or signature)
in an attempt to uniquely identify the peptide from a reservoir of sequenced proteins. Such
approaches are often faster than their computational de novo counterparts but they may
encounter difficulties with homologs and would be unable to identify novel proteins not
present in the database.
6.2
Database Search with MS
One can search a database of protein sequences using the MS masses as a mass fingerprint. Five such algorithms appeared in the literature in 1993: [YGSH93, JQCG93, MHR93,
HBS+93, PHB93] and all of them adopt a similar strategy: predict the theoretical fingerprints of all entries in a database, compare these to the experimental fingerprint, and choose
the entry with the best agreement as the answer sequence.
57
mixture of peptides
specific conserved fragmentation
e.g. proteolytic digestion
MS spectra
--------------------------- *-,*,* --------------------
--* ---------- *------------------------------------------------ *............
V
E -O
non-specific non-conserved fragmentation
EI-O'
E ]
parent peptide
MS/MS spectra
parent ion (M+H)
Figure 6-2: MS and MS/MS spectra: MS is simply a way to separate pieces of the peptide
resulting from proteolytic digestion by mass. MS/MS is a means to home in on a particular
mass to generate random non-specific fragmentation.
58
In practice, one could use the complete set of MS masses as the mass fingerprint, but
rather surprisingly, Pappin et al., in their molecular weight search (MOWSE 1 ) approach,
found that a protein could be identified uniquely with a fingerprint of as few as 3 or 4
masses [PHB93].
Their comparison algorithm is more sophisticated than simply counting
the number of matches; their scores are based on an empirically determined matrix that
captures the size distribution of peptide masses as a function of protein mass.
In general, algorithms that fall in this category may have difficulty identifying the correct
sequence from a database if:
* the peptide sample contains a modification and this modification is not present in the
database,
" a number of sequences are homologous to the correct one, and the fingerprint is
insufficient to distinguish amongst them, yielding false positives
* the database is incomplete or contains errors
* the fingerprint is corrupted by the bad choice of irrelevant peaks that are unrelated
to the sample peptide and occur due to noise/impurities.
" incomplete proteolytic digestion occurs or not enough of the products are recovered.
Some of these issues have been studied[MHR93, JQCG93], and except when a combination
of these occur [MHR93], the fingerprinting method has in general met with a great deal
of success in identifying proteins that have already been previously catalogued into the
database [LM97]. Further improvements have included better peak detection methods and
enhanced scoring mechanisms that incorporate additional attributes for finer discrimina-
tion [GMG+99].
1
http://srs.hgmp.mrc.ac.uk/cgi-bin/mowse
59
6.3
Database Search with MS/MS
Database searches can also be done with MS/MS data. Because the information contained
in this type of spectra is different and potentially more sequence-revealing, a number of
approaches have been proposed.
6.3.1
Searching with Peptide Sequence Tags
In an approach called PeptideSearch2 , Mann and Wilm [MW94] enhance the success rate
of a database search by using a mass fingerprint with more searching criteria - sequence
information that can be readily gleaned and interpreted from ESI spectra. Their peptide
sequence tag is composed to three parts: (min,
s, mn2 ) where s is a short consecutive amino
acid sequence that is the result of partial manual interpretation of the MS/MS spectra, m 1
is the mass of the prefix from the N-terminus to the start of s, and m
suffix from the end of s to the C-terminus. In this manner, m 1 + m
2
2
is the mass of the
+ mass(s) is the mass
of the entire original peptide. For database entries that contain subsequences that match
this peptide tag, a second step scores the theoretically predicted fragmentation masses for
each candidate to those experimentally observed.
One must identify a correct subsequence s and in order to pinpoint its location in the real
sequence, one must correctly assign fragment type roles to the mass peaks which support
s. To find s, one finds a chain of peaks, assigns a core fragment type to each of them,
and checks to see if their masses are consistent - namely, does the mass difference adjusted
for the peaks' fragment roles correspond to an amino acid weight? The sequence may be
correct, but if the roles are incorrect (e.g. one assumes they are B ions when in fact they are
Y ions), the computation of mi
in the query tag.
and m
2
will be incorrect, shifting s to an incorrect position
Multiple tag queries may result because different role assignments to
different sets of peaks may be consistent, and if there is no redundancy making the choice
obvious, there may be no other a priori bias towards favoring any one set of role assignments
over another.
2
http://www.mann.embl-heidelberg.de/Services/PeptideSearch/PeptideSearchIntro.html
60
Mann and Wilm found that using a subsequence of two or three residues was sufficient to
effectively narrow down the number of database matches. Furthermore, in the case of a
mass fingerprint with MS data, a modified residue will affect and shift the molecular weight
of all fragments of which it is a member. However, if there is an unexpected modification
and the query turns up no answer; the matching criteria can be relaxed - in this manner,
it is able to tolerate errors/modifications in one of the three parts of the sequence tag.
6.3.2
Evaluating Theoretically Predicted Spectra with Experimentally
Obtained Spectra
An algorithm incorporated in a program called SeQuest
3
[EMY94] for low energy CID 4
searches a database looking for entries containing subsequences with a mass equal to that
of the parent ion. When a candidate entry is found, its theoretical spectra is predicted
and compared to the experimental data to arrive at a preliminary score that is based
on: (1) the number of theoretical peaks in the experimental dataset, (2) the sum of the
intensities of these matched peaks, (3) the continuity of an ion fragment type, and (4) the
presence/absence of immonium ions for H, Y, W, M and F. The highest scoring ones go
on to a "cross-correlation" (or simply, correlation) analysis step which produces another
score measuring the closeness-of-fit between the theoretical spectrum and the experimental
spectrum.
The algorithm handles modifications by simultaneously considering them at every putative
modification site of a database entry during the search (an all-or-nothing approach - either
all sites are modified or none are). A subsequent paper [YEMS95] allowed combinations
of up to three unmodified/modified sites (as the number of possible sites increases, the
number of possibilities increases exponentially). Both versions, however, are only able to
look out for a known set of modifications; they cannot handle unanticipated or previously
unencountered ones.
PepID [TWJ96, TayOO], an algorithm implemented as part of a program called Sherpa 5 , is
3
4
http://thompson.mbt.washington.edu/sequest.html
Variations of the algorithm has been demonstrated for MALDI PSD data [GME+95] and hi-energy CID
spectra [YECB96].
5
http://www.hairyfatguy.com/Sherpa/
61
similar to SeQuest, except Sherpa is an interactive application for the interpretation of ESI
MS/MS. A modified version of the cross-correlation function of [EMY94] is included as an
optional analysis one might elect to perform.
6.4
Hybrid Approaches
A few hybrid approaches exist that either have both a computational component and a
database search, or use both MS and MS/MS data. These are depicted in Table 6-1 on
the boundary between two quadrants.
6.4.1
Computational Approaches with a MS/MS Database Search
MSTag [CBB96] 6 uses MALDI PSD peaks as a tag for searching either a protein or a DNA
database. It also has limited de novo sequencing capability (for peptides < 1300Da) as it
uses a combinatorial brute force approach.
Lutefisk97 [TJ97], an algorithm for low energy CID, also combines a de novo sequencing
step with a database lookup.
The de novo step creates a fundamental graph of B ions
that are scored with probability values (taken from [Bar90]). Paths through the graph are
enumerated, and only the best scoring ones are subjected to cross-correlation a la
[EMY94]
with the experimental spectrum to produce a new combined score. In the resulting list of
candidate sequences, several of them may be homologous with only slight differences (e.g.
the order of a short subsequence might be reversed, a substitution might have occurred
with two subsequences that are different but of the same mass, etc). A modified FASTA
algorithm allows each database entry to be compared to multiple sequence queries, allowing
for a homology-based sequence search.
PepSeq [CLS99] combines features of a database approach and a computational/de novo
approach. It examines PSD spectra and according to an arsenal of rules and observations
(e.g. residue presence/absence from immonium ions, C terminal residue constraints due
to protease used, fragment type patterns, etc ), infers a list of properties the generating
6
http://prospector.ucsf.edu/ucsfhtml3.2/mstagfd.htm
62
sequence should have and then combinatorially computes these possible candidates. The
theoretical spectrum of each candidate is compared to the experimental data arriving at a
score, and a database lookup for the candidate is performed as well. One novelty is that it
includes internal ions in its generation of theoretical spectra.
6.4.2
Database Search with MS and MS/MS
MassFrag, a hybrid approach devised by Gevaert, et.al. [GVP+96], works with both MS
and PSD MS/MS spectra. MOWSE is used on the MS spectra to obtain a list of possible
sequences. For each guess, its theoretical MS/MS spectrum is generated and compared to
the experimental MS/MS dataset to determine a score based on the number of matches.
PepFrag 7 allows nucleotide sequence databases to be searched as well. This program addresses the situation when a single mass fingerprint query results in multiple possible candidate sequences, and includes an in silico investigation of how effective different search criteria
(e.g. knowledge of the N-terminal residue identity, knowledge of the presence/absence of
certain amino acids, etc.) are at constraining the search [FQC98].
6.4.3
Database Search with MS or MS/MS
The Mascot program8 is an application that allows users to perform a number of the searches
mentioned above: (1) mass fingerprinting with MS spectra, (2) peptide sequence tags with
MS/MS, and (3) comparison of theoretical MS/MS with experimental MS/MS. Mascot's
scoring algorithms are based upon those of MOWSE, but they are probability-based and
measure the probability that an observed match between the theoretical and experimental
spectrums resulted from chance. The absolute probability that a match is random must be
supplied, as well as the database size, but other details are not available [PPCC99].
7
distributed as part of PROWL, http://prowl.rockefeller.edu/
8
http: //www.matrixscience.com/
63
6.5
Discussion of Database Approaches
In general, database searches can be made relatively fast, and if the peptide in question is
part of a protein present in the database, then the search will produce a sequence. In the
event that a search does not yield a unique answer, additional information such as partial
sequence information (a longer contiguous stretch of amino acids, or several short ones in
the case of a peptide sequence tag), other fragment masses and protease specificity can be
supplied to constrain the search to a more definitive match.
However, a database search is only viable if the protein is present. Furthermore, protein
databases may contain errors and are far from being complete [SWM97]. When sequencing
newly encountered proteins, a database search will yield either no answer or, even worse, a
false positive - a sequence, e.g of a homologous protein, that may fit the criteria well but
is not the real one responsible for the observed spectrum.
If a peptide is not present in any database, then one must now resort to some other means
for protein sequencing such as Edman degradation or one of the computational algorithms
to be discussed.
6.6
Computational Search with MS
Recall that MS spectrum peaks represent subsequence blocks that result from proteolysis.
For each block/peak mass, one might exhaustively enumerate all amino acid combinations
that have the same mass (e.g. PAAS [MSM+83] embodies a library of routines that would
be ideal for this). But the larger the magnitude of the mass, the greater the number of
possible combinations. Furthermore, there exists no subsequent means of further evaluation
for distinguishing the right answer from the pool of possibilities.
6.6.1
Ladder Sequencing with Mass Spectrometry
Instead of a protease that cleaves after a certain set of specific residues, a non-residuespecific reagent is used to cleave after every amino acid.
64
If one were able to isolate all
possible intermediate product fragments containing the original amino (carboxyl) terminus,
one would in effect have all the partial prefixes (suffixes) of the sequence string. One could
then easily identify the amino acids in the sequence from the MS spectrum by calculating
the mass difference between consecutive peaks. This is the idea behind ladder sequencing
and it was first proposed by Chait et al.
Their scheme is a modified Edman reaction in which three reagents are used: phenylisocyanate(PIC), phenylisothiocyanate(PIT C) and trifluoroacetic acid(TFA) [CWBK93]. PITC
will react with and modify the N-terminal residue in a manner that enables TFA to subsequently strip the residue off, leaving another new free N-terminus that PITC can now react
with to repeat the cycle on the next residue. PIC is a compound that like PITC reacts with
the N-terminal residue, but unlike PITC, once modified by PIC, the terminal residue is no
longer susceptible to TFA cleavage. Thus, PIC is a terminating reagent and a small amount
is added when PITC is added, so that at the end of each cycle, a fraction of the products
are blocked and do not participate further in the ladder producing reactions. This enables
one to obtain a ladder of partial suffixes which can then be flown on a mass spectrometer.
Another strategy used by
[BJHP94] (using chemical reactions based on the protocol in
[CWBK93]) and later by
[GP97] (using different chemicals in their protocol) involves the
introduction of a new aliquot of the original full length peptide at each cycle.
In this
manner, peptides get successively shorter in each cycle, but peptides that have been added
later have fewer N-terminal residues stripped off and hence form the longer partial suffixes.
Ladder sequencing approaches can handle unexpected modifications effectively because as
long as these modification do not interfere with the ladder generating reactions, they will
not hinder sequence discovery of the other unmodified residues. Only subpicomolar amounts
of peptide are needed, but
per residue).
[GP97], for example, requires 60 minutes per cycle (an hour
There are various other complications for each of these ladder sequencing
approaches:
* Chait, et al: instability of the terminating block [SSMW97] and loss of hydrophobic
peptides [GP97]
" Bartlet-Jones, et al: difficulties with lysine-containing peptides [GP97,
65
* Gu, et al: certain reaction steps need to be optimized for different proteins so this
may be difficult to do for novel proteins. some difficulties with side-chain reactions
also [GP97].
6.7
Computational Search with MS/MS
Early de novo sequencing from mass spectra was done manually - an analyst would visually
inspect spectra, seek clues and make deductions, with the ultimate objective of finding
some consistent interpretation of the spectrum as a whole. He/she might take into account
guidelines and hints described in [str95, Pap95, MB94], such as the known specificities of
a protease that might have been used during sample preparation, which could restrict the
choices for the C-terminal residue. The analyst could then attempt to build a sequence,
one residue at a time, from the C-terminus by guessing and temporarily assigning fragment
types, looking for the best next residue. Immonium ions might indicate the more likely
residues, but the number of possibilities would have still made this process time-consuming
and tedious. The task might be simpler if only the highest intensity peaks were considered,
reducing complexity at the expense of a less reliable answer based only on a small fraction of
the information in the spectrum. This game of trial and error became easier with experience,
as the analyst learned to recognize patterns after dealing with spectra on a case by case
basis, but in short, determination of the complete sequence required considerable effort. An
analyst would have only humanly been able to keep track of and explore only the tip of the
iceberg of possible combinations.
Computer algorithms were developed to automate the de novo sequencing process, and we
classify them into two categories depending on how sequence guesses are made: the sequenceto-spectrum approaches enlist the aid of the experimental dataset, while the spectrum-tosequence approaches do not.
All the approaches considered in the following sections are classified in Table 6.1. We are
interested in the strategies embodied in these algorithms, and while the details of each varies,
approaches that fall in the same classification attack the problem in the same manner.
The table also indicates which approaches use fragment type probabilities.
66
Often these
probabilities are determined empirically from some sample data, or arbitrarily chosen to be
loosely indicative of patterns in the sample data.
spectrum-to-sequence
sequence-to-spectrum
local
global
local
global(fundamental graph)
local
reference
SMMK84
LS85
fragment
probabilities
global
V
HWH86
IN86
SBB87
SB88
JB89
ZTEB90
Ba90
YGHZ91
HFBG92
V
VV
V
V
V
V
V
V
V
V
SZK95
FdCGB95
FdCGB98
FdCGB99
DAC+99a
DAC+99b
V
V_
CKT+00
our work
Table 6.1: Taxonomy of De Novo MS/MS Approaches
6.7.1
Sequence-to-Spectrum Categories
With a sequence-to-spectrum strategy, candidate sequences are generated independently of
the experimental spectrum, but the spectrum is used later on to see if any peaks support
a particular candidate sequence. This correlation process involves generating a theoretical
spectrum from the candidate sequence guess, and then comparing it to the experimentally
obtained spectrum. The simplest type of correlation is a count of the number of matches
where a match is a mass that appears in the both the theoretical and experimental spectra.
A more complex one was already seen in the cross-correlation routines of
67
[EMY94].
Global Sequence-to-Spectrum Approaches
Global sequence-to-spectrum approaches correlate complete candidate sequences.
These
can be enumerated by brute force, with [HWH86] or without [SMMK84] knowledge of the
amino acid composition.
An advantage of correlating with a full peptide sequence is that all known fragment types,
including internals and variants, can be considered since the entire sequence is available,
resulting in a more accurate measure of similarity between the theoretical and experimental
datasets. However, these approaches are computationally intensive, growing exponentially
with parent mass/peptide length.
Local Sequence-to- Spectrum Approaches
Local sequence-to-spectrum approaches [IN86, YGHZ91] avoid exhaustive searching by exploring only those portions that seem promising, instead of the whole space of possibilities.
Sequence guesses are built up in a stepwise manner by executing repeated rounds of 1) extension by one or more residues (
[IN86] starts with tripeptide seeds and extends all possible
dipeptides), 2) correlation of the new partial sequences and 3) pruning to remove those of
low potential from consideration.
Only some limited number of high scoring candidates
survive to the next round. This process ends when some number of complete sequences are
found.
One might imagine an algorithm that attempts to use the experimental dataset intelligently as a guide for determining potential candidate sequences in hopes of avoiding unnecessary searching. This type of algorithm falls into the spectrum-to-sequence category of
approaches.
6.7.2
Spectrum-to-Sequence Category
With a spectrum-to-sequence strategy, the candidate sequence guesses that are constructed
are determined from certain relationships that may exist in the experimental spectrum
itself.
68
Local Spectrum-to-Sequence Approaches
Local spectrum-to-sequence approaches [LS85, JB89] build a sequence up from one terminus
to the other, one residue at a time, using the spectrum directly to decide which residue to
grow a partial sequence with.
In another approach [ZTEB90], candidate sequences are derived mathematically from equations based on atomic weights, isotope peak ratios and immonium ion hints that dictate
composition constraints.
Each possible composition is then considered, one residue at a
time, to find a permutation that is well supported by peaks in the spectrum.
An interactive program [SBB87] allows users to link pairs of peaks that differ by a basic
amino acid mass in hopes of finding long chains that explain the parent mass. This approach
would work if the same fragment type were present for each family but not when neighboring
peptide positions are represented by family members of different fragment types.
As is the case with local sequence-to-spectrum approaches, pruning is often used to reduce
the number of outstanding sequence paths being explored. But the choice of which to keep
and which not to keep is made with only a local view of the situation at hand. Prefix pruning
can inadvertently and prematurely remove the partial path that would develop into the real
sequence when the portion of the sequence seen so far is underrepresented (even though the
remainder of the sequence, yet to be visited, is better supported).
However, if all possible extensions are extracted and compactly represented in a fundamental
graph, then the entire path would be available for analysis. Pruning would no longer have to
occur dynamically during residue extension; instead, all information could be preserved for
as long as possible, and when pruning became unavoidable, an informed decision of which
to rule out and prune could be made based on a more global examination of the situation
in its entirety.
6.7.3
Fundamental Graph (Global Spectrum-to-Sequence)
Approaches
The fundamental graph is a complete picture of all sequences that can be possibly found in
a spectrum, and is the result of an analysis of mass differences between potential families in
69
the experimental data. In the case of representative spectra, the real sequence is guaranteed
to be present as a complete path in graph, making the fundamental graph a natural and
attractive tool for allowing peak mass relationships to guide candidate sequence prediction.
Even within this global context, there is a local as well as a global way to use the information
in a fundamental graph.
Keep in mind that scores are typically associated with each node of a fundamental graph,
and these are distinct from the correlation scores. Node scores indicate the degree to which
a supposed fundamental is potentially supported by the data -the more redundancy, the
higher the node score, the more likely it is legitimate and part of the real sequence.
Note that the approaches in the following sections do not all use the same fundamental
graph but some variation of the fundamental graph concept (different fragment types are
chosen as the fundamental, different f-generating roles,
[SB88, Bar90] do not explicitly
construct the edges, etc).
Local Fundamental Graph Approaches
The beginnings of the fundamental graph idea can be found in [SB88]. Siegel and Bauman
transform a FAB MS/MS spectrum into a "reconstructed spectrum", which is simply a list
of fundamentals9 (the nodes of the graph without the edges).
Their approach is rather
complex and involves the costly pre-computation of tables containing the mass of every
ion possible from all permutations of all residues likely to be in the peptide's sequence.
Mass differences of neighboring reconstructed peaks were then computed and these tables
were consulted to find subsequences that could account for them. If any complete path
were found, the subsequences of each segment of the path could be permuted and then
concatenated together to form a complete candidate guess. This approach will generate a
great number of possible guesses, and was found to be impractical for spectra with more
than about 20 peaks [Bar90].
Hines et al. [HFBG92] uses a pattern-based approach to identify likely fundamentals, and
9
Their fundamental is not a single fragment type, but a set of fragment types, namely, all prefix series
ions.
70
Scarberry et al.[SZK95] employs neural networks, well suited for pattern recognition tasks,
to classify experimental peaks by likely fragment types, from which the fundamental list
can be computed.
As with the local spectrum-to-sequence approaches, partial sequence
guesses are extended one residue at a time, except here, the fundamentals [SZK95] and the
fundamental graph [HFBG92] are used in place of the experimental spectrum to determine
which residues to try.
6.7.4
Global Fundamental Graphs Approaches
A more global use of a fundamental graph involves some computation on the entire graph
before the most likely answer(s) is chosen. Examples of graph algorithms in this category
might be dynamic programming algorithms that find the longest or the heaviest path.
In Bartels' approach, the fundamental graph nodes are scored based on probabilities (from
[SB88]) associated with the fragment types of each of its supporters. Scores are propagated
through the entire graph from one node to the next if the nodes are a basic residue apart.
At each fundamental, two scores are kept, one for each pass are made through the graph once, when scores are propagated from higher mass nodes to low mass nodes, and once in
the reverse direction. The sum of these scores is meant to be indicative of "how well the
best explanation will be if the corresponding mass is included in the interpretation".
The
highest scoring nodes then best explain the spectrum, but no further details are available
in [Bar90]
Fernandez-de-Cossio et.al.'s MSEQ ( [FdCGB95] for hi energy CID, [FdCGS+99] for low
energy CID and PSD) and SeqMS algorithms [FdCGB+98] are both based on the work of
Bartels. Scores depend on probabilities that are different from those of Bartels, but SeqMS
takes more fragment types into account. A graph algorithm (e.g. a variation of Dijktra's
single source shortest path algorithm) is used to find the maximum scoring path from the
base fundamental to every fundamental, and this information is used to enumerate the best
scoring paths through the entire graph.
Sherenga [DAC+99b, DAC+99a] also uses graph theory to find a maximum scoring path,
and it is the only approach to our knowledge that presents a more rigorous argument for why
71
the sequence for which their scoring function is maximal is the best choice for an answer.
Interestingly, unlike all previous algorithms which are designed for a particular type of
spectra with a known set of common fragment types, theirs is also instrument-independent.
It is theoretically able to process any type of spectra by examining mass spectrum samples
produced by some ionization technique/mass spectrometer combination, and by predicting
the commonly occurring fragment types for this machine.
An algorithm by Chen, et.al. [CKT+00] uses a dynamic programming algorithm to enumerate all paths in their version of a fundamental graph, but care is taken to avoid paths
that visit related fundamentals.
This is feasible because the B and Y ions are the only
f-generating roles considered, keeping the size of the graph and the ensuing search small.
However, if these ions are not the dominant fragment types in the experimental spectrum,
then the search may not produce the correct answer.
This chapter surveyed a range of solutions for peptide sequencing from mass spectra. Of
these, the computational approaches are most applicable to the sequencing of novel unsequenced peptides. Several de novo MS/MS algorithms exist, but none of them are widely
used [Man98], perhaps due to interest in more interactive forms of analysis [Bar98] and
low confidence in the predicted answers. Appendix E chronicles some of the approaches
we investigated, Chapter 7 discusses what we learned from them, and subsequent chapters
propose a new global sequence-to-spectrum strategy for de novo MS/MS sequencing.
72
Chapter 7
Observations and Issues
We make several observations regarding the input spectrum and the scoring function in this
chapter. These are issues that an algorithm designer should be aware of and no doubt, other
researchers have also encountered them in their study of the protein sequencing problem.
Some of them may have been touched upon in other parts of this thesis, but here, we focus
on their effects on sequencing attempts.
7.1
7.1.1
Spectrum-Related Issues
Gaps
Missing peaks can lead to degeneracy and uncertainty in the predicted sequence. Enough
missing peaks can cause algorithms which iteratively grow sequence guesses based on supporting experimentals to prematurely terminate.
In an attempt to bridge a gap, funda-
mental graph approaches often deliberately introduce dipeptide and tripeptide edges in the
fundamental graph. The success of these bridging edges is limited because the existence
of a gap, the size of a gap and its location are indeterminable from the spectrum. Longer
bridging edges may allow for better sequence recovery in the presence of large gaps, but
this increase in edge population, even with the addition of dipeptide bridging edges only,
can lead to an explosive number of possible sequences to consider.
73
If, on the other hand, dipeptide/tripeptide bridging edges are not included in the construction of the graph, then another problem can occur. Assume for example that there is a gap
in the dataset because some family of the correct sequence is not represented, but there is an
alternate path already in the graph that bridges the gap. This bridge may act as a detour,
leading the path back onto the path of the real sequence (defining a sequence that is the
real sequence with a short substitution), but it may also diverge to a completely separate
path (defining a sequence whose the prefix or suffix is the same as the real sequence). Since
no "dead end" is encountered, the algorithm would not even realize that a gap was indeed
present.
7.1.2
Immonium Interference
Immonium ions happen to be the same mass as the fi A ion, so often, the node representing
the base fundamental has a large fanout.
Algorithms which work from the N-terminus
to the C-terminus are immediately faced with a large number of possibilities for the first
residue, choosing an incorrect residue for the first residue. The earliest opportunity for it
to recover from this mistake is at the third residue.
One solution is to reverse the edges of the fundamental graph and work from the C-terminus
to the N-terminus. Another alternative is to require complete paths to begin and end with
a dipeptide edge so that the first two and the last two residues are temporarily unresolved.
7.1.3
Mistaken Identities
A case of mistaken identity occurs when an experimental peak is interpreted incorrectly
and assigned an identity other than its real one. This false identity can potentially support
a sequence other than the actual one, particularly if it is of high intensity, thereby diverting
the algorithm onto the wrong path.
74
7.1.4
Under-/ Over-Represented Families
Families can suffer from under-representation and over-representation.
A gap is the ex-
treme case of an under-represented legitimate family and algorithms that construct sequence
guesses by extending partial sequences one residue at a time [IN86, JB89, YGHZ91, SZK95]
will run into a dead end since a gap offers no supporting evidence for continued expansion
of a partial sequence.
Families that are not totally devoid of members, but are under-represented face similar
problems. Poor representation can lead to low scores and prefix pruning can remove the
partial sequence from the running, even though the algorithm may later reach a portion of
spectra that is more redundant and richer in supportive fragments.
Over-representation of a legitimate family, namely redundancy of real fragments, is desireable. However, over-representation of an illegitimate family can be problematic especially
if the bogus fundamental has a high degree of supporters, many of which might due to
mistaken identities.
Such extraneous peaks can lead the algorithm astray by artificially
elevating the standing of incorrect candidate sequences so that the search may prune away
the correct path.
7.1.5
Experimental Peak Heights
Some approaches normalize/scale the heights of experimental peaks within a spectrum.
Others classify them into weak/medium/strong.
When multiple supporters for a funda-
mental exist, scoring functions often sum together their abundances. Is there a different
view of peak heights or a more intuitive way of taking them into account in the computation
of the score? Here is one view:
Peak intensities depend strongly on the physical and chemical properties of the
analytes, so that it would be rash to assume that the more intense peaks were
more "valid" than the weaker ones [but] because MS/MS spectra tend to exhibit
much higher levels of apparently random noise, often a peak at every mass, it
becomes essential for peaks to be selected on the basis of intensity. [PPCC99]
75
7.1.6
Mass Tolerances
A fragment ion of mass m seldom appears as a peak centered at m but rather at m ±
some
E for
E > 0. Mass inaccuracies almost always occur in experimental mass measurements,
and it is commonly assumed that a mass is within 0.5 daltons of the actual value( [SZK95,
HFBG92],E in [DAC+99b]).
When creating a fundamental graph or merging multiple datasets, algorithms will need to
merge two nodes together if they correspond to the same peak. Dancik et.al. write: "Ifwe do
not merge vertices that correspond to the same partial peptide, we will interpret meaningful
peaks as noise. On the other hand, if we merge vertices that do not correspond to the same
peptide, we may interpret noise as meaningful peaks" [DAC+99b]. Their merging algorithm
[DAC+99b] combines fundamentals that are believed to correspond to the same fragment
ion. But peak merging can be tricky; one difficulty with these algorithms is deciding when
to cluster peaks together - e.g. which peaks should be merged if there are 3 peaks, such
that p1, P2 and P 3 , such that P2 -P
1
< c and P3 -P
2
< E but P3 -P 1 > E? Indeed, a patch was
necessary [DAC+99a] since two peaks that were originally an amino acid distance apart,
may not be, after being separately merged with other peaks.
In short, some means is needed for telling when two unequal mass values are close enough
that they actually refer to the same theoretical peak. We address this issue with the use of
checkpoints.
Checkpoints
Molecules are composed of atoms, and atoms are in turn comprised of neutrons, protons and electrons.
1.008664904 amu, mP
Each of these fundamental particles have a particular mass: mT
=
=
1.00727647 amu and me = 0.0005485799 amu (note, 1 amu = 1
Da). The mass of every molecule must then be a linear combination mnNn + mpNp + meNe
where N, represents the number of x particles. If we assume that every proton is matched
with an electroni so that Np ~ Ne, then Nmp + Neme is approximately Np(m
1
+ me), and
MALDI ions are assumed to be singly-charged, so here, there is one more proton than electrons, but the
weight of an electron is so small that we can ignore the difference.
76
we can combine the proton and electron masses together. Since a neutron and a proton are
rougly equal in mass, we use the average of m, and (m + me) as an approximation of a
quantum of mass called a "checkpoint", a notion suggested by F Tom Leighton. As a result,
the mass of every molecule must be very close to some integer multiple of the checkpoint,
and the range of possible masses is not continuous but discrete.
(Note that it is actually not quite this simple - when protons and neutrons combine to
produce the nucleus of an atom, some mass is lost in the form of energy. The total mass is
then the sum of the masses of the individual nucleons, less this fusion energy. We computed
the average fusion energy loss per nucleon for each amino acid, weighted over the frequency
of all amino acids. This value was found to be 0.0077143251 amu 2 and thus, a checkpoint
is basically
mn rmP+me
2
- 0.0077143251 amu.)
The checkpoint turns out to be a convenient "unit" of mass. Any mass calculations that
we perform are done in checkpoint units, and our algorithms round the masses of all experimental peaks to the nearest checkpoint. This imposes an assumption that any errors
in mass must be less than one half the distance between two checkpoints (a value slightly
greater than 0.5Da).
7.2
Scoring Function Issues
The scoring function is a critical component of any sequencing algorithm.
7.2.1
Uses of a Scoring Function
All approaches are dependent on a good scoring function, and in the literature, scoring
functions are used in several ways:
2
This was calculated in the following manner: (1) Let A
{C,H,N,O,S}, the set of elements found in
proteins. Compute the fusion loss of atom a E A as the difference between the nuclear mass of its components
and the published mass less the mass of any electrons, i.e. A fa = (Np mT + N~m",) - (ma - Neme), where
N, denotes the number of x particles in atom a and ma is the published mass of a. (2) For each amino acid
r, calculate the total fusion loss of all its atoms as Afr = EaAfaNa. (3) The fusion loss of a residue per
nucleon is then given by Afr/(N; + Nn), and (4) the average fusion energy per nucleon, weighted over all
residues is: Avg( Af /(Nj + NI)f req,).
77
1. to evaluate the amount of supportive evidence for a particular peak - in the fundamental graph approach, this corresponds to node and/or edge scores,
2. to evaluate and rank partial paths through a graph comprised of these nodes and
edges, and
3. to correlate a complete sequence guess against the experimental spectrum.
Scoring Fundamentals
The score of a fundamental node is intended to serve as a measure of how likely the
fundamental is a correct and legitimate one, and hence the more redundancy, the better the score.
Node scores are often based on such factors as the number of supporter
peaks [SB88, HFBG92], the intensity of each supporter peak [SB88, IN86, HFBG92] and
the fragment type role of each supporter peak [FdCGS+99.
But one drawback with node scores is that they are often over-encompassing and can be
misleading - the score is a combined score for possible interpretations of supporting peaks
in the spectrum. Not all of these interpretations may be compatible and simultaneously
correct, so the node score can be artificially higher than it actually is.
Scoring Partial and Complete Paths
The score of a (partial or complete) sequence is a measure of how well the guess accounts for
the observed data, and can be used to rank a pool of candidates. This is useful for deciding
which partial path to expand next, or which to prune away. However, any such threshold
or cutoff process introduces the risk that the path of the real sequence may be discarded,
particularly when the real sequence may score poorly at first due to gaps/underrepresented
peaks, but improve later on.
Scoring functions for paths can take into account peak intensity [ZTEB90, HFBG92], the
number of matched experimentals, series continuity
[ZTEB90, FdCGS+99] and even se-
quence length [ZTEB90]. Most account for series ions only, but some account for internal
ions [JB89, FdCGS+99] and immoniums [SZK95, HFBG92]. There has, however, been no
78
comprehensive systematic study of how to account for each factor - Should they contribute
equally? If not, what weighting function(s) should be used?
7.2.2
Theoretical Peak Heights
Oftentimes, the scoring process involves generating theoretical spectra, and the simplest
approach is to predict the daughter masses only. A theoretical spectrum and an experimental one can match well if they have a lot of fragment masses in common. This seems
promising, but the theoretical mass peaks, both strong and weak alike, are treated equally,
and scoring functions can compare fragment masses, but they are not able to compare peak
intensities.
There is another dimension of information available from an experimental spectrum that
can be harnessed - in addition to predicting fragment masses, it would be useful to predict
fragment abundances that are indicative of a fragment's relative likelihood to occur. Were
this possible, then not only would a high overlap in fragment masses matter, but also the
general "shape" of the theoretical and experimental spectrums.
7.2.3
Award
/
Penalty System
When comparing experimental and theoretical spectra, peaks fall into one of three categories:
(I) matched peaks - peaks that are predicted and observed, (II) unaccounted
experimental peaks - peaks that appear in the experimental but are not expected in the
theoretical (and consequently attributed to noise), and (III) absent theoretical peaks - peaks
that are present in the theoretical but not in the experimental.
Scoring functions in the literature have largely considered Category I peaks only. Awarding for matches gives a feel for how well an experimental spectrum agrees with what is
theoretically expected, but it is only a part of a bigger picture. The candidate with the
highest match score can still be a poor candidate if there are many unaccounted peaks in
the experimental spectrum.
Dancik et.al. consider both Categories I and II by awarding for matches and penalizing for
79
Experimental
Theoretical
Category I
ryI
Category IIICae
Figure 7-1: Theoretical Masses and Experimental Peaks: Region A contains those theoreticals that are absent from the experimental spectrum, Region B contains matched peaks
and Region C contains the unaccounted experimentals.
unaccounted experimentals. We submit that penalizing for category III is also important.
The real sequence and a longer sequence guess that is only marginally different (e.g. an
N in the correct sequence is replaced by the dipeptide GG) can have the same bonus for
Category I and the same fine for Category II, but a slightly different Category III score,
which may be enough to distinguish the two sequences from each other.
7.2.4
Accounting for Disallowed Variants
Recall that the score of a fundamental depends on its set of supporters, the score of a path
is a function (often the sum) of its individual component fundamentals, a complete path
through a fundamental graph defines a complete peptide sequence. Including series variants
in the set of f-generating roles is beneficial, but these ions are sequence dependent. Path
scores may require adjusting to remove the contribution of variant supporters that turn out
to be prohibited by the path's sequence.
80
7.2.5
Accounting for Internals
For the same reason of path dependence, it is difficult to account for supporting internal
ions in a fundamental graph. Internal ion contributions should be included in the score of
a complete path only when all fundamental nodes necessary for defining the partial path
corresponding to the internal ion are visited by the complete path being scored.
7.2.6
Fragment Type Frequencies
The ability to quantify the tendency that a particular fragment type occurs in spectra
can be useful, and several approaches attempt to do this(see Table 6.1).
These fragment
probabilities are useful for predicting peak heights in theoretical spectra [YEMS95, TJ97],
but they are more often used to calculate the influence of supporting experimental peaks
in different f-generating roles when scoring fundamentals in a fundamental graph
[Bar90,
DAC+99b, DAC+99a, TJ97, FdCGB95, FdCGS+99, FdCGB+98].
These fragment type frequencies are often arbitrarily chosen or set at values that are compatible with simple observations of fragment type likelihoods in spectra - e.g. the tendency
that an A, B or Y series ion occurs appears greater than that for an internal ion. This
results in a simple but very rough initial approximation of the relative frequencies which,
in reality, are likely to depend on other factors such as the peptide sequence itself.
Analyses of experimental spectra can lead to improved estimates of these frequencies.
Fernandez-de-Cossio et.al. [FdCGB95] arrive at probabilities based on "hundreds of CAD
spectra" but details are not included in their paper. Recently, Dancik et.al., in an effort
to make their algorithms instrument-independent, describe a means for inferring fragment
types expressed by a particular MS instrument and deriving probabilities for them based
on experimental spectra of known sequences gathered on the instrument [DAC+99b].
A better understanding of fragmentation could lead to more precise fragment type frequencies and scores that reflect supporter fragment type influences more accurately.
81
Chapter 8
Approach
8.1
Solution Schematic
The approaches that we have considered so far, those in the literature (Section 6.7) as well
as those in our own explorations (Appendix E), have a common structural theme of two
basic modules: one for guess evaluation and another for guess generation(see Figure 8-1).
The former serves as a measure of how good a guess is and thus requires access to the input
spectrum, while the latter is the source of these guesses. Variations of this schematic exist:
some approaches use the input spectrum to aid in guess generation and others use the score
of the current guess as a feedback mechanism for choosing the next guess(see dotted line of
Figure 8-1).
Fundamental graph-based approaches transform the input sequence into a fundamental
graph which can then be used in the guess generation process. One could easily imagine
additional dotted arrows representing other variations of this paradigm schematic - e.g. the
use of the fundamental graph as a means for generating guesses by enumerating some path
through the graph.
This chapter describes an implementation for each of these algorithmic components. Our
evaluation module consists of a simple model for fragmentation and a scoring function based
on this model. Despite the fact that our understanding of the rules governing fragmentation
is incomplete (Section 3.2), we begin with the development of a simple probabilitistic model
82
Guess Generation
Guess Evaluation
S
-0
g=r~r2 r3
0
0
0
Figure 8-1: Schematic of Key Algorithmic Components
for a trial, the fragmentation of a single peptide molecule. Using this model, all possible
outcomes of a single trial can be predicted from the peptide's sequence, and in addition, a
probability can be associated with each outcome. Once we have a probability distribution
describing the outcomes of a single trial, if we view a tandem mass spectrum as a summary of
the outcomes of a number of independent single trials, then we can compute the probability
that a particular set of outcomes could have occurred. Given this, we have a means for
scoring a sequence guess against an experimentally observed spectrum.
Our generation module involves a simulated annealing search strategy that uses this scoring
function to efficiently explore a space of possible sequence guesses with the objective of
finding the optimal scoring sequence.
8.2
Modelling the Fragmentation of a Single Molecule
We begin by examining the fragmentation of a single peptide molecule p. A simplified view
of the process is as follows:
83
1. a parent molecule is protonated, becoming positively charged,
2. the parent ion may then fragment into any number of pieces,
3. only the piece retaining the charge is capable of being detected and, if detected, would
serve as this molecule's contribution to the mass spectrum.
We develop a fragmentation model incrementally by beginning with a very simple one,
incorporating various improvements into each successive model, and ending with a model
that is more complex but also more reflective of what actually happens. Each model can
be described by a fragmentation tree, a decision tree where the root node represents the
intact parent molecule and the leaves represent all possible outcomes that a single parent
molecule can produce when fragmented under a particular model. Edges represent certain
decisions in the fragmentation process and probabilities are associated with every edge. A
path from the root to a leaf of the tree describes the series of events responsible for the
ion outcome specified at the leaf. The probability of this outcome is the product of the
probabilities of all edges in this path (so the order in which these events happen does not
affect the probability of the outcome).
Model I: Modelling Series Ions
8.2.1
The simplest model for a trial involves at most one single cleavage event so that the ion
that results is either a prefix (an Aion or a Bion), a suffix (a Yion) or the original molecule
(M+H) left intact. The Model I fragmentation tree for the short peptide DRVY is illustrated
in Figure 8-2. The generalization to any arbitrary peptide can be easily made.
A break can occur at any break position pi of the peptide with some probability Pp%such that
EiPi
=
1. For all internal break positions pi (1 < i < n where n is the length of sequence),
there are two possible bonds along the peptide backbone that can be cleaved (the C,-CO
bond and the peptide bond) with probabilities Pa, and Py = 1 - Pa, respectively1 The
type of series ion that results depends on which bond is broken and which piece is retained:
cleavage of the C,-CO bond (see Figures A-1 and 3-1) produces an A ion only (X ions are
'There is a third bond in the backbone, the NH-Ca bond, but cleavage of this bond produces the C and
Z series ions which are not prevalent in MALDI-PSD spectra.
84
Parent Ion:
D
R
1
4
3
N-C-C- N-C-C1I
f
1
0
0 H
H
2
N-C-Ci
0
H
H - N-C-CI
0
H
N-terminus
Y
V
OH
C-terminus
H
break position: 0
1
2
3
5
4
M+H DRVY
-
prefix retention
suffix retention
_Aion D
Bion D
A
B
0
K -
B1Y
I
. Yion RVY
--.-- A
--- ion DR
2
-Bion DR
-....-Y.... ..Yion VY
B/Y
M+H DRVY
Ax-
-
A
- -
B/Y
4
A/X
5
----
---
Aion DRV
---
Bion DRV
Yion
....--..
A
---
- -Aion
B/Y
Y
DRVY
Bion DRVY
M+H DRVY
Figure 8-2: Model I: Basic Fragmentation Tree. A single break produces only prefixes and
suffixes.
not prevalent in MALDI-PSD), while cleavage of the amide bond produces a B ion with
probability
P
and a Y ion with probability (1
-
P).
Note that the C-terminal group is treated as if it were a residue in that it is flanked by break
positions. Cleavage can thus occur on either side of the C-terminal group: a break on the
N-terminal side leads to a prefix ion for DRVY only, while a break on the C-terminal side,
called a "non-break", implies that the molecule remained intact. A "non-break" can also
occur at the N terminus at break position 0. Unlike the C-terminal group, the N-terminal
group was not generalized (although this can be easily done) because a break on either side
of a hydrogen N-terminal group would have produced the same structure.
8.2.2
Model II: Modelling Internal Ions
A single cleavage event was enough to generate prefix and suffix series ions. In order to
generate internal ions, however, two cleavage events must occur. The Model II fragmenta-
85
tion tree, then, is basically the same as the Model I fragmentation tree, except every leaf is
allowed to undergo a second round of breaking.
When a prefix(suffix) undergoes a second cleavage event, it can produce another prefix(suffix) or an internal ion. If a molecule survived the first break intact (a non-break),
then the products of its second round of breaking resembles a Model I tree. A partial Model
II fragmentation tree for DRVY is shown in Figure 8-3.
Some remarks about Model II are in order:
Multiple Pathways for the Same Ion There can be several leaves in the fragmentation
tree that represent ions with the exact same identity.
This means there may be
multiple pathways (either a different set of events or a different ordering of the same
events) to generate the same fragment ion. For example, there are two ways depicted
in Figure 8-3 to produce the YA internal RV ion. In general, there are more ways to
produce shorter peptides than longer ones.
Different Ions With the Same Mass The resulting fragment ions may not have unique
masses; there can be two ions with different identities that have the identical mass.
Peaks at these masses are said to have multiple identities.
Aion Formation Our model allows for the possibility of a B ion, formed from the first
break, to produce an A ion after the second break.
In reality, there is no proven
evidence of this, or of the contrary.
8.2.3
Model III: Modelling Variants
The side chains of certain amino acids can lose ammonia/water or gain water. Model III
addresses the ability of a fragment to exhibit a neutral loss/gain variant according to the
following assumed rules:
* a peptide containing S,T,D or E anywhere can lose water
e a peptide containing R anywhere can lose ammonia
86
Yion RVY
I
A/X
-- - - -
-
YA inmoniurn R
YBimmonium
--13yRVYYinV
AX- -
YiZ
--
YA internal RV
3B
YB internal RV
internal RVY
__+____R_____-YA
-YB internal
B/
N,1DRYNA/
--
--
-
-
R
-D.,
DR
A
n DRY
RVY
Yion RVY
N
ScondBRea
________AV
-
-
- -
B-
-
R
--
- -
M+HD DDVY
-.
First Break
PositionA
B/Y
-
-
-
-
~Bion
-YA
- - -
D
(YB
immnium
---..
DR\Aion
D)
internal RV
AjnD
DRDRV
ABi
Bi
DR
YAimmonium
DRV
B/Y
AinD(YA imRmniumD)
V
Figure 8-3: Model II: A partial Model II fragmentation tree showing the second stage of
cleavage for two Model I leaves. When two breaks are possible, immonium and internal ions
are added to the repertoire of fragment types.
a peptide containing R, K or H can gain water only if the R,K or H residue is the
C-terminal residue
Note that in our model, we have selected and included some of the more agreed upon rules
in the literature for variants and when they are allowed. A sampling of the many other
observations, which have not been incorporated into our model, includes:
losing ammonia N,Q,K also may lose ammonia [str95, FHM+93]; K also may lose ammonia [CLS99]; an N-terminal
Q
may also lose ammonia [Hin97].
losing water only S,T lose water [CLS99]; S,T,D,E are more prone to lose water if they
are the ultimate or penultimate residues [Hin97]
gaining water Bp18 ions only occur for the (n - 1)st and (n of fnI
and the Bion of
fn2)
2
)nd B ions (i.e. the Bion
where n is the length of the peptide [BC].
Since variants are residue dependent, at least one variant-permitting (or v-permitting)
residue must be present (at the required location if applicable) in order for the corre-
87
sponding variant to be even possible. The number of instances of each v-permitting residue
is assumed to affect how likely the variant will occur as well. Let Vm18,
Vm17, vp 1 8
be the
number of v-permitting residues for each variant. If a variant x also has associated with it
a "per-instance" likelihood of occurring, tendency1 , then the probability that variant x is
expressed is given by:
X* tendencyx
Vm18 *
tendencym18 +
Vm17 *
tendencym17
+ Vp18 *
tendencyp18
(8.1)
The Model III fragmentation tree is the same as its predecessor except each leaf of the Model
II tree can now either appear as is with probability Pnovariant (this probability equals 1 if
no v-permitting residues are present at all), or appear as a variant with the appropriate
probability computed using Equation 8.1.
Bion DR
YA internal RVY
Bion DR
Brnl8ion DR
YA internal RVY
Bml7ion DR
Y Aml7 internal RVY
Bpl8ion DR
Figure 8-4: Model III: Each Model II leaf may express a variant and the decision process is
shown here only for two leaves from Figure 8-3.
8.2.4
Model IV: Modelling Residue Tendencies
Variants were the first example of residue-dependent fragmentation. Aside from variants,
there is little in the model so far that would differentiate the fragmentation tree of one
peptide from that of another save for differences due to peptide composition and peptide
length in the set of masses produced.
Certain residues are more prone to fragment than others, and certain bonds may be more
88
likely to break depending upon which residues flank it. Lin and Glish remark that "the
current state of knowledge is insufficient to predict or understand routinely which [of these]
bonds will be broken, and when a bond is broken, which end of the dissociated peptide will
retain the charge." [LG98] Again, the complete picture has not yet been elucidated, but
certain tendencies have been observed.
To incorporate what little is known, we introduce a 22x22 square non-symmetric matrix 2 M
of break tendencies where each entry M[i, j] represents the likelihood that a bond flanked
by residues i and
j
in Section 9.2.1.
Note that for M[ij], residue i is N-terminal to residue
will break. This is a value determined by a training process discussed
j,
so it is not
necessarily the case that M[i, jj = M[ji.
Matrix M allows us to evaluate the likelihood that a bond will break based on the actual
residues that flank the break. The probability that a break will occur at a particular break
position can be computed by finding the ratio of its likelihood to the sum of the likelihoods
of all other bonds in the molecule. In practice, the actual location of the bond in the peptide
may have an effect as well, but aside from the bonds positioned at the termini, our model
does not account for bond location.
8.3
Accounting for Noise
Our models so far have been concerned only with legitimate fragments, but real data is
afflicted with various forms of noise, and any model that purports to realistically explain
experimental data cannot ignore them. The following is a discussion of which noise categories from Chapter 5.2.2 are accounted for in our model, Model V, and which are not:
8.3.1
Physical Noise
Physical-Chemical Properties of Peptides Matrix M of Section 8.2.4 accounts for
some of the effects the residues themselves may have on the fragmentation process.
2
In addition to the basic residues, this matrix also includes entries for the C terminal group and a "null"
entity for handling non-breaks, hence 22. Since hydrogen is the N terminal group, we do not include an
entry for the N terminal, but this can be easily generalized if desired.
89
Low matrix values correspond to bonds between certain residue pairs that are unlikely
to break. This may lead to an under-represented family in the resulting spectrum.
Other more complex interactions based on chemical properties and residue location,
for example, can exist but are not taken into account.
Impurities The current model addresses only those outcomes possible from the fragmentation of the desired parent peptide. It is not concerned with fragments of molecules
of origin other than the parent, such as impurities.
If impurities are known to be present, one might account for them by modifying
the concept of a trial to include fragmentation events for other molecules.
In our
fragmentation tree, we assumed that the root represented the parent peptide molecule.
Instead, one might envision the root as a decision point that branches to one of several
fragmentation trees, one for each molecule possibility. Thus, with some probability,
the molecule being fragmented is the parent molecule, and its branch would lead to a
fragmentation tree like the ones we have been describing. With some other probability,
a contaminant branch is chosen, leading to the appropriate fragmentation tree for this
other molecule. The probabilities of these choices sum to one and can perhaps be made
dependent on the relative concentrations of each species.
Recall that in order to influence the resulting tandem mass spectrum, the effect of
an impurity must survive the filtration effects of the timed ion selector, so it is not
immediately obvious how to incorporate impurities into the model which may heavily
depending on when/where/how the impurity is introduced.
8.3.2
Measurement Noise
Mass Inaccuracies
Mass inaccuracies are handled on two fronts: (1) the input exper-
imental spectrum is subjected to a processing step where the experimental masses
are converted to the nearest checkpoint (see Section 7.1.6), and (2) the masses of the
fragments predicted by the model are given in terms of checkpoint values.
Again,
this is acceptable so long as the experimental peaks are correct to within roughly 0.5
daltons.
Height Inaccuracies
90
Undetected Trials Every ion that is successfully produced has some probability of
not making it into the final observed spectrum. To model this, each leaf of the
Model IV fragmentation tree is taken through an additional decision step, where
with some probability, Punobserved, an ion can be lost and unobserved.
Basal Noise
Low intensity basal noise existing across the entire mass range ar-
tificially raises the intensities of all peaks in the spectrum. With probability,
Prandom, a trial results in random noise, which means the detector registers it at
some mass according to some random distribution.
Peak Detection Errors
This is not directly accounted for in our model because this
occurs at the spectrum level, not the trial level which we are modelling - i.e. peak
detection errors occur separately, after the outcomes of all trials have completed and
are tabulated.
If the behavior of the peak detection software were known - how often and with what
distribution it fails to detect a real peak or introduces a spurious peak, one might be
able to incorporate its behavior into the model.
8.4
Assumptions
The fragmentation model that we have presented thus far makes the following assumptions:
e Spectrum Features
Single Protonation Only ions with a single positive charge are produced.
Core Fragment Types The A, B, Y ions and their loss/gain variants.
Mass Error When a trial produces a legitimate ion, the mass that is registered is
within 0.5 of the actual mass.
* Variant Assumptions
Residue and Fragment Type Dependence Variant formation is subject to the
rules in Section 8.2.3.
91
Multiple Loss/Gain Events Multiple variant events, e.g.
the loss of both wa-
ter and ammonia, are theoretically possible, where sequence-permitting, but assumed to be a rare occurrence. We consider only the loss/gain of at most one
neutral molecule (either water or ammonia).
Influence of V-permitting Instances
The more instances of a variant-permitting
residue, the more likely that variant will occur. There has been no formal study
of this in the literature, but it is reasonable to assume that more instances implies more sites liable to undergo a loss/gain event. However, the actual natural
probabilities governing this may differ from those used by our model.
The scoring function makes the following assumptions:
Independent Trials A tandem mass spectrum is the collective sum of a number of independent trials.
The current implementation assumes:
Terminal Groups H- and -OH are the N-terminal and C-terminal groups respectively
of the parent peptide.
The algorithm can be modified so that the masses of both
terminal groups can be specified by the user.
Unmodified Basic Residues The parent peptide is composed of residues drawn from the
20 basic unmodified residues. Modified amino acids can be accounted for by including
their masses in the list of residues the algorithm knows about.
For the remainder of this thesis, we also assume that our model and its probability distributions hold for the fragmentation of all instances of a specific peptide and across all peptides.
In reality though, there exists some experimental variance in such acquisition conditions as
the laser intensity, the sample concentration, the duration of laser irradiation, etc.
92
8.5
8.5.1
From Model to Scoring Function
Computing the Probability Mass Function
Once the Model V fragmentation tree for peptide p is constructed, it is simple to derive the
associated probability mass function(PMF) Fp from the set of outcome masses and their
probabilities. The PMF Fp is defined for all checkpoint masses m, from 0 to the checkpoint
of the parent, such that 0 < Fp(m) < 1 and EmFp(m) = 1.
The probability of a multiple identity mass is the sum of the probability of each identity.
Basal noise is modelled as a uniform distribution, so the probability of basal noise is added
to the probability of every checkpoint mass.
8.5.2
Evaluating Sequence Guesses
Recall that a tandem mass spectrum is essentially the aggregate sum of individual independent trials. Let S = {si,..sr} be a tandem mass spectrum with r experimental peaks. Let
m(sj) = mi and h(si) = hi be the mass and height respectively of peak s.
Given a particular sequence p, the model predicts a fragmentation tree and PMF Fp for a
single trial. This PMF provides a means for evaluating the likelihood that the fragmentation
of a number of molecules of this sequence could have given rise to S. This probability can
be computed using the multinomial distribution:
Prob(S, F)
N
=
FN
Fp(mi)h(
h1
=
)
Ni! ,Fp(Mi)h,
- h1
Fp(m 2 )h2 . . .
N
-
Zrk1 hr
hr
h2
)
Fp(mr)h,
- Fp(Mr )hr
where N is the total number of trials(molecules) 3 We use this as a scoring function for
3
Recall that some number of total molecules N were fragmented, and that a spectrum reports the subset
93
evaluating different sequence guesses by examining how likely each guess explains S, preferring that sequence which maximizes Prob(S, Fp). It will be convenient to compute the
logarithm of the probability, log(Prob(S, Fp)) = E_1(hjlogFp(m-)) and then minimize log(Prob(S, Fp)) instead. Note that the
term has been dropped because when differ-
N!
ent sequence PMFs are scored against the same S, this term is constant and can be omitted
with no effect on score comparisons.
8.5.3
Scoring Function Maximum
Given spectra S, how is the scoring function affected by different peptide guesses (and hence
different PMFs)? In particular, if the best sequence guess ought to be the sequence with
the highest probability, when, then, is the scoring function maximal?
The scoring function is maximal when the spectrum resembles the PMF, i.e. the heights
of the observed peaks are proportional to the heights of the PMF distribution. For the
simplest case, when f(k)
()
-
a proof of the claim that f(k) is maximal was
k)
given by David Stephenson [SteOO] when k = pin entails showing that f (pin - 1) < f(pin)
and f(pin + 1) < f(pin).
In Appendix
F, we generalize this result to f(pin - 6) <
f (pin) and f (pin + 6) < f (pin) for 6 > 0, and show that this is true for the multinomial
case as well. The take home message is that if we could model fragmentation perfectly
and if experimentally obtained spectrum were identical in shape to the calculated PMF
distribution for large enough N (by the law of large numbers), then the real sequence would
be the best scoring.
8.6
Exploring the Search Space
Given a number of sequence guesses, one could find the PMF for each and compute the
probability that each sequence produced the observed experimental dataset.
If the real
sequence is present among these guesses and the model is good, then the real sequence will
in fact be the best scoring. The problem we are now faced with is how to generate these
that are successfully detected.
94
guesses and how to insure that the correct sequence is among them. If this were possible,
then the scoring function would be able to separate the wheat from the chaff, identifying
the correct sequence from a pool of many wrong ones.
The naive strategy is to enumerate all possible amino acid sequences whose mass equals
that of the parent mass, evaluate the scoring function on each of them and pick the best
scoring one as the sequence prediction. For anything but the smallest peptides, such an
exhaustive approach is computationally infeasible, requiring
[(M + H)/Rmaxl.
0(
20
N)
guesses, where N =
Here, (M + H) is the parent mass and Rmax is the maximum basic
residue weight.
Spectrum-dependent pruning techniques may help, but if the experimental data is poor,
there is a good chance the real sequence may not survive.
Instead, we use an efficient
combinatorial minimization technique called simulated annealing to explore probabilistically
through the space of possibilities.
Simulated annealing is a stochastic means for finding the minimum scoring state of a system
through gradual descent [KGJV83, BT93, PTVF95].
It finds its origin in the thermody-
namic process of cooling. When a liquid freezes, it crystallizes; if the temperature is reduced
gradually, the system settles into its minimum energy state and the crystal formed is pure.
With simulated annealing, the algorithm starts at some initial temperature, and with each
iteration, this temperature is gradually lowered according to some cooling schedule. Some
starting sequence is randomly selected, and the search space is explored by making random modifications, called moves, first to the initial sequence and then to all subsequent
sequences. Any move resulting in a better scoring sequence(a lower energy state) is immediately taken. Any move resulting in a higher energy state is allowed with some probability
that depends on the current temperature. These uphill events allow for the search to escape
local minima, but they are less likely to occur as the algorithm progresses (and the temperature drops). Thus, the start of the algorithm is essentially a random exploration of the
space as long range moves have a higher probability of being taken, while more fine tuning is
done at lower temperatures. It has been shown that if the cooling schedule is slow enough,
then the algorithm will converge on the global optimum with high certainty [Haj88].
The simulated annealing specification for our problem is as follows:
95
Configuration A peptide sequence guess is a string of amino acid residues r 1
that Ei
...
r, such
m(ri) is constrained to equal the mass of the parent molecule of interest.
Rearrangements The simulated annealing algorithm can make three different types of
sequence moves 4 :
1. Permutation: randomly rearrange the order of the residues of a random subsequence.
2. Reversal: reverse the order of the residues of a random subsequence.
3. Substitution: replace a random subsequence with a different random subsequence
of the same mass, but not necessarily of the same length.
Note that there exists a finite number of moves from one sequence to any other in
the space of sequences whose mass equals that of the parent mass. The reverse move
is a special case of the permute move, and the permute move is a special case of the
substitute move. The identity move is not allowed.
We also require that the pre-move and post-move sequences have the same checkpoint
so that moves are mass-preserving. If, instead, we had allowed the mass of the postmove sequence to be within some tolerance of the pre-move sequence mass, then mass
skew can result - there exists a series of moves such that each individual move produces
a sequence within the required mass bound, but the mass of the final sequence is more
than the allowed tolerance from the mass of the original starting sequence. This is an
example of the usefulness of checkpoints.
Objective Function The function to be optimized is based on the scoring function Prob(S, F.),
i.e. the likelihood that a particular sequence guess g with a fragmentation outcome
distribution F. produced the observed experimental data S. Recall that we desire
the g that produces the maximum Prob(S, F) value; this corresponds to finding a
sequence that yields the minimum (-log Prob(S, Fg)) value.
Annealing Schedule The cooling schedule consists of several parameters:
1. initial temperature, TO,
4The choice of moves is based on an observation that sequence prediction algorithms frequently produce
a list of sequence guesses that are often very similar, differing only in small portions ofo the sequence [TJ97].
96
2. number of temperature changes allowed, tempsteps,
3. temperature change factor k < 1 such that Ti+j = k * T, i > 0,
4. max number of sequences to try at T, nover
5. max number of successful moves at T before continuing to Ti+1, nlimit
Initially, arbitrary values were chosen for these parameters, so the annealing schedule was:
" To = 100, 000,
" tempsteps = 100,
" k = 0.9,
" nover
" nlimit
100
-
*
avgLength,
10 * avgLength,
where avgLength is equal to the ceiling of the parent mass divided by the average amino
acid mass.
An implementation of the simulated annealing algorithm, based on an example in [PTVF95],
was written in Java. In addition, we modified the algorithm so that it would keep track of
and report the minimum scoring sequence that it encountered during its search.
8.7
Summary
We have proposed a model for fragmentation based on some simple rules. Given a sequence,
the model predicts a tree of possible fragmentation outcomes, and this fragmentation tree in
turn describes a PMF from which a scoring function can be derived. A simulated annealing
search can then use this model and its scoring function as the basis for designing an efficient
strategy for traversing the vast space of sequence guesses.
The next chapter contains a more detailed analysis of the effectiveness and performance of
our approach.
97
Chapter 9
Testing the Model and Its Scoring
Function
This chapter evaluates the model and the scoring function module of our approach. We
begin with a discussion of some experimental data that we acquired. Then we describe how
the model was trained using this data, and what effect this had on the scoring function.
9.1
Training Data
We acquired some experimental spectra locally so that we could gain a hands-on feel and
understanding for the acquisition process(see Appendix B), examine and analyze sample
data, and use the data to train our model and validate our algorithm.
A total of six
MALDI-PSD datasets were obtained: four for angiontensin(DRVYIHPFHL) and two for
bradykinin(RPPGFSPFR), and they are included in Appendix C.
Note that the 1205
dataset for angiotensin is believed to be problematic because of a calibration error encountered during acquisition. Ordinarily, such a dataset would have been immediately thrown
out, but it was retained so that we might study the effects a poor dataset might have on
our model.
98
Incidentally, the mass spectrometer came with a diagnostic PSD spectrum' for angiontensin,
acquired on the instrument independently by its manufacturer.
9.1.1
Observations of Training Spectra
A number of observations can be made from simple visual inspection of our PSD data and
of spectra appearing in the literature 2. The key issue is whether our model is consistent
with actual data, and these observations represent features that the model designer should
be aware of.
The most immediate observation is that there is a wide range of peak intensities with the
parent ion being one of the highest, if not the highest, peaks. Other observations concern
the distribution of peak masses, the different fragment types present and the problem of
unknown peaks.
Distribution of Peak Masses
With regards to the shape of the spectrum or the overall distribution of masses, there are
more peaks concentrated in the lower mass region than in the higher mass range. There
are a two plausible types of explanations for why this might occur: the first pertains to the
fragmentation process and the second, the fragment distribution. Large molecules, which
have a larger mass to charge ratio, travel at slower velocities, perhaps slow enough that
they are below the threshold necessary to trigger the collision-induced electron conversion
mechanism for detection [Bea92]. Alternatively, perhaps the energy imparted to a molecule
is so great that the molecule shatters into smaller pieces, making it unlikely to encounter
large intact fragments [Mat98].
The distribution of input fragments may be another explanation for the observed spectrum
shape. From a mathematical viewpoint, assuming n is the length of the peptide, there are
O(n) core ions, and these ions are evenly spread out across the range of masses. There are
Located at C:\VOYAGER\FACTORY\INSTALL\PSD\PARENOO1.MSA of the Voyager computer.
[CBB96], Figure 1; [CLS99], Figures 2,3; [Spe97], Figure 20; [JnC96], figures 4a,5; [LL95], Figures
3,4; [KSL93], Figure 3; [KKS94], Figure 6.
2
99
also 0(n 2 ) possible internal ions, but there are more shorter internal ions than longer ones
- e.g. there is only 1 internal ion of length (n - 2) but n - 3 internals of length 2. As a
result, the concentration of lower mass fragments is naturally greater.
Fragment Types Present in Spectra
A comparison study of spectra from different mass spectrometers is found in [RYM95I, and
it includes an analysis of the fragment types present in each type of spectra. The breakdowns
that they found for their MALDI-PSD spectra (taken from pie charts in Figures 1,2,3,4 of
[RYM95]) are reported in Table 9.1.
Peptide Sequence
PPGFSPFR
RPPGFSPF
DRVYIHPFHL
RPVKVYPNGAEDESAEAFPLEF
prefixes
suffixes
internals
immoniums
32%
38%
33%
59%
28%
15%
15%
3%
31%
36%
14%
18%
9%
11%
38%
20%
Table 9.1: Peak Classification from [RYM95]
Since the sequences for our datasets are known, we can likewise confirm the presence of
immoniums, cores, internals, variants and noise in our data 3 . Table C.7 of Appendix C lists
various statistics for our datasets:
size, number of experimentals accounted for, number
of multiple identity peaks present, number of each ion present and various other totals.
Multiple identity peaks are multiply counted in the fragment type tally of each identity.
We are particularly interested in the presence of internal ions, which accounted for about
20% or more of the total number of peaks in each dataset, and at least 33% of all experimental peaks with known identities. Researchers [SC98, FHM+93] have noted the possible
sequencing value of internal ions, which are indeed present [CCC95, JnC96, but most approaches in the literature that do not perform a global correlation that includes internalas,
do not account for internals. It is difficult for global fundamental graph approaches to account for internals because whether or not an internal supports a fundamental depends on
the path one takes to reach the fundamental. Local approaches that grow partial sequences
3
Note that verification was based solely upon matching theoreticals to experimentals; no chemical verification was actually performed
100
can check for internals with the partial sequence calculated so far, but the real partial sequence could still be disqualified if internals are poorly represented. In short, it is hard for
certain algorithms to harness the sequence information available from internal peaks, but
were it possible, they could benefit greatly. Algorithms the employ some form of global
correlation are favorable because they are able to do precisely this.
Fragment Type Intensities
Peaks of high intensity tend to be core ions, core variants and certain immonium ions.
Internals and other immoniums are often responsible for the low intensity peaks.
Presence of Multiple Identity Peaks
Multiple identity peaks are present in these datasets.
For example, for angiotensin, the
354.2 could be explained as a Bm17 ion of the subsequence DRV, or as a YA internal ion
for PFH. And for bradykinin: (1) 157.1 is the mass of a B ion for R and a YA internal for
SP, and (2) 555.3 is the mass of a B ion for RPPGF and a YA internal for PPGFSP. Note
that in these illustrations, the multiple identities involved a series ion having the same mass
as an internal ion, but nothing precludes them from being a prefix and a suffix, both series
ions.
Unknown Angiotensin Peaks
There are a handful of experimental peaks of fairly low intensities that appear consistently
across multiple datasets, but are of unknown origin.
pendix D.
101
They are discussed further in Ap-
9.2
9.2.1
Training the Model
Parameterizing the Model
In our discussion of the model and its fragmentation tree (Chapter 8), a number of parameters related to the various decision points of the fragmentation tree were discussed. These
comprise the parameters of our model, and they fall into several major categories:
Variant Likelihoods Pnovariant, tendencym18, tendencym17, tendencypis
Noise Probabilities Prandom, Punobserved
Series Likelihoods Pax,
Pb
Matrix Tendencies matrix M
An overview of the various regions of matrix M is depicted in Figure 9-1.
Only a few
matrix parameters are being estimated: the two columns M[*, H] and M[*, P], and the
portions involving a non-break. The remaining entries have a default value of 1. We have
only scratched the surface of fully utilizing this matrix because ideally, every matrix entry
would be a parameter; these regions were singled out because we thought they were the
most influential factors in fragmentation. Histidine and proline are known(see Section 3.2)
to favor N-terminal fragmentation. The non-break tendency also seems to be greater than
that of the other positions because core ions consistently appear more abundantly than
internal ions (a view held by the literature and supported by our datasets). Furthermore,
the parent ion, the only kind of molecule with two non-breaks, is almost always the peak
of highest intensity in the spectrum.
An accurate estimate for all model parameters is necessary for producing PMF that resemble
what actually occurs in the physical/chemical process of fragmentation.
9.2.2
Training the Model Parameters
We trained our model by finding a set of assignments for these model parameters that
maximized the likelihood of the datasets we acquired (by minimizing on the sum of their
102
AMINO ACID RESIDUES
H
P
CTERM NULL
normal tendency
non-break tendency at ends of sequence
special residue dependent tendencies
5
CTERM
NA
NULL
NA
NA
NA
Figure 9-1: Overview of Matrix Layout: Ideally, every matrix entry would be a parameter,
but only a few regions have been parameterized and singled out for estimation - the nonbreak tendencies, and the Histidine(H) and Proline(P) residue dependencies. All entries
within the same shaded region are assumed to have the same likelihood in the current
model. Entries marked NA are not possible.
-log(Prob(S, Fp)) scores - see Section 8.5.2). Rather than try all possible parameter assignments, we started with a few values for each parameter and gradually reduced the range
of values for each parameter as we homed in on a set of parameter values that yielded an
optimal sum (it is possible that this is not the global optimum). The trained model settings
are tabulated in Table 9.2.
Note that we have allowed variant tendencies to only have a small range of values. While
more or less sufficient for now, tendency values can be more representative of variant content
if a wider range of values were permitted. For example, if the three tendency parameters
were allowed to take on any integer value, then training would have produced the following
parameter values, listed in order: 0.57, 1, 22, 16, 0.17, 0.0, 0.23, 0.6, 3.0, 2.2. With little
data, too large a range can lead to an overtrained model. With more data available, however,
a more fine-grained scale may be preferred.
Unmatched experimental peaks are attributed to basal noise.
The more intense a noise
peak, the more trials resulted in noise, the lower the probability of the spectrum occurring.
103
Parameter
Untrained
Trained Overall
Pnovariant
tendencym18
tendencym17
tendencyp18
0.5
0.58
1
1
1
0.2
1
2
2
0.18
Pb
0.1
0.5
0.5
0.0
0.24
0.56
non-breaks
1
3
M[*,H],M[*,P]
1
2.4
Prandom
Punobserved
Pax
Table 9.2: Model Parameters: Untrained and Trained Overall
During training, a dataset with no unmatched peaks should favor a trained Pandom value of
0; conversely, a dataset with many unmatched peaks, especially intense ones, will increase
the
Prandom
value.
Finally, scores are optimal when Punobserved is 0. This makes sense because the parameter,
Punobserved, is the probability that an ion is lost and undetected, but a spectrum is com-
prised of exactly those ions that are detected. Without knowledge of N, the total number
of molecules fragmented, 0 is the best assignment (when N is assumed to be the number of
observed molecules), and we can in effect ignore this parameter. One could approximate N
by estimating the number of molecules in the sample (based on sample quantity and concentration), but it is difficult to tell how many of these get ionized during data acquisition.
Fortunately, as we saw in Chapter
8.5.2, the one place N occurs is factored out, so it not
needed for our purposes of comparing scores of different sequences evaluated against the
same spectrum.
Does training make a difference? We endeavor to show that even if the parameters do not
produce the global optimum, the trained model performs better than the untrained one.
9.3
Examination of the Trained PMF
The untrained model isn't very good at all - the PMF for angiotensin generated by the
untrained model using arbitrary, but reasonable, parameters (Column 2 of Table 9.2), is
104
shown in Figure 9-2. Note that since cleavage at all break positions is equally likely, M is
a constant matrix.
PMF
for Angiotensin
-
untrained parameters
0.04-
0.04
-
0.035 -
0.03
0.025
-
0.02
-
0.015 -
0.01
-
0.005-0
200
400
600
800
1000
1200
1400
Figure 9-2: PMF for Angiotensin Using an Untrained Model
There is little resemblance between the PMF in Figure 9-2 and the actual angiotensin
experimental data in Appendix C. The parent molecule is certainly not the most abundant
ion in the model and the probabilities seem inversely proportional to fragment length - e.g.
the shortest prefix and the shortest suffix are especially favored because there are many
more ways to produce them. The PMF shape is also too regular, being solely dependent
on peptide length. Differentiation based on other factors is necessary so that the PMFs of
different peptides of the same length are not identical.
On the other hand, compared to the untrained model, the trained model allows for some
differentiation and the resulting PMF is shown in Figure 9-3. Though still not identical
in shape to experimental spectra, this PMF already shows improvement compared to the
untrained one of Figure 9-2.
Now the parent mass is the highest peak, and the higher
probability ions correspond to the core ions with the internal ions being less probable.
Less regularity is observed because dependencies on the peptide composition have been
introduced.
More differentiation is still possible - the entire residue tendency matrix could have been
parameterized; residue position could be taken into account; rules for neighboring/distant
residue combinations could be introduced, etc. - but our training data is too small and
105
PMF for Angiotensin
--
trained parameters
0.045.
0.040.035-
0.03
-
Z, 0.025 -
0.02
-
0.0150.01
-
0.005-
0
200
400
600
800
1000
1200
1400
Figure 9-3: PMF for Angiotensin Using a Trained Model
doesn't capture the diversity needed to train a more comprehensive model. Nevertheless,
even with a small amount of differentiation, we can implement a fairly robust scoring function.
9.4
Scoring Guesses Against an Observed Spectrum
How good is this scoring function? What happens when different sequence guesses are scored
against the same experimental spectrum? How does the real sequence fare? Two measures
of performance, suggested by
[PPCC99], are sensitivity and selectivity.
If the scoring
algorithm is sensitive, the correct sequence scores well; if selective, unrelated sequences
score poorly.
We used our scoring function to correlate our experimental data against a pool of peptide
sequences. These sequences, listed in the leftmost column of Tables 9.3 and 9.4, all have
the same mass as the parent ion, but vary in length and similarity to the parent sequence.
The list was chosen so that the first set of sequences differed from the parent the most (in
length and sequence), while the remaining sequences exhibited a greater resemblance to the
parent, being of the correct length and including sequences such as the parent reversed, the
parent permuted and the parent with neighboring residues flipped. The last sequence is the
106
actual parent sequence.
The tables record the correlation scores for each sequence, under both the untrained and
trained models, as well as the difference between these scores. In the ideal case, use of the
trained model improves only the score of the actual sequence. The situation here is not ideal,
but the real sequence is one of the sequences that show the greatest improvement when the
trained model is used. In all cases, the correct sequence (either the parent sequence or the
parent sequence with the first 2 residues flipped) is the best scoring sequence.
107
0123 Dataset
0119 Dataset
Sequence
Untrained
Trained
Difference
Untrained
EWHTWFIMF
1000610.2
986586.2
14024.0
376255.5
369742.7
6512.8
PRMKWCPCIY
975305.5
951105.7
24199.8
372495.9
362382.5
10113.4
VLQFHLYTYI
972085.0
960256.5
11828.5
366327.7
360586.0
5741.7
HKMKFREYSA
977920.6
957214.7
20706.0
352852.6
347123.8
5728.8
FRLSCWYAHN
996250.5
974148.6
22101.9
377026.7
367970.0
9056.7
DCSYAKAWYSC
966902.2
939270.9
27631.3
373901.7
362997.1
10904.6
QCYLTGQAYYS
952266.0
932862.8
19403.2
371618.9
362988.3
8630.6
VWFLNPEHGPT
961941.0
942032.0
19909.0
359397.6
352370.4
7027.2
TFGKETIKEIM
971034.8
961195.5
9839.3
377140.7
371161.4
5979.2
YLQMSIEELGN
986816.0
964675.0
22141.0
383318.1
373152.3
10165.8
Trained
Difference
EKDICDCCHSGS
1006346.8
993525.5
12821.3
376325.8
369655.7
6670.1
ICSVWMSVVICG
1021791.8
1003735.2
18056.6
391636.1
383796.8
7839.3
HSTENCAAIITH
967937.2
953169.5
14767.7
353608.4
349607.1
4001.4
ANYSNCGCEPAPA
1016495.5
998025.3
18470.3
389600.9
381338.4
8262.5
ANCSTTRKTNGGS
995608.6
975148.5
20460.1
382945.8
372557.5
10388.3
DRVYLHLHML
721760.7
679433.3
42327.4
275185.3
261156.2
14029.2
DRVYLHMFCL
727984.5
676692.6
51291.9
276618.7
258219.5
18399.2
DRVYLHMPCY
737855.5
685587.0
52268.6
278511.3
261330.0
17181.3
DRVYLHMPEH
726715.4
684229.4
42486.0
273222.0
260859.8
12362.2
RDVYLHPWPN
761811.3
703958.5
57852.8
284003.5
259902.8
24100.6
DRVYYSLHML
760932.9
713756.0
47176.9
291209.7
275682.2
15527.6
DRVYIHFPHL
715022.7
670690.6
44332.0
272046.8
258363.3
13683.6
DVRYIHPFHL
739678.7
680309.0
59369.7
283188.6
262018.9
21169.7
DRYVIHPFHL
737180.8
676238.9
60941.9
278286.2
255804.3
22481.9
DRVIYHPFHL
738506.3
678979.6
59526.7
281096.1
259648.8
21447.3
DRVYH1PFHL
740373.8
676650.9
63722.9
281312.1
260521.3
20790.8
DRVYIPHFHL
762835.2
721162.0
41673.2
289738.6
278062.0
11676.6
DRVYIHPHFL
739913.0
682881.0
57032.0
278161.2
256277.1
21884.1
DRVYIHPFLH
726697.1
677339.3
49357.8
273243.1
254749.1
18494.0
DRVPFYIHHL
872519.8
824716.8
47803.1
330034.1
315203.6
14830.5
DRVHIYPFHL
778471.5
718644.9
59826.6
295005.1
276876.4
18128.7
DRHVYIHPFL
840385.2
796113.4
44271.8
315209.8
297241.9
17967.9
RDVYIHPFHL
711485.4
648696.4
62789.0
*269161.1
*246465.2
22696.0
LHFPHIYVRD
790750.9
768738.7
22012.3
300054.2
293322.9
6731.3
PRDVYIHFHL
856353.2
822967.0
33386.1
320955.7
310903.2
10052.5
HRDVYIHPFL
816051.5
778512.3
37539.2
295598.0
284770.6
10827.3
DRVYIHPFHL
*707448.0
*644984.9
62463.1
269449.3
247236.8
22212.5
Table 9.3: Scores of Sequences with the Same Mass as Angiotensin for Datasets 0123 and
0119. A '*' denotes best score in a column. Sequences considered correct are the actual
sequence (ilisted last) and the sequence RDVYIHPFHL, the actual with the first 2 residues
flipped, listed fifth from last.
108
0121 Dataset
1205 Dataset
Difference
Trained
Difference
1583255.8
1562844.2
20411.6
1538727.6
1500969.5
37758.1
Untrained
Sequence
Untrained
Trained
EWHTWFIMF
3591140.3
3585832.4
5307.9
PRMKWCPCIY
3481601.0
3440904.1
40696.9
VLQFHLYTYI
3422311.3
3404494.1
17817.2
1538805.3
1525055.9
13749.4
HKMKFREYSA
3539148.9
3523349.4
15799.5
1549703.5
1513659.1
36044.4
FRLSCWYAHN
3605618.2
3588582.6
17035.6
1588387.5
1558620.7
29766.8
DCSYAKAWYSC
3475024.8
3440729.6
34295.3
1509786.6
1469993.5
39793.1
QCYLTGQAYYS
3440372.5
3432988.8
7383.7
1458486.1
1429528.5
28957.6
VWFLNPEHGPT
3484229.2
3452109.6
32119.6
1513068.5
1483061.6
30007.0
TFGKETIKEIM
3447098.4
3457061.1
-9962.6
1491126.1
1486291.2
4834.9
YLQMSIEELGN
3526325.5
3498799.0
27526.6
1539122.3
1505162.6
33959.7
EKDICDCCHSGS
3534730.3
3527607.1
7123.3
1571204.9
1547374.4
23830.6
ICSVWMSVVICG
3437413.3
3363538.8
73874.6
1593272.0
1565586.2
27685.8
HSTENCAAIITH
3490206.1
3480559.8
9646.3
1545114.6
1522448.0
22666.6
ANYSNCGCEPAPA
3590061.1
3558109.3
31951.8
1587972.5
1553456.6
34515.9
ANCSTTRKTNGGS
3533219.1
3520911.5
12307.6
1556569.3
1528001.8
28567.4
DRVYLHLHML
2658723.3
2564833.4
93889.9
1113992.0
1044975.8
69016.2
DRVYLHMFCL
2663227.4
2547695.6
115531.8
1124362.5
1043150.1
81212.4
DRVYLHMPCY
2785226.9
2686913.3
98313.6
1132070.5
1042132.9
89937.6
DRVYLHMPEH
2765007.8
2689378.4
75629.4
1116415.0
1040693.1
75721.9
RDVYLHPWPN
2838234.4
2709785.2
128449.1
1181329.1
1094742.5
86586.5
DRVYYSLHML
2724515.6
2609064.6
115451.1
1148614.0
1066800.9
81813.1
DRVYIHFPHL
2569794.0
2450760.5
119033.5
1093327.9
1014123.0
79204.9
DVRYIHPFHL
2638866.2
2476966.3
161899.9
1141435.7
1045422.1
96013.6
DRYVIHPFHL
2639212.4
2470831.0
168381.4
1128156.5
1026553.8
101602.7
DRVIYHPFHL
2671037.9
2508420.0
162617.9
1135684.4
1038167.0
97517.4
DRVYHIPFHL
2615010.7
2430165.6
184845.2
1120847.7
1011129.0
109718.7
DRVYIPHFHL
2710566.2
2600019.9
110546.3
1156037.5
1080070.5
75967.0
DRVYIHPHFL
2700650.0
2560715.7
139934.3
1159050.9
1070259.6
88791.4
DRVYIHPFLH
2768536.1
2666870.9
101665.2
1112153.9
1039635.2
72518.7
DRVPFYIHHL
3043928.2
2915263.4
128664.8
1325732.7
1239322.1
86410.7
DRVHIYPFHL
2743394.9
2574565.3
168829.6
1181530.6
1077197.6
104333.0
DRHVYIHPFL
3037303.8
2942113.1
95190.8
1323411.4
1258188.6
65222.8
RDVYIHPFHL
*2534547.7
*2361854.8
172692.8
*1076543.8
*973923.5
102620.3
LHFPHIYVRD
2821830.3
2709905.1
111925.2
1222077.9
1187291.4
34786.5
PRDVYIHFHL
3006660.2
2907902.0
98758.1
1319292.1
1256930.4
62361.7
HRDVYIHPFL
2929037.6
2843633.2
85404.4
1289927.7
1233125.8
56801.9
DRVYIHPFHL
2539808.1
2369364.1
170444.0
1078123.1
975763.0
102360.1
Table 9.4: Scores of Sequences with the Same Mass as Angiotensin for Datasets 1205 and
0121. A '*' denotes best score in a column. Sequences considered correct are the actual
sequence (ilisted last) and the sequence RDVYIHPFHL, the actual with the first 2 residues
flipped, listed fifth from last.
109
0220 Dataset
Sequence
FFNWAGRY
RSSCQPDMH
WDPNICTLV
VFFMDHGAH
TRQKRMLAG
VAYHGYFCT
CQSFCDGDSV
TVNVHHPGSI
LRQNLAMGGT
LSQKACPGKE
SAFAPRIVCP
DCHMASDGAGP
TYLFGSCGGGV
EIRAIQGCGGG
KNAGPGTEISS
RPVDSSPRF
RPPGFSPRF
RPVSDSPRF
RPTAESPRF
RPWDSAMPT
RPVDSSPFR
RPVDSSLMR
RPSFGPPRF
RPWDSPGVF
RPTCPSPRF
RPWDSAMVV
PRPGFSPFR
RPGPFSPFR
RPPFGSPFR
RPPGSFPFR
RPPGFPSFR
RPPGFSFPR
RFPSFGPPR
RPPGFSPFR
Untrained
637458.2
573000.0
605878.4
617818.7
612576.6
648209.6
621628.3
616680.9
622702.2
617880.6
603815.9
607585.3
625747.3
617889.9
604424.9
516809.6
496651.5
516406.2
509738.6
521433.1
511242.9
518657.1
503237.9
518995.3
512074.9
522496.4
508248.9
496290.8
499639.0
509410.8
511823.5
495785.1
529863.1
*492761.5
0218 Dataset
Trained
620229.3
559245.2
591745.4
604156.9
597983.7
631083.7
605042.8
604741.1
605893.8
600357.9
587313.5
602810.6
604016.7
594656.0
584624.0
489526.9
466971.0
489144.7
480533.7
501069.5
484323.2
494543.2
478142.5
492483.1
488138.4
498975.2
486135.7
475676.0
472065.1
480895.7
487540.5
475180.1
519962.3
*463690.3
Difference
Untrained
Trained
Difference
17228.9
13754.8
14133.0
13661.8
14592.9
17125.9
16585.6
11939.8
16808.4
17522.7
16502.4
4774.7
21730.6
23233.9
19800.9
27282.7
29680.5
27261.5
29204.9
20363.7
26919.7
24113.8
25095.4
26512.2
23936.5
23521.2
22113.3
20614.8
27574.0
28515.1
24283.0
20605.1
9900.8
29071.2
1313724.7
1182864.6
1234530.6
1289685.8
1241441.3
1342050.6
1267178.7
1258687.3
1262377.0
1274532.3
1266159.0
1265081.6
1259824.2
1242888.5
1230376.3
1008198.9
939676.7
1010008.5
992577.2
1026787.8
995771.5
1014359.4
975513.9
1021498.9
1002086.2
1024707.9
984056.3
948037.7
958188.2
995003.5
998267.0
946461.2
1063121.5
*931460.8
1317429.7
1186785.4
1232528.8
1289786.0
1252506.1
1337724.3
1268856.7
1253688.5
1270854.0
1270929.9
1265483.1
1280059.7
1254787.7
1234011.7
1221202.1
977668.9
898853.3
978912.0
958308.4
1019673.7
965772.3
1001363.7
944049.5
996014.4
971484.8
1014970.9
952750.9
931302.2
923245.2
953766.6
975018.5
936713.1
1068835.8
*892716.5
-3705.0
-3920.8
2001.8
-100.2
-11064.8
4326.2
-1678.0
4998.8
-8477.1
3602.5
675.9
-14978.1
5036.5
8876.8
9174.2
30530.0
40823.4
31096.5
34268.8
7114.1
29999.2
12995.7
31464.4
25484.5
30601.3
9737.0
31305.4
16735.5
34943.0
41236.9
23248.5
9748.1
-5714.3
38744.3
Table 9.5: Scores of Sequences with the Same Mass as Bradykinin. A '*' denotes best score
in a column.
110
The sequences that are similar to the parent are expected to score better than those that are
not. This can be seen when the trained model scores of Tables 9.3 and 9.5 are graphically
displayed (see Figures 9-4 and 9-5). The scores fall into clusters when plotted - two clusters
for each dataset (the first two clusters of Figure 9-4 (points 0 to 40 on the x-axis) correspond
to the sequence scores for the 0123 dataset, the next two clusters for 0119, and so forth).
There is a visible marked distinction between the cluster of sequences that resemble the real
sequence (these have better, i.e. lower, scores) and the cluster of sequences that are more
distant.
3.5e-06
2.5.06
2e+06
500000
0
20
40
80
60
100
140
120
160
180
Figure 9-4: Trained Model Scores of Sequences Guesses for Angiotensin:
0123(points 0-40 along the x-axis), 0119(40-60), 1205(80-120) and 0121(120-160).
zero were inserted to separate each dataset.
Datasets
Scores at
1.2e+06
.+06
800000
600000
400000
200000
0
10
20
30
40
50
60
70
80
Figure 9-5: Trained Model Scores of Sequences Guesses for Bradykinin: Datasets
0220(points 0-40) and 0218(40-80). Scores at zero were inserted to separate each dataset.
111
9.4.1
Not the Real Sequence, but Still Correct
De novo algorithms find the amino acid sequence that best explains the observed data
according to some measure of evaluation. Ideally, the sequence that is the best fit is also
the actual parent sequence. Examination of the scores for each dataset reveals that the real
sequence is indeed found for the bradykinin datasets and the 0123 angiotensin dataset. For
the remaining angiotensin datasets, however, there is an alternate sequence, RDVYIHPFHI,
that outscores the real sequence.
Notice that this alternate sequence is the real sequence with the first two residues interchanged, and from Section 4.4, this is an acceptable prediction and considered correct.
Investigation of our data reveals two reasons why this sequence would be favored: (1) the
data contains no fi ions and weak
f{
ions, and (2) peaks (both legitimate and noise) are
misinterpreted as supportive fragment ions for the alternate sequence.
Take the 0119 and 0121 datasets for example. The 0119 dataset contains no fi or
but it has an immonium YAm17 ion for R at mass 112.
f'
ions,
The mass of the Am17 ion for
R is also 112, and moreover, the probability of the Am17 ion, being a core ion, is higher
than that of the immonium YAm17 ion. Consequently, a sequencing algorithm would favor
the misinterpretation of 112 as an Am17 ion, identifying R as the first residue, because
this assignment yields a better score. The same immonium interference by the same ion
occurs with the 0121 dataset, except unlike the 0119 dataset, an extended family peak, an
internal YB ion for RVYIH at mass 669.47, is present, but unfortunately, its intensity is
too low to be of consequence. Thus, because fi prefix ions are more probable according to
our rules than immonium ions, competing sequences whose first residue is well supported
by immonium peaks in the spectrum, might outscore a real sequence that contains a gap
at position 1. One might be able to rescue the real sequence by appropriately reducing the
break probability of break position one. This would lower the probabilities of all fi prefix
ions as a consequence, and diminish the effect of immonium interference.
The 1205 dataset, on the other hand, has no fi, but it does have a single extended family
ion - an internal ion YBp18 for RVYIHPFH at mass 1068.83.
Noise peaks at 473.147
and 696.93, however, can be interpreted as a YBm18 for DVYI and a YA for DVYIHP
112
respectively. These support RDVYIHPFHL as the sequence and help it to outscore the real
sequence despite the presence of a legitimate peak at 1068.83.
Note that while certain ion families can be better represented, it is not the case that the
dataset, as a whole, is too small. Were this true, the scores of sequences that are radically
different from the real one would be more competitive, and quite likely surpass that of the
real sequence. With "decent" spectra (good signal to noise, good redundancy), the danger
is not that a sequence totally distinct from the real sequence would score better (because
the real sequence being the real sequence should explain the bulk of the peaks), but that
a sequence, deviating ever so slightly, would outperform the real one. The reason it scores
high is because of its high degree of overlap with the real sequence; but the reason it scores
better is because the minor differences in sequence happen to be better supported.
Incidentally, no fundamental graph approach would have fared better. In the case of 1205, an
extended family ion is present, but it is an internal ion, and not one of the f-generating roles.
Unlike series ions, fundamentals cannot be definitively inferred from internals (without more
localizing information).
Internal ion supporters cannot be easily incorporated into a fundamental graph. Global
approaches might attempt to find all possible subpaths in the graph that can explain the
internal ion, and then increment some score for all fundamentals of each subpath. But
because there can be many such subpaths, this process may obscure the real sequence
instead of highlighting it. There is also no reason to believe that fundamentals that are
located at the crossroads of many such subpaths are, in fact, fundamentals of the real
sequence.
Local approaches can take internals into account as they progress through the the search
space of partial sequences (a huge tree where each node has outgoing degree 18); but it would
not be able to account for the
f'
internal ion of dataset 1205 until it has almost reached
the end of its search. It is unclear whether the real sequence would have survived pruning
long enough to make it this far, whether the influence of non-internal peaks interpreted as
internals would artificially elevate the score of incorrect sequences enough to knock out the
real sequence, and whether a search with an increased number of survivors allowed after
each round would be too inefficient.
113
Thus, while the angiotensin sequence is actually DRVYIHPFHL, the sequence RDVYIHPFHI is considered a correct solution.
9.5
Summary: Model Training
The different fragment types are all present in the datasets we collected. In addition, the
spectra also contained noise, as well as consistently occurring peaks of unknown origin. Our
model was trained with these datasets, and despite the presence of unknowns, the PMF
improved, as did the selectivity/sensitivity of the scoring function for the correct sequence.
114
Chapter 10
Testing the Simulated Annealing
Search
This chapter examines the simulated annealing search and addresses such questions as:
Does the search converge? Does restricting the search to moves that are length-preserving
help? How do different searching parameters affect the search?
10.1
Search Convergence
The scores of each successive move made by the simulated annealing search on an angiotensin
dataset and a bradykinin dataset are plotted in Figures 10-1 and
10-2 respectively. The
search used the trained model of Table 9.2 to evaluate each move that it made. The scores
fluctuate randomly at first but as the search progresses, gradually descends to a minimum
value. In both cases, the search found the real sequence successfully.
10.2
Sequence Prediction
Simulated annealing was performed on all our datasets using a model whose parameters
were trained by the very same datasets (Table 9.2) (cross-validation and the use of other
validating datasets are the subject of Chapter 11). Table 10.1 summarizes these results and
115
400000
V
380000
360000
340000
.*
A
~
,~
2
14
4
4,.,
t
.4
~t.
320000
*1
4
4$
*
300000
~
!$4~*..:
i
280000
260000
240000
1000
500
0
2500
2000
1500
3500
3000
4000
4500
Figure 10-1: Simulated Annealing Moves for 0119 Dataset. The x-axis represents the
progress of the search, and the y-axis is the score of each successive move.
640000
620000
,
600000
*
580000
560000
-
* **+*
540000
-.
520000
4
500000
480000
460000
0
500
1000
1500
2000
2500
3000
3500
Figure 10-2: Simulated Annealing Moves for 0220 Dataset. The x-axis represents the
progress of the search, and the y-axis is the score of each successive move.
116
reveals the effects of using a model that is untrained versus one that is trained, and using
moves that are restricted(length-preserving) versus moves that are unrestricted. Each entry
in the table indicates the number of times, out of 10, simulated annealing found the correct
sequence. The outcomes of the algorithm are broken down into three categories:
" the number of times the algorithm returned the correct answer,
" the number of times the algorithm terminated with no answer, but the minimum
sequence encountered was correct, and
" the number of times the algorithm returned the wrong answer, but the minimum
sequence encountered was correct.
The second category occurs when the algorithm does not think it has converged, and the
third category occurs when it gets stuck in some local minima. We distinguish among these
three categories to better understand the algorithm's behavior; one can easily modify the
code so that the minimum scoring sequence that was encountered is always returned at
the completion of the algorithm. One can consider the sum of these three numbers as the
number of times the algorithm succeeds for the given parameters.
Moves
0123
Restricted
0
0
0
Unrestricted
0
0
0
Untrained Model
0119 1205 0121 0220
0
1
0
9
0
0
0
0
0
0
0
0
0
0
0
10
0
0
0
0
0
0
0
0
0218
9
0
0
10
0
0
0123
10
0
0
4
0
0
Trained Model
0119 1205 0121 0220
8
7
7
10
0
0
0
0
2
0
1
0
1
3
4
10
0
0
0
0
0
0
0
0
0218
10
0
0
10
0
0
Table 10.1: Results of Simulated Annealing Run on Different Datasets using an Untrained/Trained Model, and a Length-Preserving/Non-Length-Preserving Search
These results indicate that simulated annealing performs better when scoring is done using
a trained model and when length-preserving moves are made.
When moves are length-
preserving, the search space is smaller, and hence the likelihood of the search succeeding (if
the subspace contains the correct sequence) is higher. Neither of these observations comes
as a surprise especially since the trained model is being validated on the very datasets used
117
to train it. What is interesting is that the bradykinin datasets (0220, 0218) appear to do
quite well regardless of the simulation conditions.
Finally, the algorithm, when used with the untrained model, frequently found sequences
that scored better than the correct one, indicating that training definitely made a difference
by improving sensitivity and/or selectivity.
The simulations that used a trained model
encountered no sequence that scored better than the real one.
10.3
Different Restricted Sizes
Ideally, the simulated annealing algorithm should run in unrestricted mode, and predict the
correct sequence after considering sequences of all lengths. We attempted to improve the
performance of the unrestricted version by trying different move probabilities and slower
schedules (results not included), but met with little success. It was also unclear whether
the tremendous cost in time, due to the use of even slower annealing schedules, would be
worth the gain in prediction success.
However, since the algorithm worked well when restricted to sequences of the right length,
one might consider partitioning the space of sequences by length into smaller non-overlapping
subspaces. Since the right length is not known, one would have to run several restricted
searches, but instead of a single search over the full space, one could run a few well-chosen
searches over smaller spaces.
The following question is then raised - how do the best scoring sequences of each length
compare? Is the real sequence the best scoring sequence of not only its length, but of the
other lengths as well? One could reason that this might be true for the shorter and longer
lengths. Very short sequences account for fewer of the experimentals and as a result, more
unexplained experimentals must be attributed to noise. If the probability of noise is low,
then this would greatly hurt the resulting score. Very long sequences are able to generate
many more theoretical peaks, especially low mass ones, and their PMFs will experience an
overall reduction in probabilities, particularly the probabilities of the higher mass fragments.
This also has the potential of hurting the resulting score.
118
The simulated annealing algorithm was run in restricted mode on a random initial sequencesi
for every length from 8 to 13. Each size was tested 10 times per dataset, and the sequences
predicted by our model are listed by size in Tables 10.2 and 10.3. We found that the correct sequence for each dataset was indeed the best scoring sequence over all lengths. This
suggests that the reason the unrestricted searches (Tables 10.1 and
11.2) did poorly was
because the sequence space was vast, not because the correct sequence scored suboptimally.
10.4
Simulated Annealing on Data Without Noise
Simulated annealing was also performed on our datasets with all noise (unmatched) peaks
removed. One would expect the algorithm to do very well with clean data, but the statistics
compiled in Table 10.4 show little improvement compared to Table 10.1. This indicates that
the the algorithm is somewhat tolerant of a certain level of noise (because the amount of
noise in the original datasets does not markedly change the algorithm's noiseless prediction
results). We will explore noise further in Section 12.2.3. Since the correct sequence is still
the best scoring sequence encountered, the performance of the algorithm on the noiseless
0119, 1205 and 0121 datasets suggests that perhaps the search itself could use improvement.
10.5
Exploration of Simulated Annealing Parameters
Thus, we shift our attention to the simulated annealing search and its parameters in hopes
of finding settings that will improve the performance of the algorithm on the 0119, 1205
and 0121 datasets.
We make forays into the multi-variable space of simulated annealing
parameters by examining the algorithm's predictions when a single parameter is changed
and when combinations of parameters are changed.
Note that we are exploring this space under the "best possible conditions" in terms of the
data and the model - namely, we used datasets containing no noise and we used a model
whose parameters were trained with all six datasets.
'At the start of every search, the algorithm attempts to pick a random sequence of the desired mass for
the specified length. It may have difficulty finding a such a sequence if the length is too short or too long.
119
0123
0119
1205
8
9
10
11
12
13
YWYWREFF,912071.27
HMIMFYRWI,747736.38
DRVYIHPFHI,644984.87
HICHGVHPFHI,694030.16
HMIHAGHPSGRP,699639.42
PNGIHQHPSGVGP,715958.5
YYWYYWRP,914142.34
HMIMFYRWI,747736.38
DRVYIHPFHI,644984.87
NCRGCIHPFHI,693794.29
CRGGGCIHPFHI,700914.51
PAQGPVPHPKGVI,725074.9
NCRGCIHPFHI,693794.29
CRGGGCIHPFHI,700914.51
HMIPAPHPGGAVI,707189.9
YWYWREFF,912071.27
HMIMFYRWI,747736.38
DRVYIHPFHI,644984.87
YWYWREFF,912071.27
HMIMFYRWI,747736.38
DRVYIHPFHI,644984.87
HICYPGIPFHI,700056.62
HMIPAPHPKGVI,702909.07
PAAGGHGVHPFHI,709675.0
YWYWREFF,912071.27
HMIMFYRWI,747736.38
DRVYIHPFHI,644984.87
NCRGCIHPFHI,693794.29
HMIHAGHPSGRP,699639.42
PQAGHRHPSGVGP,722727.3
YWYWREFF,912071.27
HMIMFYRWI,747736.38
DRVYIHPFHI,644984.87
PNAAHRHPFHI,692246.85
DRVYIHPGGGII,682825.96
PKAGHRHPSGVGP,722727.3
YWYWREFF,912071.27
HMIMFYRWI,747736.38
DRVYIHPFHI,644984.87
HICHGVHPFHI,694030.16
PGGAAHRHPFHI,699720.92
SAAAADCGHPFHI,724811.9
HKWWKYYW,910361.35
HMIMFYRWI,747736.38
DRVYIHPFHI,644984.87
HIMPAAYPFHI,707205.39
PGGAAHRHPFHI,699720.92
HICTGGSAGPFHI,723111.5
YRWHHWRR,909109.71
HMIMFYRWI,747736.38
DRVYIHPFHI,644984.87
HICHGVHPFHI,694030.16
PGGAAHRHPFHI,699720.92
PKAGHGVHPSGPR,722596.9
YWYWREFF,912071.27
HMIMFYRWI,747736.38
DRVYIHPFHI,644984.87
HICHGVHPFHI,694030.16
CRGGGCIHPFHI,700914.51
HMIHAGHPSGVGP,706485.7
(didn't converge),0.00
HMIMFYRWI,278945.79
RDVYIHPFHI,246465.16
HPNANIHPFHI,254911.32
HMIPAPHPNAVI,256264.55
HMIPAPHPGGAVI,258728.6
YYWYYWRP,344307.06
HMIMFYRWI,278945.79
RDVYIHPFHI,246465.16
HICHGVHPFHI,255383.07
HMIPAPHPNAVI,256264.55
HMIPAPHPGGAVI,258728.6
YYWYYWRP,344307.06
HMIMFYRWI,278945.79
HHRCIHPFHI,256265.93
HICHGVHPFCF,255278.29
HIMCAGGNPFCF,263968.83
HICMAGNGPNAVI,264847.9
HHWWRYRR,327530.14
HMIMFYRWI,278945.79
RDVYIHPFHI,246465.16
HICHGVHPFCF,255278.29
HPNAGGIHPFHI,256512.18
HMIPAPHPGGAVI,258728.6
HWWKKYYW,324801.74
HMIMFYRWI,278945.79
RDVYIHPFHI,246465.16
HPNANIHPFHI,254911.32
HPNAGGIHPFHI,256512.18
HPGGAGGIHPFHI,259194.2
HHWWRYRR,327530.14
HMIMFYRWI,278945.79
RDVYIHPFHI,246465.16
HICHGVHPFHI,255383.07
HPNAGGIHPFHI,256512.18
HMIPAPHPGGAVI,258728.6
(didn't converge),0.00
HICYRQWYK,294150.11
RDVYIHPFHI,246465.16
HMIHQHPNAVI,256400.84
HMIPAPHPNAVI,256264.55
HMIHAGHPPAGGD,261261.7
YWYFIWRY,346389.43
HHWWHPFHI,269454.26
HHRCIHPFHI,256265.93
HPNANIHPFHI,254911.32
HPNAGGIHPFHI,256512.18
HICMAGNGPNAVI,264847.9
(didn't converge),0.00
HMIMWYYPR,279180.02
RDVYIHPFHI,246465.16
HPNANIHPFHI,254911.32
HMIPAPHPNAVI,256264.55
HICHGVHPGGGII,263342.6
(didn't converge),0.00
HMIMFYRWI,278945.79
RDVYIHPFHI,246465.16
HPFYPAAPFFC,259035.64
HMIPAPHPNAVI,256264.55
HICYPIGPGGAVI,261826.2
(didn't converge),0.00
IHMMFWRYI,2617491.58
RDVYIHPFHI,2361854.84
HICYPGIPFHI,2397777.77
IHCYPAVPKGVI,2414683.06
IHCGGAACAPFHI,2506416.0
MHHWWWWK,3142580.37
IHMMFWRYI,2617491.58
IHCHRHPFHI,2426544.80
HICYPGIPFHI,2397777.77
IHCYPAVPGQVI,2414131.77
IHCYPVAPGGAVI,2431328.6
(didn't converge),0.00
IHMMYYYYI,2645006.18
RDVYIHPFHI,2361854.84
HICYPGIPFHI,2397777.77
PGQAYPGIPFHI,2467377.90
IHCPVPHPGGAVI,2435215.2
(didn't converge),0.00
IHMMFWRYI,2617491.58
RDVYIHPFHI,2361854.84
HICYPGIPFHI,2397777.77
(didn't converge),0.00
IHCYPVAPGGAVI,2431328.6
(didn't converge),0.00
IHMMFWRYI,2617491.58
IHCHRHPFHI,2426544.80
HICYPGIPFHI,2397777.77
IHMGGCANPMCY,2495356.20
IHCYPVAPGGAVI,2431328.6
(didn't converge),0.00
IHMMFWRYI,2617491.58
IHCHRHPFHI,2426544.80
HICYPGIPFHI,2397777.77
IPGGGHQHPFHI,2458782.53
IHCYPVAPGGAVI,2431328.6
IRTYWWWW,2957211.69
IHMMFWRYI,2617491.58
RDVYIHPFHI,2361854.84
HICYPGIPFHI,2397777.77
PGQAYPGIPFHI,2467377.90
IPGGGPAPHPFHI,2482744.5
IRTYWWWW,2957211.69
IHMMFWRYI,2617491.58
IHCHRHMFCI,2449025.18
HICYPGIPFHI,2397777.77
IHCYPAVPGQVI,2414131.77
(didn't converge),0.0
IRTYWWWW,2957211.69
IHMMFWRYI,2617491.58
RDVYIHPFHI,2361854.84
HICYPGIPFHI,2397777.77
IHCYPAVPGQVI,2414131.77
IHCPVPHPGGAVI,2435215.2
(didn't converge),0.00
IHMMFWRYI,2617491.58
RDVYIHPFHI,2361854.84
HICYPGIPFHI,2397777.77
IHCPVPHPGQVI,2416431.25
IHCYPVAPGGAVI,2431328.6
Table 10.2: Simulated Annealing Results for Different Lengths. This table lists the ten predictions made by a search, restricted to
sequences of length 8 to 13 inclusive, on the 0123, 0119 and 1205 datasets.
0121
0220
0218
8
9
YYWYYWRP,1426797.79
HMIMFYRWI,1121049.04
HICHRHPFHI,1042694.80
HICYPGIPFHI,1033167.66
PANAYVPAMIHI,1056730.57
PGAGAHGVHPFHI,1068282.4
YWYWREFF,1413671.31
HMIMYYWRP,1124357.12
RDVYIHPFHI,973923.46
HICYPGIPFHI,1033167.66
HMIHGAHPGSRP,1065444.91
PGAGAHGVHPFHI,1068282.4
YWYWREFF,1413671.31
HMIMFYRWI,1121049.04
RDVYIHPFHI,973923.46
RDVYIPGVKHI,1052268.12
PANAYVPAMIHI,1056730.57
HMIPACMPGGAVI,1066611.8
YWYWREFF,1413671.31
HMIMFYRWI,1121049.04
RDVYIHPFHI,973923.46
HICYPGIPFHI,1033167.66
PANAYVPAMIHI,1056730.57
PGAGAHGVHPFHI,1068282.4
YWYWREFF,1413671.31
HMIMFYRWI,1121049.04
RDVYIHPFHI,973923.46
HMIHQHIGARP,1059967.29
HMIPACMPNAVI,1060398.24
PGAGAHGVHPFHI,1068282.4
RDVYWWWW,1297273.15
HMIMFYRWI,1121049.04
RDVYIHPFHI,973923.46
HMICNANPMCY,1063898.78
HMIMPIGSSAPR,1066942.84
PGAGAHGVHPFHI,1068282.4
YWYWREFF,1413671.31
HMIMFYRWI,1121049.04
RDVYIHPFHI,973923.46
RDVYIHPGNII,1044130.50
HMIYPAAIGAPR,1058998.80
PGAGAHGVHPFHI,1068282.4
YWYWREFF,1413671.31
HMIMYYWRP,1124357.12
RDVYIHPFHI,973923.46
IHCHRHIGARP,1066807.39
PGQAHGVHPFHI,1057926.96
HICTGGAGSPFHI,1077096.2
10
11
12
13
YWYWREFF,1413671.31
HMIMYYWRP,1124357.12
RDVYIHPFHI,973923.46
HICYPGIPFHI,1033167.66
IHCYVPAIGAPR,1066213.71
PGAGAHGVHPFHI,1068282.4
ERDWWYRW,1401440.03
HMIMFYRWI,1121049.04
RDVYIHPFHI,973923.46
HICYPGIPFHI,1033167.66
IHCYVPAIGAPR,1066213.71
PGGAAYVPAMIHI,1069764.6
RPDWSPFR,474834.93
RPPGFSPFR,463690.29
RPPGFSPFGV,475238.25
RPPGFSAGGSK,480531.83
RPPGFSGGASAG,485736.16
PPGVCAAGGGGAF,529482.6
RPDWSPFR,474834.93
RPPGFSPFR,463690.29
RPPGFSPFGV,475238.25
RPPGFSNASAG,481168.09
RPPGFSGGASAG,485736.16
SSPPGESAGGSGA,532466.9
RPDWSPFR,474834.93
RPPGFSPFR,463690.29
RPPGFSPFGV,475238.25
PPGHVPGPFGV,514439.80
RPPGFSGGASAG,485736.16
SPPGSESSGGAGA,532099.6
RPDWSPFR,474834.93
RPPGFSPFR,463690.29
RPPGFSPFGV,475238.25
RPPGFSAGGSK,480531.83
RPPGFSGGASAG,485736.16
SPPGSESSGGAGA,532099.6
RPDWSPFR,474834.93
RPPGFSPFR,463690.29
RPPGFSPFGV,475238.25
RPPGFSAGGSK,480531.83
RPPGFSGGASAG,485736.16
PPGVCAAGGGGAF,529482.6
RPDWSPFR,474834.93
RPPGFSPFR,463690.29
RPPGFSPFGV,475238.25
RPPGFSAGGSK,480531.83
RPPGFSGGASAG,485736.16
SPPGSESSGGAGA,532099.6
RPDWSPFR,474834.93
RPPGFSPFR,463690.29
RPPGFSPFGV,475238.25
RPPGFSAGGSK,480531.83
RPPGFSGGASAG,485736.16
SPPGSESSGGAGA,532099.6
RPDWSPFR,474834.93
RPPGFSPFR,463690.29
RPPGFSPFGV,475238.25
RPPGFSAGGSK,480531.83
RPPGFSGGASAG,485736.16
PDGEGVSGGAGGT,534916.8
RPDWSPFR,474834.93
RPPGFSPFR,463690.29
RPPGFSPSDT,477261.05
RPPGFSAGGSK,480531.83
RPPGFSGGASAG,485736.16
SSPPGESAGGSGA,532466.9
RPDWSPFR,474834.93
RPPGFSPFR,463690.29
RPPGFSPFGV,475238.25
RPPGFSAGGSK,480531.83
RPPGFSGGASAG,485736.16
SPPGSESSGGAGA,532099.6
RPDWSPFR,939042.40
RPPGFSPFR,892716.45
RPPGFSPSDT,927701.33
RPPGFSGGASK,948303.57
RPPGFSGGAGGT,956080.93
PPGESSSGGAGGT,1088705.5
RPDWSPFR,939042.40
RPPGFSPFR,892716.45
RPPGFSPSDT,927701.33
RPPGFSGGASK,948303.57
RPPGFSGGAGGT,956080.93
PPGESSSGGAGGT,1088705.5
RPDWSPFR,939042.40
RPPGFSPFR,892716.45
RPPGFSPSDT,927701.33
RPPGFSGGANT,948506.39
RPPGFSGGAGGT,956080.93
EDGSPSSPGGGGG, 1083355.0
RPDWSPFR,939042.40
RPPGFSPFR,892716.45
RPPGFSPSDT,927701.33
PPGHVPGFPVG,1038657.50
RPPGFSGGAGGT,956080.93
RPGGGGCGGGASK,1066973.5
RPDWSPFR,939042.40
RPPGFSPFR,892716.45
RPPGFSPFGV,928388.14
RPPGFSGGANT,948506.39
RPPGFSGGAGGT,956080.93
PPGESSSGGAGGT,1088705.5
RPDWSPFR,939042.40
RPPGFSPFR,892716.45
RPPGFSPSDT,927701.33
PPGFSVGPESS,1053005.48
RPPGFSGGAGGT,956080.93
PDGEGVSGGAGGT,1094221.3
RPDWSPFR,939042.40
RPPGFSPFR,892716.45
RPPGFSPSDT,927701.33
RPPGFSGGANT,948506.39
RPPGFSGGAGGT,956080.93
PPGESSSGGAGGT, 1088705.5
RPDWSPFR,939042.40
RPPGFSPFR,892716.45
RPPGFSPSDT,927701.33
EDGSPSPPGAF,1039618.80
RPPGFSGGAGGT,956080.93
PPGESSSGGAGGT,1088705.5
RPDWSPFR,939042.40
RPPGFSPFR,892716.45
RPPGFSPSDT,927701.33
EDGSPSPPGAF,1039618.80
RPPGFSGGAGGT,956080.93
EDGSPSSPGGGGG,1083355.0
RPDWSPFR,939042.40
RPPGFSPFR,892716.45
RPPGFSPSDT,927701.33
EDGSPSPPGAF,1039618.80
RPPGFSGGAGGT,956080.93
EDGSPSSPGGGGG,1083355.0
Table 10.3: Simulated Annealing Results for Different Lengths(cont). This table lists the ten predictions made by a search, restricted to
sequences of length 8 to 13 inclusive, on the 0121, 0220 and 0218 datasets.
Dataset:
0123
0119
1205
0121
0220
0218
No Noise Size:
49
34
34
42
46
47
Restricted
10
8
8
8
10
10
0
0
0
0
0
0
0
0
0
0
0
0
6
3
1
3
10
10
0
0
0
0
0
0
0
0
0
0
0
0
Unrestricted
Table 10.4: Simulated Annealing of Datasets Without Noise
The various experiments and their resulting outcomes, again out of 10 executions, are
detailed in Tables 10.5, 10.6 and
10.7.
The parameter settings for experiment A are
those of Section 8.6 (namely, the settings we have been using thus far), and the settings
for the other experiments are the same as those of experiment A unless otherwise noted.
Additionally, running times (an average of three runs) for each experiment are included 2 ,
and the ratio to the corresponding experiment A running time is given in parenthesis.
Certain experiments - C, D and E which involve a single parameter change, and H, K, M,
0 and P which involve multiple changes, show some improvement in sequence prediction,
but at the expense of an increase in computation time. One could imagine using a genetic
algorithm as a more rigorous method for optimizing these search parameters as well as the
model parameters from Section 9.2. But in the meantime, we elect to continue using the
initially chosen settings (experiment A) since they seem to do fairly well for the price paid
in computation time.
10.6
Summary
Simulated annealing proved to be an effective means for exploring the space of sequence
guesses. It performed best when peptide guesses are restricted to a specific length. Since
the length of the correct sequence may not be known a priori, and an unrestricted search
does not seem to perform as well, one solution is to conduct several searches restricted to
2
The running times are only very rough estimates - the algorithm is probabilitistic in nature, and the
machines on which these simulations were run, serve other processes as well.
122
Experiment
A: (see Section 8.6)
B: To = 150000
restricted
unrestricted
restricted
unrestricted
C: nover*
2
restricted
unrestricted
D: nlimit*
2
restricted
unrestricted
E: k = 0.95
restricted
unrestricted
F: k
0.75
restricted
unrestricted
G: tempsteps*
5
restricted
unrestricted
Outcomes
0119 1205 0121
(see Table 10.1)
(see Table 10.1)
8
6
10
0
0
0
0
2
0
6
1
4
0
0
0
1
0
0
10
7
9
0
0
0
0
1
1
2
1
8
0
0
0
0
0
0
8
7
10
0
0
0
2
2
0
4
0
1
0
0
0
0
1
0
0
4
7
9
6
2
0
0
1
0
1
5
2
0
1
0
1
0
6
7
8
0
0
0
0
0
0
2
2
1
0
0
0
0
8
0
7
0
8
0
0
0
1
2
0
1
0
0
0
0
0
3
0
0
Performance in Minutes(Ratio to A)
0119
1205
0121
7.58(1.00)
7.73(1.00)
10.56(1.00)
8.65(1.00)
9.16(1.21)
9.13(1.00)
7.88(1.02)
11.35(1.00)
11.15(1.06)
8.82(1.02)
10.14(1.11)
10.68(0.94)
16.41(2.16)
14.47(1.87)
20.96(1.98)
14.19(1.64)
15.71(1.72)
19.41(1.71)
10.52(1.39)
8.61(1.11)
11.96(1.13)
10.78(1.25)
10.54(1.15)
13.70(1.21)
13.44(1.77)
16.35(2.12)
20.66(1.96)
13.03(1.51)
17.11(1.87)
16.78(1.48)
3.09(0.41)
3.50(0.45)
3.55(0.34)
3.34(0.39)
3.52(0.39)
4.29(0.38)
9.05(1.19)
7.79(1.01)
10.31(0.98)
10.23(1.18)
7.99(0.88)
10.19(0.90)
Table 10.5: Exploring Simulated Annealing Parameter Space
123
Experiment
H: nover* 2,
nlimit* = 2
restricted
unrestricted
I: To = 75000
restricted
unrestricted
J: To
50000
restricted
unrestricted
K: k = 0.95,
tempsteps* = 5
restricted
unrestricted
L: To = 150000,
tempsteps* = 5
restricted
unrestricted
M: k = 0.95,
nover* = 2,
nlimit* = 2
restricted
unrestricted
Outcomes
0119 1205 0121
10
9
10
0
0
0
0
0
0
5
0
7
0
0
0
0
1
0
8
9
6
0
0
0
0
0
0
3
2
1
0
0
0
0
0
0
7
4
10
0
0
1
0
0
8
0
2
4
0
0
9
0
0
4
0
0
0
10
0
0
0
1
2
0
0
9
0
0
3
0
0
4
0
0
1
0
0
0
10
0
0
0
0
2
0
0
10
0
0
4
0
0
10
0
0
2
0
0
2
8
0
2
7
1
6
0
3
0
Performance in Minutes(Ratio to A)
0121
1205
0119
23.83(2.26)
19.13(2.47)
17.59(2.32)
17.83(2.06)
18.97(2.08)
22.83(2.01)
8.87(1.17)
6.95(0.90)
11.45(1.08)
8.34(0.96)
7.25(0.79)
8.89(0.78)
7.76(1.02)
6.95(0.90)
11.32(1.07)
8.82(1.02)
8.62(0.94)
9.72(0.86)
13.89(1.83)
17.47(2.26)
22.04(2.09)
15.37(1.78)
17.65(1.93)
22.02(1.94)
9.61(1.27)
8.48(1.10)
11.27(1.07)
9.79(1.13)
9.25(1.01)
10.88(0.96)
26.25(3.46)
37.17(4.81)
47.46(4.49)
27.14(3.14)
36.98(4.05)
45.48(4.01)
Table 10.6: Exploring Simulated Annealing Parameter Space (cont)
124
Experiment
0119
N: To = 75000,
8
9
7
9.98(1.32)
8.05(1.04)
12.66(1.20)
unrestricted
0
1
3
0
0
2
0
1
4
10.40(1.20)
9.45(1.04)
11.01(0.97)
0
0
0
0
0
0
8
0
9
0
9
16.38(2.16)
11.86(1.53)
22.21(2.10)
0
5
0
3
0
1
4
16.28(1.88)
16.18(1.77)
22.84(2.01)
0
0
0
0
0
0
9
0
10
0
9
0
18.69(2.47)
17.71(2.29)
30.33(2.87)
1
2
0
2
1
4
17.70(2.05)
18.89(2.07)
30.18(2.66)
0
0
0
0
0
0
9
8
8
7.56(1.00)
8.85(1.14)
14.12(1.34)
0
0
0
1
0
1
2
0
3
0
5
0
8.85(1.02)
8.47(0.93)
13.05(1.15)
0
0
1
restricted
nover* = 2
unrestricted
P: To = 75000,
restricted
nover* =2,
nlimit* 2
unrestricted
Q:
To = 75000,
Performance in Minutes(Ratio to A)
0119
1205
0121
restricted
nlimit* = 2
0: To = 75000,
Outcomes
1205 0121
restricted
tempsteps* = 2
unrestricted
1_1
Table 10.7: Exploring Simulated Annealing Parameter Space (cont)
125
different sizes and then select the best overall sequence.
126
Chapter 11
Testing the Approach
This chapter focuses more on the data and the predictions made by the algorithm rather
than the specifics of the simulated annealing search. Because data was limited, we performed
a Leave One Out Cross-Validation, and looked for other sources of data to use for algorithm
validation. We also investigate the 1205 dataset in more detail.
11.1
Leave One Out Cross-Validation
Is the model too specific for its training set? One would naturally expect the algorithm to
perform well when run on datasets that were used in the algorithm's training(Table 10.1).
However, when sequencing de novo, the peptide is novel and could not have been used to
previously train the model. So a more realistic situation is to train on a subset of the data,
and then ask how the algorithm performs on the remaining datasets. This is a technique
called Leave One Out Cross-Validation which is commonly done when the amount of data
is wanting, and typically, one trains on the largest possible training set (n-1 of n datasets)
and validates with the remaining dataset.
Leave One Out Cross-Validation was performed on two levels: (1) since we had data for
two peptides, we trained the model on one peptide and used the other for validation, and
(2) since we had six individual datasets, we trained the model on all but one, and validated
on the remaining one. This led to the following scenarios:
127
"
Scenario 1: train the model using the angiotensin datasets only, validate on the
bradykinin datasets,
" Scenario 2: train the model using the bradykinin datasets only, validate on the angiotensin datasets,
" Scenario 3: train the model using five of the six datasets, validate on the sixth.
11.1.1
Results for the Different Scenarios
The trained parameter values for each scenario are listed in Table 11.1, and for each scenario,
Table 11.2 summarizes the number of times the sequence was correctly predicted, again out
of 10 attempts.
Overall
(All 6)
Scenario 1
Scenario 2
Angio
Brady
0123
| 0119
Pnovariant
tendencym18
tendencym17
tendencypis
0.58
0.60
0.52
0.57
0.58
0.61
1
2
2
1
2
2
1
2
1
Prandom
0.18
0.16
0.22
1
2
2
0.18
1
2
2
0.18
1
2
1
0.17
Punobserved
Pax
Pb
non-breaks
0.0
0.24
0.56
3.0
0.0
0.23
0.60
3.2
0.0
0.27
0.47
2.6
0.0
0.24
0.57
2.9
0.0
0.31
0.56
3.0
M[*,H],M[*,P]
2.4
2.2
2.9
2.3
2.4
Parameter
Scenario 3: All But...
1205 10121
J0220
0218
0.58
0.58
0.60
1
2
2
0.19
1
2
2
0.17
1
2
2
0.17
0.0
0.26
0.47
3.3
0.0
0.25
0.58
2.9
0.0
0.24
0.57
2.9
0.0
0.23
0.58
3.3
3.3
2.3
2.4
2.3
Table 11.1: Model Parameters for the Different Scenarios. The values of the Overall Model
from Table 9.2 are included for ease of comparison.
The results of Scenarios 1 and 2 indicate that the model, when trained on one peptide, is
able to successfully predict the sequence of a different peptide not seen by the model. The
algorithm was not as successful with the 1205 dataset in Scenario 2, and even less so in
Scenario 3, but recall that 1205 was known to be flawed.
128
Restricted
Unrestricted
3
2
1
Scenario:
Dataset:
0123
9
0119
8
1205
1
0121
9
0220
9
0218
8
0
0
0
0
0
0
0
0
6
0
5
0
2
0
1
0
3
0
10
0
10
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0220
8
0218
9
0123
10
0119
10
1205
5
0121
10
0
0
0
0
0
0
10
0
10
0
7
0
1
2
1
0
0
0
0
0
1
0
0
0
0
Table 11.2: Results of Running Simulated Annealing with Model Parameters from the
Different Scenarios
11.1.2
Investigation of the 1205 Dataset
'Tr-aining on All But 1205(AllBut1205)
When the model is trained without 1205 (column 8 of Table 11.1), the algorithm also fails
to find the correct sequence for 1205 (see Table 11.3). This indicates that there is something
about the 1205 dataset that is different.
Dataset:
Restricted
Unrestricted
0123
10
0119
10
1205
1
0121
10
0220
10
0218
10
0
0
0
0
0
0
0
8
0
5
1
0
0
6
0
10
0
10
0
0
0
1
0
0
0
0
0
0
0
0
Table 11.3: Results of Running Simulated Annealing with Model Parameters Trained With
All Angiotensin and Bradykinin Datasets Except 1205 (AllBut1205)
If one were to compare the 1205 dataset with the other angiotensin datasets one would
notice the following feature differences: the 1205 parent ion is not the highest peak, rather,
a Bp18 ion is; the magnitude of the 1205 intensities are higher than those of the other
datasets; and the peaks appear to be saturated (an effect of high laser intensity or long
acquisition times). Examination of the model parameters when training with all datasets
but 1205 reveals that there are three model parameter values that draw particular attention:
tendencyi
8
= 1, P = 0.47 and M[*,H]=M[*,PI=3.3.
129
Since these values result when 1205
is excluded, they would suggest that in the 1205 dataset:
1. the "influence" of p18 variants is high (because when the 1205 dataset is excluded,
the tendencypi8 value decreases),
2. the "influence" of B type breaks is high, and
3. the "influence" of N-terminal H and P breaks is low.
Some means was needed for comparing the datasets. Because the total number of observed
trials differs from one dataset to another, one cannot simply compare the number of peaks
and/or their intensities without some sort of normalizing factor.
One might consider the contribution of a set of peaks to the overall score as a measure the
influence of those peaks. Since a few peaks of high intensity can be more influential than
many peaks of low intensity, this quantity is attractive because it depends upon both the
cardinality of the set as well as the intensity of each peak. However, contributions must
be computed with respect to some PMF generated by some model.
and compare differently depending upon the model parameters.
Scores will change
The difficulty thus lies
in determining which parameters values are appropriate to use in order to effect such a
comparison.
Comparing Datasets by Training on a Single Dataset
We decided to use the model parameters themselves, when trained on each dataset alone,
as a means for comparing the datasets.
The resulting parameter values should reflect
fragment type composition and fragmentation patterns of the training dataset, and thus,
can potentially serve as indicators of influence. Table 11.4 lists the settings obtained from
single dataset training, and we can indeed verify that the parameter settings for the model,
when trained with 1205 only, are supportive of the above assertions.
Note that for some of the parameters, the settings cover a wide range of possible values
across all datasests (e.g. Pnovariant and non-break tendencies). Also, other datasets have
deviant setting values as well: 0220 has a high Prandom and non-break tendency; 0119, a
130
Parameter
0123
0119 11205 10121
0220
0218
Pnovariant
tendencymis
tendencym1 7
tendency18
0.71
1
2
1
0.78
1
2
1
Prandom
0.15
0.0
0.12
0.0
0.56
1
2
2
0.19
0.0
0.62
1
2
1
0.13
0.0
0.62
1
2
1
0.29
0.0
0.47
1
2
1
0.18
0.0
0.28
0.46
0.34
0.4
0.21
0.69
0.23
0.49
0.28
0.44
0.26
0.49
4.1
4.1
2.8
3.5
5.6
1.6
3.7
4.4
1.5
3.2
3.6
2.7
Punobserved
Pax
Pb
non-breaks
M[*,H],M[*,P]
Table 11.4: Model Parameters When Trained With a Single Dataset
higher Pax and M[*, H], M[*, P] value; and 0218, a rather low Pnovariant and non-break
tendency.
Some of these values were surprising as they were not readily apparent from
simple visual inspection of the dataset, nor were they anticipated from the experiments
performed so far. These may indicate that the model needs to be further tuned in terms of
number of parameters and/or granularity for each parameter.
As a control, we ran the algorithm on the very same dataset used to train the model. One
would expect the algorithm to do well, and this was the case for almost all the datasets
except 1205.
Dataset:
Restricted
Unrestricted
0123
10
0119
9
1205
6
0121
10
0220
9
0218
10
0
0
0
0
0
0
0
7
1
5
1
1
0
6
0
10
0
10
0
0
0
0
0
0
0
0
0
0
0
0
Table 11.5: Results of Running Simulated Annealing with Each Model of Table 11.4 on Its
Training Set
Training on All Angiotensin Except 1205(AllAngioBut1205)
This time, we trained the model on all angiotensin datasets except 1205(see Tables 11.6
and
11.7).
The algorithm did not work well for 1205 - the correct sequence was not the
131
best scoring sequence. We knew 1205 was questionable, and this indicates that 1205 is very
"un-angiotensin like".
Parameter
All Angio But 1205
Pnovariant
0.68
tendencymis
tendencym17
tendencyp18
1
2
Prandom
Punobserved
0.14
0.0
Pax
Pb
non-breaks
0.27
M[*,H],M[*,P]
3.6
1
0.45
3.8
Table 11.6: AllAngioBut1205: Trained Parameter Values
Dataset:
Restricted
Unrestricted
0123
10
0119
9
1205
0
0121
10
0220
10
0218
10
0
0
0
0
0
0
0
10
0
7
0
0
0
7
0
10
0
10
0
0
0
0
0
0
0
0
0
0
0
0
Table 11.7: Results of Running Simulated Annealing with Table 11.6 Model Parameters on
the Other Datasets
It is interesting to note that when 1205 is removed from the training set, there seemed to be
an overall improvement in the performance of the unrestricted search for the other datasets
(compare Tables 11.3 and 11.7 to Table 10.1).
Influential Training Datasets
Whenever 1205 is included in our training sets, the resulting parameter settings bore a
closer resemblance to those of 1205 than any other dataset (compare columns 2 and 3 of
Table 11.1 to Table 11.4). In addition, with Leave One Out Cross-Validation, the greatest
change in parameter settings was observed when the single dataset excluded was the 1205
dataset (compare column 2 of Table 11.1 to columns 5-10 of Table 11.1).
132
The reason 1205 exerts such a strong influence during training is likely due to its higher
peak intensities. Compared to the other datasets, the 1205 score is larger in magnitudel,
and since the model is trained by optimizing on the sum of the scores of the training
datasets, large scoring datasets will overshadow and unfairly overpower the scores of other
datasets.
(Conversely, of our datasets, 0119 is the least influential; it also has the lowest
peak intensities.) Thus, model parameters are biased towards datasets with high intensities,
and will tend to reflect any idiosyncrasies these datasets may have. In the case of 1205,
these parameters would be adversely affected.
Some form of normalization may help re-weigh the datasets and equalize them during training so that no one dataset will be strong enough to unduly sway the parameter settings.
Another solution would be to train with a much larger training set so that negative effects
due to any single dataset are diluted.
Normalization of Training Data
We explored the effect of normalization during training 2 by modifying the heights of the
experimental peaks in one of two ways:
Method I Normalize on the parent ion by dividing the intensity of each peak by the
intensity of the parent.
Method 1I Adjust all the resulting intensity quotients of Method I so that their sum is 1.
The Method I correlation score is essentially the original unnormalized score divided by the
height of the parent ion. The Method II correlation score is the original divided by the sum
of all peak heights in the dataset.
The parameter settings are listed in Table 11.8 and the results in Tables 11.9 and 11.10.
'Recall that the logarithm of Prob(S,F) from Section 8.2 is a sum of products. The higher the intensity,
the larger the multiplier, the larger the product and the resulting sum, the larger the score.
2
Note that the intent of this line of inquiry is not to try to normalize in order to improve the performance
of Tables 11.7 or 11.3 because 1205 was not used to train these models. Furthermore, if 1205 were indeed
unlike the typical angiotensin dataset, then no form of normalization would help salvage it.
133
Parameter
Method I
Method II
Pnovariant
0.60
0.63
tendencymi8
tendencym17
tendencyp18
1
2
1
Prandom
Pbunobserved
0.18
1
2
1
0.18
0.0
0.0
Pax
0.26
0.17
Pb
0.5
0.49
non-breaks
3.0
3.4
M[*,HI,M[*,PI
2.9
3.1
Table 11.8: Trained Parameter Values When Normalizing All Six Datasets
Dataset:
Restricted
Unrestricted
Table 11.9:
0123
10
0119
10
1205
5
0
0
0
5
0
2
0
0
0121
9
0220
10
0218
10
0
0
0
0
1
1
0
5
0
10
0
10
0
0
0
0
0
0
0
0
0
0
Results of Running Simulated Annealing When Datasets Are Normalized
(method I)
Dataset:
Restricted
Unrestricted
Table 11.10:
0123
10
0119
10
1205
0
0121
10
0220
10
0218
10
0
0
0
0
0
2
0
0
0
0
0
0
5
3
0
4
10
10
0
0
0
0
0
0
0
0
0
0
0
0
Results of Running Simulated Annealing When Datasets Are Normalized
(method II)
134
One would expect the influence of 1205 to increase with Method I because of all our datasets,
it has the lowest intensity parent ion. And indeed, the Method I results were comparable to
those of the unnormalized model (Table 10.1) - there was enough "1205 character" in the
Method I parameters to allow the algorithm to correctly predict 1205 some of the time. Note
that the performance of the algorithm on the other datasets remained relatively unaffected.
One would also expect the influence of 1205 to decrease with Method II because the sum
of all peak heights is largest for 1205 and therefore, 1205 contributes less to the sum being
minimized during training. It was not a surprise, then, that with Method II, the other
datasets prevailed, and because the 1205 was so unlike them, the algorithm did poorly,
preferring a better-scoring competing sequence, IRTYIHPFHI.
While normalization during training may be useful, normalization of the input spectra,
data used for algorithm validation is unnecessary. An input spectrum is used only to score
sequence guesses against each other. Normalizing the input spectrum with either Method I
or Method II amounts to multiplying the score assigned to every guess by the same constant
factor, so normalizing with either method would not alter any guess rankings.
Factory Angiotensin: a Valid Angiotensin Dataset
This last section in our discussion of 1205 establishes that our non-1205 datasets are indeed
valid datasets for angiotensin, and confirms that 1205 is a poor dataset for angiotensin.
As mentioned in Section 9.1, our mass spectrometer came with an angiotensin PSD spectrum
of 56 peaks, probably acquired for instrument testing/evaluation purposes. Fortunately, this
provided us with another angiotensin sample point.
The parameter settings for a model trained on this dataset alone are given in Table 11.11
and they are within normal range of the other non-1205 angiotensin settings (Table 11.4).
The tendencyis setting resembles that of 1205, but recall that coarse-grained values were
used for these parameters (so that a value even slightly above 1.5 would be reported as a 2
and similarly a value slightly below 1.5 would be treated as a 1).
Sequence prediction was performed on this dataset (see Table 11.12) using the Overall
135
Parameter
Factory Angiotensin
Pnovariant
0.77
tendencym18
tendencymi7
1
2
2
tendencyp18
PA
0.14
0.0
0.33
0.44
non-breaks
4.2
M[*,H],M[*,P]
4.1
Prandom
Punobserved
Pax
Table 11.11: Factory Angiotensin: Trained Parameter Values
model, as well as models trained without 1205 such as AllBut1205 and AllAngioBut1205.
The unrestricted Overall search did not do well because it encountered a lower scoring sequence of length 11, HICYPVAPFHI. When 1205 is removed from the training set however,
the algorithm is more successful(last two columns of Table 11.12).
Model:
Restricted
Unrestricted
Overall
3
AllBut1205
4
AllAngioBut1205
5
0
3
0
4
0
5
0
0
3
0
0
0
0
0
4
Table 11.12: Results of Running Simulated Annealing on Factory Angiotensin with Various
Trained Models
This dataset and these results are important because they are consistent with the fact
that our non-1205 angiotensin datasets are valid and that the 1205 is markedly different.
Although acquired on the same machine, the factory dataset was collected independently
and by a different operator. Despite any resulting variations in sample preparation style and
acquisition technique, its tendencies are similar to the other non-1205 angiotensin datasets,
and it is recognized by a model trained with them only(AllAngioBut 1205).
We conclude that the 1205 dataset is an outlier and exclude it from the training sets in
all subsequent analyses; it was thought likely to cause problems, and when included in the
136
training, it did.
11.2
Meta-Analysis of Published Spectra
To obtain more validating data, we searched the literature for cation MALDI-PSD spectra
of peptides that contained no modified residues, and had H- and -OH as the N- and Cterminal groups respectively. Although we preferred spectra with a substantial number of
peaks labeled, for datasets that were small, we manually interpolated and included in the
dataset peaks that were unlabeled but of a high enough intensity to be of potential interest.
Since the real sequence was known, we could sometimes use the theoretical spectrum to
deduce masses for unlabeled peaks when the resolution of the spectrum was good enough.
Otherwise, manual measurements, which may correctly salvage a legitimate peak otherwise
ignored, could potentially introduce unwanted noise into the dataset.
If the peak intensities of a published spectrum were not included in the paper, then the
intensities were inferred by measuring the height of the printed peak with a ruler, picking
an intensity for the parent peak, and computing the heights of all others relative to this.
Note that as a result, the intensities of these peaks differ from their actual intensities, but
their relative intensities are roughly preserved. This may result in a slight fluctuation in
the computed score, but our approximation of peak heights does not appear to be grossly
unreasonable. Recall that datasets are not compared to each other, but rather, the same
dataset is used to evaluate different candidate sequences.
The performance of the algorithm using the AllBut1205 model is shown in Table 11.13. The
algorithm produced acceptable sequence predictions for the first three peptides. The poor
performance of the first peptide - 4 out of 10, particularly when the correct sequence is the
minimum scoring sequence, is indicative of a poor search. The searching parameters may
be suboptimal, and it may be further complicated by the surface features of the scoring
function - e.g. the global minimum could be located at bottom of a very narrow well so
that moves out of the well are easy, but moves into the well are difficult. After trying several
different combinations of searching parameters, we were able to improve the performance
of the algorithm to (6 2 0), but the algorithm took six times as long.
137
Originally, the fourth dataset consisted of the 33 labeled peaks in the published spectrum,
and our algorithm predicted DRVYIHPFIHIIHV (restricted search).
Three other masses
(251, 263 and 463.2, corresponding to internal ions for LH, VY and HLLV respectively) were
mentioned in the text of the paper but not labeled in their figure. When these were included
in the dataset, the restricted search then found DRVYIHPFIHIVYS, but only 3 out of 10
times. The unrestricted search, on the other hand, found a longer sequence that did not
resemble the real sequence but scored better. The paper mentioned other immonium ions,
including two for tyrosine(Y), and these might have helped focus the unrestricted search to
a sequence more similar to the real one, but we could not locate these peaks in the spectrum
to estimate peak intensities, so they were not included. The problem with this dataset is
that for a peptide of this length, the dataset is too small.
The use of published spectra in this manner is definitely suboptimal and not terribly reliable.
Fortunately, we were able to obtain several datasets that were more suitable for analysis.
11.3
Data from Another Center
Having data from another center is beneficial because it is collected independently, by a
different operator on a different machine. This is especially useful because when validating
with them, they can help reveal any invalid assumptions we made that were specific to our
instrument, data and/or protocol.
Dr. Arnie Falick of Applied BioSystems supplied us with three PSD datasets, and Drs.
Wishnok and Tannenbaum(MIT) allowed us to view the datafiles on their mass spectrometer. Peaks in the spectrum for AVPYPQR and for DIYETDYYR were already labelled;
so the peak lists were used as the datasets for these two peptides and no additional peaks
were added. A peak list for the third peptide, AIEAQQHLLQLTVWGIK, was supplied to
us directly by Dr. Falick. The results of running our algorithm using the AllBut1205 model
are shown in Table 11.14.
For the third dataset, the performance of the restricted search can be improved to (5 0 1)
with a longer search. Also, the unrestricted search actually found IAEAQKHIIQITVWGIQ
six times(6 0 0), but QANAKKHIIQITVSVGGPS (1 0 2) was a better scoring sequence. It
138
is also a longer sequence, so when sequencing larger peptides, this may mean that the search
cannot simply take the best scoring guess over all lengths (as was proposed in Section 10.3)
without some improvement to the scoring function or the dataset quantity/quality. We will
return to a discussion of longer peptides in Section 13.2.
11.4
Summary
Leave One Out Cross-Validation was promising, the inadequacy of 1205 was confirmed,
and the performance of our algorithm was satisfactory for the most part. When spectra
for peptides not used in the training set were presented to the algorithm, the predictions
were adequate for the most part, however, we began to see the effects of (1)
datasets that
are too small (more data peaks is always welcomed to increase redundancy), (2) searching
parameters that may need improvement, and (3) longer peptides on the sequencing process.
We address some of these issues in the next chapter.
139
Dataset Size:
Source:
Real Sequence:
Predicted:
Restricted
Predicted:
Unrestricted
1137.6
47
[CLS99]
YGGFLRRIR
YGGFIRRIR
4
0
0
YGGFIRRIR
7
0
0
Parent Mass of Dataset Peptide
1375.8
1046.5
1758.9
71
28
36
[Spe97]
[JnC96]
[JnC96]
GDHFAPAVTLYGK DRVYIHPF
DRVYIHPFHLLVYS
DGHFAPAVTIYGK
DRVYIHPF
DRVYIHPFIHIVYS
8
10
3
0
0
0
0
0
0
DGHFAPAVTIYGK
8
0
1
DRVYIHPF
9
0
0
HIMIIIGCFTIYVHV
5
0
0
Table 11.13: Results of Simulated Annealing Run on Datasets from the Literature Using a
Model Trained Without 1205 (AllBut1205). The 1375.8 dataset was the only dataset for
which extra peaks were not inferred.
Dataset Size:
Real Sequence:
Predicted:
Restricted
Predicted:
Unrestricted
830.4
50
AVPYPQR
AVPYPQR
9
0
0
Parent Mass of Dataset Peptide
1237.5
1948.1
35
72
DIYETDYYR
AIEAQQHLLQLTVWGIK
DIYETDYYR
IAEAQKHIIQITVWGIQ
10
4
0
0
0
0
AVPYPQR
8
0
0
DIYETDYYR
8
0
0
QANAKKHIIQITVSVGGPS
1
0
2
Table 11.14: Results of Simulated Annealing Run on Datasets from Applied BioSystems
Using a Model Trained Without 1205
140
Chapter 12
Discussion
This chapter examines two issues in more detail - What happens when the algorithm is
presented with spectra of longer peptides? How does dataset size affect performance?
12.1
A Study of Two Longer Peptides
Our algorithm encountered problems when handling spectra of longer peptides because
longer sequence guesses frequently outscore the real sequence. Scoring problems indicate
that the data is poor, or the model is inaccurate/insufficient, or both. Since we have more
control over the model, we considered two possible ways to improve it: (1) by expanding the
training set so that a model trained on more diverse data would be more encompassing in
scope, and (2) by augmenting the fragmentation model so that it would be a truer reflection
of the real process.
12.1.1
Enlargement of the Training Set
We included the three datasets of Section 11.3 in the training set, bringing the training set
membership to a total of 8.
141
Parameter
0123
0119
0121
0220
0218
830.4
1237.5
1948.1
Pnovariant
tendencym18
tendencym1 7
tendencypi8
0.72
1
2
1
0.72
1
2
1
0.74
1
2
1
Prandom
Punobserved
P ax
Pb
non-breaks
0.12
0.12
0.12
0.72
1
2
1
0.11
0.75
1
2
1
0.11
0.73
1
2
1
0.12
0.64
1
2
1
0.15
0.69
1
2
1
0.12
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
M[*,H],M[*,P]
0.14
0.15
0.14
0.15
0.14
0.16
0.19
0.18
0.44
0.44
0.44
0.44
0.44
0.45
0.52
0.36
2.6
2.6
2.6
2.5
2.8
2.6
3.0
2.7
2.9
2.9
3.0
2.9
2.9
2.7
3.0
3.3
Table 12.1: Model Parameters for Leave One Out Cross-Validation
Dataset:
Restricted
Unrestricted
0123
10
0119
9
0121
10
0220
10
0218
10
830.4
10
1237.5
9
1948.1
4*
0
0
0
0
0
0
0
0
0
10
0
5
0
6
0
10
0
10
0
7
1
7
1*
8*
0
0
0
0
0
0
0
0
0
0
0
0
0
2
0
0
Table 12.2: Results of Leave One Out Cross Validation with the Eight Datasets
142
Leave One Out Cross Validation for a Larger Training Set
The trained model parameters and the corresponding validation results are given in Tables 12.1 and 12.2.
In all cases, except for the peptide of mass 1948.1, the predicted
sequence was the correct sequence and the best scoring candidate found. In the case of
1948.1, the restricted search found PDTAQKHIIQITVWGIK (score 1859158.0) to be better scoring than IAEAQKHIIQITVWGIK(1860305.6).
And the unrestricted search found
IAEAQKHIIQITVSVGIQ instead. We will return to this momentarily.
Training on All Eight Datasets
The model was trained on all eight datasets, and the resulting parameters and performance
are shown in Tables 12.3 and 12.4.
Parameter
Trained Value
Pnovariant
tendencym18
tendencym17
tendency18
0.75
1
2
Punobserved
1
0.12
0.0
Pax
Pb
non-breaks
0.16
0.44
2.7
M[*,H],M[*,P]
2.9
Prandom
Table 12.3: Trained Model Parameters: Overall Training Set Plus 830.4, 1237.5 and 1948.1
Restricted
0123
10
0
0
Unrestricted
Original Dataset
0119 0121 0220
10
8
10
0218
10
M+H of Other Datasets
830.4 1237.5
1948.1
7
10
3*
0
0
0
1
0
0
0
0
0
0
0
0
0
0
8
7
6
10
10
10
7
10*
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
Table 12.4: Results of Running Simulated Annealing On All Datasets with the Model
Parameters of Table 12.3.
143
In general, the performance of the search was fairly good but this should not be surprising as
the training set was used as the validating set. Some difficulty was encountered, again with
1948.1, the last column of Table 12.4. The restricted search found PSEAQKHIIQITVWGIQ
which outscored the real sequence due to better support for the first break position: masses
217.2, 242.2 and 1330.9 could be interpreted as a YB internal for SE, YAm18 for SEA and
YBm18 for SEAQKHIIQITV respectively. The real sequence, on the other hand, had two
supporters for p1 (masses 1329.0 and 1346.9 which are YAm18 for IEAQQHIIQITV and
YA for IEAQQHIIQITV) but unfortunately, these peaks had multiple identities (Bm18 and
Bion respectively) and these alternate identities were also generated and accounted for by
the competing sequence. The competiting sequence could thus account for more peaks, and
without any additional support of the first position, the scoring favored the competition.
The unrestricted search found IAEAQKHIIQITVSVGIQ which scored better than the real
sequence because low mass peaks supported an SV in place of the W. These masses were:
159, 242, 355 and 892, which can be interpreted as a YA for VS, a YAm18 for TVS, a
YAm18 for ITVS, and a YB for HIIQITVS respectively.
Even with an enlarged training set, the model experienced difficulty predicting the correct
sequence for 1948.1.
One comment can be made about the performance of the other datasets.
Despite the
variance in parameter settings for the different models, the various searches did well for
these other peptides (see Tables 12.4,
11.7 and
11.3).
This suggested that there was a
range of model parameter settings that were acceptable (at least for theese peptides).
Note however, that even though training on two peptides allowed the model to perform well
on peptides not used in the training set, one may still encounter some novel peptide whose
fragmentation behavior is entirely distinct from anything our model could be trained with.
More generally, suppose each peptide has a subspace of optimal model/search parameter
settings. There is no guarantee that the intersection of all such subspaces for every possible
peptide is non-empty, i.e. there may not exist a universal set of parameter values that will
allow our algorithm to recognize all peptides. Or if such a set existed, would the settings
be so compromised so that the resulting algorithm performs poorly for all datasets, and
reliably for none.
144
12.1.2
Refining the Model to Improve the Scoring Function
When the search finds that the real sequence is not the best scoring sequence, one solution is to refine the model to improve the score of the real sequence. From the previous
section, we noticed that the incorrect sequences were supported by YAm18 ions, which are
forbidden by the following rule: a peptide containing S,T,D or E can only lose water and
a peptide containing R,K or H can gain water only if the fragment type in question is a
Bion [Hin97]. Inclusion of this rule in the model restricts the allowed fragment types to
{Am17, A, Bm18, Bm17, B, Bp18, Ym17, Y} which could potentially reduce the amount
of supporting evidence for incorrect sequences. In addition, the refined model specifies that
variant formation is residue dependent, location dependent and now, fragment type dependent. While one additional rule may not be enough to remedy the problem in the end, it is
an example of one of the many things that can be done to make the model a more realistic
simulator of the rules by which Nature operates.
Note also that we are now close to the edge of unchartered territory; for example, the
literature does not address the following questions: If a peptide containing S, T, D or E can
exhibit a Bm18 or Bp18 ion and if the loss/gain of water is side-chain specific (because it
is residue specific), why does it not also express the Am18 and Ap18 ions? Is it dependent
on some interaction between the side-chain and the C-terminus?
12.1.3
Performance of the Different Variations
Predictions made by the various models for the peptides of mass 1758.9(length14) and
1948.1(length 17) are compiled in Tables 12.5 and 12.6 for convenience. Column 1 reports
the performance of the AllBut1205 model, while Columns 2 and 3 both use a model trained
on an enlarged training set (Table 12.3).
The Bion variant check refinement was also
incorporated into the model of the third column.
It is difficult to evaluate the effect of the different models on 1758.9. Again, the central
problem is one of dataset size. An analyst with the intent of sequencing would have compiled a larger peak list if possible. Nevertheless, the predictions made by a restricted search
seemed to worsen, but one could argue that the unrestricted predictions showed improve-
145
ment, although in some cases, the unrestricted search preferred a longer sequence.
Predicted:
Restricted
Predicted:
Unrestricted
AllBut1205
DRVYIHPFIHIVYS
3
0
Enlarged
DRVYIHPFIHIVYS
4
0
Enlarged+Refined
DRVYIHPFIHIIHV
7
0
0
0
0
HIMIIIGCFTIYVHV
5
HIMIERTFTIYVHV
1
DYAIAIHPFITYVHV
1
0
0
0
0
0
0
Table 12.5: Results of Various Simulated Annealing Runs on Dataset with M+H 1758.9
AllBut1205
Enlarged
Enlarged+Refined
Predicted:
IAEAQKHIIQITVWGIQ
PSEAQKHIIQITVWGIQ
PSEAQKHIIQITVWGIK
Restricted
4
3
6
0
0
0
0
0
0
QANAKKHIIQITVSVGGPS
IAEAQKHIIQITVSVGIQ
IAEAQKHIIQITVSVGIQ
1
10
8
0
0
0
2
0
0
Predicted:
Unrestricted
Table 12.6: Results of Various Simulated Annealing Runs on Dataset with M+H 1948.1
A similar phenomenon occurred with the 1948.1 dataset - the restricted predictions seemed
to move farther from the real one, while the unrestricted ones improved. An analysis of
the scores showed that in the restricted case, the shift from IAEAQKHIIQITVWGIQ (score
1843917.7) to PSEAQKHIIQITVWGIQ (score 1843207.4) was due to the enlargement of the
training set, not to model refinement. On the other hand, both enlargement and refinement
independently and individually caused a shift from QANAKKHIIQITVSVGGPS to the
closer sequence, IAEAQKHIIQITVSVGIQ.
So is it an issue of the model, the search or the data? Probably all three. Here we saw one
small refinement to the model which showed some improvement in (unrestricted) prediction.
The search did not find a better scoring sequence than the ones listed in these tables, but
from the low performance numbers, it did not converge on them reliably. The input spectrum is the component that we have least control over, but more (legitimate) experimental
146
peaks are always a bonus, and are especially essential when dealing with spectra of longer
peptides.
12.2
A Study of Dataset Size
Our algorithm consistently found the correct sequence for the 0220, 0218 and often 0123
datasets over a wide range of parameter settings, and we conjecture that the size of the
datasets and the redundancy contained therein are the reasons for this. Larger datasets
constrain the space of possibilities and decrease the scores of candidate sequences that do
not account for a large percentage of the dataset since penalties are levied for unexplained
experimentals. In constrast, small datasets are less constraining so that it is easier to find
a peptide guess that accounts for almost all experimental peaks, and perhaps easier to find
one that may even score better than the actual sequence.
In short, what matters is not
only how many experimentals are accounted for, but also how many are unaccounted for.
We ran three experiments to study dataset size using the AllBut1205 model parameters of
Table 11.1 (and the refined model of Section 12.1.2). They entailed running the sequence
prediction algorithm in restricted mode while:
1. removing the lowest intensity peaks one at a time,
2. removing the highest intensity peaks one at a time, and
3. removing all noise peaks and adding them back in one at a time
Certain behaviors and patterns can be seen in our datasets, but additional data is needed
if more general conclusions are to be drawn.
12.2.1
Removing Low Intensity Peaks
The results of the first series of experiments are depicted in Figure 12-1. The x-axis is the
number of peaks left in the dataset and this decreases as more and more of the lowest intensity peaks are removed. The y-axis is the number of times, out of 10, the correct sequence
147
10
-+++
+
+
+
-
C,
0
10
+
±
0-,
0
+
+
±
5
±
±
5
C-)
*
0
C,
0
050
40
20
30
10
num peaks left, dataset 0119
0
10
*'
0
_0
50
5
a)
30
40
20
10
num peaks left, dataset 0123
C
0
10
0
+
10-1
+
±
I0
+
+
+
0
0
_0
5
0- 5
0
(D
0
0
0
a)
080
00
60
40
20
num peaks left, dataset 0220
0
40
30
20
10
num peaks left, dataset 0218
Figure 12-1: Removal of Lowest Intensity Peaks
148
0
was encountered as the minimum scoring sequence in a length-preserving search (i.e. it is
the sum of the three numbers that we have been reporting separately, see Section 10.2).
At the start of the algorithm, the peaks that are removed are largely noise peaks. As the
algorithm progresses, it begins to remove real peaks, but the performance of the algorithm
is unaffected because these peaks are extra and expendable; enough redundancy remains in
the dataset so that the correct sequence continues to be found. When the algorithm begins
to remove more critical peaks, the remnant is insufficient and incapable of supporting the
correct sequence.
Before this point is reached, the correct sequence usually abdicates its title of best scoring sequence to a sequence that bears some similarity to the correct one (For example,
RPPGFSPRF in the case of 0220 and DRVYIHPHFI in the case of 0123). In Figure 12-1,
this transition event is indicated by a shift from the use of "+" to "*"I. Most datasets degenerated into this condition when there were less than 20 peaks left. Another way to view
these results is that roughly 20 of the highest intensity peaks were sufficient for sequence
recovery of these specific peptides.
Figure 12-1 sheds some light on the question of how much lossage a dataset can tolerate
before the real sequence is no longer optimal. Examination of the fragment types present
in the dataset right before the "* -+ +" transition reveals the presence of several gaps - for
example, the 15 peaks of 0123 have gaps at break positions 1,7 and 9; the 16 peaks of 0220
have gaps at 3,4,5 and 7. Interestingly enough, the correct sequence is still recoverable, in
some cases because of internal ions (extended family ions) that fill in these gaps and help
resolve ambiguous residue ordering in the complete sequence, and in other cases (gap at
position 7 of 0123), probably because of the limited number of possibilities from constraints
imposed by the fragment types that are present. Fundamental graph approaches, which
often do not make use of internal ion information for candidate generation, rely on other
fragment types and would have difficulty enumerating the real sequence when so many
gaps are present. Note also that the 15 peak 0220 dataset has a huge gap which neither a
dipeptide nor tripeptide edge in a fundamental graph can rescue.
'Note, this is for the restricted case. In the case of 0123 and 0121, the unrestricted search produced a
better scoring alternate sequence at dataset size 14 and 16 respectively.
149
12.2.2
Removing High Intensity Peaks
While a large number of low intensity peaks can be removed before the real sequence is no
longer optimal, only a small number of the highest intensity peaks can be removed before
the search begins to falter, and when it does, it often fails dramatically (see Figure 12-2).
This is supportive of the idea that high intensity peaks tend to be more meaningful than
lower intensity peaks.
12.2.3
Removing Noise Peaks
Figure 12-3 contains the results of the third series of experiments which examines how
performance is affected by noise content. In this experiment, all noise peaks are first removed
from a dataset, and then they are gradually added back in, one at a time, starting with the
most intense first.
The performance of the noiseless angiotensin and bradykinin datasets, previously examined in Table 10.4, did not differ noticeably from the performance of the entire datasets
(Table 10.1), so one would not expect much fluctuation in these plots.
In all cases, the correct sequence remained the optimal scoring, and the datasets were rather
resilient to the addition of noise. The effect of noise peaks is less severe here than in other
approaches where noise peaks could greatly contribute to the size of a fundamental graph,
and to the number of incorrect sequence guesses. It may be worthwhile to devise a means
for assaying just how much noise a dataset can tolerate before completely being ineffective
- this would require some controlled means for deciding which noise masses are reasonable
to add and what their intensities should be.
12.3
Summary
An expanded set of training data and an additional rule for our model led to some improvement in prediction, but it became apparent that the next challenge to address is the effect
of longer peptides.
150
CO
10
U)
c
0
0
10
-H-
0
a)
5
5
0
0
0
72
10
70
68
66
64
num peaks left, dataset 0123
+++
0
62
20
30
40
num peaks left, dataset 0119
50
10
++-
0
+CL,
a)
5
0
0
a)
0
55
0
50
45
40
num peaks left, dataset 0121
35
_0
10
10
a),
CL
0.
0
a)
-
++
+
+
+
+*
-
-+
0
0
5
5
0
0
0
85
80
75
70
num peaks left, dataset 0220
0
65
70
65
60
55
50
num peaks left, dataset 0218
Figure 12-2: Removal of Highest Intensity Peaks
151
45
U,
10
II
II
I II
111111
+
c10
+
0
+
++
0
4-
++
+
++±
_0
5
C-
05
0'40
ci'
c
0
10-
50
60
70
num peaks, dataset 0123
+++
+++++
+
80
30
40
45
35
num peaks, dataset 0119
50
50
60
70
num peaks, dataset 0218
80
++
_0
a)
05
4a)
0
0-
OL
40
cl)
10-
+
45
50
55
num peaks, dataset 0121
60
H HHHHHHHHHH"
iI
-:II
I
10
0
+
0
CD)
5
0
0
00
50
60
70
80
num peaks, dataset 0220
0L40
90
Figure 12-3: Removal and Subsequent Addition of Noise Peaks
152
With regards to dataset size, it appears that most low intensity peaks are either noise
or redundant, legitimate peaks while most high intensity peaks are more meaningful and
relevant to sequencing. An extended family member for each break position is necessary
for correct sequencing.
153
Chapter 13
Conclusions
In this thesis, a new method for de novo sequencing was presented. Like other approaches in
the literature, we presented a means for traversing the space of possible guesses, and a means
for scoring a sequence guess against an input spectrum. Exploration of the search space
was implemented with simulated annealing, an efficient technique for locating the sequence
most likely to have produced the observed data. In order to compute this likelihood, a
probabilistic model for protein fragmentation was proposed and a scoring function was
implemented based on a probability distribution of fragment masses predicted by this model.
Since our approach is a global sequence-to-spectrum strategy, complete sequence guesses
are generated in a manner that is independent of the spectrum, and because of this, it is
less vulnerable to the effects of noise and gaps. With fundamental graph approaches, noise
peaks can be mistakenly interpreted as supporting peaks; such false positives can elevate
the standing of competing sequence guesses. Candidate generation in sequence-to-spectrum
approaches is not as affected by gaps in the data - there is no fear of pruning away the real
sequence due to under-representation, and because guess generation is not dependent on
the existence of paths through a graph or the presence of supporting experimentals, these
approaches can generate sequences that fundamental graph approaches will not.
Our scoring function is based upon a probabilistic model of fragmentation.
Other ap-
proaches in the literature make use of fragment type probabilities, and to our knowledge,
only [DAC+99b, DAC+99a] have tried using a more formal framework. Fragment probabil-
154
ities appearing in the literature are often either empirically or arbitrarily determined, and
are independent of each other - e.g. an Aion appears with some probability, a Bion with
another and so on, and these probabilities do not sum to 1 [DAC+99b, FdCGS+99]. Our
probabilities are empirically determined by fitting parameters to training data, and they
are not independent - a single molecule can fragment to produce either an Aion or some
other ion, but the sum of the probabilities of all possible outcomes is 1.
Because of this, our scoring function not only awards for matches (which all other approaches
do), but naturally penalizes for unmatched experimentals (which some other approaches do)
and even for unmatched theoreticals. The PMF nicely handles multiple identity peaks when there are several legitimate explanations for a single peak,the probability of this peak
is the sum of the probabilities of each identity. Peak intensities are also nicely accounted
for by viewing the input spectrum as a histogram of outcomes of repeated trials.
13.1
Room for Improvement
We tried to keep our approach as simple as possible, but there are extensions that can be
made to the model, search and data to improve the algorithm.
13.1.1
Improvements in the Data
Advances in Mass Spectrometry
Progress is continually made in mass spectrometry. With increased accuracy and decreased
mass fluctuation, spectra with cleaner and clearer peaks would enable better peak identification and labeling. An increase in range capability would allow for the analysis of larger
molecules.
Advances in Peptide Preparation and Data Acquisition
Alternate methods for obtaining spectra may yield better spectra. There are many variations to the recipe for generating spectra - use of different ionization methods, matrices,
155
peptide concentrations, laser intensities and chemical treatments of the analyte prior to
spectra acquisition may lead to data that is more bountiful in relevant peaks. The choice of
method and technique affect the types of peaks that are obtained. Future techniques may
be developed that increase the intensity and presence of helpful peaks, while minimizing
noise and other peaks which confuse and misdirect sequence reconstruction.
13.1.2
Improvements in the Model
Advances in Understanding Peptide Fragmentation
Since an effective correlation function requires knowledge of the fragmentation process [SZK95],
as more of the fragmentation process is elucidated, the model can be updated to better simulate peptide fragmentation, resulting in more accurate PMFs and improved scores.
There are quite a number of rules that have not been incorporated into our model. For
example, break probabilities depend on a combination of other factors such as break position
and fragmentation tendencies/proton affinities of the residues involved. Certain dipeptides
such as Arg-Lys and His-His [OSTV95], and the amide bonds of the dipeptides Asp-Pro,
Asp-X, Glu-X [KCKS96] exhibit certain affinities that could be captured in the matrix.
But there are also quite a number of rules that are not known. Are there other fragmentation
events that occur? What are the relevant factors? Are parameters constant or subject to
some distribution? The better we can approximate Nature's chemical/physical rules, the
better the model will be.
Training and Validating with More Data
A larger training set, more representative of different types of peptides, would be essential
for proper training of the model. One may find additional parameters, as well as the need
for finer granularity in parameter values, helpful.
A larger validation set would also help demonstrate the effectiveness of our approach.
156
Handling Modifications and Different Terminal Groups
Our approach is currently not equipped to handle residue modifications or other terminal
groups. Known modifications (for a list of some of them, see Table III of [Yat96]) can be
handled by considering the modified residue simply as another amino acid. A walk through
the peptide search space would then include peptides involving these modifications, and
any relevant fragmentation rules would need to be programmed into the model and PMF
generation routines. Different terminal groups can be handled by parameterizing the mass
of the N- and C-terminal groups in all mass calculations.
13.1.3
Improvements in the Search
Aside from further optimization of the searching parameters, there are other aspects of the
search that can be improved. For example, it may be possible to implement an efficient
incremental scoring method so that as the search makes a move from one sequence to
another, the entire PMF for the new sequence doesn't have to be recomputed from scratch.
Move Strategies
The three different sequence moves currently occur with equal probability; perhaps allowing
for them to be selected according to some non-uniform distribution may be beneficial.
To save time, our implementation precomputes all dipeptides and tripeptides, and stores
them by mass. The substitute move uses these as building blocks (instead of building a
replacement sequence one residue at a time).
Dr. Ting Chen suggested going one step
further and precomputing all amino acid combinations for every possible sequence mass up
to the parent ion. This basically amounts to a "complete fundamental graph" where all
nodes have outgoing edges for every possible residue so that all paths from the base node
to a particular fundamental node with mass n represent all possible sequences of mass n.
Choosing one of these paths randomly would be akin to finding a sequence of mass n when
making an unrestricted substitute move.
The length of the subsequence that a move affects can also be made a function of the
157
temperature so that at lower temperatures, shorter subsequences are changed.
An Educated Initial Sequence Guess
One might imagine a situation where the initial guess is not randomly chosen, but is intelligently selected based on some relevant information. For example, one might use the best
scoring sequence guess obtained from some fundamental graph approach.
Or one might
consult the immonium ions or some other feature of the spectrum, and specify a pool of
possible residues to draw from when constructing this guess.
With an intelligent initial
guess, the search is "much farther along" than a search that starts with a random sequence,
and could be made faster by starting at a lower initial temperature.
Too high an initial
temperature would cause the algorithm to make a move to a random sequence, throwing
away any benefits afforded by an intelligent non-random initial guess. (On the other hand,
too low an initial temperature is also undesirable since it impedes the ability of the algorithm to explore the search space.) Such a optimization may be possible depending on the
structure of the scoring function and the specific application [MitOO], and there has been
some study of initial temperature settings [RKW88], but no theoretical result of a general
nature [MitOO].
Prediction of Correct Length
With restricted searches, we partitioned the large peptide sequence space into smaller disjoint subspaces and used length-preserving moves. This was an effort to see if the global
optimum could be found more reliably by reducing the size of the search space without
changing scheduling parameters. Since the range of sizes is large (from
v+H
to
M+H)
if
there were some way to guess the length of the real sequence, the savings in search time
could be potentially large. Several of the approaches we tried in Appendix E enumerated
promising paths through a graph, sorted by length. It may be possible to use this information as a rudimentary filter for determining which lengths are worth searching.
158
13.2
Looking Towards the Future: Longer Peptides
Most of the spectra used in this thesis were acquired from short peptides.
For certain
applications, this may be sufficient - assuming each residue is equally likely to appear, and
assuming a protease cleaves after a single particular amino acid, then the expected length of
a proteolytic product is 20 residues long. For proteases with specificity pockets for several
side-chains, the expected length of fragments is less than 10 residues long - well within the
range of our results.
However, it is useful to consider the effects of longer peptides on sequence guess evaluation
and on peptide space searching.
The experimental spectrum of a longer peptide would
exhibit a larger range of mass values, and if N, the total number of trials, is kept constant,
peak intensities decrease overall as there are less ions available for covering the entire spectrum of possible fragments. Low masses would tend to have higher intensities than high
masses as a longer peptide sequence provides more opportunities for arriving at lower mass
fragments.
These features are also mimicked in the theoretical spectrum of the PMF - since the sum of
the fragment probabilities must equal 1, the individual probabilities are likely to be lower
than those of shorter peptides because these are distributed over a greater range of mass
possibilities. Lower masses may exhibit higher probabilities because there may be multiple
ways to arrive at a small mass value so the probabilities aggregate.
What of their effect on the search? Unrestricted searches were not as successful as their
restricted counterparts because a much larger space had to be traversed and without a comparable incase in search time, the correct answer was less likely to be found. Larger peptides
only exacerbate the problem, and if it were only a problem of a vast search space, one could
simply adopt an extremely slow cooling schedule.
But as we saw with the AIEAQQH-
LLQLTVWGIK peptide from Chapter 11, other problems develop. Namely, alternate sequences begin to score better, especially longer ones as they tend to be able to account
for more peaks and currently, no penalty is imposed based on guess length. Predictions
that resemble the real sequence but have short substitutions of one subsequence for another
(e.g. SV for W) may be common occurrences because of the increased chance that some
159
peak may happen to accidentally support a break position within the subsitution. Solutions
that might help are a model that uses some sort of minimum description length-based measure, and possibly, input datasets with large signal to noise ratios (so that even if alternate
interpretations exist, they do not outscore the correct one).
13.2.1
Effect of Isotopes
With longer peptides, the presence of isotopes will also affect mass computations.
This
effect has been ignored so far in our discussions because we have been working with short
peptides.
Let I(n) represent the isotope distribution of a molecule. This distribution describes the
probability that a molecule contains n extra neutrons due to isotopes, and let the "isotope
contribution" be the value of n for which I achieves its maximum.
For short peptides, the non-isotope form, the case of no extra neutrons, is most probable
and hence, the isotopic contribution is 0. The mass of the peptide is exactly the mass of its
constituent atoms. One could avoid isotope issues by limiting sequencing to short peptides
only.
For longer peptides, this no longer holds. With enough atoms, the occurrence of an isotope
is more likely, and the isotopic contribution is non-zero. Consequently, some adjustment
needs to be made for the corresponding shift in mass. Gras, et. al. account for this effect
in their peak detection algorithms [GMG+99], but otherwise, it seems this has been largely
ignored in the literature.
The peptides used in this thesis had few enough nucleons' that it was safe to disregard
the effect of isotopes. However, our peptides are at the fringes of the mass region where
'For a molecule, its isotopic distribution I can be estimated as follows: take the isotopic distribution of
each atom (which is known) and convolve them for every instance of each atom in the peptide.
How many nucleons does it take to shift the isotopic contribution from 0 to 1? We approximate this
as follows: (1) for each residue, compute nr, the number of instances of the residue necessary for I(1)
to exceed 1(0), the non-isotope probability (e.g. with 38 glycines, one expects to find the mass to be
2185 rather than 2184=38*57+m(H)+m(OH)), (2) the isotope contribution per nucleon of the residue is
then
(N +N-) where N, denotes the number of x particles in residue r, (3) the weighted sum of these
is computed, weighted by residue frequency, and finally, (4) the inverse taken to arrive at the number of
nucleons necessary to achieve a weighed isotope contribution of 1. It takes about 1867 nucleons to shift the
isotopic contribution from 0 to 1.
160
the appearance of an isotope begins to become more probable. Mass calculations in this
region and beyond should include a mass offset equal to the isotope contribution to account
for the most likely number of extra neutrons.
Our estimate of average fusion loss from
Section 7.1.6, currently weighted by amino acid frequency, may also have to be weighted by
isotopes of these residues as well.
13.3
Summary
Current de novo sequencing approaches exhibit limited success in solving the sequencing
puzzle. We have proposed and designed a de novo sequencing algorithm with the following
properties in mind:
Performance Our Java implementation currently takes about 10 minutes to issue a prediction for the angiotensin datasets (using the restricted search with parameter settings
from Section 8.6) on a dual processor Pentium III 500MHz machine running Linux.
The running time varies with machine load, search parameters and dataset size, and
increases with peptide length. Some simulations were also run on two other machines:
a 400MHz Pentium II PC running Linux and a 550MHz Pentium III PC running
Linux. An optimized C implementation should run faster.
Robustness Because candidate generation is independent of the spectrum, sequence-tospectrum approaches are slightly more tolerant of noise and of gaps in particular.
Prediction is still possible if internal ions serve as extended family members at gap
positions.
Scalability Since the PMF computation is quadratic in the length of the peptide (there
are 0(n 2 ) possible fragments), and the computation of the score is linear in the size
of the dataset, the complexity of the scoring module is polynomial in the length of
the peptide.
When the model is good, the real sequence scores optimally but may be hidden
amongst competing local extrema that are all embedded within a vast search space
whose size is exponential in the length of the peptide. This is exactly the type of
161
optimization problem suitable for simulated annealing, which performs an efficient
walk of the space according to a set of search parameters. Currently, two of the parameters, nlimit and nover, are directly proportional to the parent mass. We have
not adequately studied the effects of longer peptides; it may be that other parameters
may need to be dependent on the parent mass as well.
In our investigations, the
performance of longer peptides seemed to suffer because of the model rather than the
search.
Reliability Our approach finds the answer that maximizes the probability of the observed
spectrum, and it is built on a simple probabilistic framework for reasoning about the
likelihoods of sequence guesses.
Comprehensiveness As we saw earlier in Section 7.2, some scoring functions do not take
into account all possible supporting fragment types, namely the internal ions. There
is no aspect of our approach that only makes use of a subset of the input spectrum.
By virtue of the scoring function, the entire input spectrum, peaks of all heights and
of all hypothetical identities, is taken into account.
We have described an algorithm that takes as input a tandem mass spectrum and the parent
mass, and given a finite set of residue building blocks, predicts the parent sequence using
a model that makes certain assumptions about the fragmentation process. This approach
may also be applicable to the general linear sequencing problem for synthetic polymers and
other biopolymers such as DNA by identifying the set of building blocks, capturing the
MALDI-PSD fragmentation patterns in a relevant model and implementing an appropriate
simulated annealing search specification.
Nature obeys certain fragmentation rules and we have endeavored to capture its rules in a
simple probabilistic model. If this model is good and the data is sufficient, then the real
sequence scores optimally (Appendix F) and a simulated annealing search under the right
conditions will find it. Investing in improvements in the model is the more immediate need
and the most promising next goal. While there is still much room for improvement, our
approach is one step in a direction that deserves further development and study.
162
Appendix A
Amino Acid Information
R
N-C-CH
0
Figure A-1: Basic Residue Structure: The side-chain R of a residue hangs off of the acarbon. An amide bond joins the a-carboxyl group of one residue to the a-amino group of
an adjoining residue polymerizing multiple basic residues into a peptide.
163
Residue 11Frequency
A
C
D
E
.076
.0189
F
Monoisotopic Mass
71.03712
.0521
103.00919
115.02695
.0632
129.04260
.0397
.0719
147.06842
.0228
137.05891
.0529
113.08407
K
L
M
N
.0581
.0917
128.09497
113.08407
.0229
131.04049
.0436
114.04293
P
.052
.0417
.0523
97.05277
128.05858
.0715
87.03203
.0587
.0649
101.04768
99.06842
186.07932
163.06333
G
H
I
Q
R
S
T
V
w
Y
57.02147
156.10112
.0131
.0321
Table A.1: Basic Residues, their Frequencies and Masses
164
Appendix B
Experimental Methods
Peptide samples for angiotensin I and bradykinin were prepared and data was acquired
in-house. This appendix describes the process used to acquire PSD spectra for angiotensin
I. The procedure is exactly the same for bradykinin except appropriate concentration modifications were made to account for the difference in mass.
B.O.1
Sample Preparation
Materials
" matrix:
a-cyano-4-hydroxy-cinnamic acid, CIOH 7 NO 3 , Sigma Chemical Company (St
Louis, MO, USA) [28166-41-8],
" solvent: 70% CH 3 CN in Q-H 2 0 + 0.1% TFA (100pl TFA + 70ml acetonitrile +
enough Q-H 2 0 to make 100ml),
" peptide analyte: angiotensin I, Sigma Chemical Company (St Louis, MO, USA)[70937-
97-2],
" 1.5ml microfuge tubes,
" mettler weigher and spatula,
" sample spotting gold plate (#5-2204-00-0002 Sample Plate/Polished surface).
165
Peptide samples were used directly out of its commercial packaging without any purification
or further processing, and because of the mass range of these peptides, a-cyano-4-hydroxycinnamic acid was chosen as the matrix (works best for 500-5000Da [BCC91]).
Preparation of Matrix and Analyte
Using a mettler weigher, 0.0148g of dessicated a-cyano-4-hydroxy-cinnamic acid was weighed
into a 1.5ml microfuge tube, then 1.48ml of (70% CH 3 CN in Q-H 2 0 + 0.1% TFA) was added
to produce a matrix concentration of 10mg/ml.
A 12.97mg/ml solution of angiotensin was made using 0.0011g of dessicated angiotensin and
84.8pA of (70% CH 3CN in Q-H 2 0 + 0.1% TFA). Since 1 mole of angiotensin weighs 1296.5g,
this sample was approximately 10nmol/pl. To obtain a desired lOpmol/pl concentration
for MALDI-PSD, we performed three serial dilutions, each time diluting 10pl of the sample
with 90pl of (70% CH 3 CN in Q-H 20 + 0.1% TFA).
Finally, we combined 2pI angiotensin(10pmol/pl) with 2pl of matrix, so that the final
concentration of the peptide sample was 5pM. A 1pl aliquot of this was spotted onto the
sample plate, air dried at the ambient temperature and placed into the source area of the
mass spectrometer for spectral acquisition.
B.O.2
Data Collection
Spectral data was acquired on a PerSeptive Biosystems Voyager Elite, later upgraded to
a Voyager DE STR, at the MIT Whitehead Institute.
A N 2 laser produces a 337.1nm
wavelength output at a pulse rate of 3.01Hz, and a 4mm 2 view of the sample plate and
laser illumination area can be seen on a color video monitor(Hitachi, model CT1396VM).
A 1129mm long flight tube directs ionized fragments to a detector, and a digitizer scope,
model TDS520B, displays the growing spectrum of recorded collisions with the detector.
This spectrum can then downloaded to an IBM compatible computer running GRAMS
software.
At the start of each PSD acquisition run, a calibration file was created by performing a one
166
point calibration on the parent ioni with the mass spectrometer in PSD mode (mirror ratio
of 1.00, low mass gate off, and timed ion selector on). This calibration file is used for the
collection of snapshots of small overlapping mass ranges. These stitches are linearly overlaid
by the spectrometer computer software to arrive at the final desired PSD spectrum. Piecemeal spectral concatenation is a result of the fact that a particular mirror ratio is only
capable of properly focussing a particular range of mass fragments.
For this reason, a
complete PSD run consisted of collecting stitch data for several mirror ratios, e.g.
1.0,
0.9126, 0.6049, 0.4125, 0.2738, 0.1975, 0.1213, 0.0859, 0.0674 and 0.0566.
Once the PSD composite is successfully created, the spectrum is displayed on the computer
monitor, and peaks may be selected for inclusion in the experimental dataset to be used as
input to a sequencing algorithm.
'When obtaining spectra for a peptide other than angiotensin, angiotensin was used as an internal
calibrant.
167
Appendix C
Experimental Data
Four PSD datasets for angiotensin and two for bradykinin were collected. A cutoff intensity was determined by visual inspection, and all peaks with intensities greater than this
threshold were selected. Additional peaks of lower intensity were also selected if they were
well-defined and sharp. Each dataset is comprised of these selected peaks, which are listed
below, along with each peak's intensity, checkpoint, distance from checkpoint and identity.
Some statistics are also compiled for each of the datasets and these appear in Table C.7.
Note that in the 0218 dataset, when the experimentals are converted to checkpoints, the
intensities of two peaks, masses 71.1355 and 71.3207, are added together since they both
resolve to the same checkpoint. Examination of the spectra reveals that it is not the case
that there are two clean peaks at these masses, rather, there is a single jagged peak with
ridges, so it is likely to be the fault of the labelling software (in that it incorrectly interpreted
a ridge as a peak, and may have done so imprecisely) and/or the operator (in that he/she
elected to have the extra ridge labelled). A similar situation occurs for peaks 42.6474 and
43.0738 in dataset 0220.
Entries in these tables, fragment identities in the data tables and the statistics of Table C.7,
are based a model of fragmentation model (from Chapter 8) that includes the refinement
of Section 12.1.2. Table entries enclosed in parenthesis indicate the identities/values that
would result when using a model without this refinement. Differences occur because in the
refined model, the Am18, Ap18, Ym18 and Yp18 variant ions are not allowed.
168
C.1
Dataset Peaks and Peak Identities
Experimental
Mass(E)
Intensity
Checkpoint(C)
39.0087
375
239
39.0206
50.0265
70.1586
72.1756
86.2138
574
70.0371
72.0382
86.0456
110.2028
113.2269
115.2049
5914
253
49.9061
136.1618
138.1603
156.1609
166.2488
207.3252
212.2403
213.2452
214.2558
217.2618
223.3004
230.2403
235.2327
237.2398
245.2347
249.3441
251.2606
255.2284
256.2551
257.2591
263.1501
269.1432
270.1365
272.0874
279.1152
285.1174
__________________________________
512
480
297
941
464
502
362
311
293
268
154
449
563
812
1523
210
261
302
692
3889
110.0583
113.0599
115.0610
136.0721
212.1124
213.1130
214.1135
217.1151
223.1183
230.1220
-0.1607
-0.2153
-0.1278
-0.1321
-0.1422
-0.1466
-0.1820
-0.1182
235.1247
-0.1079
237.1257
-0.1140
-0.1046
-0.2119
-0.1274
-0.0930
-0.1192
-0.1227
-0.0105
-4.5465E-4
0.0067
0.0569
0.0328
0.0338
245.1300
249.1321
251.1331
255.1353
256.1358
257.1363
2558
272.1443
279.1480
I _______________________ .11
-0.0870
-0.0781
138.0732
156.0827
166.0880
207.1098
747
540
1789
8976
1654
866
1635
Delta
Mass(C-E)
0.0119
0.1204
-0.1214
-0.1373
-0.1681
-0.1444
-0.1669
-0.1438
-0.0896
263.1395
269.1427
270.1432
285.1512
____________________________________
169
1
Identity
M:YA P, (P:Aionml8 D)
M:YA V
M:YA I
M:YA H
M:YA Y
M:YB H
M:YBp18 H
I:YA HP
I:YA PF
I:YA IH
I:YA VY, I:YB HP
I:YB PF
I:YA YI
I:YB IH
P:Bionml7 DR
I:YB RV
I:YA FH
I:YB VY
S:Yion HL, I:YBp18 IH
P:Bion DR
I:YB FH
303.1342
313.0369
326.1207
329.0269
337.0419
343.0863
354.035
364.3554
371.3498
382.3082
400.3986
414.369
416.2967
426.276
489.2467
506.2331
513.2342
517.196
798
1085
1346
1149
1044
1230
12208
1542
847
2630
1237
971
1824
1226
1175
2377
4542
4430
303.1607
313.1660
326.1729
329.1745
337.1788
343.1820
354.1878
364.1931
371.1968
382.2027
400.2122
414.2196
416.2207
426.2260
489.2594
506.2685
513.2722
517.2743
0.0265
0.1291
0.0522
0.1476
0.1369
0.0957
0.1528
-0.1622
-0.1529
-0.1054
-0.1863
-0.1493
-0.0759
-0.0499
0.0127
0.0354
0.0380
0.0783
527.5279
534.2053
548.9729
730
1947
520
527.2796
534.2833
549.2913
-0.2482
0.0780
0.3184
P:Bion DRVY
619.6552
632.4269
641.4564
3029
1537
665
619.3284
632.3353
641.3401
-0.3267
-0.0915
-0.1162
P:Aion DRVYI
I:YB IHPFH
I:YA RVYIH
647.5013
2216
647.3433
-0.1579
P:Bion DRVYI
650.4275
1768
650.3449
-0.0825
S:Yion HPFHL, I:YBp18 IHPFH
654.7842
677
654.3470
-0.4371
739.6634
994
739.3921
-0.2712
P:Aionm17 DRVYIH
756.3962
4053
756.4011
0.0049
P:Aion DRVYIH
767.3647
1689
767.4070
0.0423
P:Bionm17 DRVYIH, I:YA YIHPFH
784.3812
3857
784.4160
0.0348
P:Bion DRVYIH
1137.8063
1166.8656
1181.8581
332
372
601
1137.6033
1166.6187
1181.6266
-0.2029
-0.2468
-0.2314
P:Aion DRVYIHPFH
1183.0156
612
1182.6272
-0.3883
1183.7039
1279.7142
1296.7269
818
891
16750
1183.6277
1279.6787
1296.6877
-0.0761
-0.0354
-0.0391
1300.7974
896
1300.6898
-0.1075
1311.5771
624
1311.6956
0.1185
I:YBp18 FH
P:Aionml7 DRV
P:Aion DRV
P:Bionml7 DRV, I:YA HPF
P:Bion DRV
I:YB HPF
I:YBp18 PFH
I:YB YIH
S:Yion FHL
P:Aionm17 DRVY
P:Aion DRVY
S:Yion PFHL, I:YB VYIH
P:Bionm17 DRVY
S:Yion RVYIHPFHL
P:Bionp18 DRVYIHPFH
M+Hm17 DRVYIHPFHL
M+H DRVYIHPFHL
Table C.1: Angiotensin Dataset: data/012360c/unprependedpeaks
170
Post Source Decay Analysis
File # 1=C:\MATSU\TLENG\012360C\PSDiPOOI.MSA
Stitch Factors 0,600 - 1.010
-50000
L
-100000.
.
-150000
-200000 --
-
......
.........
.
560
1000
15000-i
10000-
5000--
0
500
Mass
(m/z)
Figure C-1: PSD for 0123 Angiotensin
171
100o
..
..
Experimental
Mass(E)
66.597
70.141
72.1589
86.1969
110.2144
112.2346
113.2549
115.1613
136.182
138.1471
156.199
166.175
212.1768
223.2092
230.1264
235.1531
251.1221
255.0997
263.1244
269.1443
272.1034
285.0768
326.169
343.0809
354.063
364.1417
382.1243
416.2279
426.272
489.2264
506.172
513.2287
517.1144
534.3137
619.6771
632.516
647.818
650.583
Intensity
Checkpoint(C)
278
627
541
67.0355
0.4385
70.0371
72.0382
86.0456
-0.1038
-0.1206
-0.1512
-0.1560
-0.1751
-0.1949
-0.1002
-0.1098
516
6525
296
221
164
462
191
226
320
278
366
110.0583
112.0594
113.0599
115.0610
136.0721
138.0732
156.0827
166.0880
521
836
433
2160
212.1124
223.1183
230.1220
235.1247
251.1331
255.1353
452
263.1395
2359
860
269.1427
272.1443
285.1512
326.1729
343.1820
676
562
578
3519
491
744
440
487
752
2152
416.2207
1616
721
995
740
720
730
M:YA P, (P:Aionml8 D)
M:YA V
M:YA I
M:YA H
M:YAm17 R
M:YA Y
M:YB H
-0.0738
M:YBp18 H
I:YA IH
0.0039
0.1011
0.1248
I:YA VY, I:YB HP
I:YB IH
P:Bionml7 DR
I:YB VY
S:Yion HL, I:YBp18 IH
P:Bion DR
I:YB FH
P:Aionml7 DRV
P:Aion DRV
P:Bionml7 DRV, I:YA HPF
0.0514
0.0784
-0.0071
I:YB HPF
S:Yion FHL
0.0744
364.1931
382.2027
Identity
-0.1162
-0.0869
-0.0643
-0.0908
-0.0043
-0.0283
0.0110
0.0356
0.0151
-0.0015
0.0409
354.1878
849
Delta
Mass(C-E)
426.2260
489.2594
506.2685
513.2722
517.2743
-0.0459
534.2833
619.3284
-0.0303
0.0330
0.0965
0.0435
0.1599
-0.3486
-0.1806
-0.4746
-0.2380
632.3353
647.3433
650.3449
L-.
172
P:Aionml7 DRVY
P:Aion DRVY
S:Yion PFHL, I:YB VYIH
P:Bionml7 DRVY
P:Bion DRVY
P:Aion DRVYI
I:YB IHPFH
P:Bion DRVYI
S:Yion HPFHL, I:YBpl8 IHPF H
741.322
540
741.3932
0.0712
756.262
767.648
784.265
1570
723
1820
756.4011
767.4070
784.4160
0.1391
-0.2409
0.1510
1133.7786
302
1133.6012
-0.1773
1182.7143
472
1182.6272
-0.0870
1184.6494
1253.771
310
302
1184.6282
1253.6649
-0.0211
-0.1060
1296.7507
7326
1296.6877
-0.0629
1311.8763
364
1311.6956
-0.1806
P:Aion DRVYIH
P:Bionml7 DRVYIH, I:YA YIHPFH
P:Bion DRVYIH
M+H DRVYIHPFHL
Table C.2: Angiotensin Dataset: data/011959adata/unprependedpeaks
0)
PerSeptive Biosystems
Original
Filename:
This File #
1=
eI950pd1po0.msa
c:\matsuille~g11
C:ATATLENG\119SOAPSDiPC0.M6A
CoNectd:
1/199
12-20
PM
SampK
MIT-
I
4000-
14
L
0-
20
Commmen:
Method:
Made:
00
600
400
000
20
Engic
PDE2000
PSD
59
Acomlersting Voltage 20W
Gdd Votager. 75.000 %
Guide Wre Voltage:
Delay.
Laser:
Praefurm: 2.03e.07
0.01 %
50 ON
Low Mass Gets:
Mirror RAtie: 1.110
1700
Scans Avrwge: 110
OFF
PSD
Tuned
Mirror RatioIon Set1ctor: 126.7 ON
Negatdve
Figure C-2: PSD for 0119 Angiotensin
173
tns. OFF
Experimental
Mass(E)
69.9885838
109.953654
135.980053
217.015005
223.038243
229.935623
234.930675
250.871086
254.865274
263.180953
269.24055
272.167793
279.068532
285.106881
303.143077
313.067376
326.068144
329.072206
337.008348
343.025774
353.96337
354.951488
364.567033
370.989102
382.421177
400.350111
416.320086
426.27812
473.147628
489.228243
506.149113
513.091947
517.045502
527.238686
534.036839
619.532451
647.383406
664.539414
696.932765
714.445666
740.276771
Intensity
_
Checkpoint(C)
Delta
Mass(C-E)
Identity
70.0371
110.0583
136.0721
0.0485
0.1047
0.0921
217.1151
223.1183
230.1220
0.1001
M:YA P, (P:Aionml8 D)
M:YA H
M:YA Y
I:YA PF
I:YA IH
_
1749
16416
2416
1890
2041
235.1247
0.0800
0.1864
0.1940
251.1331
0.2621
2301
255.1353
263.1395
269.1427
272.1443
279.1480
5475
285.1512
1919
4140
303.1607
313.1660
326.1729
329.1745
337.1788
343.1820
354.1878
355.1883
364.1931
371.1968
382.2027
0.2700
-0.0413
-0.0978
-0.0234
0.0795
0.0443
0.0177
0.0987
3350
5581
2392
13451
4058
20367
6697
6345
4656
3588
5128
39182
10481
4570
2761
7502
3550
6689
3773
1877
4833
10248
18042
17791
4508
8245
7705
6339
1889
1858
2175
3134
0.1048
0.1023
0.1704
0.1562
0.2244
0.2368
-0.3738
0.2077
-0.2184
-0.1378
-0.0993
-0.0520
0.1033
400.2122
416.2207
426.2260
473.2509
489.2594
506.2685
513.2722
517.2743
527.2796
534.2833
619.3284
647.3433
664.3523
697.3698
714.3788
740.3926
0.0312
0.1193
0.1802
0.2288
0.0409
0.2465
-0.2039
-0.0400
-0.1870
0.4370
-0.0667
0.1159
174
I:YA VY, I:YB HP
I:YB IH
P:Bionml7 DR
I:YB VY
S:Yion HL, I:YBp18 IH
P:Bion DR
I:YB FH
I:YBp18 FH
P:Aionml7 DRV
P:Aion DRV
P:Bionml7 DRV, I:YA HPF
P:Bion DRV
I:YB HPF
I:YBp18 PFH
S:Yion FHL
P:Aionml7 DRVY
P:Aion DRVY
S:Yion PFHL, I:YB VYIH
P:Bionml7 DRVY
P:Bion DRVY
P:Aion DRVYI
P:Bion DRVYI
756.225266
784.09736
927.159077
985.637163
1001.06581
1029.11267
1068.8317
1137.57857
1165.45098
1183.43523
1197.7826
1296.40479
12088
16216
2270
1761
3900
3585
2719
8810
19237
44775
5591
33467
1,.
756.4011
784.4160
927.4919
985.5226
1001.5311
1029.5460
1068.5667
1137.6033
1165.6182
1183.6277
1197.6351
1296.6877
0.1759
0.3186
0.3328
-0.1144
0.4653
0.4333
-0.2649
0.0247
0.1672
0.1925
-0.1474
0.2829
1
P:Aion DRVYIH
P:Bion DRVYIH
I:YBp18 RVYIHPFH
P:Aion DRVYIHPFH
P:Bion DRVYIHPFH
P:Bionpl8 DRVYIHPFH
1
M+H DRVYIHPFHL
Table C.3: Angiotensin Dataset: data/120598b/120598bdata
Post Source Decay Analysis
FilIe
40000
Stitch Factors:
1-C:\MATSUTLENG120508BPO1PA.MSA
.600
0-
-20000a-am00
0-
40W0
Soo 0$0
000
MasInh
Figure C-3: PSD for 1205 Angiotensin
175
- 1 010
Experimental
Mass(E)
39.0365
Intensity
Checkpoint(C)
478
39.0206
70.1559
72.1845
498
469
380
6786
694
70.0371
72.0382
86.2137
110.1966
111.1807
112.2168
115.161
136.1696
138.1572
156.1484
166.204
207.2945
217.2656
223.2725
230.2349
235.2373
245.2555
251.2266
255.2144
263.1141
269.1331
272.1083
285.1055
303.0332
313.0381
326.0851
343.0954
354.0267
355.0395
364.3496
382.2554
400.2776
416.2982
426.2724
489.1976
306
344
1061
584
576
728
468
783
1023
1655
2585
441
1287
7438
3454
21122
5155
3976
1626
2510
86.0456
110.0583
111.0589
112.0594
115.0610
136.0721
138.0732
Delta
Mass(C-E)
-0.0158
-0.1187
-0.1462
-0.1680
-0.1382
-0.1217
-0.1573
-0.0999
-0.0974
-0.0839
-0.0656
-0.1159
-0.1846
-0.1504
-0.1541
-0.1128
-0.1125
-0.1254
-0.0934
156.0827
166.0880
207.1098
217.1151
223.1183
230.1220
235.1247
245.1300
251.1331
255.1353
263.1395
-0.0790
0.0254
0.0096
0.0360
269.1427
272.1443
285.1512
303.1607
313.1660
0.0457
0.1275
0.1279
3028
2835
326.1729
0.0878
343.1820
25045
354.1878
5483
355.1883
364.1931
382.2027
400.2122
0.0866
0.1611
0.1488
-0.1564
2727
5133
2263
3514
2711
2078
-0.0526
-0.0653
416.2207
426.2260
506.2023
4013
513.2117
8758
489.2594
506.2685
513.2722
517.1686
8603
517.2743
527.2644
1255
534.1615
3114
527.2796
534.2833
-0.0774
-0.0463
0.0618
0.0662
0.0605
0.1057
0.0152
0.1218
176
Identity
M:YA P, (P:Aionml8 D)
M:YA V
M:YA I
M:YA H
M:YAm17 R
M:YA Y
M:YB H
M:YBp18 H
I:YA HP
I:YA PF
I:YA IH
I:YA VY, I:YB HP
I:YB PF
I:YB IH
P:Bionml7 DR
I:YB VY
S:Yion HL, I:YBp18 IH
P:Bion DR
I:YB FH
I:YBp18 FH
P:Aionml7 DRV
P:Aion DRV
P:Bionml7 DRV, I:YA HPF
I:YB HPF
I:YBp18 PFH
S:Yion FHL
P:Aionml7 DRVY
P:Aion DRVY
S:Yion PFHL, I:YB VYIH
P:Bionml7 DRVY
P:Bion DRVY
577.225
602.4684
619.5363
632.5459
647.4372
669.4731
680.2988
740.3599
756.3997
767.296
784.3165
1183.8495
1279.7243
1296.6713
722
725
3543
2153
2404
818
840
1112
4490
1700
5108
1132
984
24878
577.3061
577.3061
602.3194
619.3284
632.3353
647.3433
669.3550
680.3608
740.3926
756.4011
767.4070
784.4160
1183.6277
1279.6787
1296.6877
0.0811
0.0811
-0.1489
-0.2078
-0.2105
-0.0938
-0.1180
0.0620
0.0327
0.0014
0.1110
0.0995
-0.2217
-0.0455
0.0164
P:Aionml7 DRVYI
P:Aion DRVYI
I:YB IHPFH
P:Bion DRVYI
I:YB RVYIH
P:Aion DRVYIH
P:Bionml7 DRVYIH, I:YA YIHPFH
P:Bion DRVYIH
P:Bionpl8 DRVYIHPFH
M+Hml7 DRVYIHPFHL
M+H DRVYIHPFHL
Table C.4: Angiotensin Dataset: data/01 2170c/unprependedpeaks
4,
PerSeptive Biosystems
Original Filename: C:MATSU\TLENG1012170C\PSD1P00 mae
This File # 1 = C:\MATSU\TLENG\012170C\PSD1P00.MSA
Sample: 70
Collected: 1/21/98 2:03 PM
2 5 0 0;
20000-
15000M
r
10000-
5000-
J'i
LliiLlL~1L
' ' '
0-
200
400
e50
80
- -- -
1000
"-.
*
-
--
-T
1200
Mass (mn/z)
Comment angio, bwIdth=fulI
Method: PDE2000
Mode- PSD
Acculeratng Voltage: 20000
Grid Voltage: 75.000 %
Guide Wire Voltage: 0.01 %
Delay 50 ON
Laser: 1790
Scans Averaged: 231
Pressure: 2.33e-07
Low Mass Gate: OFF
MirrorRatio: 1.110
PSD Miror Ratio:
Timed Ion Selector: 1298.7 ON
Negative Ions: OFF
Figure C-4: PSD for 0121 Angiotensin
177
Experimental
Mass(E)
38.9917
42.6474
43.0738
43.5834
60.045
70.1447
71.1589
98.1434
112.1678
Intensity
Checkpoint(C)
772
380
363
353
304
6643
549
465
3290
39.0206
43.0228
43.0228
44.0233
60.0318
70.0371
71.0376
98.0520
112.0594
Delta
Mass(C-E)
0.0289
0.3754
-0.0509
0.4399
-0.0131
-0.1075
-0.1212
-0.0913
-0.1083
115.1478
419
115.0610
-0.0867
120.1403
526
120.0636
-0.0766
126.0993
780
126.0668
-0.0324
140.2396
155.2502
157.2721
762
1009
1258
140.0742
155.0822
157.0833
-0.1653
-0.1679
-0.1887
P:Bionml7 R, M:YBm17 R
I:YB PG
P:Bion R, I:YA SP, M:YB R
158.2839
166.3079
616
705
158.0838
166.0880
-0.2000
-0.2198
S:Yionml7 R
167.2804
175.2782
185.223
1309
1619
739
167.0886
175.0928
185.0981
-0.1917
-0.1853
-0.1248
I:YA PP, I:YBm18 SP
P:Bionpl8 R, S:Yion R, M:YBp18 R
I:YB SP
192.2196
195.2451
1327
1357
192.1018
195.1034
-0.1177
-0.1416
I:YB PP
209.2512
212.2254
217.123
218.2249
236.5522
237.1787
245.1472
1152
627
1163
625
911
2982
1435
209.1109
212.1124
217.1151
218.1156
236.1252
237.1257
245.1300
-0.1402
-0.1129
-0.0078
-0.1092
-0.4269
-0.0529
-0.0171
252.1834
254.2386
1217
1043
252.1337
254.1347
-0.0496
-0.1038
J:YB PPG
P:Bion RP
302.1729
305.5506
359.4264
363.414
371.3495
389.3236
544
426
439
396
1329
591
302.1602
305.1618
359.1905
363.1926
371.1968
389.2064
-0.0126
-0.3887
-0.2358
-0.2213
-0.1526
-0.1171
I:YB PGF
S:Yionml7 FR
178
Identity
M:YA S
M:YA P, M:YBm18 S
M:YB P
P:Aionm7 R, M:YAm17 R
M:YA F
P:Aionml7 RP
I:YBm18 FS, I:YA PF
P:Bionml7 RP
I:YB PF
P:Aionm17 RPPG
I:YA PPGF, I:YBm18 PGFS
I:YB PGFS
402.3099
632
1158
408.2843
631
419.2012
796
458.4447
527
842
399.2848
486.3314
506.4223
510.5818
787
995
527.6151
528.556
900
555.2614
556.6239
1029
539
531
466
572.8344
588.9908
597.6304
598.5546
559
837
892
614.3491
1013
614.956
625.3541
640.1559
642.1966
668.6404
706.4312
751.518
744
579
806.9349
827.41
833.4403
850.4326
862.8687
868.6491
877.6875
883.1278
887.5898
890.0149
904.5686
905.6564
591
1287
184
183
226
218
216
216
252
315
232
257
227
273
192
1948
911
915.3463
920.911
950.0765
990.8686
1014.4169
265
1018.3881
641
256
212
623
644
399.2117
-0.0730
402.2133
408.2165
419.2223
-0.0965
458.2430
486.2578
506.2685
510.2706
527.2796
528.2801
555.2945
556.2950
573.3040
589.3125
597.3167
598.3173
614.3258
615.3263
625.3316
640.3396
642.3406
668.3544
706.3746
751.3985
807.4282
827.4388
833.4420
850.4510
862.4574
868.4606
877.4653
883.4685
887.4706
890.4722
904.4797
905.4802
915.4855
920.4881
I:YB PPGF
S:Yionml7 PFR
P:Bion RPPG
S:Yion PFR, I:YBp18 PFR
I:YA PPGFS
I:YB PPGFS
S:Yion SPFR, I:YBp18 SPFR
P:Aionml7 RPPGF
P:Aion RPPGF
-0.0677
0.0211
-0.2016
-0.0735
-0.1537
-0.3111
-0.3354
-0.2758
P:Bion RPPGF, I:YA PPGFSP
0.0331
-0.3288
0.4696
0.3217
-0.3136
-0.2372
-0.0232
0.3703
-0.0224
0.1837
(I:YAm18 FSPFR)
P:Aionml7 RPPGFS
P:Aion RPPGFS
I:YBm18 PGFSPF
P:Bionml7 RPPGFS
0.1440
-0.2859
-0.0565
P:Bion RPPGFS
-0.1194
0.4933
S:Yion PGFSPFR, I:YBp18 PGFSPFR
0.0288
0.0017
0.0184
-0.4112
-0.1884
-0.2221
0.3407
-0.1191
0.4573
-0.0888
P:Bionml8 RPPGFSPF, I:YBm18 PPGFSPFR
S:Yionml7 PPGFSPFR
S:Yion PPGFSPFR, I:YBp18 PPGFSPFR
-0.1761
0.1392
-0.4228
0.4276
950.5041
-0.3432
990.5253
0.1211
1014.5380
1018.5402 I 0.1521 I
P:Aion RPPGFSPFR
179
1019.4492
1038.6707
673
617
1019.5407
1038.5508
0.0915
-0.1198
1043.4846
1044.4258
1038
719
1043.5534
1044.5540
0.0688
0.1282
1045.5164
1049.0638
1054.096
632
599
562
1045.5545
1049.5566
1054.5593
0.0381
0.4928
0.4633
1060.4957
14240
1060.5624
0.0667
M+Hm17 RPPGFSPFR
M+H RPPGFSPFR, P:Bionpl8 RPPGFSPFR
Table C.5: Bradykinin Dataset: data/022064c/unprependedpeaks
Post Source Decay Analysis
File # 1=CAMATSU\TLENG\022064C\PSD1POOA.MSA
-
--
Stitch Factors: 0.600 - 1.010
-------
-
-- ----
-5000----........ ......
-100000- ....
-150000
2..
4......
200
400
.....
....
.
600
800
1000
15000
1000i
5000-
04
I.___...._._6
I.
;
I
I
600
400
Mass (m/z)
Figure C-5: PSD for 0220 Bradykinin
180
800
1000
Experimental
Intensity
Checkpoint(C)
Delta
Identity
507
1051
7764
462
308
311
456
871
8015
0.0024
0.0011
-0.1109
M:YA P, M:YBm18 S
3105
23.0122
39.0206
70.0371
71.0376
71.0376
87.0461
97.0514
98.0520
112.0594
115.0610
120.0636
126.0668
140.0742
151.0801
155.0822
156.0827
157.0833
158.0838
165.0875
166.0880
167.0886
168.0891
175.0928
185.0981
192.1018
193.1024
194.1029
195.1034
196.1040
209.1109
212.1124
217.1151
237.1257
245.1300
252.1337
2282
263.1395
Mass(E)
23.0098
39.0195
70.1481
71.1355
71.3207
87.2063
97.1742
98.1618
112.1888
115.1916
120.1754
126.1679
140.2768
151.3537
155.3028
156.282
157.3173
158.311
165.3573
166.3184
167.3258
168.3327
175.3017
185.2757
192.2622
193.294
194.3283
195.2752
196.3088
209.294
212.3102
217.2673
237.233
245.2819
252.2211
263.2083
777
1677
1644
2424
1099
3114
652
3083
1276
491
1845
3432
678
4637
1971
3237
1347
1245
3412
561
3764
1317
2563
8240
2456
-0.0978
-0.2830
-0.1601
-0.1227
-0.1097
-0.1293
M:YB P
P:Aionml7 R, M:YAm17 R
-0.1305
-0.1117
-0.1010
-0.2025
-0.2735
-0.2205
M:YA F
P:Bionml7 R, M:YBm17 R
I:YB PG
-0.1992
-0.2339
-0.2271
P:Bion R, I:YA SP, M:YB R
S:Yionml7 R
-0.2697
-0.2303
-0.2371
-0.2435
I:YA PP, I:YBm18 SP
-0.2088
P:Bionpl8 R, S:Yion R, M:YBp18 R
I:YB SP
-0.1775
-0.1603
-0.1915
-0.2253
-0.1717
-0.2047
-0.1830
-0.1977
-0.1521
-0.1072
-0.1518
-0.0873
-0.0687
181
(S:Yionpl8 R)
I:YB PP
P:Aionml7 RP
I:YBm18 FS, I:YA PF
P:Bionml7 RP
I:YB PF
I:YB PPG
274.1503
275.0342
292.2046
302.4998
305.4816
316.5282
332.4663
351.4181
359.3487
361.4266
363.4294
371.421
377.4286
380.4515
389.3622
399.2901
402.3501
408.3318
419.3184
446.8625
458.7359
468.6683
486.6087
506.6194
510.6261
527.5678
538.522
555.4765
573.1925
597.4805
614.4251
624.8962
642.2694
904.5703
1043.5426
1060.5442
2348
1105
1422
2712
1828
1382
920
902
1258
1445
2061
4546
934
1041
2378
2172
4689
1312
3136
830
1528
1353
3034
2054
3479
3786
1096
2115
1352
3398
3530
1361
3551
1101
729
10508
274.1453
275.1459
292.1549
302.1602
305.1618
316.1676
332.1761
351.1862
359.1905
361.1915
363.1926
371.1968
377.2000
380.2016
389.2064
399.2117
402.2133
408.2165
419.2223
447.2372
458.2430
468.2483
486.2578
506.2685
510.2706
527.2796
538.2854
555.2945
573.3040
597.3167
614.3258
625.3316
642.3406
904.4797
1043.5534
1060.5624
,
-0.0049
0.1117
-0.0496
-0.3395
-0.3197
-0.3605
-0.2901
-0.2318
-0.1581
-0.2350
-0.2367
-0.2241
-0.2285
-0.2498
-0.1557
-0.0783
-0.1367
-0.1152
-0.0960
0.3747
-0.4928
-0.4199
-0.3508
-0.3508
-0.3554
-0.2881
-0.2365
-0.1819
0.1115
-0.1637
-0.0992
0.4354
0.0712
-0.0905
0.0108
0.0182
I:YA PGF, I:YBm18 GFS
I:YB GFS
I:YB PGF
S:Yionml7 FR
I:YB FSP
P:Bion RPP
I:YA PGFS
P:Aionml7 RPPG
I:YA PPGF, I:YBm18 PGFS
P:Aion RPPG
I:YB PGFS
I:YB PPGF
S:Yionml7 PFR
P:Bion RPPG
S:Yion PFR, I:YBp18 PFR
I:YA PPGFS
I:YBm18 PPGFS
I:YB PPGFS
S:Yion SPFR, I:YBp18 SPFR
P:Aionml7 RPPGF
P:Aion RPPGF
P:Bionml7 RPPGF
P:Bion RPPGF, I:YA PPGFSP
P:Aionml7 RPPGFS
P:Aion RPPGFS
P:Bionml7 RPPGFS,(I:YAp18 FSPFR)
P:Bion RPPGFS
S:Yion PPGFSPFR, I:YBp18 PPGFSPFR
M+Hml7 RPPGFSPFR
M+H RPPGFSPFR, P:Bionpl8 RPPGFSPFR
Table C.6: Bradykinin Dataset: data/021829c/unprependedpeaks
182
Post Source Decay Analysis
Stitch Factors: 0.600 - 1.010
File# 1=C:\MATSL\TLENG\021829C\PSD1P0OBMSA
~~
.50000-
-1000
---
200
400
200
400
60
800
1000
66)0
Bo
1000
10000-
5000-
Mass (mlz)
Figure C-6: PSD for 0218 Bradykinin
C.2
Distribution of Fragment Types
Statistics on these datasets are compiled in Table C.7. These include: dataset size, number
of experimentals accounted for, number of multiple identity peaks present, number of the
different ions present and various other totals. The ion counts include any peak that can
be assigned a particular fragment type, so multiple identity peaks are multiply counted in
the tallies for the fragment type of each identity.
183
Fragment Type
0123
0119
Dataset
1205
0121
Dataset Size
Matched Peaks
Multiple Identities
73
49
6(7)
48
34
6(7)
53
34
4(5)
5
5
4
4
5
6
0220
0218
55
42
5(6)
86
46
15
71
47
14(15)
4
4
3
5
3
5
5
Cores:
Aion
Bion
Yion
6
5
4
4
6
Am18
0(1)
0(1)
0(1)
0(1)
0
0
Am17
Bm18
3
0
2
0
2
0
3
0
5
1
4
0
Bm17
4
4
3
4
3
4
Bp18
1
0
1
1
2
2
Ym18
0
0
0
0
0
0
Ym17
1
0
0
1
5
5
Yp18
0
0
0
0
0
0(1)
9
10
0
0
0
0
0
4
4
7
0
0
0
0
0
2
4
6
0
0
0
0
0
4
6
9
0
0
0
0
0
3
6
9
0(1)
0
0
5
0
4
8
11
0
0
0(1)
5
0
3
Internals:
YA
YB
YAm18
YAm17
YAp18
YBm18
YBm17
YBp18
Immoniums:
YA
5
5
3
5
4
2
YB
YAm17
1
0
1
1
0
0
1
1
1
1
2
1
YBm18
0
0
0
0
1
1
YBm17
YBp18
0
1
0
1
0
0
0
1
1
1
1
1
25(26)
23
19(20)
13
21(22)
14
21(22)
18
30
24(25)
28(29)
27(28)
Totals:
cores
internals
immoniums
m18
m17
p18
7
8
3
8
9
8
0(1)
8
6
0(1)
7
3
0(1)
5
5
0(1)
9
5
7(8)
15
6(7)
6
15
6(8)
Table C.7: Ions Present in Data: Counts in parenthesis represent counts when using a
model without the refinement of Section 12.1.2.
184
PerSeptive Biosystems
LI**: r.@o
Soms warm4:
OCkWim Fleamm: C
:WOYAGERFFACTCflMhTALP
lfWO@0I I-ji
Lvme Mmrs ge.no0.
slilv9"age. now%
Fl4
p5*
m-SVOYAGERWACTORYUNSTAL.USDiPARENWI.MaA
T-il
GwdWiW
VdJt
twnl. 100
Sampc
103
n2-07
Anostlng V~ega: 2cCOO Prujruc z
Q
MM:
CJS3% Timpwe Wh
4
Negm
kim:
CAkbed: iIif
50
S8 F
WF
2.34
cO
sooOjz0
4
10000-
I
I
0-
iRlL b!h11
Lii. .aN~
i
200
1.
alt
1
"
vI-
500
400
Mas ("z}
Figure C-7: Manufacturer PSD for Angiotensin
185
~IA .kttLiaJkiM
-
.
1000
1200
Ii
Appendix D
Data Peaks of Unknown Origin
There are a handful of experimental peaks of fairly low abundance whose identities are not
known. They do not appear in the theoretical spectrum, and were it not for the fact that
they occur consistently across multiple datasets, one might have attributed them to noise.
checkpoint
1205
0123
0121
0119
factory
115.0610
166.0880
297
362
344
728
164
320
1679
906
212.1124
293
278
1095
521
230.1220
279.1480
303.1607
3350
2301
1919
812
866
798
1655
1626
2644
1759
1702
313.1660
4140
1085
2510
2750
329.1745
337.1788
364.1931
4656
3588
4570
1149
1044
1542
2727
2345
2050
3642
400.2122
3550
1237
2263
426.2260
527.2796
740.3926
M+H: 1296.6805
3773
4508
3134
33467
1226
730
2711
1255
1112
24878
16750
491
2941
440
2692
7326
3388
51538
Table D.1: Angiotensin Peaks of Unknown Identity: when a dataset contains an unknown,
the height of the peak is listed. The height of the parent ion is included in the last row of
the table for reference.
The masses in question are listed by checkpoint in Table D.1. When a dataset contains an
unknown, the height of the peak is listed. The diagnostic spectrum, see Section 9.1, also
186
contains peaks at many of these unknown masses (see column 4 of Table D.1).
Lastly, we consulted a batch of 10 angiotensin datasets, which were one of the first datasets
we acquired as part of our training on how to use the mass spectrometer. While unfit for
use in our analyses, they also expressed a number of these peaks in question:
checkpoint
instances (out of 10)
166.0880
7
212.1124
230.1220
426.44926
5
6
7
Table D.2: Number of times each unknown appears in training datasets (out of 10).
Again, these unknowns are not major peaks, however, they occur consistently enough to
arouse suspicion. From a purely mathematical analysis of masses and with our elementary
knowledge of fragmentation, we could not deduce a structure for these fragments using the
residues in angiotensin. Several questions then came to mind:
D.1
Do bradykinin spectra also contain unknown peaks?
Even though there are only two bradykinin datasets, they were checked for recurring unknowns, and the following were found in both:
checkpoint
0220
0218
71.0376
115.0610
126.0668
166.0880
192.1018
549
419
780
705
1327
770
777
1644
1845
3237
212.1124
1152
1317
573.3040
M+H: 1060.5653
1352
114240 110508
531
Table D.3: Bradykinin Peaks of Unknown Identity. Note that there are two 0218 experimentals (of height 462 and 308) that have the same checkpoint value of 71.0376.
Only 212.1124 and 115.061 are an amino acid (proline) distance apart, and interestingly
187
enough, these two are the only unknowns also present in the angiotensin spectra. What do
these two peptides have in common?
D.2
Could these peaks be due to the matrix?
In both cases, the same matrix compound was used. Could these peaks be somehow due
to the influence of the matrix? We acquired MS spectra for the matrix alone in linear and
reflector modes, and two of our unknowns, 212.1124 and 426.2260, appeared. We do not
know if the molecules behind these peaks are also responsible for the unknowns at these
masses in our datasets, but we suspect that it is unlikely. The timed ion selector, used
during MS/MS acquisition, should have barred non-angiotensin/non-bradykinin molecules
from the entering the second stage of MS. It is true, however, that the timed ion selector
actually allows a small window of masses through, and it is also conceivable that portions of
the matrix may interact with pieces of the analyte to form a molecule that happens to reach
the selector during this window, but it is thought that such an occurrence is extremely rare.
We also acquired MS/MS spectra for the matrix control with the timed ion selector set
at 1296.68, the mass of the angiotensin parent ion, and none of the unknown peaks were
present.
D.3
Is there any way to explain these peaks?
Of course - one can find some hypothetical structure to account for peaks at these masses,
but we have no reason to believe any of them are likely to occur. In some of our earlier
investigations, we used a larger and more general set of fragment types that included the
C, D, V, W, X, Z ions and alternate forms for the A and Y ions(Scheme 111.2 of [Joh88]
contains structures for these). We found that in the resulting PMF, these unknown masses
could be explained by these extra fragment types, and often, in more ways than one. But
while they may be possible explanations, it is unlikely that they are plausible ones because
they are extremely rare for MALDI-PSD spectra.
188
D.4
Might these unknowns be related to each other?
In the event that some of the unknowns might be the result of the same phenomena,
mass differences were calculated, and Table D.4 lists the pairs of unknown peaks that are
an amino acid apart. Note that this is only useful if the two unknowns are of the same
fragment type, in which case the one of smaller mass could be a prefix or suffix of the larger.
It is interesting to note that almost all of the residues (save T, S and A) are residues present
in the angiotensin sequence.
residue distance
P
D
angiotensin residue
166.088
166.088
166.088
166.088
166.088
I
L
H
F
Y
V
313.166 212.1124
329.1745 230.122
T
V
426.226 279.148
400.2122 303.1607
400.2122 313.166
426.226 313.166
426.226 313.166
F
P
S
I
L
400.2122 329.1745
426.226 329.1745
A
P
527.2796 364.1931
Y
527.2796 426.226
T
pair of unknown masses
212.1124 115.061
230.122 115.061
279.148
279.148
303.1607
313.166
329.1745
Table D.4: Unknowns that are a Residue Distance Apart:
angiontensin is DRVYIHPFHL.
V
V
V
V
V
recall that the sequence for
The graph of Figure D-1 is an alternate representation of the information in Table D.4. It
is reminiscent of a fundamental graph and non-angiotensin residue edges are represented
by dotted arrows. Short partial subsequences of angiotensin that are consistent with paths
through the graph are listed in Table D.5, but how these sequences can exhibit a structure
of this mass is unknown to us, e.g. what chemical structure for R has a mass of 115?
189
212
527
364
115
A'14
T
279
xV
D
329%
23
IL
~t~u
F
A
I,L
Y
166
400
HP
303
S
F
313
Figure D-1: Graphical Representation of Residue Relationships Between Unknowns
sequence of peptides
sequence of nodes visited
R, DR, DRV
IH, YIH, YIHP
H, HL, FHL
HP, IHP, IHPF
I, IH, IHP
H, FH, FHL
H, HP
F, PF
V, VY
I, YI
115,230,329
166,329,426
166,279,426
166,279,426
166,303,400
166,313,426
115,212
115,212
364,527
364,527
Table D.5: Angiotensin Peptides and Consistent Path Nodes
190
D.5
Keep in Mind...
In some sense, our portrayal of unknowns is a little misleading because the datasets that
we are working with are really subsets of the raw data output by the mass spectrometer:
" the operator labels and selects peaks from raw data to be included in the final dataset,
and understandably, often prefers the higher intensity ones
" the GRAMS peak labelling software does not always label all peaks below a selected
threshold, and even worse, certain labels may disappear during subsequent attempts
at labelling other peaks. Mass labels also do not always correspond to the peak apex
- sometimes they indicate the centroid location instead. At times, we resorted to peak
magnification and individual labelling when we were unable to tease the software into
labelling a specific desired peak.
Therefore, not all experimental peaks make it from the raw data into the dataset that is
input to a sequencing algorithm. There could be other unknown peaks that were deemed
insignificant and consequently not included simply because their intensities were too low.
Other unknowns, which may shed additional light on the matter, may not have appeared
frequently enough to be noticed by our examination and included in our list.
No one seems to know, with any degree of certainty, what the origin of these unknown peaks
are. Johnson [Joh88] surmises that if they are novel ion types, then they are not related
to the parent sequence or they are due to higher energy processes. In short, they could
be evidence of new fragmentation, of interactions between matrix and/or peptide and/or
other contaminants (e.g. on the sample plate, in solvents, in the peptide crystals when
purchased), of noise or of some other phenomena.
191
Appendix E
Visits to the Drawing Board
This appendix chapter chronicles some of the strategies that were tried and efforts that
were made to tackle de novo sequencing. While some of these turned out to be variations
of approaches that have appeared in the literature, they were instructive and instrumental
in shedding light on the nature of the problem and the characteristics of a good solution.
E.1
Understanding the Problem
Our early investigations focused on understanding the nature of the experimental data the acquisition process, the types of peaks present and the extent of noise. Much of what
has been learned in these areas have already been discussed in the background sections of
this thesis.
E.1.1
Understanding the Acquistion Process
Spectra was acquired in-house as a means for becoming familiar with the procedure and
the equipment. Among other things, we learned that: high laser intensities can tease out
additional fragments that require more energy to form, but at the expense of poorer quality
for existing peaks, which widen and become less crisp and sharp; low mass peaks are acquired
blindly - they are present in the downloaded spectrum but not in the oscilloscope output;
192
when stitches are combined to form the PSD composite, peak intensities do not appear
to be preserved across the board or normalized; some peaks had to be manually selected
for labelling because the peak detection/labelling software was unable to label every single
peak(see Section D.5); and finally, the mass spectrometer software surprisingly allowed no
direct access to the raw digital data of the displayed spectrum.
E.1.2
Understanding the Spectra
Spectra was analyzed to discover what information is contained within and which features
might prove useful. Observations included the following: fragmentation is non-conserved
and incomplete - i.e. not all pieces of a parent molecule are retained and not all possible
fragments are produced; peak masses are often shifted from their calculated theoretical
value with no apparent consistent bias; and the most common peak mass difference is 28
which is the distance between an A type break and its B type counterpart (e.g. Am17 and
Bm17, YA and YB internals, etc.).
PSD spectra for an assortment of polyamino acids (e.g. poly L arginine, poly L histidine,
etc) and short homopeptides (e.g.
FFF, VV, YYY, AAAA) were also acquired in the
hopes that fragmentation patterns particular to a residue might be useful general aids to
sequencing, but fragmentation appears to depend not only on composition but order as
well.
We found that several spectra for the same peptide could be merged into a single dataset
(with some rules for deciding when two peaks were the same mass) to improve the signal to
noise ratio. If legitimate peaks are assumed to occur consistently but noise peaks randomly,
then combining datasets helps reinforce actual fragments while diluting noise.
Access to multiple spectra for the same peptide acquired under different conditions can also
be useful because affected peaks can provide additional information [SCE+97], but we had
little success with deuterium exchange (the use of D 2 0 instead of H 2 0 so that the mass of
molecules which acquire a deuterium atom will be shifted by 1).
Two other important outcomes from understanding the data are (1)
the idea that the
mass of a molecule can only be at certain discrete values (which we call a checkpoint (see
193
Section 7.1.6), and (2) the realization that there would be portions of the data that cannot
be explained by the current knowledge of fragmentation. These peaks of unknown identity
appear persistently in spectra for the same peptide, so any de novo sequencing algorithm
would have to be robust enough to formulate a sequence prediction in spite and despite
these offending peaks.
E.2
Exploring Sequencing Algorithms
What follows is a description of a variety of different strategies and ideas we considered.
They span the gamut in terms of approach- from static to dynamic scoring, from deterministic to stochastic, and from poor to fair performance (some fail to find the real sequence,
some generate it but need to better distinguish it from competing sequences.) Only some
of the approaches are described; the observations made from these explorations are summarized in Chapter 7.
One simple early approach was one where a spectrum of n peaks was converted into a graph
of n nodes. Nodes representing peaks that were potentially of the same fragment family
were linked together with edges in some lexicographic order. Edges between families were
then added to the graph if a family was a amino acid distance apart from another. Indeed,
the use of a graph is attractive because a realm of efficient graph algorithms can now be used
to identify and isolate a path with some specific property - in our case, the longest path
that accounted for the parent mass. The idea was that this path visited the most vertices,
and hence was the most likely sequence because of the redundancy it could account for.
With complete (and hence, representative) spectra, this would be a reasonable conclusion
since the amount of redundancy would constrain the spectrum to likely support only one
interpretation.
With incomplete, but representative, spectra, it is difficult to prove that
the "best" path is the correct answer. Paths longer than the correct sequence could score
better because more nodes are visited, and even though these may be "uninteresting" (in
that they are low scoring), their aggregate sum may outscore that of the correct path.
Although not the next paradigm we considered, the fundamental graph, introduced in
Section 4.2, is the first we discuss because it conveniently highlights potential sequences
194
embedded in the data, and is therefore a natural starting point for algorithm development.
E.2.1
Fundamental Graph-Based Approaches
The simplest way to score a fundamental node is to use the number of experimental supporters for the fundamental.
More complex scoring functions have been tried by us and
other centers, and while they include attributes and factors that reward for redundancy,
there is no reason to believe them optimal, being largely based on empirical observation.
Given a graph with node scores, and possibly edge scores, one can efficiently identify the
optimum path, using a greedy strategy for example, but with spectra that is not representative, it is difficult to prove that the optimal path is the correct one. Algorithms may list
all paths that score above a certain threshold or simply the best n paths in hopes that the
real sequence scores well enough to be included among them.
With a graph of 559 nodes and 4645 edges for the 0123 dataset, it is too costly to first
enumerate all possible complete paths and then select from them. A few methods exist for
keeping the problem tractable - optimizations on the physical graph can reduce its size (e.g.
to less than 396 nodes and 2753 edges), and intelligent pruning during path enumeration
can limit the number of paths considered by discarding undesirable paths and retaining
only those which correlate well with the experimental data. However, there is a chance that
the real answer is also removed in the process.
One means for finding the best n paths involved a propagation stage where scores were
propagated through the graph from the base to the parent fundamental. Each node kept
track of the n best ways to reach it from the base, and these were collated by the distance
of the node from the base (the number of residues in the sequence defined by the partial
path). A backtracing stage in the reverse direction starting from the parent fundamental
delineated the best paths(sequences) of a particular desired length.
An algorithm usually generates a pool of predictions. The highest ranking sequence predictions often share a common subsequence with the real sequence. So, even when a peptide
cannot be completely sequenced, an accurate partial prediction might be preferred when
the answer cannot be found in its entirety.
195
A single gap is one reason why a real sequence may not be complete. Even worse is the
inability to realize that the real sequence is missing from a pool of candidates. The inclusion
of short peptide edges in the graph is a simple measure against gaps in the dataset. However,
when only dipeptides and tripeptides are considered, 25189 new edges were introduced into
the graph - the dipeptides alone ballooned the graph by 10432 edges. The introduction
of edges that represent more than one residue can make it difficult to find a fair scoring
mechanism, and may not turn out to be a viable solution for handling large gaps.
Recall that during the construction of the graph, each experimental peak gave rise to a
number of fundamentals, and these are collectively referred to as related fundamentals(see
Section 4.2).
Certain paths were found to visit related fundamentals.
Paths like these
should be disallowed if the same experimental peak is explained by multiple conflicting
roles. Dancik et.al. and Clien, et.al. were careful in their handling of related fundamen-
tals [DAC+99a, CKT+00].
Internal ions can be included in the score of a node by considering suffixes of the sequences
described by each partial path. The legitimacy of some variant supporters can also be
verified from the composition of a partial path.
These are examples of dynamic score
calculation. Because a partial path is available to a dynamic scoring function, it can compute
a score that is more accurate than that of a static one. This allows for more path-specific
(hence sequence-specific) features to be accounted for earlier in the search instead of at the
end when a sequence guess is complete. However, despite a number of optimizations, the
resulting search took too long.
E.2.2
Expanding Islands of Certainty
Instead of sequencing from one terminus to another, this approach tries to work inside out.
The idea is to guess very likely subsequences of the peptide, and then gradually extend
these islands in both directions with residues that are less and less certain until no further
extensions are possible or the entire parent mass is accounted for.
How does one find a good starting guess? In our datasets, there were definitely regions of
high redundancy as well as regions with low amounts of support.
196
The best scoring fun-
damentals, those which are highly redundant, are selected1 , and those that are an amino
acid apart are chained together to form islands. These fundamentals and their corresponding peaks can then be removed from the data, and the best scoring fundamentals of the
remaining data can then be assimilated into the growing islands. Of course, it is possible
that not all of the initially chosen fundamentals are correct, and some of the second-tier
fundamentals may not be compatible with the existing islands, so one must decide what
to do in situations like this - with only a few potential fundamentals, one could consider
all possible combinations and generate best solutions for each, or one could try to find a
maximum compatible set.
Being largely dependent upon redundancy and how quickly support peters out with every
iteration of the algorithm, it is not clear that this approach will perform well in general.
Since each iteration considers information that is less and less reliable, this might instead
suggest that perhaps finding a portion of the real sequence with high confidence is preferable
to proposing an entire sequence whose correctness is uncertain.
E.2.3
Bounding Partial Paths
A branch and bound search can be performed on the fundamental graph. Here, the search
algorithm would keep track of some limited number of partial paths sorted by score, and
always choose the best path to expand next. A better strategy would be one that calculates
an upper bound for the score of a partial path so that the most promising partial path
would always be the one chosen for expansion.
To estimate the promise possible from continued expansion of a partial path, the scoring function was modified so that it could compute a score from a corresponding partial
sequence.
The length of the real sequence must be known in order to pad the partial
sequence with the appropriate number of wildcard characters, special placeholders for a
to-be-determined residues.
The theoreticals peaks computed for a padded partial sequence now fall into one of two
'A variation was to sort the experimentals by intensity and choose the most intense ones. This assumes
that high intensity peaks are more likely to represent legitimate fragments and not noise.
197
categories: definites and ranges. Definites refer to those masses that are exactly known.
Subsequences containing wildcards induce a range of possible mass values because the wildcard stands in place of a residue whose mass is between 57(glycine) and 186(tryptophan).
The more wildcards, the greater the possible range. Now, computation of a score involves
finding the best mapping or assignment of experimental peaks to these theoreticals, and the
Hungarian method [Kuh55, PS82], an algorithm for solving the weighted bipartite matching
problem, was used.
The Hungarian method is commonly used on job scheduling problems: jobs need to be
assigned to processors, there is a cost associated with each pairing, and the minimum
cost assignment is desired. This problem can be represented as a bipartite graph where
processors and jobs are nodes, edges exist between processor and job nodes only, and the
cost of each edge is given by a square cost matrix (square because the Hungarian method
requires the number of processors to equal the number of jobs). The Hungarian method
produces a perfect assignment - one where the mapping between processors and jobs is a
bijection (one-to-one).
Variations of this algorithm abound [DE96, HHF95, TYC93].
In particular,
[TYC93]
formulate the problem as a maximization one, and [HHF95] allow matchings where multiple
jobs are assigned to a processor by introducing dummy processors/jobs when necessary to
equalize the number of processor and job nodes.
With these in mind, we reduced our
problem to an instance of bipartite matching that was suitable for the Hungarian method.
Our reduction accounted for:
" an unequal number of theoreticals and experimentals by adding the appropriate number of dummy nodes
" the possibility of multiple identity peaks by allowing multiple theoreticals to be assigned to (different instances of nodes representing) the same experimental, and
" the presence of unmatched experimentals by matching them to theoretical nodes which
represent noise.
Even with a number of optimizations, the resulting graph was huge, and the Hungarian
method took very long.
198
An alternate means for calculating an upper bound was found to be very fast, but the
bound was too relaxed and the real partial sequence was no where near the top. In fact,
it scored so poorly that it was not chosen to be expanded until much later - only when
incorrect partial paths were expanded practially to completion were they outscored by the
real partial path. It was basically a tradeoff between time and accuracy: a good estimate of
a path's potential would have been accurate but extrememly costly to compute; a very fast
estimate was possible, but the result was too imprecise to be useful in the ensuing search.
Some of the other approaches that were considered include:
Hidden Markov Models The observation sequence of a HMM is a series of outputs of
states visited in some order.
The most straightforward approach is to regard an
experimental spectrum as this observation sequence. However, spectra has no notion
of temporal assignment, neither in terms of peak generation nor peak appearance,
associated with it while HMM observations are not simply a set of observations (as is
in a mass spectrum), but a time dependent ordering of events. Perhaps there exists
some other non-trivial way to view the problem so that HMM's can be applicable.
Bottom-Up Subsequence Reconstruction A subsequence analysis approach attempted
to take advantage of the fact that there are (1) internals present and (2) more peaks
due to shorter subsequences than longer ones. The idea is to list all possible distinct
dipeptides that are supported by an experimental peak. If two list members are found
such that one has a prefix that is a suffix of the other, a new sequence is formed by
merging the overlap portions (e.g. DR and RV would give rise to DRV) and if this
sequence is supported by an experimental and has not yet been considered, this new
merged sequence added to the list. This algorithm relies on there being enough overlapping pieces present, something that is not characteristic of our datasets. Without
some other additional provisions, there are not enough subsequence peaks to permit
assembly in this splint-like fashion.
Path Reinforcement for Internal Ions The main idea behind this approach was to
augment the scores (reinforce) of certain subpaths through the fundamental graph.
Since subpaths correspond to possible internal ions, only those that are supported
by the data are reinforced - namely, reinforce all edges along all subpaths between
199
two nodes only if the mass difference of the nodes is supported by some experimental
peak. The hope was that the most reinforced path would be the most redundant and
hence the path of the correct sequence.
However, as a result of being situated at the crossroads of several reinforced paths,
certain edges ended up with very high edge scores. These edges were not part of the
correct path, and had been "taken out of context" - the score was high because it was
part of a particular subpath; if the other edges were not also included in the selected
complete path, the component of the score due to this subpath should be disregarded.
It is possible that a highly redundant spectrum could overcome this because the most
reinforced edges would be exactly those edges of the supported subpaths.
The various approaches sought to capitalize on different features of the data, and each met
with varying degrees of success/failure. Because two pairs of amino acids have indistinguishable weights, it is already impossible to predict the true sequence exactly. Gaps compound
the problem and only serve to obscure additional residues. When a pool of solutions generated by an approach contains the best correct sequence possible, the task becomes one
of examining why the real sequence scored poorly and why an incorrect competitor scored
well.
The score and/or search were often modified to account for some additional property, in
as general a manner as possible, to elevate the standing of the correct answer. Oftentimes,
these modifications, and even the entire approach itself, seemed to be ad hoc patchwork.
What resulted seemed a complicated motley collection of rules and heuristics that lacked
a formal framework for proving why the answer offered by the algorithm should be a good
decent one. In formulating our approach of Chapter 8, however, we desired an algorithm
that had a more structured framework, and Chapter 7 describes the issues we had to keep
in mind.
200
Appendix F
Scoring Function Maximum
The scoring function Prob(S, F) of Chapter 8 is maximal when the spectra S most resembles
the PMF F. A proof sketch is contained in this appendix. Let r be the parentmass. The
PMF F is defined from 0 to r and let the value of F at i be denoted as F. The experimental
spectrum S, also defined for all integer values between 0 and r, can be thought of as a vector
of assignments k where ki is the height of peak i.
Given integer N and assignment k =< ko, ki,- - , kr > where the ki are integers and
Zki = N, let
i=O
N! _Fko ... rk,
Prob (k, F) = kok! . k 0 - Fr
where 0 < Fi < l and
Fi = 1.
i=Q
Theorem F.O.1 F attains its maximum value when kopt =< F0 N, F 1N,.--
, FrN
>
Proof: The goal is to show that for any assignment k, Prob(k, F) < Prob(kopt, F).
Consider some assignment k =< ko, ki, - -, kr >. For each i, one of the following
must be true:
1. ki > FiN ->ki = FiN + ej,
2. ki = FiN,
3. k2 < FN - k = FN - e
201
where ej is an integer greater than 1.
Rearrange the expression for F by grouping together terms that fall in each of
these three categories, and without loss of generality, shuffle and rename the
indices, so that for index i,
* 0 Ki <a : ki > FN,
* a + 1 Ki Kb: ki = FN and
* b+1 Ki < r: ki < FiN.
We now arrive at:
N!
(FoN + eo)!...(FaN+ ea)!Fa+1N! - -FbN!(Fb+lN - eb+1)! ...(FrN - er)!
Prob(k, F)
FaN+ea)(F[a1N ...
x ( F'F0oN+eo -
((Fb+lN - eb+1 + 1)
N!
rOV! ---
rIV!
x (FFON+eo
...
(Fb+lN))
iV ± e0 ))
U(FO1V -- 1) ... FiaN+ea)(F~
4 1N
0
N -eb+1 -
FN)F1
t=-eb+1+1
± 1)
Falv
-FbN)(FFb1N -
1
...
eb+1 ...
...
F'
FO...
(HFoN + t) -..(HFaN + t)
t=1
Fea
Feb+1 ... Fe,
ea
N-er )
Prob(kopt, F)
b+1
t=1
aProb(kopt, F).
We desire to show that a
K
1. We can rewrite a as:
0
F eo
eo
(HFoN + t)
1
Fea
e.
a
t
0
Fb+ 1N +t)
F-eb+
eb+1
b+1
(flFaN + t)
(
FrN + t)
t=-e+
(F.1)
Fr'
t=1
Consider the fractions of the form:
-'F----.
202
These terms result from all ki
(FrN))
-t- ea))
V
FrN + t)
t=-e,+1
eo
t=1
((FrN - er + 1)
0
11Fb+1N + t) -.--.(
=
---
...
FrN-er)
that fall into case 1. We can bound each of these as follows:
Fei
F
FN+ ej
1
FFi
FN +1
1
fiit=,FN+ t)
N +
N+
(N
<
Substituting this into F.1 leads to:
it=1 FoN + t)
(N)
)
(tI"
1
=-
ea
Fb+1N + t)
(Hf=-er+i FrN + t)
F
Fb1
(f=-er+i FrN + t)
+ t)
F6'+
Fr
e ( (HO-e+ +1 Fb+lN + t)
N
:-eb+1+1
FaN + t)
Fb+l
b+1+1
(HOe-(
eo
N
(to-
Fa
F0
(H=-er+i FrN + t)
F|r
Fib+1
Consider the remaining fractions in a - namely, those that result from case 3:
FN - ej +
(H=-ei+1 FN + t)
Fiei
1
Fi
N +1
FiN
Fj
- ei
=
(N±
+F
<
Nei(because ej > 1).
Since the assignment k is one way to partition N things into r parts:
N
=
Ski
=
(FoN+eo)+-- + (FrN - er)
a
r
e
EeNFN+
=
i
i=O
i=b+1
a
N+
i=0
ei-
203
Cei
i=b+l
e,
r
->
Z
a
e2
=
i=b+1
Therefore, a < (
Z ez
i=O
)Zi=o eiNZ=b+1 e2 < 1 and the maximum value of Prob(k, F)
is obtained when k "mimics" the probability distribution defined by F; namely,
when k = kot =< FoN, FiN, -
, FrN
>.
Thanks to David Stephenson for assistance with the proof for the case when r = 2.
204
Bibliography
[ASR96]
J. Andersen, B. Svensson, and P. Roepstorff. Electrospray ionization and matrix assisted laser desorption/ionization mass spectrometry: Powerful analytical tools in recombinant protein chemistry. Nature Biotechnology, 14:449-457,
1996.
[Bal95]
M. Baldwin.
Natural
Modern mass spectrometry in bioorganic analysis.
Products Reports, pages 33-44, 1995.
[Bar90]
C. Bartels.
Fast algorithm for peptide sequencing by mass spectroscopy.
Biomedical and Environmental Mass Spectrometry, 19:363-368, 1990.
[Bar98]
C. Bartels. personal email communication. April., 1998.
[BBG96]
A. Burlingame, R. Boyd, and S. Gaskell.
Mass spectrometry.
Analytical
Chemistry, 68:599R-651R, 1996.
[BC]
P. Baker and K. Clauser. Ms-product from ucsf mass spectometry facility.
http://prospector.ucsf.edu/ucsfhtml3.2/instruct/prodman.htm.
[BC96]
R. Beavis and B. Chait. Matrix-assisted laser desorption ionization massspectrometry of proteins. Methods of Enzymology, 270:519-551, 1996.
[BCC91]
R. Beavis, T. Chaudhary, and B. Chait. a-cyano-4-hydroxycinnamic acid as a
matrix for matrix-assisted laser desorption mass spectrometry. Organic Mass
Spectrometry, 27:156-158, 1991.
[Bea92]
R. Beavis. Matrix-assisted ultraviolet laser desorption: Evolution and principles. Organic Mass Spectrometry, 27:653-659, 1992.
205
[Bie9O]
K. Biemann. Sequencing of peptides by tandem mass spectrometry and highenergy collision-induced dissociation. Methods in Enzymology, 193:455-479,
1990.
[BJHP94]
M. BartletJones, W. Jeffery, H. Hansen, and D Pappin. Peptide ladder sequencing by mass-spectrometry using a novel, volatile degradation reagent.
Rapid Communications in Mass Spectrometry, 8:737-742, 1994.
[BS87]
K. Biemann and H. Scoble. Characterization by tandem mass spectrometry
of structural modifications in proteins. Science, 237:992-998, 1987.
[BT93]
D. Bertsimas and J. Tsitsiklis. Simulated annealing. StatisticalScience, 8:1015, 1993.
[CBB96
K. Clauser, P. Baker, and A. Burlingame. Peptide fragment-ion tags from
maldi/psd for error-tolerant searching of genomic databases.
44th Conference
on Mass Spectrometry and Allied Topics, Portland, OR, page 365, 1996.
[CCC95]
M. Cordero, T. Cornish, and R Cotter. Sequencing peptides without scanning
the reflectron: Post-source decay with a curved-field reflectron time-of-flight
mass spectrometer.
Rapid Communications in Mass Spectrometry, 9:1356-
1361, 1995.
[CGAP99]
G. Corthals, S. Gygi, R. Aebersold, and S. Patterson. Proteome Research:
2D Gel Electrophoresis and Detection Methods, Ed. Rabilloud, T., chapter
Identification of Proteins by Mass Spectrometry, pages 197-231.
Springer,
New York, 1999.
[CGMW96]
K. Cox, S. Gaskell, M. Morris, and A. Whiting. Role of the site of protonation
in the low-energy decompositions of gas-phase peptide ions. Journal of the
American Society of Mass Spectrometry, 7:522-531, 1996.
[CKT+00]
T. Chen, M. Kao, M. Tepel, J. Rush, and G Church. A dynamic programming
approach to de novo peptide sequencing via tandem mass spectrometry. 11th
Annual SIAM-ACM Symposium on Discrete Algorithms (SODA 2000), pages
389-398, 2000.
206
[CLS99]
P. Chaurand, F. Luetzenkirchen, and B. Spengler. Peptide and protein identification by matrix-assisted laser desorption ionization (maldi) and maldi-postsource-decay time-of-flight mass spectrometry. J Am Soc Mass Spectrometry,
10:91-103, 1999.
[CM98]
P. Crain and J. McCloskey. Applications of mass spectrometry to the characterization of oligonucleotides and nucleic acids. Current Opinion in Biotechnology, 9:25-34, 1998.
[Cre93]
T. Creighton. Proteins: Structures and Molecular Properties, 2nd ed, chapter
Chemical Properties of Polypetides, page 34. W.H.Freeman and Company,
1993.
[CWBK93]
B. Chait, R. Wang, R. Beavis, and S. Kent.
Protein ladder sequencing.
Science, 262:89-92, 1993.
[DAC+ 99a]
V. Dancik, T. Addona, K. Clauser, J. Vath, and P Pevzner. De novo peptide
sequencing via tandem mass spectrometry. Journalof ComputationalBiology,
6:327-342, 1999.
[DAC+99b]
V. Dancik, T. Addona, K. Clauser, J. Vath, and P Pevzner. De novo peptide
sequencing via tandem mass spectrometry:
A graph-theoretical approach.
RECOMB, pages 135-144, 1999.
[DE96]
V. Dondeti and H. Emmons.
Max-min matching problems with multiple
assignments. Journal of Optimization Theory and Applications, 91:491-511,
1996.
[EMY94]
J. Eng, A. McCormack, and J. Yates, III.
An approach to correlate tan-
dem mass spectral data of peptides with amino acid sequences in a protein
database. J Am Soc Mass Spectrom, 5:976-989, 1994.
[FdCGB95]
J. Fernandez-de Cossio, J. Gonzalez, and V. Besada. A computer program
to aid the sequencing of peptides in collision-activated decomposition experiments. Computer Applications in the Biosciences, 11:427-434, 1995.
[FdCGB+98] J. Fernandez-de Cossio, J. Gonzalez, L. Betancourt, V. Besada, G. Padron,
Y. Shimonishi, and T Takao.
207
Automated interpretation of high energy
collision-induced dissociation spectra of singly protonated peptides by 'seqms', a software aid for de novo sequencing by tandem mass spectrometry.
Rapid Communications in Mass Spectrometry, 12:1867-1878, 1998.
[FdCGS+99] J. Fernandez-de Cossio, J. Gonzalez, Y. Satomi, T. Shima, N. Okumura,
V. Besada, L. Betancourt, G. Padron, Y. Shimonishi, and T Takao. Automated interpretation of low energy collision-induced dissociation spectra by
'seqms', a software aid for de novo sequencing by tandem mass spectrometry.
Electrophoresis, 21:1694-1699, 1999.
[Fen91]
C. Fenselau.
Beyond gene sequencing:
analysis of protein structure with
mass spectrometry. Annual Review of Biophysics and Biophysical Chemistry,
20:205-220, 1991.
[FHM+93]
A. Falick, W. Hines, K. Medzihradszky, M. Baldwin, and B. Gibson. Lowmass ions produced from peptides by high-energy collision-induced dissociation in tandem mass spectrometry. J Am Soc Mass Spectrom, pages 882-893,
1993.
[FQC98]
D. Fenyo, J. Qin, and B. Chait. Protein identification using mass spectrometric information. Electrophoresis, 19:998-1005, 1998.
[FS96]
M. Fitzgerald and G. Siuzdak. Biochemical mass spectrometry: Worth the
weight? Chemistry & Biology, 3:707-715, 1996.
[GME+95]
P. Griffin, M. MacCoss, J. Eng, R. Blevins, J. Aaronson, and J. Yates, III.
Direct database searching with maldi-psd spectra of peptides. Rapid Communications in Mass Spectrometry, 19:1546-1551, 1995.
[GMG+99]
R. Gras, M. Muller, E. Gasteiger, S. Gay, P. Binz, W. Bienvenut, C. Hooglang,
J. Sanchez, A. Bairoch, D. Hochstrasser, and R. Appel.
Improving pro-
tein identification from peptide mass fingerprinting through a parameterized
multi-level scoring algorithm and an optimized peak detection. Electrophore-
sis, 20:3535-3550, 1999.
[GP97]
Q.
Gu and G. Prestwich. Efficient peptide ladder sequencing by maldi-tof
208
mass spectrometry using allyl isothiocyanate.
Journal of Peptide Research,
49:484-491, 1997.
[GVP+96]
K. Gevaert, J. Verschelde, M. Puype, J. Van Damme, M. Goethals, S. DeBoeck, and J. Vandekerckhove.
Structural analysis and identification of gel-
purified proteins, available in the femtomode range, using a novel computer
program for peptide sequence assignment, by matrix-assisted laser desorption ionization-reflectron time-of-flight-mass spectrometry.
Electrophoresis,
17:918-924, 1996.
[Haj88]
B. Hajek. Cooling schedules for optimal annealing. Mathematics of Operational Research, 13:311-329, 1988.
[HBS+93]
W. Henzel, T. Billeci, J. Stults, S. Wong, C. Grimley, and C. Watanabe.
Identifying proteins from two-dimensional gels by molecular mass searching
of peptide fragments in protein sequence databases.
Proc Natl Acad Sci,
U.S.A., 90:5011-5015, 1993.
[HFBG92]
W. Hines, A. Falick, A. Burlingame, and B. Gibson. Pattern-based algorithm
for peptide sequencing from tandem high energy collision-induced dissociation
mass spectra. J Am Soc Mass Spectrom, 3:326-336, 1992.
[HHF95]
A. Hsieh, C. Ho, and K. Fan. An extension of the bipartite weighted matching
problem. Pattern Recognition Letters, 16:347-353, 1995.
[Hin97]
W. Hines. personal communication, PerSeptives Biosystems,, 1997.
[HKBC91]
F. Hillenkamp, M. Karas, R. Beavis, and B. Chait. Matrix-assisted laser desorption/ionization mass spectrometry of biopolymers. Analytical Chemistry,
63(24):1193A-1202A, 1991.
[HWH86]
C. Hamm, W. Wilson, and D. Harvan. Peptide sequencing program.
Com-
puter Applications to the Biosciences, 2:115-118, 1986.
[IN86]
K. Ishikawa and Y. Niwa. Computer-aided peptide sequencing by fast atom
bombardment mass spectrometry. Biomedical and Environmental Mass Spec-
trometry, 13:373-380, 1986.
209
[JB89]
R. Johnson and K. Biemann. Computer program (seqpep) to aid in the interpretation of high-energy collision tandem mass spectra of peptides. Biomed
Environ Mass Spectrom, 18:945-957, 1989.
[JMB88]
R. Johnson, S. Martin, and K. Biemann. Collision-induced fragmentation of
(m+h)+ ions of peptides: Side chain specific sequence ions.
International
Journal of Mass Spectrometry and Ion Processes, 86:137-154, 1988.
[JnC96]
J. Jai-nhuknan and C. Cassady. Anion and cation post-source decay time-offlight mass spectrometry of small peptides: Substance p, angiotensin ii, and
renin substrate. Rapid Communications in Mass Spectrometry, 10(13):1678-
1682, 1996.
[Joh88l
R. Johnson. Determination of peptide and protein structure by tandem mass
spectrometry. MIT PhD Thesis, Dept of Chemistry, 2 volumes, 1988.
[JQCG93]
P. James, M. Qaudroni, E. Carafoli, and G. Gonnet. Protein identification by
mass spectrometry. Biochemical and Biophysical Research Communications,
195:58-64, 1993.
[Kau95]
R. Kaufmann. Matrix-assisted laser desorption ionization mass spectrometry
a novel analytical tool in molecular biology and biotechnology.
Journal of
Biotechnology, 41:155-175, 1995.
[KCKS96I
R. Kaufmann, P. Chaurand, D. Kirsch, and B. Spengler.
Post-source de-
cay and delayed extraction in matrix-assisted laser desorption/ionizationreflectron time-of-flight mass spectrometry. are there trade-offs? Rapid Communications in Mass Spectrometry, 10:1199-1208, 1996.
[KGJV83]
S. Kirkpatrick, C. Gelatt Jr, and M. Vecchi.
Optimization by simulated
annealing. Science, 220:671-680, 1983.
[KH93]
M. Karas and F. Hillenkamp. Matrix-assisted laser desorption ionization mass
spectrometry - fundamentals and applications. AIP Conference Proceedings,
Second International Conference, Tennessee, 288:447-458, 1993.
210
[KKS94]
R. Kaufmann, D. Kirsch, and B. Spengler. Sequencing of peptides in a timeof-flight mass spectrometer: Evaluation of postsource decay following matrixassisted laser desorption ionisation. International Journal of Mass Spectrometry and Ion Processes, pages 355-385, 1994.
[KSL93]
R. Kaufmann, B. Spengler, and F. Lutzenkirchen.
Mass spectrometric se-
quencing of linear peptides by product-ion analysis in a reflectron time-offlight mass spectrometer using matrix-assisted laser desorption ionization.
Rapid Communications in Mass Spectrometry, 7:902-910, 1993.
[Kuh55]
H. Kuhn. The hungarian method for the assignment problem. Naval Research
Logististics Quarterly, 2:83-97, 1955.
[LG98]
T. Lin and G. Glish.
C-terminal peptide sequencing via multistage mass
spectrometry. Analytical Chemistry, 70:5162-5165, 1998.
[LL95]
H. Lee and D Lubman. Sequence-specific fragmentation generated by matrixassisted laser desorption/ionization in a quadrupole ion trap/reflectron timeof-flight device. Analytical Chemistry, 67:1400-1408, 1995.
[LM97]
A. Lamond and M Mann. Cell biology and the genome projects - a concerted
strategy for characterizing multiprotein complexes by using mass spectrometry. Trends in Cell Biology, 7:139-142, 1997.
[LS85]
T. Lee and V Spayth. Computer assisted interpretation of fast atom bombardment mass spectra of peptides. 33rd Conference on Mass Spectrometry
and Allied Topics, San Diego, pages 266-267, 1985.
[Man98]
M. Mann. personal email communication. April., 1998.
[Mat98]
P. Matsudaira. personal communication, Whitehead Institute, 1998.
[MB94
K. Medzihradszky and A. Burlingame. The advantages and versatility of a
high-energy collision-induced dissociation-based strategy for the sequence and
structural determination of proteins. Methods: A Companion to Methods in
Enzymology, 6:284-303, 1994.
211
[MHH+94]
S. Martin, F. Hsieh, W. Hines, D. Dalke, C. Elicone, and M. Vestal. Use of a
poroszyme immobilized trypsin cartridge and a voyager elite biospectrometry
research station for analysis of myoglobin tryptic peptides. Application Note,
Perceptive Biosystems, PA427, 1994.
[MHR93]
M. Mann, P. Hojrup, and P. Roepstorff. Use of mass spectrometric molecular
weight information to identify proteins in sequence databases. Biological Mass
Spectrometry, 22:338-345, 1993.
[Mit00]
S. Mitter. personal communication, LIDS, MIT, 2000.
[MSM+83]
T. Matsuo, T. Sakurai, H. Matsuda, H. Wollnik, and I. Katakuse. Improved
paas, a computer-program to determine possible amino-acid-sequences of peptides. Biomedical Mass Spectrometry, 10:57-60, 1983.
[Mur96]
K. Murray. Dna sequencing by mass spectrometry. Journal of Mass Spectrometry, 31:1203-1215, 1996.
[MW94]
M. Mann and M. Wilm. Error-tolerant identification of peptides in sequence
databases by peptide sequence tags.
Analytical Chemistry, 66:4390-4399,
1994.
[OSTV95]
Z. Olumee, M. Sadeghi, X. Tang, and A. Vertes. Amino acid composition
and wavelength effects in matrix-assisted laser desorption/ionization. Rapid
Communications in Mass Spectrometry, 9:744-752, 1995.
[Pap95]
I. Papayannopoulos. The interpretation of collision-induced dissociation tandem mass spectra of peptides. Mass Spectrometry Reviews, 14:49-73, 1995.
[Pet97]
E. Petit. Instrumentation and applications with maldi/tof. presentation at the
WhiteHead Institute. PerSeptives Biosystems, Voyager Elite Representative.,
July 1997.
[PHB93]
D. Pappin, P. Hojrup, and A. Bleasby. Rapid identification of proteins by
peptide-mass fingerprinting. Current Biology, 3(6):327-332, 1993.
[PPCC99]
D. Perkins, D. Pappin, D. Creasy, and J. Cottrell. Probability-based protein
212
identification by searching sequence databases using mass spectrometry data.
Electrophoresis, pages 3551-3567, 1999.
[Pro]
Protein
and
sequencing
amino
acid
analysis.
http://www.biotech.ufl.edu/ pccl/protseq.html.
[PS82]
C. Papadimitriou and K. Steiglitz. Combinatorial Optimization: Algorithms
and Complexity. Prentice Hall, Inc, USA, 1982.
[PTVF95]
W. Press, S. Teukolsky, W. Vetterling, and B. Flannery. Numerical Recipes
in C: The Art of Scientific Computing, 2 ed. Cambridge University Press,
USA, 1995.
[QC96]
J. Qin and B. Chait. Matrix-assisted laser desorption ion trap mass spectrometry: Efficient isolation and effective fragmentation of peptide ions. Analytical
Chemistry, 68:2108-2122, 1996.
[RF84]
P. Roepstorff and J. Fohlman.
Proposal for a common nomenclature for
sequence ions in mass spectra of peptides.
Biomedical Mass Spectrometry,
11:601, 1984.
[RKW88]
J. Rose, W. Klebsch, and J Wolf. Temperature measurement of simulated
annealing placements. International Conference on Computer-Aided Design,
pages 514-517, 1988.
[RYM95]
J. Rouse, W. Yu, and S. Martin. A comparison of the peptide fragmentation
obtained from a reflector matrix-assisted laser desorption-ionization time-offlight and a tandem four sector mass spectrometer. Journal of the American
Society for Mass Spectrometry, 6:822-835, 1995.
[SB88]
M. Siegel and N Bauman. An efficient algorithm for sequencing peptides using
fast atom bombardment mass spectral data. Biomedical and Environmental
Mass Spectrometry, 15:333-343, 1988.
[SBB87]
H. Scoble, J. Biller, and K. Biemann. A graphics display-oriented strategy
for the amino acid sequencing of peptides by tandem mass spectrometry.
Fresenius Z Anal Chem, 327:239-245, 1987.
213
[SC98]
D. Suckau and D. Cornett. Protein sequencing by isd and psd maldi-tof ms.
Analusis, 26:M18-M21, 1998.
[SCE+97]
A. Shevchenko, I. Chernushevich,
W. Ens, K. Standing, B. Thomson,
M. Wilm, and M. Mann. Rapid 'de novo' peptide sequencing by a combination of nanoelectrospray, isotopic labeling and a quadrupole/time-of-flight
mass spectrometer. Rapid Communiciations in Mass Spectrometry, 11:1015-
1024, 1997.
[Sch97]
M Schar. Maldi-ms at the ingenieurschule burgdorf: The technique, some
applications and expected benefits for the education in modern analytical
chemistry. Chimia, 51:782-785, 1997.
[SMMK84]
T. Sakurai, T. Matsuo, H. Matsuda, and I. Katakuse. Paas-3 - a computerprogram to determine probable sequence of peptides from mass-spectrometric
data. Biomedical Mass Spectrometry, 11:396-399, 1984.
[Spe97]
B. Spengler.
Post-source decay analysis in matrix-assisted laser desorp-
tion/ionization mass spectrometry of biomolecules.
Journal of Mass Spec-
trometry, 32:1019-1036, 1997.
[SSMW97]
J. Scott, S. Schurch, S. Moore, and C. Wilkins. Evaluation of maldi-ftms for
analysis of peptide mixtures generated by ladder sequencing.
International
Journal of Mass Spectrometry and Ion Processes, 160:291-302, 1997.
[SteO0]
D. Stephenson. personal communication, 2000.
[str95l
Strategy
for the interpretation of peptide cad spectra.
Typewritten
Manuscript, 1995.
[SWM97]
A. Shevchenko, M. Wilm, and M Mann. Peptide sequencing by mass spectrometry for homology searches and cloning of genes.
Journal of Protein
Chemistry, 16:481-490, 1997.
[SZK95]
R. Scarberry, Z. Zhang, and D. Knapp.
Peptide sequence determination
from high-energy collision-induced dissociation spectra using artificial neural
networks. J Am Soc Mass Spectrom, 6:947-961, 1995.
214
[Tay0O]
A.
Taylor.
User
guide
for
sherpa:
Your
guide
to
the
http://www.hairyfatguy.com/Sherpa/docs/Sherpa331/SherpaDoc.html,
peaks.
Ver-
sion 3.3.1, 2000.
[TBG90]
G. Thorne, K. Ballard, and S. Gaskell. Metastable decomposition of peptide
[m+h1+ ions via rearrangement involving loss of the c-terminal amino acid
residue. Journal of the American Society for Mass Spectrometry, 1:249-57,
1990.
[TJ97]
J. Taylor and R. Johnson. Sequence database searches via de novo peptide
sequencing by tandem mass spectrometry. Rapid Communiciations in Mass
Spectrometry, 11:1067-1075, 1997.
[TWJ96]
J. Taylor, K. Walsh, and R. Johnson.
Sherpa: A macintosh-based expert
system for the interpretation of electrospray ionization lc/ms and ms/ms data
from protein digests. Rapid Communications in Mass Spectrometry", pages
679-687, 1996.
[TYC93]
F. Tseng, W. Yang, and A. Chen.
Finding a complete matching with the
maximum product on weighted bipartite graphs.
Computers Math Applic,
25:65-71, 1993.
[YalOO]
More information on matrix assisted laser desorption ionization (maldi) mass
spectrometry. http://info.med.yale.edu/wmkeck/procmald.htm, 2000.
[Yat85]
J. Yates, III. Mass spectrometry and the age of the proteome. Journal of
Mass Spectrometry, 33:1-19, 1985.
[Yat96]
J. Yates, III. Protein structure analysis by mass spectrometry. Methods in
Enzymology, 271:351-377, 1996.
[YCPH96]
T. Yalcin, I. Csizmadia, M. Peterson, and A. Harrison. The structure and
fragmentation of b, (n > 3) ions in peptide spectra. Journal of the American
Society of Mass Spectrometry, 7:233-242, 1996.
[YECB96]
J. Yates, III, J. Eng, K. Clauser, and A. Burlingame.
Search of sequence
databases with uninterpreted high-energy collision-induced dissociation spectra of peptides. J Am Soc Mass Spectrometry, 7:1089-1098, 1996.
215
[YEMS95]
J. Yates, III, J. Eng, A. McCormack, and D. Schieltz. Method to correlate
tandem mass spectra of modified peptides to amino acid sequences in the
protein database. Analytical Chemistry, 67:1426-1436, 1995.
[YGHZ91]
J. Yates, III, P. Griffin, L. Hood, and J. Zhou. Computer aided interpretation of low energy MS/MS mass spectra of peptides. Techniques in Protein
Chemistry II, pages 477-485, 1991.
[YGSH93]
J. Yates, III, P. Griffin, S. Speicher, and T. Hunkapiller. Peptide mass maps:
A highly informative approach to protein identification. Analytical Biochemistry, 214:397-408, 1993.
[YKC+95]
T. Yalcin, C. Khouw, I. Csizmadia, M. Peterson, and A. Harrison. Why are
b ions stable species in peptide spectra? Journal of the American Society of
Mass Spectrometry, 6:1164-1174, 1995.
[YME96]
J. Yates, III, A. McCormack, and J. Eng. Mining genomes with MS. Analytical
Chemistry News and Features, pages 534A-540A, 1996.
[Zen97a]
R. Zenobi. Frontiers of laser chemical analysis. Chimia, 51:234-236, 1997.
[Zen97b]
R. Zenobi. Laser-assisted mass spectrometry. Chimia, 51:801-803, 1997.
[ZGW95]
E. Zaluzec, D. Gage, and J. Watson. Matrix-assisted laser desorption ionization mass spectrometry: Applications in peptide and protein characterization.
Protein Expression and Purification, 6:109-123, 1995.
[ZTEB90]
D. Zidarov, P. Thibault, M. Evans, and M Bertrand. Determination of the
primary structure of peptides using fast atom bombardment mass spectrometry. Biomed Environ Mass Spectrom, 19:13-26, 1990.
216
Download