MDEPALO Group 2 Project Paper 11-28-04

advertisement
Probabilistic Models in Proteomics Research
Maury DePalo
RBIF-103: Probability and Statistics
Brandeis University
November 22, 2004
Preface
This paper is one outcome of a group project completed for the above course. Our group was composed of three
members: Patrick Cody, Barry Coflan and myself. We three collaborated on the identification and development of
an appropriate topic; on the basic research of the subject matter; and on the preparation and delivery of the class
presentation. For both the research and the presentation, we each primarily concentrated on a single aspect of the
overall topic, and we have each prepared and submitted an individual paper with the knowledge that our separate
papers fit together as part of a larger whole that covers the overall topic more completely. This synergy is evident in
the class presentation and in the associated materials.
This present paper focuses on the development and performance of a probabilistic model being used to identify
proteins in a complex mixture. At the start of this project, we expected that we would research and present different
methods and different metrics used for protein identification. As our individual research proceeded, it became
apparent that proteomics researchers who were moving beyond the individual search scores and metrics, and who
were integrating these simpler scores into a more comprehensive probability-based model were having very good
success. In my own research, it became evident that a particular research group at the Institute for Systems Biology
was having success with what I describe in this paper as a “two-level model”, one that considers both peptide
assignments to MS spectra and the consequent peptide-based evidence for the underlying proteins [13][19].
Furthermore, this model was shown to perform better than the individual scores and metrics. Consequently, the
primary focus of this paper is on summarizing the rationale for and the successive development and refinement of
this two-level model by these researchers, and on its performance in estimating the probabilities of proteins being
present in a complex mixture.
Introduction
The primary goals of proteomics research are to separate, identify, catalog and quantify the
proteins and protein complexes that are present in a mixed sample, as representative of a change
in the metabolic or signaling state of cells under a variety of experimental conditions.
Researchers hope to characterize specific changes in protein levels as either disease or diagnostic
markers for a variety of complex diseases. Numerous studies have been performed to this end,
using a variety of experimental and analytical techniques [9][11][12].
The process of preparing protein samples for measurement and analysis and the subsequent
interpretation of the data resulting from these studies are complex and subject to substantial
variability from a number of sources. Each variable introduces an additional dimension of
uncertainty in the conclusions that can be drawn from such experiments. Researchers are using a
variety of analytical techniques to reduce or eliminate the uncertainty inherent in these methods.
This paper examines the application of a particular set of Bayesian-inspired probabilistic models
November 22, 2004
Page 1 of 25
Probabilistic Models in Proteomics Research
M. DePalo
through which a particular group of proteomics researchers are making notable progress toward a
clearer understanding of the sensitivities and specificities underlying the effective measurement
of protein expression patterns.
Measuring Mixed Protein Samples
Tandem mass spectrometry (MS/MS) is becoming the method of choice for determining the
individual protein components within a complex mixture [1][10][[17]. The proteins in a mixed
sample are first digested using a proteolytic enzyme, such as trypsin, resulting in a set of shorter
peptides. The peptides are subjected to reversed-phase chromatography or some other separation
technique, and are run through any of a number of different types of mass spectrometer. A
MS/MS ionizes and then fragments the peptides to produce characteristic spectra that can be
used for identification purposes. The collected MS/MS spectra are usually then searched against
a protein sequence database to find the best matching peptide in the database. The matched
peptides, those assigned to the generated spectra and used during searching, are then used to
infer the set of proteins in the original sample.
The Process
Although conceptually straightforward, the process of obtaining the protein mixture can
encounter a number of variables that can lead to significant variation in the results. First, the
extraction process used to acquire the protein sample from the tissue or fluid under study must be
reproduced precisely, using the exact sequence of centrifugation, fractionation, dissolution and
extraction techniques. Once the sample is obtained, the proteolytic enzyme must be chosen,
since each enzyme attacks the proteins at specific amino acid junctures, with different
efficiencies, leading to different collections of peptides depending upon the preponderance of
those specific amino acids in the sample and the number of missed cleavages. The separation
technology, such as gel electrophoresis or any of dozens of types of chromatography, must be
performed consistently. And then the specific type of mass spectrometry equipment must be
used. In this paper we will be mostly concerned with MS/MS, during which selected peptides
are further fragmented and computationally reconstructed to provide greater resolution into the
exact sequence composition of the peptides. Each of these steps introduces the potential for
variability in the resulting peptide population, in terms of composition, concentration, accuracy
and various other factors.
November 22, 2004
Page 2 of 25
Probabilistic Models in Proteomics Research
M. DePalo
The end result of the spectrometry stage is a set of MS/MS spectra that presumably correspond to
some subset of the individual peptides that comprised the proteins in the original sample.
Searching the Protein Database
Once the spectra are obtained, each spectrum is searched against a reference database of proteins
and their corresponding spectra and/or sequences. Most search algorithms begin by comparing
each spectrum against those that would be predicted for peptides obtained from the reference
database, using their masses and the precursor ion masses within an acceptable error tolerance.
Each spectrum is then assigned a peptide from the database, along with a score that reflects
various aspects of the match between the spectrum and the identified peptide. The scores are
often based on the number of common fragment ion masses between the spectrum and the
peptide (often expressed as a correlation coefficient), but also reflect additional information
pertaining to the match, and can be used to help discriminate between correct and incorrect
peptide assignments [3][4][7][8][16][18][20][21].
Improving on the Search Results
In an effort to substantiate or further increase the level of confidence associated with the search
scores returned with the peptide assignments, researchers have applied various additional criteria
to evaluate the search results. Properties of the assigned peptides, such as the number of tryptic
termini or the charge state(s), are often used to filter the search results to try to improve their
accuracy. Often the search results must be verified by an expert, but this is a time-consuming
process that is generally not feasible for larger datasets. And these approaches do little to reduce
the inherent variability in the process. Furthermore, the number of false negatives (correct
identifications that are rejected) and false positives (incorrect identifications that are accepted)
that result from the application of such filters are generally not well understood.
This is
complicated by the fact that different researchers use different filtering criteria, making it
difficult to compare results across studies.
Applying Probabilistic Models
The bulk of this paper focuses on a pair of statistical models developed by proteomics
researchers at the Institute for Systems Biology (ISB), described in detail in [13] and [19]. The
inherent two-step process used to first decompose proteins into peptides, and then peptides into
spectra, is reflected in a two-level model used to reconstruct the composition of the original
protein mixture. The peptide-level model estimates the probabilities of the peptide assignments
November 22, 2004
Page 3 of 25
Probabilistic Models in Proteomics Research
M. DePalo
to the spectra. The protein-level model uses the peptide probabilities to estimate the probabilities
of proteins in the mixture. Using these models in tandem has been shown to provide very good
discrimination between correct and incorrect assignments, and leads to predictable sensitivity
(true positive) and specificity (false positive error rates).
We now examine these models
individually in greater detail.
The Peptide-Level Model
The next several sections describe how the peptide-level model was derived and refined by
successive execution against the available experimental data. Of particular note is the manner in
which the model was repeatedly extended, successively introducing additional information
known about potential target proteins, and the processes and technologies being used to identify
them, into the model [13].
Inferring Peptides from Mass Spectra
The first step toward identifying the proteins in the sample is to identify the peptides represented
by the individual MS/MS spectra observed from the sample. As described above this consists of
matching the individual observed spectra against a reference database of proteins and the spectra
that correspond to the peptides that comprise each protein. The spectra and the peptides recorded
in the database are either actual spectra and peptides observed by previous experiments with the
corresponding protein, or spectra and peptides that are predicted for the protein on the basis of
computationally anticipated proteolytic activity on that protein. Since the search-and-match
process is not exact, a degree of uncertainty exists in any results returned by the database. Most
searching and matching algorithms in use today return one or more scores to aid in assessing the
accuracy of the matched peptides returned from the database [3][18].
For example, the
SEQUEST search algorithm returns a number of individual scores (described further below),
each representing an assessment by the matching software of the quality of the match between
the experimental spectra and the reference spectra.
The challenge facing researchers is how to evaluate and interpret the scores returned by these
search algorithms and databases, and how to use them to systematically and consistently to reach
a conclusion about the presence of the identified peptides in the original sample.
Researchers at ISB describe a peptide-level statistical model that estimates the accuracy of these
peptide assignments to the observed spectra. The model uses a machine-learning algorithm to
November 22, 2004
Page 4 of 25
Probabilistic Models in Proteomics Research
M. DePalo
distinguish between correct and incorrect peptide assignments, and computes probabilities that
the individual peptide assignments are correct using the various matching scores and other
known characteristics about proteins in general and the individual proteins in the mixture.
Experimental Conditions and Datasets
To begin, the authors generated a number of individual datasets from various control samples of
known, purified proteins at various concentrations. Each sample was run through the process
summarized above, subjecting each sample to proteolytic cleavage by trypsin, and subsequent
decomposition by ESI (electro-spray ionization)-MS/MS. They also generated a training dataset
of peptide assignments of known validity by searching the spectra against a SEQUEST peptide
database appended with the sequences of the known control proteins. This allowed them to
observe the behavior of the SEQUEST searching and matching algorithms against a known
sample of peptides, in the context of a larger database of peptides known to be incorrect with
respect to the sample proteins. This was done with both a Drosophila database and a human
database.
Each of the resulting spectra matches was reviewed manually by an expert to
determine whether they were correct. The result was a set of approximately 1600 peptide
assignments of [M + 2H]+ ions and 1000 peptide assignments of [M + 3H]+ ions determined to
be correct for each of the species databases.
Interpreting the Search Scores
The authors recognized that in order to be useful beyond their initial value, the individual scores
returned by the matching algorithm needed to be combined in some manner. Using Bayes’ Law
[2][14,][15][22][23] they reasoned that the probability that a particular peptide assignment with a
given set of search scores (x1, x2, … xS) is correct (+) could be computed as:
[Eq 1] p(+ | x1, x2, … xS) =
p(x1, x2, … xS | +) p(+) / ( p(x1, x2, … xS | +) p(+) + p(x1, x2, … xS | –) p(–) )
where p(x1, x2, … xS | +) and p(x1, x2, … xS | –) represent the probabilities that the search scores
(x1, x2, … xS) are found among correctly (+) and incorrectly (–) assigned peptides, respectively,
and the prior probabilities p(+) and p(–) represent the overall proportion of correct and incorrect
peptide assignments represented in the dataset, determined through the prior analysis using the
control samples and searches. These latter values can be considered an indication of the quality
of the dataset.
November 22, 2004
Page 5 of 25
Probabilistic Models in Proteomics Research
M. DePalo
Rather than attempting the complex process of computing these probabilities using a joint
probability distribution for the several scores (x1, x2, … xS), the authors employed discriminant
function analysis [5] to combine together the individual search scores into a single discriminant
score that was devised to separate the training data into two groups: correct and incorrect peptide
assignments. The discriminant score, F, is a weighted combination of the database search scores,
computed as:
[Eq 2] F(x1, x2, … xS) = c0 + Sum Of( ci xi )
where c0 is a constant determined through experimentation, and the weights, ci, are derived to
maximize the distinction of between-class variation versus in-class variation, in this case to
maximize the distinction between the correct and incorrect peptide assignments. The function is
derived using the training datasets with known peptide assignment validity. Once derived, the
discriminant score can be substituted as a single combined value back into [Eq 1] in place of the
original individual search scores and the resulting probabilities computed as follows:
[Eq 3] p(+ | F) = p(F | +) p(+) / ( p(F | +) p(+) + p(F | –) p(–) )
where p(+ | F) is the probability that the peptide assignment with discriminant score, F, is
correct, and p(F | +) and p(F | –) are the probabilities of F using the discriminant score
distributions of correct and incorrect peptide assignments, respectively. The authors show that
the resulting probabilities retain much of the discriminating power of the original combination of
scores, but offer a simpler calculation than the joint distributions required in [Eq 1].
Using this discriminant scoring approach against a variety of search scores returned from the
SEQUEST algorithm, four specific SEQUEST scores were found to contribute significantly to
effective discrimination: 1) Xcorr, a cross-correlation measure based on the number of peaks of
common mass between observed and predicted spectra; 2) Delta Cn, the relative difference
between the first and second highest Xcorr score for all peptides queried from the database; 3)
SpRank, a measure of how well the assigned peptide scored, relative to those of similar mass in
the database; and 4) dM, the absolute value of the difference in mass between the precursor ion of
the spectrum and the assigned peptide.
November 22, 2004
Page 6 of 25
Probabilistic Models in Proteomics Research
M. DePalo
They further discovered that transformation of some (raw) search scores significantly improved
the discrimination power of this approach. For example, Xcorr shows a strong dependence on
the length of the assigned peptides. This is because Xcorr reflects the number of matches
identified between ion fragments in the observed and predicted spectra, leading to larger values
for assignments of longer peptides with more fragment ions than for assignments of shorter
peptides with fewer fragments. Consequently, assignments of shorter peptides can be difficult to
classify as correct or incorrect, since even the correct assignments will often result in low Xcorr
scores. They found that this length dependence could be reduced by transforming Xcorr to
Xcorr’, which was computed as the ratio of the log of Xcorr to the log of the number of
fragments predicted for the peptide, using a two-part function that included a threshold for the
length of the peptide. It was found that beyond a certain length threshold, Xcorr was largely
independent of peptide length, so this factor was used in the calculation [13].
Using this analysis, it was found that the SEQUEST scores Xcorr’ and Delta C n were found to
contribute the most to the discrimination achieved by the function between the correct and
incorrect peptide assignments. After reviewing these results against the training dataset, they
observed excellent distinction between the correct and incorrect peptide assignments, with 84%
of correct peptide assignments having F scores of 1.7 or greater, and 99% of incorrect
assignments having F scores below that value.
Recognizing that [Eq 3] would be sensitive to F score distributions, the authors computed the
distributions of the F scores for the training datasets. By binning the F scores into 0.2 wide
discrete value intervals, the distributions of the scores among the correct and incorrect
assignments were determined.
The probability that a correct peptide assignment has
discriminant score, F, was found to fit a Gaussian distribution, with calculated mean, m, and
standard deviation, s, as follows:
[Eq 4] p(F | +) = ( 1 / Sqrt( 2 pi s) ) e– (F – m)^2 / 2a^2
Furthermore, the probability that an incorrect peptide assignment has a discriminant score, F,
was found to fit a gamma distribution, with parameter g set below the minimum F in the dataset,
and parameters a and b computed from the population. The resulting distribution is computed as
follows:
November 22, 2004
Page 7 of 25
Probabilistic Models in Proteomics Research
M. DePalo
[Eq 5] p(F | –) = ( ( F – g )a-1 e– (F – g) / b ) / ( ba Gamma(a) )
These two expressions for p(F | +) and p(F | –) were then substituted back into [Eq 3], which
improved the calculation of accurate probabilities that the peptides assigned to the spectra in the
training dataset are correct.
Considering the Number of Tryptic Termini
If we know that the process to generate the peptides includes a specific proteolytic enzyme, we
can exploit our knowledge of the specific amino acids cleaved by that enzyme to further inform
our database search and our probability calculations. The ISB researchers used their knowledge
that trypsin cleaves proteins on the –COOH side of lysine or arginine to first determine the
number of tryptic termini (NTT) that would be present in a sample, and then use the NTT of
peptides assigned to spectra as additional information for assessing whether the assignments are
correct. The value for NTT is either 0 (indicating that no tryptic terminus was assigned to a
spectrum peptide), 1 (indicating that only one of the two tryptic termini corresponding to a
cleavage was assigned to the spectrum peptide), or 2 (indicating that both of the tryptic termini
corresponding to a cleavage were assigned to the spectrum peptide), indicating how many of the
peptide termini are consistent with cleavage by the specific proteolytic enzyme used, trypsin.
Similar consideration of alternative cleavage enzymes would also be valid.
The NTT distributions were found to be sufficiently distinct to be a useful additional piece of
evidence for computing probabilities of peptide assignments.
In the training dataset, the
incorrectly assigned peptides had NTT = 0, 80% of the time; NTT = 1, 19% of the time; and
NTT = 2, 1% of the time. Conversely, the correctly assigned peptides had NTT = 0, 3% of the
time; NTT = 1, 23% of the time; and NTT = 2, 74% of the time. The authors hypothesized that
combining the NTT information with the discriminant score, F, would improve the probability
calculation for the peptide assignments. They reasoned that of two peptide assignments with the
same F score (discriminant function score), the one with NTT = 2 would be more likely to be
correct versus one with NTT = 0, since peptides with NTT = 2 would be more prevalent among
correct peptide assignments, given that trypsin was used in their study.
Using Bayes Law the probability that a peptide assignment is correct would be expanded to:
[Eq 6] p(+ | F, NTT) = p(F, NTT | +) p(+) / ( p(F, NTT | +) p(+) + p(F, NTT | –) p(–) )
November 22, 2004
Page 8 of 25
Probabilistic Models in Proteomics Research
M. DePalo
At this point one must consider whether the database scores used to compute the discriminant
score, F, are dependent on the number of tryptic termini of the peptides. To determine this, the
authors plotted the distribution of discriminant scores, F, for each subset of peptides for NTT =
0, 1, and 2 separately, and found that their distributions were very similar. If we conclude from
this that these measures are independent among both correct and incorrect assignments, the
simplified calculation for [Eq 6] becomes:
[Eq 7] p(+ | F, NTT) =
p(+ | F)p(+ | NTT) =
p(F | +)p(NTT | +) p(+) / ( p(F | +)p(NTT | +) p(+) + p(F | –)p(NTT | –) p(–) )
The authors also noted that if the database search is constrained to only fully tryptic peptides,
then all assigned peptides will have NTT = 2, effectively simplifying [Eq 7] back to [Eq 3], since
NTT is no longer a discriminating factor in the determination of the resulting probability.
Extending the Model to Other Datasets
One might be tempted to conclude that such a model might be generally applicable to all
experimental datasets. However, the authors noted that the discriminant score, F, can vary
significantly from dataset to dataset, leading to reduced accuracies in the computed probabilities.
For example, the Xcorr’ score from SEQUEST is known to be strongly affected by the levels of
signal-to-noise reflected in the spectra. Additional inaccuracies would result from variations in
the NTT distributions, due either to the efficiency of the trypsin cleavage activity, or to the
presence of contaminants such as protease in the sample. In addition, the NTT distributions can
vary depending upon the prevalence of lysine and arginine in the proteins residing in the
particular reference database used for searching. Finally, the prior probabilities of correctly and
incorrectly assigned spectra would be expected to vary with each experimental dataset,
potentially as a result of sample purity and spectral quality. Consequently, the extension of the
model from the training dataset to all datasets (i.e. as a global model) would not be considered
valid.
To deal with this issue and thereby improve the model’s ability to discriminate with datasets
other than the training dataset, the authors devised a mechanism to compute the prior
probabilities and the discriminant score and NTT distributions among correct and incorrect
November 22, 2004
Page 9 of 25
Probabilistic Models in Proteomics Research
M. DePalo
peptide assignments using the specific dataset as a mixture model, using the expectation
maximization (EM) algorithm [6]. In a mixture model, each matched spectrum contributes to the
correct and incorrect peptide assignment distributions in proportion to its computed probability
of being correctly and incorrectly assigned, respectively. The EM algorithm works through an
iterative two-step process that finds the distributions that best fit the observed data. With each
iteration, the mixture model distributions more closely match the observed data, converging until
there is negligible difference between the model and the observed data. Upon termination, the
algorithm reports the final probabilities that peptides are correctly assigned to spectra using the
learned distributions. Using this approach, the model can be used to tune itself more tightly to
the particular dataset under study, minimizing several types of bias that might affect its ability to
discriminate.
Assessing the Peptide-level Model
The EM method described above was examined against the test datasets of combined MS/MS
spectra generated from various runs on the control samples of proteins. The spectra were
searched against a peptide database appended with the control proteins. The discriminant score
distributions predicted for both positive and negative assignments were found to closely match
the actual positive and negative distributions of the test dataset. They also observed close
correlation between the actual and model-derived prior probabilities and NTT distributions.
Furthermore, good agreement was demonstrated for all values of NTT, further justifying that the
discriminant scores and NTT values for both correct and incorrect peptide assignments were
independent. Not surprisingly, peptide assignments with NTT = 2 had much higher probabilities
for any discriminant score, F, relative to peptide assignments with NTT = 0 or 1. This was
consistent with the higher proportion of peptides with NTT = 2 among correct assignments than
among incorrect assignments.
The authors demonstrated the accuracy of the computed probabilities by plotting the actual
probability that peptide assignments were correct as a function of the computed probability for
the combined test data spectra. Spectra were sorted by computed probability and the mean
computed probability and actual probability were determined using a sliding window of 100
spectra.
November 22, 2004
Page 10 of 25
Probabilistic Models in Proteomics Research
M. DePalo
Good correspondence between the computed and actual probabilities was shown [Figure 1],
indicating that the computed probabilities are an accurate reflection of the likelihood of correct
assignment of peptides to spectra.
Figure 1
The authors then plotted the results of a similar analysis using individual models computed from
each of the smaller datasets derived from the separate MS/MS runs, as well as a combined model
from the combined datasets. The probabilities computed for the individual MS/MS runs were
found to be nearly as accurate as those computed from the single model from the combined data.
Sensitivity and Error Rates of Peptide Identification
It is well known that probabilistic testing methods that are designed to discriminate between
correct and incorrect identification of a test condition will exhibit varying sensitivities when
reporting correct and incorrect results. Sensitivity is defined as the ability to correctly identify a
positive test condition (true positive) when is it actually present. Specificity is defined as the
ability to correctly identify a negative condition (true negative), when it is actually not present.
It is well known that the accuracy of a testing method is simultaneously dependent upon the
sensitivity and specificity to the test condition, as well as the prevalence of the test condition
among the full population being tested [14][15][23].
An ideal model would enable complete and unambiguous separation between correct and
incorrect peptide assignments. However, in practice, this is generally not feasible. As an
November 22, 2004
Page 11 of 25
Probabilistic Models in Proteomics Research
M. DePalo
alternative, the authors demonstrated the use of a threshold value for their model, with all
probabilities above the threshold accepted as correct, and all probabilities below the threshold
accepted as incorrect. A user of this model would need to balance the desire for setting a low
threshold, which would maximize sensitivity at the cost of an increased error rate (i.e. false
positive), against the desire for a higher threshold, which would ensure a lower error rate, at the
cost of decreased sensitivity and lost true positives. The relative importance of these factors
would need to be determined by the individual user for each study at hand.
The authors go on to illustrate the trade-off between these two criteria in the performance of the
peptide-level model. A range of values for the minimum probability threshold was used to
compute the resulting sensitivity and error rates for each probability threshold value. These
results are depicted in [Figure 2].
Figure 2
As a comparison, a plot of several individual SEQUEST scores and NTT in the same graph
clearly showed that the peptide-level model outperformed each of the individual SEQUEST
scores that might be used to filter the database search results, as evidenced by the model’s higher
sensitivity and correspondingly lower error rate at each point that was measured.
November 22, 2004
Page 12 of 25
Probabilistic Models in Proteomics Research
M. DePalo
The authors point out that the user must choose an appropriate probability threshold for a given
dataset.
The accuracy of the peptide-level model enables the expected sensitivity and the
expected false identification rate for any selected minimum probability threshold to be computed
from the underlying probabilities [Figure 3].
Figure 3
The sensitivity and error rates predicted by the model were found to agree well with those
observed for the data, and can be used to select the probability threshold that achieves the
optimal trade-off between the two criteria.
The Protein-Level Model
Once equipped with a reliable model for estimating probabilities at the peptide level, a proteinlevel model can be constructed using the peptide-level probabilities as input to facilitate the
inference about which proteins are present in the sample. The next several sections describe the
protein-level model that was developed in a separate study by the same authors.
Getting Peptides Organized by Protein
The first step to devising a protein-level model is to group all of the assigned peptides according
to their corresponding proteins in the database. This can be a difficult process when one (or
November 22, 2004
Page 13 of 25
Probabilistic Models in Proteomics Research
M. DePalo
more) assigned peptide is a so-called “degenerate” peptide, meaning that its sequence actually
appears in more than a single entry in the protein sequence database. This can occur when the
reference database being searched is comprised of multiple species containing homologous or
redundant entries, such as occurs in some eukaryotic or human protein databases.
Combining Knowledge About Proteins
Once the grouping of peptides is completed, the assigned peptides that correspond to a particular
protein and their associated probabilities must be combined to compute a single metric that can
further distinguish between correct and incorrect protein identifications. There also exists the
unique challenge that some proteins may legitimately have only a single peptide assigned to a
corresponding spectrum, and this can be very difficult to distinguish from a false negative result,
as most incorrectly identified proteins also have only a single corresponding peptide.
A number of different approaches of increasing complexity have been devised for identifying the
proteins on the basis of MS/MS peptide spectra. These range from the relatively simple filtering
and visualization tools that report on the list of proteins corresponding to the assigned spectra,
without attempting to resolve degenerate peptides or consider probabilities; to tools that group
peptides according to proteins, and report a score indicating the confidence of each protein
identification; to algorithms that estimate the confidence of protein identifications by taking into
account the total number of identified peptides in the data set and the number corresponding to
each protein. These tools, although useful as filtering criteria to separate correct from incorrect
protein identifications, provide no means to estimate the resulting sensitivity (true positive) and
specificity (false positive) performance.
The protein-level model described by the authors computes a probability that a protein is present
in the sample, by combining the probabilities that corresponding peptide assignments are correct
after adjusting them for observed protein grouping information.
The model apportions
degenerate peptides among its corresponding proteins and collapses redundant entries, grouping
together those proteins that are impossible to differentiate on the basis of the assigned spectra
[19].
Inferring Protein Probabilities on the Basis of Accumulating Evidence
Recall that MS/MS spectra are produced by peptides that presumably comprise the sample
proteins, and not by the proteins themselves. Consequently, all conclusions we might reach
November 22, 2004
Page 14 of 25
Probabilistic Models in Proteomics Research
M. DePalo
about which proteins are present in the sample are based upon the identification of the specific
peptides that correspond to those proteins.
Once we have a credible set of individual peptide assignment probabilities (as with the peptidelevel model), we can use these to estimate the probability that a particular protein is present in
the sample. However, these peptide probabilities do not eliminate all further variability. Many
distinct peptides can be assigned to a given spectrum, and each of these peptide assignments
corresponds to the same particular protein of interest.
Furthermore, each peptide may be
assigned (matched) to more than a single spectrum found in the reference database. Our next
step must be to exploit our knowledge of how peptides comprise proteins to improve our
likelihood of identifying the correct proteins.
Each peptide assigned to a spectrum contributes evidence that the corresponding protein is
present in the sample. If each peptide assignment is considered independent evidence that its
corresponding protein is present, then the probability that a protein is present can be computed as
the probability that at least one peptide assignment corresponding to the protein is correct:
[Eq 8] p(PROTEIN) = 1 – Producti ( Productj (1 – p(+ | Di j)))
where p(+ | Di j) represents the computed probability that the jth assignment of peptide i to a
spectrum is correct. In this case D represents the accumulated data or observations (such as
database search scores, number of tryptic termini, number of missed cleavages, etc.) contributing
evidence that the peptide assignment is correct. The authors reasoned that assignments of the
same peptide to multiple spectra are not justifiably independent events, since those spectra would
have nearly identical fragmentation patterns. For example, multiple spectra corresponding to a
peptide that is not in the reference database, perhaps due to a post-translational modification
(PTM), would each likely be assigned to the same incorrect peptide. This would lead to an
inaccurately high computed probability for the corresponding protein. Consequently, multiple
identifications of the same peptide in a data set should not necessarily result in increased
confidence that the corresponding protein is correct [19].
To correct for this factor, the authors suggest a more conservative estimate, using only a single
contributing peptide, the one showing the highest probability of those assigned to the same
protein, computed as:
November 22, 2004
Page 15 of 25
Probabilistic Models in Proteomics Research
M. DePalo
[Eq 9] p(PROTEIN) = 1 – Producti (1 – Maxj ( p(+ | Di j)) )
This adjustment was shown to improve the estimated probabilities calculated for these peptide
assignments to proteins.
It was also noted that this effect was only an issue with multiple assignments of a peptide to
MS/MS spectra of the same precursor ion state, i.e. the [M + 2H]2+ or [M + 3H]3+. The authors
found that assignments corresponding to the same peptide but with different charge state all
contribute (separate) evidence for the presence of a corresponding protein, since it is known that
peptides with difference charge states have significantly different fragmentation patterns.
Consequently, these peptide assignments were allowed to remain in the calculation [19].
Exploiting Protein Grouping Information
It is known that correct peptide assignments, more than incorrect assignments, tend to
correspond to “multi-hit” proteins. Conversely, incorrect peptide assignments tend to correspond
to proteins to which no other correctly assigned peptide corresponds. And this pattern is even
more pronounced in “high coverage” data sets, where the number of acquired MS/MS spectra is
relatively large with respect to the complexity of the sample.
Consequently, probabilities
computed for peptide assignments in the context of the complete data set from which they are
calculated may not be as accurate for the peptide subsets grouped according to corresponding
proteins. These probabilities must be adjusted to reflect whether the protein is a “multi-hit”
protein in the database.
This factor is computed by estimating each peptide’s number of sibling peptides (NSP), i.e. the
number of other correctly identified peptides that correspond to the same protein. The NSP
value for each sibling peptide is computed as the sum of the individual maximum probabilities of
all other sibling peptides for a given protein. The authors demonstrated that the difference in
NSP values for correct versus incorrect peptide assignments was pronounced, with 92% of
correct peptide assignments having an NSP value above 5 (with an average of 7), versus fewer
than 1% of incorrect peptide assignments having NSP values above 5, with the majority having a
value below 0.25 (with an average of 0.01).
The authors assume (reasonably) that the NSP value is independent of the database search scores
and other observations aggregated under D, and reached the calculation:
November 22, 2004
Page 16 of 25
Probabilistic Models in Proteomics Research
M. DePalo
[Eq 10] p(+ | D, NSP) =
p(+ | D)p(NSP | +) / p(+ | D)p(NSP | +) + p(– | D)p(NSP | –)
The authors demonstrate that the adjusted probabilities have improved power to discriminate
correct and incorrect database search results, and are more accurate among subsets of peptides
grouped according to corresponding proteins. The authors also note that NSP distributions can
be expected to vary between data sets, reflecting differences in dataset size (number of spectra),
database size, sample complexity and relative concentrations, and data quality. They conducted
a further analysis using an EM-like algorithm that converged on an adjusted set of probabilities
for p(+ | D, NSP) [19]. This approach could likely be used to improve comparisons across
disparate data sets.
NSP Distribution Based on Sample Complexity and Dataset Size
As an additional consideration of NSP distribution, the authors plotted the log of the ratio of
p(NSP | +) / p(NSP | –) learned by the protein-level model. A ratio greater than 1 (positive log)
indicates that these probabilities are enhanced by including the NSP information, whereas a ratio
less than 1 (negative log) indicates that the NSP adjustment reduces the probability that a peptide
assignment is correct.
They plotted the impact of NSP to this ratio for two data sets consisting of approximately the
Figure 4
November 22, 2004
Page 17 of 25
Probabilistic Models in Proteomics Research
M. DePalo
same number of spectra, but with one sample having 18 proteins (a relatively simple
sample)(depicted with triangles), and a second (more complex) sample have hundreds of proteins
(depicted with squares). One would expect a greater percent of correctly identified proteins in
the simpler sample to be multi-hit proteins (since there are fewer total proteins). In fact, the data
show that the effect of the log ratio is more pronounced in the low complexity sample, with the
ratio being more strongly positive at high NSP, and more strongly negative at low NSP [19].
Therefore, incorporating NSP data into the calculation enhances the probabilities of peptides
with high NSP and reduces the probabilities of peptides with low NSP to a greater extent in the
low complexity data set than in the higher complexity dataset.
The authors perform a similar comparison, but this time using an increasing number of spectra in
the data set, run against the same exact sample of proteins. One would expect that as the size of
the dataset (number of spectra) increases with the sample complexity kept constant, the sample
coverage would increase, whereby more correctly identified proteins would be “multi-hit”. In
addition, the degree of adjustment for NSP should also depend on the likelihood of observing a
given peptide’s siblings in the sample.
For any protein, the likelihood of observing sibling peptides depends on a number of factors,
including its overall abundance in the sample, its length, the number of tryptic peptides (or other
peptides if a different enzyme is used) and other factors. Furthermore, some peptides are rarely
seen, because their physicochemical properties result in poor ionization efficiency or poor
fragmentation. Consequently, some proteins might only produce one distinctive peptide that
could be identified in a given MS/MS run. These proteins should not have their probabilities
reduced due to a low NSP value, as the low value is not sufficient indication that the protein is
not present [19].
Reducing false negatives
The authors demonstrate that, since the protein-level model incorporates both the log ratio and
the unadjusted probabilities, among all peptides in the data set having a low NSP value, peptides
with the unadjusted probability p(+ | D) close to 1.0 are penalized less than those elsewhere in
the range.
This ensures that this adjustment of probability based on NSP does not
inappropriately reduce the identification of proteins based on only a small number of peptides
relative to other proteins in the sample. This lower number of peptides is often the case with
November 22, 2004
Page 18 of 25
Probabilistic Models in Proteomics Research
M. DePalo
very small proteins, or low-abundance proteins, as long as they are identified by at least one high
probability peptide [19].
The authors go on to deal with degenerate peptides, those that correspond to more than a single
protein, due either to the presence of homologous proteins, splicing variants, or redundant entries
in the reference database. The impact of degenerate peptides can be reduced essentially by
apportioning such peptides among their possible proteins according to the estimated probabilities
of those proteins in the original sample. So, if a given peptide corresponds to several different
proteins, the relative weight that this peptide belongs to a given protein is apportioned according
to the probability of the given protein relative to all of the proteins matched by the given peptide.
The protein probabilities in [Eq 9] are then calculated using these weighting factors with the
maximum probability among the multiple assignments. The model learns the degenerate peptide
weights using the EM algorithm. It is occasionally possible that even following this adjustment,
certain proteins remain indistinguishable.
In these cases, the indistinguishable proteins are
reported as a group, and assigned a single probability that any member of the group is present in
the sample. Treatment of degenerate peptides can be combined with NSP to compute an
accurate probability that each protein is present in the sample [19].
Assessing the Protein-level Model
The effectiveness of the model was tested using several datasets representing different numbers
of proteins being searched in the context of different reference databases. These results are
summarized in [Figure 5].
Figure 5
November 22, 2004
Page 19 of 25
Probabilistic Models in Proteomics Research
M. DePalo
The first three datasets, 18prot_Hinf, 18prot_Dr and 18prot_Hum were generated from the same
set of MS/MS spectra, but searched against reference databases of increasing size.
The
18prot_Hum database has many degenerate proteins, due to the large number of homologous
proteins, splicing variants and other redundant entries. The other two datasets, 18prot_Dr and
18prot_Hinf, have increasingly fewer degenerate proteins. [Figure 5] shows that the model
performed well on all datasets, independent of the reference database that was searched. Using
the NSP information yielded greater accuracy in correctly identifying proteins and suppressing
incorrectly identified proteins. By contrast, when NSP information was not used, the number of
incorrect peptides increased significantly [19].
The model was then tested against datasets generated from more complex samples. All of the
protein identifications were sorted according to computed probabilities and the corresponding
actual probabilities were determined using a sliding window of 20 identifications. [Figure 6]
shows that the computed probabilities are accurate in comparison to the actual corresponding
probabilities. A hypothetical ideal model would be represented by the dotted 45o line in the
graph.
Figure 6
November 22, 2004
Page 20 of 25
Probabilistic Models in Proteomics Research
M. DePalo
The probabilities for proteins computed without the NSP adjustment are overestimated as seen in
the graph, particularly in the range of the intermediate probabilities. Interestingly, it was shown
that nearly all of the incorrectly identified proteins were identified on the basis of only one
peptide having a significant (high) probability of being correct. The NSP adjustment penalizes
such single peptides, resulting in more accurate probabilities. [Figure 6] shows that with the
NSP adjustment, the computed probabilities are far more accurate, in fact coming out
predominantly slightly conservative relative to the actual probabilities (as evidenced by their
appearing to the left of the 45o line) in the graph.
Sensitivity and Error Rates on Protein Identification
Similar to the peptide-level model, the authors plotted the false positive error rates against
varying levels of sensitivity for the protein-level model. They filtered the data using various
values for the minimum computed probability threshold.
[Figure 7] indicates that the
probabilities computed by the protein-level model have a high power to discriminate correct
protein identifications from incorrect ones.
Figure 7
As an example, a minimum probability threshold of 0.7 results in 94% sensitivity, with a false
positive error rate of 1.2%. Of particular interest, 39% of all correct identifications passing the
0.7 filter had only one peptide corresponding to the identified protein, and these were not
(incorrectly) suppressed. Probabilities computed without the NSP adjustment had much lower
November 22, 2004
Page 21 of 25
Probabilistic Models in Proteomics Research
M. DePalo
sensitivity for any given false positive error rate. The graph also demonstrates very good
agreement between the actual sensitivity and error rates and those predicted by the model.
Consequently, researchers can choose a minimum probability threshold that gives the desired
level of sensitivity and error rate for any given experimental data analysis [19].
Benefits of the Two-Level Model
It is clear from these studies that the process used to first decompose a protein sample into its
constituent peptides, and then reconstitute the proteins on the basis of MS/MS spectra matched to
peptides in a reference database must contend with a number of variables. Any one of these
variables can negatively impact the results. At the same time, the problem space is evidently rich
enough to enable researchers to exploit specific knowledge of the proteomics domain as an aid to
interpreting the results of specific experiments and analyses. The primary objective is to weigh
the relative importance of the various observations and data points on the presence or absence of
specific proteins being present in the mixed sample.
The probabilities computed using the peptide-level model can be used to effectively identify
correct peptide assignments and filter data with predictable false identification error rates. They
also serve as useful inputs for estimating the likelihood of the presence of corresponding proteins
in the sample [13].
The protein-level model shows great sensitivity to identify correct protein assignments, while at
the same time demonstrating a low incidence of false positive errors. The model was also able to
correctly identify a large number of proteins on the basis of a single peptide. This is of particular
importance in proteomics studies as these proteins are often low-abundance and/or low
molecular weight proteins, which are very difficult to identify and are quite often lost using
suggested filtering criteria requiring two or more corresponding peptides [19]. These models
have been shown to improve the interpretation of data coming from these databases.
The computed probabilities are an accurate measure of confidence to accompany protein
identifications and can potentially provide a standardized way to publish large data sets in order
to enable cross-study comparison of results.
Conclusion
In this paper, we examined the rationale and performance of a specific two-level probabilistic
model that recognizes the hierarchical nature of the proteomics domain. The model exploits
November 22, 2004
Page 22 of 25
Probabilistic Models in Proteomics Research
M. DePalo
specific knowledge about peptides and the probabilities of their matches against MS/MS spectra
being correct.
In addition, the model exploits specific knowledge about proteins and the
probabilities that constituent peptides correctly identify those proteins, both singly and in
combination.
Probabilistic models such as those examined here are starting to show effective discrimination
between correct and incorrect identifications of proteins in complex mixtures. As the amount of
proteomics data being generated continues to increase and the scale of proteomics research
continues to expand, models such as these will become increasingly important as analytical tools
to proteomics researchers.
November 22, 2004
Page 23 of 25
Probabilistic Models in Proteomics Research
M. DePalo
References
[1] Aebersold R, Goodlett DR. Mass spectrometry in proteomics.
Feb;101(2):269-95. PMID: 11712248
[2]
Annis,
C.,
“Statistical
Engineering:
Conditional
http://www.statisticalengineering.com/conditional.htm, 2004.
Chem Rev. 2001
Probability
Applet”,
[3] Bafna V, Edwards N. SCOPE: a probabilistic model for scoring tandem mass spectra against
a peptide database. Bioinformatics. 2001;17 Suppl 1:S13-21. PMID: 11472988
[4] Chamrad DC, Korting G, Stuhler K, Meyer HE, Klose J, Bluggel M. Evaluation of
algorithms for protein identification from sequence databases using mass spectrometry data.
Proteomics. 2004 Mar;4(3):619-28. PMID: 14997485
[5] Delyon, Bernard, Discriminant Function Analysis.
StatSoft Electronic Textbook.
http://name.math.univ-rennes1.fr/bernard.delyon/textbook/stdiscan.html. 1984-2000.
[6] Dempster, A, Laird, N, Rubin, D. Maximum likelihood from incomplete data via the EM
algorithm. Journal of the Royal Statistical Society, 39:1–38, 1977.
[7] Eriksson J, Chait BT, Fenyo D. A statistical basis for testing the significance of mass
spectrometric protein identification results. Anal Chem. 2000 Mar 1;72(5):999-1005. PMID:
10739204
[8] Fenyo D., Identifying the proteome: software tools.
Aug;11(4):391-5. Review. PMID: 10975459
Curr Opin Biotechnol. 2000
[9] Gavin AC, Bosche M, Krause R, et al, Functional organization of the yeast proteome by
systematic analysis of protein complexes. Nature. 2002 Jan 10;415(6868):141-7. PMID:
11805826
[10] Gygi SP, Aebersold R. Mass spectrometry and proteomics. Curr Opin Chem Biol. 2000
Oct; 4(5):489-94. Review. PMID: 11006534
[11] Gygi SP, Rist B, Gerber SA, Turecek F, Gelb MH, Aebersold R. Quantitative analysis of
complex protein mixtures using isotope-coded affinity tags. Nat Biotechnol. 1999 Oct;17(10):
994-9. PMID: 10504701
[12] Gygi SP, Rist B, Griffin TJ, Eng J, Aebersold R. Proteome analysis of low-abundance
proteins using multidimensional chromatography and isotope-coded affinity tags. J Proteome
Res. 2002 Jan-Feb;1(1):47-54. PMID: 12643526
[13] Keller A, Nesvizhskii AI, Kolker E, Aebersold R. Empirical statistical model to estimate
the accuracy of peptide identifications made by MS/MS and database search. Anal Chem. 2002
Oct 15;74(20):5383-92. PMID: 12403597
November 22, 2004
Page 24 of 25
Probabilistic Models in Proteomics Research
[14]
King,
A.,
“AP
Statistics
http://arnoldkling.com/apstats/index.html, 2004.
[15]
Lowry,
R.,
“Bayes
Theorem:
http://faculty.vassar.edu/lowry/bayes.html, 1998-2004.
M. DePalo
Lectures:
Bayes
Conditional
Theorem”,
Probabilities”,
[16] Mann M, Wilm M. Error-tolerant identification of peptides in sequence databases by
peptide sequence tags. Anal Chem. 1994 Dec 15;66(24):4390-9. PMID: 7847635
[17] McDonald WH, Yates JR 3rd. Proteomic tools for cell biology. Traffic. 2000
Oct;1(10):747-54. Review. PMID: 11208064
[18] Moore RE, Young MK, Lee TD. Qscore: an algorithm for evaluating SEQUEST database
search results. J Am Soc Mass Spectrom. 2002 Apr;13(4):378-86. PMID: 11951976
[19] Nesvizhskii AI, Keller A, Kolker E, Aebersold R. A statistical model for identifying
proteins by tandem mass spectrometry. Anal Chem. 2003 Sep 1;75(17):4646-58. PMID:
14632076
[20] Perkins DN, Pappin DJ, Creasy DM, Cottrell JS. Probability-based protein identification by
searching sequence databases using mass spectrometry data. Electrophoresis. 1999 Dec;20(18):
3551-67. PMID: 10612281
[21] Razumovskaya J, Olman V, Xu D, Uberbacher EC, VerBerkmoes NC, Hettich RL, Xu Y.
A computational method for assessing peptide-identification reliability in tandem mass
spectrometry analysis with SEQUEST. Proteomics. 2004 Apr;4(4):961-9. PMID: 15048978
[22] Spiegel, M., Schaum’s Outline of Probability and Statistics, McGraw-Hill, 2000, 1975.
[23] Wikipedia, the free encyclopedia. Bayesian Inference. http://en.wikipedia.org/wiki/
Bayesian_inference. 2004.
November 22, 2004
Page 25 of 25
Download