Probabilistic Models in Proteomics Research Maury DePalo RBIF-103: Probability and Statistics Brandeis University November 22, 2004 Preface This paper is one outcome of a group project completed for the above course. Our group was composed of three members: Patrick Cody, Barry Coflan and myself. We three collaborated on the identification and development of an appropriate topic; on the basic research of the subject matter; and on the preparation and delivery of the class presentation. For both the research and the presentation, we each primarily concentrated on a single aspect of the overall topic, and we have each prepared and submitted an individual paper with the knowledge that our separate papers fit together as part of a larger whole that covers the overall topic more completely. This synergy is evident in the class presentation and in the associated materials. This present paper focuses on the development and performance of a probabilistic model being used to identify proteins in a complex mixture. At the start of this project, we expected that we would research and present different methods and different metrics used for protein identification. As our individual research proceeded, it became apparent that proteomics researchers who were moving beyond the individual search scores and metrics, and who were integrating these simpler scores into a more comprehensive probability-based model were having very good success. In my own research, it became evident that a particular research group at the Institute for Systems Biology was having success with what I describe in this paper as a “two-level model”, one that considers both peptide assignments to MS spectra and the consequent peptide-based evidence for the underlying proteins [13][19]. Furthermore, this model was shown to perform better than the individual scores and metrics. Consequently, the primary focus of this paper is on summarizing the rationale for and the successive development and refinement of this two-level model by these researchers, and on its performance in estimating the probabilities of proteins being present in a complex mixture. Introduction The primary goals of proteomics research are to separate, identify, catalog and quantify the proteins and protein complexes that are present in a mixed sample, as representative of a change in the metabolic or signaling state of cells under a variety of experimental conditions. Researchers hope to characterize specific changes in protein levels as either disease or diagnostic markers for a variety of complex diseases. Numerous studies have been performed to this end, using a variety of experimental and analytical techniques [9][11][12]. The process of preparing protein samples for measurement and analysis and the subsequent interpretation of the data resulting from these studies are complex and subject to substantial variability from a number of sources. Each variable introduces an additional dimension of uncertainty in the conclusions that can be drawn from such experiments. Researchers are using a variety of analytical techniques to reduce or eliminate the uncertainty inherent in these methods. This paper examines the application of a particular set of Bayesian-inspired probabilistic models November 22, 2004 Page 1 of 25 Probabilistic Models in Proteomics Research M. DePalo through which a particular group of proteomics researchers are making notable progress toward a clearer understanding of the sensitivities and specificities underlying the effective measurement of protein expression patterns. Measuring Mixed Protein Samples Tandem mass spectrometry (MS/MS) is becoming the method of choice for determining the individual protein components within a complex mixture [1][10][[17]. The proteins in a mixed sample are first digested using a proteolytic enzyme, such as trypsin, resulting in a set of shorter peptides. The peptides are subjected to reversed-phase chromatography or some other separation technique, and are run through any of a number of different types of mass spectrometer. A MS/MS ionizes and then fragments the peptides to produce characteristic spectra that can be used for identification purposes. The collected MS/MS spectra are usually then searched against a protein sequence database to find the best matching peptide in the database. The matched peptides, those assigned to the generated spectra and used during searching, are then used to infer the set of proteins in the original sample. The Process Although conceptually straightforward, the process of obtaining the protein mixture can encounter a number of variables that can lead to significant variation in the results. First, the extraction process used to acquire the protein sample from the tissue or fluid under study must be reproduced precisely, using the exact sequence of centrifugation, fractionation, dissolution and extraction techniques. Once the sample is obtained, the proteolytic enzyme must be chosen, since each enzyme attacks the proteins at specific amino acid junctures, with different efficiencies, leading to different collections of peptides depending upon the preponderance of those specific amino acids in the sample and the number of missed cleavages. The separation technology, such as gel electrophoresis or any of dozens of types of chromatography, must be performed consistently. And then the specific type of mass spectrometry equipment must be used. In this paper we will be mostly concerned with MS/MS, during which selected peptides are further fragmented and computationally reconstructed to provide greater resolution into the exact sequence composition of the peptides. Each of these steps introduces the potential for variability in the resulting peptide population, in terms of composition, concentration, accuracy and various other factors. November 22, 2004 Page 2 of 25 Probabilistic Models in Proteomics Research M. DePalo The end result of the spectrometry stage is a set of MS/MS spectra that presumably correspond to some subset of the individual peptides that comprised the proteins in the original sample. Searching the Protein Database Once the spectra are obtained, each spectrum is searched against a reference database of proteins and their corresponding spectra and/or sequences. Most search algorithms begin by comparing each spectrum against those that would be predicted for peptides obtained from the reference database, using their masses and the precursor ion masses within an acceptable error tolerance. Each spectrum is then assigned a peptide from the database, along with a score that reflects various aspects of the match between the spectrum and the identified peptide. The scores are often based on the number of common fragment ion masses between the spectrum and the peptide (often expressed as a correlation coefficient), but also reflect additional information pertaining to the match, and can be used to help discriminate between correct and incorrect peptide assignments [3][4][7][8][16][18][20][21]. Improving on the Search Results In an effort to substantiate or further increase the level of confidence associated with the search scores returned with the peptide assignments, researchers have applied various additional criteria to evaluate the search results. Properties of the assigned peptides, such as the number of tryptic termini or the charge state(s), are often used to filter the search results to try to improve their accuracy. Often the search results must be verified by an expert, but this is a time-consuming process that is generally not feasible for larger datasets. And these approaches do little to reduce the inherent variability in the process. Furthermore, the number of false negatives (correct identifications that are rejected) and false positives (incorrect identifications that are accepted) that result from the application of such filters are generally not well understood. This is complicated by the fact that different researchers use different filtering criteria, making it difficult to compare results across studies. Applying Probabilistic Models The bulk of this paper focuses on a pair of statistical models developed by proteomics researchers at the Institute for Systems Biology (ISB), described in detail in [13] and [19]. The inherent two-step process used to first decompose proteins into peptides, and then peptides into spectra, is reflected in a two-level model used to reconstruct the composition of the original protein mixture. The peptide-level model estimates the probabilities of the peptide assignments November 22, 2004 Page 3 of 25 Probabilistic Models in Proteomics Research M. DePalo to the spectra. The protein-level model uses the peptide probabilities to estimate the probabilities of proteins in the mixture. Using these models in tandem has been shown to provide very good discrimination between correct and incorrect assignments, and leads to predictable sensitivity (true positive) and specificity (false positive error rates). We now examine these models individually in greater detail. The Peptide-Level Model The next several sections describe how the peptide-level model was derived and refined by successive execution against the available experimental data. Of particular note is the manner in which the model was repeatedly extended, successively introducing additional information known about potential target proteins, and the processes and technologies being used to identify them, into the model [13]. Inferring Peptides from Mass Spectra The first step toward identifying the proteins in the sample is to identify the peptides represented by the individual MS/MS spectra observed from the sample. As described above this consists of matching the individual observed spectra against a reference database of proteins and the spectra that correspond to the peptides that comprise each protein. The spectra and the peptides recorded in the database are either actual spectra and peptides observed by previous experiments with the corresponding protein, or spectra and peptides that are predicted for the protein on the basis of computationally anticipated proteolytic activity on that protein. Since the search-and-match process is not exact, a degree of uncertainty exists in any results returned by the database. Most searching and matching algorithms in use today return one or more scores to aid in assessing the accuracy of the matched peptides returned from the database [3][18]. For example, the SEQUEST search algorithm returns a number of individual scores (described further below), each representing an assessment by the matching software of the quality of the match between the experimental spectra and the reference spectra. The challenge facing researchers is how to evaluate and interpret the scores returned by these search algorithms and databases, and how to use them to systematically and consistently to reach a conclusion about the presence of the identified peptides in the original sample. Researchers at ISB describe a peptide-level statistical model that estimates the accuracy of these peptide assignments to the observed spectra. The model uses a machine-learning algorithm to November 22, 2004 Page 4 of 25 Probabilistic Models in Proteomics Research M. DePalo distinguish between correct and incorrect peptide assignments, and computes probabilities that the individual peptide assignments are correct using the various matching scores and other known characteristics about proteins in general and the individual proteins in the mixture. Experimental Conditions and Datasets To begin, the authors generated a number of individual datasets from various control samples of known, purified proteins at various concentrations. Each sample was run through the process summarized above, subjecting each sample to proteolytic cleavage by trypsin, and subsequent decomposition by ESI (electro-spray ionization)-MS/MS. They also generated a training dataset of peptide assignments of known validity by searching the spectra against a SEQUEST peptide database appended with the sequences of the known control proteins. This allowed them to observe the behavior of the SEQUEST searching and matching algorithms against a known sample of peptides, in the context of a larger database of peptides known to be incorrect with respect to the sample proteins. This was done with both a Drosophila database and a human database. Each of the resulting spectra matches was reviewed manually by an expert to determine whether they were correct. The result was a set of approximately 1600 peptide assignments of [M + 2H]+ ions and 1000 peptide assignments of [M + 3H]+ ions determined to be correct for each of the species databases. Interpreting the Search Scores The authors recognized that in order to be useful beyond their initial value, the individual scores returned by the matching algorithm needed to be combined in some manner. Using Bayes’ Law [2][14,][15][22][23] they reasoned that the probability that a particular peptide assignment with a given set of search scores (x1, x2, … xS) is correct (+) could be computed as: [Eq 1] p(+ | x1, x2, … xS) = p(x1, x2, … xS | +) p(+) / ( p(x1, x2, … xS | +) p(+) + p(x1, x2, … xS | –) p(–) ) where p(x1, x2, … xS | +) and p(x1, x2, … xS | –) represent the probabilities that the search scores (x1, x2, … xS) are found among correctly (+) and incorrectly (–) assigned peptides, respectively, and the prior probabilities p(+) and p(–) represent the overall proportion of correct and incorrect peptide assignments represented in the dataset, determined through the prior analysis using the control samples and searches. These latter values can be considered an indication of the quality of the dataset. November 22, 2004 Page 5 of 25 Probabilistic Models in Proteomics Research M. DePalo Rather than attempting the complex process of computing these probabilities using a joint probability distribution for the several scores (x1, x2, … xS), the authors employed discriminant function analysis [5] to combine together the individual search scores into a single discriminant score that was devised to separate the training data into two groups: correct and incorrect peptide assignments. The discriminant score, F, is a weighted combination of the database search scores, computed as: [Eq 2] F(x1, x2, … xS) = c0 + Sum Of( ci xi ) where c0 is a constant determined through experimentation, and the weights, ci, are derived to maximize the distinction of between-class variation versus in-class variation, in this case to maximize the distinction between the correct and incorrect peptide assignments. The function is derived using the training datasets with known peptide assignment validity. Once derived, the discriminant score can be substituted as a single combined value back into [Eq 1] in place of the original individual search scores and the resulting probabilities computed as follows: [Eq 3] p(+ | F) = p(F | +) p(+) / ( p(F | +) p(+) + p(F | –) p(–) ) where p(+ | F) is the probability that the peptide assignment with discriminant score, F, is correct, and p(F | +) and p(F | –) are the probabilities of F using the discriminant score distributions of correct and incorrect peptide assignments, respectively. The authors show that the resulting probabilities retain much of the discriminating power of the original combination of scores, but offer a simpler calculation than the joint distributions required in [Eq 1]. Using this discriminant scoring approach against a variety of search scores returned from the SEQUEST algorithm, four specific SEQUEST scores were found to contribute significantly to effective discrimination: 1) Xcorr, a cross-correlation measure based on the number of peaks of common mass between observed and predicted spectra; 2) Delta Cn, the relative difference between the first and second highest Xcorr score for all peptides queried from the database; 3) SpRank, a measure of how well the assigned peptide scored, relative to those of similar mass in the database; and 4) dM, the absolute value of the difference in mass between the precursor ion of the spectrum and the assigned peptide. November 22, 2004 Page 6 of 25 Probabilistic Models in Proteomics Research M. DePalo They further discovered that transformation of some (raw) search scores significantly improved the discrimination power of this approach. For example, Xcorr shows a strong dependence on the length of the assigned peptides. This is because Xcorr reflects the number of matches identified between ion fragments in the observed and predicted spectra, leading to larger values for assignments of longer peptides with more fragment ions than for assignments of shorter peptides with fewer fragments. Consequently, assignments of shorter peptides can be difficult to classify as correct or incorrect, since even the correct assignments will often result in low Xcorr scores. They found that this length dependence could be reduced by transforming Xcorr to Xcorr’, which was computed as the ratio of the log of Xcorr to the log of the number of fragments predicted for the peptide, using a two-part function that included a threshold for the length of the peptide. It was found that beyond a certain length threshold, Xcorr was largely independent of peptide length, so this factor was used in the calculation [13]. Using this analysis, it was found that the SEQUEST scores Xcorr’ and Delta C n were found to contribute the most to the discrimination achieved by the function between the correct and incorrect peptide assignments. After reviewing these results against the training dataset, they observed excellent distinction between the correct and incorrect peptide assignments, with 84% of correct peptide assignments having F scores of 1.7 or greater, and 99% of incorrect assignments having F scores below that value. Recognizing that [Eq 3] would be sensitive to F score distributions, the authors computed the distributions of the F scores for the training datasets. By binning the F scores into 0.2 wide discrete value intervals, the distributions of the scores among the correct and incorrect assignments were determined. The probability that a correct peptide assignment has discriminant score, F, was found to fit a Gaussian distribution, with calculated mean, m, and standard deviation, s, as follows: [Eq 4] p(F | +) = ( 1 / Sqrt( 2 pi s) ) e– (F – m)^2 / 2a^2 Furthermore, the probability that an incorrect peptide assignment has a discriminant score, F, was found to fit a gamma distribution, with parameter g set below the minimum F in the dataset, and parameters a and b computed from the population. The resulting distribution is computed as follows: November 22, 2004 Page 7 of 25 Probabilistic Models in Proteomics Research M. DePalo [Eq 5] p(F | –) = ( ( F – g )a-1 e– (F – g) / b ) / ( ba Gamma(a) ) These two expressions for p(F | +) and p(F | –) were then substituted back into [Eq 3], which improved the calculation of accurate probabilities that the peptides assigned to the spectra in the training dataset are correct. Considering the Number of Tryptic Termini If we know that the process to generate the peptides includes a specific proteolytic enzyme, we can exploit our knowledge of the specific amino acids cleaved by that enzyme to further inform our database search and our probability calculations. The ISB researchers used their knowledge that trypsin cleaves proteins on the –COOH side of lysine or arginine to first determine the number of tryptic termini (NTT) that would be present in a sample, and then use the NTT of peptides assigned to spectra as additional information for assessing whether the assignments are correct. The value for NTT is either 0 (indicating that no tryptic terminus was assigned to a spectrum peptide), 1 (indicating that only one of the two tryptic termini corresponding to a cleavage was assigned to the spectrum peptide), or 2 (indicating that both of the tryptic termini corresponding to a cleavage were assigned to the spectrum peptide), indicating how many of the peptide termini are consistent with cleavage by the specific proteolytic enzyme used, trypsin. Similar consideration of alternative cleavage enzymes would also be valid. The NTT distributions were found to be sufficiently distinct to be a useful additional piece of evidence for computing probabilities of peptide assignments. In the training dataset, the incorrectly assigned peptides had NTT = 0, 80% of the time; NTT = 1, 19% of the time; and NTT = 2, 1% of the time. Conversely, the correctly assigned peptides had NTT = 0, 3% of the time; NTT = 1, 23% of the time; and NTT = 2, 74% of the time. The authors hypothesized that combining the NTT information with the discriminant score, F, would improve the probability calculation for the peptide assignments. They reasoned that of two peptide assignments with the same F score (discriminant function score), the one with NTT = 2 would be more likely to be correct versus one with NTT = 0, since peptides with NTT = 2 would be more prevalent among correct peptide assignments, given that trypsin was used in their study. Using Bayes Law the probability that a peptide assignment is correct would be expanded to: [Eq 6] p(+ | F, NTT) = p(F, NTT | +) p(+) / ( p(F, NTT | +) p(+) + p(F, NTT | –) p(–) ) November 22, 2004 Page 8 of 25 Probabilistic Models in Proteomics Research M. DePalo At this point one must consider whether the database scores used to compute the discriminant score, F, are dependent on the number of tryptic termini of the peptides. To determine this, the authors plotted the distribution of discriminant scores, F, for each subset of peptides for NTT = 0, 1, and 2 separately, and found that their distributions were very similar. If we conclude from this that these measures are independent among both correct and incorrect assignments, the simplified calculation for [Eq 6] becomes: [Eq 7] p(+ | F, NTT) = p(+ | F)p(+ | NTT) = p(F | +)p(NTT | +) p(+) / ( p(F | +)p(NTT | +) p(+) + p(F | –)p(NTT | –) p(–) ) The authors also noted that if the database search is constrained to only fully tryptic peptides, then all assigned peptides will have NTT = 2, effectively simplifying [Eq 7] back to [Eq 3], since NTT is no longer a discriminating factor in the determination of the resulting probability. Extending the Model to Other Datasets One might be tempted to conclude that such a model might be generally applicable to all experimental datasets. However, the authors noted that the discriminant score, F, can vary significantly from dataset to dataset, leading to reduced accuracies in the computed probabilities. For example, the Xcorr’ score from SEQUEST is known to be strongly affected by the levels of signal-to-noise reflected in the spectra. Additional inaccuracies would result from variations in the NTT distributions, due either to the efficiency of the trypsin cleavage activity, or to the presence of contaminants such as protease in the sample. In addition, the NTT distributions can vary depending upon the prevalence of lysine and arginine in the proteins residing in the particular reference database used for searching. Finally, the prior probabilities of correctly and incorrectly assigned spectra would be expected to vary with each experimental dataset, potentially as a result of sample purity and spectral quality. Consequently, the extension of the model from the training dataset to all datasets (i.e. as a global model) would not be considered valid. To deal with this issue and thereby improve the model’s ability to discriminate with datasets other than the training dataset, the authors devised a mechanism to compute the prior probabilities and the discriminant score and NTT distributions among correct and incorrect November 22, 2004 Page 9 of 25 Probabilistic Models in Proteomics Research M. DePalo peptide assignments using the specific dataset as a mixture model, using the expectation maximization (EM) algorithm [6]. In a mixture model, each matched spectrum contributes to the correct and incorrect peptide assignment distributions in proportion to its computed probability of being correctly and incorrectly assigned, respectively. The EM algorithm works through an iterative two-step process that finds the distributions that best fit the observed data. With each iteration, the mixture model distributions more closely match the observed data, converging until there is negligible difference between the model and the observed data. Upon termination, the algorithm reports the final probabilities that peptides are correctly assigned to spectra using the learned distributions. Using this approach, the model can be used to tune itself more tightly to the particular dataset under study, minimizing several types of bias that might affect its ability to discriminate. Assessing the Peptide-level Model The EM method described above was examined against the test datasets of combined MS/MS spectra generated from various runs on the control samples of proteins. The spectra were searched against a peptide database appended with the control proteins. The discriminant score distributions predicted for both positive and negative assignments were found to closely match the actual positive and negative distributions of the test dataset. They also observed close correlation between the actual and model-derived prior probabilities and NTT distributions. Furthermore, good agreement was demonstrated for all values of NTT, further justifying that the discriminant scores and NTT values for both correct and incorrect peptide assignments were independent. Not surprisingly, peptide assignments with NTT = 2 had much higher probabilities for any discriminant score, F, relative to peptide assignments with NTT = 0 or 1. This was consistent with the higher proportion of peptides with NTT = 2 among correct assignments than among incorrect assignments. The authors demonstrated the accuracy of the computed probabilities by plotting the actual probability that peptide assignments were correct as a function of the computed probability for the combined test data spectra. Spectra were sorted by computed probability and the mean computed probability and actual probability were determined using a sliding window of 100 spectra. November 22, 2004 Page 10 of 25 Probabilistic Models in Proteomics Research M. DePalo Good correspondence between the computed and actual probabilities was shown [Figure 1], indicating that the computed probabilities are an accurate reflection of the likelihood of correct assignment of peptides to spectra. Figure 1 The authors then plotted the results of a similar analysis using individual models computed from each of the smaller datasets derived from the separate MS/MS runs, as well as a combined model from the combined datasets. The probabilities computed for the individual MS/MS runs were found to be nearly as accurate as those computed from the single model from the combined data. Sensitivity and Error Rates of Peptide Identification It is well known that probabilistic testing methods that are designed to discriminate between correct and incorrect identification of a test condition will exhibit varying sensitivities when reporting correct and incorrect results. Sensitivity is defined as the ability to correctly identify a positive test condition (true positive) when is it actually present. Specificity is defined as the ability to correctly identify a negative condition (true negative), when it is actually not present. It is well known that the accuracy of a testing method is simultaneously dependent upon the sensitivity and specificity to the test condition, as well as the prevalence of the test condition among the full population being tested [14][15][23]. An ideal model would enable complete and unambiguous separation between correct and incorrect peptide assignments. However, in practice, this is generally not feasible. As an November 22, 2004 Page 11 of 25 Probabilistic Models in Proteomics Research M. DePalo alternative, the authors demonstrated the use of a threshold value for their model, with all probabilities above the threshold accepted as correct, and all probabilities below the threshold accepted as incorrect. A user of this model would need to balance the desire for setting a low threshold, which would maximize sensitivity at the cost of an increased error rate (i.e. false positive), against the desire for a higher threshold, which would ensure a lower error rate, at the cost of decreased sensitivity and lost true positives. The relative importance of these factors would need to be determined by the individual user for each study at hand. The authors go on to illustrate the trade-off between these two criteria in the performance of the peptide-level model. A range of values for the minimum probability threshold was used to compute the resulting sensitivity and error rates for each probability threshold value. These results are depicted in [Figure 2]. Figure 2 As a comparison, a plot of several individual SEQUEST scores and NTT in the same graph clearly showed that the peptide-level model outperformed each of the individual SEQUEST scores that might be used to filter the database search results, as evidenced by the model’s higher sensitivity and correspondingly lower error rate at each point that was measured. November 22, 2004 Page 12 of 25 Probabilistic Models in Proteomics Research M. DePalo The authors point out that the user must choose an appropriate probability threshold for a given dataset. The accuracy of the peptide-level model enables the expected sensitivity and the expected false identification rate for any selected minimum probability threshold to be computed from the underlying probabilities [Figure 3]. Figure 3 The sensitivity and error rates predicted by the model were found to agree well with those observed for the data, and can be used to select the probability threshold that achieves the optimal trade-off between the two criteria. The Protein-Level Model Once equipped with a reliable model for estimating probabilities at the peptide level, a proteinlevel model can be constructed using the peptide-level probabilities as input to facilitate the inference about which proteins are present in the sample. The next several sections describe the protein-level model that was developed in a separate study by the same authors. Getting Peptides Organized by Protein The first step to devising a protein-level model is to group all of the assigned peptides according to their corresponding proteins in the database. This can be a difficult process when one (or November 22, 2004 Page 13 of 25 Probabilistic Models in Proteomics Research M. DePalo more) assigned peptide is a so-called “degenerate” peptide, meaning that its sequence actually appears in more than a single entry in the protein sequence database. This can occur when the reference database being searched is comprised of multiple species containing homologous or redundant entries, such as occurs in some eukaryotic or human protein databases. Combining Knowledge About Proteins Once the grouping of peptides is completed, the assigned peptides that correspond to a particular protein and their associated probabilities must be combined to compute a single metric that can further distinguish between correct and incorrect protein identifications. There also exists the unique challenge that some proteins may legitimately have only a single peptide assigned to a corresponding spectrum, and this can be very difficult to distinguish from a false negative result, as most incorrectly identified proteins also have only a single corresponding peptide. A number of different approaches of increasing complexity have been devised for identifying the proteins on the basis of MS/MS peptide spectra. These range from the relatively simple filtering and visualization tools that report on the list of proteins corresponding to the assigned spectra, without attempting to resolve degenerate peptides or consider probabilities; to tools that group peptides according to proteins, and report a score indicating the confidence of each protein identification; to algorithms that estimate the confidence of protein identifications by taking into account the total number of identified peptides in the data set and the number corresponding to each protein. These tools, although useful as filtering criteria to separate correct from incorrect protein identifications, provide no means to estimate the resulting sensitivity (true positive) and specificity (false positive) performance. The protein-level model described by the authors computes a probability that a protein is present in the sample, by combining the probabilities that corresponding peptide assignments are correct after adjusting them for observed protein grouping information. The model apportions degenerate peptides among its corresponding proteins and collapses redundant entries, grouping together those proteins that are impossible to differentiate on the basis of the assigned spectra [19]. Inferring Protein Probabilities on the Basis of Accumulating Evidence Recall that MS/MS spectra are produced by peptides that presumably comprise the sample proteins, and not by the proteins themselves. Consequently, all conclusions we might reach November 22, 2004 Page 14 of 25 Probabilistic Models in Proteomics Research M. DePalo about which proteins are present in the sample are based upon the identification of the specific peptides that correspond to those proteins. Once we have a credible set of individual peptide assignment probabilities (as with the peptidelevel model), we can use these to estimate the probability that a particular protein is present in the sample. However, these peptide probabilities do not eliminate all further variability. Many distinct peptides can be assigned to a given spectrum, and each of these peptide assignments corresponds to the same particular protein of interest. Furthermore, each peptide may be assigned (matched) to more than a single spectrum found in the reference database. Our next step must be to exploit our knowledge of how peptides comprise proteins to improve our likelihood of identifying the correct proteins. Each peptide assigned to a spectrum contributes evidence that the corresponding protein is present in the sample. If each peptide assignment is considered independent evidence that its corresponding protein is present, then the probability that a protein is present can be computed as the probability that at least one peptide assignment corresponding to the protein is correct: [Eq 8] p(PROTEIN) = 1 – Producti ( Productj (1 – p(+ | Di j))) where p(+ | Di j) represents the computed probability that the jth assignment of peptide i to a spectrum is correct. In this case D represents the accumulated data or observations (such as database search scores, number of tryptic termini, number of missed cleavages, etc.) contributing evidence that the peptide assignment is correct. The authors reasoned that assignments of the same peptide to multiple spectra are not justifiably independent events, since those spectra would have nearly identical fragmentation patterns. For example, multiple spectra corresponding to a peptide that is not in the reference database, perhaps due to a post-translational modification (PTM), would each likely be assigned to the same incorrect peptide. This would lead to an inaccurately high computed probability for the corresponding protein. Consequently, multiple identifications of the same peptide in a data set should not necessarily result in increased confidence that the corresponding protein is correct [19]. To correct for this factor, the authors suggest a more conservative estimate, using only a single contributing peptide, the one showing the highest probability of those assigned to the same protein, computed as: November 22, 2004 Page 15 of 25 Probabilistic Models in Proteomics Research M. DePalo [Eq 9] p(PROTEIN) = 1 – Producti (1 – Maxj ( p(+ | Di j)) ) This adjustment was shown to improve the estimated probabilities calculated for these peptide assignments to proteins. It was also noted that this effect was only an issue with multiple assignments of a peptide to MS/MS spectra of the same precursor ion state, i.e. the [M + 2H]2+ or [M + 3H]3+. The authors found that assignments corresponding to the same peptide but with different charge state all contribute (separate) evidence for the presence of a corresponding protein, since it is known that peptides with difference charge states have significantly different fragmentation patterns. Consequently, these peptide assignments were allowed to remain in the calculation [19]. Exploiting Protein Grouping Information It is known that correct peptide assignments, more than incorrect assignments, tend to correspond to “multi-hit” proteins. Conversely, incorrect peptide assignments tend to correspond to proteins to which no other correctly assigned peptide corresponds. And this pattern is even more pronounced in “high coverage” data sets, where the number of acquired MS/MS spectra is relatively large with respect to the complexity of the sample. Consequently, probabilities computed for peptide assignments in the context of the complete data set from which they are calculated may not be as accurate for the peptide subsets grouped according to corresponding proteins. These probabilities must be adjusted to reflect whether the protein is a “multi-hit” protein in the database. This factor is computed by estimating each peptide’s number of sibling peptides (NSP), i.e. the number of other correctly identified peptides that correspond to the same protein. The NSP value for each sibling peptide is computed as the sum of the individual maximum probabilities of all other sibling peptides for a given protein. The authors demonstrated that the difference in NSP values for correct versus incorrect peptide assignments was pronounced, with 92% of correct peptide assignments having an NSP value above 5 (with an average of 7), versus fewer than 1% of incorrect peptide assignments having NSP values above 5, with the majority having a value below 0.25 (with an average of 0.01). The authors assume (reasonably) that the NSP value is independent of the database search scores and other observations aggregated under D, and reached the calculation: November 22, 2004 Page 16 of 25 Probabilistic Models in Proteomics Research M. DePalo [Eq 10] p(+ | D, NSP) = p(+ | D)p(NSP | +) / p(+ | D)p(NSP | +) + p(– | D)p(NSP | –) The authors demonstrate that the adjusted probabilities have improved power to discriminate correct and incorrect database search results, and are more accurate among subsets of peptides grouped according to corresponding proteins. The authors also note that NSP distributions can be expected to vary between data sets, reflecting differences in dataset size (number of spectra), database size, sample complexity and relative concentrations, and data quality. They conducted a further analysis using an EM-like algorithm that converged on an adjusted set of probabilities for p(+ | D, NSP) [19]. This approach could likely be used to improve comparisons across disparate data sets. NSP Distribution Based on Sample Complexity and Dataset Size As an additional consideration of NSP distribution, the authors plotted the log of the ratio of p(NSP | +) / p(NSP | –) learned by the protein-level model. A ratio greater than 1 (positive log) indicates that these probabilities are enhanced by including the NSP information, whereas a ratio less than 1 (negative log) indicates that the NSP adjustment reduces the probability that a peptide assignment is correct. They plotted the impact of NSP to this ratio for two data sets consisting of approximately the Figure 4 November 22, 2004 Page 17 of 25 Probabilistic Models in Proteomics Research M. DePalo same number of spectra, but with one sample having 18 proteins (a relatively simple sample)(depicted with triangles), and a second (more complex) sample have hundreds of proteins (depicted with squares). One would expect a greater percent of correctly identified proteins in the simpler sample to be multi-hit proteins (since there are fewer total proteins). In fact, the data show that the effect of the log ratio is more pronounced in the low complexity sample, with the ratio being more strongly positive at high NSP, and more strongly negative at low NSP [19]. Therefore, incorporating NSP data into the calculation enhances the probabilities of peptides with high NSP and reduces the probabilities of peptides with low NSP to a greater extent in the low complexity data set than in the higher complexity dataset. The authors perform a similar comparison, but this time using an increasing number of spectra in the data set, run against the same exact sample of proteins. One would expect that as the size of the dataset (number of spectra) increases with the sample complexity kept constant, the sample coverage would increase, whereby more correctly identified proteins would be “multi-hit”. In addition, the degree of adjustment for NSP should also depend on the likelihood of observing a given peptide’s siblings in the sample. For any protein, the likelihood of observing sibling peptides depends on a number of factors, including its overall abundance in the sample, its length, the number of tryptic peptides (or other peptides if a different enzyme is used) and other factors. Furthermore, some peptides are rarely seen, because their physicochemical properties result in poor ionization efficiency or poor fragmentation. Consequently, some proteins might only produce one distinctive peptide that could be identified in a given MS/MS run. These proteins should not have their probabilities reduced due to a low NSP value, as the low value is not sufficient indication that the protein is not present [19]. Reducing false negatives The authors demonstrate that, since the protein-level model incorporates both the log ratio and the unadjusted probabilities, among all peptides in the data set having a low NSP value, peptides with the unadjusted probability p(+ | D) close to 1.0 are penalized less than those elsewhere in the range. This ensures that this adjustment of probability based on NSP does not inappropriately reduce the identification of proteins based on only a small number of peptides relative to other proteins in the sample. This lower number of peptides is often the case with November 22, 2004 Page 18 of 25 Probabilistic Models in Proteomics Research M. DePalo very small proteins, or low-abundance proteins, as long as they are identified by at least one high probability peptide [19]. The authors go on to deal with degenerate peptides, those that correspond to more than a single protein, due either to the presence of homologous proteins, splicing variants, or redundant entries in the reference database. The impact of degenerate peptides can be reduced essentially by apportioning such peptides among their possible proteins according to the estimated probabilities of those proteins in the original sample. So, if a given peptide corresponds to several different proteins, the relative weight that this peptide belongs to a given protein is apportioned according to the probability of the given protein relative to all of the proteins matched by the given peptide. The protein probabilities in [Eq 9] are then calculated using these weighting factors with the maximum probability among the multiple assignments. The model learns the degenerate peptide weights using the EM algorithm. It is occasionally possible that even following this adjustment, certain proteins remain indistinguishable. In these cases, the indistinguishable proteins are reported as a group, and assigned a single probability that any member of the group is present in the sample. Treatment of degenerate peptides can be combined with NSP to compute an accurate probability that each protein is present in the sample [19]. Assessing the Protein-level Model The effectiveness of the model was tested using several datasets representing different numbers of proteins being searched in the context of different reference databases. These results are summarized in [Figure 5]. Figure 5 November 22, 2004 Page 19 of 25 Probabilistic Models in Proteomics Research M. DePalo The first three datasets, 18prot_Hinf, 18prot_Dr and 18prot_Hum were generated from the same set of MS/MS spectra, but searched against reference databases of increasing size. The 18prot_Hum database has many degenerate proteins, due to the large number of homologous proteins, splicing variants and other redundant entries. The other two datasets, 18prot_Dr and 18prot_Hinf, have increasingly fewer degenerate proteins. [Figure 5] shows that the model performed well on all datasets, independent of the reference database that was searched. Using the NSP information yielded greater accuracy in correctly identifying proteins and suppressing incorrectly identified proteins. By contrast, when NSP information was not used, the number of incorrect peptides increased significantly [19]. The model was then tested against datasets generated from more complex samples. All of the protein identifications were sorted according to computed probabilities and the corresponding actual probabilities were determined using a sliding window of 20 identifications. [Figure 6] shows that the computed probabilities are accurate in comparison to the actual corresponding probabilities. A hypothetical ideal model would be represented by the dotted 45o line in the graph. Figure 6 November 22, 2004 Page 20 of 25 Probabilistic Models in Proteomics Research M. DePalo The probabilities for proteins computed without the NSP adjustment are overestimated as seen in the graph, particularly in the range of the intermediate probabilities. Interestingly, it was shown that nearly all of the incorrectly identified proteins were identified on the basis of only one peptide having a significant (high) probability of being correct. The NSP adjustment penalizes such single peptides, resulting in more accurate probabilities. [Figure 6] shows that with the NSP adjustment, the computed probabilities are far more accurate, in fact coming out predominantly slightly conservative relative to the actual probabilities (as evidenced by their appearing to the left of the 45o line) in the graph. Sensitivity and Error Rates on Protein Identification Similar to the peptide-level model, the authors plotted the false positive error rates against varying levels of sensitivity for the protein-level model. They filtered the data using various values for the minimum computed probability threshold. [Figure 7] indicates that the probabilities computed by the protein-level model have a high power to discriminate correct protein identifications from incorrect ones. Figure 7 As an example, a minimum probability threshold of 0.7 results in 94% sensitivity, with a false positive error rate of 1.2%. Of particular interest, 39% of all correct identifications passing the 0.7 filter had only one peptide corresponding to the identified protein, and these were not (incorrectly) suppressed. Probabilities computed without the NSP adjustment had much lower November 22, 2004 Page 21 of 25 Probabilistic Models in Proteomics Research M. DePalo sensitivity for any given false positive error rate. The graph also demonstrates very good agreement between the actual sensitivity and error rates and those predicted by the model. Consequently, researchers can choose a minimum probability threshold that gives the desired level of sensitivity and error rate for any given experimental data analysis [19]. Benefits of the Two-Level Model It is clear from these studies that the process used to first decompose a protein sample into its constituent peptides, and then reconstitute the proteins on the basis of MS/MS spectra matched to peptides in a reference database must contend with a number of variables. Any one of these variables can negatively impact the results. At the same time, the problem space is evidently rich enough to enable researchers to exploit specific knowledge of the proteomics domain as an aid to interpreting the results of specific experiments and analyses. The primary objective is to weigh the relative importance of the various observations and data points on the presence or absence of specific proteins being present in the mixed sample. The probabilities computed using the peptide-level model can be used to effectively identify correct peptide assignments and filter data with predictable false identification error rates. They also serve as useful inputs for estimating the likelihood of the presence of corresponding proteins in the sample [13]. The protein-level model shows great sensitivity to identify correct protein assignments, while at the same time demonstrating a low incidence of false positive errors. The model was also able to correctly identify a large number of proteins on the basis of a single peptide. This is of particular importance in proteomics studies as these proteins are often low-abundance and/or low molecular weight proteins, which are very difficult to identify and are quite often lost using suggested filtering criteria requiring two or more corresponding peptides [19]. These models have been shown to improve the interpretation of data coming from these databases. The computed probabilities are an accurate measure of confidence to accompany protein identifications and can potentially provide a standardized way to publish large data sets in order to enable cross-study comparison of results. Conclusion In this paper, we examined the rationale and performance of a specific two-level probabilistic model that recognizes the hierarchical nature of the proteomics domain. The model exploits November 22, 2004 Page 22 of 25 Probabilistic Models in Proteomics Research M. DePalo specific knowledge about peptides and the probabilities of their matches against MS/MS spectra being correct. In addition, the model exploits specific knowledge about proteins and the probabilities that constituent peptides correctly identify those proteins, both singly and in combination. Probabilistic models such as those examined here are starting to show effective discrimination between correct and incorrect identifications of proteins in complex mixtures. As the amount of proteomics data being generated continues to increase and the scale of proteomics research continues to expand, models such as these will become increasingly important as analytical tools to proteomics researchers. November 22, 2004 Page 23 of 25 Probabilistic Models in Proteomics Research M. DePalo References [1] Aebersold R, Goodlett DR. Mass spectrometry in proteomics. Feb;101(2):269-95. PMID: 11712248 [2] Annis, C., “Statistical Engineering: Conditional http://www.statisticalengineering.com/conditional.htm, 2004. Chem Rev. 2001 Probability Applet”, [3] Bafna V, Edwards N. SCOPE: a probabilistic model for scoring tandem mass spectra against a peptide database. Bioinformatics. 2001;17 Suppl 1:S13-21. PMID: 11472988 [4] Chamrad DC, Korting G, Stuhler K, Meyer HE, Klose J, Bluggel M. Evaluation of algorithms for protein identification from sequence databases using mass spectrometry data. Proteomics. 2004 Mar;4(3):619-28. PMID: 14997485 [5] Delyon, Bernard, Discriminant Function Analysis. StatSoft Electronic Textbook. http://name.math.univ-rennes1.fr/bernard.delyon/textbook/stdiscan.html. 1984-2000. [6] Dempster, A, Laird, N, Rubin, D. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39:1–38, 1977. [7] Eriksson J, Chait BT, Fenyo D. A statistical basis for testing the significance of mass spectrometric protein identification results. Anal Chem. 2000 Mar 1;72(5):999-1005. PMID: 10739204 [8] Fenyo D., Identifying the proteome: software tools. Aug;11(4):391-5. Review. PMID: 10975459 Curr Opin Biotechnol. 2000 [9] Gavin AC, Bosche M, Krause R, et al, Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature. 2002 Jan 10;415(6868):141-7. PMID: 11805826 [10] Gygi SP, Aebersold R. Mass spectrometry and proteomics. Curr Opin Chem Biol. 2000 Oct; 4(5):489-94. Review. PMID: 11006534 [11] Gygi SP, Rist B, Gerber SA, Turecek F, Gelb MH, Aebersold R. Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nat Biotechnol. 1999 Oct;17(10): 994-9. PMID: 10504701 [12] Gygi SP, Rist B, Griffin TJ, Eng J, Aebersold R. Proteome analysis of low-abundance proteins using multidimensional chromatography and isotope-coded affinity tags. J Proteome Res. 2002 Jan-Feb;1(1):47-54. PMID: 12643526 [13] Keller A, Nesvizhskii AI, Kolker E, Aebersold R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal Chem. 2002 Oct 15;74(20):5383-92. PMID: 12403597 November 22, 2004 Page 24 of 25 Probabilistic Models in Proteomics Research [14] King, A., “AP Statistics http://arnoldkling.com/apstats/index.html, 2004. [15] Lowry, R., “Bayes Theorem: http://faculty.vassar.edu/lowry/bayes.html, 1998-2004. M. DePalo Lectures: Bayes Conditional Theorem”, Probabilities”, [16] Mann M, Wilm M. Error-tolerant identification of peptides in sequence databases by peptide sequence tags. Anal Chem. 1994 Dec 15;66(24):4390-9. PMID: 7847635 [17] McDonald WH, Yates JR 3rd. Proteomic tools for cell biology. Traffic. 2000 Oct;1(10):747-54. Review. PMID: 11208064 [18] Moore RE, Young MK, Lee TD. Qscore: an algorithm for evaluating SEQUEST database search results. J Am Soc Mass Spectrom. 2002 Apr;13(4):378-86. PMID: 11951976 [19] Nesvizhskii AI, Keller A, Kolker E, Aebersold R. A statistical model for identifying proteins by tandem mass spectrometry. Anal Chem. 2003 Sep 1;75(17):4646-58. PMID: 14632076 [20] Perkins DN, Pappin DJ, Creasy DM, Cottrell JS. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis. 1999 Dec;20(18): 3551-67. PMID: 10612281 [21] Razumovskaya J, Olman V, Xu D, Uberbacher EC, VerBerkmoes NC, Hettich RL, Xu Y. A computational method for assessing peptide-identification reliability in tandem mass spectrometry analysis with SEQUEST. Proteomics. 2004 Apr;4(4):961-9. PMID: 15048978 [22] Spiegel, M., Schaum’s Outline of Probability and Statistics, McGraw-Hill, 2000, 1975. [23] Wikipedia, the free encyclopedia. Bayesian Inference. http://en.wikipedia.org/wiki/ Bayesian_inference. 2004. November 22, 2004 Page 25 of 25