Proteomics Informatics – Protein identification II: search engines and protein sequence databases (Week 5) General Criteria for a Good Protein Identification Algorithms The response to random input data should be random. Maximum number of correct identification and minimum number of incorrect identifications for any data set. Maximal separation between scores for correct identifications and the distribution of scores for random matching proteins for any data set. The statistical significance of the results should be calculated. The searches should be fast. Search Parameters Parent tolerance +/- daltons/ppm Frag. Tolerance +/- daltons/ppm Complete mods Cys alkylation Potential mods (artifacts) Met/Trp oxidation, Gln/Asn deamidation Potential mods (PTMs) Cleavage Phosphoryl, sulfonyl, acetyl, methyl, glycosyl, GPI Scoring method Scores or statistics Sequences FASTA files Trypsin ([KR]|{P}) Identification – Peptide Mass Fingerprinting Sequence DB Digestion MS All Peptide Masses MS Compare, Score, Test Significance Identified Proteins Repeat for each protein Pick Protein Normalized Frequency Response to Random Data ProFound – Search Parameters http://prowl.rockefeller.edu/ ProFound – Protein Identification by Peptide Mapping r 2 ( m m ) r i0 r i ( N r )! r mmax mmin F pattern P(k | DI ) P(k | I ) gi exp i 1 2 N! i 1 2 2 2 W. Zhang & B.T. Chait, Analytical Chemistry 72 (2000) 2482-2489 ProFound Results Peptide Mapping – Mass Accuracy 7 140 Mascot 6 120 5 100 4 80 Score -log(e) ProFound 3 60 2 40 1 20 0 0 0 0.5 1 1.5 Mass Tolerance (Da) 2 0 0.5 1 1.5 Mass Tolerance (Da) 2 Peptide Mapping - Database Size S. cerevisiae Expectation Values Peptide mapping example: S. Cerevisiae 4.8e-7 Fungi 8.4e-6 All Taxa 2.9e-4 Fungi All Taxa Missed Cleavage Sites u=1 Expectation Values Peptide mapping example: u=1 4.8e-7 u=2 1.1e-5 u=4 6.8e-4 u=2 u=4 Peptide Mapping - Partial Modifications No Modifications Searched Without Modifications Searched With Possible Phosphorylation of S/T/Y DARPP-32 0.00006 0.01 CFTR 0.00002 0.005 Even if the protein is modified it is usually better to search a protein sequence database without specifying possible modifications using peptide mapping data. Phophorylation (S, T, or Y) Peptide Mapping - Ranking by Direct Calculation of the Significance Tandem MS – Database Search Sequence DB Pick Peptide MS/MS All Fragment Masses MS/MS Compare, Score, Test Significance Repeat for all peptides LC-MS Repeat for all proteins Lysis Pick Protein Fractionation Digestion Algorithms Comparing and Optimizing Algorithms Algorithm 1 Sensitivity False True Score 1-Specificity Algorithm 2 Sensitivity False True Score 1-Specificity MS/MS - Parent Mass Error and Enzyme Specificity Expectation Values MS/MS example: Dm=2, Trypsin 2.5e-5 Dm=100, Trypsin 2.5e-5 Dm=2, non-specific 7.9e-5 Dm=100, non-specific 1.6e-4 xII xI (nb!ny !) Sequest Cross-correlation X! Tandem - Search Parameters http://www.thegpm.org/ X! Tandem - Search Parameters X! Tandem - Search Parameters spectra sequences Generic search engine Test all cleavages, sequences modifications, & mutations for all sequences Conventional, single stage searching Some hard problems in MS/MS analysis in proteomics Allowing for unanticipated peptide cleavages - e.g., chymotryptic contamination in trypsin - calculation order ~ 200 × tryptic cleavage - “unfortunate” coefficient Determining potential modifications - e.g., oxidation, phosphorylation, deamidation - calculation order 2n - NP complete Detecting point mutations - e.g., sequence homology - calculation order 18N - NP complete Multi-stage searching spectra sequences Tryptic cleavage sequences Modifications #1 Modifications #2 Point mutation X! Tandem Search Results Search Results Sequence Annotations Search Results Search Results Mascot http://www.matrixscience.com/cgi/search_form.pl?FORMVER=2&SEARCH=MIS Identification – Spectrum Library Search Spectrum Library Pick Spectrum MS/MS Compare, Score, Test Significance Identified Proteins Repeat for all spectra Lysis Fractionation Digestion LC-MS/MS Steps in making an Annotated Spectrum Library (ASL): 1. Find the best 10 spectra for a particular sequence, with the same PTMs and charge. 2. Add the spectra together and normalize the intensity values. 3. Assign a “quality” value: the median expectation value of the 10 spectra used. 4. Record the 20 most intense peaks in the averaged spectrum, it’s parent ion z, m/z, sequence, protein accessions & quality. Spectrum Library Characteristics – Peptide Length fraction of library (%) 10 8 6 4 2 0 0 10 20 30 peptide length 40 50 Spectrum Library Characteristics – Protein Coverage 50 residues peptides % coverage 40 30 20 10 0 10 30 50 70 90 110 protein Mr (kDa) 130 150 170 190 Identification – Spectrum Library Search Library spectrum (5:25) Test spectrum (5:25) Results: 4 peaks selected, 1 peak missed Identification – Spectrum Library Search How likely is this? Apply a hypergeometric probability model: - 25 possible m/z values; - 5 peaks in the library spectrum; and - 4 selected by the test spectrum. Matches 1 2 3 4 5 Probability 0.45 0.15 0.016 0.00039 0.0000037 Identification – Spectrum Library Search If you have 1000 possible m/z values and 20 peaks in test and library spectrum? 1.0E+00 1.0E-02 1.0E-04 p 1.0E-06 1 matched: p = 0.6 5 matched: p = 0.0002 1.0E-08 1.0E-10 10 matched: p = 0.0000000000001 1.0E-12 1.0E-14 1 2 3 4 5 6 matches 7 8 9 10 Identification – Spectrum Library Search Experimental Mass Spectrum M/Z Best search result Library of Assigned Mass Spectra X! Hunter X! Hunter algorithm: 1. Use dot product to find a library spectrum that best matches a test spectrum. 2. Calculate p-value with hypergeometric distribution. 3. Use p-value to calculate expectation value, given the identification parameters. 4. If expectation value is less than the median expectation value of the library spectrum, report the median value. X! Hunter Result Query Spectrum Library Spectrum Number of Proteins Dynamic Range In Proteomics Distribution of Protein Amounts Experimental Dynamic Range Log (Protein Amount) Desired Dynamic Range Large The goal discrepancy is to identify between and characterize the experimental all components dynamicof range a proteome and the range of amounts of different proteins in a proteome Digestion Mass Separation Protein Abundance Sample Extraction Protein Labeling Protein Separation Fragmentation Peptide Labeling Peptide Separation Mass Separation Ionization Detection Limit of amount of material Limit of amount of material Loss of material Loss of material Separation of material Detection limit Dynamic range Sample Protein Abundance Protein Separation Digestion # of peptides per bin Peptide Separation y k 1 "Retention time" (bin) Mass Spectrometry MS dynamic range m1 m m 3 MS dynamic m3 m5 m5 range MS dynamic m3 m mm m m m m 66 5 range MS dynamic 144 m 2 3 mm m2 m m14 m range 6 5 MS dynamic mm6 m2 m 3 range 5 4 m6 m m 1 m2m 1 10 2 4 Experimental Designs Simulated Parameters in Simulation Sample ● Distribution of protein amounts in sample Protein Abundance Protein Separation ● # of Proteins in each fraction Digestion Peptide Separation ● Total amount of peptides that are loaded on column (limited by column loading capacity) # of peptides per bin ● Loss of peptides before binding to the column ● # of peptide fractions y ● Loss of peptides after elution off the column k 1 "Retention time" (bin) Mass Spectrometry MS dynamic range m1 m m 3 MS dynamic m3 m5 m5 range MS dynamic m3 m mm m m m m 66 5 range MS dynamic 144 m 2 3 mm m2 m m14 m range 6 5 MS dynamic mm6 m2 m 3 range 5 4 m6 m m 1 m2m 1 10 2 4 ● Distribution of mass spectrometric response for different peptides present at the same amount ● Dynamic range of mass spectrometer ● Detection limit of mass spectrometer Simulation Results for 1D-LC-MS 0.025 0.014 Tissue Body Fluid 0.012 No Protein Separation Number of Proteins 0.02 Number of Proteins Complex Mixtures of Proteins No Protein Separation 0.015 0.01 0.008 0.006 0.01 0.004 0.005 Digestion 0.002 0 0 0 1 RPC 2 3 4 log(Protein Amount) 5 6 0.025 2 4 6 8 log(Protein Amount) 10 1.40E-02 Number of Proteins 0.02 Protein Separation: 10 fractions Body Fluid 1.20E-02 Number of Proteins Tissue MS Analysis 0 Protein Separation: 10 fractions 1.00E-02 0.015 8.00E-03 6.00E-03 0.01 4.00E-03 0.005 2.00E-03 0.00E+00 0 0 1 2 3 4 log(Protein Amount) 5 6 0 2 4 6 8 log(Protein Amount) 10 Number of Proteins Success Rate of a Proteomics Experiment Distribution of Protein Amounts Proteins Detected Log (Protein Amount) DEFINITION: The success rate of a proteomics experiment is defined as the number of proteins detected divided by the total number of proteins in the proteome. Number of Proteins Relative Dynamic Range of a Proteomics Experiment Distribution of Protein Amounts Proteins Detected Fraction of Proteins Detected RDR90 RDR50 RDR10 Log (Protein Amount) DEFINITION: RELATIVE DYNAMIC RANGE, RDRx, where x is e.g. 10%, 50%, or 90% 1 1 RDR50 Success Rate 0.8 Success Rate 0.8 2 0.6 1 0.4 0.2 0.6 2 0.4 1 0.2 Tissue Body Fluid 0 0 1 10 100 1000 10000 100000 Number of Proteins in Mixture 0.025 Tissue 1 10 100 1000 10000 100000 Number of Proteins in Mixture 1.40E-02 Body Fluid 2 1 0.012 Body Fluid 1.20E-02 2 Number of Proteins 0.015 0.015 0.01 0.01 0.005 Number of Proteins 0.02 Number of Proteins 0.02 1 0.014 0.025 Tissue Number of Proteins Relative Dynamic Range (RDR50) Number of Proteins in Mixture 0.01 1.00E-02 0.008 8.00E-03 0.006 6.00E-03 0.004 4.00E-03 0.002 2.00E-03 0.005 0 0 0 1 2 3 4 log(Protein Amount) 5 6 0.00E+00 0 0 1 2 3 4 log(Protein Amount) 5 6 0 2 4 6 log(Protein Amount) 8 10 0 2 4 6 log(Protein Amount) 8 10 1 1 RDR50 Tissue Body Fluid 3 2 0.6 Success Rate 3 0.8 Success Rate 0.8 0.4 0.2 0.6 2 0.4 0.2 0 0.01 0.1 1 10 Amount Loaded [m g] 0.025 0.01 0.1 1 1.40E-02 2 10 100 Amount Loaded [m g] Tissue 3 0.014 Body Fluid 1.20E-02 2 Body Fluid 0.012 3 0.02 0.015 1.00E-02 0.015 0.01 8.00E-03 0.01 0.008 6.00E-03 0.01 0.005 Number of Proteins Number of Proteins Number of Proteins 0.02 0 100 0.025 Tissue Number of Proteins Relative Dynamic Range (RDR50) Amount of Peptides Loaded on the Column 0.006 4.00E-03 0.004 0.005 2.00E-03 0 0.00E+00 0 0 1 2 3 4 log(Protein Amount) 5 6 0.002 0 1 2 3 4 log(Protein Amount) 5 6 0 0 2 4 6 log(Protein Amount) 8 10 0 2 4 6 log(Protein Amount) 8 10 1 1 RDR50 4 4 3 0.6 0.4 0.2 Success Rate 3 0.8 Success Rate 0.8 0.6 0.4 0.2 Tissue Body Fluid 0 0 10 100 1000 10000 Number of Peptide Fractions 0.025 100000 Tissue 3 100 1000 10000 Number of Peptide Fractions 4 100000 0.014 Body Fluid Body Fluid 3 0.012 4 0.012 0.015 0.015 0.01 0.01 0.008 0.01 0.008 0.006 0.01 0.005 Number of Proteins Number of Proteins 0.02 Number of Proteins 0.02 10 0.014 0.025 Tissue Number of Proteins Relative Dynamic Range (RDR50) Peptide Separation 0.006 0.004 0.004 0.005 0.002 0 0 0 1 2 3 4 log(Protein Amount) 5 6 0.002 0 0 1 2 3 4 log(Protein Amount) 5 6 0 0 2 4 6 log(Protein Amount) 8 10 0 2 4 6 log(Protein Amount) 8 10 Amount loaded and peptide separation 0.025 0.025 4 Tissue 0.8 0.01 0.005 0 0 0 1 2 3 4 log(Protein Amount) 5 6 3 0 1 2 3 5 6 Amount loaded 0.025 2 Number of Proteins 0.02 Protein separation 0.015 0.2 1 0.01 1 0.2 0.4 0.6 0.8 Success Rate 0.005 0 1.0 1.0 0.015 0.01 0.005 0 Number of Proteins 0.02 2 0 0 1 2 3 4 log(Protein Amount) 5 6 0 0.025 2 3 Number of Proteins 4 Amount loaded 0.01 0.6 0.015 0.01 0.005 0.005 3 0 0 0 1 3 2 3 4 log(Protein Amount) 5 0 6 1 2 3 4 log(Protein Amount) 0.025 0.025 0.02 111 0.015 0.4 0.6 0.8 Success Rate 1.0 1 0.01 0.015 0.01 0.005 0.005 0 Protein separation Amount loaded Peptide separation 2 0.02 Protein separation 6 Number of Proteins 0.2 5 Peptide separation 2 Ranges: Protein separation: 30000 – 3000 proteins in each fraction Amount loaded: 0.1 ug – 10 ug Peptide separation: 100 – 1000 fractions 6 0.02 0.015 0.2 5 0.025 0.02 0.8 0 4 log(Protein Amount) 4 Tissue 0.4 1 Number of Proteins 0.4 0 4 log(Protein Amount) 3 0.025 0 0.015 0.005 4 0.6 Peptide separation 0.01 Number of Proteins 1. Protein separation 2. Peptide separation 3. Amount loaded Relative Dynamic Range 1. Protein separation 2. Amount loaded 3. Peptide separation Relative Dynamic Range Order: 1.0 Number of Proteins 0.02 Number of Proteins 0.02 0.015 0 0 1 2 3 4 log(Protein Amount) 5 6 0 1 2 3 4 log(Protein Amount) 5 6 Repeat Analysis 1 Analysis Repeat Analysis 2 Analyses Repeat Analysis 3 Analyses Repeat Analysis 4 Analyses Repeat Analysis 5 Analyses Repeat Analysis 6 Analyses Repeat Analysis 7 Analyses Repeat Analysis 8 Analyses Repeat Analysis: Simulations 0.5 0.3 0.2 RDR10 Sucess Rate 0.4 0.3 0.2 0.1 0.1 Experiment Experiment Simulation Simulation 0 0 0 2 4 6 Number of Repeats 8 10 0 2 4 6 Number of Repeats 8 10 Summary • The success rate of proteome analysis is influenced by the following factors (listed in order of importance): • The degree of protein separation • Amount of peptides loaded on column or mass spectrometric detection limit • The degree of peptide separation or mass spectrometric dynamic range Proteomics Informatics – Protein identification II: search engines and protein sequence databases (Week 5)