Proteomics Informatics Workshop Part I: Protein Identification David Fenyö February 4, 2011 • Introduction to proteomics • Introduction to mass spectrometry • Analysis of mass spectra • Database searching • Spectrum library searching • de novo sequencing • Significance testing Why Proteomics? Geiger et al., “Proteomic changes resulting from gene copy number variations in cancer cells”, PLoS Genet. 2010 Sep 2;6(9). pii: e1001090. Proteomics Informatics Biological System Experimental Design Samples MS/MS MS Sample Preparation Measurements Data Analysis Information about each sample Information Integration Information about the biological system What does the sample contain? How much? Sample Preparation Biological System Experimental Design Samples MS/MS MS Sample Preparation Measurements Data Analysis Information about each sample Information Integration Information about the biological system Enrichment Separation etc Digestion Top Bottom down up What does the sample contain? How much? Mass Spectrometry (MS) Ion Source Mass Analyzer Quadrupole Ion Trap (3D, linear) Time-of-Flight Orbitrap FTICR intensity MALDI ESI mass/charge Detector Mass Spectrometry – MALDI-TOF Ion Source MALDI Mass Analyzer Detector Time-of-Flight Detector HV Laser Tandem Mass Spectrometry (MS/MS) Ion Source Detector CAD – Collision Activated Dissociation Quadrupole Quadrupole m/z time time YES time Dm/z is constant m/z m/z m/z time YES time time mass/charge m/z time m/z time NO m/z Quadrupole intensity Mass Analyzer 2 m/z Fragmentation m/z Mass Analyzer 1 time Dissociation Techniques CAD: Collision Activated Dissociation (b, y ions) increase of internal energy through collisions ETD: Electron Transfer Dissociation (c, z ions) radical driven fragmentation Dissociation Techniques: CAD versus ETD CAD ETD Low charge High charge Short peptides Up to intact proteins Weakest bonds break first More uniform fragmentation Preferred cleavage N-terminal to proline No cleavage N-terminal to proline Liquid Chromatography (LC)-MS/MS LC Ion Source mass/charge mass/charge Detector mass/charge mass/charge mass/charge Time intensity intensity intensity mass/charge Mass Analyzer 2 intensity mass/charge mass/charge intensity mass/charge intensity mass/charge intensity mass/charge Fragmentation intensity intensity intensity mass/charge intensity intensity intensity intensity intensity Mass Analyzer 1 mass/charge mass/charge mass/charge intensity intensity mass/charge intensity mass/charge mass/charge intensity MS MS/MS 1 MS/MS 2 MS/MS 3 MS MS/MS 1 MS/MS 2 MS/MS 3 MS MS/MS 1 MS/MS 2 MS/MS 3 MS MS/MS 1 MS/MS 2 MS/MS 3 MS MS/MS 1 MS/MS 2 MS/MS 3 MS MS/MS 1 MS/MS 2 MS/MS 3 mass/charge intensity 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. … intensity Data Independent Acquisistion mass/charge mass/charge MS MS/MS 1 MS/MS 2 MS/MS 3 MS/MS 4 MS/MS 5 MS/MS 6 MS/MS 7 MS/MS 8 MS/MS 9 MS/MS 10 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. … MS MS/MS 1 MS/MS 2 MS/MS 3 MS/MS 4 MS/MS 5 MS/MS 6 MS/MS 7 MS/MS 8 MS/MS 9 MS/MS 10 mass/charge intensity 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. intensity Data Dependent Acquisistion mass/charge Mass Spectrometry – ESI-LC-MS/MS ESI Linear Ion Trap HCD Ion Source Mass Analyzer 1 Fragmentation Detector CAD ETD Fragmentation Mass Analyzer 2 Orbitrap Olsen J V et al. Mol Cell Proteomics 2009;8:2759-2769 Detector Charge-State Distributions MALDI ESI intensity Peptide 2+ intensity 1+ 2+ mass/charge 3+ 4+ 1+ mass/charge m M nH z n M - molecular mass n - number of charges H – mass of a proton MALDI ESI 3+ 4+ 5+ mass/charge 1+ intensity Protein intensity 2+ 27+ 31+ mass/charge Isotope Distributions m = 1035 Da m = 1878 Da m = 2234 Da Intensity 12C 14N 16O +1Da 1H 32S +2Da +3Da m/z m/z m/z 0.015% 2H 1.11% 13C 0.366% 15N 0.038% 17O, 0.200% 18O, 0.75% 33S, 4.21% 34S, 0.02% 36S Only 12C and 13C: p=0.0111 n is the number of C in the peptide m is the number of 13C in the peptide Tm is the relative intensity of the peptide m 13C 𝑛 𝑚 𝑇𝑚 = 𝑝 (1 − 𝑝)𝑛−𝑚 𝑚 Intensity ratio Intensity ratio Isotope distributions Peptide mass Peptide mass GFP 29kDa monoisotopic mass m/z Intensity Noise m/z Peak Finding Intensity Find maxima of S (l ) I (k ) |k l |w / 2 m/z The signal in a peak can be estimated with the RMSD (I (k ) I ) 2 |k l |w / 2 w /2 and the signal-to-noise ratio of a peak can be estimated by dividing the signal with the RMSD of the background The centroid m/z of a peak m I (k ) (k ) z |k l |w / 2 I (k ) |k l |w / 2 Isotope Clusters and Charge State Intensity 0.33 0.5 1 1+ 3+ 2+ 0.33 0.5 1 0.33 0.5 1 Possible to Determine Charge? Yes m/z Yes Maybe No Identification – Peptide Mass Fingerprinting Lysis Fractionation Digestion Mass spectrometry MS Identified Proteins Example data – Peptide Mapping by MALDI-TOF 45 700 Intensity Intensity 1800 0 1000 0 1300 2280 14602400 m/z D:\Users\Fenyo\Desktop\ATP.txt (15:50 (15:4602/03/11) Description: none available 700 35 Intensity D:\Users\Fenyo\Desktop\ATP.txt (15:42 02/03/11) Description: none available 4500 m/z 0 2378.0 1444.0 m/z 2394.0 1458.0 1 2 3 4 6 8 #of matching peptides 2 3 4 6 8 #of matching peptides 10 10 Avg. #of matching peptides 1 Avg. #of matching peptides Information Content in a Single Mass Measurement Human 10 8 6 4 3 2 1 1000 2000 3000 Tryptic peptide mass [Da] S. cerevisiae 10 8 6 4 3 2 1 1000 2000 3000 Tryptic peptide mass [Da] Identification – Peptide Mass Fingerprinting Lysis Fractionation Digestion Mass spectrometry MS Identified Proteins Peak Finding Charge determination De-isotoping Searching Identification – Peptide Mass Fingerprinting Sequence DB Digestion MS All Peptide Masses MS Compare, Score, Test Significance Identified Proteins Repeat for each protein Pick Protein ProFound – Search Parameters http://prowl.rockefeller.edu/ ProFound Results m/z Example data – ESI-LC-MS/MS 762 MS/MS % Relative Abundance 100 0 875 [M+2H]2+ 292 405 534 260 389 504 250 Time 500 633 663 m/z 778 750 1022 9071020 1080 1000 Peptide Fragmentation Mass Analyzer 1 Ion Source Fragmentation Mass Analyzer 2 Detector b y Identification – Tandem MS Tandem MS – Sequence Confirmation S G F L E E D E L K % Relative Abundance 100 0 250 500 m/z 750 1000 Tandem MS – Sequence Confirmation S G F L E E D E L K 88 S 145 G 292 F 405 L 534 E 663 E 778 D 907 E 1020 L 1166 K % Relative Abundance 100 0 250 500 m/z 750 1000 b ions Tandem MS – Sequence Confirmation S G F L E E D E L K 88 S 1166 145 G 1080 292 F 1022 405 L 875 534 E 762 663 E 633 778 D 504 907 E 389 1020 L 260 1166 K 147 % Relative Abundance 100 0 250 500 m/z 750 1000 b ions y ions Tandem MS – Sequence Confirmation S G F L E E D E L K 88 S 1166 145 G 1080 292 F 1022 405 L 875 534 E 762 663 E 633 778 D 504 907 E 389 1020 L 260 1166 K 147 b ions y ions 762 % Relative Abundance 100 875 [M+2H]2+ 292 405 534 260 389 504 633 663 778 1022 907 1020 1080 0 250 500 m/z 750 1000 Tandem MS – Sequence Confirmation S G F L E E D E L K 88 S 1166 145 G 1080 292 F 1022 405 L 875 534 E 762 663 E 633 778 D 504 907 E 389 1020 L 260 1166 K 147 b ions y ions 762 % Relative Abundance 100 875 [M+2H]2+ 292 405 534 260 389 504 633 663 778 1022 907 1020 1080 0 250 500 m/z 750 1000 Tandem MS – Sequence Confirmation S G F L E E D E L K 88 S 1166 145 G 1080 292 F 1022 405 L 875 534 E 762 663 E 633 778 D 504 907 E 389 1020 L 260 1166 K 147 b ions y ions 762 % Relative Abundance 100 875 113 [M+2H]2+ 113 292 405 534 260 389 504 633 663 778 1022 907 1020 1080 0 250 500 m/z 750 1000 Tandem MS – Sequence Confirmation S G F L E E D E L K 88 S 1166 145 G 1080 292 F 1022 405 L 875 534 E 762 663 E 633 778 D 504 907 E 389 1020 L 260 1166 K 147 b ions y ions 762 % Relative Abundance 100 129 875 [M+2H]2+ 129 292 405 534 260 389 504 633 663 778 1022 907 1020 1080 0 250 500 m/z 750 1000 Tandem MS – Sequence Confirmation S G F L E E D E L K 88 S 1166 145 G 1080 292 F 1022 405 L 875 534 E 762 663 E 633 778 D 504 907 E 389 1020 L 260 1166 K 147 b ions y ions 762 % Relative Abundance 100 875 [M+2H]2+ 292 405 534 260 389 504 633 663 778 1022 907 1020 1080 0 250 500 m/z 750 1000 Tandem MS – Sequence Confirmation S G F L E E D E L K 88 S 1166 145 G 1080 292 F 1022 405 L 875 534 E 762 663 E 633 778 D 504 907 E 389 1020 L 260 1166 K 147 b ions y ions 762 % Relative Abundance 100 875 [M+2H]2+ 292 405 534 260 389 504 633 663 778 1022 907 1020 1080 0 250 500 m/z 750 1000 Tandem MS – Sequence Confirmation S G F L E E D E L K 88 S 1166 145 G 1080 292 F 1022 405 L 875 534 E 762 663 E 633 778 D 504 907 E 389 1020 L 260 1166 K 147 b ions y ions 762 % Relative Abundance 100 875 [M+2H]2+ 292 405 534 260 389 504 633 663 778 1022 907 1020 1080 0 250 500 m/z 750 1000 Tandem MS – de novo Sequencing 762 Amino acid masses 1-letter 3-letter code code A Ala Chemical formula C3H5ON Monois Average otopic 71.0371 71.0788 R Arg C 6H12ON4 156.101 156.188 N Asn C 4H6O2N2 114.043 114.104 D Asp C 4 H5 O 3 N 115.027 115.089 C Cys C 3H5ONS 103.009 103.139 E Glu C 5 H7 O 3 N 129.043 129.116 Q Gln C 5H8O2N2 128.059 128.131 G Gly C2H3ON 57.0215 57.0519 H His C 6H7ON3 137.059 137.141 I Ile C 6H11ON 113.084 113.159 L Leu C 6H11ON 113.084 113.159 K Lys C 6H12ON2 128.095 128.174 M Met C 5H9ONS 131.04 131.193 F Phe C9H9ON 147.068 147.177 P Pro C5H7ON 97.0528 97.1167 S Ser C 3 H5 O 2 N 87.032 87.0782 T Thr C 4 H7 O 2 N 101.048 101.105 W Trp Y Tyr V Val C 11H10ON2 186.079 186.213 C 9H9O2N 163.063 163.176 C5H9ON 99.0684 99.1326 % Relative Abundance 100 0 875 [M+2H]2+ 292 405 534 260 389 504 250 500 633 663 m/z 778 1022 9071020 1080 750 Mass Differences Sequences consistent with spectrum 1000 Tandem MS – de novo Sequencing 260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079 260 292 389 405 504 534 633 663 762 778 875 32 129 145 244 274 373 403 502 518 615 647 760 762 819 97 113 212 242 341 371 470 486 583 615 728 730 787 16 115 145 244 274 373 389 486 518 631 633 690 99 129 228 258 357 373 470 502 615 617 674 30 129 159 258 274 371 403 516 518 575 99 129 228 244 341 373 486 488 545 30 129 145 242 274 387 389 446 99 115 212 244 357 359 416 16 113 145 258 260 317 97 129 242 244 301 32 145 147 204 907 113 115 172 1020 2 1022 59 57 Tandem MS – de novo Sequencing 260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079 260 292 389 405 504 534 633 663 762 778 875 32 129 145 244 274 373 403 502 518 615 647 760 762 819 97 113 212 242 341 371 470 486 583 615 728 730 787 16 115 145 244 274 373 389 486 518 631 633 690 99 129 228 258 357 373 470 502 615 617 674 30 129 159 258 274 371 403 516 518 575 99 129 228 244 341 373 486 488 545 30 129 145 242 274 387 389 446 99 115 212 244 357 359 416 16 113 145 258 260 317 97 129 242 244 301 32 145 147 204 907 113 115 172 1020 2 1022 59 57 Tandem MS – de novo Sequencing 260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079 260 292 389 405 504 534 32 E 145 XP I/L 16 244 274 373 403 502 518 615 647 760 762 819 212 242 341 371 470 486 583 615 728 730 787 D 145 V E X 30 244 274 373 389 486 518 631 633 690 228 258 357 373 470 502 615 617 674 E 159 V E X 633 663 762 778 875 907 1020 1022 30 SGF(I/L)EEDE(I/L)… …GF(I/L)EEDE(I/L)… …(I/L)EDEE(I/L)FG… 1166 – 1020 – 18 = 128 K or Q= 1166 Peptide M+H 1166 -1079 = 87 => S SGF(I/L)EEDE(I/L)(K/Q) SGF(I/L)EEDE(I/L)… 258 274 371 403 516 518 575 228 244 341 373 486 488 545 E 145 XV D 16 242 274 387 389 446 212 244 357 359 416 I/L 145 XP E 32 258 260 317 242 244 301 F I/L X D 145 2 204 172 59 G Tandem MS – de novo Sequencing Challenges in de novo sequencing Neutral loss (-H2O, -NH3) Modifications Background peaks Incomplete Incomplete information information Tandem MS – Database Search Sequence DB Pick Peptide MS/MS All Fragment Masses MS/MS Compare, Score, Test Significance Repeat for all peptides LC-MS Repeat for all proteins Lysis Pick Protein Fractionation Digestion Tandem MS – Database Search X! Tandem - Search Parameters http://www.thegpm.org/ X! Tandem - Search Parameters X! Tandem - Search Parameters Multi-stage searching spectra sequences Tryptic cleavage sequences Modifications #1 Modifications #2 Point mutation X! Tandem Search Results Search Results Search Results Search Results 1 0.5 Critical # of Matching Fragments 0 5 10 15 Number of Matching Fragments Critical # of Matching Fragments Probability of Identification How many fragment masses are needed for identification? 16 8 0 A parameter Small peptides are slightly more difficult to identify 16 mprecursor Critical # of Fragments Probability of Identification 1.2 14 1000 Da 1500 Da 2000 Da 2500 Da 1 0.8 12 10 0.6 0.4 0.2 8 6 4 2 0 0 0 5 10 15 Number of fragment ions 20 500 1000 1500 2000 2500 3000 Precursor Mass [Da] Dmprecursor = 1 Da Dmfragment = 0.5 Da No modification A lower precursor mass error requires fewer fragment masses for identification of unmodified peptides 16 Critical # of Fragments Probability of Identification 1.2 1 0.8 0.6 0.4 0.01 Da 0.2 1 Da 14 12 10 8 6 4 2 10 Da 0 0 5 10 15 Number of fragment ions 20 0 0.001 0.01 0.1 1 10 Precursor Mass Error [Da] mprecursor = 2000 Da Dmfragment = 0.5 Da No modification The dependence on the fragment mass error is weak below a threshold for identification of unmodified peptides 16 Dmfragment 1 Critical # of Fragments Probability of Identification 1.2 0.01 Da 0.5 Da 1 Da 2 Da 0.8 0.6 0.4 0.2 14 12 10 8 6 4 2 0 0 5 10 15 Number of fragment ions 20 0 0.001 0.01 0.1 1 10 Fragment Mass Error [Da] mprecursor = 2000 Da Dmprecursor = 1 Da No modification A moderate number of background peaks can be tolerated when identifying unmodified peptides 16 Background 1 Critical # of Fragments Probability of Identification 1.2 0% 50% 0.8 80% 0.6 0.4 0.2 14 12 10 8 6 4 2 0 0 0 5 10 15 Number of fragment ions 20 0 20 40 60 80 100 Background [%] mprecursor = 2000 Da Dmprecursor = 1 Da Dmfragment = 0.5 Da No modification A large number of background peaks can be tolerated if the fragment mass is accurate 16 Background 1 Critical # of Fragments Probability of Identification 1.2 0% 50% 0.8 80% 0.6 0.4 0.2 14 12 10 8 6 4 2 0 0 0 5 10 15 Number of fragment ions 20 0 20 40 60 80 100 Background [%] mprecursor = 2000 Da Dmprecursor = 1 Da Dmfragment = 0.01 Da No modification Identification of phosphopeptides is only slightly more difficult Probability of Identification 1.2 1 0.8 0.6 0.4 Phosphorylated 0.2 Unmodified 0 0 5 10 15 20 Number of fragment ions mprecursor = 2000 Da Dmprecursor = 1 Da Dmfragment = 0.5 Da Identification – Spectrum Library Search Spectrum Library Pick Spectrum MS/MS Compare, Score, Test Significance Identified Proteins Repeat for all spectra Lysis Fractionation Digestion LC-MS/MS Spectrum Library Characteristics – Peptide Length fraction of library (%) 10 8 6 4 2 0 0 10 20 30 peptide length 40 50 Spectrum Library Characteristics – Protein Coverage 50 residues peptides % coverage 40 30 20 10 0 10 30 50 70 90 110 protein Mr (kDa) 130 150 170 190 Spectrum Library Characteristics – Size Species Spectra Peptides Redundancy H. sapiens P. troglodytes M. mulata M. musculus R. norvegicus B. taurus E. caballus S. cerevisiae C. elegans D. rerio T. rubripes D. melanogaster A. thaliana 1002326 889232 754601 732382 637776 592070 590514 201253 190952 174049 169551 122353 111689 270345 238688 195701 199182 160439 140063 139849 133166 90981 46546 36514 71928 62574 ×3.7 ×3.7 ×3.9 ×3.7 ×4.0 ×4.2 ×4.2 ×1.5 ×2.1 ×3.7 ×4.6 ×1.7 ×1.8 Identification – Spectrum Library Search Library spectrum (5:25) Test spectrum (5:25) Results: 4 peaks selected, 1 peak missed Identification – Spectrum Library Search How likely is this? Apply a hypergeometric probability model: - 25 possible m/z values; - 5 peaks in the library spectrum; and - 4 selected by the test spectrum. Matches 1 2 3 4 5 Probability 0.45 0.15 0.016 0.00039 0.0000037 Identification – Spectrum Library Search If you have 1000 possible m/z values and 20 peaks in test and library spectrum? 1.0E+00 1.0E-02 1.0E-04 p 1.0E-06 1 matched: p = 0.6 5 matched: p = 0.0002 1.0E-08 1.0E-10 10 matched: p = 0.0000000000001 1.0E-12 1.0E-14 1 2 3 4 5 6 matches 7 8 9 10 Identification – Spectrum Library Search Experimental Mass Spectrum M/Z Best search result Library of Assigned Mass Spectra X! Hunter Result Query Spectrum Library Spectrum Significance Testing False protein identification is caused by random matching An objective criterion for testing the significance of protein identification results is necessary. The significance of protein identifications can be tested once the distribution of scores for false results is known. Significance Testing - Expectation Values The majority of sequences in a collection will give a score due to random matching. Significance Testing - Expectation Values Database Search List of Candidates M/Z Distribution of Scores for Random and False Identifications Extrapolate And Calculate Expectation Values List of Candidates With Expectation Values Rho-diagrams: Overall Quality of a Data Set Expectation values as a function of score for random matching: e( s ) exp( s ) Definition: Ei (i=0,-1,-2,…) is the number of spectra that has been assigned an expectation value between exp(i) and exp(i-1). For random matching: e exp( i ) E i (i ) log( Nde N{exp( i) exp( i 1)} e exp( i 1) E ) log( N exp( i){1 exp( 1)}) i N {1 exp( 1)} E i 0 Rho-diagram Random Matching -6 -5 -4 -3 -2 -1 0 0 -1 -2 -3 -4 -5 -6 log(e) Rho-diagram Data Quality -10 -8 -6 -4 -2 0 0 -2 -4 -6 -8 -10 log(e) Rho-diagram Parameters Summary Protein identification strategies: - de Novo Sequencing - Searching Sequence Collections - Searching Spectrum Libraries It is important to report the significance of the results Google Group for Proteomics in NYC Please join! Proteomics Informatics Workshop Part II: Protein Characterization February 18, 2011 •Top-down/bottom-up proteomics • Post-translational modifications • Protein complexes • Cross-linking • The Global Proteome Machine Database Proteomics Informatics Workshop Part III: Protein Quantitation February 25, 2011 • Metabolic labeling – SILAC • Chemical labeling • Label-free quantitation • Spectrum counting • Stoichiometry • Protein processing and degradation • Biomarker discovery and verification Proteomics Informatics Workshop Part I: Protein Identification, February 4, 2011 Part II: Protein Characterization, February 18, 2011 Part III: Protein Quantitation, February 25, 2011