Previous Lecture: Regression and Correlation This Lecture Introduction to Biostatistics and Bioinformatics Proteomics Informatics Proteomics Informatics – Learning Objectives • Structure of mass spectrometry data • Protein identification • Protein quantitation Protein Identification and Quantitation by Mass Spectrometry Samples Quantity intensity Peptides Mass Spectrometry m/z Identity Sample preparation for protein identification, characterization and quantitation Lysis Fractionation Digestion Mass spectrometry Overview of Mass spectrometry Mass Analyzer intensity Ion Source mass/charge Detector Mass Spectrometry (MS) dv F ma m z ( E v B) dt m dv E v B z dt Example data – MALDI-TOF 45 700 Intensity Intensity 1800 0 1000 D:\Users\Fenyo\Desktop\ATP.txt (15:42 02/03/11) Description: none available 4500 m/z 0 1300 2280 14602400 m/z D:\Users\Fenyo\Desktop\ATP.txt (15:50 (15:4602/03/11) Description: none available Peptide intensity vs m/z Intensity 700 35 0 2378.0 1444.0 m/z 2394.0 1458.0 Peptide Fragmentation Mass Analyzer 1 Ion Source Fragmentation Mass Analyzer 2 Detector b y Liquid Chromatography (LC)-MS/MS LC Ion Source mass/charge mass/charge Detector mass/charge mass/charge mass/charge Time intensity intensity intensity mass/charge Mass Analyzer 2 intensity mass/charge mass/charge intensity mass/charge intensity mass/charge intensity mass/charge Fragmentation intensity intensity intensity mass/charge intensity intensity intensity intensity intensity Mass Analyzer 1 mass/charge mass/charge mass/charge Example data – ESI-LC-MS/MS m/z Peptide intensity vs m/z vs time 762 MS/MS % Relative Abundance 100 0 Time 875 [M+2H]2+ 292 405 534 260 389 504 250 500 633 663 m/z 778 750 1022 9071020 1080 1000 Fragment intensity vs m/z Charge-State Distributions MALDI ESI intensity Peptide 2+ intensity 1+ 2+ mass/charge 3+ 4+ 1+ mass/charge m M nH z n M - molecular mass n - number of charges H – mass of a proton MALDI ESI 3+ 4+ 5+ mass/charge 1+ intensity Protein intensity 2+ 27+ 31+ mass/charge Charge-State m M nH z n M - molecular mass n - number of charges H – mass of a proton Example: peptide of mass 898 carrying 1 H+ = (898 + 1) / 1 = 899 m/z carrying 2 H+ = (898 + 2) / 2 = 450 m/z carrying 3 H+ = (898 + 3) / 3 = 300.3 m/z Isotope Distributions m = 1035 Da m = 1878 Da m = 2234 Da Intensity 12C 14N 16O +1Da 1H 32S +2Da +3Da m/z m/z m/z 0.015% 2H 1.11% 13C 0.366% 15N 0.038% 17O, 0.200% 18O, 0.75% 33S, 4.21% 34S, 0.02% 36S Only 12C and 13C: p=0.0111 n is the number of C in the peptide m is the number of 13C in the peptide Tm is the relative intensity of the peptide m 13C 𝑛 𝑚 𝑇𝑚 = 𝑝 (1 − 𝑝)𝑛−𝑚 𝑚 Isotope Clusters and Charge State Intensity 1 1+ 1 1 m/z Intensity 0.5 2+ 0.5 0.5 m/z Intensity 0.33 3+ 0.33 0.33 m/z What is the Charge State? 713.3225 432.8990 713.8239 714.3251 714.8263 between the isotopes is 0.5 Da 433.2330 433.5671 433.9014 between the isotopes is 0.33 Da Protein Identification by Mass Spectrometry Samples intensity Peptides Mass Spectrometry m/z Identity Protein Identification - Exercise 1. Protein identification: NUP1 was genomically tagged protein A, affinity purified under two conditions, and the resulting protein mixture was analyzed with liquid chromatography mass spectrometry (LC-MS). Search the resulting spectra (NUP1-less-stringent-wash.mgf, NUP1-more-stringent-wash.mgf) using X! Tandem (http://h.thegpm.org/tandem/thegpm_tandem.html). Change the taxon to “S. cerevisiae (budding yeast)” but otherwise keep the default parameter settings. a. Look at the list of identified proteins and explain why they are found in this sample. More information is also available by selecting the “go”, “path”, “ppi”, “doms”, “string” tabs on top of the page. b. Select the “mh” display on top right of the page, and zoom in to +/100 ppm (the default setting for the mass accuracy that was used in the search). What precursor mass accuracy should we have used? Zoom in further and determine what precursor mass accuracy could have been used if the spectra were recalibrated (the error distribution centered at zero). Identification – Tandem MS Tandem MS – Sequence Confirmation S G F L E E D E L K % Relative Abundance 100 0 250 500 m/z 750 1000 Tandem MS – Sequence Confirmation S G F L E E D E L K 88 S 145 G 292 F 405 L 534 E 663 E 778 D 907 E 1020 L 1166 K % Relative Abundance 100 0 250 500 m/z 750 1000 b ions Tandem MS – Sequence Confirmation S G F L E E D E L K 88 S 1166 145 G 1080 292 F 1022 405 L 875 534 E 762 663 E 633 778 D 504 907 E 389 1020 L 260 1166 K 147 % Relative Abundance 100 0 250 500 m/z 750 1000 b ions y ions Tandem MS – Sequence Confirmation S G F L E E D E L K 88 S 1166 145 G 1080 292 F 1022 405 L 875 534 E 762 663 E 633 778 D 504 907 E 389 1020 L 260 1166 K 147 b ions y ions 762 % Relative Abundance 100 875 [M+2H]2+ 292 405 534 260 389 504 633 663 778 1022 907 1020 1080 0 250 500 m/z 750 1000 Tandem MS – Sequence Confirmation S G F L E E D E L K 88 S 1166 145 G 1080 292 F 1022 405 L 875 534 E 762 663 E 633 778 D 504 907 E 389 1020 L 260 1166 K 147 b ions y ions 762 % Relative Abundance 100 875 [M+2H]2+ 292 405 534 260 389 504 633 663 778 1022 907 1020 1080 0 250 500 m/z 750 1000 Tandem MS – Sequence Confirmation S G F L E E D E L K 88 S 1166 145 G 1080 292 F 1022 405 L 875 534 E 762 663 E 633 778 D 504 907 E 389 1020 L 260 1166 K 147 b ions y ions 762 % Relative Abundance 100 875 113 [M+2H]2+ 113 292 405 534 260 389 504 633 663 778 1022 907 1020 1080 0 250 500 m/z 750 1000 Tandem MS – Sequence Confirmation S G F L E E D E L K 88 S 1166 145 G 1080 292 F 1022 405 L 875 534 E 762 663 E 633 778 D 504 907 E 389 1020 L 260 1166 K 147 b ions y ions 762 % Relative Abundance 100 129 875 [M+2H]2+ 129 292 405 534 260 389 504 633 663 778 1022 907 1020 1080 0 250 500 m/z 750 1000 Tandem MS – de novo Sequencing 762 Amino acid masses 1-letter 3-letter code code A Ala Chemical formula C3H5ON Monois Average otopic 71.0371 71.0788 R Arg C 6H12ON4 156.101 156.188 N Asn C 4H6O2N2 114.043 114.104 D Asp C 4 H5 O 3 N 115.027 115.089 C Cys C 3H5ONS 103.009 103.139 E Glu C 5 H7 O 3 N 129.043 129.116 Q Gln C 5H8O2N2 128.059 128.131 G Gly C2H3ON 57.0215 57.0519 H His C 6H7ON3 137.059 137.141 I Ile C 6H11ON 113.084 113.159 L Leu C 6H11ON 113.084 113.159 K Lys C 6H12ON2 128.095 128.174 M Met C 5H9ONS 131.04 131.193 F Phe C9H9ON 147.068 147.177 P Pro C5H7ON 97.0528 97.1167 S Ser C 3 H5 O 2 N 87.032 87.0782 T Thr C 4 H7 O 2 N 101.048 101.105 W Trp Y Tyr V Val C 11H10ON2 186.079 186.213 C 9H9O2N 163.063 163.176 C5H9ON 99.0684 99.1326 % Relative Abundance 100 0 875 [M+2H]2+ 292 405 534 260 389 504 250 500 633 663 m/z 778 1022 9071020 1080 750 Mass Differences Sequences consistent with spectrum 1000 Tandem MS – de novo Sequencing 260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079 260 292 389 405 504 534 633 663 762 778 875 32 129 145 244 274 373 403 502 518 615 647 760 762 819 97 113 212 242 341 371 470 486 583 615 728 730 787 16 115 145 244 274 373 389 486 518 631 633 690 99 129 228 258 357 373 470 502 615 617 674 30 129 159 258 274 371 403 516 518 575 99 129 228 244 341 373 486 488 545 30 129 145 242 274 387 389 446 99 115 212 244 357 359 416 16 113 145 258 260 317 97 129 242 244 301 32 145 147 204 907 113 115 172 1020 2 1022 59 57 Tandem MS – de novo Sequencing 260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079 260 292 389 405 504 534 633 663 762 778 875 32 129 145 244 274 373 403 502 518 615 647 760 762 819 97 113 212 242 341 371 470 486 583 615 728 730 787 16 115 145 244 274 373 389 486 518 631 633 690 99 129 228 258 357 373 470 502 615 617 674 30 129 159 258 274 371 403 516 518 575 99 129 228 244 341 373 486 488 545 30 129 145 242 274 387 389 446 99 115 212 244 357 359 416 16 113 145 258 260 317 97 129 242 244 301 32 145 147 204 907 113 115 172 1020 2 1022 59 57 Tandem MS – de novo Sequencing 260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079 260 292 389 405 504 534 32 E 145 XP I/L 16 244 274 373 403 502 518 615 647 760 762 819 212 242 341 371 470 486 583 615 728 730 787 D 145 V E X 30 244 274 373 389 486 518 631 633 690 228 258 357 373 470 502 615 617 674 E 159 V E X 633 663 762 778 875 907 1020 1022 30 SGF(I/L)EEDE(I/L)… …GF(I/L)EEDE(I/L)… …(I/L)EDEE(I/L)FG… 1166 – 1020 – 18 = 128 K or Q= 1166 Peptide M+H 1166 -1079 = 87 => S SGF(I/L)EEDE(I/L)(K/Q) SGF(I/L)EEDE(I/L)… 258 274 371 403 516 518 575 228 244 341 373 486 488 545 E 145 XV D 16 242 274 387 389 446 212 244 357 359 416 I/L 145 XP E 32 258 260 317 242 244 301 F I/L X D 145 2 204 172 59 G Tandem MS – de novo Sequencing Challenges in de novo sequencing Neutral loss (-H2O, -NH3) Modifications Background peaks Incomplete Incomplete information information Tandem MS – Database Search Sequence DB Pick Peptide MS/MS All Fragment Masses MS/MS Compare, Score, Test Significance Repeat for all peptides LC-MS Repeat for all proteins Lysis Pick Protein Fractionation Digestion 1 2 3 4 6 8 #of matching peptides 2 3 4 6 8 #of matching peptides 10 10 Avg. #of matching peptides 1 Avg. #of matching peptides Information Content in a Single Mass Measurement Human 10 8 6 4 3 2 1 1000 2000 3000 Tryptic peptide mass [Da] S. cerevisiae 10 8 6 4 3 2 1 1000 2000 3000 Tryptic peptide mass [Da] Protein Identification and Quantitation by Mass Spectrometry Samples Quantity intensity Peptides Mass Spectrometry m/z Protein Quantitation by Mass Spectrometry C ij p p p Lysis L ij p D ijk LC Sample i Protein j Peptide k Pr Fractionation p ij p Digestion Pep ik MS ik I LC-MS MS ik ik I ik k C ij j L Pr ij ij p p pijk D Pep LC MS ik ik ik p p p k Quantitation – Label-Free (MS) Sample i Protein j Peptide k Lysis Fractionation Digestion LC-MS MS Assumption: p p p p p p k L Pr D Pep LC MS ij ij ijk ik ik ik constant for all samples Ci / Ci j n MS j m I in j / I i m j Quantitation – Metabolic Labeling L Ci n j H Light Heavy n m j M M pi Ci pi Lysis j m j Fractionation Digestion LC-MS Sample i Protein j Peptide k L Ii k n L H MS H Ii m k Oda et al. PNAS 96 (1999) 6591 Ong et al. MCP 1 (2002) 376 Quantitation – Labeled Synthetic Peptides Lysis Fractionation Assumption: All losses after mixing are identical for the heavy and light isotopes and L Enrichment with Peptide antibody Light Anderson, N.L., et al. Proteomics 3 (2004) 235-44 LC-MS L D M M pi pi pi pi p n Digestion Pr j n j n jk n k sk Synthetic Peptides (Heavy) H MS Gerber et al. PNAS 100 (2003) 6940 Estimating peptide quantity Intensity Peak height Curve fitting Peak area m/z What is the best way to estimate quantity? Peak height - resistant to interference - poor statistics Peak area - better statistics - more sensitive to interference Curve fitting - better statistics - needs to know the peak shape - slow Spectrum counting - resistant to interference - easy to implement - poor statistics for low-abundance proteins Proteomics Informatics - Summary • Structure of mass spectrometry data • Protein identification • Protein quantitation Next Lecture: Gene Expression Protein Quantitation - Exercise 2. Protein quantitation: Two breast tumor xenografts (one basal and one luminal) were analyzed in by LC-MS and the spectral counts for the identified peptides in the different analyses are listed in twosample-three-replicate-comparison.txt. a. Compare replicate one of Sample 1 with replicate one of Sample 2 using proteomics_no_replicate.py. Which differences are significant? b. Compare replicate one and two of Sample 1 using proteomics_one_replicate.py. Compare to the distribution in 2a. Which differences are significant in 2a? c. Compare the three replicates of Sample 1 with the three replicates of Sample 2 using proteomics_three_replicates.py. Which differences are significant? d. In cases when a protein is not observed in one sample, how many spectra do we need to observe in the other sample to say that there is a significant difference? Phosphorylation Exercise: an unmodified peptide Theoretical fragment ions Spectrum of the phosphorylated peptide Stat3_cytosolic_a #7952 RT: 62.88 AV: 1 NL: 6.00E3 T: ITMS +c ESI d Full ms2 1196.04@cid35.00 [315.00-2000.00] 100 341.9 361.2 383.2 407.2 421.3 439.3 0 350 400 472.3 450 500.3 520.1 541.4 500 569.4 550 667.2 603.3 621.2 635.2 664.5 668.2 704.4 723.9 600 700 650 m/z 0 762.2 780.3 750 797.3 819.2 800 1178.0 0 895.3 920.4 936.1 959.2 976.5 990.2 1008.3 858.2 876.8 885.5 850 1204.1 1223.4 1250.6 1200 1250 1281.3 900 1310.5 950 m/z 1300 1000 1382.4 1400.1 1350.4 1350 1048.1 1065.7 1080.2 1109.4 1050 1432.8 1453.2 1400 1450 1802.6 1820.6 1852.6 1870.5 1146.7 1137.8 1100 1478.4 1495.5 1501.3 1150 1547.5 1571.1 1500 1550 m/z 1723.3 0 1592.3 1610.3 1600 1654.8 1671.7 1706.5 1769.4 1751.6 1650 1700 1750 1800 m/z 1850 1916.5 1934.8 1951.6 1970.5 1995.3 1900 1950 2000 Spectrum of the peptide phosphorylated at a different site Stat3_cytosolic_a #8053 RT: 63.59 AV: 1 NL: 6.15E3 T: ITMS +c ESI d Full ms2 1196.04@cid35.00 [315.00-2000.00] 587.3 588.2 558.1 569.3 100 411.2 421.2 441.1 343.2 359.2 373.2 0 623.1 641.6 667.2 682.3 700.4 724.4 700 650 600 550 500 450 400 350 520.2 472.2 490.2 m/z 0 780.4 801.2 738.2 815.4 1177.7 0 1180.4 1217.3 1230.4 1247.3 1364.2 1390.5 1146.6 1137.6 1116.1 1150 1100 1491.4 1507.3 1540.3 1558.4 1575.3 1426.4 1445.4 1477.5 1550 1500 1450 1400 1350 1300 1250 1200 1316.2 1333.4 1281.0 1065.6 1077.4 1050 1000 950 m/z 900 850 800 750 928.4 955.3 964.5 985.4 1016.5 1029.6 845.6 858.4 884.2 902.4 m/z 0 1593.5 1600 1689.6 1705.6 1672.5 1724.4 1628.4 1654.6 1650 1700 1767.6 1750 1785.4 1803.4 1800 m/z 1834.5 1852.6 1870.4 1901.5 1850 1900 1970.5 1933.4 1950 1996.7 2000