Data Collection and Analysis for High Throughput Quantitative Proteomics: Current Status and Challenges Ruedi Aebersold, Ph.D. Institute for Systems Biology Seattle, Washington email: raebersold@systemsbiology.org Proteomics: The systematic (quantitative) analysis of the proteins expressed in a cell at a time Enumerate all the components of a proteome Detect dynamic changes in proteome following external or internal perturbations Proteome as database: Proteomics as Biol. or clin. assay: Proteome analyzed once Proteome analyzed multiple (infinite) times Protein Identification Strategy * I Protein mixture II 12 Peptides 14 Time (min) 16 1D, 2D, 3D peptide separation * 200 400 600 80010001200 m/z Q1 Q2 Collision Cell Q3 Tandem mass spectrum Correlative sequence database searching III 200 400 600 800 10001200 200 400 600 800 10001200 m/z Theoretical m/z Protein identification Acquired Accurate Quantitation Using Isotope Dilution Sample 1 (Reference) Incorporate Stable Light Isotope Sample 2 Incorporate Stable Heavy Isotope Combine Samples Analyze by Mass Spectrometer • h/l analytes are chemically identical identical specific signal in MS • Ratio of h/l signals indicates ratio of analytes Isotope Coded Affinity Tags (ICAT) Heavy reagent: d8-ICAT (X=deuterium) Light reagent: d0-ICAT (X=hydrogen) O N N S Biotin tag O X N X X X O X O O X Linker (heavy or light) O X I N X Thiol reactive Detection of Cys containing peptides and accurate quantification using stable isotope dilution Quantitative proteomics by isotope labeling-LC-MS/MS Mixture 1 Optional fractionation 100 Light 0 isotopelabel Heavy 550 560 570 580 m/z 100 NH2-EACDPLR-COOH Combine and proteolyze Avidin affinity enrichment Mixture 2 0 200 400 600 800 m/z Compatible with any separation/fractionation method at protein/peptide level. Quantitation and protein identification PROTEIN LABELING Stable Isotope Labeling Strategies Metabolic stable isotope labeling Isotope tagging by chemical reaction Label Digest Digest DATA COLLECTION Digest Intensity Intensity Mass spectrometry Intensity DATA ANALYSIS Stable isotope incorporation via enzyme reaction m/z m/z m/z Quantitative Proteomics Technology Protein identification: Automated peptide tandem mass spectrometry of complex peptide mixtures Protein quantification: Isotope dilution Selective chemical reactions: reduction of sample complexity; selective analyte isolation Results Identification of proteins in sample and quantitative profiles Quantitative Proteomics Technology Protein identification: Automated peptide tandem mass spectrometry of complex peptide mixtures Protein quantification: Isotope dilution Selective chemical reactions: reduction of sample complexity; selective analyte isolation Results Identification of proteins in sample and quantitative profiles Current capacity: ~1000 proteins per day/instrument Total yeast lysate: ~ 2000 proteins identified and quantified Quantitative Proteomics Technology Protein identification: Automated peptide tandem mass spectrometry of complex peptide mixtures Protein quantification: Isotope dilution Selective chemical reactions: reduction of sample complexity; selective analyte isolation Results Identification of proteins in sample and quantitative profiles Current capacity: ~1000 proteins per day/instrument Total yeast lysate: ~ 2000 proteins identified and quantified In 1991, all the world’s labs combined had identified just about 2000 genes Current Limitations (and Potential Solutions) • The efficiency problem • The validation problem • The biological inference problem Standard Method for Complex Peptide Mixture Analysis Cation Exchange RP-HPLC ESI-MS/MS Proteome Analysis: The Analytical Challenges Yeast Proteome • Expected number of ORFs: 6118 • Expected number of tryptic peptides: ~350,000 Synchronous Timepoint Samples Compared to Reference Sample Asynchronous Reference Sample Timepoint Samples from Yeast Cells Synchronously Transiting the Cell Cycle Data Summary T0 T0 678 1648 T30 320 T60 342 T90 340 T120 319 • • • • • T30 1095 1523 998 555 604 626 T60 1184 1055 1448 1006 571 587 T90 1112 1140 1051 1713 1243 684 T120 892 921 871 960 1229 1047 2735/6562 proteins quantified across all timepoints (42%) 696 proteins quantified in every experiment 1513 proteins quantified in at least one timepoint 34,400 peptides quantified on average per timepoint >1 million mass spectra collected Features: 2720 Pep3D: Xiao-jun Li et al. submitted Features: 2720 CIDs: 1633 Features: 2720 CIDs: 1633 IDs: 363 ID/CID: 22% ID/feature: 13% Possible Solutions • Better separation technology • Selective peptide isolation • Smart precursor ion selection Number of peptides identified in each SCX fraction Number of peptides identified in each FFE fraction (aver age o ver lap: 52%) (aver age o ver lap: 29%) 700 400 Num ber of p eptides Num ber of p eptides 600 300 200 100 500 400 300 200 100 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 Nu m ber o f fr action Nu m ber o f pep tides over lap ed w ith pr evious one fraction Un ique pep tide in th e fr action 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 Nu m ber o f fr action Nu m ber o f pep tides over lap ed w ith pr evious one fraction Un ique pep tide in th e fr action •Tryptic yeast digest separated by FFE-IEX or SAX •30 fractions collected and analyzed by capLC-MS/MS •Overlap: same peptide identified in adjacent fractions Peptide overlap in SCX 2800 1400 2400 1200 Number of peptides Number of peptides Peptide overlap in FFE 2000 1600 1200 800 400 1000 800 600 400 200 0 0 1 6 11 16 21 26 1 Number of fractions one peptide distribute to 92% 6 11 16 21 Number of fractions one peptide distribute to 68% 26 Possible Solutions • Better separation technology • Selective peptide isolation – Zhang H, et al. Curr. Op. Chem . Biol. (2004) 8: 6675 – Aebersold R Nature (2003) 422(6928):115-6. • Smart precursor ion selection – Griffin T et al. Anal Chem.( 2003) 75:867-74. – Griffin et al. J Am Soc Mass Spectrom. (2001) 12:1238-46. Summary: Efficiency Problem • Only a (small) subset of peptides present is identified • Current separation strategies do not have sufficient resolving power • MS/MS of every peptide in every experiment is a bottleneck of current MS based proteomics • LC-ESI MS/MS wastes a high fraction of MS/MS cycles sequencing precursor ions that do not lead to a positive identification • Most positive identifications are not informative in profiling experiments • Smart precursor ion selection is required Current Limitations (and Potential Solutions) • The efficiency problem • The validation problem • The biological inference problem Protein Identification by MS/MS protein sample protein identifications ABC D ABC peptide mixture peptide identifications MS/MS spectra Protein Identification by MS/MS Protein level protein sample protein identifications ABC D ABC Peptide level peptide mixture Database search peptide identifications Tools: MS/MS spectrum level MS/MS spectra -Sequest -Mascott -SpectrumMill -Etc. sort by search score OUTPUT FROM SEARCH ALGORITHM “correct” incorrect sort by search score Threshold Model threshold SEQUEST: Xcorr > 2.0 Cn > 0.1 MASCOT: Score > 47 Difficulty Interpreting Protein Identifications based on MS/MS • Different search score thresholds used to filter data • Unknown and variable false positive error rates • No reliable measures of confidence Statistical Model entire dataset: Spectrum Peptide Spectrum 1 Spectrum 2 Spectrum 3 … Spectrum N LGEYGH FQSEEQ FLYQE … EIQKKF Score 4.5 3.4 1.3 … 2.2 best database MS/MS match search spectrum score Statistical Model entire dataset: Spectrum Spectrum 1 Spectrum 2 Spectrum 3 … Spectrum N Peptide LGEYGH FQSEEQ FLYQE … EIQKKF Score 4.5 3.4 1.3 … 2.2 1.0 0.97 0.01 0.3 incorrect --incorrect p=0.5 correct correct --- probability unsupervised learning EM mixture model algorithm learns the most likely distributions among correct and incorrect peptide assignments given the observed data Threshold Model: Bad Discrimination and Inconsistency SEQUEST thresholds (from literature) Sensitivity: fraction of all correct results passing filter Error Rate: fraction of all results passing filter that are incorrect test data: A. Keller et al. OMICS 6(2), 207 (2002) Ideal Spot Discriminating Power of Peptide Prophet SEQUEST thresholds (from literature) Sensitivity: fraction of all correct results passing filter Error Rate: fraction of all results passing filter that are incorrect Ideal Spot Improved discrimination: more identifications (for the same error rate) Keller at al. Anal. Chem. 2003 Protein Identification >sp|P02754|LACB_BOVIN BETA-LACTOGLOBULIN PRECURSOR (BETA-LG) (ALLERGEN BOS D 5) - Bos taurus (Bovine). MKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDAQSA PLRVYVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTKIPAVFKIDALNENKLV LDTDYKKYLLFCMENSAEPEQSLACQCLVRTPEVDDEALEKFDKALKALPMHI RLSFNPTQLEEQCHI KPTPEGDLEILLQK : p = 0.83 LSFNPTQLEEQCHI : p = 0.65 LSFNPTQLEEQCHI : p = 0.76 TPEVDDEALEK : p = 0.96 TPEVDDEALEKFDK : p = 0.96 sp|P02754|LACB_BOVIN Probability = ??? ProteinProphetTM software combines probabilities of peptides assigned to MS/MS spectra to compute accurate probabilities that corresponding proteins are present Nesvizhskii et al Anal Chem. (2003)75:4646-58. Issues for Protein Identification • Many peptides are present in more than a single database protein entry ProteinProphet apportions such peptides among all corresponding proteins to derive simplest list of proteins that explain observed peptides • Peptides corresponding to ‘single-hit’ proteins are less likely to be correct than those corresponding to ‘multihit’ proteins ProteinProphet learns by how much peptide probabilities should be adjusted to reflect this protein grouping information Amplification of False Positive Error Rate from Peptide to Protein Level 5 correct (+) + Peptide 1 Peptide 2 + + + Peptide 3 Peptide 4 Peptide 5 Peptide 6 Peptide 7 + Peptide 8 Peptide 9 Peptide10 Peptide Level: 50% False Positives Prot A Prot B Prot Prot Prot Prot Prot in the sample (enriched for ‘multi-hit’ proteins) not in the sample (enriched for ‘single hits’) Protein Level: 71% False Positives Serum Protein Identifications from Large-scale (~375 run) Experiment Data Filter # ids # non-single hits # single-hits Publ. Threshold model#1 2257 359 1898 Publ. Threshold model #2 2742 441 2301 ProteinProphet, p 0.5 713 (predicted error rate: 7%) 511 202 Reference: H. Zhang et al., in prep Consistency of Manual Validation of SEQUEST Search Results Manual Authenticators Search Results Correct Validation Incorrect Validation Validation Withheld Tasks for a proteomic analysis pipeline mzXML Suitable input Peptide assignment Data Analysis Pipeline Peptide Prophet Protein Prophet Validation Protein assignment Interpretation SBEAMS Cytoscape COMET ProbID Quantitation ASAPRatio Data Analysis Summary: • Processing of data collected from different platforms, samples, experiments, operators requires transparent methods to score data • Publication and relational database analysis require consistently scored data • Tools assigning probability based scores are essential • Openly accessible, transparent (OS) tools bring in new talent and lead to community improved tools Nesvizhskii and Aebersold (2004) Drug Discov Today. 9:173-81 http://www.proteomecenter.org/software.php Current Limitations (and Potential Solutions) • The efficiency problem • The validation problem • The biological inference problem Mock-treated IFN-treated C12 ICAT label C13 C12/C13 HPLC-MS/MS Wei Yan et al Name DNAH11: dynein, axonemal, heavy polypeptide 11 Cellular pathway moto protein complex UBE2L6: ubiquitin-conjugating enzyme E2L 6 ubiquitination and protein degradation 0.57 DNAH11: dynein, axonemal, heavy with polypeptide 11 IFIT1: interferon-induced protein tetratricopeptide repeats 1 moto protein unknown andcomplex ESIs 0.94 0.48 9999 9999 -1 -1 UBE2L6: ubiquitin-conjugating enzyme E2L 6 GPR111: G protein-coupled receptor 111 ubiquitination and receptor protein degradation G-protein coupled and G-protein signaling 0.57 0.63 9999 21.270 -1 4.741 IFIT1: protein with tetratricopeptide repeats 1 PASK interferon-induced PAS domain containing serine/threonine kinase unknown and ESIs signaling pathway 0.48 0.49 9999 12.006 -1 1.024 GPR111: adhesion G protein-coupled receptor 111 1 ADRM1: regulating molecule G-protein molecule coupled receptor and G-protein signaling adhesion and extracellular matrix protein 0.63 0.79 21.270 9.508 4.741 1.043 PASK PAS domain containing serine/threonine kinase CSA_PPIasePEPTIDYL PROLYL CIS TRANS ISOMERASE signaling pathway chaperone and protein folding 0.49 1 12.006 8.104 1.024 1.070 ADRM1: adhesion regulating molecule 1 AHCY: S-adenosylhomocysteine hydrolase adhesion molecule andmetabolism extracellular matrix protein one-carbon compound 0.79 0.93 9.508 6.279 1.043 0.936 CSA_PPIasePEPTIDYL PROLYL TRANS ISOMERASE IFIT4: interferon-induced proteinCIS with tetratricopeptide repeats 4 chaperone and protein folding unknown 1 1 8.104 6.230 1.070 0.794 AHCY: S-adenosylhomocysteine hydrolase FLJ32915: hypothetical protein FLJ32915 S100 IFIT4: protein protein with tetratricopeptide 4 GNB1:interferon-induced guanine nucleotide binding (G protein), betarepeats polypeptide 1 P100 P3 one-carbon compound metabolism unknown Probability 0.94 Sum unknown G-protein coupled receptor and G-protein signaling ASAPRatio 9999 Mean 9999 Unique ID ASAPRatio -1 Std.-1 0.93 0.73 6.279 6.054 0.936 4.883 1 1 6.230 5.845 0.794 0.133 FLJ32915: hypothetical protein FLJ32915 G1P2: interferon, alpha-inducible protein (clone IFI-15K) unknown cytoskeletion and intracellular transport 0.73 0.98 6.054 4.858 4.883 0.661 GNB1: guanine nucleotide binding protein (G (large protein), beta polypeptide MTP: microsomal triglyceride transfer protein polypeptide, 88kDa)1 G-protein coupled and G-protein signaling lipid and fatty acid receptor metabolism 1 0.97 5.845 4.748 0.133 0.751 G1P2: interferon, alpha-inducible protein (clone IFI-15K) PLCD1: phospholipase C, delta 1 cytoskeletion and intracellular transport signaling pathway; lipid metabolism 0.98 0.69 4.858 4.569 0.661 0.116 P 0.9 523 270 671 1464 P 0.4 590 330 748 1668 MTP: CD7 microsomal transfer protein (large polypeptide, 88kDa) CD7: antigentriglyceride (p41) 1113 0.97 0.57 4.748 4.523 0.751 2.204 0.69 1(1)(1 ) 0.57 4.569 4.164(2.741)(2. 2) 4.523 0.116 1.284(0.195)(0.39 4) 2.204 1(1)(1 1 ) 4.164(2.741)(2. 3.963 2) 1.284(0.195)(0.39 0.659 4) 0.62 3.815 0.058 1 0.98 0.62 0.99 3.963 3.684 3.815 3.533 0.659 0.224 0.058 1.659 NUDT2: nudix (nucleoside diphosphate linked moiety X)-type motif 2 ACACA: acetyl-Coenzyme A carboxylase alpha translation and ribosomalnucleotide protein; anti-viral response nucleobase, nucleoside, and nucleic acid metabolism unknown and ESIs chaperone and protein folding nucleobase, nucleoside, nucleotide and nucleic acid metabolism lipid and fatty acit metabolism 0.98 1 3.684 3.351 0.224 0.259 CABC1: chaperone, ABC1 activity of bc1 complex like (S. pombe) KNS2: kinesin 2 60/70kDa chaperone and protein folding transport cytoskeletion and intracellular 0.99 1 3.533 3.140 1.659 0.335 ACACA: acetyl-Coenzyme A carboxylase alpha LOC151636: rhysin 2 lipid and fatty and acit intracellular metabolism transport? cytoskeletion 1 1 3.351 2.975 0.259 0.231 KNS2: kinesin 2 60/70kDa M96: likely ortholog of mouse metal response element binding transcription factor 2 LOC151636: rhysin 2 ETFA: electron-transfer-flavoprotein, alpha polypeptide (glutaric aciduria II) M96: likely ortholog of mouse metal response element binding transcription factorN-myc 2 NMI: (and STAT) interactor cytoskeletion and intracellular transport transcription cytoskeletion and intracellular transport? electron transfer 1 0.98 1 0.45 3.140 2.923 2.975 2.890 0.335 0.390 0.231 0.484 transcription signaling pathway; transcription; apoptosis 0.98 0.57 2.923 2.875 0.390 0.138 ETFA: alphaprotein polypeptide (glutaric aciduria II) GSA7: electron-transfer-flavoprotein, ubiquitin activating enzyme E1-like electron transfer ubiquitination and protein degradation 0.45 0.98 2.890 2.844 0.484 0.663 NMI: N-mychypothetical (and STAT)protein interactor MGC3207: MGC3207 signaling pathway; transcription; translation and ribosomal proteinapoptosis 0.57 0.61 2.875 0.499 0.138 0.071 GSA7: ubiquitin activating enzyme E1-like protein SPK: symplekin ubiquitination and protein degradation unknown 0.98 1 2.844 0.496 0.663 0.029 KRT10: keratin 10 (epidermolytic hyperkeratosis; keratosis palmaris et plantaris) cytoskeletion and intracellular transport 0.97 0.495 0.055 0.98 0.484 0.008 1 0.452 0.165 0.98 0.455 0.138 0.82 0.434 0.224 1 0.426 0.014 0.98 0.416 0.081 0.95 0.391 0.074 1 0.383 0.165 RNA splicing and processing 0.96 0.378 0.154 23 IFN-repressed proteins (0.5-fold) PLCD1: phospholipase C, delta 1 EEF1A protein [Fragment] CD7: CD7 antigen (p41) PRKR protein kinase, interferon-inducible double stranded RNA dependent EEF1A protein [Fragment] KIAA1276: KIAA1276 protein PRKR protein kinase, interferon-inducible double stranded RNA dependent NUDT2: nudix (nucleoside diphosphate linked moiety X)-type motif 2 KIAA1276: KIAA1276 protein CABC1: chaperone, ABC1 activity of bc1 complex like (S. pombe) SARDH: sarcosine dehydrogenase TRA1: tumor rejection antigen (gp96) 1 lipid and response fatty acid metabolism immune signaling pathway; lipid metabolism translation and ribosomal protein; GTP binding immune response translation and ribosomal protein; anti-viral response translation and ribosomal protein; GTP binding unknown and ESIs 54 IFN-induced proteins (2-fold) electron transfer chaperone and protein folding GPS1: G protein pathway suppressor 1 G-protein coupled receptor and G-protein signaling 15 previously reported SRRM2: serine/arginine repetitive matrix 2 RNA splicing and processing KIAA0007: KIAA0007 protein unknown FACL4: fatty-acid-Coenzyme A ligase, long-chain 4 lipid and fatty acid metabolism 39 novel FXR2: fragile X mental retardation, autosomal homolog 2 RNA binding and ribosomal association TUBA6: tubulin alpha 6 cytoskeletion and intracellular transport; GTP binding CPSF4: cleavage and polyadenylation specific factor 4, 30kDa 1272 MAPRE1: microtubule-associated protein, RP/EB family, member 1 cytoskeletion and intracellular transport 0.98 0.339 0.016 OAT: ornithine aminotransferase (gyrate atrophy) amino acid and peptide metabolism 0.98 0.331 0.018 PPGB: protective protein for beta-galactosidase (galactosialidosis) chaperone and protein folding; protein protection 1 0.323 0.084 WNT9A: wingless-type MMTV integration site family, member 9A signaling pathway 0.99 0.316 0.091 FASN: fatty acid synthase lipid and fatty acid metabolism 0.99 0.304 0.100 Ig lambda chain C regions immune response 0.98 0.265 0.110 G2AN: alpha glucosidase II alpha subunit carbohydrate metabolism 1 0.198 0.033 Hypothetical protein FLJ21140 unknown 0.71 0.043 0.064 KRT6: keratin 6 cytoskeletion and intracellular transport 1 0.003 0.008 MIG-6: Gene 33/Mig-6 signaling pathway 0.99 0.000 -1.250 HIC1: hypermethylated in cancer 1 transcription suppression 0.94 0.000 -1.250 Lots of data -what does it mean? Interferon (IFN) Pathway 2.215 ± 0.079 IFN / Mock PKR 2’,5’-OAS 3.963 ± 0.659 2.460 ± 0.076 Mx 2.359 ± 0.149 ADAR 1.398 ± 0.118 IRFs Not identified MHC Katze et al (2002) 2: 675 -2-microglobulin (MHC I) 2.768 ± 0.583 IFI-30 (MHC II) 2.219 ± 0.183 GO Analysis of Interferon regulated proteins GO level Physiological process 3 Cell growth and/or maintenance Death Metabolism Response to external stimulus Response to stress Pathogenesis 4 Cell organization Cell growth Cell death Transport Cytoplasm Nuclear organization organization Catabolism Nitrogen metabolism 5 DNA metabolism Defense response 6 Amino acid metabolism Fatty acid metabolism Immune response 7 8 9 10 11 12 Cell growth and/or maintenance Metabolism Cellular defense response Islands of intense knowledge in ocean of unknown Hormone responses Cell motility Energy metabolism Transcription Charting the path between landmarks Hormone responses Cell mobility Energy metabolism Unassigned observations Transcription Walking down the interaction map A B F G C E H D I First round of TAP-tagging: Identification of IGBP1 and TIP41 interactors TCP1 CCT2 CCT3 CCT4 CCT5 CCT6A CCT7 CCT8 CCT complex PPP2CA IGBP1 PPP2CB PPP4C TIP41 Catalytic subunits PP2A-type phosphatases PPP6C PPP4R2* PPP6R1* PPP6R2A* Uncharacterized proteins Anne-Claude Gingras Human phosphatase-interaction network: Segregation into functional modules Centrosome; Meiosis Exit from mitosis; Actin cytoskeleton G1 S transition PP4 C PP6 C PP2 C PP2 B PP2A a Acknowledgements Separation strategies Hookeun Lee Eugene Yi Mingliang Yi Abundance dependent MS/MS Tim Griffin Chris Lock (Sciex) Software development and statistical models Eric Deutsch Xiao-Jun Li Jimmy Eng Alex Nesvizhskii Andy Keller Benno Schwikowski Patrick Pedrioli Ning Zhang Inference of biological function Wei Yan Anne-Claude Gingras Cytoscape project (www.cytoscape.org) Funding: NIH (NCI, NCRR, NIDA, NHBLI), Merck, ABI