Experimental & Bioinformatic Tools for Proteomics Steve Oliver Professor of Genomics Faculty of Life Sciences The University of Manchester http://www.cogeme.man.ac.uk http://www.bioinf.man.ac.uk Functional Genomics Level of Analysis Definition Status Method of Analysis Genome Complete set of genes of an organism or its organelles. Context-independent (modifications to the yeast genome may be made with exquisite precision. Systematic DNA sequencing. Transcriptome Complete set of mRNA molecules present in a cell, tissue or organ. Context-dependent (the complement of mRNAs varies with changes in physiology, development or pathology. Hybridisation arrays. SAGE High-throughput Northern analysis. Proteome Complete set of protein molecules present in a cell, tissue or organ. Context-dependent. 2-D gel electrophoresis. Peptide mass fingerprinting. Two-hybrid analysis. Metabolome Complete set of metabolites (low molecular weight intermediates) present in a cell, tissue or organ. Context-dependent. Infra-red spectroscopy. Mass spectometry. Nuclear magnetic resonance spectometry. GENOME TRANSCRIPTOME PROTEOME METABOLOME Proteomics Separation Identification Quantitation Bioinformatics Complex mixture analysis knowledge+ prediction genome “virtual” proteome peptide mass database post-translational modification separation methods 2D-gels, functional separations, n-dimensional chromatography real proteome Bioinformatics Identification complex mixtures [digest] complex peptide & subsets map fingerprint simple mixtures & single proteins [digest] simple peptide map fingerprint 4.0 4.5 5.0 5.5 6.0 6.5 Aberdeen PRF1: S. cerevisiae 2D map ADE6 + 150 100 CDC48 + HIS4 + ADE5,7 + SSE1 90 ABP1 + 80 + SSC1 VMA1 + + SSB1 + WTM1+ SSA2 SSA1 + + HSP60 PDR13 + 70 60 PUB1+ 50 LEU1 HXK2 + + VMA2 HXK1 SAM1 ATP2 + + + + LYS9 TIF3 + SGT2 + 40 ADO1 + TPM1 + FBA1 + + SPE3 Ykl056c + + PDC1 + + + RHR2 YHB1 + + ASC1 FBA1 EGD2 TDH3 + TPI1 ADK1 + + RIB3 + + + ILV5 + + +URA1 +ADH1 ENO2 + + PGK1 + PDC1 HSP26 + ADH1 + PSA1+ + ENO2 PGK1? + OYE2 + + TPI1 + FBA1 + ENO1 + + MET17 + CYS3 + EFB1 + CYS4 + + SES1 ENO2 + + + VMA4 ENO2 + Ylr301w SEC53 + + RPS0A + RPS0B + GLK1, + ARO8 GDH1 + CDC19 + + Yfr044c IPP1 + + BMH1 + HYP2 + PDC1 + FBA1 BMH2 + 30 ALD6 + PAB1 + + ASN2 + + PDB1 + CLC1,BGL2 + + ACT1 + + ARG1 SAM2 + MET6 + STI1 + PST2 SOD1 + 20 AHP1 + + MGE1 TSA1 + BNA1 TDH3 + + COF1 + + EGD1 PDC1 + FPR1 + NTF2 + 10 PFY1 + ENO2 + RPS21 + RIB4 + RPL22A + CPH1 + Peptide mass fingerprinting denature KETAAAKFERQHMDSSTSAASSSNYCNQMMKSRNLTKDRC LPVNTFVHESLADVQAVCSQKNVACKNGQTNCYQSYSTMS ITDCRETGSSKYPNCAYKTTQANKHIIVACEGNPYVPVHF DASV digest (trypsin) KETAAAK m1 FER QHMDSSTSAASSSNYCNQMMK m2 m3 CLPVNTFVHESLADVQAVCSQK NVACK m7 ETGSSK m10 SR m4 NLTK m5 m9 YPNCAYKTTQANK HIIVACEGNPYVPVHFDASV m11 m12 abundance mass spectrometry m7 m10 m1 mass m6 NGQTNCYQSYSTMSITDCR m8 m11 DR m12 m9 Proteomic applications • Quantitative Proteomics – “Expression” proteomics • protein levels under different conditions/times • Qualitative Proteomics – Identification proteomics • protein:protein interactions • post-translational modifications “A MASS SPECTROMETER MEASURES THE MW….” “...A MS ANALYSIS GIVES THE MASS-TO-CHARGE RATIO (m/z) FOR IONS…IN GAS PHASE”. Brancia FL, Trieste, 12/02/2004 What is a “mass spectrometer”...? Brancia FL, Trieste, 12/02/2004 TOF, quadrupole, ion trap Pumping system vacuum Sample introduction DIRECT INTRODUCTION (solid, liquid, gas) SEPARATION TECHNIQUES (HPLC, CE, GC) ION SOURCE ANALYZER (“ion generation”) (“mass analysis”) EI, FAB, MALDI,Electrospray Detector Data Processing Brancia FL , Trieste, 12/02/2004 Various ionisation methods • • • • Electron impact ionisation (1919 A.J. Dempster) Chemical Ionisation CI Fast atomic bombardment FAB (1981 M. Barber) Matrix-assisted laser desorption ionisation MALDI (1988 K. Tanaka, M. Karas F. Hillenkamp) • Electrospray ES (1985, J. Fenn) Brancia FL, Trieste, 12/02/2004 ‘Soft’ Ionisation Techniques ‘Soft’ refers to the low amount of energy imparted into the analyte during ionisation. Too much internal energy will result in fragmentation. Soft ionisation techniques form intact molecular or pseudo-molecular (M+H) ions. Matrix-assisted laser desorption ionisation (MALDI) Electrospray (ES) Brancia FL, Trieste, 12/02/2004 Nobel Prize in Chemistry 2002 “...for their developments of soft desorption ionisation methods for mass spectrometric analysis of biological macromolecules”. 1/4 to John B. Fenn (USA) 1/4 to Koichi Tanaka (Japan) Virginia Commonwealth University Shimadzu Corp. Kyoto Electrospray Ionization Laser Ionization 1//2 of the prize went to Kurt Wutrich (Switzerland) development of NMR analysis Brancia FL , Trieste, 12/02/2004 Electrospray (ES) Brancia FL, Trieste, 12/02/2004 [M+nH]n+ Droplet shrinks due to solvent evaporation Droplet explodes due to charge density limit Gaseous ions formed via one of two proposed mechanisms mass analyzer counter electrode (near ground) skimmer electrodes high vacuum electrospray capillary atmospheric pressure +HV sample solution pressure gradient potential gradient Brancia FL, Trieste, 12/02/2004 The principal outcome of the electrospray process is the transfer of analyte species, generally ionised in condensed phase, into the gas phase as isolated entities ++ ++ +HV + + + Gaskell SJ Jounal of Mass Spectrometry 1997 ++ ++ + + ++ Aerosol of charged droplets Brancia FL, Trieste, 12/02/2004 ES spectrum of Rho protein 759.3 771.6 100 Rho Protein: 47004.33 Da 759.1 747.1 735.5 784.4 797.7 724.1 [M+56H]56+ 713.2 825.6 713.0 702.6 840.3 702.4 % 855.6 [M+50H]50+ 871.7 888.0 682.1 905.0 682.0 941.0 672.4 960.2 672.2 653.9 0 600 650 980.3 1001.2 700 750 800 850 900 950 1000 1050 Courtesy of Dr Matt Openshaw 1100 1150 1200 1250 m/z 1300 Brancia FL, Trieste, 12/02/2004 Electrospray (ES) [M+56H]56+ Therefore, = M = 840.3 m/z [840.3 x 56] – 56 = 47000.8 Da Deconvolution: Takes all the multiply charged ions and converts them into a spectrum on a mass (Da) scale i.e. works out the molecular weight is most likely to be. Brancia FL, Trieste, 12/02/2004 ES spectrum after deconvolution 47004.9 100 47004.0 Da % 0 44000 44500 45000 45500 46000 46500 47000 47500 48000 48500 49000 49500 mass 50000 Brancia FL, Trieste, 12/02/2004 Advantages • Production of molecular ions from solution • The ease of coupling with separation techniques (micro LC-MS/MSMS, nano LCMS/MSMS) • Production of multiply charged ions Brancia FL, Trieste, 12/02/2004 Matrix Assisted Laser Desorption Ionisation MALDI Time-of-Flight Brancia FL, Trieste, 12/02/2004 Matrix assisted laser desorption ionisation (MALDI) COOH COOH OH COOH H3CO CN HO HO HO OCH 3 -cyano-4-hydroxy cinnamic acid (CHCA) 2,5-dihydroxybenzoic acid (DHB) Trans-3,5-dimethoxy-4hydroxy cinnamic acid (sinapinic acid; SA) Typically used with a nitrogen laser (337 nm) Brancia FL, Trieste, 12/02/2004 MALDI is an efficient desorption ionisation technique for producing gaseous ions from a solid sample by laser pulses [M+H]+ Brancia FL, Trieste, 12/02/2004 Matrix Assisted Laser Desorption/Ionisation (MALDI) Unlike ES, MALDI forms predominantly singly charged ions e.g. [M+H]+ or adducts (sodium [M+Na]+ or potassium [M+K]+) Sodium = 23 amu Potassium = 39 amu [M+H]+ [M+Na]+ 22 m/z 38 m/z [M+K]+ Brancia FL, Trieste, 12/02/2004 Why is the matrix so important? • Matrix is necessary to dilute and disperse the analyte • It functions as energy mediator for ionising the analyte itself or other neutral molecule • It forms an activated state produced by photo ionisation Brancia FL, Trieste, 12/02/2004 Advantages • MALDI primarily creates singly charged ions [M+H]+ • Less sensitive to contaminants • Sensitivity at femtomole level • High throughput analysis Brancia FL, Trieste, 12/02/2004 Time-of-flight (ToF) mass spectrometer MALDI target Flight tube (field-free region) Detector mv2/2= zV Extraction grid t=0 t2=m/z(d2/2V) t=>0 Brancia FL, Trieste, 12/02/2004 Reflectron-time of flight mass analyser Detector 1 Electrostatic mirror Laser Detector 2 Target VACCEL Brancia FL, Trieste, 12/02/2004 MALDI Sensitivity = Simplicity femtomole 10-15 M/l (...attomole 10-18 M) = $$$ = Speed (“high throughput”) = Selectivity very easy training required 70 to 650 k$ 120 to 650 k$ ~104/day dynamic system (“resolution”) = Structural information = Software = ESI >5000 MSn MSn “ ...evaluation in progress.” Brancia FL, Trieste, 12/02/2004 Structural information can be achieved by tandem mass spectrometry Brancia FL, Trieste, 12/02/2004 The tandem mass spectrometry experiment Ion source Analyser 1 e.g. quadrupole e.g. electrospray Decomposition region Analyser 2 e.g. quadrupole, time-of-flight collisionally activated decomposition CAD Brancia FL, Trieste, 12/02/2004 ion source Collision gas molecules ion beam m+ f+ f+ 1 2 f+ f+ 3 4 f+ 1 f+ 3 f+ 2 f+ 4 MS1 * * m+ * f+ 1 * * ion detector m+ f + f + 1 3 * Collision Cell MS2 (a) f 4 TIC f 3 f 2 f 1 m (b) TIC m/z f 3 f 1 m m/z Brancia FL, Trieste, 12/02/2004 PROBLEMS WITH ‘CLASSICAL’ PROTEOME ANALYSIS: 1. Not comprehensive 2. Not high-throughput 3. Destroys protein-protein interactions that provide important clues to function Number of (protein) database matches 450 400 350 300 250 200 C. elegans 150 100 S.cerevisiae 50 0 1000 E.coli H.influenzae 1200 1400 1600 Peptide mass (Da) 1800 2000 • Multidimensional protein identification technology (MudPIT) • Washburn MP, et al Nat Biotechnol 2001, 19:242-247. SCX Reverse Phase Load complete digest of sample Develop with gradient and spray directly onto MSMS Identified 1500 proteins from yeast including lower abundance species and membrane proteins MS/MS 2415 (46%) of Plasmodium genome identified in all 4 stages of parasitic life cycle Just Enough Diagnostic Information Sidhu KS, Sangavich P, Brancia FL, Sullivan AG, Gaskell SJ, Wolkenhauer O, Oliver SG, Hubbard SJ (2001) Bioinformatic assessment of mass spectrometric chemical derivatisation techniques for proteome database searching. Proteomics 1, 1368-1377. Provide limited sequence information by: 1. Identification of N-terminal amino acid by PTC derivatisation 2. Use guanidination to identify C-terminus, determine lysine content, and improve signal response 3. Specifically fragment next to Asp residues using MALDI-QToF MS PTC-derivatisation •phenylthiocarbamoyl derivative •Edman chemistry •N-terminal amino acid •b1 ion created via low energy collisions •precursor ion scan gives parents •increased sensitivity peptide ions ms2 ms1 scan for precursors collision cell fixed on b1 Spectra collected of all peptides which give rise to a given b1 ion (implying knowledge of the N-terminal amino acid) Database peptide hits by N-terminal amino acid N-terminal mean number Amino acid of peptides ANY W C H M : N I E S L : I/L 74.15 1.70 1.77 2.30 3.41 5.61 5.76 6.04 7.18 8.39 14.16 Error = ± 0.5 Da Average number of matching proteins in the yeast proteome when searching with a peptide mass in the 1000-2000 Da range Rare amino acids give a bigger search gain Guanidation of Lysine H2 N NH2 NH NH NH2 O H3C NH2 NH2 O O-methyl isourea OH NH2 O OH lysine homoarginine 500 K 1000 R 1500 R R 1500 Mass (m/z) 1790.0320 1822.1611 R 1841.1048 1412.96 756.56 R K 2442.40 1286.90 1308.83 1000 1159.77 1170.72 807.46 2000 1057.77 726.43 656.16 Counts MALDI spectrum of an enolase tryptic digest R K 0 2000 2500 MALDI spectrum of a tryptic digest of enolase after guanidation *K *K 6000 4000 R 2000 *K *K *K R *K *K *K *K R R R *K 0 800 1000 1200 1400 1600 1800 Mass (m/z) 2000 2200 2400 2600 Initial set of search peptides and associated information Search database, compile protein “hit list” with matching peptides Top-scoring protein is matched. Remove corresponding peptides from search list If all initial search peptides masses are matched, stop, else continue searching Real yeast proteomics • Alternatives to 2D-gels – denaturing technology – low abundance spots difficult to identify • Many steps of orthogonal 1D-steps – Size exclusion chromatography – Ion exchange chromatography – 1D-gels 1000 1200 1400 1512.69 1752.65 3600 R K 1416.55 1210.39 1221.90 1150.49 1040.30 795.23 925.33 After guanidination R 1768.59 1600 3612.77 800 1708.61 1470.68 795.32 811.32 Before guanidination 3570.36 1752.62 Yeast proteome sample 0 800 1000 1200 1400 Mass (m/z) 1600 1800 3600 K Database search gains Standard MALDI 7 search peptides (before guanidination) Standard MALDI 12 search peptides (after guanidination) Combined 19 (7 + 12) search peptides (both experiments) YDR457w YGR192c YIL192c YJR009c YDL140c YJR109c 6 out of 7 5 out of 7 5 out of 7 4 out of 7 4 out of 7 4 out of 7 85.7% 71.4% 71.4% 57.1% 57.1% 57.1% YGR192c YJR009c YBR208c YFR031c YER075c TY1B_LR2 10 out of 12 9 out of 12 7 out of 12 6 out of 12 6 out of 12 5 out of 12 83.3% 75.0% 58.3% 50.0% 50.0% 41.7% 2549 proteins match at least 1 peptide YGR192c YJR009c YDR457w YIL129c YGR098c YFR031c 15 out of 19 13 out of 19 10 out of 19 10 out of 19 8 out of 19 8 out of 19 78.9% 68.4% 52.6% 52.6% 42.1% 42.1% 3235 proteins match at least 1 peptide 1656 proteins match at least 1 peptide Database search gains # peptides in common Search peptides in common (5 from expt 1, 4 from expt 2) PTC derivatised 3 peptides N-term = Ile/Leu All 3 sets of experimental data combined YGR192c YJR009c YJL052w O7535 YDR545w YBR223c 9 out of 9 7 out of 9 5 out of 9 4 out of 9 4 out of 9 4 out of 9 100.0% 77.8% 55.6% 44.4% 44.4% 44.4% YGR192c YJR009c YJL052w YLR060w YNL271c YAL019w 3 out of 3 3 out of 3 2 out of 3 2 out of 3 2 out of 3 2 out of 3 100% 100% 66.7% 66.7% 66.7% 66.7% YGR192c YJR009c YJL052w YLR454w YJL165c YLR060w 18 out of 22 16 out of 22 9 out of 22 8 out of 22 6 out of 22 5 out of 22 81.8% 72.7% 40.9% 36.4% 27.3% 22.7% 5 4 3 3 2 2 Only 289 proteins match at least 1 peptide in both experiments Only 204 proteins match at least 1 peptide 3 2 2 2 2 2 Only 18 proteins match at least 1 peptide in all 3 experiments S. cerevisiae Yeast 22proteins proteins 100 100 90 90 % unambiguous identification % unambiguous identification S. cerevisiae 1 protein Yeast 1 protein 80 80 standard 70 guanidination 60 standard PTC (500) 30 PTC (500) 50 PTC (50) 40 Asp-frag 30 (All) Asp-frag 20 20 70 60 50 40 10 guanidination PTC (50) Asp-frag Asp-frag (All) 10 0 0 1 2 2 4 C. elegans 1 protein 6 C. elegans 2 proteins 100 100 90 90 % unambiguous identification % unambiguous identification 4 total number of search peptides total number of search peptides 80 70 60 50 40 30 20 80 70 standard standard guanidination 60 guanidination PTC (500) PTC (500) PTC (50) PTC (50) 50 40 Asp-frag Asp-frag 30 (All) Asp-frag Asp-frag (All) 20 10 10 0 0 1 2 total num ber of search peptides 4 2 4 6 total number of search peptides Improved bioinformatics approaches for complex mixtures primary data (input masses) search engine secondary data Database: - proteome - proteins - peptides protein hit list (quantitative data) (experimental proteome data) rule-based system protein information (qualitative data) probability combined evidence Final Scores possibility Contextual information pI (theoretical & experimental) Molecular weight (oligomerisation state) Subcellular localisation (known, predicted - PSORT) Molecular environment (soluble, membrane, DNA-, actin- associated.) Post-translational modifications (known, putative, predicted) Sequence motifs Homology relationships Non-native state digestions Scoring systems • Bayesian approach P(k | I ) P( D | kI ) P(k | DI ) P( D | I ) – – – – – – k is hypothesis that the sample protein is protein k, D is mass spec fingerprint data, I is background information, P(k|DI) is posterior probability for k given D and I, P(k|I) is prior probability of k given I, P(D|I) is a normalisation constant QUANTITATIVE PROTEOMICS DiGE Difference Gel Electrophoresis • Ünlü M. et al (1997). Difference gel electrophoresis:a single gel method for detecting changes in cell extracts. Electrophoresis,18, 2071-2077 Sample 2 Sample 3 label with cy3 in dark 30mins @ 4OC label with cy5 in dark 30mins @ 4OC Sample 1 label with cy2 in dark 30mins @ 4OC quench un-reacted dye by adding 1mM lysine in dark 10mins @ 4OC Difference Gel Electrophoresis 2D gel electrophoresis Cy 5 Cy3 no difference ● presence / absence ●● up / down-regulation ● Cy3 +Cy5 Stable Isotope Labelling • In vivo labelling = Isotopes introduced during cell culture N14 N15 m/z Pro Cheap Information rich Con Only works for microbes and cell culture???? Very complex samples Have to deduce sequence before assigning pairs – Growth of C.elegans on isotopically labelled E.coli E.coli grown on 15N 14N E.coli grown on nitrogen source nitrogen source Metabolic labelling of C.elegans Light mutant Heavy WT Also grew Drosophila on metabolically labelled yeast Light WT Heavy mutant Krijsveld et al (2003) Nat. Biotech. In vitro labelling - continued I Isotopes introduced during proteolysis 18O – labelled water, Ctermini II Guanidinylation of lysine using isotopes of O-methyl isourea – lysine residues III Dimethyl labelling – lysine residues –Pro Con •Cheap Complex peptide mixture •Universal Small mass difference on MS ICAT – Isotope Coded Affinity Tags Gygi SP, et al . Nat Biotechnol 1999, 17:994-999. Biotin Affinity Tag Cleavable Linker Isotope Coded Linker 227 / 236 (9*13C) amu SHreactive group (Iodoaceta mide) Pros Cons Universal Simplified sample Protein must contain cysteine ICAT method O HN NH O X NH S Biotin X X X H O O Linker (heavy or light) O H O H NH H Thiol-specific reactive group Gygi S, Rist B et al. (1999) Nature Biotech. 17: 994. Control sample Test sample Denature (SDS) and reduce (TCEP) SH SH SH SH SH SH SH SH SH SH SH SH Label with light reagent S S S Pool Samples S S Label with heavy reagent S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S Digest overnight with trypsin S S S S S Purify labelled peptides using avidin column S S S S S S S S S S S Cleave biotin portion of the tag with concentrated TFA S S S S S S S S S S LC-MSMS iTRAQ Ross P. et al. Mol Cell Proteomics. 2004 Sep 22 WORKFLOW reduce, alkylate (cysteine block) and digest protein sample with trypsin as usual label each sample (max of 4) with a different iTRAQ reagent, 100ug of protein is optimal combine all iTRAQ labeled samples to one sample mixture clean up sample by Cation- Exchange- Chromatography for complex sample mixtures, pre-fractionation is achieved by using a High-Resolution-Cation-Exchange column analyze the mixture by LC/MS/MS results are analysed by Pro Quant Software PROTEIN TURNOVER The missing dimension of proteomics JM Pratt, J Petty, I Riba-Garcia, DHL Robertson, SJ Gaskell, SG Oliver, RJ Beynon (2002) Molec. Cell. Proteomics 1, 579-591. Experimental Approach Deuterated leucine labelling Unlabelled chase 1 0.9 0.8 Loss of label from proteins at different rates = turnover Protein labelling curve (100ml/h-1) 0.7 0.6 0.5 Doubling times (0.1h-1) 0.4 0.3 0.2 0.1 0 0 10 20 30 40 Time (h) 50 60 70 80 Dilution rate = 0.1h-1 Half-time = 6.9h L=3 Pratt et al., Figure 3 100% d9 1467. 3 1119.9 1454. 1 1686. 3 1795. 4 1336. 2 L=1 2336. 5 2057. 5 L=3 50% d9 1119.8 L=0 L=1 L=2 L=3 L=2 L=2 L=2 1119.9 1440.0 L=1 1444.9 0% d9 1747.1 1668.0 1317.8 1768.2 2039.2 2327.2 L=1 27Da (3 Leu) 9Da (1 Leu) 1364.833 100 0h 1178.928 1538.967 x6 1538.967 1002127.389 1612.914 % 100 2126.389 1864.125 % % 1521.909 0 100 4h 1365.895 1864.163 0 6h 1365.878 % 0 1365.886 1179.957 1529.938 1365.898 1169.911 1529.932 % 1181.821 2099.316 % 2164.200 0 1355.994 0 1600 % 1551.968 1530.129 1800 2000 2111.252 2121.256 100 2099.525 % 0 1400 1539.029 0 100 1753.268 2109.259 100 2099.316 2099.525 1530.129 1613.156 1200 1523.882 x6 1170.065 % 1529.984 100 1754.101 2126.407 0 x6 1612.996 1529.984 2122.239 100 2099.247 1551.916% 1520.933 0 51h 1529.932 1538.987 % 1355.860 2112.228 0 100 1754.037 2126.420 2099.251 % 1552.943 0 100 100 % x6 0 % 100 1521.928 1612.944 2121.241 1538.991 0 100 2110.235 0 100 2099.251 2126.419 2099.250 1551.931 x6 0 25h % 1521.941 1772.147 12h 100 1529.930 % 1612.944 2121.305 1538.981 0 100 2110.260 0 2126.419 1772.145 % 1552.938 100 1538.981 2126.443 % 2099.240 1521.946 0 8h 1529.949 x6 1612.934 1178.956 100 2126.443 2002.202 % 1539.007 100 1612.960 2112.260 1554.837 0 x6 1539.007 % 100 1532.971 0 1178.974 2126.389 % m/z 1520 1539.080 1530 1540 1552.130 1550 2110.462 0m/z 2100 2110 2122.522 m/z 2120 2130 Pratt et al., Figure 3 1 NADP-glutamate dehydrogenase (GDH) (3 peptides) RIAt 0 .8 Hsp26(2 peptides) 0 .6 0 .4 0 .2 1 0 .6 0 .4 0 .2 0 0 10 20 30 Time(h) 40 50 0 10 20 30 Time(h) 40 0.16 kloss (h-1) ± SEM RIAt Pyruvate decarboxylase (PDC) (4 peptides) Hsp71 (4 peptides) 0 .8 0.08 0 NADP-GDH Hsp26 Hsp71 PDC 50 60 Pratt et al., Figure 5 0.02-0.03 h-1 0.12 30 0.01-0.02 h-1 Distribution (%) 0.1 20 0.03-0.04 h-1 > 0.04 h-1 10 0.08 0 Degradation rate constant 0.06 0.04 0.02 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 15 16 17 18 19 20 21 22 23 25 26 27 27 28 29 30 31 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 Degradation rate constant (h-1) ± SEM < 0.01h-1 Protein (Spot ID) INTEGRATION Evaluating protein-interaction data von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork P (2002) Comparative assessment of large-scale data sets of protein–protein interactions. Nature 417, 399-403. Cornell M, Paton NW, Oliver SG (2004) A critical and integrated view of the yeast interactome. Comp. Funct. Genom. 5, 382-402 (A) DNA Binding Domain Fused to Protein A A LacZ UAS Reporter Gene Promoter The fusion of the “bait” protein and the DNA binding domain of the transcriptional activator cannot turn on the Reporter Gene. Activator region fused to Protein B (B) B LacZ UAS Reporter Gene Promoter The fusion of the “prey” protein and the activating region of the transcriptional activator is also insufficient to switch on the reporter. Activator region fused to Protein B (C) DNA Binding Domain Fused to Protein A A UAS Promoter B Transcription LacZ Reporter Gene The association of “bait” and “prey” brings the DNA binding domain and the activator region close enough to switch on the Reporter Gene and turn yeast blue. Fig. 1 How the two-hybrid system detects protein associations in yeast. Schematic representation of the two hybrid system in case of interaction of protein A and B activation D B A RNA POL II DNA-binding D UAS reporter gene Gene expression Schematic representation of the two hybrid system in absence of interaction of protein A and B activation D B RNA POL II A DNA-binding D UAS NO TRANSCRIPT reporter gene Synthetic lethals Definition: lethality is caused by mutating two or more genes gene1 gene1 gene2 gene2 geneA gene3 gene3 geneB gene4 gene4 geneC gene5 Single essential pathway gene5 Functionally overlapping pathways Asparagine-linked Glycosylation Dolpp-GlcNAc2Man9Glc3 (Substrate) (ALG genes are responsible for the core synthesis) Asp -NH -GlcNAc2Man9Glc3 + Asp-NH2 X STT3, OST1 WBP1, OST3 OST6, SWP1 OST2 OST5 OST4 X SER/THR SER/THR alg mutations are synthetically lethal with conditional mutation affecting oligosaccharyltransferase activity Integrating complex data with yeast two-hybrid data Complex consists of six proteins A, B, C, D, E, F B F A E In a yeast two-hybrid experiment, A A interacts with another protein Is B, C, D, E or F? C D Large-scale interaction data and the distribution of interactions according to functional categories. Quantitative comparison of interaction datasets. Set of confirmed Y2H interactions Confirmation of an interaction requires: 1. Identification in more than one Y2H screen, OR 2. The reverse interaction must have been identified, OR 3. The two proteins must have been identified in the same protein complex (from either classical or high-throughput affinity purification studies). A total of 451 reliable interactions, involving 581 proteins have been identified from a combined data set comprising 5214 interactions and 4025 proteins PEDRo: A Systematic Approach to Modelling, Capturing and Disseminating Proteomics Data Taylor CF, Paton NW, Garwood KL, Kirby PD, Stead DA, Yin Z, Deutsch EW, Selway L, Walker J, Riba–Garcia I, Mohammed S, Deery MJ, Howard JA, Dunkley T, Aebersold R, Kell DB, Lilley KS, Roepstorff P, Yates JR III, Brass A, Brown AJP, Cash P, Gaskell SJ, Hubbard SJ, Oliver SG (2003) Nature Biotechnol. 21, 247-591. Garwood K, McLaughlin T, Garwood C, Joens S, Morrison N, Taylor CF, Carroll K, Evans C, Whetton AD, Hart S, Stead D, Yin Z, Brown AJP, Hesketh A, Chater K, Hansson L, Mewissen M, Ghazal P, Howard J, Lilley KS, Gaskell SJ, Brass A, Hubbard SJ, Oliver SG, Paton NW (2004) PEDRo: A database for storing, searching and disseminating experimental proteomics data. BMC Genomics 5, 68 doi:10.1186/1471-2164-5-68. Proteomics — the state of play • The volume of generated proteome data is rapidly increasing – Movement towards high–throughput approaches – Experimental techniques increasing in complexity – Analyses also increasing in complexity • Current publicly available proteomics data is limited – 2D–Gel image databases (e.g. SWISS–2DPAGE) contain little information about sample preparation, or analysis of results – No widely used databases of mass spectrometry data or analyses • A robust, future-proofed, standard representation of both methods and data from proteomics experiments is required – – – – Analogous to the MIAME guidelines for transcriptomics Users will know what to expect from datasets (formats etc.) Will facilitate handling, exchange and dissemination of data Will guide the development of effective search/analysis tools PEDRo and PEML • The PEDRo (Proteome Experiment Data Repository) model – Specifies the information required about a proteomics experiment • sufficient information to exactly replicate that experiment – Organised in a manner reflecting the procedures that generated it – Flexible enough to accommodate new technological developments – Described in UML (Universal Modelling Language) making it implementation–independent (effectively a generic blueprint) • • Implemented in SQL (the relational database repository) Also implemented in Java (later slide), and XML (next bullet) • PEML (Proteomics Experiment Markup Language) – The XML implementation of PEDRo for data exchange and rapid dissemination (using XSLT to display PEML files as web pages) • Two benefits arising from early implementation of the model – Implementation allows the underlying technologies to be tested – Making explicit what data might most usefully be captured about proteomics experiments will speed the model’s evolution The nature of proteomics experiment data • Sample generation – Origin of sample • hypothesis, organism, environment, preparation, paper citations • Sample processing – Gels (1D/ 2D) and columns • images, gel type and ranges, band/spot coordinates • stationary and mobile phases, flow rate, temperature, fraction details • Mass Spectrometry • machine type, ion source, voltages • In Silico analysis • peak lists, database name + version, partial sequence, search parameters, search hits, accession numbers The PEDRo UML schema in reduced form Organism TaggingProcess OntologyEntry PercentX MobilePhase Component AssayDataPoint SampleOrigin GradientStep Column OtherAnalyte ProcessingStep ChemicalTreatment Fraction AnalyteProcessingStep OtherAnalyte Analyte Sample TreatedAnalyte Experiment MassSpecMachine RelatedGelItem mzAnalysis IonSource GelItem Electrospray BoundaryPoint DiGEGelItem Gel1D Gel Detection Spot Gel2D DiGEGel Tandem SequenceData MSMSFraction IonTrap PeakList MALDI Band MassSpecExperiment DBSearch ToF DBSearchParameters ListProcessing OtherIonisation Hexapole PeptideHit Peak OntologyEntry ProteinHit OntologyEntry OthermzAnalysis Quadrupole CollisionCell Chromatogram Point Peak-Specific ChromatogramIntegration Protein RelatedGelItem Experiment hypothesis method_citations result_citations MassSpecMachine manufacturer model_name software_version MALDI laser_wavelength laser_power matrix_type grid_voltage acceleration_voltage ion_mode OtherIonisation name 1 ionisation_ parameters _parameters 1 OthermzAnalysis name ToF reflectron_state internal_length 1 1 analyte_parameters 1 OtherAnalyte * name 1 Analyte * sample_date experimenter 1 GelItem id area intensity local_background annotation annotation_source volume pixel_x_coord pixel_y_coord pixel_radius 1 normalisation normalised_volume * MassSpecExperiment * description parameters_file 1 1 IonSource type collision_energy 0..1 0..1 1 0..1 mzAnalysis type 1 1 0..1 Detection type Quadrupole description has_children 1 * 1 PeakList 1 1..n list_type description mass_value_type 0..1 Hexapole description IonTrap gas_type gas_pressure rf_frequency excitation_amplitude isolation_centre isolation_width final_ms_level CollisionCell gas_type gas_pressure collision_offset PEDRo UML Class Diagram: Key to colours Sample Generation Sample Processing Mass Spectrometry MS Results Analysis 1 * RelatedGelItem description gel_reference item_reference 1 1 Peak * m_to_z abundance 1 multiplicity {ordered} OntologyEntry category value * mz_analysis description * 1 MobilePhase Component * description concentration * OntologyEntry category value description * {ordered} Electrospray spray_tip_voltage spray_tip_diameter solution_voltage cone_voltage loading_type solvent interface_manufacturer spray_tip_manufacturer Sample * sample_id analyte_processing _step_parameters * Chromatogram Point time_point ion_count * PeakSpecific ChromatogramIntegration resolution software version background_threshold area_under_curve peak_description sister_peak_reference Column AssayDataPoint {ordered} description 1 * time manufacturer part_number protein_assay 1 batch_number 1 1 PercentX internal_length OtherAnalyte 1 internal_diameter 0..1 2..n percentage ProcessingStep stationary_phase 1 1 bead_size name GradientStep *{ordered}1 pore_size step_time * temperature AnalyteProcessingStep flow_rate Fraction injection_volume * 1 Gel parameters_file start_point description end_point raw_image ChemicalTreatment protein_assay annotated_image digestion software_version 1 derivatisations TreatedAnalyte 1 warped_image warping_map Gel1D Band equipment 1 denaturing_agent lane_number * percent_acrylamide mass_start apparent_mass solubilization_buffer mass_end stain_details run_details protein_assay Spot 1 in-gel_digestion apparent_pi 1 Gel2D background apparent_mass * pi_start pixel_size_x pi_end pixel_size_y BoundaryPoint mass_start * * pixel_x_coord mass_end DiGEGel pixel_y_coord first_dim_details dye_type second_dim_details DiGEGelItem excitation_wavelength * exposure_time dye_type MSMSFraction tiff_image target_m_to_z * plus_or_minus DBSearch * username {ordered} id_date Tandem * ListProcessing n-terminal_aa SequenceData 1 * c-terminal_aa smoothing_process source_type count_of_specific_aa background_threshold sequence name_of_counted_aa * regex_pattern PeptideHit 1 1 {ordered} * score DBSearchParameters score_type 1..n ProteinHit * program sequence database all_peptides_matched information 1 database_date probability * parameters_file 1..n 1 taxonomical_filter db_search_ peptide_hit parameters Protein _parameters fixed_modifications 1 * * accession_number variable_modifications OntologyEntry gene_name max_missed_cleavages category synonyms mass_value_type value organism fragment_ion_tolerance description orf_number peptide_mass_tolerance description accurate_mass_mode RelatedGelItem sequence mass_error_type modifications mass_error description * predicted_mass protonated gel_reference predicted_pi icat_option item_reference 1 next_dimension Organism SampleOrigin species_name description 1 * strain_identifier condition relevant_genotype condition_degree environment TaggingProcess tissue_type cell_type * 0..1 lysis_buffer tag_type cell_cycle_phase cell_component 1..n tag_purity protein_concentration technique tag_concentration metabolic_label final_volume The Framework Around PEDRo 1. Lab generated data is encoded using the PEDRo data entry tool, producing an XML (PEML) file for local storage, or submission 2. Locally stored PEML files may be viewed in a web browser (with XSLT), allowing web pages to be quickly generated from datasets 3. Upon receipt of a PEML file at the repository site, a validation tool checks the file before entering it into the database 4. The repository (a relational database) holds submitted data, allowing various analyses to be performed, or data to be extracted as a PEML file or another format The PEDRo Data Collator • The tool with which a user enters information about, and data from, proteomics experiments –The tool collates these data into a single PEML file –The hierarchical nature of the PEDRo schema (and PEML) is reflected in the structure of the data entry tool • Successive stages of the experimental design are added as ‘children’ of the previous stage • Enforces an audit trail for data; e.g. details of a gel cannot be entered without first describing the sample • A simple, filterable list of all the sub–records present and tree-style browser act as ‘index’ and ‘contents’ for the PEML file being edited Conclusions • The PEDRo model does require a substantial amount of data – Much of this information will be available in the lab of origin – Some data will be common to many experiments, and therefore need only be entered once, then saved as a template in PEDRoDC • But there are several advantages to adopting such a model – All datasets will contain information sufficient to quickly establish the provenance and relevance (to the researcher) of a dataset – Datasets will be detailed enough to allow non–standard searches, for example, by sample extraction technique – Tools can be developed that allow easy access to large numbers of such datasets, from a wide range of proteomics sites – Integration with other resources such as the major sequence databases, will provide sophisticated search and analysis capability – Information exchange between researchers will be facilitated through the use of a common language (PEML), and the ability to rapidly display PEML-encoded data as a web page