InnoMol Proteomics Workshop April 8, 2014 Principles of Shotgun Proteomics and Proteogenomics Boris Maček Proteome Center Tuebingen General MS-based proteomics workflow Aebersold R and Mann M. 2003. Nature 422: 198-207 Principle of protein database search K G A Intensity L Intensity Translated Genomic Sequence Theoretical Spectra for Proteins m/z m/z Intensity Theoretical spectra that fall into the defined mass range. Each of them is compared to our fragment Ion spectra. m/z Intensity A S m/z Database 3 Principle of protein database search A S L K G A Intensity MaxQuant Software m/z >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens GN=YWHAB PE=1 SV=3 MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGAR RSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQP ESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLG LALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLT LWTSENQGDEGDAGEGEN >sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens GN=YWHAE PE=1 SV=1 MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARR ASWRIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANT GESKVFYYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRL GLALNFSVFYYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNL TLWTSDMQGDGEEQNKEALQDVEDENQ >sp|P62258-2|1433E_HUMAN Isoform SV of 14-3-3 protein epsilon OS=Homo sapiens GN=YWHAE MVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASWRIISSIEQKEENKGGEDKL KMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVFYYKMKGDYHRYLAEFA TGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVFYYEILNSPDRACR LAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGEEQNKEALQDV EDENQ >sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHAH PE=1 SV=4 MGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARR SSWRVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCND FQYESKVFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPI RLGLALNFSVFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRD NLTLWTSDQQDEEAGEGN >tr|F2Z3E5|F2Z3E5_HUMAN Hydroxyacid-oxoacid transhydrogenase, mitochondrial OS=Homo sapiens GN=ADHFE1 PE=4 SV=1 MAAAARARVAYLLRQLQRAACQCPTHSHTYSQDGCFKY >tr|Q5SS58|Q5SS58_HUMAN MHC class I polypeptide-related sequence A OS=Homo sapiens GN=MICA PE=4 SV=2 MGQRDQGLDRERKGPQDDPGSYQGPERRNFLKEDAMKTKTHYHAMHADCLQELRRYL ESGVVLRRTVPPMVNVTRSEASEGNITVTCRASSFYPRNIILTWRQDGVSLSHDTQQ WGDVLPDGNGTYQTWVATRICRGEEQRFTCYMEHSGNHSTHPVPSGKVLVLQSHWQT FHVSAVAAGCCYFCYYYFLCPLL >tr|Q5T409|Q5T409_HUMAN Disrupted in schizophrenia 1 OS=Homo sapiens GN=DISC1 PE=2 SV=1 MPGGGPQGAPAAAGGGGVSHRAGSRDCLPPAACFRRRRLARRPGYMRSSTGPGIGFL SPAVGTLFRFPGGVSGEESHHSESRARQCGLDSRGLLVRSPVSKSAAAPTVTSVRGT SAHFGIQLRGGTRLPDRLSWPCGPGSAGWQQEFAAMDSSETLDASWEAACSDGARRV RAAGSLPSAELSSNSCSPGCGPEVPPTPPGSHSAFTSSFSFIRLSLGSAGERGEAEG CPPSREAESHCQSPQEMGAKAASLDGPHEDPRCLSRPFSLLATRVSADLAQAARNSS RPERDMHSLPDMDPGSSSSLDPSLAGCGGDGSSGSGDAHSWDTLLRKWEPVLRDCLL RNRRQMEVISLRLKLQKLQEDAVENDDYDKAETLQQRLEDLEQEKISLHFQLPSRQP ALSSFLGHLAAQVQAALRRGATQQASGDDTHTPLRMEPRLLEPTAQDSLHVSITRRD WLLQEKQQLQKEIEALQARMFVLEAKDQQLRREIEEQEQQLQWQGCDLTPLVGQLSL GQLQEVSKALQDTLASAGQIPFHAEPPETIRSLQERIKSLNLSLKEITTKVCMSEKF CSTLRKKVNDIETQLPALLEAKMHAISGNHFWTAKDLTEEIRSLTSEREGLEGLLSK LLVLSSRNVKKLGSVKEDYNRLRREVEHQETAYETSVKENTMKYMETLKNKLCSCKC PLLGKVWEADLEACRLLIQSLQLQEARGSLSVEDERQMDDLEGAAPPIPPRLHSEDK RKTPLKESYILSAELGEKCEDIGKKLLYLEDQLHTAIHSHDEDLIHSLRRELQMVKE TLQAMILQLQPAKEAGEREAAASCMTAGVHEAQA Translated Genomic Sequence Theoretical Spectra for Proteins Homo Sapiens Reference Proteome 71,434 entries (20,246 reviewed proteins) (51,188 un-reviewed) 4 Database MS instrumentation in proteomics Aebersold R and Mann M. 2003. Nature 422: 198-207 Coupling LC to MS for complex mixture analysis Nanoflow LC/MS interface set-up: Column (75 µm)/spray tip (8 μm) Proxeon Easy nLC nanoflow LC System Reverse-phase C18 beads, 3 μm LTQ-Orbitrap No precolumn or split! Platin-wire 2.0 kV 12-15 cm Sample Loading:~700 nl/min Gradient elution:~200 nl/min Coupling LC to MS for complex mixture analysis BSA tryptic in-solution digest 50 fmol on column LTQ-Orbitrap (2005) Linear ion trap (LTQ) Source C-Trap Octopole coll. cell Orbitrap LTQ-FT MS/MS optimized scan cycle: → peptide mass measurement Orbitrap-MS MS-Full Scan MS2 LTQ-MS 0 300 MS2 MS2 600 MS2 900 Time [msec] 1200 MS2 → peptide sequencing 1500 1800 Data processing workflow: MaxQuant Acquisition speed LTQ Orbitrap XL LTQ Orbitrap Velos □ CID Identified + CID Not Iidentified Acquisition speed # of MS/MS Scans 120000 100000 80000 LTQ Orbitrap XL (2007) 60000 LTQ Orbitrap Velos (2009) LTQ Orbitrap Elite (2011) 40000 20000 0 60 min 100 min 140 min 240 min Stable Isotope Labeling by Amino Acids in Cell Culture (SILAC) ”normal AA” ”heavy AA” Lys-12C6 Lys-13C6 Resting cells Treated (drug, GF) Combine and lyse, protein purification or fractionation Proteolysis (trypsin, Lys-C, etc.) Quantitation and identification by MS (nanoscale LC-MS/MS) Current research at the PCT • Proteogenomics • B. subtilis, E. coli (Krug et al, 2011, Mol Bosystems; 2013 MCP) • Pristionchus pacificus (Borchert et al, 2010, Genome Res) • cancer cell lines/tissues • Proteomics for systems biology • In-depth sequencing and quantitation of model organisms (B.subtilis, E.coli, S. pombe, A. thaliana) (Soufi et al, 2010, J Prot Res; Schütz et al, 2011, Plant Cell; Soufi et al, 2012, Curr Opinion Microbiol; Soares et al, 2013, JPR) • Phosphoproteomics • targets of Aurora kinase in S. pombe (Koch et al, 2011, Science Signaling) • targets of protein kinase D in human cells (Franz-Wachtel et al., 2012, MCP) • targets of S/T/Y kinases and phosphatases in B.subtilis and E.coli • Protein modifications • ubiquitylation (Ikeda et al, 2011, Nature) • lysine acetylation (Carpy et al., in preparation) • Clinical proteomics • genetic rescue of Fragile X phenotype in FMR1 KO mice Super-SILAC in Bacteria Super-SILAC in Bacteria E. coli: Replicate 1 and 2 Parameter Number Total MS/MS 757,835 Total Peptides Identified 18,273 Total Proteins Identified 2,292 Single Peptide Hits 6.5% Total Proteins Quantified* 1923 *in all phases of growth Soufi et al. in preparation Biological reproducibility Soufi et al. in preparation Proteome dynamics during growth Soufi et al. in preparation Dynamics of stress proteins during growth Soufi et al. in preparation OD 600 Estimation of absolute copy numbers T5 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 T6 T7 T4 T3 UPS standard (iBAQ) T1 T2 0 100 200 300 400 500 600 700 Time (min) Soufi et al. in preparation 800 1800 5760 Summary of absolutely quantified proteins During Growth Membrane Proteins Identified 2,292 684 Quantified (All Phases) 1,923 588 Absolutely Quantified 2,096 494 Soufi et al. in preparation Most abundant Proteins (ES) Protein Copies per cell (ES) Elongation factor Tu 1;P-43 341,047.56 Outer membrane protein A 313,464.22 Braun lipoprotein 216,037.00 Cysteine synthase A;O 187,791.26 Enolase 164,914.38 DNA-binding protein HU-alpha 136,208.45 Scavengase P20;Thiol peroxidase 131,599.61 Glyceraldehyde-3-phosphate dehydrogenase A 127,416.09 Malate dehydrogenase 123,943.77 IDP;Isocitrate dehydrogenase [NADP] 117,787.02 High-affinity zinc uptake system protein znuA 111,748.80 Cadmium-induced protein yodA 107,098.12 Outer membrane protein C 106,108.02 50S ribosomal protein L6 98,724.11 Universal stress protein A 94,784.63 Soufi et al. in preparation Count Dynamic range of protein abundance Blue: All proteins Red: Membrane proteins Log2 Protein Copy Number Soufi et al. in preparation Proteogenomics • Application of tandem mass spectrometry to genome re-annotation • Search MS/MS spectra against a database containing the complete genome translated in 6 reading frames Problem: database size and structure „Ususal“ Proteomics applications Predicted ORFs REV_Predicted ORFs •Incompatibility with some data processing programs •Long search times Proteogenomics applications Predicted ORFs Frame1 Frame2 Frame3 Frame4 Frame5 Frame6 REV_Predicted ORFs REV_Frame1 REV_Frame2 REV_Frame3 REV_Frame4 REV_Frame5 REV_Frame6 •Decreased sensitivity of database search •Unequal target and decoy search spaces •Most translated frames are in fact decoy sequences •Overestimation of the FDR Proteogenomics of E. coli • • • • Model Gram-negative bacterium Small (4.6 Mb) and well characterized genome ~4,300 protein coding genes (manually annotated and reviewed) Comprehensive high accuracy MS dataset comprising >42,000 unique peptide sequences from >2,600 proteins • Hypothesis: genome annotation approaches completeness • Assessment of general properties of a simple proteogenomic experiment MS/MS spectra acquired MQ TPP MS/MS spectra identified 1,941,724 370,231 1,941,724 162,028 Results I MS/MS spectra identified (%) 19,1 8.3 Peptide sequences Novel peptides Decoy peptides Lab contaminant peptides E. coli proteins 33,964 263 336 306 2,653 25,724 59 0 209 2,524 Proteogenomics of E. coli 1.9M peptide mass spectra Results I Proteogenomics of E. coli A B fes fepa ybdz PEP = 4.02E-08 PP = 0.9999 Annotated genes Detected peptides Six-frame ORFs Position (Mb) MFEVTFWWRDPQGSEEY... fes VGSESWWQSK TWGYGVTALKVGSESWWQSKHGPEWQRLNDEMFEVTFWWRDPQGSEEY... C D yhja yhjb tref PEP = 0.027976 PP = 0.9504 Annotated genes Detected peptides Six-frame ORFs Position (Mb) tref MLNQKIQNPNPDELMIEVDLCYELDPYELKLDEMIEAEP... KPPQIRISL ...NAVFKPPQIRISL LATNFGGWILMLNQKIQNPNPDELMIEVDLCYELDPYELKLDEMIEAEP... Krug et al. Mol Cell Proteomics, 2013 Majority of Novel Peptides are False Positives Results I Krug et al. Mol Cell Proteomics, 2013 Assessment of Processing Workflows Results I Krug et al. Mol Cell Proteomics, 2013 Deep Proteome Coverage of Escherichia coli MS/MS scans Mean: Median: 0 50 20 scans 7 scans 100 20-fold base coverage of 27.5% genome sequence Results I Krug et al. Mol Cell Proteomics, 2013 Conclusions • proteomics reaches analytical capacity to identify and quantify all gene products in microorganisms grown in culture • several regulatory protein modifications (e.g. S/T/Y-phosphorylation, lysine acetylation) can routinly be analyzed on a global scale • many challenges ahead: • analysis of H/D-phosphorylation • analysis of environmental samples • coverage of genome/protein sequence by detected peptides • future developments: • faster MS/MS acquisition • smarter acquisition software • large-scale targeted proteomics • metaproteomics and individual proteomics Acknowledgements Proteome Center Tuebingen Boumediene Soufi Nelson C. Soares Philipp Spät Karsten Krug Alejantro Carpy Sasa Popic Silke Wahl Funding