Top-down characterization of proteins in bacteria with unsequenced genomes Nathan Edwards Georgetown University Medical Center Microorganism Identification Homeland-security/defense applications Clinical applications in strain identification: Selection of treatment and/or antibiotics New applications in microbiome analysis: Long history of fingerprinting approaches Bacterial colonies in gut, .... Chronic wound infections Compete with genomic approaches? PCR, Next-gen sequencing Primary sales-pitch is speed. 2 Microorganism Identifications Match spectra with proteome (or genome) sequence for (species) identity Provides robust match with respect to instrumentation and sample prep Many bacteria will never be sequenced or "finished"... Pathogen simulants, for example ...but many have – about 2500 to date. 3 Microorganism Identifications Match spectra with proteome (or genome) sequence for (species) identity Provides robust match with respect to instrumentation and sample prep Many bacteria will never be sequenced or "finished"... Pathogen simulants, for example ...but many have – about 2500 to date. Can we use the available sequence to identify proteins from unknown, unsequenced bacteria? Yes, for some proteins in some organisms! 4 Intact protein LC-MS/MS Crude cell lysate Capilary HPLC LTQ-Orbitrap XL Precursor scan: 30,000 @ 400 m/z Data-dependent precursor selection: C8 column 5 most abundant ions 10 second dynamic exclusion Charge-state +3 or greater CAD product ion scan 15,000 @ 400 m/z 5 [195.00-2000.00] MS yr_inclusion 60 40 20 CID Protein Fragmentation Spectrum from Y. rohdei 21.03 21.46 0 19.5 20.0 20.5 21.0 21.5 22.0 22.5 Time (min) yr_inclusion #1937-2437 RT: 19.45-24.36 AV: 21 NL: 4.80E4 F: FTMS + p ESI d Full ms2 756.70@cid35.00 [195.00-2000.00] 576.83 z=2 100 23.0 23.5 24.0 24.5 25.0 756.70 +8 MW 6044.11 90 80 70 584.57 z=4 720.39 z=2 60 50 785.41 z=4 40 694.62 z=4 30 20 10 840.16 z=7 200.78 329.71 z=? z=? 903.81 z=3 928.49 z=4 461.16 559.55 z=4 z=? 992.53 z=3 555.29 z=4 0 200 400 600 800 1118.93 z=? 1000 1253.14 z=? 1345.30 z=? 1200 1400 1804.48 z=? 1491.23 1610.27 1666.89 1883.75 z=? z=? z=? z=? 1600 1800 2000 m/z 6 Enterobacteriaceae Protein Sequences Exhaustive set of all Enterobacteriaceae family protein sequences from ...plus Glimmer3 predictions on RefSeq Enterobacteriaceae genomes Swiss-Prot, TrEMBL, RefSeq, Genbank, and [CMR] Primary and alternative translation start-sites Filter for intact mass in range 1 kDa – 20 kDa 253,626 distinct protein sequences, 256 species Derived from "Rapid Microorganism Identification Database" (RMIDb.org) infrastructure. 7 ProSightPC 2.0 Product ion scan decharging Absolute mass search mode Enabled by high-resolution fragment ion measurements THRASH algorithm implementation 15 ppm fragment ion match tolerance 250 Da precursor ion match tolerance "Single-click" analysis of entire LC-MS/MS datafile. 8 Other tools Explored using standard search engines: Decharge and format as charge +1 spectrum X!Tandem scoring plugin (ProSight, delta M) OMSSA, Mascot, etc… MS-Tools: MS-Deconv, MS-TopDown, MS-Align, MS-Align+, MS-Align-E! 9 60 CID Protein Fragmentation Spectrum from Y. rohdei 756.70@cid35.00 [195.00-2000.00] MS yr_inclusion 40 20 21.03 21.46 0 19.5 20.0 20.5 21.0 21.5 22.0 22.5 Time (min) yr_inclusion #1937-2437 RT: 19.45-24.36 AV: 21 NL: 4.80E4 F: FTMS + p ESI d Full ms2 756.70@cid35.00 [195.00-2000.00] 576.83 z=2 100 23.0 23.5 24.0 24.5 25.0 756.70 +8 MW 6044.11 90 Match to Y. pestis 50S Ribosomal Protein L32 80 70 584.57 z=4 720.39 z=2 60 50 785.41 z=4 40 694.62 z=4 30 20 10 840.16 z=7 200.78 329.71 z=? z=? 903.81 z=3 928.49 z=4 461.16 559.55 z=4 z=? 992.53 z=3 555.29 z=4 0 200 400 600 800 1118.93 z=? 1000 1253.14 z=? 1345.30 z=? 1200 1400 1804.48 z=? 1491.23 1610.27 1666.89 1883.75 z=? z=? z=? z=? 1600 1800 2000 m/z 10 Exact match sequence… 11 Phylogeny: Protein vs DNA Protein Sequence 16S-rRNA Sequence 12 What about mixtures? 13 Shared Small Ribosomal Proteins 14 Shared Small Ribosomal Proteins 15 Identified E. herbicola proteins 30S Ribosomal Protein S19 m/z 686.39, z 15+, E-value 1.96e-16, Δ 0.007 Six proteins identified with |Δ| < 0.02 16 Identified E. herbicola proteins DNA-binding protein HU-alpha m/z 732.71, z 13+, E-value 7.5e-26, Δ -14.128 Eight proteins identified with "large" |Δ| 17 Identified E. herbicola proteins DNA-binding protein HU-alpha m/z 732.71, z 13+, E-value 1.91e-58 Use "Sequence Gazer" to find mass shift ΔM mode can "tolerate" one shift for free! 18 ProSightPC: ΔM mode b- and y-ions Protein Sequence Experimental Precursor ΔM Also: PIITA - Tsai et al. 2009 19 ProSightPC: ΔM mode Match a single "blind" mass-shift for free! ΔM b'- and y'-ions b- and y-ions Protein Sequence Experimental Precursor ΔM Also: PIITA - Tsai et al. 2009 20 ProSightPC: ΔM mode Match a single "blind" mass-shift for free! ΔM b-, b'-, y- and y'-ions Protein Sequence Experimental Precursor ΔM Also: PIITA - Tsai et al. 2009 21 Identified E. herbicola proteins DNA-binding protein HU-alpha m/z 732.71, z 13+, E-value 7.5e-26, Δ -14.128 Extract N- and C-terminus sequence supported by at least 3 b- or y-ions 22 E. herbicola protein sequences 23 E. herbicola sequences found in other species 24 Phylogenetic placement of E. herbicola Phylogram Cladogram phylogeny.fr – "One-Click" 25 Genome annotation errors UniProt: E. coli Cell division protein ZapB MQFRRGMTMSLEVFEKLEAKVQQAIDTITL… 3 17 (204) (166) 0 (2) 22 (371) E. coli strains 26 Genome annotation errors UniProt: E. coli Cell division protein ZapB MQFRRGMTMSLEVFEKLEAKVQQAIDTITL… 3 17 0 (204) (166) (2) 22 (371) E. coli strains Need ±1500 Da precursor tolerance… 27 Conclusions Protein identification for unsequenced organisms. Identification and localization for sequence mutations and post-translational modifications. Extraction of confidently established sequence suitable for phylogenetic analysis. Genome annotation correction. New paradigm for phylogenetic analysis? 28 Acknowledgements Dr. Catherine Fenselau Dr. Yan Wang University of Maryland Proteomics Core Dr. Art Delcher Avantika Dhabaria, Joe Cannon*, Colin Wynne* University of Maryland Biochemistry University of Maryland CBCB Funding: NIH/NCI 29