Prediction and identification of novel proteins from the marine cyanophage S-PM2. Konstantinos Thalassinos; Susan E. Slade; James H. Scrivens; Martha R. Clokie; Nicholas H. Mann Biological Mass Spectrometry and Proteomics Group, University of Warwick, Coventry, CV4 7AL, U.K Protein identification MATERIALS AND METHODS OVERVIEW ORF function Structural / Homology to T4 Cyanobacterial / bacterial Enzyme Other tRNAs Unknown Purpose To undertake a holistic proteomic study of a novel cyanobacterial virus (S-PM2). In parallel, the well-characterised proteins of the Escherichia coli virus T4 were identified and the compared to their “homologues” in S-PM2. Methods The genomic sequence of S-PM2 was translated in all six reading frames using three gene prediction programs (Expasy Translate Tool, Glimmer and GeneMarkS) and the output converted to a Fasta format suitable for database searching. Mass spectrometric analysis and subsequent protein identification were achieved using both MALDI-MS and LC-ESI-MS/MS techniques. INTRODUCTION T4 is a virus (phage) that infects the enteric bacterium Escherichia coli; see Figure 1. Extensive studies have fully characterised the genes and proteins involved in host infection, phage DNA replication and phage protein translation prior to the self assembly of mature viral particles for release from the cell. Figure 2. Genome of cyanophage S-PM2 showing predicted gene function and homologies to similar genes in other organisms. This study aims to identify the structural proteins of S-PM2 where little or no homology exists with known DNA sequences. Initially the focus commenced at the nucleic acid level by predicting genes and their products then using a mass spectrometric approach for protein identification, information is generated and used to annotate the genome, see Figure 3. The DNA sequence of S-PM2 was analysed by means of two gene prediction software programs, GenemarkS and Glimmer, which use interpolated Markov models to assess the protein coding potential of stretches of DNA and distinguish them from non-coding regions. A third, less specific program, Expasy Translate Tool, was used as a comparison, which translates nucleotide sequences to protein sequences. A mass spectrometry-based proteomic approach was taken in the identification of gelresolved purified phage proteins from both S-PM2 and T4. The amino acid sequences obtained from the identified S-PM2 proteins were compared with the predicted sequences generated by the three programs. The focus of the study then returned back to the genome annotating the identified genes with their respective function. Genome Experimental Protein extract 1D / 2D gels Theoretical Predict Proteins from genome Proteins Separation Create Databases of predicted proteins Genes were predicted using GeneMarkS (http://opal.biology.gatech.edu/GeneMark/) and Glimmer (http://www.tigr.org/software/glimmer/). Again the output was converted to a Fasta formatted database by use of Perl scripts and all three databases were added to SwissProt. T4 Protein MS identification Gp23 Gp3 Gp19 Gp13 Yes - Gp15 - Enzymatic using Trypsin The proteomes predicted by the two different programs were compared using MatLabs Bioinformatic ToolBox (MathWorks, UK). A script was written to perform a Global alignment, using the Needleman-Wunsch algorithm, between each protein of the first proteome with all the other proteins of the second proteome. The results of the alignment score were saved in a m x n matrix (the scores matrix) where m is the total number of proteins predicted by Glimmer and n the number predicted by GeneMarkS. The substitution matrix used for the alignment was PAM10 and this was so that only very similar proteins produced a good alignment. Gp6 Gp8 Gp18 Gp12 - Yes Yes In order to find the matching proteins between the proteomes, another script was written to query the scores matrix and find all the protein pairs that gave rise to a score of more than 0 (thus indicating a match). From this information it was possible to identify proteins that were unique to each prediction. Protein preparation and identification S-PM2 and T4 virus particles were purified using a CsCl gradient and the proteins solubilised in Laemmli buffer prior to resolution on a 1D SDS-PAGE gel, stained with Coomassie G-250. Protein bands were excised and processed using a MassPrep robotic protein handling system (Waters Micromass MS Technologies, U.K.). Protein samples were destained, reduced, alkylated with iodoacetamide, digested with trypsin and the resultant peptides extracted according to standard protocols described by the supplier. The tryptic peptides were characterised by means of matrix assisted laser desorption ionisation MS on a M@LDI-LR instrument (Waters Micromass MS Technologies, U.K.). The tryptic extract was mixed with matrix (alpha-cyano-4-hydroxycinnamic acid) prior to spotting. - Yes Gp24 Yes T4 Function S-PM2 ORF Size kDa MS identification Major capsid protein Tail completion protein Tail tube protein Neck protein Tail sheath stabilizer and completion protein 108 110 103 95 51981 20681 23048 32316 Yes Yes Yes Yes 105 30735 Yes Baseplate wedge subunit Baseplate wedge subunit Tail sheath protein Short tail fibers Unknown Unknown Contains a fibrinogen domain Unknown Unknown Unknown Highly immunogenic outer capsid protein NAD protein ADP ribosyltransferase Head vertex protein 80 83 102 225 87 82 89 146 86 223 - 67750 Yes 67781 79730 110526 18676 19247 33790 32849 134100 59748 - Yes Yes Yes Yes Yes Yes Yes Yes Yes - - - - - - - Table 1. Table of selected structural proteins from T4 and S-PM2, indicating those identified by MS-based techniques. Digestion Figure 3. Overview of a holistic proteomic study. The two proteins from S-PM2, which differed in their predicted N-terminii were identified by MS as a gp3 homologue and a second protein of currently unknown function. In both cases, peptides were identified from non-coding regions of the genome denoted by the GeneMarkS program; see Figures 4 and 5. The tryptic extracts were also analysed by means of nano-LC-ESI-MS/MS on a Q-Tof Ultima Global with in-line CapLC system (Waters Micromass MS Technologies, U.K.). The tryptic extract was desalted using an in-line C18 precolumn cartridge (Dionex, U.S.A.) and the peptides further resolved on a 75 µm C18 PepMap column (Dionex, U.S.A.) using an increasing acetonitrile concentration gradient. Acquire Data Use existing database Database searching software Search Protein Identification CONCLUSIONS The Expasy translate tool generated the highest number of potential protein sequences due to its lack of specificity in the identification of coding regions within nucleic acid sequences. Both the Glimmer and GenemarkS programs generated similar numbers of potential protein sequences. The majority of these sequences differed only in the start position of transcription/translation of the DNA sequence resulting in minor differences at the N-terminus of each protein. Both the Glimmer and GenemarkS programs predicted unique proteins, of which none have been identified to-date. Over 50% of the structural proteins expected from S-PM2 have now been identified and recent improvements in phage protein production will ensure that successful identification of the remaining proteins will be achieved. Evidence of peptides from two proteins that originated from non-coding regions of DNA, identified by the GeneMarkS program, indicates that further optimisation of this prediction tool is required. Potentially a relaxation of the parameters that identify coding regions within DNA will allow a greater number of genes and their products to be identified. Our study clearly indicates that a holistic proteomic approach to the study of novel organisms is highly successful in the identification of proteins where little homology to known gene products exists. ProteinLynx Global Server 2.0 (Waters Micromass MS Technologies, U.K.) was used to interrogate the data obtained from both MALDI-MS and LC-ESI-MS/MS experiments. REFERENCES Mann, N.H., Cook, A., Millard, A., Bailey, S. and Clokie, M. (2003). Marine ecosystems: Bacterial photosynthesis genes in a virus. Nature 424: 741. S-PM2 Gene Prediction Results Mann N.H. et al. (2004). Genome of the phage S-PM2 which infects the cyanobacterium Synechococcus (in prep). CapLC-ESI-MS/MS Use MS and MS/MS data to search against predicted databases The genome was annotated accordingly upon confirmation that a protein product had been translated from a specified region of S-PM2 nucleic acid. Our preliminary experiment identifying a number of structural proteins from T4 was successful and further samples of T4 phage proteins have been produced for analysis. The Expasy Translate Tool predicted over 3700 ORFs from the S-PM2 genome. GeneMarkS predicted 217 proteins and Glimmer 202 for S-PM2, of which 189 are almost identical in sequence. MALDI, ESI MS Figure 5. Protein Workpad view for an S-PM2 protein of unknown function showing peptides identified by MS from Glimmer (left) and GeneMarkS (right) predictions. During the identification of S-PM2 proteins, no T4 sequences were identified and conversely no S-PM2 sequences were identified during T4 protein analysis. This indicates that these “homologous” proteins at the amino acid level are truly quite dissimilar. RESULTS Link back to genome Figure 1. Electron micrographs of S-PM2 (left) and the distantly related T4 phage (right). In contrast, S-PM2, a cyanophage that infects the marine bacterium Synechococcus was first isolated in 1993 (Wilson et al.) and, although its impact on natural populations of Synechococcus is still not known, it is thought to have a significant effect. The genome of SPM2 has recently been sequenced (Millard et al., Mann et al. 2003 and 2004 in prep.) and shows little homology to other known viral genomes thus making identification of viral proteins more complicated; see Figure 2. Consequently, a combined bioinformatics and proteomics approach was undertaken to solve this complex problem. The raw genomic sequence of SPM-2 was translated in all six reading frames using the Translate tool from the Expasy web server (http://ca.expasy.org/). Perl scripts transformed the output to a Fasta formatted database. Open Reading Frames (ORFs) with a molecular weight greater than 300Da were included. During database searches of the S-PM2 protein data, no T4 protein sequences were identified. Also, no S-PM2 sequences were identified during the searches of the T4 data. Comparison of two predicted proteomes Results The Expasy Translate Tool predicted the greatest number of proteins from the S-PM2 genome. Two programs (Glimmer and GeneMarkS) predicted over 200 translated proteins, of which almost 20% are truly unique to one proteome. Over 20 S-PM2 proteins were purified and identified with a number of T4 proteins. Some of the S-PM2 proteins show similarity to T4, whilst a number seem to be truly unique to this virus. For two S-PM2 proteins, GenemarkS incorrectly predicted the translation start site. Previously unconfirmed genes were identified and the genome annotated accordingly. Construction of the protein database In total 24 structural proteins were identified from the cyanophage S-PM2 and 5 (to-date) from the bacteriophage T4, see Table 1. The Glimmer prediction contained 13 unique proteins while the GeneMarkS contained 28. For two S-PM2 proteins, both the Glimmer and Expasy Translate predictions contained greater numbers of residues per protein because they included an extended N-terminal region. Figure 4. A comparison of the predicted gene products for an S-PM2 protein of unknown function from GenemarkS in blue, Glimmer shown in red + blue and the ORF in turquoise + red + blue. Millard, A., Clokie, M., Shub, D.A. and Mann, N.H. (2004). Genetic organization of the psbAD region in phages infecting marine Synechococcus strains. Proceedings of the National Academy of Sciences (In press). Wilson, W. H., Joint I. R. Carr, N. G. et al. (1993). Isolation and molecular characterization of five marine cyanophages propagated on Synechococcus sp. strain WH7803. Appl. Environ. Micro. 59 (11): 3736-3743