Prediction and identification of novel proteins from the marine cyanophage... . Konstantinos Thalassinos; Susan E. Slade; James H. Scrivens; Martha R.... Biological Mass Spectrometry and Proteomics Group, University of Warwick, Coventry,...

advertisement
Prediction and identification of novel proteins from the marine cyanophage S-PM2.
Konstantinos Thalassinos; Susan E. Slade; James H. Scrivens; Martha R. Clokie; Nicholas H. Mann
Biological Mass Spectrometry and Proteomics Group, University of Warwick, Coventry, CV4 7AL, U.K
Protein identification
MATERIALS AND METHODS
OVERVIEW
ORF function
Structural / Homology to T4
Cyanobacterial / bacterial
Enzyme
Other
tRNAs
Unknown
Purpose
To undertake a holistic proteomic study of a novel cyanobacterial virus (S-PM2).
In parallel, the well-characterised proteins of the Escherichia coli virus T4 were identified and
the compared to their “homologues” in S-PM2.
Methods
The genomic sequence of S-PM2 was translated in all six reading frames using three gene
prediction programs (Expasy Translate Tool, Glimmer and GeneMarkS) and the output
converted to a Fasta format suitable for database searching.
Mass spectrometric analysis and subsequent protein identification were achieved using both
MALDI-MS and LC-ESI-MS/MS techniques.
INTRODUCTION
T4 is a virus (phage) that infects the enteric bacterium Escherichia coli; see Figure 1.
Extensive studies have fully characterised the genes and proteins involved in host infection,
phage DNA replication and phage protein translation prior to the self assembly of mature viral
particles for release from the cell.
Figure 2. Genome of cyanophage S-PM2 showing predicted gene function and homologies to
similar genes in other organisms.
This study aims to identify the structural proteins of S-PM2 where little or no homology exists
with known DNA sequences. Initially the focus commenced at the nucleic acid level by
predicting genes and their products then using a mass spectrometric approach for protein
identification, information is generated and used to annotate the genome, see Figure 3.
The DNA sequence of S-PM2 was analysed by means of two gene prediction software
programs, GenemarkS and Glimmer, which use interpolated Markov models to assess the
protein coding potential of stretches of DNA and distinguish them from non-coding regions. A
third, less specific program, Expasy Translate Tool, was used as a comparison, which
translates nucleotide sequences to protein sequences.
A mass spectrometry-based proteomic approach was taken in the identification of gelresolved purified phage proteins from both S-PM2 and T4.
The amino acid sequences obtained from the identified S-PM2 proteins were compared with
the predicted sequences generated by the three programs.
The focus of the study then returned back to the genome annotating the identified genes with
their respective function.
Genome
Experimental
Protein extract
1D / 2D gels
Theoretical
Predict Proteins
from genome
Proteins
Separation
Create Databases
of predicted proteins
Genes were predicted using GeneMarkS (http://opal.biology.gatech.edu/GeneMark/) and
Glimmer (http://www.tigr.org/software/glimmer/). Again the output was converted to a Fasta
formatted database by use of Perl scripts and all three databases were added to SwissProt.
T4
Protein
MS
identification
Gp23
Gp3
Gp19
Gp13
Yes
-
Gp15
-
Enzymatic
using Trypsin
The proteomes predicted by the two different programs were compared using MatLabs
Bioinformatic ToolBox (MathWorks, UK). A script was written to perform a Global alignment,
using the Needleman-Wunsch algorithm, between each protein of the first proteome with all
the other proteins of the second proteome. The results of the alignment score were saved in a
m x n matrix (the scores matrix) where m is the total number of proteins predicted by Glimmer
and n the number predicted by GeneMarkS. The substitution matrix used for the alignment
was PAM10 and this was so that only very similar proteins produced a good alignment.
Gp6
Gp8
Gp18
Gp12
-
Yes
Yes
In order to find the matching proteins between the proteomes, another script was written to
query the scores matrix and find all the protein pairs that gave rise to a score of more than 0
(thus indicating a match).
From this information it was possible to identify proteins that were unique to each prediction.
Protein preparation and identification
S-PM2 and T4 virus particles were purified using a CsCl gradient and the proteins solubilised
in Laemmli buffer prior to resolution on a 1D SDS-PAGE gel, stained with Coomassie G-250.
Protein bands were excised and processed using a MassPrep robotic protein handling
system (Waters Micromass MS Technologies, U.K.). Protein samples were destained,
reduced, alkylated with iodoacetamide, digested with trypsin and the resultant peptides
extracted according to standard protocols described by the supplier.
The tryptic peptides were characterised by means of matrix assisted laser desorption
ionisation MS on a M@LDI-LR instrument (Waters Micromass MS Technologies, U.K.). The
tryptic extract was mixed with matrix (alpha-cyano-4-hydroxycinnamic acid) prior to spotting.
-
Yes
Gp24
Yes
T4 Function
S-PM2 ORF
Size kDa
MS
identification
Major capsid protein
Tail completion protein
Tail tube protein
Neck protein
Tail sheath stabilizer and
completion protein
108
110
103
95
51981
20681
23048
32316
Yes
Yes
Yes
Yes
105
30735
Yes
Baseplate wedge subunit
Baseplate wedge subunit
Tail sheath protein
Short tail fibers
Unknown
Unknown
Contains a fibrinogen domain
Unknown
Unknown
Unknown
Highly immunogenic outer capsid
protein
NAD protein ADP
ribosyltransferase
Head vertex protein
80
83
102
225
87
82
89
146
86
223
-
67750
Yes
67781
79730
110526
18676
19247
33790
32849
134100
59748
-
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
-
-
-
-
-
-
-
Table 1. Table of selected structural proteins from T4 and S-PM2, indicating those identified by
MS-based techniques.
Digestion
Figure 3. Overview of a holistic proteomic study.
The two proteins from S-PM2, which differed in their predicted N-terminii were identified by MS
as a gp3 homologue and a second protein of currently unknown function. In both cases,
peptides were identified from non-coding regions of the genome denoted by the GeneMarkS
program; see Figures 4 and 5.
The tryptic extracts were also analysed by means of nano-LC-ESI-MS/MS on a Q-Tof Ultima
Global with in-line CapLC system (Waters Micromass MS Technologies, U.K.). The tryptic
extract was desalted using an in-line C18 precolumn cartridge (Dionex, U.S.A.) and the
peptides further resolved on a 75 µm C18 PepMap column (Dionex, U.S.A.) using an
increasing acetonitrile concentration gradient.
Acquire Data
Use existing database Database
searching software
Search
Protein Identification
CONCLUSIONS
The Expasy translate tool generated the highest number of potential protein sequences due
to its lack of specificity in the identification of coding regions within nucleic acid sequences.
Both the Glimmer and GenemarkS programs generated similar numbers of potential protein
sequences. The majority of these sequences differed only in the start position of
transcription/translation of the DNA sequence resulting in minor differences at the N-terminus
of each protein.
Both the Glimmer and GenemarkS programs predicted unique proteins, of which none have
been identified to-date.
Over 50% of the structural proteins expected from S-PM2 have now been identified and
recent improvements in phage protein production will ensure that successful identification of
the remaining proteins will be achieved.
Evidence of peptides from two proteins that originated from non-coding regions of DNA,
identified by the GeneMarkS program, indicates that further optimisation of this prediction tool
is required. Potentially a relaxation of the parameters that identify coding regions within DNA
will allow a greater number of genes and their products to be identified.
Our study clearly indicates that a holistic proteomic approach to the study of novel organisms
is highly successful in the identification of proteins where little homology to known gene
products exists.
ProteinLynx Global Server 2.0 (Waters Micromass MS Technologies, U.K.) was used to
interrogate the data obtained from both MALDI-MS and LC-ESI-MS/MS experiments.
REFERENCES
Mann, N.H., Cook, A., Millard, A., Bailey, S. and Clokie, M. (2003). Marine ecosystems:
Bacterial photosynthesis genes in a virus. Nature 424: 741.
S-PM2 Gene Prediction Results
Mann N.H. et al. (2004). Genome of the phage S-PM2 which infects the cyanobacterium
Synechococcus (in prep).
CapLC-ESI-MS/MS
Use MS and MS/MS
data to search against
predicted databases
The genome was annotated accordingly upon confirmation that a protein product had been
translated from a specified region of S-PM2 nucleic acid.
Our preliminary experiment identifying a number of structural proteins from T4 was
successful and further samples of T4 phage proteins have been produced for analysis.
The Expasy Translate Tool predicted over 3700 ORFs from the S-PM2 genome. GeneMarkS
predicted 217 proteins and Glimmer 202 for S-PM2, of which 189 are almost identical in
sequence.
MALDI, ESI MS
Figure 5. Protein Workpad view for an S-PM2 protein of unknown function showing peptides
identified by MS from Glimmer (left) and GeneMarkS (right) predictions.
During the identification of S-PM2 proteins, no T4 sequences were identified and conversely
no S-PM2 sequences were identified during T4 protein analysis. This indicates that these
“homologous” proteins at the amino acid level are truly quite dissimilar.
RESULTS
Link back
to genome
Figure 1. Electron micrographs of S-PM2 (left) and the distantly related T4 phage (right).
In contrast, S-PM2, a cyanophage that infects the marine bacterium Synechococcus was first
isolated in 1993 (Wilson et al.) and, although its impact on natural populations of
Synechococcus is still not known, it is thought to have a significant effect. The genome of SPM2 has recently been sequenced (Millard et al., Mann et al. 2003 and 2004 in prep.) and
shows little homology to other known viral genomes thus making identification of viral proteins
more complicated; see Figure 2. Consequently, a combined bioinformatics and proteomics
approach was undertaken to solve this complex problem.
The raw genomic sequence of SPM-2 was translated in all six reading frames using the
Translate tool from the Expasy web server (http://ca.expasy.org/). Perl scripts transformed
the output to a Fasta formatted database. Open Reading Frames (ORFs) with a molecular
weight greater than 300Da were included.
During database searches of the S-PM2 protein data, no T4 protein sequences were
identified. Also, no S-PM2 sequences were identified during the searches of the T4 data.
Comparison of two predicted proteomes
Results
The Expasy Translate Tool predicted the greatest number of proteins from the S-PM2 genome.
Two programs (Glimmer and GeneMarkS) predicted over 200 translated proteins, of which
almost 20% are truly unique to one proteome.
Over 20 S-PM2 proteins were purified and identified with a number of T4 proteins.
Some of the S-PM2 proteins show similarity to T4, whilst a number seem to be truly unique to
this virus.
For two S-PM2 proteins, GenemarkS incorrectly predicted the translation start site.
Previously unconfirmed genes were identified and the genome annotated accordingly.
Construction of the protein database
In total 24 structural proteins were identified from the cyanophage S-PM2 and 5 (to-date) from
the bacteriophage T4, see Table 1.
The Glimmer prediction contained 13 unique proteins while the GeneMarkS contained 28.
For two S-PM2 proteins, both the Glimmer and Expasy Translate predictions contained
greater numbers of residues per protein because they included an extended N-terminal
region.
Figure 4. A comparison of the predicted gene products for an S-PM2 protein of unknown
function from GenemarkS in blue, Glimmer shown in red + blue and the ORF in turquoise +
red + blue.
Millard, A., Clokie, M., Shub, D.A. and Mann, N.H. (2004). Genetic organization of the psbAD
region in phages infecting marine Synechococcus strains. Proceedings of the National
Academy of Sciences (In press).
Wilson, W. H., Joint I. R. Carr, N. G. et al. (1993). Isolation and molecular characterization of
five marine cyanophages propagated on Synechococcus sp. strain WH7803. Appl. Environ.
Micro. 59 (11): 3736-3743
Download