Identification of Signal Peptides with Sequence Motif

advertisement
Identification of Signal Peptides with a Sequence Motif Learning Technique
Dec 18, 1999
CS 838 Bioinformetics
Zhiqi Qiu <zqiu@cs.wisc.edu>
I: Introduction
This project investigates the application of a motif learning scheme to the task of identifying
signal peptides, which in turn are used to predict the sub-cellular localization of a given protein.
First, using the best Z score, the probability that the motif starts at a certain position in a given
sequence, we attempt to distinguish between signal peptides and non-signal peptides. Secondly,
using the position in the sequence that yields the best Z score as a hint, we try to find the exact
cleavage site location in the signal peptide. The results are then compared to those of Signal IP
(http://www.cbs.dtu.dk/services/SignalP/), a World Wide Web prediction server based on a
combination of several artificial neural networks, to evaluate its performance. Though our approach
is far less complicated, the results it obtained are still quite respectable.
II: Background
The prediction of the sub-cellular localization of proteins is a very important topic in
molecular biology. The foremost pioneer in this field, Gunter Blobel was awarded the 1999 Nobel
Prize in Physiology or Medicine for the discovery that "proteins have intrinsic signals that govern
their transport and localization in the cell." The first such signal to be discovered was the secretory
signal peptide. His "signal hypothesis" postulates that proteins secreted out of the cell contain an
intrinsic signal that governs them to and across membranes. The signal consists of a peptide, i.e. a
sequence of amino acids in a particular order that form an integral part of the protein, often found
in the N-terminal of a sequence. Eventually it was shown that the signal hypothesis was both
correct and universal, since the processes operate in the same way in yeast, plant, and animal cells.
Furthermore, similar intrinsic signals target the transport of proteins also to other intracellular
organelles. Therefore, the problem of predicting the subcellular localization of a given protein can
be solved by identifying the signal peptide contained in the protein sequence.
The common structure of signal peptides is described as a positively charged n-region
followed by a hydrophobic h-region and a neutral but polar c-region. The peptide is cleaved off
while the protein is translocated through the membrane. The (-3, -1) rule states that the residues at
positions –3 and –1 (relative to the cleavage site) must be small and neutral for cleavage to occur
correctly (von Heijne, 1983, 1985). Von Heijne published in 1986 a weighted matrix that is widely
used to predict the location of the cleavage site, and to discriminate between signal peptides and
non-signal peptides by the using the maximum cleavage site score. The positions in the matrix are 13 to +2 relative to the cleavage site.
Since that time, the amount of signal peptide data available has increased by a factor of 5-10
(Nielsen, 1997). The fact motivated me to construct weighted matrices from those new data using
the expectation maximization algorithm for motif discovery. Because signal peptides of different
organisms have different characteristics, I separate the sequences into three classes: Gram-positive
prokaryotes, Gram-negative prokaryotes, and eukaryotes, and attempt to find one motif in each
class. The motifs are then applied to signal peptides and non-signal peptides to test their
distinguishing power.
Eukaryotes
Prokaryotes
Gram-negative
Gram-positive
25.1 aa
32.0 aa
Total length (average)
22.6 aa
n-regions
only slightly Arg-rich
h-regions
short, v hydrophobic
bit longer, less hydrophobic
long, less hydrophobic
c-regions
short, no pattern
short, Ser+Ala-rich
longer, Pro+Thr-rich
-3,-1 positions
small, neutral residues
almost exclusively Ala
+1 to +5 region
no pattern
rich in Ala, Asp/Glu, and Ser/Thr
Lys+Arg-rich
<Table 1: characteristics of signal peptides from different organisms (Nielsen, 1997) >
III: Method
Classification:
A motif is a pattern common to a set of nucleic or amino acid subsequences that share some
biological property of interest. Lawrence & Reilly published the first EM approach to motif
discovery in 1990. Bailey & Elkan refined the algorithm in 1993 to accommodate multiple motifs.
Here, however, only one single motif is desired from every data set and we do not have to worry
about multiple occurrences of a motif in a sequence. A motif is represented by a matrix P.
1
2
3
A
0.2
0.3
0.2
C
0.1
0.3
0.2
G
0.5
0.1
0.1
T
0.2
0.3
0.5
< Graph 1: sample of a very simple DNA motif with Width=3>
Another data structure, Z matrix is used to discover the motif starting position in every
sequence. Element Z(i, j) represents the probability that the motif starts in position j in sequence i.
The basic EM approach is to set up a loop to re-estimate Z from P and re-estimate P from Z until
values in P become stable.
Since the EM method only finds the local maximum, care must be taken to initialize the
starting P matrix. Our method is adapted from the MEME's scheme for trying many starting points
to initialize the P matrix: for every signal peptide sequence in the training set, an initial P matrix is
derived from the subsequence starting at –(Width-3) from the cleavage site position. Then we run
EM for one iteration for each initial P matrix, choose the motif model with the highest likelihood,
and run EM from there to convergence. This method still does not guarantee us to find the optimal
starting point, even though it is a good heuristic.
The main constraint of the EM method is that it only deals with contiguous motifs, so
inserts or deletes in the motif that could happen in natural mutation process are not considered.
Cleavage Site Location:
To restate the (-3, -1) rule, the residues at positions –3 and –1 (relative to the cleavage site)
must be small and neutral for cleavage to occur correctly. In the previous step, along with the best Z
score of a sequence, we also get the most possible starting position of the motif, which gives us a
strong hint of the location of the cleavage site. Obviously, the cleavage site should occur
somewhere close to the end of the motif. Scanning through that neighborhood, we can assign each
position a score based on residues at its –3 and –1 position. Counting the occurrences of each
amino acid used at position –1 and –3 from the true cleavage sites, we can get a good idea what
amino acids are preferred. For human signal peptides, for instance, they are A, C, L, P, S, T, V. The
relative frequencies of these residues are directly translated into a scoring function:
A
30;
C
5;
L
2;
P
1;
S
11;
T
5;
V
7;
The position with the highest score combined from its –3, -1 residue is identified to be the cleavage
site location.
Program Input:
The first four parameters are used to configure the program. The first is the width of the
motif. The second one is the expected latest start of the motif. The third one is the index of the
sequence we use for initializing the P matrix, and the last one is the threshold value by which we
classify the sequence as a signal peptide or not. Most of numbers actually used here are obtained
from empirical results, and may not be the optimal.
1. Motif width
16 is chosen as the default motif width because the matrix published by von Heijne in 1986
has the range -13 to +2 relative to the cleavage site. This is the most important parameter for the
program and influences the choice of other parameters. For all three classes of signal peptides, 16
yields good result, but I later found that motif length should also be associated with the average
length of the peptides. For the eukaryotes data sets, whose signal peptides are longer than those of
the prokaryotes, 22 is more appropriate. However, I have noticed that when motif width increases
beyond an optimal point, the percentage of correct cleavage site prediction decrease perceptibly,
while the discrimination task yields similar result as before.
2. Latest start position
Besides having a Z score that exceeds the threshold value, the best motif starting position in
a sequence also has to be no later than the latest starting position. This rule is particularly useful in
preventing signal anchors from being mistaken for signal peptides. The reasoning is explained later.
Here let it suffice to say that when motif width is 16, 18 seems to be the best value.
3. Initializing index
The next parameter is the index of the sequence in the training set for P matrix
initialization. If it is smaller than 0, the program would try out a different initialization matrix for
every sequence in the data set.
4. Threshold
We classify a sequence as a signal peptide only if its best Z score exceeds the threshold
likelihood. (Actually, it is the logarithm of the cumulative probability value.) When motif width is
16, we use –40.0 as the threshold value. When motif with is 22, we use –57.0.
Beyond these four parameters, the program takes some more parameters as the filenames of
the data sets. The first one is always the signal peptide data set, and there can be any number of
other data set of the same organism following that. The location of the cleavage site is only
calculated for the signal peptide data set.
A heuristic rule, inspired by the Nielsen 1997 paper, is added to help distinguishing cleaved
signal peptides from uncleaved N-terminal signal anchors. Signal anchors often have sites similar
to signal peptide cleavage sites after their hydrophobic (transmembrane) region. Therefore, a
prediction method can easily be expected to mistake signal anchors for peptides. Indeed, before
adding this rule, by the best Z score alone, usually over 50% of signal anchors are identified as
signal peptides. However, signal anchors are generally significantly longer than signal peptides. By
setting boundary to the first position at which the motif can occur in a signal peptide, (between 16
to 20 in my tests), the percentages of false positive for signal anchors are greatly reduced.
IV: Testing and results
The data sets come from the web site of Signal IP, which was originally extracted from
SWISS-PROT version 29. The data sets are divided into prokaryotic and eukaryotic entries, and the
prokaryotic data sets are further divided into Gram-positive eubacteria (Firmicutes) and Gramnegative eubacteria (Gracilicutes). Additionally, two single-species data sets are selected: a human
subset of the eukaryotic data, and an E. Coli subset of the Gram-negative data.
From secretory proteins, the sequence of the signal peptide and the first 30 amino acids of
the mature protein are included in the data set. From cytoplasmic and (for the eukaryotes) nuclear
proteins, the first 70 amino acids of each sequence are used. Additionally, a set of eukaryotic signal
anchor sequences, i.e. N-terminal parts of type II membrane proteins, are extracted. (Nielsen,
1997). To avoid redundancy in the data sets, pairs of sequences that are functionally homologous
are excluded. The final count of sequences in the data set can be found in table 2.
Number of sequences
Configuration
Performance
Signal peptides
Width
Cleavage site location Signal peptide discri-
Non-signal
Threshold
mination (correlation)
Human
416
251
16
-40.0
0.51
0.88 (0.91)
Eukaryote
1011
820
16
-40.0
0.43
0.85 (0.87)
E.coli
105
119
22
-57.0
0.64
0.93
Gram-
266
186
22
-57.0
0.75
0.92
Gram+
141
64
22
-57.0
0.64
0.96
Number of sequences: the number of sequences in the data sets after redundancy reduction. Configuration: motif width
and the threshold Z value. Performance: the percentage of signal peptide sequences where the cleavage site are
correctly predicted. The ability of the method to distinguish between the signal peptides and the N-terminals of nonsecretory proteins. The numbers in the bracket are measured when signal anchor data sets are not used.
<Table 2: Data and performance summary>
The formula for calculating the correlation coefficient is:
where Pt and Pf are the numbers of true and false positives, while Nt and Nf are the numbers of true
and false negatives.
Signal Peptide
Cytoplasmic protein Nuclear protein
Signal Anchor
Pt
Nf
Nt
Pf
Nt
Pf
Nt
Pf
HUMAN
0.909
0.091
1.000
0.000
0.987
0.013
0.750
0.250
EUK
0.860
0.140
0.993
0.007
0.980
0.020
0.835
0.165
ECOLI
0.962
0.038
0.941
0.059
/
/
/
/
GRAM-
0.940
0.060
0.957
0.043
/
/
/
/
GRAM+
0.958
0.042
0.922
0.078
/
/
/
/
<Table 3: Detailed classification results>
The test performance has been verified by cross-validation. Half of the signal peptide data
are used to calculate the P matrix, and the other half are used as testing data. The performance
values are measured on the test sets and due to the redundancy reduction of the data, the sequence
similarity between training and test sets is very low. Consequently, the prediction accuracy on
sequences with some degree of homology to the sequences in the data sets will in general be higher.
V: Aanlysis
The results collected here generally agree with the findings published on the Signal IP web
site. For instance, the difference in structure between the signal peptides from different organisms
is reflected in the performance values. The signal peptides from prokaryotes are longer than those
of the eukaryotes, with more extended h-regions. Therefore a larger value for motif width works
better for them. Gram-negative cleavage sites have the strongest pattern and we had the highest
success rate with predicting them. The eukaryotic cleavage sites are significantly more difficult to
predict in contrast. Though all three signal peptides follow the (-3, -1) rule near the cleavage site,
eukaryoties accept a number of different amino acids while the prokaryotes almost exclusively use
Ala in these two positions. This is reflected in our algorithm by the two separate scoring functions.
However, unlike Nielsen’s 1997 study, we do not find the discrimination of signal peptides versus
non-secretory proteins easier for the eukaryotes than for the prokaryotes, but rather the other way
around.
Training and testing on single-species data sets did not noticeably improve the predictive
performance. I have also tried to classify E. coli sequences using the motif matrix from Gramnegative data, and human sequences with all eukaryotic data. That does not seem to affect
performance in either positive or negative direction.
Notice that the correlation coefficient might not be the best metrics by which to judge the
performance of our program. The data sets are of very different lengths. For instance, we have 416
sequences in the human signal peptide data set, but only 251 sequences in the human non-signal
peptide data sets. Because the numbers here are not normalized, true positive value carries more
weight than true negative value. Also, the percentages of correctly located cleavage site are slightly
exaggerated: the program saves up to 4 possible positions with the same highest score and if one of
them is the true cleavage site, we count it as a success.
VI: Conclusion
In general, our performance is not quite as good as the much more sophisticated Signal IP
server, which trains neural networks for the classification and cleavage site location tasks.
However, we do outperform Signal IP in classifying signal peptides of prokaryotes. The
percentages of false positive among the signal anchors are also lower. For a simple algorithm with
somewhat rigid assumptions, (the foremost being the one that requires the motif to be contiguous),
it is still quite effective. Future work could include further improving its performance by fine
tuning the parameters discussed above. Also, paralleling the Nielsen 1997 study, we can apply our
prediction method to all the amino acid sequences of the predicted coding regions in the
Haemophilus influenzae genome. It would be interesting to see whether our estimate of the number
of sequences with cleavable signal peptides in H. influenzae is close to the result of their study.
Reference:

T.Bailey, C.Elkan:
"Unsupervised Learning of Multiple Motifs in Biopolymers Using Expectation
Maximization" Machine Learning Journal, 21, 51-83, 1995

H.Nielsen, J.Engelbrecht, G.von Heijne, and S.Brunak:
"Identification of prokaryotic and eukaryotic signal peptides and predictionof their cleavage
sites" Protein Engineering 10/1, 1-6, 1997

H.Nielsen, J.Engelbrecht, G.von Heijne, and S.Brunak:
"Defining a similarity threshold for a functional protein sequence pattern: The signal
peptide cleavage site" PROTEINS, 24, 165-177, 1996.

A.Bairoch and B.Boeckmann:
"The SWISS-PROT protein sequence data bank: current status” Nucleic Acids Res.
22:3578-3580 (1994).
Attachment:
Program listing in file: em.cc
Download