Machine Learning Approaches for Predicting Signal Peptides and
their Cleavage Sites
Problem Definition:
Signal peptide is a short (3-60 amino acids long) peptide chain that directs the
transport of a protein. It is the pre-sequence that targets the proteins in eukaryotes and
prokaryotes to other organelles, such as mitochondria, chloroplasts and apicoplasts,
through the secretory pathway.
Signal sequence is often located in the N-terminal part in the protein and cleaved off
by an extracellular signal peptidase while the protein is transferred through the
membrane. So the knowledge of how signal peptide work becomes very important
when understanding the molecular mechanism or studying new drugs[1] [2]. If a
sorting signal in a protein is cleaved at wrong position, the protein could be delivered
to wrong cellular location, leading to several diseases[1].
The challengeable problem that we must resolve is to discriminate secretory
protein(proteins that secreted by the cell they are synthesized in endoplasmic
reticulum.) from non-secretory proteins, and to identify the cleavage sites for the
secretory proteins.
Query protein
Determine whether it is
secretory or non-secretory
Predict the signal
peptide cleavage
The signal peptides sequences lengths and order vary obviously among different
proteins, so it make the cleavage site identification more difficult, nevertheless they
still have some common features that we can use in identification. The most important
feature is the series of hydrophobic amino acids called (h-region),it generally consists
of seven to fifteen amino acids in length. Also there exist another region before h-
region, called (n-region) contains one to five amino acids carrying positive charges in
general. Between the h-region and the cleavage site is the (c-region), which is consists
of three to seven polar but uncharged amino acids.
We can predict the presence of signal peptides and their cleavage sites using machine
learning techniques.
Support vector machine
Kernel Methods:
Evolutionary Approaches:
Materials and Dataset:
The desired dataset can be extracted from the Swiss-Prot database. (Uni-Prot database
server, Expazy server). It is the biological database of protein sequences. The study
may focus on some particular organisms such as : eukaryotes, gram
classification (OC) line with the specified type.
Then choose the secretory proteins from the previous proteins if they marked by
(Signal) in the feature table (FT), while choose the non-secretory proteins if they
marked by cytoplasm and nucleus in the comment line (CC). if the chosen dataset
contains more than one protein that has the same first 100 residues, only one of them
will be kept avoiding redundancy [3].
constructing the data in the above mentioned criteria will provide a rigorous high
quality benchmark datasets for special organism, most likely for eukaryotes, gram
positive, gram negative.
Results Testing:
The performance of the method that will be used can be tested using cross validation
testing technique. The most frequently used in previous work is jackknife (leave one
out), In addition to self consistency testing (training data is the same as testing), some
of them used 20% of data for testing and the rest for training.
The dataset is relatively small, not more than 3000 protein in each benchmark, so
jackknife test is suitable here.
Recent Work:
There are several methods used in predicting signal peptides and their cleavage sites,
here I will introduce three different approaches used in this field; Support vector
machine, Hidden Markov model and Conditional Random field.
1. Support vector machine approach
SVM -based ternary classifier is proposed for predicting mammalian secreted
proteins (SecretP), by using pseudo-amino acid composition (PseAA) and five
additional features for distinguishing types of proteins:
classically secreted proteins CSP
non secreted proteins NSP
non classically secreted proteins NCSP
For training 864 mammalian proteins confirmed to route in non classical secreted
proteins NCSP were collected from Swiss-Prot through data mining. 149 from this set
is used for testing. Proteins marked as "secreted" in keyword (KW) line without
"signal" in the feature table (FT) line, were selected to construct the dataset of
Proteins in training and testing were aligned using BLASTCLUCT program in order
to avoid redundancy and homology bias, that proteins with less than 25%sequence
identity were kept. For CSP's dataset, 3321 classically secreted proteins extracellular
proteins with N- terminal signal peptides and released via classical ER Golgi
pathway. And for NSP dataset, 3654 proteins annotated as residing in the cytoplasm
and/or the nucleus were selected.
Feature extraction and Methods:
Amino Acid Composition (AAC): incorporates the occurrence frequency
information of 20 amino acids in protein sequence.
Seven physicochemical properties of amino acids: hydrophobicity, solvent
accessible surface area, net charge, polarity and polarizibility.
N-terminal signal peptides are important for the release of CSP, they are useful for
distinguishing CSP from NCSP and NSP.
After translating the proteins into numeric vectors, Pseudo-amino acid composition
were used to transform unequal length vector into uniform matrices. Here new
(PseAA) model was constructed based on amino acid composition (AAC) and auto
covariance (AC) that considers the sequence order effects.
SVM, implemented using Libsvm library, and RBF used as kernel function.
5-cross validation rather than jackknife test was used because the large number of
proteins. Four parameter, sensitivity, specificity, accuracy, Matthews correlation
coefficient MCC were used to evaluate the performance of SecretP.[4]
2. Hidden Markov Model Approach:
HMM based approach for predicting signal peptides in archaea and their cleavage
sites, also discriminates such proteins from cytoplasmic and transmembrane ones.
Archaeal signal peptides exhibit a more eukaryotic-like cleavage sites (c-region),
and a unique (h-region) resembling the bacterial ones with a slight over
representation of Leucine and Isoleucine amino acids.
Leucine is the dominant residue in eukaryotes, so predictors trained on eukaryal or
bacterial proteins cannot reliably be applied to archaeal sequences.
UniProt database lists only 12 archaeal sequences with experimentally verified,
precise location of cleavage sites. And the database of signal peptides SPDB lists
only 9. an extensive literature review was made on pubmed to identify archaeal
Materials and Methods:
HMM consists of three different submodels:
SP submodel corresponding to secretory signal peptides
TM submodel (N-terminal transmembrane)
Globular submodel
SP submodel is the central core model that modeling the positively charged,
h region and c region.
The model was trained using the Baum-Welch algorithm for labeling
sequences, and Viterbi algorithm for decoding.
Decoding will produces the optimal path of states through the model and
predicts the type of the sequence SP, TM, Globular, as well as the cleavage
sites if any.
Results: results obtained in 35-fold cross validation predicts correctly all 69
SP's and rejects correctly 248 out of 252 cytoplasmic and TM proteins. These
results corresponds to 100% sensitivity and 98.41% specificity with MCC
0.964. [5]
3. Conditional Random Fields Approach:
Conditional Random Field (CRFs) can be applied to predict signal peptide and their
cleavage. This work demonstrates how amino acid properties can be exploited and
incorporated into the CRF to boost prediction performance.
CRFs were originally designed for sequence labeling tasks. Given a sequence of
observation, CRFs finds the most likely label for each observation.
CRFs have a graphical structure consisting of edge and vertices, in which an edge
represents the dependency between two random variables (two amino acids in
protein), and vertices represents a random variable whose distribution is to be
CRFs are undirected graphical models opposed to directed graphical models such as
Dataset: 1937 sequences for eukaryotic proteins were extracted from the Swiss Prot
version 56.5.
Methods and materials:
The prediction problem is formulated as labeling task. Amino acids with similar
properties can be categorized as subgroups.
Divide the 20 amino acids according to their hydrophibicity and charge/polarity,
because these properties are believed to posses information about cleavage sites, as
the h-region in signal peptides is rich in hydrophobic residues, and c-region is
dominated by small, non polar residues.
These were used as observation to train CRF.
Ten fold cross validation test was used to verify performance, and up to 79.81
accuracy was achieved.[6]
Machine learning approaches have shown significant performance in solving supervised
learning classification problems. Support vector machine is one of the most promising
classifiers that can be applied efficiently to bioinformatics problems, because it is a kernel
based model. It represents the data by means of kernel functions.
The kernel function take relationships that are implicit in data, makes them explicit then
the detection of the patterns becomes more easily
Choosing kernel function among the huge number of kernel types is the most important
design decision in SVM. The development of new string kernel and the optimization of
its parameter is still a big challenge in the field.
We will develop a new string kernel using the evolutionary approaches to optimize it.
This new optimized kernel will be applied in the prediction of signal peptide and its
cleavage sites in proteins.
Contribution: to develop a kernel for bio sequences and optimize the kernel using
evolutionary approaches.
The proposed methodology is the kernel-based machine learning technique that generates
the optimal hyperplane that can differentiate between the two types of data, the Support
Vector Machine (SVM), it grantees to classify the signal peptide from any other
sequences , also it can predict whether this signal peptide would be cleaved or not , during
its transportation outside the cell. String kernel function will map the string sequences
into high dimensional space in order to be classified linearly. In this work a new string
kernel will be developed and optimized using evolutionary approaches. Then this
optimized string kernel will be used with SVM, to predict the signal peptides and their
cleavage sites.
A representation of a multi-string kernel function as a tree connected using multiple
mathematical operators and some coefficients, such that the resulting function satisfies
Mercer's conditions. Then evolutionary approaches will be used to find the optimal
structure and parameters for this tree, to produce the most optimized new string kernel.
New String
Kernel Function
Evolutionary Approaches
Optimized new String
Kernel function
The proposed methodology in constructing new optimized string kernel with Support
Vector Machine is expected to predict the signal peptides and their cleavage sites with
high accuracy rates in comparison with previous work that uses either SVM with
numerical kernel , or SVM with single traditionally string kernel.
Literature Review:
This section will introduce the techniques used in signal peptide prediction tools and the
available techniques that are used to optimize kernels.
the problem of predicting signal peptides and their cleavage site have been \ using
machine learning techniques.
machine learning techniques.