Machine Learning Approaches for Predicting Signal Peptides and their Cleavage Sites Problem Definition: Signal peptide is a short (3-60 amino acids long) peptide chain that directs the transport of a protein. It is the pre-sequence that targets the proteins in eukaryotes and prokaryotes to other organelles, such as mitochondria, chloroplasts and apicoplasts, through the secretory pathway. Signal sequence is often located in the N-terminal part in the protein and cleaved off by an extracellular signal peptidase while the protein is transferred through the membrane. So the knowledge of how signal peptide work becomes very important when understanding the molecular mechanism or studying new drugs[1] [2]. If a sorting signal in a protein is cleaved at wrong position, the protein could be delivered to wrong cellular location, leading to several diseases[1]. The challengeable problem that we must resolve is to discriminate secretory protein(proteins that secreted by the cell they are synthesized in endoplasmic reticulum.) from non-secretory proteins, and to identify the cleavage sites for the secretory proteins. Query protein Determine whether it is secretory or non-secretory Secretory non-secretory end Predict the signal peptide cleavage sites The signal peptides sequences lengths and order vary obviously among different proteins, so it make the cleavage site identification more difficult, nevertheless they still have some common features that we can use in identification. The most important feature is the series of hydrophobic amino acids called (h-region),it generally consists of seven to fifteen amino acids in length. Also there exist another region before h- region, called (n-region) contains one to five amino acids carrying positive charges in general. Between the h-region and the cleavage site is the (c-region), which is consists of three to seven polar but uncharged amino acids. We can predict the presence of signal peptides and their cleavage sites using machine learning techniques. Support vector machine Kernel Methods: Evolutionary Approaches: Materials and Dataset: The desired dataset can be extracted from the Swiss-Prot database. (Uni-Prot database server, Expazy server). It is the biological database of protein sequences. The study may focus on some particular organisms such as : eukaryotes, gram organism classification (OC) line with the specified type. Then choose the secretory proteins from the previous proteins if they marked by (Signal) in the feature table (FT), while choose the non-secretory proteins if they marked by cytoplasm and nucleus in the comment line (CC). if the chosen dataset contains more than one protein that has the same first 100 residues, only one of them will be kept avoiding redundancy [3]. constructing the data in the above mentioned criteria will provide a rigorous high quality benchmark datasets for special organism, most likely for eukaryotes, gram positive, gram negative. Results Testing: The performance of the method that will be used can be tested using cross validation testing technique. The most frequently used in previous work is jackknife (leave one out), In addition to self consistency testing (training data is the same as testing), some of them used 20% of data for testing and the rest for training. The dataset is relatively small, not more than 3000 protein in each benchmark, so jackknife test is suitable here. Recent Work: There are several methods used in predicting signal peptides and their cleavage sites, here I will introduce three different approaches used in this field; Support vector machine, Hidden Markov model and Conditional Random field. 1. Support vector machine approach SVM -based ternary classifier is proposed for predicting mammalian secreted proteins (SecretP), by using pseudo-amino acid composition (PseAA) and five additional features for distinguishing types of proteins: classically secreted proteins CSP non secreted proteins NSP non classically secreted proteins NCSP Dataset: For training 864 mammalian proteins confirmed to route in non classical secreted proteins NCSP were collected from Swiss-Prot through data mining. 149 from this set is used for testing. Proteins marked as "secreted" in keyword (KW) line without "signal" in the feature table (FT) line, were selected to construct the dataset of NCSP's. Proteins in training and testing were aligned using BLASTCLUCT program in order to avoid redundancy and homology bias, that proteins with less than 25%sequence identity were kept. For CSP's dataset, 3321 classically secreted proteins extracellular proteins with N- terminal signal peptides and released via classical ER Golgi pathway. And for NSP dataset, 3654 proteins annotated as residing in the cytoplasm and/or the nucleus were selected. Feature extraction and Methods: Amino Acid Composition (AAC): incorporates the occurrence frequency information of 20 amino acids in protein sequence. Seven physicochemical properties of amino acids: hydrophobicity, solvent accessible surface area, net charge, polarity and polarizibility. N-terminal signal peptides are important for the release of CSP, they are useful for distinguishing CSP from NCSP and NSP. After translating the proteins into numeric vectors, Pseudo-amino acid composition were used to transform unequal length vector into uniform matrices. Here new (PseAA) model was constructed based on amino acid composition (AAC) and auto covariance (AC) that considers the sequence order effects. SVM, implemented using Libsvm library, and RBF used as kernel function. Results: 5-cross validation rather than jackknife test was used because the large number of proteins. Four parameter, sensitivity, specificity, accuracy, Matthews correlation coefficient MCC were used to evaluate the performance of SecretP.[4] 2. Hidden Markov Model Approach: HMM based approach for predicting signal peptides in archaea and their cleavage sites, also discriminates such proteins from cytoplasmic and transmembrane ones. Archaeal signal peptides exhibit a more eukaryotic-like cleavage sites (c-region), and a unique (h-region) resembling the bacterial ones with a slight over representation of Leucine and Isoleucine amino acids. Leucine is the dominant residue in eukaryotes, so predictors trained on eukaryal or bacterial proteins cannot reliably be applied to archaeal sequences. Dataset: UniProt database lists only 12 archaeal sequences with experimentally verified, precise location of cleavage sites. And the database of signal peptides SPDB lists only 9. an extensive literature review was made on pubmed to identify archaeal sequences. Materials and Methods: HMM consists of three different submodels: SP submodel corresponding to secretory signal peptides TM submodel (N-terminal transmembrane) Globular submodel SP submodel is the central core model that modeling the positively charged, h region and c region. The model was trained using the Baum-Welch algorithm for labeling sequences, and Viterbi algorithm for decoding. Decoding will produces the optimal path of states through the model and predicts the type of the sequence SP, TM, Globular, as well as the cleavage sites if any. Results: results obtained in 35-fold cross validation predicts correctly all 69 SP's and rejects correctly 248 out of 252 cytoplasmic and TM proteins. These results corresponds to 100% sensitivity and 98.41% specificity with MCC 0.964. [5] 3. Conditional Random Fields Approach: Conditional Random Field (CRFs) can be applied to predict signal peptide and their cleavage. This work demonstrates how amino acid properties can be exploited and incorporated into the CRF to boost prediction performance. CRFs were originally designed for sequence labeling tasks. Given a sequence of observation, CRFs finds the most likely label for each observation. CRFs have a graphical structure consisting of edge and vertices, in which an edge represents the dependency between two random variables (two amino acids in protein), and vertices represents a random variable whose distribution is to be inferred. CRFs are undirected graphical models opposed to directed graphical models such as HMM. Dataset: 1937 sequences for eukaryotic proteins were extracted from the Swiss Prot version 56.5. Methods and materials: The prediction problem is formulated as labeling task. Amino acids with similar properties can be categorized as subgroups. Divide the 20 amino acids according to their hydrophibicity and charge/polarity, because these properties are believed to posses information about cleavage sites, as the h-region in signal peptides is rich in hydrophobic residues, and c-region is dominated by small, non polar residues. These were used as observation to train CRF. Results: Ten fold cross validation test was used to verify performance, and up to 79.81 accuracy was achieved.[6] References: 1. Jingjing Sun, Lipo Wang, Predicting Signal Peptides and Their Cleavage Sites using Support Vector Machines and Improved Position Weight Matrixes IEEE, Fourth International Conference on Natural Computation 2008 2. Karsten Hiller, Andreas Grote, Maurice Scheer, Richard Munch ,Dieter Jahn, PrediSi: Prediction of Signal Peptides and their cleavage positions, Oxford, Nucleic Acids Research 2004 3. Kuo-Chen Chou, Hong-Bin Shen, Signal-CF: A subsite-coupled and window-fusing approach for predicting signal peptides 2007 4. Lezheng Yu, Yanzhi Guo,Zheng Zhang, Yizhou Li, Menglog Li, Gongbing li, Wenjia Xion, Yuhong Zeng: SecretP A new method for predicting mammalian secreted proteins. Elsevier, Peptides31 2010 5. P.G. Bagos, K.D. Tsirigos, S.K. Plessas, T.D. Liakopoulos, S.J. Hamodrakas: Prediction of signal peptides in archaea. Oxford, Protein Engineering Design and Selection 2009 6. Man-Wai Mak, Sun-Yuan Kung :Conditional Random Fields for the prediction of signal peptide cleavage sites. (2009)IEEE, ICASSP Abstract Machine learning approaches have shown significant performance in solving supervised learning classification problems. Support vector machine is one of the most promising classifiers that can be applied efficiently to bioinformatics problems, because it is a kernel based model. It represents the data by means of kernel functions. The kernel function take relationships that are implicit in data, makes them explicit then the detection of the patterns becomes more easily Choosing kernel function among the huge number of kernel types is the most important design decision in SVM. The development of new string kernel and the optimization of its parameter is still a big challenge in the field. We will develop a new string kernel using the evolutionary approaches to optimize it. This new optimized kernel will be applied in the prediction of signal peptide and its cleavage sites in proteins. Introduction Contribution: to develop a kernel for bio sequences and optimize the kernel using evolutionary approaches. Methodology: The proposed methodology is the kernel-based machine learning technique that generates the optimal hyperplane that can differentiate between the two types of data, the Support Vector Machine (SVM), it grantees to classify the signal peptide from any other sequences , also it can predict whether this signal peptide would be cleaved or not , during its transportation outside the cell. String kernel function will map the string sequences into high dimensional space in order to be classified linearly. In this work a new string kernel will be developed and optimized using evolutionary approaches. Then this optimized string kernel will be used with SVM, to predict the signal peptides and their cleavage sites. A representation of a multi-string kernel function as a tree connected using multiple mathematical operators and some coefficients, such that the resulting function satisfies Mercer's conditions. Then evolutionary approaches will be used to find the optimal structure and parameters for this tree, to produce the most optimized new string kernel. New String Kernel Function Coefficient Coefficient MathOperator Coeff SK1 MathOperator Coeff SK2 Coeff ……………… SKn1 Evolutionary Approaches Optimized new String Kernel function Prediction SVM Coeff SKn Expectation The proposed methodology in constructing new optimized string kernel with Support Vector Machine is expected to predict the signal peptides and their cleavage sites with high accuracy rates in comparison with previous work that uses either SVM with numerical kernel , or SVM with single traditionally string kernel. Literature Review: This section will introduce the techniques used in signal peptide prediction tools and the available techniques that are used to optimize kernels. the problem of predicting signal peptides and their cleavage site have been \ using machine learning techniques.