A Framework for Identifying Affinity Classes of Inorganic Materials Binding Peptide Sequences Nan Du Paras N. Prasad Mark T. Swihart Computer Science and Engineering Department SUNY at Buffalo, U.S.A, Department of Chemistry SUNY at Buffalo, U.S.A, pnprasad@buffalo.edu Department of Chemical and Biological Engineering SUNY at Buffalo, U.S.A, Marc R. Knecht swihart@buffalo.edu Aidong Zhang Institute for Frontier Materials Deakin University, Geelong, Australia Department of Chemistry University of Miami Coral Gables, U.S.A, Computer Science and Engineering Department SUNY at Buffalo, U.S.A, nandu@buffalo.edu Tiffany Walsh tiffany.walsh@deakin.edu.au knecht@miami.edu ABSTRACT With the rapid development of bionanotechnology, there has been a growing interest recently in identifying the affinity classes of the inorganic materials binding peptide sequences. Unfortunately, there are some distinct characteristics of inorganic materials binding sequence data that limit the performance of many widely-used classification methods. In this paper, we propose a novel framework to predict the affinity classes of peptide sequences with respect to an associated inorganic material. We first generate a large set of simulated peptide sequences based on our new amino acid transition matrix, and then the probability of test sequences belonging to a specific affinity class is calculated through solving an objective function. In addition, the objective function is solved through iterative propagation of probability estimates among sequences and sequence clusters. Experimental results on a real inorganic material binding sequence dataset show that the proposed framework is highly effective on identifying the affinity classes of inorganic material binding sequences. Besides that, the experiment on the SCOP (structural classification of proteins) dataset shows that the proposed framework is general and can be used in the traditional protein sequences. Categories and Subject Descriptors J.3 [Life And Medical Sciences]: Biology and Genetics; I.6.1 [SIMULATION AND MODELING]: Simulation Theory—Model classification 1. INTRODUCTION Over the past decades, plenty of studies have been published for analyzing the peptide sequences with affinity to biological entities such as enzymes, cells, viruses, lipids and proteins. azhang@buffalo.edu Recently, there has been a growing interest in identifying the characteristics of the desired inorganic material binding peptide sequences through utilizing the biocombinatorial peptide libraries such as phage play [27], cell surface [5], and yeast display [3]. In particular, numerous studies have been reported about the peptide sequences that bind to the inorganic materials, such as noble metals (gold, silver, platinum) [18, 19], semiconductors (zinc sulfide, cadmium sulfide) [9, 10], and metal oxides (silica, titanium and magnetite) [4, 23], which are crucial for practical implementation in technology and medicine. Inorganic material binding peptide sequences, which are usually 7-14 amino acids long differentiate from other polypeptides with their specific molecular recognition properties onto targeted inorganic material surfaces [7]. Among many studies of inorganic material binding sequences, effectively identifying the affinity class, which shows the binding strength of the desired sequence with respect to the target inorganic material, is crucial for further designing novel peptides [25]. The binding affinity of a peptide to an inorganic surface is the result of a complex interplay between the binding strength of its individual residues and its conformation. The binding strength of the desired sequence for a specific material is usually measured with the adsorption free energy (∆Gads ), which is then used to classify the affinity classes as weak, medium, or strong for each sequence. Despite extensive recent reports on combinatorially selected inorganic binding peptides and their bionanotechnological utility as synthesizers and molecular linkers [7, 15, 30], there is still limited knowledge about the relationships between binding peptide sequences and their associated inorganic materials. Therefore, using machine learning technology to suggest sequence affinity classes, we can increase product performance, functionality, and reduce cost without doing new large-scale screenings via phage display. Various approaches have been used or developed for recognizing both close and distant homologs of given protein sequences, which is one of the central themes in bioinformatics. Most of the work is based on some traditional machine learning models such as Hidden Markov model (HMM ) [8, 20], Neural Network (NN) [32, 34] and Support vector machine (SVM ) [13, 16]. However, the problem of inorganic material binding peptide sequence affinity classes identification has some distinct challenges that are rarely faced in protein sequence identification, which markedly limit the performance of the successful models mentioned above for protein sequences detection. Challenge I: The number of labeled samples is usually insufficient. As an emerging topic, the peptide sequences binding solid inorganic materials have been developed only since the last decade which is not so well studied comparing to protein sequences analysis which has much longer history. For example, unlike protein sequences analysis which has lots of large-scale public datasets such as GPCR [17] or SCOP [33], no complete result of large-scale screening experiments has been made publicly available for the inorganic material binding sequences. Therefore, unlike protein sequence research which has many public large databases and publicly available experiment results, the data about inorganic binding peptide sequences is usually quite few. Most existing protein sequence classification approaches require a large size of labeled samples to train an accurate model, but labeling the affinity classes for a large number of inorganic material binding sequences is very time-consuming and expensive thus it is usually infeasible. If only a limited number of labeled samples are available for the model training, the learned model may suffer from the problems of over-fitting or under-fitting. As a machine learning method which has received much attention in the past decade, SemiSupervised Learning (SSL) [36] is good at handling the lack of sufficient labeled training data problem. However, the utility of this method may be markedly limited due to the next challenge. Challenge II: The peptide sequences belonging to the same affinity class may be very dissimilar. Usually, the protein sequences which belong to the same family follow some apparent patterns, in other words, they are similar to each other by some views. However, the “similarity” between inorganic material binding peptide sequences from the same affinity class may be not so apparent. In some cases, the intra-similarity which measures the similarity of all sequences inside the same class is even less than the inter-similarity which measures the similarity among the sequences from different classes. This phenomenon also means some peptide sequences belonging to the same class may be dissimilar with each other, at least by the current knowledge. This observation reflects that the inorganic material binding sequences are against the smoothness assumption at the class level which is generally assumed in both supervised learning and semi-supervised learning. In light of these challenges for inorganic material binding sequence affinity classes identification, we propose a novel framework which includes two parts. First, to tackle the insufficient data challenge, we augment the training sequence set with simulated sequences which are generated based on a new amino acid transition matrix. By using the simulated sequences, we incorporate not only the prior phylogenetic knowledge but also the specific sequence patterns responsible for the target inorganic material into the training data. Second, instead of searching the patterns globally from the peptide sequences belonging to the same affinity class, we separate the sequences into smaller clusters and try to learn the patterns from them locally. Intuitively, since there are few obvious patterns that could be found at the class level, we search for them from the smaller cluster level. In this paper, we make the following contributions: • We introduce the distinct challenges existing in identifying affinity classes for inorganic material binding sequences. • We propose a novel framework which can effectively predict the affinity classes of the inorganic material binding sequences and provide an efficient iterative algorithm to find the optimal solution of the proposed objective function. Moreover, the proposed framework is a general framework which is also effective on identifying the traditional protein sequences. • The extensive experiments show that the proposed method outperforms many other baseline methods. The rest of the paper is organized as follows. In Section 2, we describe the datasets used in this paper and the setting of our problem. The peptide sequences simulation method and the graph-based optimization model are presented in Section 3 and Section 4, respectively. Extensive experimental results are shown in Section 5. In Section 6, we explain the relationship between our work and previous related ones. The conclusion and future work is presented in Section 7. 2. DATA SET AND PROBLEM DEFINITION In this section, we describe the datasets used in this paper and present the problem definition. 2.1 Data Set We have used two datasets to demonstrate the proposed method’s performance. The first dataset is from Ersin Emre Oren et al [25]. This data set consists of total 39 quartz (rhombohedral silica, SiO2) binding peptide sequences which are produced using phage-display techniques. All these peptide sequences are further classified into two classes based on their affinity: strong and weak binder class which have 10 and 15 sequences, respectively. In order to better demonstrate the problem and show the proposed method in the following, we abstract a sample set including two affinity classes from this data set and show it in Table 1. Table 1: Sample set of peptide sequences data Name DS202 DS189 ... Strong Class Sequence RLNPPSQMDPPF QTWPPPLWFSTS ... Name DS201 DS191 ... Weak Class Sequence MEGQYKSNLLFT VAPRVQNLHFGA ... It is worth noticing that this data set is a good illustrating example to show the two challenges mentioned above. First, there are only around ten sequences available for each affinity class which is much less comparing to the data size used for classifier training in protein sequence analysis where hundreds or thousands sequences are usually involved [20, 29]. Second, the unobvious pattern challenge shown in this data set can be well illustrated by Figure 1. In this figure, based on the total similarity scores (TSS) defined in [25], we first calculate the total similarity of sequences from the same class (i.e. strong class in blue and weak class in orange) via the following equation: T SSA = NA ∑ NA ∑ 1 P SSij (1 − δij ), N A ∗ (N A − 1) i=1 j=1 (1) where δ is the usual Kronecker delta function in which δij = 1 when i = j and 0 otherwise, NA is the total number of sequences in set A, and P SSij is the similarity between the ith sequence and j th sequence of set A calculated via the Needleman-Wunsch algorithm [24]. For the sake of simplicity, we call it self-class similarity for short. Moreover, the TSS of the sequences across the classes are calculated as: T SSA−B = NA ∑ NB ∑ 1 P SSij , N A ∗ N B i=1 j=1 (2) where NB is the total number of sequences in set B. Correspondingly, the total similarity for sequences across the classes is named across-class similarity for short. To calculate P SSij , we need to provide a transition matrix on which the optimal scoring alignment would be made. Without loss of generality, we have used the Blosum 62 [14] (Figure 1(a)) and Pam 250 [26] (Figure 1(b)) as the transition matrices, respectively. It can be seen from the figures that the sequences belonging to weak class have very low or no significant similarity which is even much lower than the cross-class similarity. Due to this phenomenon, the traditional classification approaches are hard to learn an effective pattern from the weak class. 10 Similarity Score Similarity Score 6 4 2 0 8 6 4 2 0 Strong Strong Weak Weak Strong (a) Pam 250 families, where acquired proteins are further grouped into seven families (i.e. A, B, C, D, E, F and G). The size of the families in the dataset are shown in Table 2, and the data we used are available in http://www.acsu.buffalo. edu/~nandu/Datasets/. Table 2: Summary of the protein sequence data Family Class A Class B Class C Class D Class E Class F Class G 2.2 Number of Seq 23 23 16 19 18 14 20 Shortest Seq 160 136 224 120 324 131 45 Longest Seq 177 72 260 144 429 221 83 Problem Definition We consider our problem as identifying the affinity classes for the test inorganic material binding peptide sequences based on the training sequences. Given a pool of l+u peptide sequences Ssource = {s1 , ..., sl , ..., sl+u } where each peptide sequence is represented as a series of ordering amino acids. To better understand what a peptide sequence looks like, let us take a peptide sequence from Table 1 as an example. DS202: RLNPPSQMDPPF is a peptide sequence composed of twelve ordered amino acids where each letter denotes an amino acid and there are 20 amino acids which are known as standard amino acids. We also assume in this sequence pool Ssource , the first l sequences are labeled si ∈ Ssource (1 ≤ i ≤ l) based on its affinity to the target inorganic material (e.g. weak or strong), which together is named L, and the rest of u sequences are unlabeled si ∈ Ssource (l + 1 ≤ i ≤ l + u) and together is named U, where L ∪ U = Ssource . Our goal is to predict the labels of peptide sequences in U, using the training sequences in L. 3. PEPTIDE SEQUENCES SIMULATION As we mentioned before, lack of labeled data is a general problem we are usually facing when working on inorganic material binding sequence research. One of the most successful methods to date for recognizing protein sequences based on evolutionary knowledge is using simulated sequences. Nowadays, there are many studies [1, 20] which have shown that augmenting the training set with the simulated sequences generated from the amino acids transition matrix such as Blosum 62 and Pam 250 can increase the homologs identification performance. Also, it is believed that a set of peptides generated by directed evolution to recognize a given solid material will have similar sequences [25]. Weak Weak Strong (b) Blosum 62 Figure 1: Total similarity scores of the self-class (the strong class and weak class) and the cross-class for quartz binding binders. To demonstrate the proposed work is a general framework which is also effective on predicting the homology families of the traditional protein sequence, the SCOP dataset from [22] is used. In addition, we employ the approach developed by Anoop Kumar and Lenore Cowen [20] to pick the SCOP Although these transition matrices are shown to be efficient and gain wide acceptance, we cannot directly apply this technique to generate simulated sequences. Since these transition matrices are derived from the large-scale natural protein sequence databases rather than the target inorganic material binding sequences, which means these existing matrices could not represent the target inorganic material well. Thus, we only use traditional transition matrix as a seed and based on it we would generate a new transition matrix which not only maintains the prior knowledge from protein but also captures the significant knowledge inside the target inorganic material. Here, aiming to provide a more comprehensive and diverse view for our model, we use a two-step simulated sequence generation approach to enlarge our training set. First, we would generate a new transition matrix which can better measure the amino acids transition relations for the target inorganic material. Specifically, we use a traditional transition matrix (e.g. Blosum 62 or Pam 250) as a seed matrix M , which is a 20 × 20 symmetric matrix and each name on the column or row is a single letter representing an amino acid. Then we greedily and iteratively mutate each profile mij which is an integer coefficient between two amino acids in the seed transition matrix to maximize the difference between self-class similarity and cross-class similarity [25] which is designed to enlarge the gap among the affinity classes. Second, after the new transition matrix M ∗ is constructed, the simulated sequences are generated based on the labeled sequence set L. When a sequence is selected as a seed, the simulated sequence is generated by randomly selecting a position from it and replacing the amino acid i in the corresponding position with a new amino acid j with a probability defined as Eq. (3): ∗ Mij . Pij = ∑20 ∗ j=1 Mij (3) As an example, Figure 2 shows the process of mutating an amino acid in the selected position based on the mutation probability, and we keep replacing the amino acids in the target sequence until a desired mutation threshold t is reached [20]. RLNPPSQMDPPF Mutation Probability A R N D C Q E G H I L K M F Assume the 8-th amino acid M (Methionine) is selected to be mutated P S T W Y V 4.6% 4.6% 3.5% 2.2% 4.6% 5.8% 3.5% 2.2% 3.5% 6.9% 8.0% 4.6% 11.4% 5.8% 3.5% 4.6% 4.6% 4.6% 4.6%6.9% RLNPPSQWDPPF WD Assume W (Tryptophan) is selected to replace M Figure 2: An example of mutating an amino acid in the selected position. When a specific position (e.g. 8-th) is selected from the target sequence, the corresponding amino acid M is mutated to be another amino acid W based on the mutation probability. By this two-step method, we can incorporate not only the prior phylogenetic knowledge but also the specific amino acids pattern responsible for the target inorganic material into the data. Accordingly, based on the this peptide sequence simulation method, for each labeled source sequence si ∈ Ssource (1 ≤ i ≤ l) we generate m mutated sequences, which is represented by a simulated peptide sequence set Ssimulated = {s∗1 , ..., s∗l×m }. Finally, we define the sequence pool as S = Ssource ∪ Ssimulated which includes the source peptide sequences and simulated sequences. We will show that the simulated sequences effectively improve the performance in the experiments. 4. GRAPH-BASED OPTIMIZATION MODEL Aiming at handling the challenge that the obvious patterns are hard to be found at the class level, we propose a graphbased optimization model to estimate the conditional probability of the test sequences belonging to each affinity class. Our method begins by mapping sequences from the sequence pool into nodes of a sequence-to-sequence graph (Section 4.1) where the relationships among sequences are better measured and many efficient clustering methods are available. Instead of searching the patterns from the class level, we partition the sequences into clusters where we believe the significant patterns exist and an objective function (Section 4.2) is proposed to learn the conditional probability of each sequence belonging to a specific affinity class. Finally, we present an efficient iterative algorithm to obtain the optimal value of the objective function (Section 4.3). 4.1 Mapping Sequences into Nodes of a Graph We map all the sequences into a graph where each node denotes a peptide sequence and each edge denotes the pairwise similarity between two sequences. This graph offers a better understanding of the pairwise relationships among peptide sequences and is easy to be partitioned into clusters. The pairwise similarity among sequences is calculated with Needleman-Wunsch [24] algorithm after locally alignment between each sequence pair using Smith-Waterman algorithm [28]. 4.2 The Objective Function The key idea of our approach is that, instead of searching the patterns at the class level, we narrow down the affinity class prediction problem from the class level to the cluster level. We believe that, if the patterns are obscure, shifting the focus from the class level to the cluster level, we can find clearer patterns. Before proceeding further, we introduce the notation that will be used in the following discussion: dij denotes the ijth entry in the matrix D, d⃗i. and d⃗.j denote vectors of i-th row and j-th column of matrix D, respectively. Belongingness Matrix: We denote the belongingness matrix B as an N × V matrix where N is the number of all sequences (including source sequences, simulated sequences and test sequences) and V is the number of clusters detected from the clustering result on N. Note that, each entry in the belongingness matrix corresponds to the probability a peptide sequence belonging to a cluster. If peptide sequence si is assigned to a cluster cj , then bij = 1 and 0 otherwise. To construct the belongingness matrix B, we have used the spectral clustering [31] which has been proved effective for solving the graph partitioning problems to partition the sequence-to-sequence graph that we have constructed in the previous section into V clusters. Sequence Probability Matrix: The conditional probability of peptide sequence si belonging to class z (siz = P̂ (y = z|si )) is estimated with an N × D matrix S where D is the number of affinity classes we want to classify. Cluster Probability Matrix: The conditional probability of cluster cj belonging to class z (cjz = P̂ (y = z|cj )) is estimated as a V × D matrix C, where cjz represents the probability of a cluster cj belonging to a class z. Sequence Labeled Matrix: Since we have a labeled sequence set L where the sequences have the initial class labels which are represented by an N × D matrix F, where fiz = 1 if we know sequence si belonging to class z in advance, and 0 otherwise. Cluster Labeled Matrix: We may also have some prior information of some cluster belonging to a specific class. We use a V × D matrix Y to define initial labels for clusters where yjz = 1 denotes we are confident that a cluster cj represents a specific class z by some knowledge, and 0 otherwise. In our case, we assign a cluster cj to a specific class z if all the source sequences in it belong to the same class z. Cluster Similarity Matrix: In addition, a V × V matrix W denotes the similarity among the sequence clusters, where wij is the similarity between the sequence cluster ci and cj . In our work, the pairwise cluster similarity is calculated with the T SSA−B between the sets of sequence binders A and B. Now we formulate the affinity class identification problem as the following objective function: It is easy to prove that the objective function Eq. (4) is a convex program which makes it possible to find a global optimal solution. To obtain the optimal solution for matrices S and C, we propose to solve the Eq. (4) using the block coordinate descent method [2]. At iteration t, fixing the value of s⃗i. , we can take the partial derivative to c⃗tj. in Eq. (4) and set it to 0, and then obtain the update Formula as Eq. (5). ∑n ⃗ + γk y⃗ bij st−1 j j. j. t ⃗ . cj. = ∑ i=1 n ⃗e ⃗ i=1 bij + γkj + α(ij. − lj. ) Accordingly, the update can be represented as a matrix form ∑ as Eq. (6), where Dv = diag{( N bij )} is the normali=1 ∑ ization factor, Kv = diag{( D z=1 yjz ))} indicates the constraints for the clusters and diag denotes the diagonal elee is the normalized laplaments of a matrix. Furthermore, L 1 −1 −2 e = Dw 2 W D w cian [35] defined as L , where Dw is the diagonal degree matrix of W. e −1 (AT S t−1 + γKv Y ). C t = (Dv + γKv + α(I − L)) min J(S, C) = (5) (6) S,C N ∑ V V ∑ V ∑ ∑ min ( bij ∥s⃗i. − c⃗j. ∥2 + α wij ∥c⃗i. − c⃗j. ∥2 S,C +β i=1 j=1 N ∑ i=1 j=1 hi ∥s⃗i. − f⃗i. ∥2 + γ i=1 V ∑ kj ∥c⃗j. − y⃗j. ∥2 ) j=1 subject to the following conditions: D ∑ z=1 siz = 1, siz ≥ 0 D ∑ cjz = 1, cjz ≥ 0, The Hessian matrix with respect to C is a diagonal matrix ∑n e with entries i=1 bij + α > 0 and I − L. The diagonal e matrix is positive definite and it is easy to prove that I − L is also a semi-positive definite. Thus, the hessian matrix is a positive definite matrix, which means derivative for C gives the unique minimum of the Eq. (4). Similarly, we can obtain the update formula Eq. (7) with respect to s⃗i . through fixing c⃗tj. . z=1 (4) where ∥.∥2 indicates the L2 norm. The first term in the ∑ ∑V 2 Eq. (4), N i=1 j=1 bij ∥s⃗i. − c⃗j. ∥ , ensures that a sequence should have similar probability vector as the cluster it belongs to, namely, cluster cj should correspond to class z if the majority of sequences in this cluster belong to class z. Intuitively, the higher the deviation, ∑ ∑ the larger penalty would get. The second term α Vi=1 Vj=1 wij ∥c⃗i. − c⃗j. ∥2 corresponds to the intuition that the clusters which are close to each other should have similar class, and α denotes the confidence over this source of information. From the view of graph theory, this term is propagating the class information ∑ ⃗ 2 among the clusters. The third term β N i=1 hi ∥s⃗i. − fi. ∥ puts the constraint that the predictions should not deviate too much from the corresponding sequence ground-truth and β is the parameter that expresses the confidence of our belief on∑the prior knowledge of sequences. Similarly, the last term γ Vj=1 kj ∥c⃗j. − y⃗j. ∥2 is the loss function of penalizing the deviation between predictions result and our prior knowledge of clusters and γ is the parameter that expresses the confidence of our belief on the prior knowledge of clusters. 4.3 Iterative Update Algorithm ∑v s⃗ti. = ⃗t j=1 bij cj. ∑v j=1 bij + βhi f⃗i. + βhi . (7) Also, the matrix form of Eq. (7) is as following: S t = (Dn + βHn )−1 (AC t + βHn F ), (8) ∑ where Dn = diag{( Vj=1 bij )} is the normalization factor ∑ and Hn = diag{( D z=1 fiz )} indicates the constraints for the sequences. The hessian ∑ matrix is also a diagonal matrix with diagonal elements N i=1 bij > 0, which means derivative for S gives the unique minimum of the Eq. (4). To sum up, the pseudo-code of iteratively solving Eq. (4) by block coordinate descend method is shown as Algorithm 1, where ϵ is a convergence threshold. Since the proposed method is based on a graph model, we name our approach Peptide Sequences Identification Graph Model - PSIGM. This iterative process shows a procedure of information propagation among the clusters. In each iteration step, each cluster estimates its class based on its members’ classes while 5 4 1 4 1 6 3 5 6 3 2 2 (A) 1 (B) 5 4 4 1 5 6 3 6 3 2 2 (D) Node class: Cluster class: (C) Strong Weak Test Strong Simulated Weak Unlabel Figure 3: An example of illustrating the label propagation at each iteration. (A) partition of all the nodes (each sequence is represented as a node here) into multiple clusters; (B) conditional probability estimate of clusters C receiving the label information from the sequences (nodes) belonging to them; (C) each cluster propagates its class information to their neighboring clusters; and (D) after updating probability, each cluster pass the label information back to its members conditional probability S. Algorithm 1 The PSIGM Algorithm Input: Belongingness matrix BN ×V , sequence labeled matrix FN ×D , cluster labeled matrix YV ×D , cluster similarity matrix WV ×V , and parameter α, β, γ, ϵ Output: Estimated sequence probability matrix SN ×D 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: Initialize S 0 randomly; t ← 1; begin repeat Update C t using Eq. (6); Update S t using Eq. (8); t ←t + 1; until S t − S t−1 ≤ ϵ Output S t ; end retaining its initial class Y. After all the clusters receiving the label information, they propagate their label information to their neighboring clusters based on the smoothness consumption. After the clusters have received the information from its neighbors, they pass the information back to the nodes belonging to it while the nodes retains their initial classes F. This process continues until convergence. To better demonstrate it, Figure 3 shows an example of how the information propagation. 4.4 Time Complexity The time complexity of the proposed algorithm is composed with two parts: updating the cluster probability matrix C and updating the sequence probability matrix S. For updating the matrix C, the time complexity is O(V N 2 D + V 3 + V 2 D) where N is the size of the peptide sequence pool and V is the number of clusters and D is the number of affinity classes. Since in our case the number of sequence set is usually much larger than number of clusters, thus the time complexity for the first step is O(V N 2 D). For updating the matrix S, the time complexity is O(N 2 D + N V D), thus O(N 2 D). Therefore, the overall time complexity is O(V N 2 D). Suppose the number of iteration is k, the time complexity of whole algorithm is O(kV N 2 D). In experiments, we observe that t is usually between 8 and 20. 5. EXPERIMENTS In the following, we conduct two extensive experiments to show our proposed framework’s performance. First, the experiment on the quartz binding sequence dataset shows that PSIGM is effective on identifying the binding affinity classes of inorganic material binding sequences. Second, the experiment on the SCOP protein sequence dataset shows that PSIGM is a general framework which also works effectively in other kinds of sequence sets. Because most of our baselines are designed as binary classifiers, for the sake of simplicity, in the following experiments we only consider the case of weak and strong binder class identification of the data set mentioned in Section 2.1, although our proposed method is not restricted to binary classification. Throughout all the experiments, we set α = 2, β = 10, and γ = 3 for our algorithm. The rationale is that both α and γ depend on the clustering result which is influenced by some uncertain factors such as the number of clusters and the initial centric of each cluster, thus assigning a relative low value to them is better; on the other hand, β shows our confidence on the labeled sequences which come from strict and reliable experiments, thus β should be assigned a relative larger value. 5.1 Experiment on Material Binding Sequences To show our proposed framework is effective on predicting the binding affinity class of inorganic material binding sequences, we perform the following three experiments to demonstrate that: 1) simulated sequences and the information propagation among the clusters can effectively alleviate the limitation due to challenge I; 2) searching the patterns from clusters rather than classes help handling the challenge II; and 3) the new generated transition matrix contributes to our proposed method’s performance. For each accuracy shown in the following experiments, we perform it 5 times using the Leave-one-out validation and report the mean value. In addition, for each experiment, we fix the threshold ϵ for convergence to 10−4 . The effect of simulated sequences and cluster information propagation. To test the effect of simulated sequences and information propagation among the clusters, we ran our method without using simulated sequences or information propagation among clusters, respectively. Furthermore, as we mentioned in Section 4.2, we need to partition all the sequences into V clusters, thus we also want to show the relationship between the proposed method’s performance and the number of clusters. We vary the number of clusters as 2, 5, 10, 15, 18 and 20. We want to show that both strategies: sequence simulation and information propagation among the clusters are crucial in improving the performance. The result is shown in Figure 4, where the x axis denotes the different number of clusters and y axis denotes the accuracy. We can see that simulated sequences contribute to the performance improvement. Also, in the absence of the information propagation among clusters, the performance reduces which displays the desired appearance. Finally, we notice that the hill-like shapes appear as the number of clusters increasing for all the three cases, which are likely due to the reason that when the number of clusters is too low, it is close to a global view; when it is too high, the clusters would be too trivial to learn from. 0.85 marked as SVM*, Neural Network* and HMM*. Note that, since LLGC was designed as a semi-supervised algorithm which needs unlabeled instances to aid propagating the labeled information, thus we only consider the LLGC with the simulated sequences which are used as unlabeled data. Thus, all the methods could be separated into two parts: using the simulated sequences and without using the simulated sequences. To measure the influence from the different mutation rate t which is used to generate simulated sequences, we vary the mutation rate t as 5%, 10%, 15% and 20%. The result of predicting the quartz binding affinity classes comparing with baselines is shown in Table 3. Note that the proposed method significantly outperforms the others in predicting the affinity classes of the test inorganic material binding sequences. In addition, the performance of the proposed method is not so sensitive to the settings of the mutation rate when they are at a relative low range. It is worth noticing that, instead of aiding the performance, the simulated sequences in SVM*, HMM* and NN* make the performance worse than without them in SVM, HMM and NN. The main reason for this phenomenon is that: all these methods use a global view on the training data which is only represented as two large classes: strong binding or weak binding class. However, as we mentioned, the sequences inside the same class may be very different, thus the more simulated sequences are added, the more unobvious patterns are likely to be. Since the proposed method treats the sequences locally as clusters, thus it can properly handle this problem. Table 3: Comparison with baselines under different mutation rates Method PSIGM LLGC NN* HMM* SVM* SVM NN HMM Oren et al. PSIGM Without Information Propagation Without Sequences Simulation Accuracy 0.8 0.75 5% 0.82 0.60 0.60 0.66 0.62 Mutation Rate 10% 15% 0.82 0.84 0.63 0.60 0.62 0.58 0.65 0.69 0.63 0.58 0.68 0.65 0.70 0.80 20% 0.83 0.62 0.55 0.67 0.59 0.7 0.65 2 5 10 15 18 20 Number of Clusters Figure 4: Performance comparison with different strategies. Performance comparison with baselines. In this part, we compare the proposed method with 5 other algorithms mentioned above including SVM, Neural Network, HMM, Learning with local and global consistency (LLGC ) [35] which is a famous graph-based Semi-Supervised Learning algorithm and the approach proposed in [25]. For fairness and comprehensiveness, we have also tried adding the simulated sequences used for our framework to these methods which are Comparison with Varying Transition Matrices. Finally, we show the performance of the proposed method with different transition matrices. We want to demonstrate that the new generated matrix could improve the performance of the proposed algorithm for the target inorganic material. In this experiment, we also vary the mutation rate t as 5%, 10%, 15% and 20%. We have compared the new transition matrix M1 which was generated based on Blosum 62 and M2 which was generated based on Pam 250 with four other widely-used transition matrices including Blosum 62, Pam 250, Dayhoff [6] and Gonnet [12] as Table 4. The result shows that, the new generated transition matrices perform better than the others at each mutation rate. 5.2 Experiment on Protein Sequences Table 4: Performance with different transition matrices 5% 0.82 0.80 0.72 0.72 0.71 0.79 Mutation Rate 10% 15% 0.82 0.84 0.81 0.82 0.74 0.75 0.73 0.73 0.71 0.72 0.78 0.79 Family A B C D E F G 20% 0.83 0.82 0.74 0.74 0.72 0.80 Actually, the proposed PSIGM is a general framework which is not limited at identifying the affinity class of inorganic material binding sequences. To prove that, we have used the SCOP protein data mentioned in the Section 2.1. Instead of predicting the sequences’ affinity classes, we consider the problem in homology families prediction: for a specific family, could the proposed framework identifies the sequences belonging to it from the remaining families? Correspondingly, we construct seven identification tasks from this dataset, where the sequences from one particular family is used as the positive set and the sequences from the remaining four families are used as the negative set. For example, when the sequences in family A is used as the positive set, the sequences from families B, C, D, E, F and G would be used as the negative set. Two experiments are performed to demonstrate that: 1) our PSIGM is a general framework which can also handle the tradition protein sequence identification; 2) a moderate setting of mutation rate is conductive to improve the performance. Performance comparison with baselines. It is worth noticing that, through handling the data in this way, it obtains the characteristics of inorganic binding sequences to some extent. First, the number of labeled sequences is limited. As shown in Table 2, for each family, only around 20 sequences are given. Second, since our negative set is the collection from several different families, the sequences in it do not follow the same pattern. Table 5 shows the result of predicting the homology family comparing with baselines which are mentioned in Section 5.1. Note that each result shown here is the average of 10 times performance through 5-fold cross validation. As the table show, the proposed method outperforms the other methods at each protein family’s prediction. Note that, the accuracies of predicting the homology families are much higher than the accuracies of predicting the affinity classes of the inorganic material binding sequences. The reasons behind this can be well explained by Figure 5, which shows the self-class similarity of each prediction task. Unlike the case shown in Figure 1 where the cross-class similarity is remarkably higher than the weak self-class similarity, in this protein sequence dataset, all the self-class similarities are still higher or close to the cross-class similarities. As we know, the more cross-class similarity surpasses the self similarity, the more difficult two classes are separated. Parameter sensitivity. The performance of PSIGM is influenced by the setting of mutation rate which is used to generate the simulation sequences. To fully evaluate how the mutation rate affects the performance, we increase it from 0.05 to 0.3 with a step of 0.05 and report the accuracy PSIGM 0.999 0.965 0.999 0.999 0.999 0.978 0.987 LLGC 0.80 0.81 0.82 0.82 0.82 0.82 0.82 SVM 0.998 0.959 0.999 0.999 0.985 0.946 0.975 HMM 0.966 0.944 0.952 0.969 0.935 0.952 0.972 NN 0.913 0.947 0.968 0.999 0.999 0.857 0.929 of each family’s prediction task in Figure 6. It is clear that most families have an increase in accuracy as the mutation rate rises until reaching a threshold of 15%, and then the performance begins to decrease. This proves that the performance of PSIGM can be improved by a moderate setting of mutation rate. In addition, we can infer that PSIGM is not only effective at identifying the affinity classes of the inorganic material binding sequences, but also effective at predicting the homology families of the traditional protein sequences. 1 0.99 0.98 AUC Transition Matrix M1 M2 Blosum 62 Pam 250 Dayhoff Gonnet Table 5: Comparison with baselines at each class 0.97 A B C D E F G 0.96 0.95 0.94 0.93 0.05 0.1 0.15 0.2 0.25 0.3 Mutation Rate Figure 6: Parameter sensitivity experiment. 6. RELATED WORK As an emerging research topic, there is very little published work on identifying the affinity classes of inorganic material binding sequence that we can compare to. But as a similar topic, much research has been devoted to the question of identifying the homologs of the protein sequences. Hidden Markov Model (HMM) is a widely-used probability modeling method for protein homology detection [8, 20, 29] which first generates a probability for each specific sequences family and then calculates the likelihood of an unknown sequence fitting each family. Another type of direct modeling methods for protein homology detection is based on Neural Network [32, 34], where the multilayer nature of neural network allows them to discover non-linear higher order correlations among the sequences. As a widely-used machine learning algorithm, Support Vector Machine (SVM) [13] has been also applied to protein homology detection problems. Mak et al. [21] proposed a SVM based model named PairProSVM to automatically predict the sub-cellular locations of proteins sequences. Karchin et al. [17] combined the HMM with the SVM to identify the protein homologies. However, these methods are inappropriate in our case for two reasons. First, they ask for a training set consisting of sufficient la- 100 0 100 50 0 A 300 200 100 0 B Not B Not B A 0 150 100 50 0 E (d) Class D 150 100 50 0 F Not E Not E D (c) Class C Similarity Score Similarity Score Similarity Score 200 Not D Not D C 200 400 100 Not C Not C (b) Class B 600 200 D B (a) Class A 300 0 C Not A Not A Similarity Score 200 Similarity Score Similarity Score Similarity Score 400 300 G Not F Not F Not G Not G E F G (e) Class E (f) Class F (g) Class G Figure 5: Total similarity scores of the self-class and the cross-class for each prediction task based on Pam 250. (A) class A and non-A; (B) class B and non-B; (C) class C and non-C; (D) class D and non-D; (E) class E and non-E; (F) class F and non-F; (G) class G and non-G. beled examples. Second, they try to learn the pattern from each class which may not exist at this level. Moreover, besides the differences with the traditional classification approaches, the proposed framework is also different from the following work: 1) [25] has proposed a method to generate the new transition matrix and making the classification based on it. The main difference between the proposed work and Oren’s work is that they still considered the sequence classification problem with a global view. 2) [11] proposed a consensus maximization model to solve the problem of finding informative genes from multiple studies. Although the proposed method has the same intuition with Ge’s work in which a cluster should correspond to a particular class z if the majority of instances in this cluster belong to class z, their method aimed at making the reliable prediction by utilizing multiple sources which is much different from our topic. 7. CONCLUSION AND FUTURE WORK Identifying the affinity classes of peptide sequences binding to a specific inorganic material is a challenging research problem with broad applications. In this paper, we proposed a novel framework, PSIGM, to solve this problem. We begin with providing a two-step simulated peptide sequences generation method to make the training set more comprehensive and diverse. Moreover, unlike traditional machine learning approaches used for protein sequences identification that try to find the patterns from the class level, our framework partitions the sequences into smaller clusters and learns the patterns from them through using a graph-based optimization model. Extensive experimental studies demonstrate that the proposed framework can effectively identify the affinity classes of the inorganic material binding sequences. In the future, we will use a cyclic model to validate and retrain our model: first, this model selects some sequences that have the most/least probabilities binding to a target inorganic material as a candidate set; second, we would use some quick experimental methods to validate the candidate sequence set such as QCM (Quartz Crystal Microbalance); finally, the validated sequences are used to retrain our model, and then new candidate sequences would be selected from the sequence database based on their affinity, so on so forth. We believe that by this cyclic validation model, we can not only validate our model’s effectiveness but also keep retraining it to be better and better. 8. REFERENCES [1] Afiahayati and S. Hartati. Multiple sequence alignment using hidden markov model with augmented set based on blosum 80 and its influence on phylogenetic accuracy, 2010. [2] D. P. Bertsekas. Nonlinear Programming, volume 43. Athena Scientific, 1995. [3] E. T. Boder and K. D. Wittrup. Yeast surface display for screening combinatorial polypeptide libraries. Nature Biotechnology, 15(6):553–557, 1997. [4] H. Chen, X. Su, K.-G. Neoh, and W.-S. Choe. Probing the interaction between peptides and metal oxides using point mutants of a tio2-binding peptide. Langmuir The Acs Journal Of Surfaces And Colloids, 24(13):6852–6857, 2008. [5] K. Y. Dane, C. Gottstein, and P. S. Daugherty. Cell surface profiling with peptide libraries yields ligand arrays that classify breast tumor subtypes. Molecular Cancer Therapeutics, 8(5):1312–1318, 2009. [6] M. O. Dayhoff, R. M. Schwartz, and B. C. Orcutt. A [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] model of evolutionary change in proteins. Atlas of protein sequence and structure, 5(Suppl 3):345–352, 1978. S. Donatan, M. Sarikaya, C. Tamerler, and M. Urgen. Effect of solid surface charge on the binding behaviour of a metal-binding peptide. Journal of the Royal Society Interface the Royal Society, (April):rsif.2012.0060–, 2012. R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological sequence analysis: probabilistic models of proteins and nucleic acids. cambridge univ, 1998. E. Estephan, C. Larroque, F. J. G. Cuisinier, Z. Bálint, and C. Gergely. Tailoring gan semiconductor surfaces with biomolecules. The Journal of Physical Chemistry B, 112(29):8799–8805, 2008. E. Estephan, M.-b. Saab, C. Larroque, M. Martin, F. Olsson, S. Lourdudoss, and C. Gergely. Peptides for functionalization of inp semiconductors. Journal of Colloid and Interface Science, 337(2):358–363, 2009. L. Ge, N. Du, and A. Zhang. Finding informative genes from multiple microarray experiments: A graph-based consensus maximization model, 2011. G. H. Gonnet, M. A. Cohen, and S. A. Benner. Exhaustive matching of the entire protein sequence database. Science, 256(5062):1443–1445, 1992. M. J. Grimble. Adaptive systems for signal processing, communications and control. Control, 3, 2001. S. Henikoff and J. G. Henikoff. Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences of the United States of America, 89(22):10915–10919, 1992. M. Hnilova, E. E. Oren, U. O. S. Seker, B. R. Wilson, S. Collino, J. S. Evans, C. Tamerler, and M. Sarikaya. Effect of molecular conformations on the adsorption behavior of gold-binding peptides. Langmuir The Acs Journal Of Surfaces And Colloids, 24(21):12440–12445, 2008. M. K. Kalita, U. K. Nandal, A. Pattnaik, A. Sivalingam, G. Ramasamy, M. Kumar, G. P. S. Raghava, and D. Gupta. Cyclinpred: A svm-based method for predicting cyclin protein sequences. PLoS ONE, 3(7):12, 2008. R. Karchin, K. Karplus, and D. Haussler. Classifying g-protein coupled receptors with support vector machines. Bioinformatics, 18(1):147–159, 2002. E. Kasotakis, E. Mossou, L. Adler-Abramovich, E. P. Mitchell, V. T. Forsyth, E. Gazit, and A. Mitraki. Design of metal-binding sites onto self-assembled peptide fibrils. Biopolymers, 92(3):164–172, 2009. J. Kimling, M. Maier, B. Okenve, V. Kotaidis, H. Ballot, and A. Plech. Turkevich method for gold nanoparticle synthesis revisited. The Journal of Physical Chemistry B, 110(32):15700–15707, 2006. A. Kumar and L. Cowen. Augmented training of hidden markov models to recognize remote homologs via simulated evolution. Bioinformatics, 25(13):1602–1608, 2009. M.-W. M. M.-W. Mak, J. G. J. Guo, and S.-Y. K. S.-Y. Kung. Pairprosvm: protein subcellular localization based on local pairwise profile alignment and svm. IEEEACM Transactions on Computational Biology and Bioinformatics, 5(3):416–422, 2008. [22] A. G. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia. Scop: A structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology, 247(4):536 – 540, 1995. [23] R. R. Naik, L. L. Brott, S. J. Clarson, and M. O. Stone. Silica-precipitating peptides isolated from a combinatorial phage display peptide library. Journal of Nanoscience and Nanotechnology, 2(1):95–100, 2002. [24] S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3):443–453, 1970. [25] E. E. Oren, C. Tamerler, D. Sahin, M. Hnilova, U. O. S. Seker, M. Sarikaya, and R. Samudrala. A novel knowledge-based approach to design inorganic-binding peptides. Bioinformatics, 23(21):2816–2822, 2007. [26] W. R. Pearson. Rapid and sensitive sequence comparison with fastp and fasta. Methods in Enzymology, 183(1988):63–98, 1990. [27] G. P. Smith. Filamentous fusion phage: novel expression vectors that display cloned antigens on the virion surface. Science, 228(4705):1315–1317, 1985. [28] T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147(1):195–197, 1981. [29] N. Terrapon, O. Gascuel, E. Marechal, and L. Brehelin. Fitting hidden markov models of protein domains to a target species: application to plasmodium falciparum. BMC Bioinformatics, 13(1):67, 2012. [30] A. Vila Verde, P. J. Beltramo, and J. K. Maranas. Adsorption of homopolypeptides on gold investigated using atomistic molecular dynamics. Langmuir The Acs Journal Of Surfaces And Colloids, 27(10):5918–5926, 2011. [31] U. Von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395–416, 2007. [32] D. W. D. Wang, N. K. L. N. K. Lee, T. S. Dillon, and N. J. Hoogenraad. Protein sequences classification using radial basis function (rbf) neural networks, 2002. [33] M. Wistrand and E. L. L. Sonnhammer. Improving profile hmm discrimination by adapting transition probabilities. Journal of Molecular Biology, 338(4):847–854, 2004. [34] C. H. Wu, S. Zhao, H. L. Chen, C. J. Lo, and J. McLarty. Motif identification neural design for rapid and sensitive protein family search. Computer applications in the biosciences CABIOS, 12(2):109–118, 1996. [35] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Sch. Learning with local and global consistency. Advances in Neural Information Processing Systems 16, 1:595–602, 2003. [36] X. Zhu. Semi-supervised learning literature survey. SciencesNew York, Tech. Rep.(1530):1–59, 2007.