1 Identifying Affinity Classes of Inorganic Materials Binding Sequences via a Graph-based Model Nan Du, Marc R. Knecht, Mark T. Swihart, Zhenghua Tang, Tiffany R. Walsh and Aidong Zhang Abstract—Rapid advances in bionanotechnology have recently generated growing interest in identifying peptides that bind to inorganic materials and classifying them based on their inorganic material affinities. However, there are some distinct characteristics of inorganic materials binding sequence data that limit the performance of many widely-used classification methods when applied to this problem. In this paper, we propose a novel framework to predict the affinity classes of peptide sequences with respect to an associated inorganic material. We first generate a large set of simulated peptide sequences based on an amino acid transition matrix tailored for the specific inorganic material. Then the probability of test sequences belonging to a specific affinity class is calculated by minimizing an objective function. In addition, the objective function is minimized through iterative propagation of probability estimates among sequences and sequence clusters. Results of computational experiments on two real inorganic material binding sequence datasets show that the proposed framework is highly effective for identifying the affinity classes of inorganic material binding sequences. Moreover, the experiments on the SCOP (structural classification of proteins) dataset shows that the proposed framework is general and can be applied to traditional protein sequences. Index Terms—inorganic material, peptide sequences, classification F 1 I NTRODUCTION Over the past decade, many studies have been published for analyzing the peptide sequences with affinity to biological entities such as enzymes, cells, viruses, lipids and proteins. Recently, interest in identifying and classifying peptides that interact specifically with inorganic materials has grown. These inorganic materials binding peptide sequences have been identified from biocombinatorial peptide libraries using phage display [1], cell surface display [2], and yeast display [3]. In particular, numerous studies have been reported about the peptide sequences that bind to the inorganic materials, such as noble metals (gold, silver, platinum) [4], [5], [6], [7], [8], semiconductors (zinc sulfide, cadmium sulfide) [9], [10], [11], [12], and metal oxides (silica, titanium and magnetite) [13], [14], [15], [16], [17], [18], [19], which are of great interest for applications in technology and medicine. Inorganic material binding peptide sequences, which are usually 7-14 amino acids long, are differentiated from other polypeptides by their specific molecular recognition properties for targeted inorganic material surfaces [20]. Effectively identifying the affinity • Nan Du and Aidong Zhang are with the Computer Science and Engineering Department, University at Buffalo (SUNY), Buffalo, NY 14260. E-mail:nandu,azhang@buffalo.edu • Marc R. Knecht and Zhenghua Tang are with Department of Chemistry, University of Miami, 1301 Memorial Drive, Coral Gables, Florida 33146. E-mail:knecht,z.tang@miami.edu • Mark T. Swihart is with Department of Chemical and Biological Engineering, University at Buffalo (SUNY), Buffalo, NY 14260 E-mail:swihart@buffalo.edu • Tiffany R. Walsh is with Institute for Frontier Materials, Deakin University, Geelong, Vic. 3216, Australia E-mail:tiffany.walsh@deakin.edu.au classes, which shows the binding strength of a specific sequence with respect to the target inorganic material, is crucial for further designing novel peptides [21]. The binding affinity of a peptide to an inorganic surface is the result of a complex interplay between the binding strength of its individual residues and its conformation. The binding strength of a sequence for a specific material is usually measured with the adsorption free energy (∆Gads ), which is then used to classify the affinity class as weak, medium, or strong for each sequence. Despite extensive recent reports on combinatorially selected inorganic binding peptides and their bionanotechnological utility as synthesizers and molecular linkers [22], [23], [20], there is still limited knowledge about the relationships between binding peptide sequences and their associated inorganic materials. Therefore, by using machine learning technology to suggest sequence affinity classes, we can predict new sequences having desired affinity for specific inorganic materials, without doing new large-scale screenings via phage display. Various approaches have been used or developed for recognizing both close and distant homologs of given protein sequences, which is one of the central themes in bioinformatics. Most of the work is based on established machine learning models such as Hidden Markov model (HMM) [24], [25], Neural Network (NN) [26], [27] and Support vector machine (SVM) [28]. However, the problem of inorganic material binding peptide sequence affinity classes identification has some distinct challenges that are rarely faced in protein sequence identification, which markedly limit the performance of the models mentioned above, despite their success in other types of protein sequences detection. 2 Challenge I: The number of labeled samples is usually insufficient. As an emerging topic, the peptide sequences identified for binding solid inorganic materials have been developed only in the last decade, and are not so well studied compared to protein sequences analysis which has much longer history. For example, unlike protein sequences analysis that has numerous large-scale public datasets such as GPCR [29] or SCOP [30], no complete result of large-scale screening experiments has been made publicly available for the inorganic material binding sequences. Therefore, unlike protein sequence research which has many public large databases and publicly available experiment results, the data about inorganic binding peptide sequences are usually quite few. Most existing protein sequence classification approaches require a large set of labeled samples to train an accurate model. However, labeling the affinity classes for a large number of inorganic material binding sequences is very time-consuming and expensive. Thus it is usually infeasible. If only a limited number of labeled samples are available for the model training, the learned model may suffer from the problems of over-fitting or under-fitting. As a machine learning method which has received much attention in the past decade, Semi-Supervised Learning (SSL) [31] is good at handling the lack of sufficient labeled training data problem. However, the utility of this method may be markedly limited due to the next challenge. Challenge II: The peptide sequences belonging to the same affinity class may be very dissimilar. Usually, the protein sequences which belong to the same family follow some apparent patterns, in other words, they are similar to each other by some views. However, the “similarity” between inorganic material binding peptide sequences from the same affinity class may be not so apparent. In some cases, the intra-similarity which measures the similarity of all sequences inside the same class is even less than the inter-similarity which measures the similarity among the sequences from different classes. This phenomenon also means some peptide sequences belonging to the same class may be dissimilar with each other, at least by the current knowledge. This observation reflects the fact that the inorganic material binding sequences do not satisfy the smoothness assumption at the class level which is generally assumed in both supervised learning and semi-supervised learning. In light of these challenges for inorganic material binding sequence affinity classes identification, we propose a novel framework which includes two parts. First, to tackle the insufficient data challenge, we augment the training sequence set with simulated sequences which are generated based on a new amino acid transition matrix. By using the simulated sequences, we incorporate not only the prior phylogenetic knowledge but also the specific sequence patterns responsible for the target inorganic material into the training data. Second, instead of searching the patterns globally from the peptide sequences belonging to the same affinity class, we separate the sequences into smaller clusters and try to learn the patterns from them locally via a graphbased optimization model. Intuitively, since there are few obvious patterns that could be found at the class level, we search for them at the smaller cluster level. Based on the two strategies mentioned above, we propose a novel model that combines the sequence simulation and cluster-based sequence affinity identification. The initial idea was published in [32]. This paper extends the original idea to formulate a solid method and provide more supportive, comprehensive experiments. The main process of the proposed method is shown in Fig. 1, where we first use the labeled sequences as seeds to simulate more sequences, and then all the labeled and simulated sequences are used to train our graph-based optimization model which is effective at identifying the sequences’ affinity classes. We will discuss the proposed method in detail in the following sections. PPTNSM HFQN PPTNSM …… HFQN Strong Set …… Strong Set LWSTVA Peptide Sequences Simulation LWSTVA SNLFT …… Graph-based Optimization Model Weak Set SNLFT …… Weak Set PPANST SNMFT …… Simulated Set Fig. 1: Main process of the proposed method. In this paper, we make the following contributions: • • • We introduce the distinct challenges associated with identifying affinity classes for inorganic material binding sequences. We propose a novel framework which can effectively predict the affinity classes of the inorganic material binding sequences and provide an efficient iterative algorithm to find the optimal solution of the proposed objective function. Moreover, our framework is a general framework which is also effective for identifying the classes of traditional protein sequences. The extensive computational experiments show that the proposed method outperforms many other baseline methods. The rest of the paper is organized as follows. In Section 2, we explain the relationship between our work and previous related work. In Section 3, we describe the datasets used in this paper and the setting of our problem. The peptide sequence simulation method and the graphbased optimization model are presented in Section 4 and Section 5, respectively. Extensive experimental results are shown in Section 6. The conclusion and future work is presented in Section 7. 3 2 R ELATED W ORK As an emerging research topic, there is very little published work on identifying the affinity classes of inorganic material binding sequences that we can compare to. But as a similar topic, much research has been devoted to the question of identifying the homologs of the protein sequences. HMM is a widely-used probability modeling method for protein homology detection [24], [25], [33] which first generates a probability for each specific sequence family and then calculates the likelihood of an unknown sequence fitting each family. Another type of direct modeling methods for protein homology detection is based on Neural Network [26], [27], where the multilayer nature of neural network allows them to discover non-linear higher order correlations among the sequences. As a widely-used machine learning algorithm, SVM [28] has been also applied to protein homology detection problems. Mak et al. [34] proposed a SVM based model named PairProSVM to automatically predict the sub-cellular locations of proteins sequences. Karchin et al. [29] combined the HMM with the SVM to identify the protein homologies. Tian et al. proposed a weighted version of SVM to weaken the influence of outliers for improving protein sub-cellular localization predictions [35]. However, these methods are inappropriate in our case for two reasons. First, they ask for a training set consisting of sufficient labeled examples. Second, they try to learn the pattern from each class which may not exist at this level. Moreover, besides the differences with the traditional classification approaches, the proposed framework is also different from the following work: 1) Oren et al. [21] has proposed a method to generate a new transition matrix and make the classification based on it. The first difference between the work presented here and Oren’s work is that they only consider the sequence classification problem via learning the patterns from the entire sequence set belonging to the same affinity class. Second, the newly generated transition matrix in [21] was only used to calculate the pairwise distance between sequences. In our proposed method, the newly generated matrix is also used to generate the simulated sequences. 2) Ge et al. [36] proposed a consensus maximization model to solve the problem of finding informative genes from multiple studies. Although the proposed method has the same intuition as Ge’s work in which a cluster should correspond to a particular class z if the majority of instances in this cluster belongs to class z, it aimed at making the reliable prediction by utilizing multiple experimental results which is much different from our work. In our case, we only have the raw dataset (i.e. labeled inorganic material sequences) rather than multiple experimental results. 3 DATASETS AND P ROBLEM D EFINITION In this section, we describe the datasets used in this paper and present the problem definition. 3.1 Datasets We have used three datasets to demonstrate the proposed method’s performance. The first dataset is from Oren et al [21]. This dataset consists of a total of 25 quartz (rhombohedral silica, SiO2 ) binding peptide sequences which were identified using phage-display techniques. All these peptide sequences are further classified into two classes based on their affinity strength: strong and weak binder classes which contain 10 and 15 sequences, respectively. To better demonstrate the problem and show the proposed method in the rest of the paper, we abstract a sample set which includes two affinity classes from this dataset and show it in Table 1. TABLE 1: Sample set of peptide sequences data Name DS202 DS189 ... Strong Class Sequence RLNPPSQMDPPF QTWPPPLWFSTS ... Name DS201 DS191 ... Weak Class Sequence MEGQYKSNLLFT VAPRVQNLHFGA ... The second inorganic material binding peptide sequence dataset is from our systematic study of peptide binding on gold (Au) [37], combined with the previous data from Wang et al. [38], to give a total of 32 peptide sequences. Sequences in our sequence set following the pattern XHXHXHX, where X is an arbitrary amino acid are from Wang et al. [38]. Since any peptide sequences that containing cysteine (i.e. amino acid C) can bind strongly onto the gold surface, without loss of generality, any sequences contains cysteine are not considered. Using measured adsorption free energies (∆ G kJ/mol) for all the sequences, we drew the boundary between strong and weak binding sequences, such that the weak class has ∆ G > −25 kJ/mol, and the strong class has ∆G ≤ −25 kJ/mol. Note that, Hnilova et al. [39] have shown that sequence ’TLRRWRDRRILN’ (AUBP30) has weak binding ability to gold. Although they did not report the free energy for it, it is very likely to reside in the weak set based on the qualitative binding analysis. All the sequences from the strong and weak classes are listed in Table 2. It is worth noticing that these datasets illustrate well the two challenges mentioned above. First, there are only around ten sequences available for each affinity class, which is very few in comparison to the data size used for classifier training in protein sequence analysis where hundreds or thousands of sequences are usually involved [33], [25]. Second, the unobvious pattern challenge shown in these datasets is illustrated well in Fig. 2 and Fig. 3. In this figure, based on the total similarity scores (TSS) defined in [21], we first calculate the total similarity of sequences from the same class A via the following equation: ∑∑ 1 T SSA = P SSij (1 − δij ), N A ∗ (N A − 1) i=1 j=1 NA NA (1) Weak Sequence HHHHHHH MHMHMHM RHRHRHR YHYHYHY WHWHWHW KHKHKHK AHAHAHA GHGHGHG QHQHQHQ IHIHIHI NHNHNHN VHVHVHV SHSHSHS THTHTHT EHEHEHE DHDHDHD LHLHLHL FHFHFHF PHPHPHP TLRRWRDRRILN ∆G -23.9 -23.1 -22.9 -22.8 -22.3 -22.3 -21.6 -21.6 -20.5 -20.4 -19.9 -19.9 -18.9 -18.7 -18.6 -18.2 -16.2 -15.4 -14.1 - NA ∑ NB ∑ 1 P SSij , N A ∗ N B i=1 j=1 2 0 Strong 10 5 0 Strong Weak Weak Weak Weak Strong Strong (a) Pam 250 (b) Blosum 62 Fig. 2: Total similarity scores of the self-class (the strong class and weak class) and the cross-class for quartz binding binders. 20 where δ is the usual Kronecker delta function in which δij = 1 when i = j and 0 otherwise, NA is the total number of sequences in set A, and P SSij is the similarity between the ith sequence and j th sequence of set A calculated via the Needleman-Wunsch algorithm [40]. For the sake of simplicity, we call it self-class similarity for short. Moreover, the TSS of the sequences across the classes A and B are calculated as: T SSA−B = 4 (2) where NB is the total number of sequences in set B. Correspondingly, the total similarity for sequences across the classes is named across-class similarity for short. To calculate P SSij , we need to provide a transition matrix on which the optimal scoring alignment would be made. Without loss of generality, we have used both the Pam 250 [41] (Fig. 2(a) and Fig. 3(a)) and Blosum 62 [42] (Fig. 2(b) and Fig. 3(b)) as the transition matrices, respectively. Fig. 2 shows that the sequences belonging to the weak class have very low or no significant similarities. Their self-similarity is much lower than the cross-class similarity. Similarly, as shown in Fig. 3, the similarities of the sequences belonging to the strong gold binding set are very close to the cross-class similarity. Due to this phenomenon, the traditional classification approaches cannot readily identify an effective pattern. To demonstrate the proposed work is a general framework which is also effective on predicting the homology families of the traditional protein sequence, the third dataset: Structural Classification of Proteins SCOP dataset from [43] is also used. In addition, we employ the approach developed by Anoop Kumar and Lenore Cowen [25] to pick the SCOP families, where acquired proteins are further grouped into seven families (i.e. A, B, C, D, E, F and G). The size and the length of longest/shortest of amino acids at each family in the dataset are shown 20 Similarity Score ∆G -37.6 -37.6 -36.6 -36.4 -35.7 -35.3 -35.0 -35.0 -31.8 -31.6 -31.3 -30.3 Similarity Score Strong Sequence WAGAKRLVLRRE MHGKTQATSGTIQS LKAHLPPSRLPS WALRRSIRRQSY TGTSVLIATPYV EQLGVRKELRGV RMRMKMK PPPWLPYMPPWS AYSSGAPPMPPF TGIFKSARAMRN KHKHWHW TSNAVHPTLRHL 6 Similarity Score TABLE 2: Summary of the gold binding peptide sequences Similarity Score 4 15 10 5 0 Strong 15 10 5 0 Strong Weak Weak (a) Pam 250 Weak Weak Strong Strong (b) Blosum 62 Fig. 3: Total similarity scores of the self-class (the strong class and weak class) and the cross-class for gold binding binders. in Table 3, and the data we used are available at http: //www.acsu.buffalo.edu/∼nandu/InorganicSeq/. TABLE 3: Summary of the protein sequence data 3.2 Family Number of Seq Class A Class B Class C Class D Class E Class F Class G 23 23 16 19 18 14 20 Length of Shortest Seq 160 72 224 120 324 131 45 Length of Longest Seq 177 136 260 144 429 221 83 Problem Definition We consider our problem as identifying the affinity classes for the test inorganic material binding peptide sequences based on the training sequences. We start from a pool of l+u peptide sequences Ssource = {s1 , ..., sl , ..., sl+u } where each peptide sequence is represented as a series of ordered amino acids. To better understand what a peptide sequence looks like, let us take a peptide sequence from Table 1 as an example. DS202: RLNPPSQMDPPF is a peptide sequence composed of twelve ordered amino acids where each letter denotes one of the 20 standard amino acids. We also assume in this sequence pool Ssource , the first l sequences are labeled si ∈ Ssource (1 ≤ i ≤ l) based on its affinity to the 5 target inorganic material (e.g. weak or strong), which together is named L, and the rest of u sequences are unlabeled si ∈ Ssource (l + 1 ≤ i ≤ l + u) and together is named U, where L ∪ U = Ssource . Our goal is to predict the labels of peptide sequences in U, using the training sequences in L. 4 P EPTIDE S EQUENCE S IMULATION As we mentioned above, lack of labeled data is a general problem we usually face when working on inorganic material binding sequences. One of the most successful methods to date for recognizing protein sequences based on evolutionary knowledge is using simulated sequences. Nowadays, there are many studies [25], [44] which have shown that augmenting the training set with the simulated sequences generated from an amino acids transition matrix such as Blosum 62 and Pam 250 can increase the homologs identification performance. One can reasonably expect that a set of peptides generated by directed evolution to recognize a given solid material will have similar sequences [21]. Although these transition matrices are shown to be efficient and gain wide acceptance, we cannot directly apply this technique to generate simulated sequences. These transition matrices are derived from the large-scale natural protein sequence databases rather than the target inorganic material binding sequences, which means these existing matrices could not represent the target inorganic material well. Thus, we only use a traditional transition matrix as a seed and based on it we generate a new transition matrix which not only maintains the prior knowledge from proteins but also captures the significant knowledge inside the target inorganic material. Here, aiming to provide a more comprehensive and diverse view for our model, we use a two-step simulated sequence generation approach to enlarge our training set. First, we generate a new transition matrix which can better measure the amino acids transition relations for the target inorganic material. Specifically, we use a traditional transition matrix (e.g. Blosum 62 or Pam 250) as a seed matrix M , which is a 20 × 20 symmetric matrix and each name on the column or row is a single letter representing an amino acid. Then we greedily and iteratively mutate each profile mij which is an integer coefficient between two amino acids in the seed transition matrix to maximize the difference between self-class similarity of class A (i.e. T SSA ) and crossclass similarity between class A and B (i.e. T SSA−B ) [21] which is designed to enlarge the gap among the affinity classes. Second, after the new transition matrix M ∗ is constructed, the simulated sequences are generated based on the labeled sequence set L. When a sequence is selected as a seed, the simulated sequence is generated by randomly selecting a position from it and replacing the amino acid i in the corresponding position with a new amino acid j with a probability defined in Eq. (3): ∗ Mij . Pij = ∑20 ∗ j=1 Mij (3) Note that, all the probabilities are calculated after normalization of the values in M ∗ into a positive value space (e.g. 0−1). As an example, Fig. 4 shows the process of mutating an amino acid in the selected position based on the mutation probability, and we keep replacing the amino acids in the target sequence until a desired mutation threshold t is reached [25]. RLNPPSQMDPPF Mutation Probability A R N D C Q E G H I L K M F Assume the 8-th amino acid M (Methionine) is selected to be mutated P S T W Y V 4.6% 4.6% 3.5% 2.2% 4.6% 5.8% 3.5% 2.2% 3.5% 6.9% 8.0% 4.6% 11.4% 5.8% 3.5% 4.6% 4.6% 4.6% 4.6%6.9% RLNPPSQWDPPF WD Assume W (Tryptophan) is selected to replace M Fig. 4: An example of mutating an amino acid in the selected position. When a specific position (e.g. 8-th) is selected from the target sequence, the corresponding amino acid M is mutated to be another amino acid W based on the mutation probability. By this two-step method, we can incorporate not only the prior phylogenetic knowledge but also the specific amino acid pattern responsible for binding to the target inorganic material into the data. Accordingly, based on this peptide sequence simulation method, for each labeled source sequence si ∈ Ssource (1 ≤ i ≤ l), we generate m mutated sequences, which is represented as a simulated peptide sequence set Ssimulated = {s∗1 , ..., s∗l×m }. Finally, we define the sequence pool as S = Ssource ∪ Ssimulated which includes the source peptide sequences and simulated sequences. We will show that the simulated sequences effectively improve the performance in the experiments. 5 G RAPH - BASED O PTIMIZATION M ODEL Aiming to handle the challenge that the obvious patterns are hard to find at the class level, we propose a graphbased optimization model to estimate the conditional probability of the test sequences belonging to each affinity class. Our method begins by mapping sequences from the sequence pool into nodes of a sequence-tosequence graph (Section 5.1) where the relationships among sequences are better measured and many efficient clustering methods are available. Instead of searching for the patterns at the class level, we partition the sequences into clusters where we believe the significant patterns exist and an objective function (Section 5.2) is proposed to learn the conditional probability of each sequence belonging to a specific affinity class. Finally, we present 6 an efficient iterative algorithm to obtain the optimal value of the objective function (Section 5.3). 5.1 Mapping Sequences into Nodes of a Graph We map all the sequences into a graph where each node denotes a peptide sequence and each edge denotes the pairwise similarity between two sequences. This graph offers a good understanding of the pairwise relationships among peptide sequences and is easily partitioned into clusters. The pairwise similarity among sequences is calculated using Needleman-Wunsch [40] algorithm after local alignment between each sequence pair using SmithWaterman algorithm [45]. Specifically, we assign a cluster cj to a specific class z if all the source sequences in it belong to the same class z. Cluster Similarity Matrix: In addition, a V × V matrix W denotes the similarity among the sequence clusters, where wij is the similarity between the sequence cluster ci and cj . Specifically, the pairwise cluster similarity is calculated using T SSA−B between the sets of sequence binders A and B. Now we formulate the affinity class identification problem as the following objective function: min J(S, C) = S,C N ∑ V V ∑ V ∑ ∑ 2 min ( bij ∥s⃗i. − c⃗j. ∥ + α wij ∥c⃗i. − c⃗j. ∥2 5.2 The Objective Function The key idea of our approach is that, instead of searching for patterns at the class level, we narrow down the affinity class prediction problem from the class level to the cluster level. We believe that, if the patterns are obscure, shifting the focus from the class level to the cluster level, we can find clearer patterns. Before proceeding further, we introduce the notation that will be used in the following discussion: dij denotes the ij-th entry in the matrix D, d⃗i. and d⃗.j denote vectors of i-th row and j-th column of matrix D, respectively. Belongingness Matrix: We denote the belongingness matrix B as an N × V matrix where N is the number of all sequences (including source sequences, simulated sequences and test sequences) and V is the number of clusters detected from the clustering result on N. Note that, each entry in the belongingness matrix corresponds to the probability of a peptide sequence belonging to a cluster. If peptide sequence si is assigned to a cluster cj , then bij = 1 and 0 otherwise. To construct the belongingness matrix B, we have used spectral clustering [46] which has proven effective for solving the graph partitioning problems, to partition the sequence-to-sequence graph that we have constructed in the previous section into V clusters. Sequence Probability Matrix: The conditional probability of peptide sequence si belonging to class z (siz = P̂ (y = z|si )) is estimated with an N × D matrix S where D is the number of affinity classes we want to classify. Cluster Probability Matrix: The conditional probability of cluster cj belonging to class z (cjz = P̂ (y = z|cj )) is estimated as a V × D matrix C, where cjz represents the probability of a cluster cj belonging to a class z. Sequence Labeled Matrix: In the labeled sequence set L the sequences have the initial class labels which are represented by an N × D matrix F, where fiz = 1 if we know sequence si belonging to class z in advance, and 0 otherwise. Cluster Labeled Matrix: We may also have prior information of a cluster belonging to a specific class. We use a V × D matrix Y to define initial labels for clusters where yjz = 1 denotes that we are confident that a cluster cj belongs to a specific class z, and 0 otherwise. S,C +β i=1 j=1 N ∑ i=1 j=1 hi ∥s⃗i. − f⃗i. ∥2 + γ i=1 V ∑ kj ∥c⃗j. − y⃗j. ∥2 ) j=1 subject to the following conditions: D ∑ siz = 1, siz ≥ 0 z=1 D ∑ cjz = 1, cjz ≥ 0, z=1 (4) 2 where ∥.∥ the L2 norm. The first term in ∑Nindicates ∑V Eq. (4), b ∥s⃗i. − c⃗j. ∥2 , ensures that a seij i=1 j=1 quence should have similar probability vector as the cluster it belongs to, namely, cluster cj should correspond to class z if the majority of sequences in this cluster belong to class z. Intuitively, the higher the deviation, the larger penalty would get. The second term ∑V ∑V α i=1 j=1 wij ∥c⃗i. − c⃗j. ∥2 corresponds to the intuition that the clusters which are close to each other should have similar class, and α denotes the confidence over this source of information. From the view of graph theory, this term is propagating the class information among ∑N the clusters. The third term β i=1 hi ∥s⃗i. − f⃗i. ∥2 applies the constraint that the predictions should not deviate too much from the corresponding sequence ground-truth and β is the parameter that expresses the confidence of our belief on the prior knowledge of sequences. Similar∑V ly, the last term γ j=1 kj ∥c⃗j. − y⃗j. ∥2 is the loss function penalizing the deviation between predictions and our prior knowledge of clusters, and γ is the parameter that expresses the confidence of our belief on the prior knowledge of clusters. 5.3 Iterative Update Algorithm It is easy to prove that the objective function Eq. (4) is convex which makes it possible to find a global optimal solution. To obtain the optimal solution for matrices S and C, we propose to solve Eq. (4) using the the block coordinate descent method [47]. At iteration t, fixing the value of s⃗i. , we can take the partial derivative to c⃗tj. in Eq. (4) and set it to 0, and then obtain the update Formula Eq. (5): 7 ∑n c⃗tj. = ∑ n ⃗ t−1 i=1 bij sj. i=1 bij + γkj y⃗j. e + γkj + α(i⃗j. − l⃗j. ) . (5) Accordingly, the update can be represented as a matrix ∑N form as Eq. (6), where Dv = diag{( i=1 bij )} is the ∑D normalization factor, Kv = diag{( z=1 yjz ))} indicates the constraints for the clusters and diag denotes the e is the diagonal elements of a matrix. Furthermore, L 1 − 12 − 2 e = Dw W Dw normalized laplacian [48] defined as L , where Dw is the diagonal degree matrix of W. e −1 (AT S t−1 + γKv Y ). (6) C t = (Dv + γKv + α(I − L)) The Hessian matrix with respect to C is a diagonal ∑n e matrix with entries i=1 bij + α > 0 and I − L. The diagonal matrix is positive definite and it is easy to e is also a semi-positive definite. Thus, the prove that I − L hessian matrix is a positive definite matrix, which means derivative for C gives the unique minimum of Eq. (4). Similarly, we can obtain the update formula Eq. (7) with respect to s⃗i . through fixing c⃗tj. . ∑v s⃗ti. = j=1 ∑ v bij c⃗tj. + βhi f⃗i. j=1 bij + βhi . (7) Also, the matrix form of Eq. (7) is as following: S t = (Dn + βHn )−1 (AC t + βHn F ), (8) ∑V where Dn = diag{( j=1 bij )} is the normalization factor ∑D and Hn = diag{( z=1 fiz )} indicates the constraints for the sequences. The hessian matrix is also a diagonal ∑N matrix with diagonal elements i=1 bij > 0, which means the derivative of S gives the unique minimum of Eq. (4). To sum up, the pseudo-code of iteratively solving Eq. (4) by the block coordinate descend method is shown as Algorithm 1, where ϵ is a convergence threshold. Because the proposed method is based on a graph model, we name our approach Peptide Sequences Identification Graph Model - PSIGM. This iterative process shows a procedure of information propagation among the clusters. To better demonstrate it, Fig. 5 shows an example of the information propagation. In each iteration step, each cluster estimates its class based on its members’ classes while retaining its initial class Y (as Fig. 5-(A)). After all the clusters receive the label information (as Fig. 5-(B)), they propagate their label information to their neighboring clusters based on the smoothness assumption (as Fig. 5-(C)). After the clusters have received the information from their neighbors, they pass the information back to the nodes belonging to it while the nodes retains their initial classes (as Fig. 5-(D)). This process continues until convergence. Algorithm 1 The PSIGM Algorithm Input: Belongingness matrix BN ×V , sequence labeled matrix FN ×D , cluster labeled matrix YV ×D , cluster similarity matrix WV ×V , and parameter α, β, γ, ϵ Output: Estimated sequence probability matrix SN ×D 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 5.4 Initialize S 0 randomly; t ← 1; begin repeat Update C t using Eq. (6); Update S t using Eq. (8); t ←t + 1; until S t − S t−1 ≤ ϵ Output S t ; end Time Complexity The time complexity of the proposed algorithm is composed of two parts: updating the cluster probability matrix C and updating the sequence probability matrix S. For updating the matrix C, the time complexity is O(V N 2 D + V 3 + V 2 D) where N is the size of the peptide sequence pool and V is the number of clusters and D is the number of affinity classes. Because in our case the sequence set is usually much larger than the number of clusters, thus the time complexity for the first step is O(V N 2 D). For updating the matrix S, the time complexity is O(N 2 D + N V D), thus O(N 2 D). Therefore, the overall time complexity is O(V N 2 D). Suppose the number of iterations is k, the time complexity of whole algorithm is O(kV N 2 D). In experiments, we observe that k is usually between 8 and 20. 6 E XPERIMENTS In the following, we first conduct the experiments on both the quartz and gold binding sequence datasets to show that PSIGM is effective for identifying the binding affinity classes of inorganic material binding sequences, and then the experiments on the SCOP protein sequence dataset to show that PSIGM is a general framework which also works effectively in other kinds of sequence sets. Because most of our baselines are designed as binary classifiers, for the sake of simplicity, in the following experiments we only consider the case of weak and strong binder identification of the datasets mentioned in Section 3.1, although our proposed method is not restricted to binary classification. Throughout all the experiments, we set α = 2, β = 10, and γ = 2 as default values of our algorithm. The rationale is that both α and γ depend on the clustering result which is influenced by some uncertain factors such as the number of clusters and the initial centric of each cluster, thus assigning a relative low value to them is better; on the other hand, β shows our confidence in the labeled sequences which come from strict and reliable experiments, thus β should be assigned a relative larger value. 8 5 4 1 4 1 6 3 5 6 3 2 2 (A) 1 (B) 5 4 4 1 5 6 3 6 3 2 2 (D) Node class: Cluster class: (C) Strong Weak Strong Test Simulated Weak Unlabel Fig. 5: An example of illustrating the label propagation at each iteration. (A) partition of all the nodes (each sequence is represented as a node here) into multiple clusters; (B) conditional probability estimate of clusters (i.e. cluster probability matrix C) receiving the label information from the sequences (nodes) belonging to them; (C) each cluster propagates its class information to their neighboring clusters; and (D) after updating probability, each cluster passes the label information back to its members conditional probability (i.e. sequence probability matrix S). 6.1 Experiments on Material Binding Sequences To show our proposed framework is effective on predicting the binding affinity class of inorganic material binding sequences, we perform the following three experiments to demonstrate that: 1) simulated sequences and the information propagation among the clusters can effectively alleviate the limitation due to challenge I; 2) searching the patterns from clusters rather than classes helps in handling challenge II; and 3) the newly generated transition matrix contributes to our proposed method’s performance. For each accuracy (i.e. the percent of testing set examples correctly classified by the classifier when compared with the ground truth) shown in the following experiments, we performed experiments 5 times using Leave-one-out validation and report the mean value. We iteratively select one labeled sequence as the test sequence, and use the rest of the labeled sequences to generate the simulated sequences and train the model. When the model is well trained, we can predict the test sequences affinity class. The reason why we use the Leave-one-out validation rather than cross validation in the inorganic material binding sequences is that the number of labeled sequences is insufficient. In such a case, each one of them may represent significant underlying pattern or characteristic. In addition, for each experiment, we fix the threshold ϵ for convergence to 10−4 . The effect of simulated sequences and cluster in- formation propagation. To test the effect of simulated sequences and information propagation among the clusters, we ran our method with or without using simulated sequences or information propagation among clusters, respectively. Furthermore, as we mentioned in Section 5.2, we need to partition all the sequences into V clusters, thus we also want to show the relationship between the proposed method’s performance and the number of clusters. We vary the number of clusters as 2, 5, 10, 15, 18 and 20. We show that both strategies, sequence simulation and information propagation among the clusters, are crucial in improving the performance. The result is shown in Fig. 6, where the x axis denotes the different number of clusters and y axis denotes the accuracy. We can see that simulated sequences contribute to the performance improvement. Also, in the absence of the information propagation among clusters, the performance degrades. Finally, we notice that the hill-like shapes appear as the number of clusters increasing for all the three cases. Most likely, when the number of clusters is too low, it is close to a global view; when it is too high, the clusters would be too trivial to learn from. Performance comparison with baselines. In this part, we compare the proposed method with 5 other algorithms mentioned above including SVM, Neural Network, HMM, Learning with local and global consistency (LLGC) [48], which is a well-known graph-based Semi-Supervised 9 0.85 TABLE 4: Comparison with baselines under different mutation rates for quartz-binding sequences PSIGM Without Information Propagation Without Sequences Simulation Method PSIGM LLGC NN* HMM* SVM* SVM NN HMM Accuracy 0.8 0.75 0.7 0.65 5 10 15 18 20 Number of Clusters Fig. 6: Performance comparison with different strategies. Learning algorithm. For fairness and comprehensiveness, we have also tried adding the simulated sequences used for our framework to these methods which are marked as SVM*, Neural Network* and HMM*. Note that, since LLGC was designed as a semi-supervised algorithm which needs unlabeled instances to aid propagating the labeled information, thus we only consider the LLGC with the simulated sequences which are used as unlabeled data. Thus, all the methods are separated into two parts: using the simulated sequences and without using the simulated sequences. To measure the influence from the different mutation rate t which is used to generate simulated sequences, we vary the mutation rate t at 5%, 10%, 15% and 20%. The results of predicting the quartz binding affinity classes comparing with baselines on the two inorganic binding sequences dataset are shown in Table 4 and Table 5, respectively. Note that the proposed method significantly outperforms the others in predicting the affinity classes of the test inorganic material binding sequences in most cases. In addition, the performance of the proposed method is not so sensitive to the settings of the mutation rate over the range considered. It is worth noticing that, instead of aiding the performance, the simulated sequences in SVM*, HMM* and NN* make the performance worse than without them. The main reason for this phenomenon is that: all these methods use a global view on the training data which is only represented as two large classes: strong binding or weak binding classes. However, as we mentioned, the sequences inside the same class may be very different, thus the more simulated sequences are added, the more unobvious patterns are likely to be. The proposed method treats the sequences locally as clusters, thus it can properly handle this problem. Parameter Sensitivity. There are three parameters in our objective function Eq. 4: α, β and γ. We conducted sensitivity experiments, shown in Fig. 7. In the experiments, when one parameter is varied, the other two parameters are fixed at their default settings (i.e., α = 2, β = 10, and γ = 2). Note that, α represents the confidence of our belief over information propagation a- Mutation Rate 10% 15% 0.82 0.84 0.63 0.60 0.62 0.58 0.65 0.69 0.63 0.58 0.68 0.65 0.70 20% 0.83 0.62 0.55 0.67 0.59 TABLE 5: Comparison with baselines under different mutation rates of the gold binding sequences Method PSIGM LLGC NN* HMM* SVM* SVM NN HMM 5% 0.91 0.81 0.86 0.89 0.83 Mutation Rate 10% 15% 0.91 0.92 0.82 0.84 0.82 0.84 0.90 0.90 0.82 0.80 0.84 0.82 0.90 20% 0.91 0.81 0.81 0.87 0.82 mong clusters. The clusters of the sequences are obtained from arbitrary clustering methods, which are not very stable. In other words, it may not be completely correct. Therefore, smaller α usually yields better performance. β shows our confidence on the prior knowledge of the sequence classes. These sequence classes, which are obtained from serious physical or chemical experiments, are deemed to be reliable and thus a large β is usually better. γ denotes the confidence on the prior knowledge of cluster classes. This information may not be totally reliable, therefore lower value usually yields better results. The results in Fig. 7 confirm our observation. 0.92 0.9 0.88 Accuracy 2 5% 0.82 0.60 0.60 0.66 0.62 0.86 0.84 0.82 α β γ 0.8 0.78 0.76 0 1 2 5 8 Parameters 10 15 20 Fig. 7: Parameter sensitivity experiments. Comparison with Varying Transition Matrices. Finally, we show the performance of the proposed method with different transition matrices. We want to demonstrate that the newly generated matrix improves the performance of the proposed algorithm for the target inorganic material. In this experiment, we also vary the 10 mutation rate t at 5%, 10%, 15% and 20%. We have compared the new transition matrix M1 which was generated based on Blosum 62 and M2 which was generated based on Pam 250 with four other widely-used transition matrices including Blosum 62, Pam 250, Dayhoff [49] and Gonnet [50] in Table 6. The result shows that, the newly generated transition matrices perform better than the others at each mutation rate. the inorganic material binding sequences. The reasons behind this can be well explained by Fig. 8, which shows the self-class similarity of each prediction task. As we know, the more cross-class similarity surpasses the selfsimilarity, the more difficult two classes are separated. TABLE 7: Comparison with baselines of each class Family A B C D E F G TABLE 6: Performance with different transition matrices 5% 0.82 0.80 0.72 0.72 0.71 0.79 Mutation Rate 10% 15% 0.82 0.84 0.81 0.82 0.74 0.75 0.73 0.73 0.71 0.72 0.78 0.79 20% 0.83 0.82 0.74 0.74 0.72 0.80 6.2 Experiments on Protein Sequences The proposed PSIGM is a general framework which is not limited to identifying the affinity class of inorganic material binding sequences. To prove that, we have used the SCOP protein data mentioned in Section 3.1. Instead of predicting the sequences’ affinity classes, we consider the problem in homology family prediction: for a specific family, could the proposed framework identify the sequences belonging to it from the remaining families? Correspondingly, we construct seven identification tasks from this dataset, where the sequences from one particular family are used as the positive set and the sequences from the remaining six families are used as the negative set. For example, when the sequences in family A are used as the positive set, the sequences from families B, C, D, E, F and G would be used as the negative set. Two experiments are performed to demonstrate that: 1) our PSIGM is a general framework which can also handle the tradition protein sequence identification; and 2) a moderate setting of mutation rate is conductive to improve the performance. Performance comparison with baselines. It is worth noticing that, through handling the data in this way, it obtains the characteristics of inorganic binding sequences to some extent. Note that each result shown in the follwing experiments (i.e. Table 7 and Fig. 9) is the average of 10 times performance through 5-fold cross validation. Since the protein sequence dataset has relative sufficient training samples and the sequences that belong to the same protein family are similar to each other, we have used cross validation rather than Leaveone-out validation. Table 7 shows the result of predicting the homology family comparing with baselines which are mentioned in Section 6.1. As the table shows, the proposed method outperforms the other methods at each protein family’s prediction. Note that, the accuracies of predicting the homology families are much higher than the accuracies of predicting the affinity classes of LLGC 0.80 0.81 0.82 0.82 0.82 0.82 0.82 SVM 0.998 0.959 0.999 0.999 0.985 0.946 0.975 HMM 0.966 0.944 0.952 0.969 0.935 0.952 0.972 NN 0.913 0.947 0.968 0.999 0.999 0.857 0.929 Mutation rate sensitivity. The performance of PSIGM is influenced by the setting of the mutation rate which is used to generate the simulated sequences. To fully evaluate how the mutation rate affects the performance, we increase it from 0.05 to 0.3 with a step of 0.05 and report the accuracy of each family’s prediction task in Fig. 9. It is clear that most families have an increase in accuracy as the mutation rate rises until reaching a threshold of 15%, and then the performance begins to decrease. This proves that the performance of PSIGM can be improved by a moderate setting of mutation rate. In addition, we can infer that PSIGM is not only effective at identifying the affinity classes of the inorganic material binding sequences, but also effective at predicting the homology families of the traditional protein sequences. 1 0.99 0.98 AUC Transition Matrix M1 M2 Blosum 62 Pam 250 Dayhoff Gonnet PSIGM 0.999 0.965 0.999 0.999 0.999 0.978 0.987 0.97 A B C D E F G 0.96 0.95 0.94 0.93 0.05 0.1 0.15 0.2 0.25 0.3 Mutation Rate Fig. 9: Mutation rate sensitivity experiment. 7 C ONCLUSION AND F UTURE W ORK Identifying the affinity classes of peptide sequences binding to a specific inorganic material is a new and challenging research problem with broad applications. In this paper, we proposed a novel framework, PSIGM, to solve this problem. We begin with providing a twostep simulated peptide sequences generation method to make the training set more comprehensive and diverse. Moreover, unlike traditional machine learning approaches used for protein sequences identification that try 200 100 0 100 50 0 A 400 300 200 100 Not A 300 200 100 0 0 B Not A Similarity Score 300 Similarity Score Similarity Score D C Not B Not B Not C Not C Not D Not D B C D (a) Class A (b) Class B (c) Class C (d) Class D 600 Similarity Score Similarity Score A 400 200 0 E 200 Similarity Score Similarity Score 11 150 100 50 0 150 100 F Not E Not E 50 0 G Not F Not F Not G Not G E F G (e) Class E (f) Class F (g) Class G Fig. 8: Total similarity scores of the self-class and the cross-class for each prediction task based on Pam 250. (A) self-class and cross-class TSS of class A and non-A; (B) self-class and cross-class TSS of class B and non-B; (C) self-class and cross-class TSS of class C and non-C; (D) self-class and cross-class TSS of class D and non-D; (E) self-class and cross-class TSS of class E and non-E; and (F) self-class and cross-class TSS of class F and non-F; (G) self-class and cross-class TSS of class G and non-G. to find the patterns from the class level, our framework partitions the sequences into smaller clusters and learns the patterns from them through using a graphbased optimization model. Extensive experimental studies demonstrate that the proposed framework can effectively identify the affinity classes of the inorganic material binding sequences. In the future, to achieve better performance, we plan to use a cyclic model to validate and retrain PSIGM: first, we will select some sequences that have the most/least probabilities binding to a target inorganic material as a candidate set by using PSIGM; second, we plan to use some efficient experimental methods to validate the candidate sequence set such as QCM (Quartz Crystal Microbalance); finally, the validated sequences will be used to retrain the PSIGM, and then new candidate sequences will be selected from the sequence database based on their affinity, so on so forth. We believe that by this cyclic validation model, we can not only further validate PSIGM’s effectiveness but also keep retraining it to be better and better. 8 ACKNOWLEDGMENTS This material is based upon work supported by the Air Force Office of Scientific Research (AFOSR), grant number FA9550-12-1-0226. We gratefully acknowledge the Victorian Life Sciences Computation Facility (VLSCI) for allocation of computational resources. TRW thanks veski for an Innovation Fellowship. R EFERENCES [1] G. P. Smith, “Filamentous fusion phage: novel expression vectors that display cloned antigens on the virion surface.,” Science, vol. 228, no. 4705, pp. 1315–1317, 1985. [2] K. Y. Dane, C. Gottstein, and P. S. Daugherty, “Cell surface profiling with peptide libraries yields ligand arrays that classify breast tumor subtypes.,” Molecular Cancer Therapeutics, vol. 8, no. 5, pp. 1312–1318, 2009. [3] E. T. Boder and K. D. Wittrup, “Yeast surface display for screening combinatorial polypeptide libraries,” Nature Biotechnology, vol. 15, no. 6, pp. 553–557, 1997. [4] E. Kasotakis, E. Mossou, L. Adler-Abramovich, E. P. Mitchell, V. T. Forsyth, E. Gazit, and A. Mitraki, “Design of metal-binding sites onto self-assembled peptide fibrils.,” Biopolymers, vol. 92, no. 3, pp. 164–172, 2009. [5] J. Kimling, M. Maier, B. Okenve, V. Kotaidis, H. Ballot, and A. Plech, “Turkevich method for gold nanoparticle synthesis revisited.,” The Journal of Physical Chemistry B, vol. 110, no. 32, pp. 15700–15707, 2006. [6] Y. Huang, C.-Y. Chiang, S. K. Lee, Y. Gao, E. L. Hu, J. De Yoreo, and A. M. Belcher, “Programmable assembly of nanoarchitectures using genetically engineered viruses.,” Nano Letters, vol. 5, no. 7, pp. 1429–1434, 2005. [7] K. T. Nam, D.-W. Kim, P. J. Yoo, C.-Y. Chiang, N. Meethong, P. T. Hammond, Y.-M. Chiang, and A. M. Belcher, “Virus-enabled synthesis and assembly of nanowires for lithium ion battery electrodes.,” Science, vol. 312, no. 5775, pp. 885–888, 2006. [8] R. R. Naik, S. J. Stringer, G. Agarwal, S. E. Jones, and M. O. Stone, “Biomimetic synthesis and patterning of silver nanoparticles.,” Nature Materials, vol. 1, no. 3, pp. 169–172, 2002. [9] E. Estephan, C. Larroque, F. J. G. Cuisinier, Z. Blint, and C. Gergely, “Tailoring gan semiconductor surfaces with biomolecules.,” The Journal of Physical Chemistry B, vol. 112, no. 29, pp. 8799–8805, 2008. [10] E. Estephan, M.-b. Saab, C. Larroque, M. Martin, F. Olsson, S. Lourdudoss, and C. Gergely, “Peptides for functionalization of inp semiconductors.,” Journal of Colloid and Interface Science, vol. 337, no. 2, pp. 358–363, 2009. [11] M. M. Tomczak, M. K. Gupta, L. F. Drummy, S. M. Rozenzhak, and R. R. Naik, “Morphological control and assembly of zinc oxide 12 [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] using a biotemplate.,” Acta Biomaterialia, vol. 5, no. 3, pp. 876– 882, 2009. C. Vreuls, G. Zocchi, A. Genin, C. Archambeau, J. Martial, and C. V. De Weerdt, “Inorganic-binding peptides as tools for surface quality control,” Journal of Inorganic Biochemistry, vol. 104, no. 10, pp. 1013–1021, 2010. R. R. Naik, L. L. Brott, S. J. Clarson, and M. O. Stone, “Silicaprecipitating peptides isolated from a combinatorial phage display peptide library.,” Journal of Nanoscience and Nanotechnology, vol. 2, no. 1, pp. 95–100, 2002. H. Chen, X. Su, K.-G. Neoh, and W.-S. Choe, “Probing the interaction between peptides and metal oxides using point mutants of a tio2-binding peptide.,” Langmuir, vol. 24, no. 13, pp. 6852–6857, 2008. Y. Liu, J. Mao, B. Zhou, W. Wei, and S. Gong, “Peptide aptamers against titanium-based implants identified through phage display.,” Journal of Materials Science: Materials in Medicine, vol. 21, no. 4, pp. 1103–1107, 2010. M. B. Dickerson, S. E. Jones, Y. Cai, G. Ahmad, R. R. Naik, N. Krger, and K. H. Sandhage, “Identification and design of peptides for the rapid, high-yield formation of nanoparticulate tio2 from aqueous solutions at room temperature,” Chemistry of Materials, vol. 20, no. 4, pp. 1578–1584, 2008. C. Tamerler, T. Kacar, D. Sahin, H. Fong, and M. Sarikaya, “Genetically engineered polypeptides for inorganics: A utility in biological materials science and engineering,” Materials Science and Engineering C, vol. 27, no. 3, pp. 558–564, 2007. H. Chen, X. Su, K.-G. Neoh, and W.-S. Choe, “Qcm-d analysis of binding mechanism of phage particles displaying a constrained heptapeptide with specific affinity to sio2 and tio2.,” Analytical Chemistry, vol. 78, no. 14, pp. 4872–4879, 2006. E. Eteshola, L. J. Brillson, and S. C. Lee, “Selection and characteristics of peptides that bind thermally grown silicon dioxide films.,” Biomolecular Engineering, vol. 22, no. 5-6, pp. 201–204, 2005. S. Donatan, M. Sarikaya, C. Tamerler, and M. Urgen, “Effect of solid surface charge on the binding behaviour of a metal-binding peptide.,” Journal of the Royal Society Interface the Royal Society, no. April, pp. rsif.2012.0060–, 2012. E. E. Oren, C. Tamerler, D. Sahin, M. Hnilova, U. O. S. Seker, M. Sarikaya, and R. Samudrala, “A novel knowledge-based approach to design inorganic-binding peptides.,” Bioinformatics, vol. 23, no. 21, pp. 2816–2822, 2007. M. Hnilova, E. E. Oren, U. O. S. Seker, B. R. Wilson, S. Collino, J. S. Evans, C. Tamerler, and M. Sarikaya, “Effect of molecular conformations on the adsorption behavior of gold-binding peptides.,” Langmuir, vol. 24, no. 21, pp. 12440–12445, 2008. A. Vila Verde, P. J. Beltramo, and J. K. Maranas, “Adsorption of homopolypeptides on gold investigated using atomistic molecular dynamics.,” Langmuir, vol. 27, no. 10, pp. 5918–5926, 2011. R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, “Biological sequence analysis: probabilistic models of proteins and nucleic acids. cambridge univ,” 1998. A. Kumar and L. Cowen, “Augmented training of hidden markov models to recognize remote homologs via simulated evolution,” Bioinformatics, vol. 25, no. 13, pp. 1602–1608, 2009. C. H. Wu, S. Zhao, H. L. Chen, C. J. Lo, and J. McLarty, “Motif identification neural design for rapid and sensitive protein family search.,” Computer applications in the biosciences CABIOS, vol. 12, no. 2, pp. 109–118, 1996. D. W. D. Wang, N. K. L. N. K. Lee, T. S. Dillon, and N. J. Hoogenraad, “Protein sequences classification using radial basis function (rbf) neural networks,” 2002. M. J. Grimble, “Adaptive systems for signal processing, communications and control,” Control, vol. 3, 2001. R. Karchin, K. Karplus, and D. Haussler, “Classifying g-protein coupled receptors with support vector machines.,” Bioinformatics, vol. 18, no. 1, pp. 147–159, 2002. M. Wistrand and E. L. L. Sonnhammer, “Improving profile hmm discrimination by adapting transition probabilities.,” Journal of Molecular Biology, vol. 338, no. 4, pp. 847–854, 2004. X. Zhu, “Semi-supervised learning literature survey,” SciencesNew York, vol. Tech. Rep., no. 1530, pp. 1–59, 2007. N. Du, M. R. Knecht, P. N. Prasad, M. T. Swihart, T. Walsh, and A. Zhang, “A framework for identifying affinity classes of inorganic materials binding peptide sequence.,” ACM Conference on Bioinformatics, Computational Biology and Biomedical Informatics (ACM BCB), 2013. [33] N. Terrapon, O. Gascuel, E. Marechal, and L. Brehelin, “Fitting hidden markov models of protein domains to a target species: application to plasmodium falciparum,” BMC Bioinformatics, vol. 13, no. 1, p. 67, 2012. [34] M.-W. M. M.-W. Mak, J. G. J. Guo, and S.-Y. K. S.-Y. Kung, “Pairprosvm: protein subcellular localization based on local pairwise profile alignment and svm.,” IEEEACM Transactions on Computational Biology and Bioinformatics, vol. 5, no. 3, pp. 416–422, 2008. [35] J. Tian, H. Gu, W. Liu, and C. Gao, “Robust prediction of protein subcellular localization combining {PCA} and {WSVMs},” Computers in Biology and Medicine, vol. 41, no. 8, pp. 648 – 652, 2011. [36] L. Ge, N. Du, and A. Zhang, “Finding informative genes from multiple microarray experiments: A graph-based consensus maximization model,” in Proceedings of the 2011 IEEE International Conference on Bioinformatics and Biomedicine, BIBM ’11, pp. 506– 511, 2011. [37] Z. Tang, J. Palafox-Hernandez, W.-C. Law, Z. Hughes, M. T. Swihart, P. N. Prasad, M. R. Knecht, and T. R. Walsh, “Biomolecular recognition principles for bionanocombinatorics: An integrated approach to elucidate enthalpic and entropic factors,” ACS Nano, Article ASAP, DOI: 10.1021/nn404427y, vol. 7, pp. 9632–9646, 2013. [38] Y. N. Tan, J. Y. Lee, and D. I. C. Wang, “Uncovering the design rules for peptide synthesis of metal nanoparticles.,” Journal of the American Chemical Society, vol. 132, no. 16, pp. 5677–5686, 2010. [39] M. Hnilova, C. R. So, E. E. Oren, B. R. Wilson, T. Kacar, C. Tamerler, and M. Sarikaya, “Peptide-directed co-assembly of nanoprobes on multimaterial patterned solid surfaces,” Soft Matter, vol. 8, pp. 4327–4334, 2012. [40] S. B. Needleman and C. D. Wunsch, “A general method applicable to the search for similarities in the amino acid sequence of two proteins.,” Journal of Molecular Biology, vol. 48, no. 3, pp. 443–453, 1970. [41] W. R. Pearson, “Rapid and sensitive sequence comparison with fastp and fasta.,” Methods in Enzymology, vol. 183, no. 1988, pp. 63– 98, 1990. [42] S. Henikoff and J. G. Henikoff, “Amino acid substitution matrices from protein blocks.,” Proceedings of the National Academy of Sciences of the United States of America, vol. 89, no. 22, pp. 10915– 10919, 1992. [43] A. G. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia, “Scop: A structural classification of proteins database for the investigation of sequences and structures,” Journal of Molecular Biology, vol. 247, no. 4, pp. 536 – 540, 1995. [44] Afiahayati and S. Hartati, “Multiple sequence alignment using hidden markov model with augmented set based on blosum 80 and its influence on phylogenetic accuracy,” 2010. [45] T. F. Smith and M. S. Waterman, “Identification of common molecular subsequences.,” Journal of Molecular Biology, vol. 147, no. 1, pp. 195–197, 1981. [46] U. Von Luxburg, “A tutorial on spectral clustering,” Statistics and Computing, vol. 17, no. 4, pp. 395–416, 2007. [47] D. P. Bertsekas, Nonlinear Programming, vol. 43. Athena Scientific, 1995. [48] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Sch, “Learning with local and global consistency,” Advances in Neural Information Processing Systems 16, vol. 1, pp. 595–602, 2003. [49] M. O. Dayhoff, R. M. Schwartz, and B. C. Orcutt, “A model of evolutionary change in proteins,” Atlas of protein sequence and structure, vol. 5, no. Suppl 3, pp. 345–352, 1978. [50] G. H. Gonnet, M. A. Cohen, and S. A. Benner, “Exhaustive matching of the entire protein sequence database.,” Science, vol. 256, no. 5062, pp. 1443–1445, 1992. 13 Nan Du Nan Du received his B.S. degree from Guangdong University of Technology in 2006. After that, he received his M.S. degree from Southern China University of Technology in 2009. Since 2009, he has been working toward the Ph.D. degree in State University of New York at Buffalo, NY, with supervision by Prof. Aidong Zhang. His research interests are in the area of data mining, machine learning and bioinformatics. Marc R. Knecht Marc R. Knecht earned a B.S. degree in Chemistry from Duquesne University in 2001. In 2004, he received a Ph.D. in BioInspired Chemistry from Vanderbilt University under the direction of Professor David W. Wright, followed by postdoctoral research at the University of Texas with Professor Richard M. Crooks focused on characterizing the structure/function relationship of nanocatalysts. After completing postdoctoral studies, he began his independent career as an assistant professor of Chemistry at University of Kentucky. In the summer of 2011, Professor Knecht joined the Department of Chemistry at the University of Miami as an associate professor. During his independent career, Professor Knecht has established a research program focused on elucidating the effects of the biotic/abiotic interface of bio-inspired nanomaterials. In this regard, his group has employed high-resolution characterization, activity studies, and synthetic analyses of peptides to demonstrate that the biological surface of bionanomaterials possesses significant control over the functionality and could serve as modification sites to control the activity. He has published 47 publications in this area. Mark T. Swihart Mark T. Swihart is a Professor in the Department of Chemical and Biological Engineering at the University at Buffalo (SUNY). He earned a B.S. in Chemical Engineering from Rice University in 1992, and a Ph.D. in Chemical Engineering in 1997 from the University of Minnesota. He then spent one year as a postdoctoral researcher in Mechanical Engineering at the University of Minnesota before joining the University at Buffalo as an assistant professor in 1998. Since 2007, he has directed a universitywide strategic initiative in Integrated Nanostructured Systems. His research interests include synthesis, processing, and applications of nanoparticles and other nanomaterials, and he has co-authored more than 120 journal papers in these areas. Dr. Swihart is a recipient of the Kenneth Whitby award from the American Association for Aerosol Research, the Schoellkopf medal from the Western New York section of the American Chemical Society, and the J.B. Wagner award from the Electrochemical Society. Zhenghua Tang Zhenghua Tang currently is a postdoctoral research associate working in Marc R. Knecht group at University of Miami. He obtained his B. S. degree at college of Chemistry and Chemical Engineering, Lanzhou University, Lanzhou, Gansu, P. R. China in 2005. He attended graduate school there from Aug. 2005 to Jun. 2007. During his graduate study, he went to Institute of Chemistry, Chinese Academy of Science (ICCAS) as a visiting student for about one year (2006-2007). In August 2007, he moved to US and obtained his Pd. D degree in chemistry from Department of Chemistry, Georgia State University in July, 2012. He started his current position since August, 2012. His research interest focuses on bio-inspired nanomaterials for targeted applications, including bionanocombinatorics, self-assembly, catalyst, multifunctional design and so on. He is the recipient of 2010 chairs award in Chemistry Department at GSU as well as 2011 Chinese Government Award for Outstanding Self-Financed Students Abroad. Tiffany R. Walsh Tiff Walsh graduated with a B.Sci(Hons) from the University of Melbourne. She earned her PhD degree in theoretical chemistry from the University of Cambridge, U.K., working in the group of Prof. David Wales in the Dept. of Chemistry as a Cambridge Commonwealth Trust scholar. Walsh then joined the Dept. of Materials, University of Oxford, U.K. as a postdoctoral researcher in the Materials Modelling Laboratory (MML) with Prof. Adrian Sutton. She was then awarded a Glasstone fellowship, which she held in the MML in Oxford. In 2002, she joined the faculty of the University of Warwick, U.K., as a joint appointment in the Dept. of Chemistry and the Centre for Scientific Computing. Her research interests focus on computational modelling the interface between biomolecules and inorganic surfaces, using molecular dynamics simulations. She was a lead investigator in the team that won 5.3 M ($US 8.2 M) of funding for a 5-year EPSRC Programme Grant in this area (started in Oct 2010). In 2012, Walsh joined the Institute for Frontier Materials at Deakin University in Australia, where she holds the position of Associate Prof. in Bio\Nanotechnology. Aidong Zhang Dr. Aidong Zhang is University Distinguished Professor and Chair in the Department of Computer Science and Engineering at State University of New York at Buffalo. Her research interests include bioinformatics, data mining, multimedia and database systems, and content-based image retrieval. She is an author of over 250 research publications in these areas. She has chaired or served on over 100 program committees of international conferences and workshops, and currently serves several journal editorial boards. She has published two books Protein Interaction Networks: Computational Analysis (Cambridge University Press, 2009) and Advanced Analysis of Gene Expression Microarray Data (World Scientific Publishing Co., Inc. 2006). Dr. Zhang is a recipient of the National Science Foundation CAREER award and State University of New York (SUNY) Chancellor’s Research Recognition award. Dr. Zhang is an IEEE Fellow.