A Framework for Identifying Affinity Classes of Inorganic Nan Du

advertisement
A Framework for Identifying Affinity Classes of Inorganic
Materials Binding Peptide Sequences
Nan Du
Paras N. Prasad
Mark T. Swihart
Computer Science and
Engineering Department
SUNY at Buffalo, U.S.A,
Department of Chemistry
SUNY at Buffalo, U.S.A,
pnprasad@buffalo.edu
Department of Chemical and
Biological Engineering
SUNY at Buffalo, U.S.A,
Marc R. Knecht
swihart@buffalo.edu
Aidong Zhang
Institute for Frontier Materials
Deakin University,
Geelong, Australia
Department of Chemistry
University of Miami
Coral Gables, U.S.A,
Computer Science and
Engineering Department
SUNY at Buffalo, U.S.A,
nandu@buffalo.edu
Tiffany Walsh
tiffany.walsh@deakin.edu.au knecht@miami.edu
ABSTRACT
With the rapid development of bionanotechnology, there has
been a growing interest recently in identifying the affinity
classes of the inorganic materials binding peptide sequences.
Unfortunately, there are some distinct characteristics of inorganic materials binding sequence data that limit the performance of many widely-used classification methods. In
this paper, we propose a novel framework to predict the
affinity classes of peptide sequences with respect to an associated inorganic material. We first generate a large set of
simulated peptide sequences based on our new amino acid
transition matrix, and then the probability of test sequences
belonging to a specific affinity class is calculated through
solving an objective function. In addition, the objective
function is solved through iterative propagation of probability estimates among sequences and sequence clusters. Experimental results on a real inorganic material binding sequence dataset show that the proposed framework is highly
effective on identifying the affinity classes of inorganic material binding sequences. Besides that, the experiment on the
SCOP (structural classification of proteins) dataset shows
that the proposed framework is general and can be used in
the traditional protein sequences.
Categories and Subject Descriptors
J.3 [Life And Medical Sciences]: Biology and Genetics;
I.6.1 [SIMULATION AND MODELING]: Simulation
Theory—Model classification
1. INTRODUCTION
Over the past decades, plenty of studies have been published
for analyzing the peptide sequences with affinity to biological
entities such as enzymes, cells, viruses, lipids and proteins.
azhang@buffalo.edu
Recently, there has been a growing interest in identifying
the characteristics of the desired inorganic material binding peptide sequences through utilizing the biocombinatorial peptide libraries such as phage play [27], cell surface [5],
and yeast display [3].
In particular, numerous studies have been reported about
the peptide sequences that bind to the inorganic materials, such as noble metals (gold, silver, platinum) [18, 19],
semiconductors (zinc sulfide, cadmium sulfide) [9, 10], and
metal oxides (silica, titanium and magnetite) [4, 23], which
are crucial for practical implementation in technology and
medicine. Inorganic material binding peptide sequences,
which are usually 7-14 amino acids long differentiate from
other polypeptides with their specific molecular recognition
properties onto targeted inorganic material surfaces [7]. Among
many studies of inorganic material binding sequences, effectively identifying the affinity class, which shows the binding
strength of the desired sequence with respect to the target
inorganic material, is crucial for further designing novel peptides [25]. The binding affinity of a peptide to an inorganic
surface is the result of a complex interplay between the binding strength of its individual residues and its conformation.
The binding strength of the desired sequence for a specific
material is usually measured with the adsorption free energy
(∆Gads ), which is then used to classify the affinity classes
as weak, medium, or strong for each sequence.
Despite extensive recent reports on combinatorially selected
inorganic binding peptides and their bionanotechnological
utility as synthesizers and molecular linkers [7, 15, 30], there
is still limited knowledge about the relationships between
binding peptide sequences and their associated inorganic
materials. Therefore, using machine learning technology to
suggest sequence affinity classes, we can increase product
performance, functionality, and reduce cost without doing
new large-scale screenings via phage display.
Various approaches have been used or developed for recognizing both close and distant homologs of given protein sequences, which is one of the central themes in bioinformatics.
Most of the work is based on some traditional machine learning models such as Hidden Markov model (HMM ) [8, 20],
Neural Network (NN) [32, 34] and Support vector machine
(SVM ) [13, 16]. However, the problem of inorganic material binding peptide sequence affinity classes identification
has some distinct challenges that are rarely faced in protein sequence identification, which markedly limit the performance of the successful models mentioned above for protein sequences detection.
Challenge I: The number of labeled samples is usually insufficient. As an emerging topic, the peptide sequences binding solid inorganic materials have been developed only since the last decade which is not so well studied
comparing to protein sequences analysis which has much
longer history. For example, unlike protein sequences analysis which has lots of large-scale public datasets such as
GPCR [17] or SCOP [33], no complete result of large-scale
screening experiments has been made publicly available for
the inorganic material binding sequences. Therefore, unlike protein sequence research which has many public large
databases and publicly available experiment results, the data
about inorganic binding peptide sequences is usually quite
few. Most existing protein sequence classification approaches
require a large size of labeled samples to train an accurate
model, but labeling the affinity classes for a large number of
inorganic material binding sequences is very time-consuming
and expensive thus it is usually infeasible. If only a limited number of labeled samples are available for the model
training, the learned model may suffer from the problems of
over-fitting or under-fitting. As a machine learning method
which has received much attention in the past decade, SemiSupervised Learning (SSL) [36] is good at handling the lack
of sufficient labeled training data problem. However, the
utility of this method may be markedly limited due to the
next challenge.
Challenge II: The peptide sequences belonging to
the same affinity class may be very dissimilar. Usually, the protein sequences which belong to the same family follow some apparent patterns, in other words, they are
similar to each other by some views. However, the “similarity” between inorganic material binding peptide sequences
from the same affinity class may be not so apparent. In
some cases, the intra-similarity which measures the similarity of all sequences inside the same class is even less than the
inter-similarity which measures the similarity among the sequences from different classes. This phenomenon also means
some peptide sequences belonging to the same class may be
dissimilar with each other, at least by the current knowledge. This observation reflects that the inorganic material
binding sequences are against the smoothness assumption at
the class level which is generally assumed in both supervised
learning and semi-supervised learning.
In light of these challenges for inorganic material binding
sequence affinity classes identification, we propose a novel
framework which includes two parts. First, to tackle the insufficient data challenge, we augment the training sequence
set with simulated sequences which are generated based on
a new amino acid transition matrix. By using the simulated
sequences, we incorporate not only the prior phylogenetic
knowledge but also the specific sequence patterns responsible for the target inorganic material into the training data.
Second, instead of searching the patterns globally from the
peptide sequences belonging to the same affinity class, we
separate the sequences into smaller clusters and try to learn
the patterns from them locally. Intuitively, since there are
few obvious patterns that could be found at the class level,
we search for them from the smaller cluster level.
In this paper, we make the following contributions:
• We introduce the distinct challenges existing in identifying affinity classes for inorganic material binding
sequences.
• We propose a novel framework which can effectively
predict the affinity classes of the inorganic material
binding sequences and provide an efficient iterative algorithm to find the optimal solution of the proposed
objective function. Moreover, the proposed framework
is a general framework which is also effective on identifying the traditional protein sequences.
• The extensive experiments show that the proposed method
outperforms many other baseline methods.
The rest of the paper is organized as follows. In Section 2,
we describe the datasets used in this paper and the setting
of our problem. The peptide sequences simulation method
and the graph-based optimization model are presented in
Section 3 and Section 4, respectively. Extensive experimental results are shown in Section 5. In Section 6, we explain
the relationship between our work and previous related ones.
The conclusion and future work is presented in Section 7.
2.
DATA SET AND PROBLEM DEFINITION
In this section, we describe the datasets used in this paper
and present the problem definition.
2.1 Data Set
We have used two datasets to demonstrate the proposed
method’s performance. The first dataset is from Ersin Emre
Oren et al [25]. This data set consists of total 39 quartz
(rhombohedral silica, SiO2) binding peptide sequences which
are produced using phage-display techniques. All these peptide sequences are further classified into two classes based on
their affinity: strong and weak binder class which have 10
and 15 sequences, respectively. In order to better demonstrate the problem and show the proposed method in the
following, we abstract a sample set including two affinity
classes from this data set and show it in Table 1.
Table 1: Sample set of peptide sequences data
Name
DS202
DS189
...
Strong Class
Sequence
RLNPPSQMDPPF
QTWPPPLWFSTS
...
Name
DS201
DS191
...
Weak Class
Sequence
MEGQYKSNLLFT
VAPRVQNLHFGA
...
It is worth noticing that this data set is a good illustrating
example to show the two challenges mentioned above. First,
there are only around ten sequences available for each affinity class which is much less comparing to the data size used
for classifier training in protein sequence analysis where hundreds or thousands sequences are usually involved [20, 29].
Second, the unobvious pattern challenge shown in this data
set can be well illustrated by Figure 1. In this figure, based
on the total similarity scores (TSS) defined in [25], we first
calculate the total similarity of sequences from the same
class (i.e. strong class in blue and weak class in orange) via
the following equation:
T SSA =
NA ∑
NA
∑
1
P SSij (1 − δij ),
N A ∗ (N A − 1) i=1 j=1
(1)
where δ is the usual Kronecker delta function in which δij =
1 when i = j and 0 otherwise, NA is the total number of
sequences in set A, and P SSij is the similarity between the
ith sequence and j th sequence of set A calculated via the
Needleman-Wunsch algorithm [24]. For the sake of simplicity, we call it self-class similarity for short. Moreover, the
TSS of the sequences across the classes are calculated as:
T SSA−B =
NA ∑
NB
∑
1
P SSij ,
N A ∗ N B i=1 j=1
(2)
where NB is the total number of sequences in set B. Correspondingly, the total similarity for sequences across the
classes is named across-class similarity for short.
To calculate P SSij , we need to provide a transition matrix on which the optimal scoring alignment would be made.
Without loss of generality, we have used the Blosum 62 [14]
(Figure 1(a)) and Pam 250 [26] (Figure 1(b)) as the transition matrices, respectively. It can be seen from the figures
that the sequences belonging to weak class have very low or
no significant similarity which is even much lower than the
cross-class similarity. Due to this phenomenon, the traditional classification approaches are hard to learn an effective
pattern from the weak class.
10
Similarity Score
Similarity Score
6
4
2
0
8
6
4
2
0
Strong
Strong
Weak
Weak
Strong
(a) Pam 250
families, where acquired proteins are further grouped into
seven families (i.e. A, B, C, D, E, F and G). The size of
the families in the dataset are shown in Table 2, and the
data we used are available in http://www.acsu.buffalo.
edu/~nandu/Datasets/.
Table 2: Summary of the protein sequence data
Family
Class A
Class B
Class C
Class D
Class E
Class F
Class G
2.2
Number of Seq
23
23
16
19
18
14
20
Shortest Seq
160
136
224
120
324
131
45
Longest Seq
177
72
260
144
429
221
83
Problem Definition
We consider our problem as identifying the affinity classes
for the test inorganic material binding peptide sequences
based on the training sequences. Given a pool of l+u peptide
sequences Ssource = {s1 , ..., sl , ..., sl+u } where each peptide
sequence is represented as a series of ordering amino acids.
To better understand what a peptide sequence looks like,
let us take a peptide sequence from Table 1 as an example.
DS202: RLNPPSQMDPPF is a peptide sequence composed
of twelve ordered amino acids where each letter denotes an
amino acid and there are 20 amino acids which are known as
standard amino acids. We also assume in this sequence pool
Ssource , the first l sequences are labeled si ∈ Ssource (1 ≤ i ≤
l) based on its affinity to the target inorganic material (e.g.
weak or strong), which together is named L, and the rest of
u sequences are unlabeled si ∈ Ssource (l + 1 ≤ i ≤ l + u)
and together is named U, where L ∪ U = Ssource . Our goal
is to predict the labels of peptide sequences in U, using the
training sequences in L.
3.
PEPTIDE SEQUENCES SIMULATION
As we mentioned before, lack of labeled data is a general
problem we are usually facing when working on inorganic
material binding sequence research. One of the most successful methods to date for recognizing protein sequences
based on evolutionary knowledge is using simulated sequences.
Nowadays, there are many studies [1, 20] which have shown
that augmenting the training set with the simulated sequences generated from the amino acids transition matrix
such as Blosum 62 and Pam 250 can increase the homologs
identification performance. Also, it is believed that a set
of peptides generated by directed evolution to recognize a
given solid material will have similar sequences [25].
Weak
Weak
Strong
(b) Blosum 62
Figure 1: Total similarity scores of the self-class (the strong
class and weak class) and the cross-class for quartz binding
binders.
To demonstrate the proposed work is a general framework
which is also effective on predicting the homology families of
the traditional protein sequence, the SCOP dataset from [22]
is used. In addition, we employ the approach developed by
Anoop Kumar and Lenore Cowen [20] to pick the SCOP
Although these transition matrices are shown to be efficient and gain wide acceptance, we cannot directly apply
this technique to generate simulated sequences. Since these
transition matrices are derived from the large-scale natural
protein sequence databases rather than the target inorganic
material binding sequences, which means these existing matrices could not represent the target inorganic material well.
Thus, we only use traditional transition matrix as a seed
and based on it we would generate a new transition matrix
which not only maintains the prior knowledge from protein
but also captures the significant knowledge inside the target
inorganic material.
Here, aiming to provide a more comprehensive and diverse
view for our model, we use a two-step simulated sequence
generation approach to enlarge our training set. First, we
would generate a new transition matrix which can better
measure the amino acids transition relations for the target
inorganic material. Specifically, we use a traditional transition matrix (e.g. Blosum 62 or Pam 250) as a seed matrix M ,
which is a 20 × 20 symmetric matrix and each name on the
column or row is a single letter representing an amino acid.
Then we greedily and iteratively mutate each profile mij
which is an integer coefficient between two amino acids in the
seed transition matrix to maximize the difference between
self-class similarity and cross-class similarity [25] which is
designed to enlarge the gap among the affinity classes.
Second, after the new transition matrix M ∗ is constructed,
the simulated sequences are generated based on the labeled
sequence set L. When a sequence is selected as a seed, the
simulated sequence is generated by randomly selecting a position from it and replacing the amino acid i in the corresponding position with a new amino acid j with a probability
defined as Eq. (3):
∗
Mij
.
Pij = ∑20
∗
j=1 Mij
(3)
As an example, Figure 2 shows the process of mutating
an amino acid in the selected position based on the mutation probability, and we keep replacing the amino acids in
the target sequence until a desired mutation threshold t is
reached [20].
RLNPPSQMDPPF
Mutation
Probability
A
R N D
C
Q
E
G H
I
L
K M F
Assume the 8-th amino acid M
(Methionine) is selected to be mutated
P
S
T
W Y
V
4.6% 4.6% 3.5% 2.2% 4.6% 5.8% 3.5% 2.2% 3.5% 6.9% 8.0% 4.6% 11.4% 5.8% 3.5% 4.6% 4.6% 4.6% 4.6%6.9%
RLNPPSQWDPPF
WD
Assume W (Tryptophan) is selected to
replace M
Figure 2: An example of mutating an amino acid in the
selected position. When a specific position (e.g. 8-th) is
selected from the target sequence, the corresponding amino
acid M is mutated to be another amino acid W based on the
mutation probability.
By this two-step method, we can incorporate not only the
prior phylogenetic knowledge but also the specific amino
acids pattern responsible for the target inorganic material
into the data. Accordingly, based on the this peptide sequence simulation method, for each labeled source sequence
si ∈ Ssource (1 ≤ i ≤ l) we generate m mutated sequences,
which is represented by a simulated peptide sequence set
Ssimulated = {s∗1 , ..., s∗l×m }. Finally, we define the sequence
pool as S = Ssource ∪ Ssimulated which includes the source
peptide sequences and simulated sequences. We will show
that the simulated sequences effectively improve the performance in the experiments.
4. GRAPH-BASED OPTIMIZATION MODEL
Aiming at handling the challenge that the obvious patterns
are hard to be found at the class level, we propose a graphbased optimization model to estimate the conditional probability of the test sequences belonging to each affinity class.
Our method begins by mapping sequences from the sequence
pool into nodes of a sequence-to-sequence graph (Section
4.1) where the relationships among sequences are better
measured and many efficient clustering methods are available. Instead of searching the patterns from the class level,
we partition the sequences into clusters where we believe the
significant patterns exist and an objective function (Section
4.2) is proposed to learn the conditional probability of each
sequence belonging to a specific affinity class. Finally, we
present an efficient iterative algorithm to obtain the optimal value of the objective function (Section 4.3).
4.1
Mapping Sequences into Nodes of a Graph
We map all the sequences into a graph where each node
denotes a peptide sequence and each edge denotes the pairwise similarity between two sequences. This graph offers
a better understanding of the pairwise relationships among
peptide sequences and is easy to be partitioned into clusters. The pairwise similarity among sequences is calculated
with Needleman-Wunsch [24] algorithm after locally alignment between each sequence pair using Smith-Waterman algorithm [28].
4.2
The Objective Function
The key idea of our approach is that, instead of searching
the patterns at the class level, we narrow down the affinity
class prediction problem from the class level to the cluster
level. We believe that, if the patterns are obscure, shifting
the focus from the class level to the cluster level, we can find
clearer patterns.
Before proceeding further, we introduce the notation that
will be used in the following discussion: dij denotes the ijth entry in the matrix D, d⃗i. and d⃗.j denote vectors of i-th
row and j-th column of matrix D, respectively.
Belongingness Matrix: We denote the belongingness matrix B as an N × V matrix where N is the number of all
sequences (including source sequences, simulated sequences
and test sequences) and V is the number of clusters detected
from the clustering result on N. Note that, each entry in the
belongingness matrix corresponds to the probability a peptide sequence belonging to a cluster. If peptide sequence
si is assigned to a cluster cj , then bij = 1 and 0 otherwise. To construct the belongingness matrix B, we have
used the spectral clustering [31] which has been proved effective for solving the graph partitioning problems to partition
the sequence-to-sequence graph that we have constructed in
the previous section into V clusters.
Sequence Probability Matrix: The conditional probability of peptide sequence si belonging to class z (siz = P̂ (y =
z|si )) is estimated with an N × D matrix S where D is the
number of affinity classes we want to classify.
Cluster Probability Matrix: The conditional probability
of cluster cj belonging to class z (cjz = P̂ (y = z|cj )) is
estimated as a V × D matrix C, where cjz represents the
probability of a cluster cj belonging to a class z.
Sequence Labeled Matrix: Since we have a labeled sequence set L where the sequences have the initial class labels
which are represented by an N × D matrix F, where fiz = 1
if we know sequence si belonging to class z in advance, and
0 otherwise.
Cluster Labeled Matrix: We may also have some prior
information of some cluster belonging to a specific class. We
use a V × D matrix Y to define initial labels for clusters
where yjz = 1 denotes we are confident that a cluster cj
represents a specific class z by some knowledge, and 0 otherwise. In our case, we assign a cluster cj to a specific class
z if all the source sequences in it belong to the same class z.
Cluster Similarity Matrix: In addition, a V × V matrix W denotes the similarity among the sequence clusters,
where wij is the similarity between the sequence cluster ci
and cj . In our work, the pairwise cluster similarity is calculated with the T SSA−B between the sets of sequence binders
A and B.
Now we formulate the affinity class identification problem
as the following objective function:
It is easy to prove that the objective function Eq. (4) is a
convex program which makes it possible to find a global optimal solution. To obtain the optimal solution for matrices
S and C, we propose to solve the Eq. (4) using the block coordinate descent method [2]. At iteration t, fixing the value
of s⃗i. , we can take the partial derivative to c⃗tj. in Eq. (4)
and set it to 0, and then obtain the update Formula as Eq.
(5).
∑n
⃗ + γk y⃗
bij st−1
j j.
j.
t
⃗
.
cj. = ∑ i=1
n
⃗e
⃗
i=1 bij + γkj + α(ij. − lj. )
Accordingly, the update can be represented
as a matrix form
∑
as Eq. (6), where Dv = diag{( N
bij )} is the normali=1
∑
ization factor, Kv = diag{( D
z=1 yjz ))} indicates the constraints for the clusters and diag denotes the diagonal elee is the normalized laplaments of a matrix. Furthermore, L
1
−1
−2
e = Dw 2 W D w
cian [35] defined as L
, where Dw is the diagonal degree matrix of W.
e −1 (AT S t−1 + γKv Y ).
C t = (Dv + γKv + α(I − L))
min J(S, C) =
(5)
(6)
S,C
N ∑
V
V ∑
V
∑
∑
min (
bij ∥s⃗i. − c⃗j. ∥2 + α
wij ∥c⃗i. − c⃗j. ∥2
S,C
+β
i=1 j=1
N
∑
i=1 j=1
hi ∥s⃗i. − f⃗i. ∥2 + γ
i=1
V
∑
kj ∥c⃗j. − y⃗j. ∥2 )
j=1
subject to the following conditions:
D
∑
z=1
siz = 1, siz ≥ 0
D
∑
cjz = 1, cjz ≥ 0,
The Hessian matrix with respect to C is a diagonal matrix
∑n
e
with entries
i=1 bij + α > 0 and I − L. The diagonal
e
matrix is positive definite and it is easy to prove that I − L
is also a semi-positive definite. Thus, the hessian matrix
is a positive definite matrix, which means derivative for C
gives the unique minimum of the Eq. (4). Similarly, we
can obtain the update formula Eq. (7) with respect to s⃗i .
through fixing c⃗tj. .
z=1
(4)
where ∥.∥2 indicates the L2 norm. The first term in the
∑
∑V
2
Eq. (4), N
i=1
j=1 bij ∥s⃗i. − c⃗j. ∥ , ensures that a sequence
should have similar probability vector as the cluster it belongs to, namely, cluster cj should correspond to class z if
the majority of sequences in this cluster belong to class z. Intuitively, the higher the deviation,
∑
∑ the larger penalty would
get. The second term α Vi=1 Vj=1 wij ∥c⃗i. − c⃗j. ∥2 corresponds to the intuition that the clusters which are close to
each other should have similar class, and α denotes the confidence over this source of information. From the view of
graph theory, this term is propagating the class information
∑
⃗ 2
among the clusters. The third term β N
i=1 hi ∥s⃗i. − fi. ∥
puts the constraint that the predictions should not deviate
too much from the corresponding sequence ground-truth and
β is the parameter that expresses the confidence of our belief
on∑the prior knowledge of sequences. Similarly, the last term
γ Vj=1 kj ∥c⃗j. − y⃗j. ∥2 is the loss function of penalizing the
deviation between predictions result and our prior knowledge of clusters and γ is the parameter that expresses the
confidence of our belief on the prior knowledge of clusters.
4.3 Iterative Update Algorithm
∑v
s⃗ti. =
⃗t
j=1 bij cj.
∑v
j=1 bij
+ βhi f⃗i.
+ βhi
.
(7)
Also, the matrix form of Eq. (7) is as following:
S t = (Dn + βHn )−1 (AC t + βHn F ),
(8)
∑
where Dn = diag{( Vj=1 bij )} is the normalization factor
∑
and Hn = diag{( D
z=1 fiz )} indicates the constraints for
the sequences. The hessian
∑ matrix is also a diagonal matrix
with diagonal elements N
i=1 bij > 0, which means derivative for S gives the unique minimum of the Eq. (4). To sum
up, the pseudo-code of iteratively solving Eq. (4) by block
coordinate descend method is shown as Algorithm 1, where
ϵ is a convergence threshold. Since the proposed method
is based on a graph model, we name our approach Peptide
Sequences Identification Graph Model - PSIGM.
This iterative process shows a procedure of information propagation among the clusters. In each iteration step, each cluster estimates its class based on its members’ classes while
5
4
1
4
1
6
3
5
6
3
2
2
(A)
1
(B)
5
4
4
1
5
6
3
6
3
2
2
(D)
Node class:
Cluster class:
(C)
Strong
Weak
Test
Strong
Simulated
Weak
Unlabel
Figure 3: An example of illustrating the label propagation at each iteration. (A) partition of all the nodes (each sequence
is represented as a node here) into multiple clusters; (B) conditional probability estimate of clusters C receiving the label
information from the sequences (nodes) belonging to them; (C) each cluster propagates its class information to their neighboring clusters; and (D) after updating probability, each cluster pass the label information back to its members conditional
probability S.
Algorithm 1 The PSIGM Algorithm
Input: Belongingness matrix BN ×V , sequence labeled matrix FN ×D , cluster labeled matrix YV ×D , cluster similarity
matrix WV ×V , and parameter α, β, γ, ϵ
Output: Estimated sequence probability matrix SN ×D
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
Initialize S 0 randomly;
t ← 1;
begin
repeat
Update C t using Eq. (6);
Update S t using Eq. (8);
t ←t + 1;
until S t − S t−1 ≤ ϵ
Output S t ;
end
retaining its initial class Y. After all the clusters receiving
the label information, they propagate their label information to their neighboring clusters based on the smoothness
consumption. After the clusters have received the information from its neighbors, they pass the information back to
the nodes belonging to it while the nodes retains their initial classes F. This process continues until convergence. To
better demonstrate it, Figure 3 shows an example of how
the information propagation.
4.4 Time Complexity
The time complexity of the proposed algorithm is composed
with two parts: updating the cluster probability matrix C
and updating the sequence probability matrix S. For updating the matrix C, the time complexity is O(V N 2 D +
V 3 + V 2 D) where N is the size of the peptide sequence pool
and V is the number of clusters and D is the number of
affinity classes. Since in our case the number of sequence
set is usually much larger than number of clusters, thus the
time complexity for the first step is O(V N 2 D). For updating the matrix S, the time complexity is O(N 2 D + N V D),
thus O(N 2 D). Therefore, the overall time complexity is
O(V N 2 D). Suppose the number of iteration is k, the time
complexity of whole algorithm is O(kV N 2 D). In experiments, we observe that t is usually between 8 and 20.
5.
EXPERIMENTS
In the following, we conduct two extensive experiments to
show our proposed framework’s performance. First, the experiment on the quartz binding sequence dataset shows that
PSIGM is effective on identifying the binding affinity classes
of inorganic material binding sequences. Second, the experiment on the SCOP protein sequence dataset shows that
PSIGM is a general framework which also works effectively
in other kinds of sequence sets. Because most of our baselines are designed as binary classifiers, for the sake of simplicity, in the following experiments we only consider the
case of weak and strong binder class identification of the
data set mentioned in Section 2.1, although our proposed
method is not restricted to binary classification. Throughout all the experiments, we set α = 2, β = 10, and γ = 3 for
our algorithm. The rationale is that both α and γ depend
on the clustering result which is influenced by some uncertain factors such as the number of clusters and the initial
centric of each cluster, thus assigning a relative low value
to them is better; on the other hand, β shows our confidence on the labeled sequences which come from strict and
reliable experiments, thus β should be assigned a relative
larger value.
5.1
Experiment on Material Binding Sequences
To show our proposed framework is effective on predicting the binding affinity class of inorganic material binding
sequences, we perform the following three experiments to
demonstrate that: 1) simulated sequences and the information propagation among the clusters can effectively alleviate
the limitation due to challenge I; 2) searching the patterns
from clusters rather than classes help handling the challenge
II; and 3) the new generated transition matrix contributes
to our proposed method’s performance. For each accuracy
shown in the following experiments, we perform it 5 times using the Leave-one-out validation and report the mean value.
In addition, for each experiment, we fix the threshold ϵ for
convergence to 10−4 .
The effect of simulated sequences and cluster information propagation. To test the effect of simulated sequences and information propagation among the clusters,
we ran our method without using simulated sequences or
information propagation among clusters, respectively. Furthermore, as we mentioned in Section 4.2, we need to partition all the sequences into V clusters, thus we also want to
show the relationship between the proposed method’s performance and the number of clusters. We vary the number
of clusters as 2, 5, 10, 15, 18 and 20.
We want to show that both strategies: sequence simulation
and information propagation among the clusters are crucial
in improving the performance. The result is shown in Figure
4, where the x axis denotes the different number of clusters
and y axis denotes the accuracy. We can see that simulated sequences contribute to the performance improvement.
Also, in the absence of the information propagation among
clusters, the performance reduces which displays the desired
appearance. Finally, we notice that the hill-like shapes appear as the number of clusters increasing for all the three
cases, which are likely due to the reason that when the number of clusters is too low, it is close to a global view; when it
is too high, the clusters would be too trivial to learn from.
0.85
marked as SVM*, Neural Network* and HMM*. Note that,
since LLGC was designed as a semi-supervised algorithm
which needs unlabeled instances to aid propagating the labeled information, thus we only consider the LLGC with
the simulated sequences which are used as unlabeled data.
Thus, all the methods could be separated into two parts: using the simulated sequences and without using the simulated
sequences. To measure the influence from the different mutation rate t which is used to generate simulated sequences,
we vary the mutation rate t as 5%, 10%, 15% and 20%.
The result of predicting the quartz binding affinity classes
comparing with baselines is shown in Table 3.
Note that the proposed method significantly outperforms
the others in predicting the affinity classes of the test inorganic material binding sequences. In addition, the performance of the proposed method is not so sensitive to the
settings of the mutation rate when they are at a relative low
range. It is worth noticing that, instead of aiding the performance, the simulated sequences in SVM*, HMM* and NN*
make the performance worse than without them in SVM,
HMM and NN. The main reason for this phenomenon is
that: all these methods use a global view on the training
data which is only represented as two large classes: strong
binding or weak binding class. However, as we mentioned,
the sequences inside the same class may be very different,
thus the more simulated sequences are added, the more unobvious patterns are likely to be. Since the proposed method
treats the sequences locally as clusters, thus it can properly
handle this problem.
Table 3: Comparison with baselines under different mutation rates
Method
PSIGM
LLGC
NN*
HMM*
SVM*
SVM
NN
HMM
Oren et al.
PSIGM
Without Information Propagation
Without Sequences Simulation
Accuracy
0.8
0.75
5%
0.82
0.60
0.60
0.66
0.62
Mutation Rate
10% 15%
0.82 0.84
0.63
0.60
0.62
0.58
0.65
0.69
0.63
0.58
0.68
0.65
0.70
0.80
20%
0.83
0.62
0.55
0.67
0.59
0.7
0.65
2
5
10
15
18
20
Number of Clusters
Figure 4: Performance comparison with different strategies.
Performance comparison with baselines. In this part,
we compare the proposed method with 5 other algorithms
mentioned above including SVM, Neural Network, HMM,
Learning with local and global consistency (LLGC ) [35] which
is a famous graph-based Semi-Supervised Learning algorithm
and the approach proposed in [25]. For fairness and comprehensiveness, we have also tried adding the simulated sequences used for our framework to these methods which are
Comparison with Varying Transition Matrices. Finally, we show the performance of the proposed method with
different transition matrices. We want to demonstrate that
the new generated matrix could improve the performance
of the proposed algorithm for the target inorganic material.
In this experiment, we also vary the mutation rate t as 5%,
10%, 15% and 20%. We have compared the new transition
matrix M1 which was generated based on Blosum 62 and
M2 which was generated based on Pam 250 with four other
widely-used transition matrices including Blosum 62, Pam
250, Dayhoff [6] and Gonnet [12] as Table 4. The result
shows that, the new generated transition matrices perform
better than the others at each mutation rate.
5.2
Experiment on Protein Sequences
Table 4: Performance with different transition matrices
5%
0.82
0.80
0.72
0.72
0.71
0.79
Mutation Rate
10%
15%
0.82 0.84
0.81
0.82
0.74
0.75
0.73
0.73
0.71
0.72
0.78
0.79
Family
A
B
C
D
E
F
G
20%
0.83
0.82
0.74
0.74
0.72
0.80
Actually, the proposed PSIGM is a general framework which
is not limited at identifying the affinity class of inorganic
material binding sequences. To prove that, we have used
the SCOP protein data mentioned in the Section 2.1. Instead of predicting the sequences’ affinity classes, we consider the problem in homology families prediction: for a
specific family, could the proposed framework identifies the
sequences belonging to it from the remaining families? Correspondingly, we construct seven identification tasks from
this dataset, where the sequences from one particular family is used as the positive set and the sequences from the
remaining four families are used as the negative set. For
example, when the sequences in family A is used as the positive set, the sequences from families B, C, D, E, F and G
would be used as the negative set. Two experiments are
performed to demonstrate that: 1) our PSIGM is a general
framework which can also handle the tradition protein sequence identification; 2) a moderate setting of mutation rate
is conductive to improve the performance.
Performance comparison with baselines. It is worth
noticing that, through handling the data in this way, it obtains the characteristics of inorganic binding sequences to
some extent. First, the number of labeled sequences is limited. As shown in Table 2, for each family, only around 20
sequences are given. Second, since our negative set is the collection from several different families, the sequences in it do
not follow the same pattern. Table 5 shows the result of predicting the homology family comparing with baselines which
are mentioned in Section 5.1. Note that each result shown
here is the average of 10 times performance through 5-fold
cross validation. As the table show, the proposed method
outperforms the other methods at each protein family’s prediction. Note that, the accuracies of predicting the homology families are much higher than the accuracies of predicting the affinity classes of the inorganic material binding
sequences. The reasons behind this can be well explained
by Figure 5, which shows the self-class similarity of each
prediction task. Unlike the case shown in Figure 1 where
the cross-class similarity is remarkably higher than the weak
self-class similarity, in this protein sequence dataset, all the
self-class similarities are still higher or close to the cross-class
similarities. As we know, the more cross-class similarity surpasses the self similarity, the more difficult two classes are
separated.
Parameter sensitivity. The performance of PSIGM is influenced by the setting of mutation rate which is used to
generate the simulation sequences. To fully evaluate how
the mutation rate affects the performance, we increase it
from 0.05 to 0.3 with a step of 0.05 and report the accuracy
PSIGM
0.999
0.965
0.999
0.999
0.999
0.978
0.987
LLGC
0.80
0.81
0.82
0.82
0.82
0.82
0.82
SVM
0.998
0.959
0.999
0.999
0.985
0.946
0.975
HMM
0.966
0.944
0.952
0.969
0.935
0.952
0.972
NN
0.913
0.947
0.968
0.999
0.999
0.857
0.929
of each family’s prediction task in Figure 6. It is clear that
most families have an increase in accuracy as the mutation
rate rises until reaching a threshold of 15%, and then the
performance begins to decrease. This proves that the performance of PSIGM can be improved by a moderate setting
of mutation rate. In addition, we can infer that PSIGM is
not only effective at identifying the affinity classes of the
inorganic material binding sequences, but also effective at
predicting the homology families of the traditional protein
sequences.
1
0.99
0.98
AUC
Transition Matrix
M1
M2
Blosum 62
Pam 250
Dayhoff
Gonnet
Table 5: Comparison with baselines at each class
0.97
A
B
C
D
E
F
G
0.96
0.95
0.94
0.93
0.05
0.1
0.15
0.2
0.25
0.3
Mutation Rate
Figure 6: Parameter sensitivity experiment.
6.
RELATED WORK
As an emerging research topic, there is very little published
work on identifying the affinity classes of inorganic material
binding sequence that we can compare to. But as a similar
topic, much research has been devoted to the question of
identifying the homologs of the protein sequences. Hidden
Markov Model (HMM) is a widely-used probability modeling
method for protein homology detection [8, 20, 29] which first
generates a probability for each specific sequences family
and then calculates the likelihood of an unknown sequence
fitting each family. Another type of direct modeling methods for protein homology detection is based on Neural Network [32, 34], where the multilayer nature of neural network
allows them to discover non-linear higher order correlations
among the sequences. As a widely-used machine learning algorithm, Support Vector Machine (SVM) [13] has been also
applied to protein homology detection problems. Mak et
al. [21] proposed a SVM based model named PairProSVM
to automatically predict the sub-cellular locations of proteins sequences. Karchin et al. [17] combined the HMM
with the SVM to identify the protein homologies. However,
these methods are inappropriate in our case for two reasons.
First, they ask for a training set consisting of sufficient la-
100
0
100
50
0
A
300
200
100
0
B
Not B
Not B
A
0
150
100
50
0
E
(d) Class D
150
100
50
0
F
Not E
Not E
D
(c) Class C
Similarity Score
Similarity Score
Similarity Score
200
Not D
Not D
C
200
400
100
Not C
Not C
(b) Class B
600
200
D
B
(a) Class A
300
0
C
Not A
Not A
Similarity Score
200
Similarity Score
Similarity Score
Similarity Score
400
300
G
Not F
Not F
Not G
Not G
E
F
G
(e) Class E
(f) Class F
(g) Class G
Figure 5: Total similarity scores of the self-class and the cross-class for each prediction task based on Pam 250. (A) class A
and non-A; (B) class B and non-B; (C) class C and non-C; (D) class D and non-D; (E) class E and non-E; (F) class F and
non-F; (G) class G and non-G.
beled examples. Second, they try to learn the pattern from
each class which may not exist at this level.
Moreover, besides the differences with the traditional classification approaches, the proposed framework is also different
from the following work: 1) [25] has proposed a method to
generate the new transition matrix and making the classification based on it. The main difference between the proposed work and Oren’s work is that they still considered the
sequence classification problem with a global view. 2) [11]
proposed a consensus maximization model to solve the problem of finding informative genes from multiple studies. Although the proposed method has the same intuition with
Ge’s work in which a cluster should correspond to a particular class z if the majority of instances in this cluster belong
to class z, their method aimed at making the reliable prediction by utilizing multiple sources which is much different
from our topic.
7. CONCLUSION AND FUTURE WORK
Identifying the affinity classes of peptide sequences binding to a specific inorganic material is a challenging research
problem with broad applications. In this paper, we proposed
a novel framework, PSIGM, to solve this problem. We begin
with providing a two-step simulated peptide sequences generation method to make the training set more comprehensive
and diverse. Moreover, unlike traditional machine learning
approaches used for protein sequences identification that try
to find the patterns from the class level, our framework partitions the sequences into smaller clusters and learns the patterns from them through using a graph-based optimization
model. Extensive experimental studies demonstrate that
the proposed framework can effectively identify the affinity
classes of the inorganic material binding sequences.
In the future, we will use a cyclic model to validate and
retrain our model: first, this model selects some sequences
that have the most/least probabilities binding to a target
inorganic material as a candidate set; second, we would use
some quick experimental methods to validate the candidate
sequence set such as QCM (Quartz Crystal Microbalance);
finally, the validated sequences are used to retrain our model,
and then new candidate sequences would be selected from
the sequence database based on their affinity, so on so forth.
We believe that by this cyclic validation model, we can not
only validate our model’s effectiveness but also keep retraining it to be better and better.
8.
REFERENCES
[1] Afiahayati and S. Hartati. Multiple sequence
alignment using hidden markov model with
augmented set based on blosum 80 and its influence
on phylogenetic accuracy, 2010.
[2] D. P. Bertsekas. Nonlinear Programming, volume 43.
Athena Scientific, 1995.
[3] E. T. Boder and K. D. Wittrup. Yeast surface display
for screening combinatorial polypeptide libraries.
Nature Biotechnology, 15(6):553–557, 1997.
[4] H. Chen, X. Su, K.-G. Neoh, and W.-S. Choe. Probing
the interaction between peptides and metal oxides
using point mutants of a tio2-binding peptide.
Langmuir The Acs Journal Of Surfaces And Colloids,
24(13):6852–6857, 2008.
[5] K. Y. Dane, C. Gottstein, and P. S. Daugherty. Cell
surface profiling with peptide libraries yields ligand
arrays that classify breast tumor subtypes. Molecular
Cancer Therapeutics, 8(5):1312–1318, 2009.
[6] M. O. Dayhoff, R. M. Schwartz, and B. C. Orcutt. A
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
model of evolutionary change in proteins. Atlas of
protein sequence and structure, 5(Suppl 3):345–352,
1978.
S. Donatan, M. Sarikaya, C. Tamerler, and M. Urgen.
Effect of solid surface charge on the binding behaviour
of a metal-binding peptide. Journal of the Royal
Society Interface the Royal Society,
(April):rsif.2012.0060–, 2012.
R. Durbin, S. Eddy, A. Krogh, and G. Mitchison.
Biological sequence analysis: probabilistic models of
proteins and nucleic acids. cambridge univ, 1998.
E. Estephan, C. Larroque, F. J. G. Cuisinier,
Z. Bálint, and C. Gergely. Tailoring gan semiconductor
surfaces with biomolecules. The Journal of Physical
Chemistry B, 112(29):8799–8805, 2008.
E. Estephan, M.-b. Saab, C. Larroque, M. Martin,
F. Olsson, S. Lourdudoss, and C. Gergely. Peptides for
functionalization of inp semiconductors. Journal of
Colloid and Interface Science, 337(2):358–363, 2009.
L. Ge, N. Du, and A. Zhang. Finding informative
genes from multiple microarray experiments: A
graph-based consensus maximization model, 2011.
G. H. Gonnet, M. A. Cohen, and S. A. Benner.
Exhaustive matching of the entire protein sequence
database. Science, 256(5062):1443–1445, 1992.
M. J. Grimble. Adaptive systems for signal processing,
communications and control. Control, 3, 2001.
S. Henikoff and J. G. Henikoff. Amino acid
substitution matrices from protein blocks. Proceedings
of the National Academy of Sciences of the United
States of America, 89(22):10915–10919, 1992.
M. Hnilova, E. E. Oren, U. O. S. Seker, B. R. Wilson,
S. Collino, J. S. Evans, C. Tamerler, and M. Sarikaya.
Effect of molecular conformations on the adsorption
behavior of gold-binding peptides. Langmuir The Acs
Journal Of Surfaces And Colloids,
24(21):12440–12445, 2008.
M. K. Kalita, U. K. Nandal, A. Pattnaik,
A. Sivalingam, G. Ramasamy, M. Kumar, G. P. S.
Raghava, and D. Gupta. Cyclinpred: A svm-based
method for predicting cyclin protein sequences. PLoS
ONE, 3(7):12, 2008.
R. Karchin, K. Karplus, and D. Haussler. Classifying
g-protein coupled receptors with support vector
machines. Bioinformatics, 18(1):147–159, 2002.
E. Kasotakis, E. Mossou, L. Adler-Abramovich, E. P.
Mitchell, V. T. Forsyth, E. Gazit, and A. Mitraki.
Design of metal-binding sites onto self-assembled
peptide fibrils. Biopolymers, 92(3):164–172, 2009.
J. Kimling, M. Maier, B. Okenve, V. Kotaidis,
H. Ballot, and A. Plech. Turkevich method for gold
nanoparticle synthesis revisited. The Journal of
Physical Chemistry B, 110(32):15700–15707, 2006.
A. Kumar and L. Cowen. Augmented training of
hidden markov models to recognize remote homologs
via simulated evolution. Bioinformatics,
25(13):1602–1608, 2009.
M.-W. M. M.-W. Mak, J. G. J. Guo, and S.-Y. K.
S.-Y. Kung. Pairprosvm: protein subcellular
localization based on local pairwise profile alignment
and svm. IEEEACM Transactions on Computational
Biology and Bioinformatics, 5(3):416–422, 2008.
[22] A. G. Murzin, S. E. Brenner, T. Hubbard, and
C. Chothia. Scop: A structural classification of
proteins database for the investigation of sequences
and structures. Journal of Molecular Biology,
247(4):536 – 540, 1995.
[23] R. R. Naik, L. L. Brott, S. J. Clarson, and M. O.
Stone. Silica-precipitating peptides isolated from a
combinatorial phage display peptide library. Journal of
Nanoscience and Nanotechnology, 2(1):95–100, 2002.
[24] S. B. Needleman and C. D. Wunsch. A general
method applicable to the search for similarities in the
amino acid sequence of two proteins. Journal of
Molecular Biology, 48(3):443–453, 1970.
[25] E. E. Oren, C. Tamerler, D. Sahin, M. Hnilova,
U. O. S. Seker, M. Sarikaya, and R. Samudrala. A
novel knowledge-based approach to design
inorganic-binding peptides. Bioinformatics,
23(21):2816–2822, 2007.
[26] W. R. Pearson. Rapid and sensitive sequence
comparison with fastp and fasta. Methods in
Enzymology, 183(1988):63–98, 1990.
[27] G. P. Smith. Filamentous fusion phage: novel
expression vectors that display cloned antigens on the
virion surface. Science, 228(4705):1315–1317, 1985.
[28] T. F. Smith and M. S. Waterman. Identification of
common molecular subsequences. Journal of Molecular
Biology, 147(1):195–197, 1981.
[29] N. Terrapon, O. Gascuel, E. Marechal, and
L. Brehelin. Fitting hidden markov models of protein
domains to a target species: application to
plasmodium falciparum. BMC Bioinformatics,
13(1):67, 2012.
[30] A. Vila Verde, P. J. Beltramo, and J. K. Maranas.
Adsorption of homopolypeptides on gold investigated
using atomistic molecular dynamics. Langmuir The
Acs Journal Of Surfaces And Colloids,
27(10):5918–5926, 2011.
[31] U. Von Luxburg. A tutorial on spectral clustering.
Statistics and Computing, 17(4):395–416, 2007.
[32] D. W. D. Wang, N. K. L. N. K. Lee, T. S. Dillon, and
N. J. Hoogenraad. Protein sequences classification
using radial basis function (rbf) neural networks, 2002.
[33] M. Wistrand and E. L. L. Sonnhammer. Improving
profile hmm discrimination by adapting transition
probabilities. Journal of Molecular Biology,
338(4):847–854, 2004.
[34] C. H. Wu, S. Zhao, H. L. Chen, C. J. Lo, and
J. McLarty. Motif identification neural design for
rapid and sensitive protein family search. Computer
applications in the biosciences CABIOS,
12(2):109–118, 1996.
[35] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and
B. Sch. Learning with local and global consistency.
Advances in Neural Information Processing Systems
16, 1:595–602, 2003.
[36] X. Zhu. Semi-supervised learning literature survey.
SciencesNew York, Tech. Rep.(1530):1–59, 2007.
Download