Identifying Affinity Classes of Inorganic Materials

advertisement
1
Identifying Affinity Classes of Inorganic Materials
Binding Sequences via a Graph-based Model
Nan Du, Marc R. Knecht, Mark T. Swihart, Zhenghua Tang,
Tiffany R. Walsh and Aidong Zhang
Abstract—Rapid advances in bionanotechnology have recently generated growing interest in identifying peptides that bind to inorganic
materials and classifying them based on their inorganic material affinities. However, there are some distinct characteristics of inorganic
materials binding sequence data that limit the performance of many widely-used classification methods when applied to this problem.
In this paper, we propose a novel framework to predict the affinity classes of peptide sequences with respect to an associated inorganic
material. We first generate a large set of simulated peptide sequences based on an amino acid transition matrix tailored for the specific
inorganic material. Then the probability of test sequences belonging to a specific affinity class is calculated by minimizing an objective
function. In addition, the objective function is minimized through iterative propagation of probability estimates among sequences and
sequence clusters. Results of computational experiments on two real inorganic material binding sequence datasets show that the
proposed framework is highly effective for identifying the affinity classes of inorganic material binding sequences. Moreover, the
experiments on the SCOP (structural classification of proteins) dataset shows that the proposed framework is general and can be
applied to traditional protein sequences.
Index Terms—inorganic material, peptide sequences, classification
F
1
I NTRODUCTION
Over the past decade, many studies have been published
for analyzing the peptide sequences with affinity to biological entities such as enzymes, cells, viruses, lipids and
proteins. Recently, interest in identifying and classifying
peptides that interact specifically with inorganic materials has grown. These inorganic materials binding peptide
sequences have been identified from biocombinatorial
peptide libraries using phage display [1], cell surface
display [2], and yeast display [3].
In particular, numerous studies have been reported
about the peptide sequences that bind to the inorganic
materials, such as noble metals (gold, silver, platinum)
[4], [5], [6], [7], [8], semiconductors (zinc sulfide, cadmium sulfide) [9], [10], [11], [12], and metal oxides (silica,
titanium and magnetite) [13], [14], [15], [16], [17], [18],
[19], which are of great interest for applications in technology and medicine. Inorganic material binding peptide
sequences, which are usually 7-14 amino acids long, are
differentiated from other polypeptides by their specific
molecular recognition properties for targeted inorganic
material surfaces [20]. Effectively identifying the affinity
• Nan Du and Aidong Zhang are with the Computer Science and Engineering Department, University at Buffalo (SUNY), Buffalo, NY 14260.
E-mail:nandu,azhang@buffalo.edu
• Marc R. Knecht and Zhenghua Tang are with Department of Chemistry,
University of Miami, 1301 Memorial Drive, Coral Gables, Florida 33146.
E-mail:knecht,z.tang@miami.edu
• Mark T. Swihart is with Department of Chemical and Biological Engineering, University at Buffalo (SUNY), Buffalo, NY 14260
E-mail:swihart@buffalo.edu
• Tiffany R. Walsh is with Institute for Frontier Materials, Deakin University, Geelong, Vic. 3216, Australia
E-mail:tiffany.walsh@deakin.edu.au
classes, which shows the binding strength of a specific
sequence with respect to the target inorganic material,
is crucial for further designing novel peptides [21]. The
binding affinity of a peptide to an inorganic surface is
the result of a complex interplay between the binding
strength of its individual residues and its conformation.
The binding strength of a sequence for a specific material
is usually measured with the adsorption free energy
(∆Gads ), which is then used to classify the affinity class
as weak, medium, or strong for each sequence.
Despite extensive recent reports on combinatorially selected inorganic binding peptides and their bionanotechnological utility as synthesizers and molecular linkers
[22], [23], [20], there is still limited knowledge about the
relationships between binding peptide sequences and
their associated inorganic materials. Therefore, by using
machine learning technology to suggest sequence affinity
classes, we can predict new sequences having desired
affinity for specific inorganic materials, without doing
new large-scale screenings via phage display.
Various approaches have been used or developed for
recognizing both close and distant homologs of given
protein sequences, which is one of the central themes
in bioinformatics. Most of the work is based on established machine learning models such as Hidden Markov
model (HMM) [24], [25], Neural Network (NN) [26], [27]
and Support vector machine (SVM) [28]. However, the
problem of inorganic material binding peptide sequence
affinity classes identification has some distinct challenges
that are rarely faced in protein sequence identification,
which markedly limit the performance of the models
mentioned above, despite their success in other types
of protein sequences detection.
2
Challenge I: The number of labeled samples is usually insufficient. As an emerging topic, the peptide sequences identified for binding solid inorganic materials
have been developed only in the last decade, and are not
so well studied compared to protein sequences analysis
which has much longer history. For example, unlike
protein sequences analysis that has numerous large-scale
public datasets such as GPCR [29] or SCOP [30], no
complete result of large-scale screening experiments has
been made publicly available for the inorganic material
binding sequences. Therefore, unlike protein sequence
research which has many public large databases and
publicly available experiment results, the data about inorganic binding peptide sequences are usually quite few.
Most existing protein sequence classification approaches
require a large set of labeled samples to train an accurate
model. However, labeling the affinity classes for a large
number of inorganic material binding sequences is very
time-consuming and expensive. Thus it is usually infeasible. If only a limited number of labeled samples are
available for the model training, the learned model may
suffer from the problems of over-fitting or under-fitting.
As a machine learning method which has received much
attention in the past decade, Semi-Supervised Learning
(SSL) [31] is good at handling the lack of sufficient
labeled training data problem. However, the utility of
this method may be markedly limited due to the next
challenge.
Challenge II: The peptide sequences belonging to
the same affinity class may be very dissimilar. Usually,
the protein sequences which belong to the same family
follow some apparent patterns, in other words, they
are similar to each other by some views. However, the
“similarity” between inorganic material binding peptide
sequences from the same affinity class may be not so
apparent. In some cases, the intra-similarity which measures the similarity of all sequences inside the same class
is even less than the inter-similarity which measures the
similarity among the sequences from different classes.
This phenomenon also means some peptide sequences
belonging to the same class may be dissimilar with each
other, at least by the current knowledge. This observation reflects the fact that the inorganic material binding
sequences do not satisfy the smoothness assumption
at the class level which is generally assumed in both
supervised learning and semi-supervised learning.
In light of these challenges for inorganic material binding sequence affinity classes identification, we propose
a novel framework which includes two parts. First, to
tackle the insufficient data challenge, we augment the
training sequence set with simulated sequences which
are generated based on a new amino acid transition
matrix. By using the simulated sequences, we incorporate not only the prior phylogenetic knowledge but
also the specific sequence patterns responsible for the
target inorganic material into the training data. Second, instead of searching the patterns globally from the
peptide sequences belonging to the same affinity class,
we separate the sequences into smaller clusters and try
to learn the patterns from them locally via a graphbased optimization model. Intuitively, since there are few
obvious patterns that could be found at the class level,
we search for them at the smaller cluster level.
Based on the two strategies mentioned above, we
propose a novel model that combines the sequence simulation and cluster-based sequence affinity identification.
The initial idea was published in [32]. This paper extends
the original idea to formulate a solid method and provide more supportive, comprehensive experiments. The
main process of the proposed method is shown in Fig.
1, where we first use the labeled sequences as seeds to
simulate more sequences, and then all the labeled and
simulated sequences are used to train our graph-based
optimization model which is effective at identifying the
sequences’ affinity classes. We will discuss the proposed
method in detail in the following sections.
PPTNSM
HFQN
PPTNSM
……
HFQN
Strong Set
……
Strong Set
LWSTVA
Peptide
Sequences
Simulation
LWSTVA
SNLFT
……
Graph-based
Optimization
Model
Weak Set
SNLFT
……
Weak Set
PPANST
SNMFT
……
Simulated Set
Fig. 1: Main process of the proposed method.
In this paper, we make the following contributions:
•
•
•
We introduce the distinct challenges associated with
identifying affinity classes for inorganic material
binding sequences.
We propose a novel framework which can effectively predict the affinity classes of the inorganic
material binding sequences and provide an efficient iterative algorithm to find the optimal solution
of the proposed objective function. Moreover, our
framework is a general framework which is also
effective for identifying the classes of traditional
protein sequences.
The extensive computational experiments show that
the proposed method outperforms many other baseline methods.
The rest of the paper is organized as follows. In Section
2, we explain the relationship between our work and previous related work. In Section 3, we describe the datasets
used in this paper and the setting of our problem. The
peptide sequence simulation method and the graphbased optimization model are presented in Section 4 and
Section 5, respectively. Extensive experimental results are
shown in Section 6. The conclusion and future work is
presented in Section 7.
3
2
R ELATED W ORK
As an emerging research topic, there is very little published work on identifying the affinity classes of inorganic material binding sequences that we can compare
to. But as a similar topic, much research has been devoted to the question of identifying the homologs of the
protein sequences. HMM is a widely-used probability
modeling method for protein homology detection [24],
[25], [33] which first generates a probability for each specific sequence family and then calculates the likelihood
of an unknown sequence fitting each family. Another
type of direct modeling methods for protein homology
detection is based on Neural Network [26], [27], where
the multilayer nature of neural network allows them
to discover non-linear higher order correlations among
the sequences. As a widely-used machine learning algorithm, SVM [28] has been also applied to protein
homology detection problems. Mak et al. [34] proposed a
SVM based model named PairProSVM to automatically
predict the sub-cellular locations of proteins sequences.
Karchin et al. [29] combined the HMM with the SVM
to identify the protein homologies. Tian et al. proposed
a weighted version of SVM to weaken the influence of
outliers for improving protein sub-cellular localization
predictions [35]. However, these methods are inappropriate in our case for two reasons. First, they ask for
a training set consisting of sufficient labeled examples.
Second, they try to learn the pattern from each class
which may not exist at this level.
Moreover, besides the differences with the traditional
classification approaches, the proposed framework is
also different from the following work: 1) Oren et al.
[21] has proposed a method to generate a new transition
matrix and make the classification based on it. The
first difference between the work presented here and
Oren’s work is that they only consider the sequence
classification problem via learning the patterns from the
entire sequence set belonging to the same affinity class.
Second, the newly generated transition matrix in [21]
was only used to calculate the pairwise distance between
sequences. In our proposed method, the newly generated
matrix is also used to generate the simulated sequences.
2) Ge et al. [36] proposed a consensus maximization
model to solve the problem of finding informative genes
from multiple studies. Although the proposed method
has the same intuition as Ge’s work in which a cluster
should correspond to a particular class z if the majority
of instances in this cluster belongs to class z, it aimed
at making the reliable prediction by utilizing multiple
experimental results which is much different from our
work. In our case, we only have the raw dataset (i.e. labeled inorganic material sequences) rather than multiple
experimental results.
3
DATASETS
AND
P ROBLEM D EFINITION
In this section, we describe the datasets used in this
paper and present the problem definition.
3.1
Datasets
We have used three datasets to demonstrate the proposed method’s performance. The first dataset is from
Oren et al [21]. This dataset consists of a total of 25 quartz
(rhombohedral silica, SiO2 ) binding peptide sequences
which were identified using phage-display techniques.
All these peptide sequences are further classified into
two classes based on their affinity strength: strong and
weak binder classes which contain 10 and 15 sequences,
respectively. To better demonstrate the problem and
show the proposed method in the rest of the paper, we
abstract a sample set which includes two affinity classes
from this dataset and show it in Table 1.
TABLE 1: Sample set of peptide sequences data
Name
DS202
DS189
...
Strong Class
Sequence
RLNPPSQMDPPF
QTWPPPLWFSTS
...
Name
DS201
DS191
...
Weak Class
Sequence
MEGQYKSNLLFT
VAPRVQNLHFGA
...
The second inorganic material binding peptide sequence dataset is from our systematic study of peptide
binding on gold (Au) [37], combined with the previous
data from Wang et al. [38], to give a total of 32 peptide
sequences. Sequences in our sequence set following the
pattern XHXHXHX, where X is an arbitrary amino acid
are from Wang et al. [38]. Since any peptide sequences
that containing cysteine (i.e. amino acid C) can bind
strongly onto the gold surface, without loss of generality, any sequences contains cysteine are not considered.
Using measured adsorption free energies (∆ G kJ/mol)
for all the sequences, we drew the boundary between
strong and weak binding sequences, such that the weak
class has ∆ G > −25 kJ/mol, and the strong class has
∆G ≤ −25 kJ/mol. Note that, Hnilova et al. [39] have
shown that sequence ’TLRRWRDRRILN’ (AUBP30) has
weak binding ability to gold. Although they did not
report the free energy for it, it is very likely to reside in
the weak set based on the qualitative binding analysis.
All the sequences from the strong and weak classes are
listed in Table 2.
It is worth noticing that these datasets illustrate well
the two challenges mentioned above. First, there are
only around ten sequences available for each affinity
class, which is very few in comparison to the data size
used for classifier training in protein sequence analysis
where hundreds or thousands of sequences are usually
involved [33], [25]. Second, the unobvious pattern challenge shown in these datasets is illustrated well in Fig.
2 and Fig. 3. In this figure, based on the total similarity
scores (TSS) defined in [21], we first calculate the total
similarity of sequences from the same class A via the
following equation:
∑∑
1
T SSA =
P SSij (1 − δij ),
N A ∗ (N A − 1) i=1 j=1
NA NA
(1)
Weak Sequence
HHHHHHH
MHMHMHM
RHRHRHR
YHYHYHY
WHWHWHW
KHKHKHK
AHAHAHA
GHGHGHG
QHQHQHQ
IHIHIHI
NHNHNHN
VHVHVHV
SHSHSHS
THTHTHT
EHEHEHE
DHDHDHD
LHLHLHL
FHFHFHF
PHPHPHP
TLRRWRDRRILN
∆G
-23.9
-23.1
-22.9
-22.8
-22.3
-22.3
-21.6
-21.6
-20.5
-20.4
-19.9
-19.9
-18.9
-18.7
-18.6
-18.2
-16.2
-15.4
-14.1
-
NA ∑
NB
∑
1
P SSij ,
N A ∗ N B i=1 j=1
2
0
Strong
10
5
0
Strong
Weak
Weak
Weak
Weak
Strong
Strong
(a) Pam 250
(b) Blosum 62
Fig. 2: Total similarity scores of the self-class (the strong
class and weak class) and the cross-class for quartz
binding binders.
20
where δ is the usual Kronecker delta function in which
δij = 1 when i = j and 0 otherwise, NA is the total
number of sequences in set A, and P SSij is the similarity
between the ith sequence and j th sequence of set A
calculated via the Needleman-Wunsch algorithm [40]. For
the sake of simplicity, we call it self-class similarity for
short. Moreover, the TSS of the sequences across the
classes A and B are calculated as:
T SSA−B =
4
(2)
where NB is the total number of sequences in set B.
Correspondingly, the total similarity for sequences across
the classes is named across-class similarity for short.
To calculate P SSij , we need to provide a transition
matrix on which the optimal scoring alignment would
be made. Without loss of generality, we have used both
the Pam 250 [41] (Fig. 2(a) and Fig. 3(a)) and Blosum 62
[42] (Fig. 2(b) and Fig. 3(b)) as the transition matrices,
respectively. Fig. 2 shows that the sequences belonging to the weak class have very low or no significant
similarities. Their self-similarity is much lower than the
cross-class similarity. Similarly, as shown in Fig. 3, the
similarities of the sequences belonging to the strong gold
binding set are very close to the cross-class similarity.
Due to this phenomenon, the traditional classification
approaches cannot readily identify an effective pattern.
To demonstrate the proposed work is a general framework which is also effective on predicting the homology
families of the traditional protein sequence, the third
dataset: Structural Classification of Proteins SCOP dataset
from [43] is also used. In addition, we employ the
approach developed by Anoop Kumar and Lenore Cowen
[25] to pick the SCOP families, where acquired proteins
are further grouped into seven families (i.e. A, B, C, D,
E, F and G). The size and the length of longest/shortest
of amino acids at each family in the dataset are shown
20
Similarity Score
∆G
-37.6
-37.6
-36.6
-36.4
-35.7
-35.3
-35.0
-35.0
-31.8
-31.6
-31.3
-30.3
Similarity Score
Strong Sequence
WAGAKRLVLRRE
MHGKTQATSGTIQS
LKAHLPPSRLPS
WALRRSIRRQSY
TGTSVLIATPYV
EQLGVRKELRGV
RMRMKMK
PPPWLPYMPPWS
AYSSGAPPMPPF
TGIFKSARAMRN
KHKHWHW
TSNAVHPTLRHL
6
Similarity Score
TABLE 2: Summary of the gold binding peptide sequences
Similarity Score
4
15
10
5
0
Strong
15
10
5
0
Strong
Weak
Weak
(a) Pam 250
Weak
Weak
Strong
Strong
(b) Blosum 62
Fig. 3: Total similarity scores of the self-class (the strong
class and weak class) and the cross-class for gold binding
binders.
in Table 3, and the data we used are available at http:
//www.acsu.buffalo.edu/∼nandu/InorganicSeq/.
TABLE 3: Summary of the protein sequence data
3.2
Family
Number of Seq
Class A
Class B
Class C
Class D
Class E
Class F
Class G
23
23
16
19
18
14
20
Length of
Shortest Seq
160
72
224
120
324
131
45
Length of
Longest Seq
177
136
260
144
429
221
83
Problem Definition
We consider our problem as identifying the affinity
classes for the test inorganic material binding peptide sequences based on the training sequences. We
start from a pool of l+u peptide sequences Ssource =
{s1 , ..., sl , ..., sl+u } where each peptide sequence is represented as a series of ordered amino acids. To better
understand what a peptide sequence looks like, let us
take a peptide sequence from Table 1 as an example. DS202: RLNPPSQMDPPF is a peptide sequence composed
of twelve ordered amino acids where each letter denotes
one of the 20 standard amino acids. We also assume
in this sequence pool Ssource , the first l sequences are
labeled si ∈ Ssource (1 ≤ i ≤ l) based on its affinity to the
5
target inorganic material (e.g. weak or strong), which
together is named L, and the rest of u sequences are
unlabeled si ∈ Ssource (l + 1 ≤ i ≤ l + u) and together is
named U, where L ∪ U = Ssource . Our goal is to predict
the labels of peptide sequences in U, using the training
sequences in L.
4
P EPTIDE S EQUENCE S IMULATION
As we mentioned above, lack of labeled data is a general
problem we usually face when working on inorganic
material binding sequences. One of the most successful methods to date for recognizing protein sequences
based on evolutionary knowledge is using simulated
sequences. Nowadays, there are many studies [25], [44]
which have shown that augmenting the training set with
the simulated sequences generated from an amino acids
transition matrix such as Blosum 62 and Pam 250 can
increase the homologs identification performance. One
can reasonably expect that a set of peptides generated
by directed evolution to recognize a given solid material
will have similar sequences [21].
Although these transition matrices are shown to be
efficient and gain wide acceptance, we cannot directly
apply this technique to generate simulated sequences.
These transition matrices are derived from the large-scale
natural protein sequence databases rather than the target inorganic material binding sequences, which means
these existing matrices could not represent the target
inorganic material well. Thus, we only use a traditional
transition matrix as a seed and based on it we generate
a new transition matrix which not only maintains the
prior knowledge from proteins but also captures the significant knowledge inside the target inorganic material.
Here, aiming to provide a more comprehensive and
diverse view for our model, we use a two-step simulated
sequence generation approach to enlarge our training
set. First, we generate a new transition matrix which
can better measure the amino acids transition relations
for the target inorganic material. Specifically, we use
a traditional transition matrix (e.g. Blosum 62 or Pam
250) as a seed matrix M , which is a 20 × 20 symmetric
matrix and each name on the column or row is a single
letter representing an amino acid. Then we greedily
and iteratively mutate each profile mij which is an
integer coefficient between two amino acids in the seed
transition matrix to maximize the difference between
self-class similarity of class A (i.e. T SSA ) and crossclass similarity between class A and B (i.e. T SSA−B ) [21]
which is designed to enlarge the gap among the affinity
classes.
Second, after the new transition matrix M ∗ is constructed, the simulated sequences are generated based
on the labeled sequence set L. When a sequence is
selected as a seed, the simulated sequence is generated
by randomly selecting a position from it and replacing
the amino acid i in the corresponding position with a
new amino acid j with a probability defined in Eq. (3):
∗
Mij
.
Pij = ∑20
∗
j=1 Mij
(3)
Note that, all the probabilities are calculated after
normalization of the values in M ∗ into a positive value
space (e.g. 0−1). As an example, Fig. 4 shows the process
of mutating an amino acid in the selected position based
on the mutation probability, and we keep replacing
the amino acids in the target sequence until a desired
mutation threshold t is reached [25].
RLNPPSQMDPPF
Mutation
Probability
A
R N D
C
Q
E
G H
I
L
K M F
Assume the 8-th amino acid M
(Methionine) is selected to be mutated
P
S
T
W Y
V
4.6% 4.6% 3.5% 2.2% 4.6% 5.8% 3.5% 2.2% 3.5% 6.9% 8.0% 4.6% 11.4% 5.8% 3.5% 4.6% 4.6% 4.6% 4.6%6.9%
RLNPPSQWDPPF
WD
Assume W (Tryptophan) is selected to
replace M
Fig. 4: An example of mutating an amino acid in the
selected position. When a specific position (e.g. 8-th)
is selected from the target sequence, the corresponding
amino acid M is mutated to be another amino acid W
based on the mutation probability.
By this two-step method, we can incorporate not
only the prior phylogenetic knowledge but also the
specific amino acid pattern responsible for binding to
the target inorganic material into the data. Accordingly,
based on this peptide sequence simulation method, for
each labeled source sequence si ∈ Ssource (1 ≤ i ≤ l),
we generate m mutated sequences, which is represented as a simulated peptide sequence set Ssimulated =
{s∗1 , ..., s∗l×m }. Finally, we define the sequence pool as
S = Ssource ∪ Ssimulated which includes the source
peptide sequences and simulated sequences. We will
show that the simulated sequences effectively improve
the performance in the experiments.
5
G RAPH - BASED O PTIMIZATION M ODEL
Aiming to handle the challenge that the obvious patterns
are hard to find at the class level, we propose a graphbased optimization model to estimate the conditional
probability of the test sequences belonging to each affinity class. Our method begins by mapping sequences
from the sequence pool into nodes of a sequence-tosequence graph (Section 5.1) where the relationships
among sequences are better measured and many efficient
clustering methods are available. Instead of searching for
the patterns at the class level, we partition the sequences
into clusters where we believe the significant patterns
exist and an objective function (Section 5.2) is proposed
to learn the conditional probability of each sequence
belonging to a specific affinity class. Finally, we present
6
an efficient iterative algorithm to obtain the optimal
value of the objective function (Section 5.3).
5.1 Mapping Sequences into Nodes of a Graph
We map all the sequences into a graph where each node
denotes a peptide sequence and each edge denotes the
pairwise similarity between two sequences. This graph
offers a good understanding of the pairwise relationships
among peptide sequences and is easily partitioned into
clusters. The pairwise similarity among sequences is
calculated using Needleman-Wunsch [40] algorithm after
local alignment between each sequence pair using SmithWaterman algorithm [45].
Specifically, we assign a cluster cj to a specific class z if
all the source sequences in it belong to the same class z.
Cluster Similarity Matrix: In addition, a V × V matrix
W denotes the similarity among the sequence clusters,
where wij is the similarity between the sequence cluster
ci and cj . Specifically, the pairwise cluster similarity is
calculated using T SSA−B between the sets of sequence
binders A and B.
Now we formulate the affinity class identification
problem as the following objective function:
min J(S, C) =
S,C
N ∑
V
V ∑
V
∑
∑
2
min (
bij ∥s⃗i. − c⃗j. ∥ + α
wij ∥c⃗i. − c⃗j. ∥2
5.2 The Objective Function
The key idea of our approach is that, instead of searching
for patterns at the class level, we narrow down the
affinity class prediction problem from the class level to
the cluster level. We believe that, if the patterns are
obscure, shifting the focus from the class level to the
cluster level, we can find clearer patterns.
Before proceeding further, we introduce the notation
that will be used in the following discussion: dij denotes
the ij-th entry in the matrix D, d⃗i. and d⃗.j denote vectors
of i-th row and j-th column of matrix D, respectively.
Belongingness Matrix: We denote the belongingness
matrix B as an N × V matrix where N is the number
of all sequences (including source sequences, simulated
sequences and test sequences) and V is the number of
clusters detected from the clustering result on N. Note
that, each entry in the belongingness matrix corresponds
to the probability of a peptide sequence belonging to a
cluster. If peptide sequence si is assigned to a cluster cj ,
then bij = 1 and 0 otherwise. To construct the belongingness matrix B, we have used spectral clustering [46]
which has proven effective for solving the graph partitioning problems, to partition the sequence-to-sequence
graph that we have constructed in the previous section
into V clusters.
Sequence Probability Matrix: The conditional probability of peptide sequence si belonging to class z (siz =
P̂ (y = z|si )) is estimated with an N × D matrix S where
D is the number of affinity classes we want to classify.
Cluster Probability Matrix: The conditional probability of cluster cj belonging to class z (cjz = P̂ (y = z|cj ))
is estimated as a V × D matrix C, where cjz represents
the probability of a cluster cj belonging to a class z.
Sequence Labeled Matrix: In the labeled sequence set
L the sequences have the initial class labels which are
represented by an N × D matrix F, where fiz = 1 if we
know sequence si belonging to class z in advance, and
0 otherwise.
Cluster Labeled Matrix: We may also have prior
information of a cluster belonging to a specific class.
We use a V × D matrix Y to define initial labels for
clusters where yjz = 1 denotes that we are confident that
a cluster cj belongs to a specific class z, and 0 otherwise.
S,C
+β
i=1 j=1
N
∑
i=1 j=1
hi ∥s⃗i. − f⃗i. ∥2 + γ
i=1
V
∑
kj ∥c⃗j. − y⃗j. ∥2 )
j=1
subject to the following conditions:
D
∑
siz = 1, siz ≥ 0
z=1
D
∑
cjz = 1, cjz ≥ 0,
z=1
(4)
2
where ∥.∥
the L2 norm. The first term in
∑Nindicates
∑V
Eq. (4),
b
∥s⃗i. − c⃗j. ∥2 , ensures that a seij
i=1
j=1
quence should have similar probability vector as the
cluster it belongs to, namely, cluster cj should correspond to class z if the majority of sequences in this
cluster belong to class z. Intuitively, the higher the
deviation, the larger penalty would get. The second term
∑V ∑V
α i=1 j=1 wij ∥c⃗i. − c⃗j. ∥2 corresponds to the intuition
that the clusters which are close to each other should
have similar class, and α denotes the confidence over this
source of information. From the view of graph theory,
this term is propagating the class information among
∑N
the clusters. The third term β i=1 hi ∥s⃗i. − f⃗i. ∥2 applies
the constraint that the predictions should not deviate
too much from the corresponding sequence ground-truth
and β is the parameter that expresses the confidence of
our belief on the prior knowledge of sequences. Similar∑V
ly, the last term γ j=1 kj ∥c⃗j. − y⃗j. ∥2 is the loss function
penalizing the deviation between predictions and our
prior knowledge of clusters, and γ is the parameter
that expresses the confidence of our belief on the prior
knowledge of clusters.
5.3
Iterative Update Algorithm
It is easy to prove that the objective function Eq. (4) is
convex which makes it possible to find a global optimal
solution. To obtain the optimal solution for matrices S
and C, we propose to solve Eq. (4) using the the block
coordinate descent method [47]. At iteration t, fixing the
value of s⃗i. , we can take the partial derivative to c⃗tj. in Eq.
(4) and set it to 0, and then obtain the update Formula
Eq. (5):
7
∑n
c⃗tj. = ∑
n
⃗
t−1
i=1 bij sj.
i=1 bij
+ γkj y⃗j.
e
+ γkj + α(i⃗j. − l⃗j. )
.
(5)
Accordingly, the update can be represented as a matrix
∑N
form as Eq. (6), where Dv = diag{( i=1 bij )} is the
∑D
normalization factor, Kv = diag{( z=1 yjz ))} indicates
the constraints for the clusters and diag denotes the
e is the
diagonal elements of a matrix. Furthermore, L
1
− 12
−
2
e = Dw W Dw
normalized laplacian [48] defined as L
,
where Dw is the diagonal degree matrix of W.
e −1 (AT S t−1 + γKv Y ). (6)
C t = (Dv + γKv + α(I − L))
The Hessian matrix with respect to C is a diagonal
∑n
e
matrix with entries
i=1 bij + α > 0 and I − L. The
diagonal matrix is positive definite and it is easy to
e is also a semi-positive definite. Thus, the
prove that I − L
hessian matrix is a positive definite matrix, which means
derivative for C gives the unique minimum of Eq. (4).
Similarly, we can obtain the update formula Eq. (7) with
respect to s⃗i . through fixing c⃗tj. .
∑v
s⃗ti. =
j=1
∑
v
bij c⃗tj. + βhi f⃗i.
j=1 bij
+ βhi
.
(7)
Also, the matrix form of Eq. (7) is as following:
S t = (Dn + βHn )−1 (AC t + βHn F ),
(8)
∑V
where Dn = diag{( j=1 bij )} is the normalization factor
∑D
and Hn = diag{( z=1 fiz )} indicates the constraints for
the sequences. The hessian matrix is also a diagonal
∑N
matrix with diagonal elements
i=1 bij > 0, which
means the derivative of S gives the unique minimum
of Eq. (4). To sum up, the pseudo-code of iteratively
solving Eq. (4) by the block coordinate descend method
is shown as Algorithm 1, where ϵ is a convergence
threshold. Because the proposed method is based on a
graph model, we name our approach Peptide Sequences
Identification Graph Model - PSIGM.
This iterative process shows a procedure of information propagation among the clusters. To better demonstrate it, Fig. 5 shows an example of the information
propagation. In each iteration step, each cluster estimates
its class based on its members’ classes while retaining its
initial class Y (as Fig. 5-(A)). After all the clusters receive
the label information (as Fig. 5-(B)), they propagate their
label information to their neighboring clusters based
on the smoothness assumption (as Fig. 5-(C)). After
the clusters have received the information from their
neighbors, they pass the information back to the nodes
belonging to it while the nodes retains their initial classes
(as Fig. 5-(D)). This process continues until convergence.
Algorithm 1 The PSIGM Algorithm
Input: Belongingness matrix BN ×V , sequence labeled
matrix FN ×D , cluster labeled matrix YV ×D , cluster similarity matrix WV ×V , and parameter α, β, γ, ϵ
Output: Estimated sequence probability matrix SN ×D
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
5.4
Initialize S 0 randomly;
t ← 1;
begin
repeat
Update C t using Eq. (6);
Update S t using Eq. (8);
t ←t + 1;
until S t − S t−1 ≤ ϵ
Output S t ;
end
Time Complexity
The time complexity of the proposed algorithm is composed of two parts: updating the cluster probability
matrix C and updating the sequence probability matrix
S. For updating the matrix C, the time complexity is
O(V N 2 D + V 3 + V 2 D) where N is the size of the
peptide sequence pool and V is the number of clusters
and D is the number of affinity classes. Because in our
case the sequence set is usually much larger than the
number of clusters, thus the time complexity for the first
step is O(V N 2 D). For updating the matrix S, the time
complexity is O(N 2 D + N V D), thus O(N 2 D). Therefore,
the overall time complexity is O(V N 2 D). Suppose the
number of iterations is k, the time complexity of whole
algorithm is O(kV N 2 D). In experiments, we observe that
k is usually between 8 and 20.
6
E XPERIMENTS
In the following, we first conduct the experiments on
both the quartz and gold binding sequence datasets to
show that PSIGM is effective for identifying the binding
affinity classes of inorganic material binding sequences,
and then the experiments on the SCOP protein sequence
dataset to show that PSIGM is a general framework
which also works effectively in other kinds of sequence
sets. Because most of our baselines are designed as binary classifiers, for the sake of simplicity, in the following
experiments we only consider the case of weak and
strong binder identification of the datasets mentioned
in Section 3.1, although our proposed method is not
restricted to binary classification. Throughout all the
experiments, we set α = 2, β = 10, and γ = 2 as default
values of our algorithm. The rationale is that both α and
γ depend on the clustering result which is influenced by
some uncertain factors such as the number of clusters
and the initial centric of each cluster, thus assigning a
relative low value to them is better; on the other hand,
β shows our confidence in the labeled sequences which
come from strict and reliable experiments, thus β should
be assigned a relative larger value.
8
5
4
1
4
1
6
3
5
6
3
2
2
(A)
1
(B)
5
4
4
1
5
6
3
6
3
2
2
(D)
Node class:
Cluster class:
(C)
Strong
Weak
Strong
Test
Simulated
Weak
Unlabel
Fig. 5: An example of illustrating the label propagation at each iteration. (A) partition of all the nodes (each
sequence is represented as a node here) into multiple clusters; (B) conditional probability estimate of clusters (i.e.
cluster probability matrix C) receiving the label information from the sequences (nodes) belonging to them; (C)
each cluster propagates its class information to their neighboring clusters; and (D) after updating probability, each
cluster passes the label information back to its members conditional probability (i.e. sequence probability matrix
S).
6.1 Experiments on Material Binding Sequences
To show our proposed framework is effective on predicting the binding affinity class of inorganic material
binding sequences, we perform the following three experiments to demonstrate that: 1) simulated sequences
and the information propagation among the clusters
can effectively alleviate the limitation due to challenge
I; 2) searching the patterns from clusters rather than
classes helps in handling challenge II; and 3) the newly
generated transition matrix contributes to our proposed
method’s performance. For each accuracy (i.e. the percent of testing set examples correctly classified by the
classifier when compared with the ground truth) shown
in the following experiments, we performed experiments
5 times using Leave-one-out validation and report the
mean value. We iteratively select one labeled sequence
as the test sequence, and use the rest of the labeled
sequences to generate the simulated sequences and train
the model. When the model is well trained, we can
predict the test sequences affinity class. The reason why
we use the Leave-one-out validation rather than cross
validation in the inorganic material binding sequences is
that the number of labeled sequences is insufficient. In
such a case, each one of them may represent significant
underlying pattern or characteristic. In addition, for each
experiment, we fix the threshold ϵ for convergence to
10−4 .
The effect of simulated sequences and cluster in-
formation propagation. To test the effect of simulated
sequences and information propagation among the clusters, we ran our method with or without using simulated
sequences or information propagation among clusters,
respectively. Furthermore, as we mentioned in Section
5.2, we need to partition all the sequences into V clusters,
thus we also want to show the relationship between
the proposed method’s performance and the number of
clusters. We vary the number of clusters as 2, 5, 10, 15,
18 and 20.
We show that both strategies, sequence simulation and
information propagation among the clusters, are crucial
in improving the performance. The result is shown in
Fig. 6, where the x axis denotes the different number
of clusters and y axis denotes the accuracy. We can see
that simulated sequences contribute to the performance
improvement. Also, in the absence of the information
propagation among clusters, the performance degrades.
Finally, we notice that the hill-like shapes appear as the
number of clusters increasing for all the three cases. Most
likely, when the number of clusters is too low, it is close
to a global view; when it is too high, the clusters would
be too trivial to learn from.
Performance comparison with baselines. In this part,
we compare the proposed method with 5 other algorithms mentioned above including SVM, Neural Network,
HMM, Learning with local and global consistency (LLGC)
[48], which is a well-known graph-based Semi-Supervised
9
0.85
TABLE 4: Comparison with baselines under different
mutation rates for quartz-binding sequences
PSIGM
Without Information Propagation
Without Sequences Simulation
Method
PSIGM
LLGC
NN*
HMM*
SVM*
SVM
NN
HMM
Accuracy
0.8
0.75
0.7
0.65
5
10
15
18
20
Number of Clusters
Fig. 6: Performance comparison with different strategies.
Learning algorithm. For fairness and comprehensiveness,
we have also tried adding the simulated sequences used
for our framework to these methods which are marked
as SVM*, Neural Network* and HMM*. Note that, since
LLGC was designed as a semi-supervised algorithm
which needs unlabeled instances to aid propagating the
labeled information, thus we only consider the LLGC
with the simulated sequences which are used as unlabeled data. Thus, all the methods are separated into two
parts: using the simulated sequences and without using
the simulated sequences. To measure the influence from
the different mutation rate t which is used to generate
simulated sequences, we vary the mutation rate t at 5%,
10%, 15% and 20%. The results of predicting the quartz
binding affinity classes comparing with baselines on the
two inorganic binding sequences dataset are shown in
Table 4 and Table 5, respectively.
Note that the proposed method significantly outperforms the others in predicting the affinity classes of
the test inorganic material binding sequences in most
cases. In addition, the performance of the proposed
method is not so sensitive to the settings of the mutation
rate over the range considered. It is worth noticing
that, instead of aiding the performance, the simulated
sequences in SVM*, HMM* and NN* make the performance worse than without them. The main reason for
this phenomenon is that: all these methods use a global
view on the training data which is only represented
as two large classes: strong binding or weak binding
classes. However, as we mentioned, the sequences inside
the same class may be very different, thus the more
simulated sequences are added, the more unobvious
patterns are likely to be. The proposed method treats the
sequences locally as clusters, thus it can properly handle
this problem.
Parameter Sensitivity. There are three parameters in
our objective function Eq. 4: α, β and γ. We conducted
sensitivity experiments, shown in Fig. 7. In the experiments, when one parameter is varied, the other two
parameters are fixed at their default settings (i.e., α = 2,
β = 10, and γ = 2). Note that, α represents the
confidence of our belief over information propagation a-
Mutation Rate
10%
15%
0.82
0.84
0.63
0.60
0.62
0.58
0.65
0.69
0.63
0.58
0.68
0.65
0.70
20%
0.83
0.62
0.55
0.67
0.59
TABLE 5: Comparison with baselines under different
mutation rates of the gold binding sequences
Method
PSIGM
LLGC
NN*
HMM*
SVM*
SVM
NN
HMM
5%
0.91
0.81
0.86
0.89
0.83
Mutation Rate
10%
15%
0.91
0.92
0.82
0.84
0.82
0.84
0.90
0.90
0.82
0.80
0.84
0.82
0.90
20%
0.91
0.81
0.81
0.87
0.82
mong clusters. The clusters of the sequences are obtained
from arbitrary clustering methods, which are not very
stable. In other words, it may not be completely correct.
Therefore, smaller α usually yields better performance.
β shows our confidence on the prior knowledge of
the sequence classes. These sequence classes, which are
obtained from serious physical or chemical experiments,
are deemed to be reliable and thus a large β is usually
better. γ denotes the confidence on the prior knowledge
of cluster classes. This information may not be totally reliable, therefore lower value usually yields better results.
The results in Fig. 7 confirm our observation.
0.92
0.9
0.88
Accuracy
2
5%
0.82
0.60
0.60
0.66
0.62
0.86
0.84
0.82
α
β
γ
0.8
0.78
0.76
0
1
2
5
8
Parameters
10
15
20
Fig. 7: Parameter sensitivity experiments.
Comparison with Varying Transition Matrices. Finally, we show the performance of the proposed method
with different transition matrices. We want to demonstrate that the newly generated matrix improves the
performance of the proposed algorithm for the target
inorganic material. In this experiment, we also vary the
10
mutation rate t at 5%, 10%, 15% and 20%. We have
compared the new transition matrix M1 which was generated based on Blosum 62 and M2 which was generated
based on Pam 250 with four other widely-used transition
matrices including Blosum 62, Pam 250, Dayhoff [49]
and Gonnet [50] in Table 6. The result shows that, the
newly generated transition matrices perform better than
the others at each mutation rate.
the inorganic material binding sequences. The reasons
behind this can be well explained by Fig. 8, which shows
the self-class similarity of each prediction task. As we
know, the more cross-class similarity surpasses the selfsimilarity, the more difficult two classes are separated.
TABLE 7: Comparison with baselines of each class
Family
A
B
C
D
E
F
G
TABLE 6: Performance with different transition matrices
5%
0.82
0.80
0.72
0.72
0.71
0.79
Mutation Rate
10%
15%
0.82
0.84
0.81
0.82
0.74
0.75
0.73
0.73
0.71
0.72
0.78
0.79
20%
0.83
0.82
0.74
0.74
0.72
0.80
6.2 Experiments on Protein Sequences
The proposed PSIGM is a general framework which is
not limited to identifying the affinity class of inorganic
material binding sequences. To prove that, we have used
the SCOP protein data mentioned in Section 3.1. Instead
of predicting the sequences’ affinity classes, we consider
the problem in homology family prediction: for a specific family, could the proposed framework identify the
sequences belonging to it from the remaining families?
Correspondingly, we construct seven identification tasks
from this dataset, where the sequences from one particular family are used as the positive set and the sequences
from the remaining six families are used as the negative
set. For example, when the sequences in family A are
used as the positive set, the sequences from families B,
C, D, E, F and G would be used as the negative set.
Two experiments are performed to demonstrate that:
1) our PSIGM is a general framework which can also
handle the tradition protein sequence identification; and
2) a moderate setting of mutation rate is conductive to
improve the performance.
Performance comparison with baselines. It is worth
noticing that, through handling the data in this way,
it obtains the characteristics of inorganic binding sequences to some extent. Note that each result shown
in the follwing experiments (i.e. Table 7 and Fig. 9)
is the average of 10 times performance through 5-fold
cross validation. Since the protein sequence dataset has
relative sufficient training samples and the sequences
that belong to the same protein family are similar to each
other, we have used cross validation rather than Leaveone-out validation. Table 7 shows the result of predicting
the homology family comparing with baselines which
are mentioned in Section 6.1. As the table shows, the
proposed method outperforms the other methods at each
protein family’s prediction. Note that, the accuracies
of predicting the homology families are much higher
than the accuracies of predicting the affinity classes of
LLGC
0.80
0.81
0.82
0.82
0.82
0.82
0.82
SVM
0.998
0.959
0.999
0.999
0.985
0.946
0.975
HMM
0.966
0.944
0.952
0.969
0.935
0.952
0.972
NN
0.913
0.947
0.968
0.999
0.999
0.857
0.929
Mutation rate sensitivity. The performance of PSIGM
is influenced by the setting of the mutation rate which
is used to generate the simulated sequences. To fully
evaluate how the mutation rate affects the performance,
we increase it from 0.05 to 0.3 with a step of 0.05 and
report the accuracy of each family’s prediction task in
Fig. 9. It is clear that most families have an increase
in accuracy as the mutation rate rises until reaching a
threshold of 15%, and then the performance begins to
decrease. This proves that the performance of PSIGM can
be improved by a moderate setting of mutation rate. In
addition, we can infer that PSIGM is not only effective at
identifying the affinity classes of the inorganic material
binding sequences, but also effective at predicting the
homology families of the traditional protein sequences.
1
0.99
0.98
AUC
Transition Matrix
M1
M2
Blosum 62
Pam 250
Dayhoff
Gonnet
PSIGM
0.999
0.965
0.999
0.999
0.999
0.978
0.987
0.97
A
B
C
D
E
F
G
0.96
0.95
0.94
0.93
0.05
0.1
0.15
0.2
0.25
0.3
Mutation Rate
Fig. 9: Mutation rate sensitivity experiment.
7
C ONCLUSION
AND
F UTURE W ORK
Identifying the affinity classes of peptide sequences
binding to a specific inorganic material is a new and
challenging research problem with broad applications.
In this paper, we proposed a novel framework, PSIGM,
to solve this problem. We begin with providing a twostep simulated peptide sequences generation method to
make the training set more comprehensive and diverse.
Moreover, unlike traditional machine learning approaches used for protein sequences identification that try
200
100
0
100
50
0
A
400
300
200
100
Not A
300
200
100
0
0
B
Not A
Similarity Score
300
Similarity Score
Similarity Score
D
C
Not B
Not B
Not C
Not C
Not D
Not D
B
C
D
(a) Class A
(b) Class B
(c) Class C
(d) Class D
600
Similarity Score
Similarity Score
A
400
200
0
E
200
Similarity Score
Similarity Score
11
150
100
50
0
150
100
F
Not E
Not E
50
0
G
Not F
Not F
Not G
Not G
E
F
G
(e) Class E
(f) Class F
(g) Class G
Fig. 8: Total similarity scores of the self-class and the cross-class for each prediction task based on Pam 250. (A)
self-class and cross-class TSS of class A and non-A; (B) self-class and cross-class TSS of class B and non-B; (C)
self-class and cross-class TSS of class C and non-C; (D) self-class and cross-class TSS of class D and non-D; (E)
self-class and cross-class TSS of class E and non-E; and (F) self-class and cross-class TSS of class F and non-F; (G)
self-class and cross-class TSS of class G and non-G.
to find the patterns from the class level, our framework partitions the sequences into smaller clusters and
learns the patterns from them through using a graphbased optimization model. Extensive experimental studies demonstrate that the proposed framework can
effectively identify the affinity classes of the inorganic
material binding sequences.
In the future, to achieve better performance, we plan
to use a cyclic model to validate and retrain PSIGM: first,
we will select some sequences that have the most/least
probabilities binding to a target inorganic material as
a candidate set by using PSIGM; second, we plan to
use some efficient experimental methods to validate the
candidate sequence set such as QCM (Quartz Crystal
Microbalance); finally, the validated sequences will be
used to retrain the PSIGM, and then new candidate
sequences will be selected from the sequence database
based on their affinity, so on so forth. We believe that
by this cyclic validation model, we can not only further
validate PSIGM’s effectiveness but also keep retraining
it to be better and better.
8
ACKNOWLEDGMENTS
This material is based upon work supported by the
Air Force Office of Scientific Research (AFOSR), grant
number FA9550-12-1-0226. We gratefully acknowledge
the Victorian Life Sciences Computation Facility (VLSCI)
for allocation of computational resources. TRW thanks
veski for an Innovation Fellowship.
R EFERENCES
[1]
G. P. Smith, “Filamentous fusion phage: novel expression vectors
that display cloned antigens on the virion surface.,” Science,
vol. 228, no. 4705, pp. 1315–1317, 1985.
[2] K. Y. Dane, C. Gottstein, and P. S. Daugherty, “Cell surface
profiling with peptide libraries yields ligand arrays that classify
breast tumor subtypes.,” Molecular Cancer Therapeutics, vol. 8,
no. 5, pp. 1312–1318, 2009.
[3] E. T. Boder and K. D. Wittrup, “Yeast surface display for screening
combinatorial polypeptide libraries,” Nature Biotechnology, vol. 15,
no. 6, pp. 553–557, 1997.
[4] E. Kasotakis, E. Mossou, L. Adler-Abramovich, E. P. Mitchell, V. T.
Forsyth, E. Gazit, and A. Mitraki, “Design of metal-binding sites
onto self-assembled peptide fibrils.,” Biopolymers, vol. 92, no. 3,
pp. 164–172, 2009.
[5] J. Kimling, M. Maier, B. Okenve, V. Kotaidis, H. Ballot, and
A. Plech, “Turkevich method for gold nanoparticle synthesis
revisited.,” The Journal of Physical Chemistry B, vol. 110, no. 32,
pp. 15700–15707, 2006.
[6] Y. Huang, C.-Y. Chiang, S. K. Lee, Y. Gao, E. L. Hu, J. De Yoreo,
and A. M. Belcher, “Programmable assembly of nanoarchitectures
using genetically engineered viruses.,” Nano Letters, vol. 5, no. 7,
pp. 1429–1434, 2005.
[7] K. T. Nam, D.-W. Kim, P. J. Yoo, C.-Y. Chiang, N. Meethong,
P. T. Hammond, Y.-M. Chiang, and A. M. Belcher, “Virus-enabled
synthesis and assembly of nanowires for lithium ion battery
electrodes.,” Science, vol. 312, no. 5775, pp. 885–888, 2006.
[8] R. R. Naik, S. J. Stringer, G. Agarwal, S. E. Jones, and M. O. Stone,
“Biomimetic synthesis and patterning of silver nanoparticles.,”
Nature Materials, vol. 1, no. 3, pp. 169–172, 2002.
[9] E. Estephan, C. Larroque, F. J. G. Cuisinier, Z. Blint, and C. Gergely, “Tailoring gan semiconductor surfaces with biomolecules.,”
The Journal of Physical Chemistry B, vol. 112, no. 29, pp. 8799–8805,
2008.
[10] E. Estephan, M.-b. Saab, C. Larroque, M. Martin, F. Olsson,
S. Lourdudoss, and C. Gergely, “Peptides for functionalization
of inp semiconductors.,” Journal of Colloid and Interface Science,
vol. 337, no. 2, pp. 358–363, 2009.
[11] M. M. Tomczak, M. K. Gupta, L. F. Drummy, S. M. Rozenzhak, and
R. R. Naik, “Morphological control and assembly of zinc oxide
12
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
using a biotemplate.,” Acta Biomaterialia, vol. 5, no. 3, pp. 876–
882, 2009.
C. Vreuls, G. Zocchi, A. Genin, C. Archambeau, J. Martial, and
C. V. De Weerdt, “Inorganic-binding peptides as tools for surface
quality control,” Journal of Inorganic Biochemistry, vol. 104, no. 10,
pp. 1013–1021, 2010.
R. R. Naik, L. L. Brott, S. J. Clarson, and M. O. Stone, “Silicaprecipitating peptides isolated from a combinatorial phage display peptide library.,” Journal of Nanoscience and Nanotechnology,
vol. 2, no. 1, pp. 95–100, 2002.
H. Chen, X. Su, K.-G. Neoh, and W.-S. Choe, “Probing the interaction between peptides and metal oxides using point mutants of
a tio2-binding peptide.,” Langmuir, vol. 24, no. 13, pp. 6852–6857,
2008.
Y. Liu, J. Mao, B. Zhou, W. Wei, and S. Gong, “Peptide aptamers
against titanium-based implants identified through phage display.,” Journal of Materials Science: Materials in Medicine, vol. 21,
no. 4, pp. 1103–1107, 2010.
M. B. Dickerson, S. E. Jones, Y. Cai, G. Ahmad, R. R. Naik,
N. Krger, and K. H. Sandhage, “Identification and design of
peptides for the rapid, high-yield formation of nanoparticulate
tio2 from aqueous solutions at room temperature,” Chemistry of
Materials, vol. 20, no. 4, pp. 1578–1584, 2008.
C. Tamerler, T. Kacar, D. Sahin, H. Fong, and M. Sarikaya,
“Genetically engineered polypeptides for inorganics: A utility in
biological materials science and engineering,” Materials Science
and Engineering C, vol. 27, no. 3, pp. 558–564, 2007.
H. Chen, X. Su, K.-G. Neoh, and W.-S. Choe, “Qcm-d analysis of
binding mechanism of phage particles displaying a constrained
heptapeptide with specific affinity to sio2 and tio2.,” Analytical
Chemistry, vol. 78, no. 14, pp. 4872–4879, 2006.
E. Eteshola, L. J. Brillson, and S. C. Lee, “Selection and characteristics of peptides that bind thermally grown silicon dioxide films.,”
Biomolecular Engineering, vol. 22, no. 5-6, pp. 201–204, 2005.
S. Donatan, M. Sarikaya, C. Tamerler, and M. Urgen, “Effect of
solid surface charge on the binding behaviour of a metal-binding
peptide.,” Journal of the Royal Society Interface the Royal Society,
no. April, pp. rsif.2012.0060–, 2012.
E. E. Oren, C. Tamerler, D. Sahin, M. Hnilova, U. O. S. Seker, M. Sarikaya, and R. Samudrala, “A novel knowledge-based
approach to design inorganic-binding peptides.,” Bioinformatics,
vol. 23, no. 21, pp. 2816–2822, 2007.
M. Hnilova, E. E. Oren, U. O. S. Seker, B. R. Wilson, S. Collino, J. S.
Evans, C. Tamerler, and M. Sarikaya, “Effect of molecular conformations on the adsorption behavior of gold-binding peptides.,”
Langmuir, vol. 24, no. 21, pp. 12440–12445, 2008.
A. Vila Verde, P. J. Beltramo, and J. K. Maranas, “Adsorption of
homopolypeptides on gold investigated using atomistic molecular dynamics.,” Langmuir, vol. 27, no. 10, pp. 5918–5926, 2011.
R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, “Biological
sequence analysis: probabilistic models of proteins and nucleic
acids. cambridge univ,” 1998.
A. Kumar and L. Cowen, “Augmented training of hidden markov
models to recognize remote homologs via simulated evolution,”
Bioinformatics, vol. 25, no. 13, pp. 1602–1608, 2009.
C. H. Wu, S. Zhao, H. L. Chen, C. J. Lo, and J. McLarty, “Motif
identification neural design for rapid and sensitive protein family
search.,” Computer applications in the biosciences CABIOS, vol. 12,
no. 2, pp. 109–118, 1996.
D. W. D. Wang, N. K. L. N. K. Lee, T. S. Dillon, and N. J.
Hoogenraad, “Protein sequences classification using radial basis
function (rbf) neural networks,” 2002.
M. J. Grimble, “Adaptive systems for signal processing, communications and control,” Control, vol. 3, 2001.
R. Karchin, K. Karplus, and D. Haussler, “Classifying g-protein
coupled receptors with support vector machines.,” Bioinformatics,
vol. 18, no. 1, pp. 147–159, 2002.
M. Wistrand and E. L. L. Sonnhammer, “Improving profile hmm
discrimination by adapting transition probabilities.,” Journal of
Molecular Biology, vol. 338, no. 4, pp. 847–854, 2004.
X. Zhu, “Semi-supervised learning literature survey,” SciencesNew
York, vol. Tech. Rep., no. 1530, pp. 1–59, 2007.
N. Du, M. R. Knecht, P. N. Prasad, M. T. Swihart, T. Walsh,
and A. Zhang, “A framework for identifying affinity classes of
inorganic materials binding peptide sequence.,” ACM Conference
on Bioinformatics, Computational Biology and Biomedical Informatics
(ACM BCB), 2013.
[33] N. Terrapon, O. Gascuel, E. Marechal, and L. Brehelin, “Fitting
hidden markov models of protein domains to a target species: application to plasmodium falciparum,” BMC Bioinformatics, vol. 13,
no. 1, p. 67, 2012.
[34] M.-W. M. M.-W. Mak, J. G. J. Guo, and S.-Y. K. S.-Y. Kung, “Pairprosvm: protein subcellular localization based on local pairwise
profile alignment and svm.,” IEEEACM Transactions on Computational Biology and Bioinformatics, vol. 5, no. 3, pp. 416–422, 2008.
[35] J. Tian, H. Gu, W. Liu, and C. Gao, “Robust prediction of protein subcellular localization combining {PCA} and {WSVMs},”
Computers in Biology and Medicine, vol. 41, no. 8, pp. 648 – 652,
2011.
[36] L. Ge, N. Du, and A. Zhang, “Finding informative genes from
multiple microarray experiments: A graph-based consensus maximization model,” in Proceedings of the 2011 IEEE International
Conference on Bioinformatics and Biomedicine, BIBM ’11, pp. 506–
511, 2011.
[37] Z. Tang, J. Palafox-Hernandez, W.-C. Law, Z. Hughes, M. T. Swihart, P. N. Prasad, M. R. Knecht, and T. R. Walsh, “Biomolecular
recognition principles for bionanocombinatorics: An integrated
approach to elucidate enthalpic and entropic factors,” ACS Nano,
Article ASAP, DOI: 10.1021/nn404427y, vol. 7, pp. 9632–9646, 2013.
[38] Y. N. Tan, J. Y. Lee, and D. I. C. Wang, “Uncovering the design
rules for peptide synthesis of metal nanoparticles.,” Journal of the
American Chemical Society, vol. 132, no. 16, pp. 5677–5686, 2010.
[39] M. Hnilova, C. R. So, E. E. Oren, B. R. Wilson, T. Kacar,
C. Tamerler, and M. Sarikaya, “Peptide-directed co-assembly of
nanoprobes on multimaterial patterned solid surfaces,” Soft Matter, vol. 8, pp. 4327–4334, 2012.
[40] S. B. Needleman and C. D. Wunsch, “A general method applicable
to the search for similarities in the amino acid sequence of two
proteins.,” Journal of Molecular Biology, vol. 48, no. 3, pp. 443–453,
1970.
[41] W. R. Pearson, “Rapid and sensitive sequence comparison with
fastp and fasta.,” Methods in Enzymology, vol. 183, no. 1988, pp. 63–
98, 1990.
[42] S. Henikoff and J. G. Henikoff, “Amino acid substitution matrices from protein blocks.,” Proceedings of the National Academy of
Sciences of the United States of America, vol. 89, no. 22, pp. 10915–
10919, 1992.
[43] A. G. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia, “Scop: A
structural classification of proteins database for the investigation
of sequences and structures,” Journal of Molecular Biology, vol. 247,
no. 4, pp. 536 – 540, 1995.
[44] Afiahayati and S. Hartati, “Multiple sequence alignment using
hidden markov model with augmented set based on blosum 80
and its influence on phylogenetic accuracy,” 2010.
[45] T. F. Smith and M. S. Waterman, “Identification of common
molecular subsequences.,” Journal of Molecular Biology, vol. 147,
no. 1, pp. 195–197, 1981.
[46] U. Von Luxburg, “A tutorial on spectral clustering,” Statistics and
Computing, vol. 17, no. 4, pp. 395–416, 2007.
[47] D. P. Bertsekas, Nonlinear Programming, vol. 43. Athena Scientific,
1995.
[48] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Sch, “Learning
with local and global consistency,” Advances in Neural Information
Processing Systems 16, vol. 1, pp. 595–602, 2003.
[49] M. O. Dayhoff, R. M. Schwartz, and B. C. Orcutt, “A model
of evolutionary change in proteins,” Atlas of protein sequence and
structure, vol. 5, no. Suppl 3, pp. 345–352, 1978.
[50] G. H. Gonnet, M. A. Cohen, and S. A. Benner, “Exhaustive matching of the entire protein sequence database.,” Science, vol. 256,
no. 5062, pp. 1443–1445, 1992.
13
Nan Du Nan Du received his B.S. degree from
Guangdong University of Technology in 2006.
After that, he received his M.S. degree from
Southern China University of Technology in
2009. Since 2009, he has been working toward
the Ph.D. degree in State University of New York
at Buffalo, NY, with supervision by Prof. Aidong
Zhang. His research interests are in the area of
data mining, machine learning and bioinformatics.
Marc R. Knecht Marc R. Knecht earned a B.S.
degree in Chemistry from Duquesne University
in 2001. In 2004, he received a Ph.D. in BioInspired Chemistry from Vanderbilt University
under the direction of Professor David W. Wright,
followed by postdoctoral research at the University of Texas with Professor Richard M. Crooks
focused on characterizing the structure/function
relationship of nanocatalysts. After completing
postdoctoral studies, he began his independent
career as an assistant professor of Chemistry
at University of Kentucky. In the summer of 2011, Professor Knecht
joined the Department of Chemistry at the University of Miami as an
associate professor. During his independent career, Professor Knecht
has established a research program focused on elucidating the effects
of the biotic/abiotic interface of bio-inspired nanomaterials. In this regard, his group has employed high-resolution characterization, activity
studies, and synthetic analyses of peptides to demonstrate that the
biological surface of bionanomaterials possesses significant control over
the functionality and could serve as modification sites to control the
activity. He has published 47 publications in this area.
Mark T. Swihart Mark T. Swihart is a Professor
in the Department of Chemical and Biological
Engineering at the University at Buffalo (SUNY).
He earned a B.S. in Chemical Engineering from
Rice University in 1992, and a Ph.D. in Chemical Engineering in 1997 from the University of
Minnesota. He then spent one year as a postdoctoral researcher in Mechanical Engineering
at the University of Minnesota before joining the
University at Buffalo as an assistant professor in
1998. Since 2007, he has directed a universitywide strategic initiative in Integrated Nanostructured Systems. His
research interests include synthesis, processing, and applications of
nanoparticles and other nanomaterials, and he has co-authored more
than 120 journal papers in these areas. Dr. Swihart is a recipient of
the Kenneth Whitby award from the American Association for Aerosol
Research, the Schoellkopf medal from the Western New York section
of the American Chemical Society, and the J.B. Wagner award from the
Electrochemical Society.
Zhenghua Tang Zhenghua Tang currently is a
postdoctoral research associate working in Marc
R. Knecht group at University of Miami. He obtained his B. S. degree at college of Chemistry
and Chemical Engineering, Lanzhou University,
Lanzhou, Gansu, P. R. China in 2005. He attended graduate school there from Aug. 2005 to
Jun. 2007. During his graduate study, he went to
Institute of Chemistry, Chinese Academy of Science (ICCAS) as a visiting student for about one
year (2006-2007). In August 2007, he moved
to US and obtained his Pd. D degree in chemistry from Department
of Chemistry, Georgia State University in July, 2012. He started his
current position since August, 2012. His research interest focuses
on bio-inspired nanomaterials for targeted applications, including bionanocombinatorics, self-assembly, catalyst, multifunctional design and
so on. He is the recipient of 2010 chairs award in Chemistry Department
at GSU as well as 2011 Chinese Government Award for Outstanding
Self-Financed Students Abroad.
Tiffany R. Walsh Tiff Walsh graduated with a
B.Sci(Hons) from the University of Melbourne.
She earned her PhD degree in theoretical chemistry from the University of Cambridge, U.K.,
working in the group of Prof. David Wales in the
Dept. of Chemistry as a Cambridge Commonwealth Trust scholar. Walsh then joined the Dept.
of Materials, University of Oxford, U.K. as a postdoctoral researcher in the Materials Modelling
Laboratory (MML) with Prof. Adrian Sutton. She
was then awarded a Glasstone fellowship, which
she held in the MML in Oxford. In 2002, she joined the faculty of the
University of Warwick, U.K., as a joint appointment in the Dept. of Chemistry and the Centre for Scientific Computing. Her research interests
focus on computational modelling the interface between biomolecules
and inorganic surfaces, using molecular dynamics simulations. She was
a lead investigator in the team that won 5.3 M ($US 8.2 M) of funding
for a 5-year EPSRC Programme Grant in this area (started in Oct
2010). In 2012, Walsh joined the Institute for Frontier Materials at Deakin
University in Australia, where she holds the position of Associate Prof.
in Bio\Nanotechnology.
Aidong Zhang Dr. Aidong Zhang is University
Distinguished Professor and Chair in the Department of Computer Science and Engineering
at State University of New York at Buffalo. Her
research interests include bioinformatics, data
mining, multimedia and database systems, and
content-based image retrieval. She is an author
of over 250 research publications in these areas.
She has chaired or served on over 100 program committees of international conferences
and workshops, and currently serves several
journal editorial boards. She has published two books Protein Interaction
Networks: Computational Analysis (Cambridge University Press, 2009)
and Advanced Analysis of Gene Expression Microarray Data (World
Scientific Publishing Co., Inc. 2006). Dr. Zhang is a recipient of the
National Science Foundation CAREER award and State University of
New York (SUNY) Chancellor’s Research Recognition award. Dr. Zhang
is an IEEE Fellow.
Download