Clustering and overlapping modules detection in PPI network based on IBFO R A

advertisement
278
DOI 10.1002/pmic.201200309
Proteomics 2013, 13, 278–290
RESEARCH ARTICLE
Clustering and overlapping modules detection in PPI
network based on IBFO
Xiujuan Lei1 , Shuang Wu1 , Liang Ge2 and Aidong Zhang2
1
2
College of Computer Science, Shaanxi Normal University, Xi’an, P. R. China
Department of Computer Science and Engineering, State University of New York at Buffalo, NY, USA
As is known to all, traditional clustering algorithms do not work well due to the topological
features of protein–protein interaction networks. An improved clustering method based on
bacteria foraging optimization (BFO) mechanism and intuitionistic fuzzy set, short for improved BFO, is proposed in this paper, in which the trigonometric function is used to define
the membership degrees and the indeterminacy degree is introduced to detect the overlapping
modules. In chemotactic operation of BFO, the algorithm initializes a cluster center according
to comprehensive network feature value of node and eliminates the isolated point in accordance
with edge-clustering coefficient. In the reproduction operation of BFO, the nodes possessing
high membership degrees are merged into the cluster that the cluster center belongs to and
labeled as visited nodes. Meanwhile, the nodes that also have high indeterminacy degrees are
visited again when generating another cluster. The procedure of elimination–dispersal operation is equivalent to the selection of the next cluster center. Finally, the algorithm merges the
clusters having high similarity. The results show that the algorithm not only determines the
cluster number automatically, improves the f-measure value of cluster results, but also identify
the overlaps in protein–protein interaction network successfully.
Received: July 25, 2012
Revised: September 19, 2012
Accepted: October 11, 2012
Keywords:
Bacteria foraging optimization / Bioinformatics / Indeterminacy degree / Overlap /
Protein–protein interaction networks
1
Introduction
Correspondence: Dr. Xiujuan Lei, College of Computer Science,
Shaanxi Normal University, Xi’an, Shaanxi Province, 710062,
P. R. China
E-mail: xjlei@snnu.edu.cn
Fax: +86 29 85310161
as growth and development and metabolism. In addition, it
is extremely helpful in the diagnosis of major diseases and
intensive study of therapy, meantime stimulates the development of biology, medicine and bioinformatics, and so on.
PPI networks share the feature of small world [1] that is
characterized by high clustering coefficients. In addition, the
scale-free [2] is also fit in PPI networks that suggested an
important topological feature of PPI networks, that is, the
modularity [3]. So, it is natural to use clustering methods to
predict the functional modules. However, the traditional clustering methods such as hierarchical clustering, density-based
method, and fuzzy clustering algorithm [4,5] have difficulties
in either requiring the prior knowledge of cluster number
or being sensitive to noisy data. Then, Nabieva et al. [6] first
put forward functional-flow model to explore the underlying
structure of PPI networks. The experimental results showed
that the method performed well. However, the running time
Abbreviations: BFO, bacteria foraging optimization; CNFV,
comprehensive network feature value; PPI, protein–protein
interaction
Colour Online: See the article online to view Figs. 3, 4 and 7–10 in
colour.
In the postgenomic era, the research focus of biological science has gradually transferred from genomics to proteomics.
Recently, the rapid development of proteomics and the explosion of protein–protein interaction (PPI) dataset have drawn
more and more researchers to investigate PPI networks in
order to predict the function of unknown proteins. The researches toward PPI networks contribute to predicting the
functions of unknown proteins from the aspect of molecular
level, further uncovering regularities of cellular activities such
C 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
www.proteomics-journal.com
279
Proteomics 2013, 13, 278–290
was relatively high. Cho et al. [7] proposed another flowbased modularization algorithm to predict the overlapping
functional modules in a weighted graph; the f-measure [8]
value of clustering result was relatively low. Kenley et al. [9]
proposed a novel information-theoretic definition, graph entropy as a measure of the structural complexity of a graph.
The results showed that the approach had higher accuracy
in predicting protein complexes. Recently, more and more
intelligent optimization algorithms have found broad applications in many research fields. Goel et al. [10] presented
software for the generation and analysis of dynamic, fourdimensional PPI networks. Our research [10] had successfully applied the quantum-behaved particle swarm optimization algorithm and artificial bee colony algorithm to optimize
functional-flow model and also utilized joint strength based
ant colony optimization to cluster the functional modules,
the results show that these methods performed well in identifying the functional modules.
Following the rapid development of intelligence algorithms, Passino [14] presented bacteria foraging optimization
(BFO) algorithm inspired by the social foraging behavior of
Escherichia coli. BFO algorithm had drawn attentions of researchers from various fields such as harmonic estimation,
transmission loss reduction, and machine learning. In order to explore the searching ability of BFO algorithm, several
researchers integrated the algorithm with other intelligent
methods [15] such as genetic algorithm and particle swarm
optimization algorithm etc. Then, we try to design a new
model taking advantage of BFO algorithm to cluster PPI networks [16]. The initial positions where the bacterium located
are treated as cluster centers, the cluster modules are generated during the reproduction operation and elimination–
dispersal operation. The experimental results showed that
the method could effectively improve the accuracy of cluster
results. However, the recall value is low and the algorithm ignored a fact that a protein may belong to two or more clusters.
In 1965, the fuzzy methods [17] have been adopted in cluster
analysis. Atanassov [18] extended the theory of fuzzy set and
proposed the intuitionistic fuzzy set that includes the concepts of membership degree, nonmembership degree, and
indeterminacy degree. In this paper, we adopt the indeterminacy degree of protein nodes to take the overlapping functional modules of PPI networks into consideration in order
to improve the recall value of cluster results.
In this paper, we propose a novel model and algorithm that
uses BFO mechanism and intuitionistic fuzzy set to tackle the
overlapping functional modules of PPI network and automatically determine cluster number. In Section 2, the principle
of BFO algorithm, the definitions of intuitionistic fuzzy set,
several concepts of graph, and the evaluation criteria of PPI
network clustering are briefly presented. Section 3 describes
the model design and implementation steps of improved algorithm that integrates intuitionistic fuzzy set into the mechanism of BFO algorithm. We execute the algorithm and make
comparisons with the BFO algorithm referred in ref. [16] and
functional-flow algorithm [7] in Section 4. The experimental
C 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
results show that the algorithm is superior to BFO algorithm
in f-measure value.
2
Materials and methods
2.1 Basic principles and concepts
2.1.1 Principle of BFO algorithm
BFO algorithm [19] is an intelligent algorithm that has been
widely accepted as a global optimization searching method.
It is composed of chemotactic operation, reproduction operation, and elimination–dispersal operation.
2.1.1.1 Chemotactic operation
Figure 1(A) shows that the bacterium moves in two different
ways including swim and tumble by means of a set of tensile
flagella. When all the flagella rotate clockwise, each flagellum
operates relatively independent of the others. When all the
flagella rotate counterclockwise, they push the bacterium so
that it moves in one direction at a very fast rate. Figure 1(B)
shows the swimming and tumbling behavior of the bacterium
in a neural medium. There exists a nutrient concentration
gradient in the Fig. 1(C). The darker the shade is, the higher
the concentration of the nutrient is. The bacterium alternately
swims and tumbles for the purpose of moving toward nutrient gradient and avoiding noxious environment. The BFO
algorithm regards this phenomenon as the chemotactic behavior that is able to largely broaden the local exploring ability
of BFO algorithm.
2.1.1.2 Reproduction operation
In general, several bacteria that are becoming incapable of
searching food during executing the chemotactic behavior are
eliminated. In order to maintain the scale of population, the
remained bacteria will reduplicate themselves and generate
new individuals. To improve the global convergent speed and
efficiency, the BFO algorithm generally selects the bacteria
that rank in the former half positions to reproduce themselves
and generate new individuals that are completely identical to
the original bacteria.
2.1.1.3 Elimination–dispersal operation
Owing to sudden changes of the local environment, the bacteria population may be gradually inadaptable to the environment that a group of bacteria is either killed or dispersed
into a new location. This phenomenon is simulated as the
elimination–dispersal operation that is normally executed
with some certain probability. If some bacterium satisfies
the probability of elimination–dispersal operation, this bacterium will die and the algorithm generates another new individual in a random position of the feasible solution space.
The elimination–dispersal operation enhances the randomly
www.proteomics-journal.com
280
X. Lei et al.
Proteomics 2013, 13, 278–290
Figure 1. The foraging behavior
of bacteria [20].
searching ability of BFO algorithm, maintains the varieties of
population and avoids the premature convergence.
Define a chemotactic step to be a tumble followed by a
tumble, or a tumble followed by a run. Let j be the index for
the chemotactic step. k be the index for the reproduction step
and l be the index of the elimination–dispersal event. The
position of each member in the population of the S bacteria
at the jth chemotactic step, kth reproduction step, and lth
elimination–dispersal event is represented as follows:
P( j, k, l ) = {xi ( j, k, l )|i = 1, 2, · · · , S}.
2.1.2 Fuzzy set
In the typical setting, the clusters are nonoverlapping. However, PPI networks, contain many overlapping modules.
Therefore, we adopt the fuzzy concept that each node belongs
to certain clusters with a probability between 0 and 1.
Definition 1. The fuzzy set A in the domain X is defined as
follows [17]:
A = {(x, A(x))|x ∈ X},
where A(x) is the membership function. The fuzzy set A satisfies a requirement: A: X→M, the symbol M represents the
membership space. The membership function A(x) denotes
the membership degree or the probability that the element
x belongs to fuzzy set A. Therefore, each element (x, A(x))
in the fuzzy set A expresses the membership degree of the
element x.
Definition 2. Let a set X be fixed. An intuitionistic fuzzy
set [21] is an object having the form
B = {< x, ␮ B (x), ␯ B (x) > |x ∈ E },
where the functions ␮B (x) and ␯B (x) define the membership
degree and nonmembership degree that the element x belongs to set B, respectively. In addition, for each object x in
C 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
the set B, the summation of ␮B (x) and ␯B (x) varies between 0
and 1. The indeterminacy degree of the element x is equal to
1 − ␮B (x) − ␯B (x).
2.1.3 Relevant concepts of graph
In an undirected graph, the degree of one node represents
the number of its direct neighboring nodes. For weighted
graphs, the weighted degree of a node [8] is the sum of the
weight value of edges among nodes i and its neighbors,
w(i, j ),
(1)
w(i) =
j ∈N(i)
Assume a node i, ki represents the node degree, and ni
refers to the number of edges linking all the neighbor nodes
of node i with each other. The node clustering coefficient is
calculated as follows [8]:
Ci = 2ni /ki (ki − 1),
(2)
With respect to Ci , the edge clustering coefficient WEu,v is
defined as the ratio of the number of triangles containing the
edge to the number of all the possible triangles including this
edge. In a weighted graph, the edge clustering coefficient [22]
is calculated as follows:
w(i, k) ·
w( j, k)
k∈Ii, j
WE i, j = s ∈Ni
k∈Ii, j
w(i, s ) ·
w( j, t)
.
(3)
t∈N j
In Eq. (3), the sets Ni and Nj represent the sets of adjacent
nodes of node i and node j, respectively, w(i, s) stands for the
weight value of edge-linking node i with node s. The set Ii, j
refers to the set of common nodes between the adjacent nodes
of nodes i and j. Noted that Ii, j = Ni ∩ N j . The weighted aggregation coefficient of edge is illustrated by the ratio between
the product of summation of weight values of edges, respectively, connecting these two nodes (i, j) with their common
www.proteomics-journal.com
281
Proteomics 2013, 13, 278–290
neighbors (k) and the product of summation of weight values
of edges linking these two nodes (i, j) with their corresponding neighbors (s, t). The edge clustering coefficient is not
sensitive to the influence of the false-positive data. Therefore,
it is more preferable to the large-scale PPI data containing
many false-positive data.
The density of a subgraph s is defined by the following
equation:
n
n−1 n(n − 1)
(9)
k−1
i
n−i
|V |
.
(10)
n
where |V | is the total number of proteins, |X| is the number of
proteins in a reference function, n is the number of proteins
in an identified module, and k is the number of proteins in
common between the function and the module.
2.2 Methods
WE i, j
i=1 j =2
,
(5)
where the symbols i and j stand for the ith protein node and
the jth protein node in the subgraph s, respectively. WEi,j is
the edge clustering coefficient connecting the ith node with
the jth node. It is apparent that the value WD(s) illustrates the
average edge clustering coefficient linking with all the protein
nodes in cluster s. Suppose that the obtained cluster number
is indicated as numclu, a set of clusters C can be calculated as
follows:
f un(C) =
P =1−
(4)
where the parameter n represents the number of nodes and
e is the number of edges connecting protein nodes with each
other in the subgraph s. To the weighted PPI networks, each
cluster can be assessed according to Eq. (5):
WD(s ) =
2
.
1
1
+
precision recall
i=0
2e
,
n(n − 1)
2
f -measure =
In PPI network, protein modules can be statistically evaluated using p-value from the hypergeometric distribution,
which is defined as
|V | − |X|
|X|
2.1.4 Object function
D(s ) =
Therefore, in order to balance the precision and recall values,
we can define the f-measure value as follows:
numcl
u
1
WD(s )k .
numcl u k=1
2.2.1 Data preprocessing
2.2.1.1 Calculation of distance among protein nodes
In PPI networks, protein name can be changed into the positive integer and the data is converted into an adjacent matrix
P. Assume that the number of protein nodes is n and Xi represents the ith protein that is denoted as Xi = (Pi1 , Pi2 , . . . ,
Pin ). The inner product of two protein node is calculated by
the equation Xij = (Pi1 , Pi2 , . . . , Pin ) • (Pj1 , Pj2 , . . . , Pjn ) =
Pi1 × Pj1 + Pi2 × Pj2 + . . . + Pin × Pjn . The similarity between
nodes i and j is defined as follows [23]:
n
(6)
Si j =
k=1
n
min(X ik , X j k )
.
(11)
max(X ik , X j k )
k=1
2.1.5 Evaluation criteria of cluster results
Precision, recall, and p-value are usually adopted to evaluate
clustering results [8]. Suppose that X represents one cluster
in the cluster results, Fi stands for the matched cluster in the
standard PPI dataset,
precision(X, Fi ) =
recall(X, Fi ) =
|X ∩ Fi |
,
|X|
|X ∩ Fi |
,
|Fi |
(7)
(8)
where the expression |X∩Fi | stands for the number of common proteins between clusters X and Fi . However, both these
two evaluating criteria have bias for different sized clusters.
C 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
The protein node not only interacts with adjacent protein
nodes, but also has interactions with other protein nodes via
some protein or some several protein nodes.
As Fig. 2 shows that the protein node a can be denoted as
Xa = (paa , pab , pac , pad , pae , paf ) = (0, 1, 0, 1, 0, 0), similarly, the
protein node Xb = (1, 0, 0, 1, 0, 0), Xc = (0, 0, 0, 1, 0, 0), Xd =
(1, 1, 1, 0, 1, 1), Xe = (0, 0, 0, 1, 0, 0), Xf = (0, 0, 0, 1, 0, 0). The
associated matrix is obtained as follows:
⎤
⎡
0 1 0 1 0 0
⎥
⎢
⎢1 0 0 1 0 0⎥
⎥
⎢
⎢0 0 0 1 0 0⎥
⎥
⎢
⎥
X =⎢
⎢1 1 1 0 1 1⎥
⎥
⎢
⎥
⎢
⎢0 0 0 1 0 0⎥
⎦
⎣
0 0 0 1 0 0
www.proteomics-journal.com
282
X. Lei et al.
Proteomics 2013, 13, 278–290
2.2.2 Determination of membership function and
nonmembership function
Figure 2. A sketch subgraph of PPI network.
Then, value Xaa = Xa • (Xa )T = (0, 1, 0, 1, 0, 0) • (0, 1, 0, 1, 0,
0)T = 2, Xab = Xa • Xb = (0, 1, 0, 1, 0, 0) • (1, 0, 0, 1, 0, 0) = 1,
Xbb = Xb • (Xb )T = (1, 0, 0, 1, 0, 0) • (1, 0, 0, 1, 0, 0)T = 2. In the
similar way, the value Xac = 1, Xbc = 1, Xad = 1, Xbd = 1, Xae =
1, Xbe = 1, Xaf = 1, Xbf = 1. Therefore, the similarity between
nodes a and b is that Sab = (1 + 1 + 1 + 1 + 1 + 1)/(2 + 2 +
1 + 1 + 1 + 1) = 6/8 = 0.75. It is clear to see that the higher
the similarity between two protein nodes is, the shorter the
space distance is. So, the distance between two protein nodes
a and b is denoted as dab = 1 – Sab = 0.25.
The similarity between two protein nodes can be calculated
according to Eq. (11), while the similarity between module
Mi and another different module Mj is measured [7] by the
Eq. (12):
S(Mi , M j ) =
The concepts of membership degree ␮ and nonmembership
degree ␯ play an important role in the clustering procedure
and it is essential to construct appropriate functions to calculate the membership degree and nonmembership degree
among protein nodes. If two protein nodes are close, the possibility that these two nodes can be grouped into one cluster
is high, and the membership degree between two nodes is
close to 1. As the distance increases, the membership degree gradually descends. The relationship between distances
among protein nodes and membership degree can be roughly
described as Fig. 3 (A).
c(x, y)
x∈Mi ,y∈M j
min(|Mi |, |M j |)
(12)
where
⎧
1
⎪
⎪
⎨
c(x, y) = w(x, y)
⎪
⎪
⎩0
if x = y
if x = y and x, y ∈ E
(13)
otherwise
2.2.1.2 Initialization of cluster center
The node clustering coefficient barely measures the joint density and strength among all the nodes in the local proximity
of this node. Meanwhile, the comprehensive network feature
value (CNFV) of node [8] reveals the joint strength between
this node and other nodes aside from the above-mentioned
feature. The CNFV of node i is defined as follows:
CNFV i = b × Ci + (1 − b) × w(i)/n.
(14)
The parameter ␤ is a random number within 0 and 1, and n
stands for the number of protein nodes in PPI network.
C 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
Figure 3. The relationships among membership degree, nonmembership degree, and distance of protein nodes.
www.proteomics-journal.com
283
Proteomics 2013, 13, 278–290
The membership function is obtained according to corresponding relationship between membership degree and the
distance of two protein nodes:
⎧
1
0 ≤ di j < 0.1
⎪
⎪
⎪
⎨
p
0.1 ≤ di j < 0.9
(15)
mi j = cos di j ·
2
⎪
⎪
⎪
⎩0
0.9 ≤ d ≤ 1
ij
On the contrary, when two protein nodes are relatively
close, two protein nodes can be merged into one cluster; so
the nonmembership degree between two nodes is considered
to be close to 0. As the distance constantly increases between
0 and 1, there is less possibility that these two nodes belong
to identical cluster and the nonmembership degree keeps
up the upward momentum, which is the relation between
nonmembership degree and distance of two protein nodes is
illustrated as the Fig. 3 (B). Similarly, the calculation equation of nonmembership degree can also be determined as
follows:
⎧
0
0 ≤ di j < 0.1
⎪
⎪
⎪
⎨
␲
0.1 ≤ di j < 0.9
(16)
␯i j = sin di j ·
2
⎪
⎪
⎪
⎩1
0.9 ≤ d ≤ 1
Figure 4. The overlap of protein functional modules.
Table 1. Corresponding relationship of the IBFO mechanism and
the PPI networks clustering
BFO algorithm
The problem of clustering PPI networks
Bacterium
Chemotactic
operation
Reproduction
operation
Membership
degree
Indeterminacy
degree
Protein node
Eliminate isolated node according to
edge-clustering coefficient
Merge nodes into the cluster that
cluster center belongs to
Nodes that possess high membership
degree are grouped into the cluster
Node that has the lower indeterminacy
degree than the given threshold is
merged into this cluster. Otherwise
merge node into cluster and mark
this node as unvisited node
Randomly choose one node as the new
cluster center
ij
2.2.3 Model design of improved algorithm
The BFO algorithm referred in ref. [16] mainly contained
three behaviors of bacteria into the problem of clustering PPI networks, which is chemotactic, reproduction, and
elimination–dispersal operation. However, the recall value
of cluster results is relatively low, which is due to the fact
that PPI network is distinct from other complicated networks and has the small-world and scale-free characters,
there are a large number of proteins that have fewer interactions with other proteins and are abandoned in the clustering
procedure.
In fact, a protein (black node in Fig. 4) in real PPI network can be included in several different protein complexes
to perform different functions, i.e., a protein functional modules overlap with each other as Fig. 4 shows. Naturally, the
concepts of membership degree and indeterminacy degree
in the intuitionistic fuzzy set can be introduced to detect the
overlapping modules. This paper proposes an improved BFO
clustering algorithm based on intuitionistic fuzzy set.
We adopt the principle of improved bacteria foraging optimization (IBFO) to cluster PPI network. During the procedure of clustering PPI network, one bacterium is regarded
as one protein node, the corresponding relationship of the
IBFO mechanism and the PPI networks clustering is listed
in Table 1.
The clustering model based on IBFO is shown in Fig. 5.
Figure 6 is the flow chart of clustering method based on
IBFO.
C 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
Elimination–
dispersal
operation
Food
distribution
function
Average value of the edge-clustering
coefficient in each cluster
Where flag represents the parameter of controlling the
chemotactic operation, Nre and Ned stand for the parameters of
reproduction operation and elimination–dispersal operation,
respectively. The count parameters of reproduction operation
and elimination–dispersal operation are denoted as k and l.
In the preliminary stage, k and l are all set as 0.
First, in the chemotactic operation, the algorithm takes
advantage of the edge clustering coefficient to eliminate the
isolated protein nodes. With regard to each protein node i
in PPI network, the algorithm calculates the summation of
edge clustering coefficient connecting the node i with other
protein nodes in PPI network. If the summation is zero, the
node i will be regarded as the isolated node and abandoned.
Then, the algorithm chooses one protein node that has high
CNFV value as the initial cluster center.
www.proteomics-journal.com
284
X. Lei et al.
Proteomics 2013, 13, 278–290
merges clusters that the similarity is higher than the given
threshold.
2.2.4 Implementation steps of IBFO algorithm
The specific implementation steps are as follows:
Procedure Initialization
Figure 5. The clustering model design based on IBFO.
In the reproduction period, the algorithm searches the protein nodes that have higher membership degrees with cluster
center and merges these protein nodes into the cluster that
the cluster center belongs to. Then calculates the membership
degree, nonmembership degree and indeterminacy degree
among cluster center j and one node i in the other unvisited
nodes of PPI network based on Eqs. (15) and (16). Assume
that the membership degree ␮ij of node i is higher than the
threshold of membership degree (T1), and the indeterminacy
degree ␲ij is lower than the threshold of indeterminacy degree (T3), then the protein node i is classified into the cluster
that the cluster center j belongs to and marked as the visited
protein node (the threshold of nonmembership degree (T2)
that has no effect on cluster effect will not be considered).
Oppositely, if the indeterminacy degree ␲ij is higher than
the threshold of indeterminacy degree, the node i is not only
grouped into the cluster that the cluster center j exists in, but
also is labeled as fuzzy node that will be visited next time and
has the potential to be merged into other clusters. The procedure continues until all nodes in PPI network are evaluated.
Then one cluster is obtained.
In the elimination–dispersal phase, several bacteria will die
and the population will produce other new individuals. This
operation is corresponding to selecting a new protein node as
the next cluster center according to CNFV of the nodes. Then
the algorithm starts to generate the next cluster according to
the former reproduction operation. If the cluster number is
larger than 3, then calculates the similarities among any two
cluster modules according to Eqs. (12) and (13). Afterwards
C 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
Assign values to several parameters: set the index of
external loop iter = 1, the maximal iterations of external
loop maxiter = 100. Initialize the optimal fitness value
gfval and the global optimal cluster result gcluster.
Calculate CNFV of all the protein nodes and the distances d
between any two protein nodes. Calculate the
membership degree, nonmembership degree, and
indeterminacy degree among all the protein nodes
according to the appropriate membership function
Eq. (15) and nonmembership function Eq. (16). Determine
the threshold T1 of membership degree, the threshold T3
of indeterminacy degree.
Step 1: During the procedure of chemotactic operation, for
each node i, the algorithm respectively calculates the
summation of edge clustering coefficient connecting the
node i with other protein nodes in PPI network. If the
summation is zero, the node i will be regarded as the
isolated node and eliminated.
Step 2: Set the index of internal loop count = 1, the maximal
iterations of internal loop maxcount = 100.
Step 3: Randomly select one protein node that has high
network comprehensive feature value as cluster center.
Step 4: Corresponding to the reproduction operation of BFO
algorithm, the algorithm clusters PPI networks in
accordance with the membership degree,
nonmembership degree, and indeterminacy degree
among cluster centerand other protein nodes, several
protein nodes are grouped into the cluster that cluster
center belongs to and marked as the visited nodes.
However, a part of nodes are classified into the cluster
and labeled as unvisited nodes that may also participate
in other clusters.
Step 5: Set count = count + 1, meantime take advantage of
the elimination–dispersal operation to randomly select
the new cluster center according to the reproduction
operation of BFO algorithm and go back to Step 4.
Step 6: If the cluster number is larger than 3, then calculate
the similarities between any two clusters. Afterwards,
merge the clusters that the similarity is higher than the
given threshold.
Step 7: Until all the protein nodes are visited or the index of
external loop count arrives at the maximal iterations of
external loop maxcount, a set of clusters of PPI network is
obtained.
Step 8: Calculate the fitness value of the obtained cluster
results and compare with the optimal fitness value gfval,
and then update the value gfval and the global optimal
cluster result gcluster. Meantime, set iter = iter + 1.
Step 9: The algorithm terminates until the value iter reaches
to the maximal iterations of external loop maxiter, else go
back to Step 2.
Output of the ultimate clustering result.
www.proteomics-journal.com
285
Proteomics 2013, 13, 278–290
Figure 6. The flow chart of clustering method based on IBFO.
2.2.5 Time complexity of algorithm
In this algorithm, suppose that the number of protein nodes
in PPI dataset is n, the cluster number of obtained clusters
is numclu and the number of protein nodes in the cluster is
num, the maximal iterations of external loop is maxiter, the
maximal iterations of internal loop is maxcount, and the time
complexity of algorithm is as follows:
(i) In the phase of data preprocessing, the time complexity
of calculating the membership degree and indeterminacy degree among any two protein nodes is O(n2 ).
(ii) With regard to each cluster center, the time complexity
of obtaining one cluster via chemotactic operation and
reproduction operation is O(n).
(iii) The time complexity of calculating the similarity and
merging the clusters that have high similarity is
O(numclu × num2 + numclu × num).
(iv) The time complexity of obtaining a set of clusters is O
(maxcount × (numclu × num2 )).
C 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
(v) The global optimal cluster results are obtained until executing the algorithm for maxiter times, the time complexity is O(maxiter × maxcount × (numclu × num2 )).
3
Results
3.1 Parameter analysis
To assess the performance of algorithm, the experiments are
carried out on the Windows XP system on an Intel Core 2
Duo, running at 2.93 GHz Processor with 2 GB of memory.
We use the Munich Information Center for Protein Sequence
(MIPS) PPI datasets as our data source and MIPS complex
database as ground truth to evaluate the protein complexes
predicted by our method [24]. There are several relevant parameters that will influence the cluster results such as the
parameter of BFO, the threshold of membership degree, the
threshold of indeterminacy degree, and the maximal iterations of IBFO algorithm etc.
www.proteomics-journal.com
286
X. Lei et al.
Proteomics 2013, 13, 278–290
Figure 7. The consequence of two initializing mechanisms on
clustering results.
We execute the algorithms for 20 times. Figure 7 illustrates
the influence of f-measure value by two different initialization
mechanisms, respectively, that are the cluster center is initialized by bacterial reproduction operation and initialized
randomly. It shows that f-measure values obtained from the
former are higher than the latter. The former scope is between
0.78 and 0.82 that is relatively stable.
Figure 8(A) illustrates the influence of threshold T1 of
membership degree on cluster center in terms of precision,
recall, and f-measure values. When threshold T1 of membership degree is less than 0.5, the chart shows an uptrend of
the three values and the values decrease as the threshold T1
arrives at the value 0.8. Figure 8(B) describes the influence
of threshold T3 of indeterminacy degree. When threshold T3
varies from 0 to 0.15, the precision, recall, and f-measure values gradually increase to the optimal values. So the threshold
of membership degree is set as 0.52 and the threshold of
indeterminacy degree is assigned to 0.15 in the following
experiments.
Figure 9 shows the effect of maximal iteration on the clustering results. The algorithm performs best in precision, recall,
and f-measure values when the maximal iteration reaches 100.
Figure 8. The effect of the threshold of membership degree and
indeterminacy degree on clustering results.
3.2 Performance of IBFO algorithm
This paper integrates the concepts of membership degree and
indeterminacy degree in the intuitionistic fuzzy set into the
principle of BFO algorithm, so the algorithm may generate
several overlapping functional modules in the final cluster
results.
During the procedure of reproduction operation in the
model design, the indeterminacy degree is used to determine
whether the protein node can be regarded as fuzzy node and
visited for many times when clustering PPI network, which
is intended to tackle the problem that one protein node may
C 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
Figure 9. The influence of maximum iterations on clustering results.
www.proteomics-journal.com
Proteomics 2013, 13, 278–290
Figure 10. Comparisons of clustering results with and without
the indeterminacy degree.
possess one or more functional module. It is essential to show
whether the improvement relevant to intuitionistic fuzzy set
is reasonable and effective. In Fig. 10, the dotted line stands
for the cluster results obtained by the algorithm that ignores
the indeterminacy degree of protein node, while the real line
represents the cluster results of improved algorithm proposed
in this paper. Figure 10(A) shows the results of these two
algorithms in terms of recall value. The results show that the
IBFO algorithm performs better in improving recall value of
cluster results. This is because that each protein node has
the possibility to be merged into one or more clusters, so
the algorithm can find the clusters as completely as possible.
Figure 10(B) evaluates these two algorithms in terms of fmeasure values. It shows that the f-measure value of IBFO
algorithm is superior to the algorithm that takes no account
of the overlapping functional modules in PPI network.
C 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
287
The functional-flow algorithm [7] is a relatively effective
method in solving the problem of clustering PPI network,
which is based on the principle that the functional information of a protein flows through every possible path and thus we
can quantify how much a protein can functionally influence
other adjacent proteins. The algorithm considers the generation of overlapping functional modules in the model design
and merging the modules that have high similarity in the
postprocessing stage. Moreover, the experiments show that
the algorithm is comparatively highly efficient. However, the
algorithm has to predefine cluster number. In addition, the
precision and recall values are relatively low. Consequently,
the method presented in ref. [16] takes the mechanism of
BFO algorithm into consideration to optimize the procedure
of clustering PPI network. In the model design, the clusters
are created one by one that can overcome the drawback of
predefining cluster number. Although each protein node can
be exclusively grouped into one cluster, which goes against
the topological character of PPI network that one protein
node can be grouped into two or more clusters. Therefore,
the number of protein nodes in the obtained clusters is relatively fewer compared to the matched module in the standard
dataset, which results in that the BFO algorithm do not work
well in clustering PPI network from the perspective of recall
value of cluster results. With regard to the shortcomings of
predefining cluster number and lower recall value, the IBFO
algorithm proposed in this paper introduces the concepts of
indeterminacy degree on the basis of BFO algorithm. We
respectively execute the three algorithms for 20 times, the
precision, recall, f-measure values of cluster results are shown
in Table 2.
As Tables 2 and 3 show, the IBFO algorithm performs
better in terms of precision, recall, and f-measure values compared to other algorithms. The ultimate goal of algorithm is
to predicting the clusters as accurately as possible. The top 20
clusters obtained by IBFO algorithm are listed in Table 4.
As Table 4 shows that the top 20 clusters obtained by IBFO
algorithm include the proteins classified rightly and other
proteins that should be grouped into the different clusters
from the corresponding modules. There are relatively more
protein nodes classified rightly existing in modules 1, 6, and
19, so the recall value of cluster results get largely improved. A
low value of p indicates that the module closely corresponds
to the function, because it is less probable that the network
will produce the module by chance. The cluster results in
Table 3 can effectively identify a set of unknown proteins that
have the same function and protein complex to predict the
function of unknown proteins.
We can see from Table 5 that the three overlapping proteins
are detected when the indeterminacy degree are set as 0.05,
0.2, and 0.25, respectively, and when indeterminacy degree
is 0.1 and 0.15, 5, and 7 overlapping proteins are identified
separately. This illustrate that the more overlapping proteins
are obtained when the indeterminacy degree is set as 0.15
that can also see from the Fig. 8(B). The result is just from
the statistics and simulation point of view. In fact how to set
www.proteomics-journal.com
288
X. Lei et al.
Proteomics 2013, 13, 278–290
Table 2. Comparisons among the three algorithms
Running times
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Flow
BFO
Precision
Recall
f-measure
Precision
Recall
f-measure
Precision
Recall
f-measure
0.3159
0.2727
0.2554
0.2341
0.2445
0.2365
0.2192
0.2292
0.2192
0.2083
0.2211
0.2044
0.2215
0.2264
0.2176
0.2212
0.2373
0.2222
0.2302
0.2427
0.6635
0.7316
0.7454
0.5225
0.5364
0.5265
0.5367
0.5212
0.5253
0.5155
0.5276
0.5395
0.5352
0.5194
0.5423
0.5337
0.5221
0.5334
0.5126
0.5243
0.4280
0.3973
0.3804
0.3233
0.3359
0.3264
0.3117
0.3123
0.3093
0.2967
0.3116
0.2965
0.3133
0.3153
0.3106
0.3128
0.3263
0.3137
0.3177
0.3318
0.7436
0.7446
0.6776
0.7102
0.6651
0.6364
0.7117
0.6715
0.7590
0.7294
0.6510
0.6679
0.6868
0.6943
0.7222
0.6480
0.7165
0.7050
0.7179
0.7185
0.4447
0.7669
0.4413
0.7391
0.4460
0.4480
0.4355
0.7179
0.4222
0.4673
0.4435
0.4677
0.4429
0.4316
0.4294
0.4221
0.7148
0.4273
0.4366
0.4327
0.5566
0.5740
0.5345
0.5427
0.5340
0.5259
0.5404
0.5153
0.5426
0.5697
0.5277
0.5502
0.5386
0.5324
0.5386
0.5113
0.5255
0.5322
0.5431
0.5402
0.8628
0.9053
0.8491
0.8519
0.8492
0.8564
0.8877
0.8736
0.8839
0.8907
0.8661
0.8504
0.8575
0.8472
0.8750
0.8491
0.8644
0.8709
0.8726
0.8539
0.7626
0.7170
0.7586
0.7343
0.7728
0.7435
0.7217
0.7494
0.7164
0.6824
0.7278
0.7690
0.7081
0.7937
0.7042
0.7577
0.7702
0.6942
0.6777
0.7674
0.8096
0.8002
0.8013
0.7887
0.8092
0.7959
0.7961
0.8067
0.7913
0.7727
0.7909
0.8076
0.7756
0.8195
0.7803
0.8008
0.8149
0.7725
0.7629
0.8083
Table 3. Comparisons of flow and our algorithms on average
value
Algorithms
Average value
Flow [7]
IQ-Flow [11]
IQ-Flow fast [11]
ABC-Flow [12]
JSACO [13]
BFO [16]
IBFO
IBFO
Precision
Recall
f-measure
0.23
0.67
0.72
0.70
0.87
0.70
0.87
0.56
__
__
0.84
0.26
0.50
0.74
0.31
__
__
0.76
0.55
0.54
0.79
the appropriate value of indeterminacy degree and which proteins will overlap, these should depend on the experiment and
analysis from biologists. But it at least provides a reference to
them to a certain extent.
4
Discussion
BFO algorithm in ref. [16] has the low recall value in clustering PPI network, in this paper we proposed a novel method
using BFO mechanism based on intuitionistic fuzzy. The algorithm initially eliminates the isolated points based on the
Table 4. The proteins and p-value of the top 20 clusters
Cluster
ordinal
The proteins classified rightly
The proteins classified wrongly
ID of the protein
p-value
function modules
1
YBR120c, YOR334w, YIR021w, YDR194c,
YMR023c, YGR222w
YPR025c, YDL108w, YLR005w, YPR056w,
YPL122c, YER171w
YDR176w, YDR145w, YGL112c, YPL254w,
YCL010c, YOL148c, YLR055c, YGL066w,
YDR392w, YBR081c, YBR192c,
YMR236w
YHR069c, YDR280w, YOL021c, YGR195W,
YDL111c, YGR095c, YCR035c
YLR381w, YJR135c
YLR115w, YLR277c, YAL043c
YPL010w, YNL287w, YDL145c,YFR051c,
YGL137w
YDR028c, YER133w, YOR178c, YKL193c,
YER054c, YMR311c
YKR052c, YLR382c,YDL044c, YHR005c-a,
YJL133w, YPR134w
YIL143c,YDR311w
500.50
0.1644
510.100
0.0771
YDR167w
230.20.10
0.1044
YOL077w-a, YKL190w,YKL058w, YJL074c
440.12.10
0.0985
YPR046w
YDR301w
YLR093c,YDR238c
270.20.20
440.10.20
260.30.10
0.2321
0.2021
0.1864
YNL126w
450
0.0936
2
3
4
5
6
7
8
C 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
www.proteomics-journal.com
289
Proteomics 2013, 13, 278–290
Table 4. Continued
Cluster
ordinal
The proteins classified rightly
The proteins classified wrongly
ID of the protein
p-value
function modules
9
YPL129w, YPL016w, YJL176c, YMR033w,
YBR289w, YOR290c, YDR073w,
YPR034w
YML099c, YMR043w, YMR042w, YDR137c
YRB123c, YAL001c, YGR047c, YOR110w,
YDR362c
YMR033w, YOR213c, YCR052w,YKR008,
YFR037c, YLR321c, YIL126w, YLR357w,
YPR034w, YGR056w
YOL123w, YGL044c, YDR228c,YMR061w,
YOR250c
YEL020w-a, YHR005c-a
YJL154c, YJL053w, YHR012w
YNL062c, YNL244c, YOR361c, YDR429c,
YMR146c, YBR079c
YNL199c, YPL075w
YDL069c, YHL038c,YJL209w, YDR197w
YNL290w, YOR217w, YJR068w, YOL094c,
YBR087w
YGR072w, YMR080c, YHR077c
YNR023w,YHL025w
510.190.50
0.1563
YGL154c, YGR244c,YGR113w
___
510.190.120
510.150
0.2179
0.0827
YLR060w, YLP045w, YLR148w
400
0.1329
YHR012w
440.10.10
0.1195
YCL009c
___
YGL130w, YHR164c,YER176w
440.40
260.30.30.10
500.10.40
0.3019
0.1092
0.3919
YPL001w
YPL075w
___
510.190.90
440.20
410.40.30
0.2109
0.1560
0.1210
YJR052w, YER090w
300
0.2057
10
11
12
13
14
15
16
17
18
19
20
Table 5. The overlapping proteins under different indeterminacy
degree
Indeterminacy The overlapping proteins
degree
0.05
0.10
0.15
0.20
0.25
YDR276c, YDR488c, YPL017c
YNL088w, YDR488c, YER007w, YPL155c,
YMR138w
YNL088w, YDR276c, YEL020w-a, YFL018c,
YLR055c, YDR149c, YOR266w
YLR055c, YBR107c, YPR070w
YIL095w, YPR023c, YLR212c
This work was supported by the National Natural Science
Foundation of China under Grant No. 61100164 and 61173190,
the Natural Science Foundation of Shaanxi Province of China
in 2010 under Grant No. 2010JQ8034, and the Fundamental
Research Funds for the Central Universities under Grant No.
GK200902016.
The authors have declared no conflict of interest.
5
edge-clustering coefficient in the chemotactic operation. Corresponding to the reproduction operation, the nodes that have
high membership degree are merged into the cluster that the
cluster center belongs to. Meantime, the nodes that also have
high indeterminacy degree are labeled as unvisited protein
nodes and may be grouped into two or more clusters. The
procedure of elimination–dispersal operation is equivalent to
the selection of the next cluster center and generating another
cluster. In the end, the algorithm merges the clusters having
high similarity and terminates until arriving at the maximal
iterations. The simulation result on PPI dataset showed that
the algorithm could not only effectively improve the accuracy
of cluster result, automatically determine the cluster number, but also identify the overlapping modules successfully.
However, some parameters of the algorithms will influence
the cluster result, which should be discussed further. And
also, how to construct the dynamic model and to design the
corresponding algorithms of the PPI network is the future
research direction.
C 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
References
[1] Watts, D. J., Strogatz, S. H., Collective dynamics of ‘smallworld’ networks. Nature 1998, 393, 440–442.
[2] Barabási, A. L., Oltvai, Z. N., Network biology: understanding
the cell’s functional organization. Nature Rev. Genet. 2004,
5, 101–113.
[3] Soon-Hyung, Y., Oltvai, Z. N., Barabási, A. L., Functional and
topological characterization of protein interaction networks.
Proteomics 2004, 4, 928–942.
[4] Penggang, S., Lin, G., Identification of overlapping and nonoverlapping community structure by fuzzy clustering in complex networks. Inform. Sci. 2011, 181, 1060–1071.
[5] Berggård, T., Linse, S., James, P., Methods for the detection and analysis of protein-protein interactions. Proteomics
2007, 7, 2833–2842.
[6] Nabieva, E., Jim, K., Agarwal, A., Chazelle, B., Singh, M.,
Whole-proteome prediction of protein function via graphtheoretic analysis of interaction maps. Bioinformatics 2005,
21, i302–i310.
[7] Cho, Y. R., Hwang, W., Ramanathan, M., Aidong, Z., Semantic integration to identify overlapping functional modules
www.proteomics-journal.com
290
X. Lei et al.
in protein interaction networks. BMC Bioinformatics 2007,
doi:10.1186/1471-2105-8-265.
[8] Aidong, Z., Protein Interaction Networks, Cambridge University Press, New York 2009.
[9] Kenley, E. C., Cho, Y. R., Detecting protein complexes
and functional modules from protein interaction networks: a graph entropy approach. Proteomics 2011, 11,
3825–3844.
[10] Goel, A., Simone, S. Li, Marc, R. W., Four-dimensional visualisation and analysis of protein–protein interaction networks.
Proteomics 2011, 11, 2672–2682.
[11] Xiujuan, L., Xu, H., Lei, S., Aidong, Z., Clustering PPI
data based on improved functional-flow model through
quantum-behaved PSO. Int. J. Data Min. Bioinform. 2012,
6, 42–60.
[12] Xiujuan, L., Jianfang, T., The information flow clustering
model and algorithm based on the artificial bee colony
mechanism of PPI network. Chinese J. Comput. 2012, 35,
134–145.
[13] Xiujuan, L., Xu, H., Shuang, W., Ling, G., Joint strength
based ant colony optimization clustering algorithm for PPI
networks. Acta Electron. Sin. 2012, 40, 695–702.
[14] Passino, K. M., Biomimicry of bacterial foraging for distributed optimization and control. IEEE Contr. Syst. Mag. N
Y 2002, 22, 52–67.
[15] Kim, D.H., Abraham, A., Cho, J. H., A hybrid genetic algorithm and bacterial foraging approach
C 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
Proteomics 2013, 13, 278–290
for global
3918–3937.
optimization.
Inform.
Sci.
2007,
177,
[16] Xiujuan, L., Shuang, W., Liang, G., Aidong, Z., Clustering
PPI data based on bacteria foraging optimization algorithm.
2011 IEEE International Conference on Bioinformatics and
Biomedicine (BIBM11), Atlanta, Georgia, 2011, 96–99.
[17] Zadeh, L.A., Fuzzy sets. Inform. Cont 1965, 8, 338–353.
[18] Atanassov, K., Intuitionistic fuzzy sets. Fuzzy Sets Syst. 1986,
20, 87–96.
[19] Dasgupta, S., Biswas, A., Abraham, A., Das, S., Adaptive
computational chemotaxis in bacterial foraging algorithm,
2008 International Conference on Complex, Intelligent and
Software Intensive Systems 2008, 13, 64–71.
[20] Veysel, G., Kevin, M. P., Swarm Stability and Optimization.
Springer Verlag, Berlin Heidelberg 2011.
[21] De, S. K., Biswas, R., Roy, A. R., Some operations on intuitionistic fuzzy sets. Fuzzy Sets Syst., Arti. Intell. 2003, 2715,
285–292.
[22] Huan, W., Min, L., Jianxin, W., Yi, P., A new method for identifying essential proteins based on edge clustering coefficient.
Lecture Notes in Computer Science 2011, 6674, 87–98.
[23] Letovsky, S., Kasif, S., Predicting protein function from
protein-protein interaction data: a probabilistic approach.
BMC Bioinformatics 2003, 19, 197–204.
[24] Güldener, U., Münsterkōtter, M., Kastenmüller, G., Strack,
N. et al., CYGD: the comprehensive yeast genome database.
Nucl. Acids Res. 2005, 33, D364–D368.
www.proteomics-journal.com
Download