278 DOI 10.1002/pmic.201200309 Proteomics 2013, 13, 278–290 RESEARCH ARTICLE Clustering and overlapping modules detection in PPI network based on IBFO Xiujuan Lei1 , Shuang Wu1 , Liang Ge2 and Aidong Zhang2 1 2 College of Computer Science, Shaanxi Normal University, Xi’an, P. R. China Department of Computer Science and Engineering, State University of New York at Buffalo, NY, USA As is known to all, traditional clustering algorithms do not work well due to the topological features of protein–protein interaction networks. An improved clustering method based on bacteria foraging optimization (BFO) mechanism and intuitionistic fuzzy set, short for improved BFO, is proposed in this paper, in which the trigonometric function is used to define the membership degrees and the indeterminacy degree is introduced to detect the overlapping modules. In chemotactic operation of BFO, the algorithm initializes a cluster center according to comprehensive network feature value of node and eliminates the isolated point in accordance with edge-clustering coefficient. In the reproduction operation of BFO, the nodes possessing high membership degrees are merged into the cluster that the cluster center belongs to and labeled as visited nodes. Meanwhile, the nodes that also have high indeterminacy degrees are visited again when generating another cluster. The procedure of elimination–dispersal operation is equivalent to the selection of the next cluster center. Finally, the algorithm merges the clusters having high similarity. The results show that the algorithm not only determines the cluster number automatically, improves the f-measure value of cluster results, but also identify the overlaps in protein–protein interaction network successfully. Received: July 25, 2012 Revised: September 19, 2012 Accepted: October 11, 2012 Keywords: Bacteria foraging optimization / Bioinformatics / Indeterminacy degree / Overlap / Protein–protein interaction networks 1 Introduction Correspondence: Dr. Xiujuan Lei, College of Computer Science, Shaanxi Normal University, Xi’an, Shaanxi Province, 710062, P. R. China E-mail: xjlei@snnu.edu.cn Fax: +86 29 85310161 as growth and development and metabolism. In addition, it is extremely helpful in the diagnosis of major diseases and intensive study of therapy, meantime stimulates the development of biology, medicine and bioinformatics, and so on. PPI networks share the feature of small world [1] that is characterized by high clustering coefficients. In addition, the scale-free [2] is also fit in PPI networks that suggested an important topological feature of PPI networks, that is, the modularity [3]. So, it is natural to use clustering methods to predict the functional modules. However, the traditional clustering methods such as hierarchical clustering, density-based method, and fuzzy clustering algorithm [4,5] have difficulties in either requiring the prior knowledge of cluster number or being sensitive to noisy data. Then, Nabieva et al. [6] first put forward functional-flow model to explore the underlying structure of PPI networks. The experimental results showed that the method performed well. However, the running time Abbreviations: BFO, bacteria foraging optimization; CNFV, comprehensive network feature value; PPI, protein–protein interaction Colour Online: See the article online to view Figs. 3, 4 and 7–10 in colour. In the postgenomic era, the research focus of biological science has gradually transferred from genomics to proteomics. Recently, the rapid development of proteomics and the explosion of protein–protein interaction (PPI) dataset have drawn more and more researchers to investigate PPI networks in order to predict the function of unknown proteins. The researches toward PPI networks contribute to predicting the functions of unknown proteins from the aspect of molecular level, further uncovering regularities of cellular activities such C 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com 279 Proteomics 2013, 13, 278–290 was relatively high. Cho et al. [7] proposed another flowbased modularization algorithm to predict the overlapping functional modules in a weighted graph; the f-measure [8] value of clustering result was relatively low. Kenley et al. [9] proposed a novel information-theoretic definition, graph entropy as a measure of the structural complexity of a graph. The results showed that the approach had higher accuracy in predicting protein complexes. Recently, more and more intelligent optimization algorithms have found broad applications in many research fields. Goel et al. [10] presented software for the generation and analysis of dynamic, fourdimensional PPI networks. Our research [10] had successfully applied the quantum-behaved particle swarm optimization algorithm and artificial bee colony algorithm to optimize functional-flow model and also utilized joint strength based ant colony optimization to cluster the functional modules, the results show that these methods performed well in identifying the functional modules. Following the rapid development of intelligence algorithms, Passino [14] presented bacteria foraging optimization (BFO) algorithm inspired by the social foraging behavior of Escherichia coli. BFO algorithm had drawn attentions of researchers from various fields such as harmonic estimation, transmission loss reduction, and machine learning. In order to explore the searching ability of BFO algorithm, several researchers integrated the algorithm with other intelligent methods [15] such as genetic algorithm and particle swarm optimization algorithm etc. Then, we try to design a new model taking advantage of BFO algorithm to cluster PPI networks [16]. The initial positions where the bacterium located are treated as cluster centers, the cluster modules are generated during the reproduction operation and elimination– dispersal operation. The experimental results showed that the method could effectively improve the accuracy of cluster results. However, the recall value is low and the algorithm ignored a fact that a protein may belong to two or more clusters. In 1965, the fuzzy methods [17] have been adopted in cluster analysis. Atanassov [18] extended the theory of fuzzy set and proposed the intuitionistic fuzzy set that includes the concepts of membership degree, nonmembership degree, and indeterminacy degree. In this paper, we adopt the indeterminacy degree of protein nodes to take the overlapping functional modules of PPI networks into consideration in order to improve the recall value of cluster results. In this paper, we propose a novel model and algorithm that uses BFO mechanism and intuitionistic fuzzy set to tackle the overlapping functional modules of PPI network and automatically determine cluster number. In Section 2, the principle of BFO algorithm, the definitions of intuitionistic fuzzy set, several concepts of graph, and the evaluation criteria of PPI network clustering are briefly presented. Section 3 describes the model design and implementation steps of improved algorithm that integrates intuitionistic fuzzy set into the mechanism of BFO algorithm. We execute the algorithm and make comparisons with the BFO algorithm referred in ref. [16] and functional-flow algorithm [7] in Section 4. The experimental C 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim results show that the algorithm is superior to BFO algorithm in f-measure value. 2 Materials and methods 2.1 Basic principles and concepts 2.1.1 Principle of BFO algorithm BFO algorithm [19] is an intelligent algorithm that has been widely accepted as a global optimization searching method. It is composed of chemotactic operation, reproduction operation, and elimination–dispersal operation. 2.1.1.1 Chemotactic operation Figure 1(A) shows that the bacterium moves in two different ways including swim and tumble by means of a set of tensile flagella. When all the flagella rotate clockwise, each flagellum operates relatively independent of the others. When all the flagella rotate counterclockwise, they push the bacterium so that it moves in one direction at a very fast rate. Figure 1(B) shows the swimming and tumbling behavior of the bacterium in a neural medium. There exists a nutrient concentration gradient in the Fig. 1(C). The darker the shade is, the higher the concentration of the nutrient is. The bacterium alternately swims and tumbles for the purpose of moving toward nutrient gradient and avoiding noxious environment. The BFO algorithm regards this phenomenon as the chemotactic behavior that is able to largely broaden the local exploring ability of BFO algorithm. 2.1.1.2 Reproduction operation In general, several bacteria that are becoming incapable of searching food during executing the chemotactic behavior are eliminated. In order to maintain the scale of population, the remained bacteria will reduplicate themselves and generate new individuals. To improve the global convergent speed and efficiency, the BFO algorithm generally selects the bacteria that rank in the former half positions to reproduce themselves and generate new individuals that are completely identical to the original bacteria. 2.1.1.3 Elimination–dispersal operation Owing to sudden changes of the local environment, the bacteria population may be gradually inadaptable to the environment that a group of bacteria is either killed or dispersed into a new location. This phenomenon is simulated as the elimination–dispersal operation that is normally executed with some certain probability. If some bacterium satisfies the probability of elimination–dispersal operation, this bacterium will die and the algorithm generates another new individual in a random position of the feasible solution space. The elimination–dispersal operation enhances the randomly www.proteomics-journal.com 280 X. Lei et al. Proteomics 2013, 13, 278–290 Figure 1. The foraging behavior of bacteria [20]. searching ability of BFO algorithm, maintains the varieties of population and avoids the premature convergence. Define a chemotactic step to be a tumble followed by a tumble, or a tumble followed by a run. Let j be the index for the chemotactic step. k be the index for the reproduction step and l be the index of the elimination–dispersal event. The position of each member in the population of the S bacteria at the jth chemotactic step, kth reproduction step, and lth elimination–dispersal event is represented as follows: P( j, k, l ) = {xi ( j, k, l )|i = 1, 2, · · · , S}. 2.1.2 Fuzzy set In the typical setting, the clusters are nonoverlapping. However, PPI networks, contain many overlapping modules. Therefore, we adopt the fuzzy concept that each node belongs to certain clusters with a probability between 0 and 1. Definition 1. The fuzzy set A in the domain X is defined as follows [17]: A = {(x, A(x))|x ∈ X}, where A(x) is the membership function. The fuzzy set A satisfies a requirement: A: X→M, the symbol M represents the membership space. The membership function A(x) denotes the membership degree or the probability that the element x belongs to fuzzy set A. Therefore, each element (x, A(x)) in the fuzzy set A expresses the membership degree of the element x. Definition 2. Let a set X be fixed. An intuitionistic fuzzy set [21] is an object having the form B = {< x, B (x), B (x) > |x ∈ E }, where the functions B (x) and B (x) define the membership degree and nonmembership degree that the element x belongs to set B, respectively. In addition, for each object x in C 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim the set B, the summation of B (x) and B (x) varies between 0 and 1. The indeterminacy degree of the element x is equal to 1 − B (x) − B (x). 2.1.3 Relevant concepts of graph In an undirected graph, the degree of one node represents the number of its direct neighboring nodes. For weighted graphs, the weighted degree of a node [8] is the sum of the weight value of edges among nodes i and its neighbors, w(i, j ), (1) w(i) = j ∈N(i) Assume a node i, ki represents the node degree, and ni refers to the number of edges linking all the neighbor nodes of node i with each other. The node clustering coefficient is calculated as follows [8]: Ci = 2ni /ki (ki − 1), (2) With respect to Ci , the edge clustering coefficient WEu,v is defined as the ratio of the number of triangles containing the edge to the number of all the possible triangles including this edge. In a weighted graph, the edge clustering coefficient [22] is calculated as follows: w(i, k) · w( j, k) k∈Ii, j WE i, j = s ∈Ni k∈Ii, j w(i, s ) · w( j, t) . (3) t∈N j In Eq. (3), the sets Ni and Nj represent the sets of adjacent nodes of node i and node j, respectively, w(i, s) stands for the weight value of edge-linking node i with node s. The set Ii, j refers to the set of common nodes between the adjacent nodes of nodes i and j. Noted that Ii, j = Ni ∩ N j . The weighted aggregation coefficient of edge is illustrated by the ratio between the product of summation of weight values of edges, respectively, connecting these two nodes (i, j) with their common www.proteomics-journal.com 281 Proteomics 2013, 13, 278–290 neighbors (k) and the product of summation of weight values of edges linking these two nodes (i, j) with their corresponding neighbors (s, t). The edge clustering coefficient is not sensitive to the influence of the false-positive data. Therefore, it is more preferable to the large-scale PPI data containing many false-positive data. The density of a subgraph s is defined by the following equation: n n−1 n(n − 1) (9) k−1 i n−i |V | . (10) n where |V | is the total number of proteins, |X| is the number of proteins in a reference function, n is the number of proteins in an identified module, and k is the number of proteins in common between the function and the module. 2.2 Methods WE i, j i=1 j =2 , (5) where the symbols i and j stand for the ith protein node and the jth protein node in the subgraph s, respectively. WEi,j is the edge clustering coefficient connecting the ith node with the jth node. It is apparent that the value WD(s) illustrates the average edge clustering coefficient linking with all the protein nodes in cluster s. Suppose that the obtained cluster number is indicated as numclu, a set of clusters C can be calculated as follows: f un(C) = P =1− (4) where the parameter n represents the number of nodes and e is the number of edges connecting protein nodes with each other in the subgraph s. To the weighted PPI networks, each cluster can be assessed according to Eq. (5): WD(s ) = 2 . 1 1 + precision recall i=0 2e , n(n − 1) 2 f -measure = In PPI network, protein modules can be statistically evaluated using p-value from the hypergeometric distribution, which is defined as |V | − |X| |X| 2.1.4 Object function D(s ) = Therefore, in order to balance the precision and recall values, we can define the f-measure value as follows: numcl u 1 WD(s )k . numcl u k=1 2.2.1 Data preprocessing 2.2.1.1 Calculation of distance among protein nodes In PPI networks, protein name can be changed into the positive integer and the data is converted into an adjacent matrix P. Assume that the number of protein nodes is n and Xi represents the ith protein that is denoted as Xi = (Pi1 , Pi2 , . . . , Pin ). The inner product of two protein node is calculated by the equation Xij = (Pi1 , Pi2 , . . . , Pin ) • (Pj1 , Pj2 , . . . , Pjn ) = Pi1 × Pj1 + Pi2 × Pj2 + . . . + Pin × Pjn . The similarity between nodes i and j is defined as follows [23]: n (6) Si j = k=1 n min(X ik , X j k ) . (11) max(X ik , X j k ) k=1 2.1.5 Evaluation criteria of cluster results Precision, recall, and p-value are usually adopted to evaluate clustering results [8]. Suppose that X represents one cluster in the cluster results, Fi stands for the matched cluster in the standard PPI dataset, precision(X, Fi ) = recall(X, Fi ) = |X ∩ Fi | , |X| |X ∩ Fi | , |Fi | (7) (8) where the expression |X∩Fi | stands for the number of common proteins between clusters X and Fi . However, both these two evaluating criteria have bias for different sized clusters. C 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim The protein node not only interacts with adjacent protein nodes, but also has interactions with other protein nodes via some protein or some several protein nodes. As Fig. 2 shows that the protein node a can be denoted as Xa = (paa , pab , pac , pad , pae , paf ) = (0, 1, 0, 1, 0, 0), similarly, the protein node Xb = (1, 0, 0, 1, 0, 0), Xc = (0, 0, 0, 1, 0, 0), Xd = (1, 1, 1, 0, 1, 1), Xe = (0, 0, 0, 1, 0, 0), Xf = (0, 0, 0, 1, 0, 0). The associated matrix is obtained as follows: ⎤ ⎡ 0 1 0 1 0 0 ⎥ ⎢ ⎢1 0 0 1 0 0⎥ ⎥ ⎢ ⎢0 0 0 1 0 0⎥ ⎥ ⎢ ⎥ X =⎢ ⎢1 1 1 0 1 1⎥ ⎥ ⎢ ⎥ ⎢ ⎢0 0 0 1 0 0⎥ ⎦ ⎣ 0 0 0 1 0 0 www.proteomics-journal.com 282 X. Lei et al. Proteomics 2013, 13, 278–290 2.2.2 Determination of membership function and nonmembership function Figure 2. A sketch subgraph of PPI network. Then, value Xaa = Xa • (Xa )T = (0, 1, 0, 1, 0, 0) • (0, 1, 0, 1, 0, 0)T = 2, Xab = Xa • Xb = (0, 1, 0, 1, 0, 0) • (1, 0, 0, 1, 0, 0) = 1, Xbb = Xb • (Xb )T = (1, 0, 0, 1, 0, 0) • (1, 0, 0, 1, 0, 0)T = 2. In the similar way, the value Xac = 1, Xbc = 1, Xad = 1, Xbd = 1, Xae = 1, Xbe = 1, Xaf = 1, Xbf = 1. Therefore, the similarity between nodes a and b is that Sab = (1 + 1 + 1 + 1 + 1 + 1)/(2 + 2 + 1 + 1 + 1 + 1) = 6/8 = 0.75. It is clear to see that the higher the similarity between two protein nodes is, the shorter the space distance is. So, the distance between two protein nodes a and b is denoted as dab = 1 – Sab = 0.25. The similarity between two protein nodes can be calculated according to Eq. (11), while the similarity between module Mi and another different module Mj is measured [7] by the Eq. (12): S(Mi , M j ) = The concepts of membership degree and nonmembership degree play an important role in the clustering procedure and it is essential to construct appropriate functions to calculate the membership degree and nonmembership degree among protein nodes. If two protein nodes are close, the possibility that these two nodes can be grouped into one cluster is high, and the membership degree between two nodes is close to 1. As the distance increases, the membership degree gradually descends. The relationship between distances among protein nodes and membership degree can be roughly described as Fig. 3 (A). c(x, y) x∈Mi ,y∈M j min(|Mi |, |M j |) (12) where ⎧ 1 ⎪ ⎪ ⎨ c(x, y) = w(x, y) ⎪ ⎪ ⎩0 if x = y if x = y and x, y ∈ E (13) otherwise 2.2.1.2 Initialization of cluster center The node clustering coefficient barely measures the joint density and strength among all the nodes in the local proximity of this node. Meanwhile, the comprehensive network feature value (CNFV) of node [8] reveals the joint strength between this node and other nodes aside from the above-mentioned feature. The CNFV of node i is defined as follows: CNFV i = b × Ci + (1 − b) × w(i)/n. (14) The parameter  is a random number within 0 and 1, and n stands for the number of protein nodes in PPI network. C 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim Figure 3. The relationships among membership degree, nonmembership degree, and distance of protein nodes. www.proteomics-journal.com 283 Proteomics 2013, 13, 278–290 The membership function is obtained according to corresponding relationship between membership degree and the distance of two protein nodes: ⎧ 1 0 ≤ di j < 0.1 ⎪ ⎪ ⎪ ⎨ p 0.1 ≤ di j < 0.9 (15) mi j = cos di j · 2 ⎪ ⎪ ⎪ ⎩0 0.9 ≤ d ≤ 1 ij On the contrary, when two protein nodes are relatively close, two protein nodes can be merged into one cluster; so the nonmembership degree between two nodes is considered to be close to 0. As the distance constantly increases between 0 and 1, there is less possibility that these two nodes belong to identical cluster and the nonmembership degree keeps up the upward momentum, which is the relation between nonmembership degree and distance of two protein nodes is illustrated as the Fig. 3 (B). Similarly, the calculation equation of nonmembership degree can also be determined as follows: ⎧ 0 0 ≤ di j < 0.1 ⎪ ⎪ ⎪ ⎨ 0.1 ≤ di j < 0.9 (16) i j = sin di j · 2 ⎪ ⎪ ⎪ ⎩1 0.9 ≤ d ≤ 1 Figure 4. The overlap of protein functional modules. Table 1. Corresponding relationship of the IBFO mechanism and the PPI networks clustering BFO algorithm The problem of clustering PPI networks Bacterium Chemotactic operation Reproduction operation Membership degree Indeterminacy degree Protein node Eliminate isolated node according to edge-clustering coefficient Merge nodes into the cluster that cluster center belongs to Nodes that possess high membership degree are grouped into the cluster Node that has the lower indeterminacy degree than the given threshold is merged into this cluster. Otherwise merge node into cluster and mark this node as unvisited node Randomly choose one node as the new cluster center ij 2.2.3 Model design of improved algorithm The BFO algorithm referred in ref. [16] mainly contained three behaviors of bacteria into the problem of clustering PPI networks, which is chemotactic, reproduction, and elimination–dispersal operation. However, the recall value of cluster results is relatively low, which is due to the fact that PPI network is distinct from other complicated networks and has the small-world and scale-free characters, there are a large number of proteins that have fewer interactions with other proteins and are abandoned in the clustering procedure. In fact, a protein (black node in Fig. 4) in real PPI network can be included in several different protein complexes to perform different functions, i.e., a protein functional modules overlap with each other as Fig. 4 shows. Naturally, the concepts of membership degree and indeterminacy degree in the intuitionistic fuzzy set can be introduced to detect the overlapping modules. This paper proposes an improved BFO clustering algorithm based on intuitionistic fuzzy set. We adopt the principle of improved bacteria foraging optimization (IBFO) to cluster PPI network. During the procedure of clustering PPI network, one bacterium is regarded as one protein node, the corresponding relationship of the IBFO mechanism and the PPI networks clustering is listed in Table 1. The clustering model based on IBFO is shown in Fig. 5. Figure 6 is the flow chart of clustering method based on IBFO. C 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim Elimination– dispersal operation Food distribution function Average value of the edge-clustering coefficient in each cluster Where flag represents the parameter of controlling the chemotactic operation, Nre and Ned stand for the parameters of reproduction operation and elimination–dispersal operation, respectively. The count parameters of reproduction operation and elimination–dispersal operation are denoted as k and l. In the preliminary stage, k and l are all set as 0. First, in the chemotactic operation, the algorithm takes advantage of the edge clustering coefficient to eliminate the isolated protein nodes. With regard to each protein node i in PPI network, the algorithm calculates the summation of edge clustering coefficient connecting the node i with other protein nodes in PPI network. If the summation is zero, the node i will be regarded as the isolated node and abandoned. Then, the algorithm chooses one protein node that has high CNFV value as the initial cluster center. www.proteomics-journal.com 284 X. Lei et al. Proteomics 2013, 13, 278–290 merges clusters that the similarity is higher than the given threshold. 2.2.4 Implementation steps of IBFO algorithm The specific implementation steps are as follows: Procedure Initialization Figure 5. The clustering model design based on IBFO. In the reproduction period, the algorithm searches the protein nodes that have higher membership degrees with cluster center and merges these protein nodes into the cluster that the cluster center belongs to. Then calculates the membership degree, nonmembership degree and indeterminacy degree among cluster center j and one node i in the other unvisited nodes of PPI network based on Eqs. (15) and (16). Assume that the membership degree ij of node i is higher than the threshold of membership degree (T1), and the indeterminacy degree ij is lower than the threshold of indeterminacy degree (T3), then the protein node i is classified into the cluster that the cluster center j belongs to and marked as the visited protein node (the threshold of nonmembership degree (T2) that has no effect on cluster effect will not be considered). Oppositely, if the indeterminacy degree ij is higher than the threshold of indeterminacy degree, the node i is not only grouped into the cluster that the cluster center j exists in, but also is labeled as fuzzy node that will be visited next time and has the potential to be merged into other clusters. The procedure continues until all nodes in PPI network are evaluated. Then one cluster is obtained. In the elimination–dispersal phase, several bacteria will die and the population will produce other new individuals. This operation is corresponding to selecting a new protein node as the next cluster center according to CNFV of the nodes. Then the algorithm starts to generate the next cluster according to the former reproduction operation. If the cluster number is larger than 3, then calculates the similarities among any two cluster modules according to Eqs. (12) and (13). Afterwards C 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim Assign values to several parameters: set the index of external loop iter = 1, the maximal iterations of external loop maxiter = 100. Initialize the optimal fitness value gfval and the global optimal cluster result gcluster. Calculate CNFV of all the protein nodes and the distances d between any two protein nodes. Calculate the membership degree, nonmembership degree, and indeterminacy degree among all the protein nodes according to the appropriate membership function Eq. (15) and nonmembership function Eq. (16). Determine the threshold T1 of membership degree, the threshold T3 of indeterminacy degree. Step 1: During the procedure of chemotactic operation, for each node i, the algorithm respectively calculates the summation of edge clustering coefficient connecting the node i with other protein nodes in PPI network. If the summation is zero, the node i will be regarded as the isolated node and eliminated. Step 2: Set the index of internal loop count = 1, the maximal iterations of internal loop maxcount = 100. Step 3: Randomly select one protein node that has high network comprehensive feature value as cluster center. Step 4: Corresponding to the reproduction operation of BFO algorithm, the algorithm clusters PPI networks in accordance with the membership degree, nonmembership degree, and indeterminacy degree among cluster centerand other protein nodes, several protein nodes are grouped into the cluster that cluster center belongs to and marked as the visited nodes. However, a part of nodes are classified into the cluster and labeled as unvisited nodes that may also participate in other clusters. Step 5: Set count = count + 1, meantime take advantage of the elimination–dispersal operation to randomly select the new cluster center according to the reproduction operation of BFO algorithm and go back to Step 4. Step 6: If the cluster number is larger than 3, then calculate the similarities between any two clusters. Afterwards, merge the clusters that the similarity is higher than the given threshold. Step 7: Until all the protein nodes are visited or the index of external loop count arrives at the maximal iterations of external loop maxcount, a set of clusters of PPI network is obtained. Step 8: Calculate the fitness value of the obtained cluster results and compare with the optimal fitness value gfval, and then update the value gfval and the global optimal cluster result gcluster. Meantime, set iter = iter + 1. Step 9: The algorithm terminates until the value iter reaches to the maximal iterations of external loop maxiter, else go back to Step 2. Output of the ultimate clustering result. www.proteomics-journal.com 285 Proteomics 2013, 13, 278–290 Figure 6. The flow chart of clustering method based on IBFO. 2.2.5 Time complexity of algorithm In this algorithm, suppose that the number of protein nodes in PPI dataset is n, the cluster number of obtained clusters is numclu and the number of protein nodes in the cluster is num, the maximal iterations of external loop is maxiter, the maximal iterations of internal loop is maxcount, and the time complexity of algorithm is as follows: (i) In the phase of data preprocessing, the time complexity of calculating the membership degree and indeterminacy degree among any two protein nodes is O(n2 ). (ii) With regard to each cluster center, the time complexity of obtaining one cluster via chemotactic operation and reproduction operation is O(n). (iii) The time complexity of calculating the similarity and merging the clusters that have high similarity is O(numclu × num2 + numclu × num). (iv) The time complexity of obtaining a set of clusters is O (maxcount × (numclu × num2 )). C 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim (v) The global optimal cluster results are obtained until executing the algorithm for maxiter times, the time complexity is O(maxiter × maxcount × (numclu × num2 )). 3 Results 3.1 Parameter analysis To assess the performance of algorithm, the experiments are carried out on the Windows XP system on an Intel Core 2 Duo, running at 2.93 GHz Processor with 2 GB of memory. We use the Munich Information Center for Protein Sequence (MIPS) PPI datasets as our data source and MIPS complex database as ground truth to evaluate the protein complexes predicted by our method [24]. There are several relevant parameters that will influence the cluster results such as the parameter of BFO, the threshold of membership degree, the threshold of indeterminacy degree, and the maximal iterations of IBFO algorithm etc. www.proteomics-journal.com 286 X. Lei et al. Proteomics 2013, 13, 278–290 Figure 7. The consequence of two initializing mechanisms on clustering results. We execute the algorithms for 20 times. Figure 7 illustrates the influence of f-measure value by two different initialization mechanisms, respectively, that are the cluster center is initialized by bacterial reproduction operation and initialized randomly. It shows that f-measure values obtained from the former are higher than the latter. The former scope is between 0.78 and 0.82 that is relatively stable. Figure 8(A) illustrates the influence of threshold T1 of membership degree on cluster center in terms of precision, recall, and f-measure values. When threshold T1 of membership degree is less than 0.5, the chart shows an uptrend of the three values and the values decrease as the threshold T1 arrives at the value 0.8. Figure 8(B) describes the influence of threshold T3 of indeterminacy degree. When threshold T3 varies from 0 to 0.15, the precision, recall, and f-measure values gradually increase to the optimal values. So the threshold of membership degree is set as 0.52 and the threshold of indeterminacy degree is assigned to 0.15 in the following experiments. Figure 9 shows the effect of maximal iteration on the clustering results. The algorithm performs best in precision, recall, and f-measure values when the maximal iteration reaches 100. Figure 8. The effect of the threshold of membership degree and indeterminacy degree on clustering results. 3.2 Performance of IBFO algorithm This paper integrates the concepts of membership degree and indeterminacy degree in the intuitionistic fuzzy set into the principle of BFO algorithm, so the algorithm may generate several overlapping functional modules in the final cluster results. During the procedure of reproduction operation in the model design, the indeterminacy degree is used to determine whether the protein node can be regarded as fuzzy node and visited for many times when clustering PPI network, which is intended to tackle the problem that one protein node may C 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim Figure 9. The influence of maximum iterations on clustering results. www.proteomics-journal.com Proteomics 2013, 13, 278–290 Figure 10. Comparisons of clustering results with and without the indeterminacy degree. possess one or more functional module. It is essential to show whether the improvement relevant to intuitionistic fuzzy set is reasonable and effective. In Fig. 10, the dotted line stands for the cluster results obtained by the algorithm that ignores the indeterminacy degree of protein node, while the real line represents the cluster results of improved algorithm proposed in this paper. Figure 10(A) shows the results of these two algorithms in terms of recall value. The results show that the IBFO algorithm performs better in improving recall value of cluster results. This is because that each protein node has the possibility to be merged into one or more clusters, so the algorithm can find the clusters as completely as possible. Figure 10(B) evaluates these two algorithms in terms of fmeasure values. It shows that the f-measure value of IBFO algorithm is superior to the algorithm that takes no account of the overlapping functional modules in PPI network. C 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim 287 The functional-flow algorithm [7] is a relatively effective method in solving the problem of clustering PPI network, which is based on the principle that the functional information of a protein flows through every possible path and thus we can quantify how much a protein can functionally influence other adjacent proteins. The algorithm considers the generation of overlapping functional modules in the model design and merging the modules that have high similarity in the postprocessing stage. Moreover, the experiments show that the algorithm is comparatively highly efficient. However, the algorithm has to predefine cluster number. In addition, the precision and recall values are relatively low. Consequently, the method presented in ref. [16] takes the mechanism of BFO algorithm into consideration to optimize the procedure of clustering PPI network. In the model design, the clusters are created one by one that can overcome the drawback of predefining cluster number. Although each protein node can be exclusively grouped into one cluster, which goes against the topological character of PPI network that one protein node can be grouped into two or more clusters. Therefore, the number of protein nodes in the obtained clusters is relatively fewer compared to the matched module in the standard dataset, which results in that the BFO algorithm do not work well in clustering PPI network from the perspective of recall value of cluster results. With regard to the shortcomings of predefining cluster number and lower recall value, the IBFO algorithm proposed in this paper introduces the concepts of indeterminacy degree on the basis of BFO algorithm. We respectively execute the three algorithms for 20 times, the precision, recall, f-measure values of cluster results are shown in Table 2. As Tables 2 and 3 show, the IBFO algorithm performs better in terms of precision, recall, and f-measure values compared to other algorithms. The ultimate goal of algorithm is to predicting the clusters as accurately as possible. The top 20 clusters obtained by IBFO algorithm are listed in Table 4. As Table 4 shows that the top 20 clusters obtained by IBFO algorithm include the proteins classified rightly and other proteins that should be grouped into the different clusters from the corresponding modules. There are relatively more protein nodes classified rightly existing in modules 1, 6, and 19, so the recall value of cluster results get largely improved. A low value of p indicates that the module closely corresponds to the function, because it is less probable that the network will produce the module by chance. The cluster results in Table 3 can effectively identify a set of unknown proteins that have the same function and protein complex to predict the function of unknown proteins. We can see from Table 5 that the three overlapping proteins are detected when the indeterminacy degree are set as 0.05, 0.2, and 0.25, respectively, and when indeterminacy degree is 0.1 and 0.15, 5, and 7 overlapping proteins are identified separately. This illustrate that the more overlapping proteins are obtained when the indeterminacy degree is set as 0.15 that can also see from the Fig. 8(B). The result is just from the statistics and simulation point of view. In fact how to set www.proteomics-journal.com 288 X. Lei et al. Proteomics 2013, 13, 278–290 Table 2. Comparisons among the three algorithms Running times 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Flow BFO Precision Recall f-measure Precision Recall f-measure Precision Recall f-measure 0.3159 0.2727 0.2554 0.2341 0.2445 0.2365 0.2192 0.2292 0.2192 0.2083 0.2211 0.2044 0.2215 0.2264 0.2176 0.2212 0.2373 0.2222 0.2302 0.2427 0.6635 0.7316 0.7454 0.5225 0.5364 0.5265 0.5367 0.5212 0.5253 0.5155 0.5276 0.5395 0.5352 0.5194 0.5423 0.5337 0.5221 0.5334 0.5126 0.5243 0.4280 0.3973 0.3804 0.3233 0.3359 0.3264 0.3117 0.3123 0.3093 0.2967 0.3116 0.2965 0.3133 0.3153 0.3106 0.3128 0.3263 0.3137 0.3177 0.3318 0.7436 0.7446 0.6776 0.7102 0.6651 0.6364 0.7117 0.6715 0.7590 0.7294 0.6510 0.6679 0.6868 0.6943 0.7222 0.6480 0.7165 0.7050 0.7179 0.7185 0.4447 0.7669 0.4413 0.7391 0.4460 0.4480 0.4355 0.7179 0.4222 0.4673 0.4435 0.4677 0.4429 0.4316 0.4294 0.4221 0.7148 0.4273 0.4366 0.4327 0.5566 0.5740 0.5345 0.5427 0.5340 0.5259 0.5404 0.5153 0.5426 0.5697 0.5277 0.5502 0.5386 0.5324 0.5386 0.5113 0.5255 0.5322 0.5431 0.5402 0.8628 0.9053 0.8491 0.8519 0.8492 0.8564 0.8877 0.8736 0.8839 0.8907 0.8661 0.8504 0.8575 0.8472 0.8750 0.8491 0.8644 0.8709 0.8726 0.8539 0.7626 0.7170 0.7586 0.7343 0.7728 0.7435 0.7217 0.7494 0.7164 0.6824 0.7278 0.7690 0.7081 0.7937 0.7042 0.7577 0.7702 0.6942 0.6777 0.7674 0.8096 0.8002 0.8013 0.7887 0.8092 0.7959 0.7961 0.8067 0.7913 0.7727 0.7909 0.8076 0.7756 0.8195 0.7803 0.8008 0.8149 0.7725 0.7629 0.8083 Table 3. Comparisons of flow and our algorithms on average value Algorithms Average value Flow [7] IQ-Flow [11] IQ-Flow fast [11] ABC-Flow [12] JSACO [13] BFO [16] IBFO IBFO Precision Recall f-measure 0.23 0.67 0.72 0.70 0.87 0.70 0.87 0.56 __ __ 0.84 0.26 0.50 0.74 0.31 __ __ 0.76 0.55 0.54 0.79 the appropriate value of indeterminacy degree and which proteins will overlap, these should depend on the experiment and analysis from biologists. But it at least provides a reference to them to a certain extent. 4 Discussion BFO algorithm in ref. [16] has the low recall value in clustering PPI network, in this paper we proposed a novel method using BFO mechanism based on intuitionistic fuzzy. The algorithm initially eliminates the isolated points based on the Table 4. The proteins and p-value of the top 20 clusters Cluster ordinal The proteins classified rightly The proteins classified wrongly ID of the protein p-value function modules 1 YBR120c, YOR334w, YIR021w, YDR194c, YMR023c, YGR222w YPR025c, YDL108w, YLR005w, YPR056w, YPL122c, YER171w YDR176w, YDR145w, YGL112c, YPL254w, YCL010c, YOL148c, YLR055c, YGL066w, YDR392w, YBR081c, YBR192c, YMR236w YHR069c, YDR280w, YOL021c, YGR195W, YDL111c, YGR095c, YCR035c YLR381w, YJR135c YLR115w, YLR277c, YAL043c YPL010w, YNL287w, YDL145c,YFR051c, YGL137w YDR028c, YER133w, YOR178c, YKL193c, YER054c, YMR311c YKR052c, YLR382c,YDL044c, YHR005c-a, YJL133w, YPR134w YIL143c,YDR311w 500.50 0.1644 510.100 0.0771 YDR167w 230.20.10 0.1044 YOL077w-a, YKL190w,YKL058w, YJL074c 440.12.10 0.0985 YPR046w YDR301w YLR093c,YDR238c 270.20.20 440.10.20 260.30.10 0.2321 0.2021 0.1864 YNL126w 450 0.0936 2 3 4 5 6 7 8 C 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com 289 Proteomics 2013, 13, 278–290 Table 4. Continued Cluster ordinal The proteins classified rightly The proteins classified wrongly ID of the protein p-value function modules 9 YPL129w, YPL016w, YJL176c, YMR033w, YBR289w, YOR290c, YDR073w, YPR034w YML099c, YMR043w, YMR042w, YDR137c YRB123c, YAL001c, YGR047c, YOR110w, YDR362c YMR033w, YOR213c, YCR052w,YKR008, YFR037c, YLR321c, YIL126w, YLR357w, YPR034w, YGR056w YOL123w, YGL044c, YDR228c,YMR061w, YOR250c YEL020w-a, YHR005c-a YJL154c, YJL053w, YHR012w YNL062c, YNL244c, YOR361c, YDR429c, YMR146c, YBR079c YNL199c, YPL075w YDL069c, YHL038c,YJL209w, YDR197w YNL290w, YOR217w, YJR068w, YOL094c, YBR087w YGR072w, YMR080c, YHR077c YNR023w,YHL025w 510.190.50 0.1563 YGL154c, YGR244c,YGR113w ___ 510.190.120 510.150 0.2179 0.0827 YLR060w, YLP045w, YLR148w 400 0.1329 YHR012w 440.10.10 0.1195 YCL009c ___ YGL130w, YHR164c,YER176w 440.40 260.30.30.10 500.10.40 0.3019 0.1092 0.3919 YPL001w YPL075w ___ 510.190.90 440.20 410.40.30 0.2109 0.1560 0.1210 YJR052w, YER090w 300 0.2057 10 11 12 13 14 15 16 17 18 19 20 Table 5. The overlapping proteins under different indeterminacy degree Indeterminacy The overlapping proteins degree 0.05 0.10 0.15 0.20 0.25 YDR276c, YDR488c, YPL017c YNL088w, YDR488c, YER007w, YPL155c, YMR138w YNL088w, YDR276c, YEL020w-a, YFL018c, YLR055c, YDR149c, YOR266w YLR055c, YBR107c, YPR070w YIL095w, YPR023c, YLR212c This work was supported by the National Natural Science Foundation of China under Grant No. 61100164 and 61173190, the Natural Science Foundation of Shaanxi Province of China in 2010 under Grant No. 2010JQ8034, and the Fundamental Research Funds for the Central Universities under Grant No. GK200902016. The authors have declared no conflict of interest. 5 edge-clustering coefficient in the chemotactic operation. Corresponding to the reproduction operation, the nodes that have high membership degree are merged into the cluster that the cluster center belongs to. Meantime, the nodes that also have high indeterminacy degree are labeled as unvisited protein nodes and may be grouped into two or more clusters. The procedure of elimination–dispersal operation is equivalent to the selection of the next cluster center and generating another cluster. In the end, the algorithm merges the clusters having high similarity and terminates until arriving at the maximal iterations. The simulation result on PPI dataset showed that the algorithm could not only effectively improve the accuracy of cluster result, automatically determine the cluster number, but also identify the overlapping modules successfully. However, some parameters of the algorithms will influence the cluster result, which should be discussed further. And also, how to construct the dynamic model and to design the corresponding algorithms of the PPI network is the future research direction. C 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim References [1] Watts, D. J., Strogatz, S. H., Collective dynamics of ‘smallworld’ networks. Nature 1998, 393, 440–442. [2] Barabási, A. L., Oltvai, Z. N., Network biology: understanding the cell’s functional organization. Nature Rev. Genet. 2004, 5, 101–113. [3] Soon-Hyung, Y., Oltvai, Z. N., Barabási, A. L., Functional and topological characterization of protein interaction networks. Proteomics 2004, 4, 928–942. [4] Penggang, S., Lin, G., Identification of overlapping and nonoverlapping community structure by fuzzy clustering in complex networks. Inform. Sci. 2011, 181, 1060–1071. [5] Berggård, T., Linse, S., James, P., Methods for the detection and analysis of protein-protein interactions. Proteomics 2007, 7, 2833–2842. [6] Nabieva, E., Jim, K., Agarwal, A., Chazelle, B., Singh, M., Whole-proteome prediction of protein function via graphtheoretic analysis of interaction maps. Bioinformatics 2005, 21, i302–i310. [7] Cho, Y. R., Hwang, W., Ramanathan, M., Aidong, Z., Semantic integration to identify overlapping functional modules www.proteomics-journal.com 290 X. Lei et al. in protein interaction networks. BMC Bioinformatics 2007, doi:10.1186/1471-2105-8-265. [8] Aidong, Z., Protein Interaction Networks, Cambridge University Press, New York 2009. [9] Kenley, E. C., Cho, Y. R., Detecting protein complexes and functional modules from protein interaction networks: a graph entropy approach. Proteomics 2011, 11, 3825–3844. [10] Goel, A., Simone, S. Li, Marc, R. W., Four-dimensional visualisation and analysis of protein–protein interaction networks. Proteomics 2011, 11, 2672–2682. [11] Xiujuan, L., Xu, H., Lei, S., Aidong, Z., Clustering PPI data based on improved functional-flow model through quantum-behaved PSO. Int. J. Data Min. Bioinform. 2012, 6, 42–60. [12] Xiujuan, L., Jianfang, T., The information flow clustering model and algorithm based on the artificial bee colony mechanism of PPI network. Chinese J. Comput. 2012, 35, 134–145. [13] Xiujuan, L., Xu, H., Shuang, W., Ling, G., Joint strength based ant colony optimization clustering algorithm for PPI networks. Acta Electron. Sin. 2012, 40, 695–702. [14] Passino, K. M., Biomimicry of bacterial foraging for distributed optimization and control. IEEE Contr. Syst. Mag. N Y 2002, 22, 52–67. [15] Kim, D.H., Abraham, A., Cho, J. H., A hybrid genetic algorithm and bacterial foraging approach C 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim Proteomics 2013, 13, 278–290 for global 3918–3937. optimization. Inform. Sci. 2007, 177, [16] Xiujuan, L., Shuang, W., Liang, G., Aidong, Z., Clustering PPI data based on bacteria foraging optimization algorithm. 2011 IEEE International Conference on Bioinformatics and Biomedicine (BIBM11), Atlanta, Georgia, 2011, 96–99. [17] Zadeh, L.A., Fuzzy sets. Inform. Cont 1965, 8, 338–353. [18] Atanassov, K., Intuitionistic fuzzy sets. Fuzzy Sets Syst. 1986, 20, 87–96. [19] Dasgupta, S., Biswas, A., Abraham, A., Das, S., Adaptive computational chemotaxis in bacterial foraging algorithm, 2008 International Conference on Complex, Intelligent and Software Intensive Systems 2008, 13, 64–71. [20] Veysel, G., Kevin, M. P., Swarm Stability and Optimization. Springer Verlag, Berlin Heidelberg 2011. [21] De, S. K., Biswas, R., Roy, A. R., Some operations on intuitionistic fuzzy sets. Fuzzy Sets Syst., Arti. Intell. 2003, 2715, 285–292. [22] Huan, W., Min, L., Jianxin, W., Yi, P., A new method for identifying essential proteins based on edge clustering coefficient. Lecture Notes in Computer Science 2011, 6674, 87–98. [23] Letovsky, S., Kasif, S., Predicting protein function from protein-protein interaction data: a probabilistic approach. BMC Bioinformatics 2003, 19, 197–204. [24] Güldener, U., Münsterkōtter, M., Kastenmüller, G., Strack, N. et al., CYGD: the comprehensive yeast genome database. Nucl. Acids Res. 2005, 33, D364–D368. www.proteomics-journal.com