A Graph-based Cluster Ensemble Method to Detect Protein Functional Modules From Multiple Information Sources Yuan Zhang∗ , Liang Ge+ , Nan Du+ , Kebin Jia∗ , Aidong Zhang+ + Beijing University of Technology State University of New York at Buffalo Beijing, 100124, China Buffalo, 14260, U.S.A. zhangyuan@emails.bjut.edu.cn liangge, nandu, azhang@buffalo.edu kebinj@bjut.edu.cn ∗ ABSTRACT Many works have been done to identify functional modules in Protein-Protein Interaction (PPI) networks but the results are far from satisfaction. One main reason is that the PPI data generated from high-throughput experiments is noisy and incomplete. Solving the problem goes beyond what a single data source can provide and thus requires the integration of multiple information sources. To address this problem, we hereby propose a graph-based cluster ensemble method which integrates gene ontology (GO) and gene expression data with PPI networks. Experimental results show that our method is superior to the baseline methods and demonstrate the benefits of integrating multiple biological information sources and diverse clustering methods. 1. INTRODUCTION Predicting protein functional modules among the interacted proteins is a significant problem for understanding the biological processes at the cellular level. PPI networks are the most important sources of information for functional module detection. It’s a common practice to adopt clustering methods to find the protein functional modules [2, 8]. However, we observe that utilizing single information source or criterion to find protein functional modules has suffered the following three limitations. Firstly, it is well known that a PPI network contains a lot of noise and errors: the rate of false positive links is sometimes up to 50% [11]. Also, the gene expression data is reported to be noisy and incomplete [6]. Given a data source with so much noise and errors, it is expected that the clustering results are not satisfactory. Secondly, the clusters produced by these methods which are based on single information source may lack biological meaning. The third limitation is caused by the clustering methods. Most module detection methods are case dependent, and their results typically depend on the specific initial conditions, seeds selection, or parameter setting. Furthermore, different clustering methods tend to get different and unstable cluster partitions based on their own objective functions. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. The challenge of protein module detection changes to how to effectively integrate multiple biological data sources. Here we propose a graph based cluster ensemble method to approach the problem. Extensive experimental results show that our method is superior to the baseline methods. The evaluation results have also shown that our method has greatly improved the prediction of protein functional modules than just using individual clustering methods. Also, by adding additional information, the full graph based cluster ensemble method has achieved more accurate functional modules than solely using the bipartite graph. 2. 2.1 GRAPH BASED CLUSTER ENSEMBLE Base Similarity Graphs and Base Clusters We adopt two different protein correlation measurements that integrate Gene Ontology information and gene expression data with the PPI network. The Pearson correlation coefficient value (normalized to the range of 0 to 1) is used to calculate the co-expression correlation. GO-driven Similarity is calculated based on the information content of GO terms and Lin’s model [7] is implemented. The co-expression correlation and GO-driven similarity are denoted as CoExp and Sim respectively. Three different clustering motheds, including hierarchical agglomerative clustering, K-means, and Markov clustering (MCL), are performed on those two correlation matrices to obtain the base cluster partitions. Each of the three methods will achieve its own objective function individually and produce diverse clustering results for our cluster ensemble model. 2.2 Bipartite Graph Model As illustrated in Figure 1, we construct a bipartite graph based on the affiliation of proteins and clusters, denoted as BGENS. Suppose we are given a set of T clustering partitions G = G1 , G2 , . . . , GT of the original data. Each clustering partition, Gt , t ∈ (1, 2, . . . , T ), consists of a set of clusters from a certain clustering method with one binary matrix. BGENS constructs the graph GB (V, E), where V = V G ∪VN , V G represents all the cluster nodes from the base clustering partitions, and V N is the N protein nodes in the data. E is the pairwise connectivity relation between the protein nodes and the clusters, where if the protein node i belongs to cluster node j, E(i, j) = E(j, i) = 1 and 0 otherwise. Here we get the adjacency matrix AB of the bipartite g1 g2 g3 where Ps is the binary matrix describing protein interaction and Gs is the binary matrix of cluster similarity. We calculate Ps based on a joint measurement of Co-expression correlation and GO-driven similarity: ( 1 if InSim(pi , pj ) > β, Ps (pi , pj ) = (3) 0 otherwise, g4 p1 p2 where InSim is the joint similarity measurement: g5 g6 g7 InSim(pi , pj ) = (1 − α) × CoExp(pi , pj ) × P pi(pi , pj ) + α × Sim(pi , pj ) × P pi(pi , pj ), (4) g8 Figure 1: Full graph based on the bipartite graph. Circles represent proteins and triangles is clusters derived from base clustering methods. Solid lines construct a bipartite graph between proteins and clusters and by adding the light dotted lines, relationships between the clusters or proteins, a full graph is derived. The final clustering results are separated by bold dash lines 0.8 where P pi is PPI network adjacency matrix and α is a real positive number between 0 and 1 to specify how much the GO-driven similarity measurement contributes to the joint matrix. In our experiments, we choose α = 0.7 according to the experimental optima. Gs in Equation 2 is the binary matrix of Jaccard similarity of clusters. In our experiments, we set β = 0.8. ( g ∩g 1 if gii ∪gjj > β, (5) Gs (gi , gj ) = 0 otherwise. 2.4 0.7 0.5 MCODE CFinder RRW Flownet INENS CLENS BGENS FGENS 0.4 0.3 0.2 Precision Recall F−measure Figure 2: Comparison with baseline methods. graph from the clustering partitions: 0 ET . AB = E 0 (1) 3. 3.1 2.3 Graph Partition With the graph we constructed above, the cluster ensemble problem is reduced into a graph partition problem. We use the spectral clustering method which is a well-studied and efficient graph partition algorithm. The goal of the spectral clustering is to partition the graph into K parts with the objective of minimizing the cut (the sum of the weights of those edges connecting different parts in the graph) [9]. Spectral clustering normalizes the similarity matrix, AF P the F in our case, with the diagonal matrix D(i, i) = j A (i, j) which is actually the degree matrix of AF . And we get the normalized Laplacian matrix L = D-1 AF . Then it transforms the normalized matrix into K-dimension space, whose coordinates are the K largest eigenvectors and the new matrix is then clustered via K-means clustering. 0.6 Full Graph Model In the bipartite graph above, the original interactions between the proteins and the relationships between the clusters, which intuitively give us more information for the identification of final clusters, are omitted. For those proteins evenly appear in different clusters, the original protein interactions provide us the basis of cluster-belonging preference. As illustrated in Figure 1, for the case that protein p1 has been clustered into cluster g2 and g7 , and p2 into cluster g6 and g7 , we partition the two proteins into different final clusters by taking their original relationships with other proteins into consideration. The same situation happens with the cluster nodes. We calculate the similarity of clusters by Jaccard similarity measurement. Intuitively, by adding the nodes’ and clusters’ relationships, we get the full graph of cluster ensemble as GF and the adjacency matrix of it is represented as AF . We denote this model as FGENS. Ps E T AF = , (2) E Gs EXPERIMENT AND RESULTS Data sets Our PPI data set is from Gavin [5] which contains 2551 proteins annotated with their GO function terms and 21413 interactions, which are retrieved by mass spectrometry of tandem affinity purification data (TAP). We downloaded GO-slim file from http://www.geneontology.org/. We used the January 14, 2012 version of the GO slim mapping file and chose the Biological Process (BP) hierarchy to calculate the GO-driven similarity of proteins. The gene expression data is from Brem [3] in which each gene is described by 131 expression values. We evaluate our results with the CYC2008 benchmark dataset [10], a comprehensive catalogue of manually curated 408 heteromeric protein complexes in S. cerevisiae reliably backed by small-scale experiments from the literature. 3.2 Comparison With Baseline Methods We compared the performance of our methods with the previous methods: the Flownet [4], MCODE [2], CFinder [1], the spectral clustering method [9], RRW [8], and MCL. We implemented these baseline methods on the TAP data Table 1: Comparison of the base clustering partitions and proposed method Precision Recall F-measure CoexHie 0.528428 0.446328 0.48392 CoexKm 0.548495 0.463277 0.502297 GOHie 0.521739 0.440678 0.477795 GOKm 0.535117 0.451977 0.490046 CoexMCL 0.40856 0.29661 0.343699 GOMCL 0.561338 0.426554 0.484751 BGENS 0.603333 0.511299 0.553517 FGENS 0.636667 0.539548 0.584098 with optimal parameters set and the comparison results are presented in Figure 2. Based on the comparison, our method outperforms other approaches by getting higher F -measure and Recall rates, which demonstrates that our method has efficiently found more matched modules from the PPI network than others. In detail, we have the following observations: 1) FGENS gets higher F -measure and Recall than the methods MCODE, CFinder and RRW which just use the PPI network topology and almost achieves the P recision of CFinder. Furthermore, FGENS obtains better results in all three evaluation criteria than the Flownet method which combines GO similarity information. All these results demonstrate that the integration of multiple information sources greatly benefits the functional module detection. 2) FGENS gets better performance than the other cluster ensemble methods, i.e., INENS and CLENS. This indicates that our graph based cluster ensemble method is more effective than other ensemble methods and gets more consensus and meaningful modules. 3) By adding the additional information of proteins and clusters to the bipartite graph, our graph model captures more precise belonging relationships than BGENS according to the higher rates of the three evaluation criteria. 3.3 Analysis of Ensemble Power To evaluate the superior performance of the graph based cluster ensemble method, we compared the ensemble result with the results of the single clustering partitions in detail. The results for single clustering partitions are denoted as GoKm, CoexKm (for K-means), GOHie, CoexHier (for Hierarchical clustering) and GOMCL, CoexMCL (for MCL) in Table 1. Note that FGENS gets higher P recision, Recall and F -measure than all the single clustering partitions, proving that the integration of multiple sources has great power to predict protein functional modules in PPI networks. We extracted all the matched modules from the clustering methods and counted the number of proteins they covered and the modules they detected. The results are shown in Table 2. FGENS retrieves a larger number of matched modules and covers more proteins than all these base clustering methods. Also, the results of FGENS are slightly better than BGENS’ which has omitted the original protein interactions and the clusters’ relationships. 4. CONCLUSIONS This paper has presented a graph-based cluster ensemble method integrating multiple data sources to identify protein functional modules from PPI networks. We applied our method on the TAP data and evaluated the results with the hand-curated complexes from CYC2008 benchmark dataset. Table 2: The matched modules from the base clustering methods and proposed method Average Matched size modules CoexHie 8.16 158 CoexKm 8.34 164 GOHie 4.42 156 GOKm 5.25 160 CoexMCL 5.92 105 GOMCL 4.46 151 BGENS 9.14 181 FGENS 9.56 191 Cover proteins 1289 1368 689 840 622 674 1654 1826 From the experimental results we see that our cluster ensemble framework is an efficient and effective method to integrate different biological information sources for identifying more stable and balanced functional modules. The diversity of multiple biological information sources and the clustering methods have enforced our results rather than pulled it down. 5. ACKNOWLEDGMENTS The research work is supported by National Natural Science Foundation of China under Grant No. 30970780 and Ph.D. Programs Foundation of Ministry of Education of China under Grant No. 20091103110005. 6. REFERENCES [1] Balazs Adamcsek, Gergely Palla, Illes J Farkas, Imre Derenyi, and Tamas Vicsek. Cfinder: Locating cliques and overlapping modules in biological networks. Bioinformatics, 22(8):1021–1023, 2006. [2] Gary D Bader and Christopher Wv Hogue. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics, 4(1):2, 2003. [3] Rachel B Brem and Leonid Kruglyak. The landscape of genetic complexity across 5,700 gene expression traits in yeast. Proceedings of the National Academy of Sciences of the United States of America, 102(5):1572–1577, 2005. [4] Young-Rae Cho, Lei Shi, and Aidong Zhang. Flownet: Flow-based approach for efficient analysis of complex biological networks. 2009 Ninth IEEE International Conference on Data Mining, pages 91–100, 2009. [5] A C Gavin, P Aloy, P Grandi, R Krause, M Boesche, M Marzioch, C Rau, L J Jensen, S Bastuck, B Dumpelfeld, and et al. Proteome survey reveals modularity of the yeast cell machinery. Nature, 440(7084):631–636, 2006. [6] Heather Hardway. Gene network models robust to spatial scaling and noisy input. Mathematical Biosciences, 237(March):1–16, 2012. [7] Dekang Lin. An information-theoretic definition of similarity. Quality, 1:296–304, 1998. [8] Kathy Macropol, Tolga Can, and Ambuj K Singh. Rrw: repeated random walks on genome-scale protein networks for local cluster discovery. BMC Bioinformatics, 10(Mc):283, 2009. [9] Max Planck and Ulrike Von Luxburg. A tutorial on spectral clustering a tutorial on spectral clustering. Statistics and Computing, 17(August):395–416, 2006. [10] Shuye Pu, Jessica Wong, Brian Turner, Emerson Cho, and Shoshana J Wodak. Up-to-date catalogues of yeast protein complexes. Nucleic Acids Research, 37(3):825–831, 2009. [11] Christian Von Mering, Roland Krause, Berend Snel, Michael Cornell, Stephen G Oliver, Stanley Fields, and Peer Bork. Comparative assessment of large-scale data sets of protein-protein interactions. Nature, 417(6887):399–403, 2002.