A Graph-based Cluster Ensemble Method to Detect Protein

advertisement
A Graph-based Cluster Ensemble Method to Detect Protein
Functional Modules From Multiple Information Sources
Yuan Zhang∗ , Liang Ge+ , Nan Du+ , Kebin Jia∗ , Aidong Zhang+
+
Beijing University of Technology
State University of New York at Buffalo
Beijing, 100124, China
Buffalo, 14260, U.S.A.
zhangyuan@emails.bjut.edu.cn
liangge, nandu, azhang@buffalo.edu
kebinj@bjut.edu.cn
∗
ABSTRACT
Many works have been done to identify functional modules
in Protein-Protein Interaction (PPI) networks but the results are far from satisfaction. One main reason is that
the PPI data generated from high-throughput experiments
is noisy and incomplete. Solving the problem goes beyond
what a single data source can provide and thus requires the
integration of multiple information sources. To address this
problem, we hereby propose a graph-based cluster ensemble method which integrates gene ontology (GO) and gene
expression data with PPI networks. Experimental results
show that our method is superior to the baseline methods
and demonstrate the benefits of integrating multiple biological information sources and diverse clustering methods.
1.
INTRODUCTION
Predicting protein functional modules among the interacted proteins is a significant problem for understanding
the biological processes at the cellular level. PPI networks
are the most important sources of information for functional
module detection. It’s a common practice to adopt clustering methods to find the protein functional modules [2, 8].
However, we observe that utilizing single information source
or criterion to find protein functional modules has suffered
the following three limitations. Firstly, it is well known that
a PPI network contains a lot of noise and errors: the rate of
false positive links is sometimes up to 50% [11]. Also, the
gene expression data is reported to be noisy and incomplete
[6]. Given a data source with so much noise and errors, it
is expected that the clustering results are not satisfactory.
Secondly, the clusters produced by these methods which are
based on single information source may lack biological meaning. The third limitation is caused by the clustering methods. Most module detection methods are case dependent,
and their results typically depend on the specific initial conditions, seeds selection, or parameter setting. Furthermore,
different clustering methods tend to get different and unstable cluster partitions based on their own objective functions.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.
The challenge of protein module detection changes to how
to effectively integrate multiple biological data sources. Here
we propose a graph based cluster ensemble method to approach the problem. Extensive experimental results show
that our method is superior to the baseline methods. The
evaluation results have also shown that our method has
greatly improved the prediction of protein functional modules than just using individual clustering methods. Also,
by adding additional information, the full graph based cluster ensemble method has achieved more accurate functional
modules than solely using the bipartite graph.
2.
2.1
GRAPH BASED CLUSTER ENSEMBLE
Base Similarity Graphs and Base Clusters
We adopt two different protein correlation measurements
that integrate Gene Ontology information and gene expression data with the PPI network. The Pearson correlation
coefficient value (normalized to the range of 0 to 1) is used
to calculate the co-expression correlation. GO-driven Similarity is calculated based on the information content of GO
terms and Lin’s model [7] is implemented. The co-expression
correlation and GO-driven similarity are denoted as CoExp
and Sim respectively.
Three different clustering motheds, including hierarchical
agglomerative clustering, K-means, and Markov clustering
(MCL), are performed on those two correlation matrices to
obtain the base cluster partitions. Each of the three methods will achieve its own objective function individually and
produce diverse clustering results for our cluster ensemble
model.
2.2
Bipartite Graph Model
As illustrated in Figure 1, we construct a bipartite graph
based on the affiliation of proteins and clusters, denoted as
BGENS. Suppose we are given a set of T clustering partitions G = G1 , G2 , . . . , GT of the original data. Each
clustering partition, Gt , t ∈ (1, 2, . . . , T ), consists of a set
of clusters from a certain clustering method with one binary matrix. BGENS constructs the graph GB (V, E), where
V = V G ∪VN , V G represents all the cluster nodes from the
base clustering partitions, and V N is the N protein nodes
in the data. E is the pairwise connectivity relation between
the protein nodes and the clusters, where if the protein node
i belongs to cluster node j, E(i, j) = E(j, i) = 1 and 0 otherwise. Here we get the adjacency matrix AB of the bipartite
g1
g2
g3
where Ps is the binary matrix describing protein interaction
and Gs is the binary matrix of cluster similarity. We calculate Ps based on a joint measurement of Co-expression
correlation and GO-driven similarity:
(
1 if InSim(pi , pj ) > β,
Ps (pi , pj ) =
(3)
0 otherwise,
g4
p1 p2
where InSim is the joint similarity measurement:
g5
g6
g7
InSim(pi , pj ) = (1 − α) × CoExp(pi , pj ) × P pi(pi , pj )
+ α × Sim(pi , pj ) × P pi(pi , pj ), (4)
g8
Figure 1: Full graph based on the bipartite graph. Circles represent proteins and triangles is clusters derived
from base clustering methods. Solid lines construct a
bipartite graph between proteins and clusters and by
adding the light dotted lines, relationships between the
clusters or proteins, a full graph is derived. The final
clustering results are separated by bold dash lines
0.8
where P pi is PPI network adjacency matrix and α is a real
positive number between 0 and 1 to specify how much the
GO-driven similarity measurement contributes to the joint
matrix. In our experiments, we choose α = 0.7 according
to the experimental optima. Gs in Equation 2 is the binary
matrix of Jaccard similarity of clusters. In our experiments,
we set β = 0.8.
(
g ∩g
1 if gii ∪gjj > β,
(5)
Gs (gi , gj ) =
0 otherwise.
2.4
0.7
0.5
MCODE
CFinder
RRW
Flownet
INENS
CLENS
BGENS
FGENS
0.4
0.3
0.2
Precision
Recall
F−measure
Figure 2: Comparison with baseline methods.
graph from the clustering partitions:
0 ET
.
AB =
E
0
(1)
3.
3.1
2.3
Graph Partition
With the graph we constructed above, the cluster ensemble problem is reduced into a graph partition problem. We
use the spectral clustering method which is a well-studied
and efficient graph partition algorithm. The goal of the spectral clustering is to partition the graph into K parts with
the objective of minimizing the cut (the sum of the weights
of those edges connecting different parts in the graph) [9].
Spectral clustering normalizes the similarity matrix,
AF
P the
F
in our case, with the diagonal matrix D(i, i) = j A (i, j)
which is actually the degree matrix of AF . And we get the
normalized Laplacian matrix L = D-1 AF . Then it transforms the normalized matrix into K-dimension space, whose
coordinates are the K largest eigenvectors and the new matrix is then clustered via K-means clustering.
0.6
Full Graph Model
In the bipartite graph above, the original interactions between the proteins and the relationships between the clusters, which intuitively give us more information for the identification of final clusters, are omitted. For those proteins
evenly appear in different clusters, the original protein interactions provide us the basis of cluster-belonging preference.
As illustrated in Figure 1, for the case that protein p1 has
been clustered into cluster g2 and g7 , and p2 into cluster
g6 and g7 , we partition the two proteins into different final
clusters by taking their original relationships with other proteins into consideration. The same situation happens with
the cluster nodes. We calculate the similarity of clusters by
Jaccard similarity measurement. Intuitively, by adding the
nodes’ and clusters’ relationships, we get the full graph of
cluster ensemble as GF and the adjacency matrix of it is
represented as AF . We denote this model as FGENS.
Ps E T
AF =
,
(2)
E Gs
EXPERIMENT AND RESULTS
Data sets
Our PPI data set is from Gavin [5] which contains 2551
proteins annotated with their GO function terms and 21413
interactions, which are retrieved by mass spectrometry of
tandem affinity purification data (TAP). We downloaded
GO-slim file from http://www.geneontology.org/. We used
the January 14, 2012 version of the GO slim mapping file and
chose the Biological Process (BP) hierarchy to calculate the
GO-driven similarity of proteins. The gene expression data
is from Brem [3] in which each gene is described by 131 expression values. We evaluate our results with the CYC2008
benchmark dataset [10], a comprehensive catalogue of manually curated 408 heteromeric protein complexes in S. cerevisiae reliably backed by small-scale experiments from the
literature.
3.2
Comparison With Baseline Methods
We compared the performance of our methods with the
previous methods: the Flownet [4], MCODE [2], CFinder
[1], the spectral clustering method [9], RRW [8], and MCL.
We implemented these baseline methods on the TAP data
Table 1: Comparison of the base clustering partitions
and proposed method
Precision
Recall
F-measure
CoexHie
0.528428 0.446328
0.48392
CoexKm
0.548495 0.463277
0.502297
GOHie
0.521739 0.440678
0.477795
GOKm
0.535117 0.451977
0.490046
CoexMCL
0.40856
0.29661
0.343699
GOMCL
0.561338 0.426554
0.484751
BGENS
0.603333 0.511299
0.553517
FGENS
0.636667 0.539548
0.584098
with optimal parameters set and the comparison results are
presented in Figure 2.
Based on the comparison, our method outperforms other
approaches by getting higher F -measure and Recall rates,
which demonstrates that our method has efficiently found
more matched modules from the PPI network than others.
In detail, we have the following observations: 1) FGENS gets
higher F -measure and Recall than the methods MCODE,
CFinder and RRW which just use the PPI network topology and almost achieves the P recision of CFinder. Furthermore, FGENS obtains better results in all three evaluation criteria than the Flownet method which combines
GO similarity information. All these results demonstrate
that the integration of multiple information sources greatly
benefits the functional module detection. 2) FGENS gets
better performance than the other cluster ensemble methods, i.e., INENS and CLENS. This indicates that our graph
based cluster ensemble method is more effective than other
ensemble methods and gets more consensus and meaningful
modules. 3) By adding the additional information of proteins and clusters to the bipartite graph, our graph model
captures more precise belonging relationships than BGENS
according to the higher rates of the three evaluation criteria.
3.3
Analysis of Ensemble Power
To evaluate the superior performance of the graph based
cluster ensemble method, we compared the ensemble result with the results of the single clustering partitions in
detail. The results for single clustering partitions are denoted as GoKm, CoexKm (for K-means), GOHie, CoexHier
(for Hierarchical clustering) and GOMCL, CoexMCL (for
MCL) in Table 1. Note that FGENS gets higher P recision,
Recall and F -measure than all the single clustering partitions, proving that the integration of multiple sources has
great power to predict protein functional modules in PPI
networks. We extracted all the matched modules from the
clustering methods and counted the number of proteins they
covered and the modules they detected. The results are
shown in Table 2. FGENS retrieves a larger number of
matched modules and covers more proteins than all these
base clustering methods. Also, the results of FGENS are
slightly better than BGENS’ which has omitted the original
protein interactions and the clusters’ relationships.
4.
CONCLUSIONS
This paper has presented a graph-based cluster ensemble
method integrating multiple data sources to identify protein functional modules from PPI networks. We applied our
method on the TAP data and evaluated the results with the
hand-curated complexes from CYC2008 benchmark dataset.
Table 2: The matched modules from the base clustering
methods and proposed method
Average Matched
size
modules
CoexHie
8.16
158
CoexKm
8.34
164
GOHie
4.42
156
GOKm
5.25
160
CoexMCL
5.92
105
GOMCL
4.46
151
BGENS
9.14
181
FGENS
9.56
191
Cover
proteins
1289
1368
689
840
622
674
1654
1826
From the experimental results we see that our cluster ensemble framework is an efficient and effective method to integrate different biological information sources for identifying
more stable and balanced functional modules. The diversity
of multiple biological information sources and the clustering methods have enforced our results rather than pulled it
down.
5.
ACKNOWLEDGMENTS
The research work is supported by National Natural Science
Foundation of China under Grant No. 30970780 and Ph.D. Programs Foundation of Ministry of Education of China under Grant
No. 20091103110005.
6.
REFERENCES
[1] Balazs Adamcsek, Gergely Palla, Illes J Farkas, Imre
Derenyi, and Tamas Vicsek. Cfinder: Locating cliques and
overlapping modules in biological networks. Bioinformatics,
22(8):1021–1023, 2006.
[2] Gary D Bader and Christopher Wv Hogue. An automated
method for finding molecular complexes in large protein
interaction networks. BMC Bioinformatics, 4(1):2, 2003.
[3] Rachel B Brem and Leonid Kruglyak. The landscape of
genetic complexity across 5,700 gene expression traits in
yeast. Proceedings of the National Academy of Sciences of
the United States of America, 102(5):1572–1577, 2005.
[4] Young-Rae Cho, Lei Shi, and Aidong Zhang. Flownet:
Flow-based approach for efficient analysis of complex
biological networks. 2009 Ninth IEEE International
Conference on Data Mining, pages 91–100, 2009.
[5] A C Gavin, P Aloy, P Grandi, R Krause, M Boesche,
M Marzioch, C Rau, L J Jensen, S Bastuck, B Dumpelfeld,
and et al. Proteome survey reveals modularity of the yeast
cell machinery. Nature, 440(7084):631–636, 2006.
[6] Heather Hardway. Gene network models robust to spatial
scaling and noisy input. Mathematical Biosciences,
237(March):1–16, 2012.
[7] Dekang Lin. An information-theoretic definition of
similarity. Quality, 1:296–304, 1998.
[8] Kathy Macropol, Tolga Can, and Ambuj K Singh. Rrw:
repeated random walks on genome-scale protein networks
for local cluster discovery. BMC Bioinformatics,
10(Mc):283, 2009.
[9] Max Planck and Ulrike Von Luxburg. A tutorial on
spectral clustering a tutorial on spectral clustering.
Statistics and Computing, 17(August):395–416, 2006.
[10] Shuye Pu, Jessica Wong, Brian Turner, Emerson Cho, and
Shoshana J Wodak. Up-to-date catalogues of yeast protein
complexes. Nucleic Acids Research, 37(3):825–831, 2009.
[11] Christian Von Mering, Roland Krause, Berend Snel,
Michael Cornell, Stephen G Oliver, Stanley Fields, and
Peer Bork. Comparative assessment of large-scale data sets
of protein-protein interactions. Nature, 417(6887):399–403,
2002.
Download