A Collective NMF Method for Detecting Protein Functional

advertisement
A Collective NMF Method for Detecting Protein Functional
Module from Multiple Data Sources
Yuan Zhang∗ , Nan Du+ , Liang Ge+ , Kebin Jia∗ , Aidong Zhang+
∗
Beijing University of Technology
Beijing, 100124, China
zhangyuan@emails.bjut.edu.cn
kebinj@bjut.edu.cn
ABSTRACT
Detecting functional modules from protein-protein interaction (PPI) networks is an active research area with many
practical applications. However, there is always a critical
concern on the false PPI interactions which are derived from
the high-throughput experiments and the unsatisfactory results obtained from single PPI network with severe information insufficiency. To address this problem, we propose
a Collective Non-negative Matrix Factorization (CoNMF)
based soft clustering method which efficiently integrates information of gene ontology (GO), gene expression data and
PPI networks. In our method, the three data sources are
formed into two graphs with similarity adjacency matrices
and these graphs are approximated by a matrix factorization with their common factor which provides the straightforward interpretation of clustering results. Extensive experiments show that we can improve the module detection
performance by integrating multiple biological data sources
and that CoNMF yields superior results compared to other
multiple data sources fusion methods by identifying a larger
number of more precise protein modules with actual biological meaning and certain degree of overlapping.
Categories and Subject Descriptors
H.2.8 [Database Management]: Database ApplicationsData Mining; J.3 [Life and Medical Sciences]: Biology
and genetics
General Terms
Algorithm
1.
INTRODUCTION
The techniques of identifying communities from networks,
such as online social networks [17], mobile phone networks
[18], scientific collaboration networks and biological networks
[10], are of great use in helping us understand and further
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
ACM-BCB’12, October 7-10, 2012, Orlando, FL, USA
Copyright 2012 ACM 978-1-4503-1670-5/12/10 ...$15.00.
ACM-BCB 2012
+
State University of New York at Buffalo
Buffalo, 14260, U.S.A.
nandu, liangge, azhang@buffalo.edu
exploit these networks [27]. Previous studies have shown
that the proteins belonging to the same module are densely
connected with each other but sparsely interacting with the
other proteins in the network [22]. Protein functional modules can be understood as independent sub-networks and
proteins in the same module always interact more frequently
and show stronger functional dependencies in the cellular
system.
It is a common practice to adopt clustering methods to
detect protein functional modules from PPI networks [1,
4]. These clustering methods can be broadly characterized as distance-based or graph-based [15]. Distance-based
clustering methods use classic clustering techniques and foof the distance between proteins.
cus on the deı̈ňAnition
Graph-based clustering approaches consider the topology of
PPI networks and partition the graph/network based on
some criteria which are to maximize the density of subgraph
or minimize the cost of cut-off while separating the graph.
However, we observe that these clustering methods have not
provided satisfactory results because of the complex, incomplete and noisy nature of PPI networks. It is well known
that a PPI network contains a lot of noise and errors: the
rate of false positive links is sometimes up to 50% [25]. Solving the problem goes beyond what a single data source can
provide and thus requires the integration of multiple information sources.
These days, multiple high throughput techniques, such as
microarray expression profiles and mass spectrometry experiments, have provided us with assistance to deal with
the false information. Also, the Gene Ontology project is
highly developed which aims at assigning functional annotations derived by small-scale experiments or proved literature to genes and gene products [21]. So far, the challenging task shifts to integrate these data sources in a manner
that will lead to more reliable and valid functional modules. The other challenge we always face is the overlapping
of functional modules in the PPI network. Since some proteins may perform different cellular functions, such multifunctional proteins are expected to specifically interact with
distinct sets of partners, either simultaneously or not, depending on the function performed. Although the overlapping nature of protein functional modules has already been
recognized, most existing clustering methods cannot handle
the overlap of clusters.
655
To address these challenges, we propose a multiple graph
clustering method which is based on collective symmetric
non-negative matrix factorization (NMF). NMF is a popular matrix decomposition method which factorizes an input
non-negative matrix into two non-negative matrices of lower
rank via a multiplicative updates algorithm [6]. NMF has
been proved to be useful in dimension reduction of image,
text, and signal data. It also has been applied in an unsupervised setting in natural language processing such as
document clustering [14]. More recently, NMF was successfully utilized to find co-expressed genes in gene expression
data which directly used the dimensionality reduction nature of classic NMF by finding optimal proximity matrix
factorizations of the high-dimensional data [6]. However,
in our problem, we adopt multiple biological data sources,
including gene expression data, gene ontology and PPI network, and formulate them into two similarity based graphs
representing the relationships between the pairwise proteins.
Our objective is to find a consistent partition across all the
graphs, i.e., the common factor of the graphs. In this paper, we develop a graph clustering objective function based
on symmetric NMF which simultaneously analyzes multiple
similarity based graphs. In our method, the clustering of
multiple similarity based graphs is automatically reduced to
a multiplicative update algorithm which achieves local optimized solutions. Since the optimization problem is sparse,
containing a large number of zero, 1-norm penalty on matrix factors are involved to achieve a more sparse solution.
Moreover, by setting an experimental threshold on the optimized matrix factors, we obtain overlapping clusters of the
graphs.
The rest of the paper is organized as follows. In Section
2, we briefly introduce the related work on multiple data
sources clustering. In Section 3, we introduce the construction of the similarity based graphs which integrate GO and
gene expression data with the PPI network respectively. In
Section 4, the collective NMF method are proposed. Extensive experimental studies are carried out in Section 5, which
show the improvement of our CoNMF method. Finally, further discussions and conclusions are presented.
2.
RELATED WORK
For mining clusters from multiple data sources, there are
mainly three approaches: weighted summation of original
data, summation of kernels and clustering ensemble method.
×N
,
Given multiple similarity adjacency matrices A(m) ∈ RN
+
m = 1, 2, . . . M , which are derived from the data sources,
we summarize the three methods as follow:
Weighted Summation of Original Data. By a linear
combination of all the similarity adjacency matrices with
appropriate
weights, the integrated similarity matrix A =
M
(m)
is constructed. With this new matrix A, classim=1 A
cal clustering methods can be used to find the clusters, such
as the spectral clustering [19] or Markov clustering (MCL)
[13].
Summation of Kernels. Given the original data sets,
which are the multiple graphs in our problem, kernel based
clustering methods first map the data into some feature
space F by means of a certain map Φ, and then group patterns in the feature space according to a similarity or dis-
ACM-BCB 2012
similarity criteria where clusters are set of similar patterns.
One of the most relevant aspects of kernel based clustering is that it is possible to compute the distances of the
nodes in the kernel space without knowing explicitly mapping method Φ. It can be done by applying the so called
distance kernel trick. For multiple graph partition problem, the commonmethod is to summarize the kernel of each
M
(m)
. One particular example for
graph as K =
m=1 K
graph partition is the spectral kernel [23] as:
K (m) =
d
(m)
vk
(m) T
(vk
) ,
(1)
k=1
(m)
where vk is the kth smallest eigenvector of graph Laplacian L(m) and d N is the number of eigenvectors used
per individual graph. Clustering can then be obtained by
performing kernel K-means on K.
Clustering Ensemble. Cluster ensemble has been studied by many researchers in the machine learning community. Strehl and Ghosh [24] proposed two approaches, i.e.,
instance-based and cluster-based approaches, to formulating graph partitioning problems for cluster ensemble. The
instance-based approach, denoted as INENS, models each
instance as a vertex in a graph and computes the similarity between a pair of instances according to how frequently
they are clustered into the same clusters. The cluster-based
approach, denoted as CLENS, takes each cluster from all
the clustering partitions as a vertex and the similarity between the clusters based on the percentage of instances they
share as the weight of each edge. Fern and Brodley [8] developed a Hybrid Bipartite Graph Formulation (HBGF) which
constructed a bipartite graph based on the instances and
clusters followed by the graph partition method to get the
ensemble result. It is one of the most stable and effective
approach for combining cluster partitions.
3.
CONSTRUCTION OF RELATIONSHIP
GRAPHS
In this section, we construct two relationship graphs by integrating co-expression correlation and GO functional similarity with the PPI network respectively. By integrating
multiple biological information sources we are able to alleviate the false information in the PPI network.
3.1
Co-expression Correlation
Gene expression data has been used to assist in enhancing
the reliability of PPI networks and detecting co-expressed
protein modules in many clustering algorithms [4]. We use
the Pearson correlation coefficient (normalized to the range
of 0 to 1) to calculate the co-expression correlation, which is
denoted as CoExp. And the co-expression correlation graph
A(1) is constructed by combining with the PPI network as
follow:
A(1) (pi , pj ) = CoExp(pi , pj ) × P pi(pi , pj ).
3.2
(2)
GO-driven Similarity
The GO-driven similarity is referred to as semantic similarity. It is to measure the similarity between two terms
by quantifying the information two terms share in common.
One concept commonly used to quantify the information of
656
terms is Information Content (IC), which gives a measure
how specific and informative a term is. Let C be a set of
terms in the GO, and p(c) is the probability of finding a
child c ∈ C in the annotation structure. The IC of a term c
can be quantified as the negative log likelihood, −log(p(c)).
If c is the root term of the taxonomy, −log(p(c)) will equal
0. One important model is proposed by Lin [16] which can
be seen as a normalized version of Resnik’s model as follows:
sim(c1 , c2 ) =
2 × maxc∈S(c1 ,c2 ) [log(p(c))]
.
log(p(c1 )) + log(p(c2 ))
(3)
The GO-driven similarity, Sim(pi , pj ), is calculated by aggregating maximum inter-set similarity values as follows:
Sim(pi , pj ) =
1
max simp∈Cj (ck , cp )
×
m×n
k∈Ci
+
max simk∈Ci (ck , cp ) . (4)
p∈Cj
W,H≥0
(5)
Intuitively, the symmetric NMF defined in Equation 7 is
more suitable for graph clustering based on similarity matrix
, which is illustrated in Figure 1(b). Since the similarity
matrix of graph is symmetric, the clustering indicators for
the rows and columns are in transpose relation.
2
(7)
min A − HH T .
CONSENSUS CLUSTERING BASED ON
COLLECTIVE SYMMETRIC NMF
NMF was first introduced as a dimension reduction method
for pattern analysis. It has attracted great attention of researchers in many fields because of its straightforward interpretability, i.e., each observation data can be explained
by additive linear combination of nonnegative basis vectors. In many real world pattern analysis problems, the
non-negativity constraints are prevalent, e.g., image pixel
values, chemical compound concentrations and signal intensities are always nonnegative. Recently, NMF is given more
attention for its application to data clustering. Previous
work has shown the direct relationships between NMF and
kernel K-means [7, 12]. It can model widely varying data
distributions and accomplish both hard and soft clustering
simultaneously [3].
In our problem, our data are formulated into similarity based
graphs as discussed in Section 3. In these graphs, each
node corresponds to a protein and each edge corresponds
to the similarity or relationship between the pairwise proteins. When a similarity matrix is constructed for the graph,
the factorization of this similarity matrix will generate a
clustering assignment matrix that is nonnegative and well
captures the cluster structure inherent in the graph representation. Different from the traditional NMF as discussing
below, the similarity based graph clustering are formulated
in an alternative symmetric NMF. In this section, in order
to integrate multiple similarity relationships, we propose a
new collective symmetric NMF method.
4.1
Symmetric NMF for Graph Clustering
NMF performs a lower rank approximation by minimizing
the distance between non-negative matrix A and the multiplicative factor matrices. The typical NMF can be formu-
ACM-BCB 2012
F
The nonnegative constraint on H is crucial for its success
since the entry hij of H represents the real-valued membership weight for protein i belonging to cluster j.
4.2
4.
(6)
F
where A ∈ Rm×n
, W ∈ Rm×k
, H ∈ Rk×n
, R represents
+
+
+
the set of nonnegative real numbers, and ·F represents
Frobenius norm, and k min{m, n}. As shown in Figure
1(a), the interpretation of NMF in clustering problem can
be explained in this way: the columns of W provide the
clustering indicators for columns of A, and the columns of
H T provides the clustering indicators for rows of A.
H≥0
The GO-driven similarity are combined with the PPI network as Equation 5, and P pi is the adjacency matrix of the
PPI network:
A(2) (pi , pj ) = Sim(pi , pj ) × P pi(pi , pj ).
lated as the following optimization problem [11]:
2
min A − W H T ,
Collective Symmetric NMF
To simultaneously analyze multiple graphs and extract the
consensus clusters, we construct a collective symmetric NMF
model which aims at finding the common factor for all graphs.
Suppose we are given M graphs whose adjacency matrices
are A(m) , m = 1, 2, . . . , M , each of size N × N and corresponding to the same nodes, i.e., the same proteins. The
modified formulation is given as:
F = min
H
M 2
1 (m)
− HH T +
A
2 m=1
F
η H2F + β H1 ,
s.t.
0 ≤ H ≤ 1,
(8)
where the H matrix is the cluster indicator matrix. The
regularization terms on H are added to improve the numerical stability and avoid overfitting. What is more, to achieve
sparsity of the solution on the cluster indicator matrix, we
integrate 1-norm regularization which has been successfully
utilized in a variety of sparse optimization problem [12].
Each row of cluster indicator matrix H can be seen as a
vector representing the probability of a protein occurring in
each cluster. Hence, we impose the constraint that the elements of H must fall between 0 and 1 and then we are able
to obtain the overlapping clusters by setting an experimental
threshold on it.
We minimize the cost function, i.e., Equation 8, by extending gradient descent and employing multiplicative update
rules on it. Taking the partial derivatives of the objective
function F with respect to H yield the following:
M
M
∂F
A(m) H − 2
HH T H − ηH − β/2.
=
∂H
m=1
m=1
(9)
Given the objective function and its partial derivatives, one
can solve H using a gradient descent approach. Here, we
657
A
×
W
A
HT
(a) NMF
×
H
HT
(b) Symmetric NMF
Figure 1: W and H in (a) are the clustering indicator matrices for non-negative matrix A; H in (b) is the
clustering indicator matrix for symmetric non-negative matrix A.
develop the following iterative matrix factorization based
update rules for the unknown factors:
M
(m)
H
m=1 A
(10)
H←H◦
,
T
HH H + ηH + β/2
where ◦ stands for element-wise matrix multiplication.
5.
5.1
EXPERIMENT AND RESULTS
Experiment Setup
Our PPI data set is from Gavin [9] which contains 2551 proteins annotated with their GO function terms and 21413 interactions, which are retrieved by mass spectrometry of tandem affinity purification data (TAP). We downloaded GOslim file from http://www.geneontology.org/. Since we want
to build a conservative similarity correlation matrix based
only on experimentally determined annotations to avoid any
bias in our data set due to other electronic annotation systems, we included those annotations under the evidence
codes IDA, IEP, IGI, IMP, IPI, RCA, TAS, and excluded
codes IC, IEA, ISS, NAS, ND—because the latter are either
inferred from electronic annotations or assigned by automated methods without curatorial judgments [21]. We used
the January 14, 2012 version of the GO slim mapping file
and chose the Biological Process (BP) hierarchy to calculate
the GO-driven similarity of proteins. The gene expression
data is from Brem [2] in which each gene is described by 131
expression values, associated with 131 time courses during
certain cell processes.
To check whether the extracted modules correspond to real
complexes we compared our results with the CYC2008 benchmark dataset [20], a comprehensive catalogue of manually
curated protein complexes in S. cerevisiae reliably backed
by small-scale experiments from the literature.
5.2
Comparison Criteria
Known complexes are available from those catalogued in the
CYC2008 database which serves as gold-standard data. Obviously, the gold-standard complexes Gc and predicted clusters Pc are expected to be matched as much as possible.
Thus, the overlapping score OL(Pc , Gc ) is used to find the
matched complexes:
OL(Pc , Gc ) =
ACM-BCB 2012
|VP c ∩ VGc |2
,
|VP c | × |VGc |
(11)
where |VP c | is the size of predicted cluster, |VGc | is the size of
known complex, and |VP c ∩ VGc | is the number of the intersections of the predicted clusters and the known complexes.
Pc and Gc will be considered to be matched if their OL score
is larger than a threshold δ, which is typically chosen as 0.2
[5, 26].
Based on the number of matched clusters, F -measure is
used to estimate the matching results by taking into account
both the precision and the recall. P recision is defined as
P rec = T P/(T P + F N ), where T P (true positive) is the
number of the predicted clusters matched with the known
complexes by OL ≥ δ, and F N (false negative) is the number of the known complexes that are not matched by the predicted clusters. Recall is defined as Rec = T P/(T P + F P ),
where F P (false positive) is the total number of the predicted clusters minus T P . The F -measure is:
F -measure =
5.3
2 × P rec × Rec
.
P rec + Rec
(12)
Comparison Result
We compared the performance of our method with the baseline methods which are introduced in Section 2. We implemented MCL method, with the best parameter setting, on
the weighted summation of original data and the method
is denoted as WeiSim. We also ran traditional NMF on the
graph of GO similarity and co-expression similarity, which
are denoted as NMFonGO and NMFonCo respectively. For the
cluster ensemble methods, the K-means, hierarchical clustering and spectral clustering were separately implemented on
these A(i) matrices to get base clusters. And in these clustering methods, where a priori cluster number k is needed, we
set it as 300. We also set stop rules for hierarchical clustering method to get a number of k clusters. In the method of
CoNMF, the parameters in Equation 8 were set as η = 0.4,
β = 0.04 according to extensive experiments. The comparison results are presented in Figure2. Moreover, we extracted
all the matched modules from these clustering methods and
counted the number of proteins they covered and the modules they detected. The results are shown in Table 1 and the
term of Cover proteins represents the number of unique
protein these methods have detected.
From Figure 2, we can see that our method outperforms
other approaches by getting higher rates on all three evalua-
658
0.7
0.7
0.6
0.6
0.5
0.5
WeiSum
KerSpe
INENS
CLENS
HBGF
NMFonGO
NMFonCo
CoNMF
0.4
0.3
0.2
0.1
Precision
Recall
tion criteria. Our method has not only found more matched
modules from the PPI network but also obtained more precise cluster results than others.
The significant improvement is partly because GO-driven
similarity and co-expression correlation measurement have
enhanced the accuracy and integrality of PPI networks as
we can see from the comparison with NMFonGO and NMFonCo. Also, the new objective function in our method is
a more effective way for extracting the consolidate cluster
structures in both of graphs than the other clustering methods that deal with multiple data sources. The reasons of
the bad performance of the kernel summation method are
that the structure patterns of those graphs do not necessarily satisfy linear ensemble in the kernel space and that
it is always a challenging problem to find the proper kernel
for specific data. For the cluster ensemble methods, they
find the consensus clusters based on the meta-clusters from
different clustering methods and data sources. But none
of them is unable to collect the overlapping proteins in the
graph which definitely affects its performance. The straightforward method of summation of original datasets gets relative high F -measure too, but its cluster results tend to
be smaller than our method according to the Table 1 and
cannot deal with the overlapping problem neither.
5.4
Parameter Sensitivity Analysis
We obtain overlapping clusters by setting a threshold λ on
the factor matrix H. This parameter also affects the performance of CoNMF. We study the effect of parameter choosing and present the change of F -measure and overlapping
rate in Figure 3. The F -measure generally increases at first
as λ rises, getting the highest value, i.e., F -measure=0.617
around λ = 0.72, and goes down slightly and gradually after
that, while the overlapping rate (the number of the protein
belonging to more than one module divided by the number
of protein derived from our method) decreases as λ increases.
At the point of λ = 0.72, the overlapping rate of modules is
0.330.
This phenomenon is consistent with the fact of functional
overlapping of proteins. With a lower threshold, the clusters
tend to absorb some proteins that do not really belong to
ACM-BCB 2012
0.4
0.3
0.2
0.1
0.4
Fmeasure
Figure 2: Comparison with baseline methods.
Fmeasure
Overlapping rate
0.5
0.6
0.7
0.8
0.9
1
Figure 3: Parameter effect on F-measure and overlapping rate.
them and the precision of the method drops. As λ increases,
proteins are assigned to clusters more harshly. Although the
precision seems have not dramatically fall, the overlapping
rate does.
6.
CONCLUSION
We have presented a novel approach CoNMF which adopts
multiple non-negative matrix factorization models to detect protein functional modules. This method makes better use of multiple biological information sources. We applied our method on the TAP data and evaluated the results
with the hand-curated complexes from CYC2008 benchmark
dataset. Augmented with other clustering methods that deal
with multiple sources, our methodology is superior in finding
more modules with actual biological meaning and naturally
tackling the overlapping detection problem. In the present
work, we integrate gene expression data and GO functional
annotations with the PPI network but the model can be extended for more similarity based data sources and for other
multiple graph clustering problems.
7.
ACKNOWLEDGMENTS
The research work is supported by National Natural Science
Foundation of China under Grant No. 30970780 and Ph.D.
Programs Foundation of Ministry of Education of China under Grant No. 20091103110005.
8.
REFERENCES
[1] Gary D Bader and Christopher Wv Hogue. An
automated method for finding molecular complexes in
large protein interaction networks. BMC
Bioinformatics, 4(1):2, 2003.
[2] Rachel B Brem and Leonid Kruglyak. The landscape
of genetic complexity across 5,700 gene expression
traits in yeast. Proceedings of the National Academy of
Sciences of the United States of America,
102(5):1572–1577, 2005.
[3] Yanhua Chen, Manjeet Rege, Ming Dong, and Jing
Hua. Non-negative matrix factorization for
semi-supervised data clustering. Knowledge and
Information Systems, 17(3):355–379, 2008.
659
Table 1: The matched modules from the base clustering methods and proposed method
WeiSum KerSpe INENS CLENS HBGF NMFGO NMFCo CoNMF
Matched modules
201
97
104
147
181
176
195
205
Average size
5.97
7.19
8.16
8.34
9.14
7.85
10.94
17.52
Cover proteins
1199
699
849
1226
1654
1496
2178
2277
Overlap proteins
220
354
752
[4] Young-rae Cho, Woochang Hwang, and Aidong
Zhang. Efficient modularization of weighted protein
interaction networks using k-hop graph reduction.
BioInformatics and BioEngineering 2006 BIBE 2006
Sixth IEEE Symposium on, pages 289–298, 2006.
[5] Young-Rae Cho, Lei Shi, and Aidong Zhang. Flownet:
Flow-based approach for efficient analysis of complex
biological networks. 2009 Ninth IEEE International
Conference on Data Mining, pages 91–100, 2009.
[6] Karthik Devarajan. Nonnegative matrix factorization:
An analytical and interpretive tool in computational
biology. PLoS Computational Biology, 4(7):12, 2008.
[7] Chris Ding, Xiaofeng He, and Horst D Simon. On the
equivalence of nonnegative matrix factorization and
spectral clustering. Proc SIAM Data Mining Conf,
44(4):606âĂŞ610, 2005.
[8] Xiaoli Zhang Fern and Carla E Brodley. Solving
cluster ensemble problems by bipartite graph
partitioning. Twentyfirst international conference on
Machine learning ICML 04, pages 36–41, 2004.
[9] A C Gavin, P Aloy, P Grandi, R Krause, M Boesche,
M Marzioch, C Rau, L J Jensen, S Bastuck,
B Dumpelfeld, and et al. Proteome survey reveals
modularity of the yeast cell machinery. Nature,
440(7084):631–636, 2006.
[10] M Girvan and M E J Newman. Community structure
in social and biological networks. Proceedings of the
National Academy of Sciences of the United States of
America, 99(12):7821–7826, 2002.
[11] Hyunsoo Kim and Haesun Park. Nonnegative matrix
factorization based on alternating nonnegativity
constrained least squares and active set method.
SIAM Journal on Matrix Analysis and Applications,
30(2):713, 2008.
[12] Jingu Kim and Haesun Park. Sparse nonnegative
matrix factorization for clustering. Science, pages
1–15, 2006.
[13] Nevan J Krogan, Gerard Cagney, Haiyuan Yu,
Gouqing Zhong, Xinghua Guo, Alexandr Ignatchenko,
Joyce Li, Shuye Pu, Nira Datta, Aaron P Tikuisis,
and et al. Global landscape of protein complexes in
the yeast saccharomyces cerevisiae. Nature,
440(7084):637–643, 2006.
[14] D D Lee and H S Seung. Learning the parts of objects
by non-negative matrix factorization. Nature,
401(6755):788–91, 1999.
[15] Chuan Lin, Young-rae Cho, Woo-chang Hwang,
Pengjun Pei, and Aidong Zhang. Clustering methods
in protein-protein interaction network. in Knowledge
Discovery in Bioinformatics Techniques Methods and
Application, 2006.
[16] Dekang Lin. An information-theoretic definition of
similarity. Quality, 1:296–304, 1998.
ACM-BCB 2012
[17] M E J Newman. Finding community structure in
networks using the eigenvectors of matrices. Physical
Review E - Statistical, Nonlinear and Soft Matter
Physics, 74(3 Pt 2):036104, 2006.
[18] J-P Onnela, J SaramÃd’ki, J HyvÃűnen, G SzabÃş,
D Lazer, K Kaski, J KertÃl’sz, and A-L BarabÃasi.
Structure and tie strengths in mobile communication
networks. Proceedings of the National Academy of
Sciences of the United States of America,
104(18):7332–7336, 2007.
[19] Max Planck and Ulrike Von Luxburg. A tutorial on
spectral clustering a tutorial on spectral clustering.
Statistics and Computing, 17(August):395–416, 2006.
[20] Shuye Pu, Jessica Wong, Brian Turner, Emerson Cho,
and Shoshana J Wodak. Up-to-date catalogues of
yeast protein complexes. Nucleic Acids Research,
37(3):825–831, 2009.
[21] Seung Yon Rhee, Valerie Wood, Kara Dolinski, and
Sorin Draghici. Use and misuse of the gene ontology
annotations. Nature Reviews Genetics, 9(7):509–515,
2008.
[22] B Schwikowski, P Uetz, and S Fields. A network of
protein-protein interactions in yeast. Nature
Biotechnology, 18(12):1257–1261, 2000.
[23] Alexander J Smola and Risi Kondor. Kernels and
regularization on graphs. Machine Learning,
2777:1–15, 2003.
[24] Alexander Strehl and Joydeep Ghosh. Cluster
ensemblesa knowledge reuse framework for combining
multiple partitions. Journal of Machine Learning
Research, 3(3):583–617, 2003.
[25] Christian Von Mering, Roland Krause, Berend Snel,
Michael Cornell, Stephen G Oliver, Stanley Fields,
and Peer Bork. Comparative assessment of large-scale
data sets of protein-protein interactions. Nature,
417(6887):399–403, 2002.
[26] Jianxin Wang, Min Li, Jianer Chen, and Yi Pan. A
fast hierarchical clustering algorithm for functional
modules discovery in protein interaction networks.
IEEEACM Transactions on Computational Biology
and Bioinformatics, 8(3):607–620, 2011.
[27] R Wang, S Zhang, Y Wang, X Zhang, and L Chen.
Clustering complex networks and biological networks
by nonnegative matrix factorization with various
similarity measures. Neurocomputing, 72(1-3):134–141,
2008.
660
Download