slides - University of North Carolina at Chapel Hill

advertisement
Flexible and Robust Co-Regularized
Multi-Domain Graph Clustering
Wei Cheng1 Xiang Zhang2 Zhishan Guo1 Yubao Wu2
Patric F. Sullivan1 Wei Wang3
1University
of North Carolina at Chapel Hill,
2Case Western Reserve University,
3University of California, Los Angeles
Speaker: Wei Cheng
The 19th ACM Conference on Knowledge Discovery and Data Mining (SIGKDD’13)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Outline
• Introduction
• Motivation
• Co-regularized multi-domain graph clustering
 Single domain graph clustering
 Cross-domain Co-regularization
Residual sum of squares (RSS) loss
Clustering disagreement (CD) loss
• Re-evaluation cross-domain relationship
• Experimental Study
• Conclusion
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Graph and Graph Clustering
• Graphs are ubiquitous
 social networks
 biology interaction networks
 literature citation networks, etc
• Graphs clustering
 Decompose a network into sub-networks based
on some topological properties
 Usually we look for dense sub-networks
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
E.g., Detect protein functional
modules in a PPI network
from Nataša Pržulj – Introduction
to Bioinformatics. 2011.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
E.g., Community Detection
Collaboration network between scientists
from Santo Fortunato –Community
detection in graphs
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Multi-view Graph clustering
• Graphs collected from multiple sources/domains
• Multi-view graph clustering
 Refine clustering
 Resolve ambiguity
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Motivation
• Multi-view
 Exact one-to-one
 Complete mapping
 The same size
• More common cases
 Many-to-many
 Tolerate partial mapping
 Different sizes
 Mappings are associated
with weights(confidence)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Motivation
• Objective: design algorithm which is
 Flexibility
 Robustness
Flexibility and Robustness
Suitable for common cases :
Many-to-many weighted
partial mappings
Noisy graphs have little
influence on others
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Problem Formulation
affinity
matrix
A(1)
A(2)
A(3)
Sa,b(i,j) denotes the weight between the a-th
instance in Dj and the b-th instance in Di.
 To partition each A(π) into kπ clusters while considering
the co-regularized constraints implicitly encoded in
cross-domain relationships in S.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Co-regularized multi-domain
graph clustering (CGC)
• Single-domain Clustering
 Symmetric Non-negative matrix factorization (NMF).
 Minimizing:
L( ) || A( )  H ( ) ( H ( ) )T ||F 2
( )
( )
( )
( ) T
s.t. H ( )  0
n k
 Here, H  [h1* , ha* ,..., hn * ]  R   , where each
ha(* ) represents the cluster assignment of the a-th
instance in domain Dπ
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Co-regularized multi-domain
graph clustering (CGC)
• Cross-domain Co-regularization
 Residual sum of squares (RSS) loss (when the number
of clusters is the same for different domains).
 Clustering disagreement (CD) loss (when the number
of clusters is the same or different).
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Co-regularized multi-domain
graph clustering (CGC)
• Residual sum of squares (RSS) loss
 Directly compare the H(π) inferred in different domains.
 To penalize the inconsistency of cross-domain cluster partitions
for the l-th cluster in Di, the loss for the b-th instance is
Jb(i,l, j )  ( E (i , j ) ( xb( j ) , l )  hb(,jl) )2
where
E (i , j ) ( xb( j ) , l ) 
1
(i , j ) (i )
S

b
, a ha ,l
(i , j )
( j)
| N ( xb ) | aN ( i , j ) ( xb( j ) )
N ( i , j ) ( xb( j ) ) denotes the set of indices of instances in Di that
are mapped to x ( j ) , and | N (i , j ) ( xb( j ) ) | is its cardinality.
b
 The RSS loss is
k
nj
(i , j )
J RSS
  J b(i,l, j ) || S (i , j ) H (i )  H ( j ) ||2F
l 1 b 1
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Co-regularized multi-domain
graph clustering (CGC)
• Clustering disagreement (CD)
 Indirectly measure the clustering inconsistency of cross-domain
cluster partitions .
 Intuition:
•
A⃝ and B⃝ are mapped to 2⃝, and C is mapped to 4⃝ . Intuitively, if the
similarity between cluster assignments for 2⃝ and 4⃝ is small, then the
similarity of clustering assignments between A⃝ and C⃝ and the similarity
between B⃝ and C⃝ should also be small.
(i , j )
 The CD loss is J CD
|| S (i , j ) H ( i ) ( S ( i , j ) H ( i ) )T  H ( j ) ( H ( j ) )T ||2F
Linear kernel
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Co-regularized multi-domain
graph clustering (CGC)
• Objective function (Joint Matrix Optimization):
d
H
min
( )
 0(1  d )
o   L(i ) 
i 1

 (i , j ) J (i , j )
( i , j )I
Can be solved with an alternating scheme: optimize
the objective with respect to one variable while
fixing others.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Re-Evaluating Cross-Domain
Relationship
• The cross-domain instance relationship
based on prior knowledge may contain
noise.
• It is crucial to allow users to evaluate
whether the provided relationships violate
any single-domain clustering structures.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Re-Evaluating Cross-Domain
Relationship
• We only need to slightly modify the
co-regularization loss functions by multiplying
a confidence matrix W (i , j )
JW(i , j ) || (W ( i , j ) S ( i , j ) ) H ( i )  H ( j ) ||2F
d
Optimize:
W 0, H
min
( )
0(1  d )
o   L(i ) 
i 1

( i , j )I
 (i , j ) JW(i , j )
Sort the values of W(i,j) and report to users the smallest
elements.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Experimental Study
• Data sets:
 UCI (Iris, Wine, Ionosphere, WDBC)
Construct two cross-domain relationships: Iris-Wine,
Ionosphere-WDBC, (positive/negative instances only
mapped to positive/negative instances in another domain)
 Newsgroup data (6 groups from 20 Newsgroups)
comp.os.ms-windows.misc, comp.sys.ibm.pc.hardware,
comp.sys.mac.hardware, (3 comp)
rec.motorcycles, rec.sport.baseball, rec.sport.hockey (3 rec)
 protein-protein interaction (PPI) networks (from
BioGrid), gene co-expression networks (from Gene
Expression Ominbus), genetic interaction network
(from TEAM)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Experimental Study
• Effectiveness (UCI data set)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Experimental Study
• Robustness Evaluation (UCI)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Experimental Study
• Re-Evaluating Cross-Domain Relationship
(UCI)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Experimental Study
• Binary v.s. Weighted Relationship
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Experimental Study
• Binary v.s. Weighted Relationship
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Experimental Study
• Protein Module Detection by Integrating
Multi-Domain Heterogeneous Data
490032 genetic markers
across 4890 (1952 disease
and 2938 healthy) samples.
We use 1 million top-ranked
genetic marker pairs to
construct the network and
the test statistics as the
weights on the edges
5412 genes
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Experimental Study
Protein Module Detection:
• Evaluation: standard Gene Set Enrichment
Analysis (GSEA)
 we identify the most significantly enriched Gene Ontology
categories
 significance (p-value) is determined by the Fisher’s exact test
 raw p-values are further calibrated to correct for the multiple
testing problem
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Experimental Study
• Protein Module Detection:
Comparison of CGC and single-domain graph clustering (k = 100)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Experimental Study
• Protein Module Detection:
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Conclusion
• In this paper…
 We propose a flexible co-regularized method,
CGC, to tackle the many-to-many, weighted,
partial mappings for multi-domain graph
clustering .
 CGC utilizes cross-domain relationship as coregularizing penalty to guide the search of
consensus clustering structure.
 CGC is robust even when the cross-domain
relationships based on prior knowledge are noisy.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Thank You !
Questions?
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Experimental Study
• Performance Evaluation
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Download