and k (s)

advertisement
March 11, 2010
Sparsity-Cognizant Overlapping Co-clustering
Hao Zhu
Dept. of ECE, Univ. of Minnesota
http:// spincom.ece.umn.edu
Acknowledgements: G. Mateos, Profs. G. B. Giannakis,
N. D. Sidiropoulos, A. Banerjee, and G. Leus
NSF grants CCF 0830480 and CON 014658
ARL/CTA grant no. DAAD19-01-2-0011
1
Outline
 Motivation and context
 Problem statement and Plaid models
 Sparsity-cognizant overlapping co-clustering (SOC)
 Uniqueness
 Simulated tests
 Conclusions and future research
SPinCOM University of Minnesota
2
Context
Co-clustering (biclustering) = two-way clustering
 Clustering: partition the objects(samples, rows) based on a similarity criteria
on their attributes(features, columns) [Tan-Steinbach-Kumar ’06]
 Co-clustering: simultaneous clustering of objects and attributes [Busygin et al ’08]
 Dense, approximately constant-valued submatrices
 NP-hard: reduce to ordinary clustering as in k-means
attributes
objects
SPinCOM University of Minnesota
3
Context
Co-clustering (biclustering) = two-way clustering
 Clustering: partition the objects(samples, rows) based on a similarity criteria
on their attributes(features, columns) [Tan-Steinbach-Kumar ’06]
 Co-clustering: simultaneous clustering of objects and attributes [Busygin et al ’08]
 Dense, approximately constant-valued submatrices
 NP-hard: reduce to ordinary clustering as in k-means
Application areas
 Social network: cohesive subgroups of actors within a network[Wasserman et al ’94]
 Bioinformatics: interpertable biological structure in gene expression data
[Lazzeroni et al ’02]
 Internet traffic: dominat host groups with strong interactions [Jin et al ’09]
SPinCOM University of Minnesota
4
Related Works and our Focus
Matrix factorization based on SVD
 Bipartite spectral graph partitioning [Dhillon ’01]
 Orthogonal nonnegative matrix factorization (tNMF) [Ding et al ’06]
 Search for non-overlapping co-clusters using orthogonality
Overlapping co-clustering under Bayesian framework [Fu-Banerjee ’09]
 Probabilistic model for co-cluster membership indicators and parameters
 EM algorithm for inference and parameter estimation
 E-step: Gibbs sampling for membership indicator detection
Plaid models [Lazzeroni et al ’02]
 Superposition of multiple overlapping layers (co-clusters)
 Greedy layer search: one at a time
SPinCOM University of Minnesota
5
Related Works and our Focus (cont’d)
Borrow plaid model features
 Overlapping: some objects/attributes may relate to multiple co-clusters
 Partial co-clustering motivated by “Uninteresting background”
 Linear model requires low computational burden
Exploit sparsity in co-cluster membership vectors
 Sparse information hidden in large data set
 Parsimonious models: more interpretable and informative
Simultaneous cross-layer optimization
 Compared with greedy layer-by-layer strategy
Our focus: Sparsity-cognizant overlapping co-clustering (SOC) algorithm
SPinCOM University of Minnesota
6
Modeling
Matrix Y: n × p
 induced by two groups of interacting nodes
and
 Yij measures the strength of the relationship between
and
Ex-1: Internet traffic: traffic activity graph (TAG)1
 Track the traffic flow between inside host
outside host
and
Ex-2: Gene expression microarray data2
 Measure the level with which gene
sample
SPinCOM University of Minnesota
1,2 The
two pictures are taken from [Jin et al ’09] and [Lazzeroni et al ’02], respectively.
is expressed in
7
Submatrices
Hidden dense/uniform submatrices
 capture a subset of
related to a subset of
that has similar feature values
 reveal certain informative behavioral patterns
Features
 distributed “sparsely” in Y compared to the data dimension np
 may overlap because of some multiple-functioned nodes
Goal: Efficient co-clustering algorithms to extract the underlying submatrices,
by exploiting sparsity and accounting for possible overlapping
SPinCOM University of Minnesota
8
Plaid Models
Matrix Y: superposition of k submatrices (layers)
 l : level of layer
 il (jl) =
(0 background)
1
if
0
otherwise
(
) is in the -th layer
Row/column-level related effects il and jl
 l common to all the nodes in the layer
 il and jl express the node-related response
SPinCOM University of Minnesota
9
Problem Statement
Problem: Given the plaid model, seek the optimal membership indicators
 Data fitting error penalized by the L1 norm of the indicators
  > 0 controls the sparsity enforced
 Facilitate extraction of the more informative/inpretable submatrices out of Y
The optimal solution is NP-hard
 Binary constraints on membership indicators [Tuner et al ’05]
 Product of different variables
Efficient sub-optimal algorithm to identify the submatrices jointly
 Recall the submatrices are detected one at a time in [Lazzeroni et al ’02]
SPinCOM University of Minnesota
10
Sparsity-cognizant overlapping co-clustering (SOC)
Background-layer-free residue matrix Z
Iterative cycling update of , , and 
 Per iteration s, (s) collects all ijk(s) values, likewise for (s) and (s)
(s-1)
(s-1)
(s)
(s-1)
(s)
(s)
(s)
Different from [Lazzeroni et al ’02]
 All the k layers are updated jointly, less prone to error propagation across layers
 Membership indicators are updated with binary constraint
SPinCOM University of Minnesota
combinatorial complexity
11
Updating (s)
Given (s-1) and (s-1)
Unconstrained quadratic programming
closed-form solution
 Inversion of a large matrix leads to numerical instablity
Coordinate descent algorithm alternating across all the layers
For l = 1, ..., k
 Define residue matrix
 Reduce to
:
by extracting from
the rows il(s-1)=1 and the columns jl(s-1)=1
 Update for T cycles (T small)
SPinCOM University of Minnesota
12
Updating (s) and (s)
Given (s) and (s-1), determine (s)
Obtain jointly membership indicators
for the i -th row
 Important for overlapping submatrices to eliminate cross effects
L1 norm penalty reduces to linear term due to non-negativity
 Quadratic minimization subject to {0,1} binary constraints
NP-hard
 Similar problems in MIMO/multiuser detection with binary alphabet
(Near-) optimal sphere decoding algorithm (SDA)
 Incurs polynomial (cubic) complexity in general
Same techniques to detect (s) with (s) and (s)
SPinCOM University of Minnesota
13
Convergence and Implementation
SOC algorithm converges (at least) to a stationary point
 Data fitting error cost: bounded below and non-increasing per iteration
Pruning steps [Lazzeroni et al ’02], [Tuner et al ’05]
Initialization
 Background level fitting to obtain matrix Z (Recall the submatrix parameter fitting)
 Membership indicators (0) and (0) : K-means [Tuner et al ’05]
Parameter choices
 Number of layers k : explain a certain percentage of variation
 Sparse regularization parameter : trial-and-error/bi-cross-validation [Witten et al ’09]
SPinCOM University of Minnesota
14
Uniqueness
Plain plaid models: decomposition into product of unknown matrices
 Binary-valued matrices: R=[il] (n ×k) and K =[jl] (p ×k)
 Diagonal matrix D = diag(1 , ... , k)
Blind source separation (BSS) [Talwar et al ’96], [van der Veen et al ’96]
 Product of two matrices, finite alphabet (FA)/constant modulus (CM) constraint
 (Generally) uniquely identifiable with enough number of samples
3-way array (Candecomp/Parafac) [Kruskal ’77], [Sidiropoulos et al ’00]
two-way
 Unique up to permutation and scaling:
 Fails to hold if h = 1
SPinCOM University of Minnesota
15
Uniqueness (cont’d)
Sparsity in blind identification
 Sparse component analysis: very sparse representation [Georgiev et al ’07]
 Non-negative source separation using local dominance [Chan et al ’08]
Proposition: Consider
where diagonal matrix D and binary-valued matrices
R, K are all of full rank. Each column vector kl of K is locally sparse 8l,
which means there exists an (unknown) row index jl such that
Given Z, the matrices R, D, and K are unique up to permutations.
Proof relies on convex analysis
 Affine hull of column vectors of K coincides with the one of Z
 Under local sparseness, its convex hull becomes the intersection of the affine
hull and the positive orthant
 Columns of K are extreme points of convex hull
Results hold also when R is locally sparse (symmetry)
SPinCOM University of Minnesota
16
Preliminary Simulation
Two uniform blocks + noises ~ Unif [0, 0.5]
SOC parameters: k=2, S=20, T=1, and =0,3
=0
Original
Plaid
Permuted
SPinCOM University of Minnesota
=3
17
Real Data To Simulate
Internet traffic flow data
 Uncover different types of co-clusters: in-star, out-star, bi-mesh,....
 Examples of Email application: department servers, Gmail
 Overlapping co-clusters may reveal server farms
Gene expression microarray data
 Co-clusters may exhibit some biological patterns
 Need to check with the gene enrichment value
SPinCOM University of Minnesota
18
Concluding Summary
 Plaid models to reveal overlapping co-clusters
Exploit sparsity for parsimonious recovery
Jointly decide among multiple layers
 SOC algorithm iteratively updates the unknown parameters
Coordinate descent solver for the layer level parameters
Sphere decoder detects membership indicators jointly
 Local sparseness leads to unique decomposition
SPinCOM University of Minnesota
19
Future Directions
 Implementation issues with parameter choices
 Efficient initializations and membership vector detection
 Comprehensive numerical experiments on real data
Thank You!
SPinCOM University of Minnesota
20
Download