March 11, 2010 Sparsity-Cognizant Overlapping Co-clustering Hao Zhu Dept. of ECE, Univ. of Minnesota http:// spincom.ece.umn.edu Acknowledgements: G. Mateos, Profs. G. B. Giannakis, N. D. Sidiropoulos, A. Banerjee, and G. Leus NSF grants CCF 0830480 and CON 014658 ARL/CTA grant no. DAAD19-01-2-0011 1 Outline Motivation and context Problem statement and Plaid models Sparsity-cognizant overlapping co-clustering (SOC) Uniqueness Simulated tests Conclusions and future research SPinCOM University of Minnesota 2 Context Co-clustering (biclustering) = two-way clustering Clustering: partition the objects(samples, rows) based on a similarity criteria on their attributes(features, columns) [Tan-Steinbach-Kumar ’06] Co-clustering: simultaneous clustering of objects and attributes [Busygin et al ’08] Dense, approximately constant-valued submatrices NP-hard: reduce to ordinary clustering as in k-means attributes objects SPinCOM University of Minnesota 3 Context Co-clustering (biclustering) = two-way clustering Clustering: partition the objects(samples, rows) based on a similarity criteria on their attributes(features, columns) [Tan-Steinbach-Kumar ’06] Co-clustering: simultaneous clustering of objects and attributes [Busygin et al ’08] Dense, approximately constant-valued submatrices NP-hard: reduce to ordinary clustering as in k-means Application areas Social network: cohesive subgroups of actors within a network[Wasserman et al ’94] Bioinformatics: interpertable biological structure in gene expression data [Lazzeroni et al ’02] Internet traffic: dominat host groups with strong interactions [Jin et al ’09] SPinCOM University of Minnesota 4 Related Works and our Focus Matrix factorization based on SVD Bipartite spectral graph partitioning [Dhillon ’01] Orthogonal nonnegative matrix factorization (tNMF) [Ding et al ’06] Search for non-overlapping co-clusters using orthogonality Overlapping co-clustering under Bayesian framework [Fu-Banerjee ’09] Probabilistic model for co-cluster membership indicators and parameters EM algorithm for inference and parameter estimation E-step: Gibbs sampling for membership indicator detection Plaid models [Lazzeroni et al ’02] Superposition of multiple overlapping layers (co-clusters) Greedy layer search: one at a time SPinCOM University of Minnesota 5 Related Works and our Focus (cont’d) Borrow plaid model features Overlapping: some objects/attributes may relate to multiple co-clusters Partial co-clustering motivated by “Uninteresting background” Linear model requires low computational burden Exploit sparsity in co-cluster membership vectors Sparse information hidden in large data set Parsimonious models: more interpretable and informative Simultaneous cross-layer optimization Compared with greedy layer-by-layer strategy Our focus: Sparsity-cognizant overlapping co-clustering (SOC) algorithm SPinCOM University of Minnesota 6 Modeling Matrix Y: n × p induced by two groups of interacting nodes and Yij measures the strength of the relationship between and Ex-1: Internet traffic: traffic activity graph (TAG)1 Track the traffic flow between inside host outside host and Ex-2: Gene expression microarray data2 Measure the level with which gene sample SPinCOM University of Minnesota 1,2 The two pictures are taken from [Jin et al ’09] and [Lazzeroni et al ’02], respectively. is expressed in 7 Submatrices Hidden dense/uniform submatrices capture a subset of related to a subset of that has similar feature values reveal certain informative behavioral patterns Features distributed “sparsely” in Y compared to the data dimension np may overlap because of some multiple-functioned nodes Goal: Efficient co-clustering algorithms to extract the underlying submatrices, by exploiting sparsity and accounting for possible overlapping SPinCOM University of Minnesota 8 Plaid Models Matrix Y: superposition of k submatrices (layers) l : level of layer il (jl) = (0 background) 1 if 0 otherwise ( ) is in the -th layer Row/column-level related effects il and jl l common to all the nodes in the layer il and jl express the node-related response SPinCOM University of Minnesota 9 Problem Statement Problem: Given the plaid model, seek the optimal membership indicators Data fitting error penalized by the L1 norm of the indicators > 0 controls the sparsity enforced Facilitate extraction of the more informative/inpretable submatrices out of Y The optimal solution is NP-hard Binary constraints on membership indicators [Tuner et al ’05] Product of different variables Efficient sub-optimal algorithm to identify the submatrices jointly Recall the submatrices are detected one at a time in [Lazzeroni et al ’02] SPinCOM University of Minnesota 10 Sparsity-cognizant overlapping co-clustering (SOC) Background-layer-free residue matrix Z Iterative cycling update of , , and Per iteration s, (s) collects all ijk(s) values, likewise for (s) and (s) (s-1) (s-1) (s) (s-1) (s) (s) (s) Different from [Lazzeroni et al ’02] All the k layers are updated jointly, less prone to error propagation across layers Membership indicators are updated with binary constraint SPinCOM University of Minnesota combinatorial complexity 11 Updating (s) Given (s-1) and (s-1) Unconstrained quadratic programming closed-form solution Inversion of a large matrix leads to numerical instablity Coordinate descent algorithm alternating across all the layers For l = 1, ..., k Define residue matrix Reduce to : by extracting from the rows il(s-1)=1 and the columns jl(s-1)=1 Update for T cycles (T small) SPinCOM University of Minnesota 12 Updating (s) and (s) Given (s) and (s-1), determine (s) Obtain jointly membership indicators for the i -th row Important for overlapping submatrices to eliminate cross effects L1 norm penalty reduces to linear term due to non-negativity Quadratic minimization subject to {0,1} binary constraints NP-hard Similar problems in MIMO/multiuser detection with binary alphabet (Near-) optimal sphere decoding algorithm (SDA) Incurs polynomial (cubic) complexity in general Same techniques to detect (s) with (s) and (s) SPinCOM University of Minnesota 13 Convergence and Implementation SOC algorithm converges (at least) to a stationary point Data fitting error cost: bounded below and non-increasing per iteration Pruning steps [Lazzeroni et al ’02], [Tuner et al ’05] Initialization Background level fitting to obtain matrix Z (Recall the submatrix parameter fitting) Membership indicators (0) and (0) : K-means [Tuner et al ’05] Parameter choices Number of layers k : explain a certain percentage of variation Sparse regularization parameter : trial-and-error/bi-cross-validation [Witten et al ’09] SPinCOM University of Minnesota 14 Uniqueness Plain plaid models: decomposition into product of unknown matrices Binary-valued matrices: R=[il] (n ×k) and K =[jl] (p ×k) Diagonal matrix D = diag(1 , ... , k) Blind source separation (BSS) [Talwar et al ’96], [van der Veen et al ’96] Product of two matrices, finite alphabet (FA)/constant modulus (CM) constraint (Generally) uniquely identifiable with enough number of samples 3-way array (Candecomp/Parafac) [Kruskal ’77], [Sidiropoulos et al ’00] two-way Unique up to permutation and scaling: Fails to hold if h = 1 SPinCOM University of Minnesota 15 Uniqueness (cont’d) Sparsity in blind identification Sparse component analysis: very sparse representation [Georgiev et al ’07] Non-negative source separation using local dominance [Chan et al ’08] Proposition: Consider where diagonal matrix D and binary-valued matrices R, K are all of full rank. Each column vector kl of K is locally sparse 8l, which means there exists an (unknown) row index jl such that Given Z, the matrices R, D, and K are unique up to permutations. Proof relies on convex analysis Affine hull of column vectors of K coincides with the one of Z Under local sparseness, its convex hull becomes the intersection of the affine hull and the positive orthant Columns of K are extreme points of convex hull Results hold also when R is locally sparse (symmetry) SPinCOM University of Minnesota 16 Preliminary Simulation Two uniform blocks + noises ~ Unif [0, 0.5] SOC parameters: k=2, S=20, T=1, and =0,3 =0 Original Plaid Permuted SPinCOM University of Minnesota =3 17 Real Data To Simulate Internet traffic flow data Uncover different types of co-clusters: in-star, out-star, bi-mesh,.... Examples of Email application: department servers, Gmail Overlapping co-clusters may reveal server farms Gene expression microarray data Co-clusters may exhibit some biological patterns Need to check with the gene enrichment value SPinCOM University of Minnesota 18 Concluding Summary Plaid models to reveal overlapping co-clusters Exploit sparsity for parsimonious recovery Jointly decide among multiple layers SOC algorithm iteratively updates the unknown parameters Coordinate descent solver for the layer level parameters Sphere decoder detects membership indicators jointly Local sparseness leads to unique decomposition SPinCOM University of Minnesota 19 Future Directions Implementation issues with parameter choices Efficient initializations and membership vector detection Comprehensive numerical experiments on real data Thank You! SPinCOM University of Minnesota 20