Model based variable clustering: minimax membership detection Florentina Bunea Department of Statistical Science Cornell University Workshop on the Theory of Big Data, UCL, January, 2016 Florentina Bunea Department of Statistical Science Cornell University Model based variable clustering: minima Variable clustering Model based variable clustering X = (X1 , . . . , Xa , . . . Xp ) zero mean random vector. Cluster = a collection of probabilistically similar Xa ’s = well defined target. Observe data: X1 , . . . , Xn i.i.d. X. Task: Estimate target. Algorithmic variable clustering Observe data: X1 , . . . , Xn i.i.d. X. Task: Group p vectors in Rn . Florentina Bunea Department of Statistical Science Cornell University Model based variable clustering: minima An instance of variable clustering Industry Section Telecom Home Improvement Cord K-means Model based AT&T, Verizon Home Depot, Lowe’s Model free HC Model free AT&T, Verizon, Pfizer, Merck, Lilly, Bristol-Myers AT&T, Verizon Home Depot, Lowe’s, Starbucks Home Depot, Lowe’s, Starbucks, Costco, Target, Wal-Mart, FedEx, United Parcel Service The relationship between AT&T and all other companies is similar to the relationship between Verizon and all other companies. Florentina Bunea Department of Statistical Science Cornell University Model based variable clustering: minima Models for variable clustering: G-models G = {Gk }1≤k≤K partition of {1, . . . , p}. 1 2 All variables Xa in a group Gk should have similar relationships to one another. All variables in a group should act as one block in relation to all variables in another group. Featured models: G-latent, G-exchangeable and G-block covariance. Florentina Bunea Department of Statistical Science Cornell University Model based variable clustering: minima G-latent models X follows a G-latent variable model if there exists Z ∈ RK with covariance C such that Xa = Zk + Ea , V ar(Ea ) = γk , for a ∈ Gk , 1 ≤ k ≤ K. G-exchangeable models law The distribution of X is G-exchangeable if Xσ = X for all σ ∈ SG , the set of permutations σ of {1, . . . , p} that only permute elements within each group of the partition, but not between groups. G-block covariance models X ∈ Rp with covariance matrix Σ D1 C11 Σ= C12 C12 C12 follows a G-block covariance model if C11 C12 C12 C12 D1 C12 C12 C12 C12 D2 C22 C22 C12 C22 D2 C22 C12 C22 C22 D2 Florentina Bunea Department of Statistical Science Cornell University Model based variable clustering: minima Identifiability in G-models: well defined estimation targets G-latent ⇒ G-exchangeable ⇒ G-block covariance structure. Definition of identifiability We say that the G-models above are identifiable, if the sets B = {G partition of {1, . . . , p} : Σ has G-block covariance structure}, E = {G partition of {1, . . . , p} : X is G-exchangeable}, L = {G partition of {1, . . . , p} : X is G-latent}, have, respectively, a unique minimum with respect to the partial order induced by inclusion. Identifiable G-models 1 G-exchangeable: Yes, for all distributions, no assumptions. Min = Gexchg . 2 G-block covariance: Yes, for all distributions, no assumptions. Min = Gblock . 3 G-latent models: Not in general, need further assumptions. Florentina Bunea Department of Statistical Science Cornell University Model based variable clustering: minima Identifiable clusters in Gaussian copula G-models Goal: Define identifiable clusters that are invariant to unknown monotone transformations of the data. A solution: Assume that X ∈ Rp has a zero-mean Gaussian copula distribution with copula correlation matrix R and marginals F1 , . . . , Fp . Then: Y =: h(X) ∼ N (0, R); ha = Φ−1 ◦ Fa . The target: The clusters of Y as defined by one of the G-models. Florentina Bunea Department of Statistical Science Cornell University Model based variable clustering: minima • Y = h(X) ∼ N (0, R), h unknown, R correlation matrix. Equality of target partitions in three Gaussian G-models Let Y be a zero mean Gaussian vector with covariance matrix R. 1 Gblock = Gexchg . 2 If C ≥ 0, then Gblock is also the unique minimal latent partition of Y. R= 1 W11 B12 B12 B12 W11 1 B12 B12 B12 B12 B12 1 W22 W22 B12 B12 W22 1 W22 Florentina Bunea Department of Statistical Science Cornell University B12 B12 W22 W22 1 C= W11 B12 B12 W22 Model based variable clustering: minima Clusters = Blocks Variables in the same group have zero CORD If a ∈ Gk , b ∈ Gj : CORD(a, b) =: maxc6=a,b |Rac − Rbc | = max1≤l≤K |Clj − Clk |. If Ya and Yb are in different groups ( j 6= k): R= 1 R11 R12 R12 R13 R13 R11 C =: R12 R13 R11 1 R12 R12 R13 R13 R12 R22 R23 R12 R12 1 R22 R23 R23 CORD(a, b) > 0. R12 R12 R22 1 R23 R23 R13 R13 R23 R23 1 R33 R13 C11 R23 =: C12 R33 C13 Florentina Bunea Department of Statistical Science Cornell University R13 R13 R23 R23 R33 1 C12 C22 C23 C13 C23 C33 Model based variable clustering: minima Minimally separated blocks: a minimax lower bound Once you eliminate the impossible, whatever remains, no matter how improbable, must be the truth. Arthur Conan Doyle No algorithm can recover blocks in R(η) connected by too short a CORD If r 0 ≤ η < η∗ ∼ 0.92 log(p) , n we have inf sup PR (Ĝ 6= Gblock ) ≥ Ĝ R∈R(η) (1) 1 , 7 where the infimum is taken over all possible estimators. R(η) =: {R : CORD(a, b) > η = {C : min for all a Gblock b} max |Clj − Clk | > η}. 1≤j,k≤K 1≤l≤K Florentina Bunea Department of Statistical Science Cornell University Model based variable clustering: minima Variable clustering in Gaussian Copula Models Data: X1 , . . . , Xn i.i.d. X. Assume Y =: h(X) ∼ N(0, R), h unknown. Model based clusters: The unknown block structure of R. Unstructured estimator of R, without estimating h: Tbab = 2 n(n − 1) X sgn ((Xi,a − Xj,a )(Xi,b − Xj,b ) . 1≤i<j≤d bab = sin( π Tbab ) R 2 Florentina Bunea Department of Statistical Science Cornell University (2) Model based variable clustering: minima CORD for variable clustering \ CORD(a, b) = maxc6=a,b |R̂ac − R̂bc | Algorithm 1 (Runs in polynomial time) • Parameter: α > 0 • Initialization: S = {1, . . . , p} and l = 0 • Repeat: while S 6= ∅ I l ← l + 1 I Choose an index a ∈ S. \ I Ĝ = {b ∈ S : CORD(a, b) < α} l I S ← S \ Ĝ l • Return: the partition Ĝ = (Ĝl )l=1,...,k • If X is Gaussian, can use Pearson’s sample correlation instead of Kendall’s τ . • A refined version on CRAN. Florentina Bunea Department of Statistical Science Cornell University Model based variable clustering: minima Exact partition recovery via CORD ♥ Exact partition recovery at separation level η = O q log p n If X has a Gaussian copula distribution with copula correlation matrix q R ∈ R(η), ♥ with η ≥ η , then the CORD algorithm, applied at threshold level logn p . α, guarantees that: b = Gblock , whp. G R(η) =: {R : CORD(a, b) > η for all a b − Rk∞ = OP If X has a Gaussian copula distribution kR Florentina Bunea Department of Statistical Science Cornell University q Gblock b}. log p . n Model based variable clustering: minima Block-correlation matrices with extra information • R = ACAt + Γ, A membership matrix, Γ is a diagonal matrix. p , for all k. • Equally sized groups: |Gk | = m = K • JK matrix of 1’s. Rm (η) = {R : C = ηIK + αJK ; and |Gk | = m = p , for all k}. K • η is the difference between within group and between groups correlations. Florentina Bunea Department of Statistical Science Cornell University Model based variable clustering: minima A minimax lower bound over Rm (η) No algorithm can recover blocks in Rm (η) if η is too small If r 0 ≤ η < η∗ ∼ c1 log(p) 1 ∨ nm n ! , (3) we have, for 0 < c2 < 1, inf sup PR (Ĝ 6= Gblock ) ≥ c2 > 0 , Ĝ R∈R(η) where the infimum is taken over all possible estimators. In progress: lower bound over general C classes, with possibly unbalanced groups, m = mink |Gk |. Florentina Bunea Department of Statistical Science Cornell University Model based variable clustering: minima Generics on spectral clustering General principle 0 • ACA0 = U ΛU , a rank K eigen-vector decomposition. • a Gblock ∼ b ⇐⇒ Ua: = Ub: , where Ua: ∈ RK is the a-th row of U . • Use (for instance) approximate K-means algorithms to find blocks. • Good performance for good estimators of U . • Widely used in network analysis for fitting (approximate) SBM (stochastic block models). Florentina Bunea Department of Statistical Science Cornell University Model based variable clustering: minima (A version of) one-step spectral clustering b, • Use approximate K-means on the rows of U e b b the first K eigenvectors of R =: R − diag(R). • Expected to work: R0 =: R − diag(R) is approximately rank K, kCk∞ . when λK (C) ≥ 4 K p e − R0 . Let Re(R) =: • Need to control the error W̃ =: R r kW̃ kop ≤ kRkop traceR kRkop Re(R) Re(R) ∨ n n . ! . B. and Xiao (2015); Koltchinksii and Lounici (2015). Florentina Bunea Department of Statistical Science Cornell University Model based variable clustering: minima Limitations of (a version of) one-step spectral clustering • L = overall proportion of mis-assigned variables. • Consistent clustering: L < n1 . For the illustrative case: C = ηIK , n = p. Current state of affairs for one-step spectral clustering: If K nρ . η ≤ 1, then L . ρ, w.h.p., for ρ < 1 . K Consistent clustering does not follow from current results. Problem: proof techniques. Networks: one step (Lei and Rinaldo, 2015); two-step refinements (Gao et al. 2015), with provably better theoretical performance. ! One-step spectral clustering may have excellent practical performance. Florentina Bunea Department of Statistical Science Cornell University Model based variable clustering: minima A convex optimization approach to estimation of correlation matrices with balanced blocks Target: the block matrix of 1’s, B ∗ = AAT = (1 Gblock a ∼ b )1≤a,b≤p . b of R, and an estimator K b of K. Input: An estimator R ∗ Estimate B by: b ∈ argmaxhR, b Bi, B B∈Cb B is in S + = {symmetric and positive definite} Bab 6 1 0 6P p ∀b, a Bab = m b =: K Cb = b P 2 bm Bab = K b a,b diag(B) = I Florentina Bunea Department of Statistical Science Cornell University Model based variable clustering: minima b is a restricted least squares estimator. •B • Networks: Chen and Xu (2015), Amini and Levina (2014). Analysis not transferable to covariance/correlation matrices. • b = rank(L̃), K (L̃, Γ̃) = argmin L,Γ • 1 2 b kL + Γ − RkF + λkLk∗ } s.t. Γ is diagonal . 2 b = K, w.h.p. (Wegkamp and Zhao, 2015) K Florentina Bunea Department of Statistical Science Cornell University Model based variable clustering: minima Near minimax optimal membership estimation ♣ Exact partition recovery at level η = O log(p)∨K n W q K∨log(p) nm b = B ∗ , whp. If R ∈ Rm (η), with η ≥ η ♣ , then B p , for all k}. K • Upper bound holds for more general q classes ofblock-matrices. log(p) 1 • Minimax lower bound: η∗ = c1 nm ∨ n . Rm (η) = {R : C = ηIK + αJK ; Florentina Bunea Department of Statistical Science Cornell University and |Gk | = m = Model based variable clustering: minima Let q be any vector in RK and e1 , . . . , eK be the canonical basis of RK . Let αk = maxj:j>k Cjk /Ckk = maxj:j6=k Cjk /(Ckk ∨ Cjj ). If • log(K) n (it can be made explicit) r r (ek − q)T C(ek − q) log(p) _ K ∨ log p _ K • (1 − αk )Ckk & , n nm n then B̂ = B ∗ w.h.p. Take q = 1 K to come close to the minimax separation bound. Florentina Bunea Department of Statistical Science Cornell University Model based variable clustering: minima Recovery with C = ηIK + αJ c R R R(sorted) CORD Spectral Convex 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 Figure: η = 0.29, α = 0.25, n = 100 Florentina Bunea Department of Statistical Science Cornell University Model based variable clustering: minima Recovery with C = ηIK + αJ c R R R(sorted) CORD Spectral Convex 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 Figure: η = 0.49, α = 0.25, n = 100 Florentina Bunea Department of Statistical Science Cornell University Model based variable clustering: minima Recovery with C = ηIK + αJ c R R R(sorted) CORD Spectral Convex 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 Figure: η = 0.89, α = 0.25, n = 100 Florentina Bunea Department of Statistical Science Cornell University Model based variable clustering: minima Recovery with C = BB T ; B random; balanced groups c R R R(sorted) CORD Spectral Convex 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 Figure: n = 500, p = 20, K = 5 C is positive definite Florentina Bunea Department of Statistical Science Cornell University Model based variable clustering: minima Recovery with C = BB T ; B random; unbalanced groups c R R CORD Spectral R(sorted) 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 Figure: n = 400, p = 100, K = 10 Florentina Bunea Department of Statistical Science Cornell University Model based variable clustering: minima Recovery with C ≺ 0 c R R R(sorted) CORD Spectral Convex 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 Figure: n = 500 C is symmetric but has an important negative eigenvalue Florentina Bunea Department of Statistical Science Cornell University Model based variable clustering: minima Summary: clustering via G-models R(η) = {C : min max |Clj − Clk | > η > 0}. 1≤j,k≤K 1≤l≤K • Does not requireCkk ∨ Cjj (within block) be larger than Cjk (between blocks). q log p • Minimax η = O . n • Exact recovery at and above the minimax level: CORD. • General partitions. Rm (η) = {R : C = ηIK +αJK ; η > 0, and |Gk | = m = p , for all k}. K • Does require Ckk∨ Cjj (within block) be larger than Cjk (between blocks). q log(p) 1 • Minimax η = O nm ∨ n . • Exact recovery at and above the near minimax level: CONVEX. • Balanced partitions. Florentina Bunea Department of Statistical Science Cornell University Model based variable clustering: minima Many open problems .... • Develop a unifying framework for the study of variable clustering in G-models. 1 2 3 4 5 6 • Identify key quantities in model based variable clustering. Establish minimax lower bounds over general model classes. Develop effective (near) minimax adaptive algorithms over these classes. Bridge the gap between theory and practice in spectral clustering. ”Post-clustering” inference. ..... Beyond G-models .... Florentina Bunea Department of Statistical Science Cornell University Model based variable clustering: minima The team: Xi Luo (Brown), Christophe Giraud (Polytechnique and Paris Sud), Martin Royer (Paris Sud and Cornell), Nicolas Verzelen (INRA). Community estimation in G-models via CORD, 2015, F.B., C.Giraud, X. Luo (arXiv + CRAN). A convex optimization approach to near minimax optimal variable clustering, 2016, F.B., C. Giraud, M. Royer and N. Verzelen. Florentina Bunea Department of Statistical Science Cornell University Model based variable clustering: minima Thank you ! Florentina Bunea Department of Statistical Science Cornell University Model based variable clustering: minima G-latent models: merits and limitations G-latent models = most commonly used G-models. X follows a G-latent variable model if there exists Z ∈ RK such that Xa = Zk + Ea , V ar(Ea ) = γk , for a ∈ Gk , 1 ≤ k ≤ K. Merits: simple and interpretable. Limitations: not identifiable. Example. X is zero-mean Gaussian with 1 0.25 Σ= 0.26 0.26 0.25 1 0.26 0.26 0.26 0.26 1 0.25 0.26 0.26 0.25 1 X is G-latent, with G = {{1}, {2}, {3, 4}}. X is also G0 -latent, with G0 = {{1, 2}, {3}, {4}}. X cannot be latent with respect to G∗ = {{1, 2}, {3, 4}}. The minimal partition for a latent decomposition is not unique! Florentina Bunea Department of Statistical Science Cornell University Model based variable clustering: minima