Model based variable clustering: minimax membership detection

advertisement
Model based variable clustering: minimax membership
detection
Florentina Bunea
Department of Statistical Science
Cornell University
Workshop on the Theory of Big Data, UCL, January, 2016
Florentina Bunea Department of Statistical Science Cornell University
Model based variable clustering: minima
Variable clustering
Model based variable clustering
X = (X1 , . . . , Xa , . . . Xp ) zero mean random vector.
Cluster = a collection of probabilistically similar Xa ’s = well defined target.
Observe data: X1 , . . . , Xn i.i.d. X.
Task: Estimate target.
Algorithmic variable clustering
Observe data: X1 , . . . , Xn i.i.d. X.
Task: Group p vectors in Rn .
Florentina Bunea Department of Statistical Science Cornell University
Model based variable clustering: minima
An instance of variable clustering
Industry
Section
Telecom
Home
Improvement
Cord
K-means
Model
based
AT&T,
Verizon
Home
Depot,
Lowe’s
Model free
HC Model free
AT&T, Verizon,
Pfizer, Merck,
Lilly,
Bristol-Myers
AT&T, Verizon
Home Depot,
Lowe’s,
Starbucks
Home Depot,
Lowe’s,
Starbucks,
Costco, Target,
Wal-Mart,
FedEx, United
Parcel Service
The relationship between AT&T and all other companies is similar to the
relationship between Verizon and all other companies.
Florentina Bunea Department of Statistical Science Cornell University
Model based variable clustering: minima
Models for variable clustering: G-models
G = {Gk }1≤k≤K partition of {1, . . . , p}.
1
2
All variables Xa in a group Gk should have similar relationships to
one another.
All variables in a group should act as one block in relation to all
variables in another group.
Featured models: G-latent, G-exchangeable and G-block covariance.
Florentina Bunea Department of Statistical Science Cornell University
Model based variable clustering: minima
G-latent models
X follows a G-latent variable model if there exists Z ∈ RK with covariance C
such that
Xa = Zk + Ea , V ar(Ea ) = γk , for a ∈ Gk , 1 ≤ k ≤ K.
G-exchangeable models
law
The distribution of X is G-exchangeable if Xσ = X for all σ ∈ SG , the set of
permutations σ of {1, . . . , p} that only permute elements within each group of the
partition, but not between groups.
G-block covariance models
X ∈ Rp with covariance matrix Σ

D1
 C11

Σ=
 C12
 C12
C12
follows a G-block covariance model if

C11 C12 C12 C12
D1 C12 C12 C12 

C12 D2 C22 C22 

C12 C22 D2 C22 
C12 C22 C22 D2
Florentina Bunea Department of Statistical Science Cornell University
Model based variable clustering: minima
Identifiability in G-models: well defined estimation targets
G-latent ⇒ G-exchangeable ⇒ G-block covariance structure.
Definition of identifiability
We say that the G-models above are identifiable, if the sets
B = {G partition of {1, . . . , p} : Σ has G-block covariance structure},
E = {G partition of {1, . . . , p} : X is G-exchangeable},
L = {G partition of {1, . . . , p} : X is G-latent},
have, respectively, a unique minimum with respect to the partial order induced by
inclusion.
Identifiable G-models
1
G-exchangeable: Yes, for all distributions, no assumptions. Min = Gexchg .
2
G-block covariance: Yes, for all distributions, no assumptions. Min = Gblock .
3
G-latent models: Not in general, need further assumptions.
Florentina Bunea Department of Statistical Science Cornell University
Model based variable clustering: minima
Identifiable clusters in Gaussian copula G-models
Goal: Define identifiable clusters that are invariant to unknown
monotone transformations of the data.
A solution: Assume that X ∈ Rp has a zero-mean Gaussian copula distribution
with copula correlation matrix R and marginals F1 , . . . , Fp . Then:
Y =: h(X) ∼ N (0, R); ha = Φ−1 ◦ Fa .
The target: The clusters of Y as defined by one of the G-models.
Florentina Bunea Department of Statistical Science Cornell University
Model based variable clustering: minima
• Y = h(X) ∼ N (0, R), h unknown, R correlation matrix.
Equality of target partitions in three Gaussian G-models
Let Y be a zero mean Gaussian vector with covariance matrix R.
1
Gblock = Gexchg .
2
If C ≥ 0, then Gblock is also the unique minimal latent partition of Y.



R=


1
W11
B12
B12
B12
W11
1
B12
B12
B12
B12
B12
1
W22
W22
B12
B12
W22
1
W22
Florentina Bunea Department of Statistical Science Cornell University
B12
B12
W22
W22
1






C=
W11
B12
B12
W22
Model based variable clustering: minima
Clusters = Blocks
Variables in the same group have zero CORD
If a ∈ Gk , b ∈ Gj : CORD(a, b) =: maxc6=a,b |Rac − Rbc | = max1≤l≤K |Clj − Clk |.
If Ya and Yb are in different groups ( j 6= k):




R=




1
R11
R12
R12
R13
R13
R11
C =:  R12
R13
R11
1
R12
R12
R13
R13
R12
R22
R23
R12
R12
1
R22
R23
R23

CORD(a, b) > 0.
R12
R12
R22
1
R23
R23

R13
R13
R23
R23
1
R33
R13
C11
R23  =:  C12
R33
C13
Florentina Bunea Department of Statistical Science Cornell University
R13
R13
R23
R23
R33
1
C12
C22
C23









C13
C23 
C33
Model based variable clustering: minima
Minimally separated blocks: a minimax lower bound
Once you eliminate the impossible, whatever remains, no matter how improbable, must
be the truth. Arthur Conan Doyle
No algorithm can recover blocks in R(η) connected by too short a
CORD
If
r
0 ≤ η < η∗ ∼ 0.92
log(p)
,
n
we have
inf sup PR (Ĝ 6= Gblock ) ≥
Ĝ R∈R(η)
(1)
1
,
7
where the infimum is taken over all possible estimators.
R(η)
=:
{R : CORD(a, b) > η
=
{C :
min
for all a
Gblock
b}
max |Clj − Clk | > η}.
1≤j,k≤K 1≤l≤K
Florentina Bunea Department of Statistical Science Cornell University
Model based variable clustering: minima
Variable clustering in Gaussian Copula Models
Data: X1 , . . . , Xn i.i.d. X. Assume Y =: h(X) ∼ N(0, R), h unknown.
Model based clusters: The unknown block structure of R.
Unstructured estimator of R, without estimating h:
Tbab =
2
n(n − 1)
X
sgn ((Xi,a − Xj,a )(Xi,b − Xj,b ) .
1≤i<j≤d
bab = sin( π Tbab )
R
2
Florentina Bunea Department of Statistical Science Cornell University
(2)
Model based variable clustering: minima
CORD for variable clustering
\
CORD(a,
b) = maxc6=a,b |R̂ac − R̂bc |
Algorithm 1 (Runs in polynomial time)
• Parameter: α > 0
• Initialization: S = {1, . . . , p} and l = 0
• Repeat: while S 6= ∅
I l ← l + 1
I Choose an index a ∈ S.
\
I Ĝ = {b ∈ S : CORD(a,
b) < α}
l
I S ← S \ Ĝ
l
• Return: the partition Ĝ = (Ĝl )l=1,...,k
• If X is Gaussian, can use Pearson’s sample correlation instead of Kendall’s τ .
• A refined version on CRAN.
Florentina Bunea Department of Statistical Science Cornell University
Model based variable clustering: minima
Exact partition recovery via CORD
♥
Exact partition recovery at separation level η = O
q
log p
n
If X has a Gaussian copula distribution with copula correlation matrix
q R ∈ R(η),
♥
with η ≥ η , then the CORD algorithm, applied at threshold level logn p . α,
guarantees that:
b = Gblock , whp.
G
R(η) =: {R : CORD(a, b) > η
for all a
b − Rk∞ = OP
If X has a Gaussian copula distribution kR
Florentina Bunea Department of Statistical Science Cornell University
q
Gblock
b}.
log p
.
n
Model based variable clustering: minima
Block-correlation matrices with extra information
• R = ACAt + Γ,
A membership matrix,
Γ is a diagonal matrix.
p
, for all k.
• Equally sized groups: |Gk | = m = K
• JK matrix of 1’s.
Rm (η) = {R : C = ηIK + αJK ;
and |Gk | = m =
p
, for all k}.
K
• η is the difference between within group and between groups correlations.
Florentina Bunea Department of Statistical Science Cornell University
Model based variable clustering: minima
A minimax lower bound over Rm (η)
No algorithm can recover blocks in Rm (η) if η is too small
If
r
0 ≤ η < η∗ ∼ c1
log(p) 1
∨
nm
n
!
,
(3)
we have, for 0 < c2 < 1,
inf sup PR (Ĝ 6= Gblock ) ≥ c2 > 0 ,
Ĝ R∈R(η)
where the infimum is taken over all possible estimators.
In progress: lower bound over general C classes, with possibly unbalanced groups,
m = mink |Gk |.
Florentina Bunea Department of Statistical Science Cornell University
Model based variable clustering: minima
Generics on spectral clustering
General principle
0
• ACA0 = U ΛU , a rank K eigen-vector decomposition.
• a
Gblock
∼
b ⇐⇒ Ua: = Ub: , where Ua: ∈ RK is the a-th row of U .
• Use (for instance) approximate K-means algorithms to find blocks.
• Good performance for good estimators of U .
• Widely used in network analysis for fitting (approximate)
SBM (stochastic block models).
Florentina Bunea Department of Statistical Science Cornell University
Model based variable clustering: minima
(A version of) one-step spectral clustering
b,
• Use approximate K-means on the rows of U
e
b
b
the first K eigenvectors of R =: R − diag(R).
• Expected to work:
R0 =: R − diag(R) is approximately rank K,
kCk∞ .
when λK (C) ≥ 4 K
p
e − R0 . Let Re(R) =:
• Need to control the error W̃ =: R
r
kW̃ kop
≤
kRkop
traceR
kRkop
Re(R) Re(R)
∨
n
n
.
!
.
B. and Xiao (2015); Koltchinksii and Lounici (2015).
Florentina Bunea Department of Statistical Science Cornell University
Model based variable clustering: minima
Limitations of (a version of) one-step spectral clustering
• L = overall proportion of mis-assigned variables.
• Consistent clustering: L < n1 .
For the illustrative case: C = ηIK , n = p.
Current state of affairs for one-step spectral clustering:
If
K
nρ
. η ≤ 1,
then L . ρ, w.h.p., for ρ <
1
.
K
Consistent clustering does not follow from current results. Problem: proof
techniques.
Networks: one step (Lei and Rinaldo, 2015); two-step refinements (Gao et al. 2015),
with provably better theoretical performance.
! One-step spectral clustering may have excellent practical performance.
Florentina Bunea Department of Statistical Science Cornell University
Model based variable clustering: minima
A convex optimization approach to estimation of
correlation matrices with balanced blocks
Target: the block matrix of 1’s, B ∗ = AAT = (1
Gblock
a
∼
b
)1≤a,b≤p .
b of R, and an estimator K
b of K.
Input: An estimator R
∗
Estimate B by:
b ∈ argmaxhR,
b Bi,
B
B∈Cb

B is in S + = {symmetric and positive definite}




Bab 6 1
 0 6P
p
∀b, a Bab = m
b =: K
Cb =
b
P

2

bm

Bab = K
b

a,b

diag(B) = I
Florentina Bunea Department of Statistical Science Cornell University
Model based variable clustering: minima
b is a restricted least squares estimator.
•B
• Networks: Chen and Xu (2015), Amini and Levina (2014).
Analysis not transferable to covariance/correlation matrices.
•
b = rank(L̃),
K
(L̃, Γ̃) = argmin
L,Γ
•
1
2
b
kL + Γ − RkF + λkLk∗ } s.t. Γ is diagonal .
2
b = K, w.h.p. (Wegkamp and Zhao, 2015)
K
Florentina Bunea Department of Statistical Science Cornell University
Model based variable clustering: minima
Near minimax optimal membership estimation
♣
Exact partition recovery at level η = O
log(p)∨K
n
W q K∨log(p)
nm
b = B ∗ , whp.
If R ∈ Rm (η), with η ≥ η ♣ , then B
p
, for all k}.
K
• Upper bound holds for more general
q classes ofblock-matrices.
log(p)
1
• Minimax lower bound: η∗ = c1
nm ∨ n .
Rm (η) = {R : C = ηIK + αJK ;
Florentina Bunea Department of Statistical Science Cornell University
and |Gk | = m =
Model based variable clustering: minima
Let q be any vector in RK and e1 , . . . , eK be the canonical basis of RK .
Let αk = maxj:j>k Cjk /Ckk = maxj:j6=k Cjk /(Ckk ∨ Cjj ).
If
• log(K) n (it can be made explicit)
r
r
(ek − q)T C(ek − q) log(p) _ K ∨ log p _ K
• (1 − αk )Ckk &
,
n
nm
n
then B̂ = B ∗ w.h.p.
Take q =
1
K
to come close to the minimax separation bound.
Florentina Bunea Department of Statistical Science Cornell University
Model based variable clustering: minima
Recovery with C = ηIK + αJ
c
R
R
R(sorted)
CORD
Spectral
Convex
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
Figure: η = 0.29, α = 0.25, n = 100
Florentina Bunea Department of Statistical Science Cornell University
Model based variable clustering: minima
Recovery with C = ηIK + αJ
c
R
R
R(sorted)
CORD
Spectral
Convex
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
Figure: η = 0.49, α = 0.25, n = 100
Florentina Bunea Department of Statistical Science Cornell University
Model based variable clustering: minima
Recovery with C = ηIK + αJ
c
R
R
R(sorted)
CORD
Spectral
Convex
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
Figure: η = 0.89, α = 0.25, n = 100
Florentina Bunea Department of Statistical Science Cornell University
Model based variable clustering: minima
Recovery with C = BB T ; B random; balanced groups
c
R
R
R(sorted)
CORD
Spectral
Convex
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
Figure: n = 500, p = 20, K = 5
C is positive definite
Florentina Bunea Department of Statistical Science Cornell University
Model based variable clustering: minima
Recovery with C = BB T ; B random; unbalanced groups
c
R
R
CORD
Spectral
R(sorted)
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
Figure: n = 400, p = 100, K = 10
Florentina Bunea Department of Statistical Science Cornell University
Model based variable clustering: minima
Recovery with C ≺ 0
c
R
R
R(sorted)
CORD
Spectral
Convex
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
Figure: n = 500
C is symmetric but has an important negative eigenvalue
Florentina Bunea Department of Statistical Science Cornell University
Model based variable clustering: minima
Summary: clustering via G-models
R(η) = {C : min
max |Clj − Clk | > η > 0}.
1≤j,k≤K 1≤l≤K
• Does not requireCkk ∨ Cjj (within block) be larger than Cjk (between blocks).
q
log p
• Minimax η = O
.
n
• Exact recovery at and above the minimax level: CORD.
• General partitions.
Rm (η) = {R : C = ηIK +αJK ; η > 0, and |Gk | = m =
p
, for all k}.
K
• Does require Ckk∨ Cjj (within block) be larger than Cjk (between blocks).
q
log(p)
1
• Minimax η = O
nm ∨ n .
• Exact recovery at and above the near minimax level: CONVEX.
• Balanced partitions.
Florentina Bunea Department of Statistical Science Cornell University
Model based variable clustering: minima
Many open problems ....
•
Develop a unifying framework for the study of variable clustering
in G-models.
1
2
3
4
5
6
•
Identify key quantities in model based variable clustering.
Establish minimax lower bounds over general model classes.
Develop effective (near) minimax adaptive algorithms over these classes.
Bridge the gap between theory and practice in spectral clustering.
”Post-clustering” inference.
.....
Beyond G-models ....
Florentina Bunea Department of Statistical Science Cornell University
Model based variable clustering: minima
The team: Xi Luo (Brown), Christophe Giraud (Polytechnique and Paris
Sud), Martin Royer (Paris Sud and Cornell), Nicolas Verzelen (INRA).
Community estimation in G-models via CORD, 2015,
F.B., C.Giraud, X. Luo (arXiv + CRAN).
A convex optimization approach to near minimax optimal variable clustering,
2016, F.B., C. Giraud, M. Royer and N. Verzelen.
Florentina Bunea Department of Statistical Science Cornell University
Model based variable clustering: minima
Thank you !
Florentina Bunea Department of Statistical Science Cornell University
Model based variable clustering: minima
G-latent models: merits and limitations
G-latent models = most commonly used G-models.
X follows a G-latent variable model if there exists Z ∈ RK such that
Xa = Zk + Ea , V ar(Ea ) = γk , for a ∈ Gk , 1 ≤ k ≤ K.
Merits: simple and interpretable.
Limitations: not identifiable.
Example. X is zero-mean Gaussian with

1
0.25
Σ=
0.26
0.26
0.25
1
0.26
0.26
0.26
0.26
1
0.25

0.26
0.26
0.25
1
X is G-latent, with G = {{1}, {2}, {3, 4}}.
X is also G0 -latent, with G0 = {{1, 2}, {3}, {4}}.
X cannot be latent with respect to G∗ = {{1, 2}, {3, 4}}.
The minimal partition for a latent decomposition is not unique!
Florentina Bunea Department of Statistical Science Cornell University
Model based variable clustering: minima
Download