Phase Transitions in the Information Distortion Albert E. Parker

advertisement
Phase Transitions in the
Information Distortion
NIPS 2003 workshop on Information Theory
and Learning:
The Bottleneck and Distortion Approach
December 13, 2003
Albert E. Parker
Department of Mathematical Sciences
Center for Computational Biology
Montana State University
Collaborators: Tomas Gedeon, Alex Dimitrov, John Miller, and Zane Aldworth
The Goal:
To determine the phase transitions or the bifurcation structure of solutions to
clustering problems of the form
maxqG(q) constrained by D(q)I0
where
q(Z|X)
X
Z
•  is the set of valid conditional probabilities in RNK.
K objects
N clusters
• G and D are sufficiently smooth in .
• G and D have symmetry: they are invariant to relabelling of the classes of Z.
• The Hessians qG and q D are block diagonal.
A similar formulation:
Using the method Lagrange multipliers, the goal of determining the bifurcation
structure of solutions of the optimization problem can be rephrased as finding
the bifurcation structure of stationary points of the problem
maxq(G(q)+D(q))

where
•  [0,).
•  is the set of valid conditional probabilities in RNK.
q(Z|X)
X
K objects
Z
N clusters
• G and D are sufficiently smooth in .
• G and D have symmetry: they are invariant to relabelling of the classes of Z.
• The Hessian q(G+ D) is block diagonal, and satisfies a set of regularity
conditions at bifurcation: (e.g. the kernel of each block is one dimensional)
How: Use the Symmetries
By capitalizing on the symmetries of the cost functions, we have described the
bifurcation structure of stationary points to problems of the form
maxqG(q) constrained by D(q)I0
or
maxq(G(q)+D(q))

where
•  [0,).
•  is the set of valid conditional probabilities in RNK.
• G and D are sufficiently smooth in .
• G and D have symmetry: they are invariant to relabelling of the classes of Z.
• The Hessian q(G+ D) is block diagonal, and satisfies a set of regularity
conditions at bifurcation: (e.g. the kernel is one dimensional)
Examples
optimizing at a distortion level D(Y,Z)  D0
• Rate Distortion Theory (Shannon 1950’s)
Minimal Informative Compression
min I(X,Z) constrained by D(X,Z)  D0
q
• Deterministic Annealing (Rose 1990’s)
A Clustering Algorithm
max H(Z|X) constrained by D(X,Z)  D0
C, q  
Examples
optimizing at a distortion level D(Y,Z)  D0
• Rate Distortion Theory (Shannon 1950’s)
Minimal Informative Compression
max -I(X,Z) constrained by D(X,Z)  D0
q
• Deterministic Annealing (Rose 1998)
A Clustering Algorithm
max H(Z|X) constrained by D(X,Z)  D0
C, q  
I(X,Z)=H(Z) – H(Z|X)
Inputs and Outputs and Clustered Outputs
Inputs
Y
L objects {yi}
Outputs
p(X,Y)
X
K objects {xi}
Clustered Outputs
q(Z|X)
Z
N objects {zi}
Inputs and Outputs and Clustered Outputs
Inputs
Y
L objects {yi}
Outputs
p(X,Y)
X
K objects {xi}
Clustered Outputs
q(Z|X)
Z
N objects {zi}
Two methods which use an
information distortion function to cluster
• Information Bottleneck Method (Tishby, Pereira, Bialek 1999)
min I(X,Z) constrained by DI(X,Z)  D0
q
max –I(X,Z) +  I(Y;Z)
q
• Information Distortion Method (Dimitrov and Miller 2001)
max H(Z|X) constrained by DI(X,Z)  D0
q
max H(Z|X) +  I(Y;Z)
q
Two methods which use an
information distortion function to cluster
• Information Bottleneck Method (Tishby, Pereira, Bialek 1999)
min I(X,Z) constrained by DI(X,Z)  D0
q
max –I(X,Z) +  I(Y;Z)
q
The Hessian is always
singular … (-I(X,Z)
is not strictly concave)
The theory which
follows does not apply
• Information Distortion Method (Dimitrov and Miller 2001)
max H(Z|X) constrained by DI(X,Z)  D0
q
max H(Z|X) +  I(Y;Z)
q
Two methods which use an
information distortion function to cluster
• Information Bottleneck Method (Tishby, Pereira, Bialek 1999)
min I(X,Z) constrained by DI(X,Z)  D0
q
max –I(X,Z) +  I(Y;Z)
q
The Hessian is always
singular … (I(X,Z)
is not strictly concave)
The theory which
follows does not apply
• Information Distortion Method (Dimitrov and Miller 2001)
max H(Z|X) constrained by DI(X,Z)  D0
q
H(Z|X) is
strictly concave)
max H(Z|X) +  I(Y;Z)
q
The theory which
follows does apply
A basic annealing algorithm
to solve
maxq(G(q)+D(q))

Let q0 be the maximizer of maxq G(q), and let 0 =0. For k  0, let (qk , k ) be
a solution to maxq G(q) +  D(q ). Iterate the following steps until
K =  max for some K.
1. Perform  -step: Let  k+1 = k + dk where dk>0
2. The initial guess for qk+1 at  k+1 is qk+1(0) = qk +  for some small
perturbation .
3. Optimization: solve maxq (G(q) +  k+1 D(q)) to get the maximizer qk+1 ,
using initial guess qk+1(0) .
Application of the annealing method to the Information Distortion problem
maxq (H(Z|X) +  I(X;Z))
when p(X,Y) is defined by four gaussian blobs
Y
p(X,Y)
L=52 inputs
X
X
K=52 outputs
X, Outputs
Z
K=52 outputs N=4 clustered outputs
Z, Clustered Outputs
Y, Inputs
q(Z|X)
X, Outputs
Evolution of the optimal clustering:
Observed Bifurcations for the Four Blob problem:
I(Y,Z) bits
We just saw the optimal clusterings q* at some  *=  max . What do the clusterings look like for < max ??
??????
Observed Bifurcations for the 4 Blob Problem
Conceptual Bifurcation Structure
I(Y,Z) bits
q*

Why are there only 3 bifurcations observed? In general, are there only N-1 bifurcations?
What kinds of bifurcations do we expect: pitchfork-like, transcritical, saddle-node, or some
other type?
How many bifurcating branches are there?
What do the bifurcating branches look like? Are they 1st order phase transitions
(subcritical) or 2nd order phase transitions (supercritical) ?
What is the stability of the bifurcating branches? Is there always a bifurcating branch
which contains solutions of the optimization problem?
Are there bifurcations after all of the classes have resolved ?
Recall the Symmetries:
To better understand the bifurcation structure, we capitalize
on the symmetries of the function G(q)+D(q)
class 1
class 3
q(Z|X) : a clustering
X
Z
K objects {xi}
N objects {zi}
Recall the Symmetries:
To better understand the bifurcation structure, we capitalize
on the symmetries of the function G(q)+D(q)
class 3
class 1
q(Z|X) : a clustering
X
Z
K objects {xi}
N objects {zi}
The symmetry group of all
permutations on N symbols
is
SN
.
A partial subgroup lattice for SN when N=4.
S4
S3
S2
S2
S3
S2
S2
S2
S3
S2
S2
1
S2
S2
S3
S2
S2
S2
A partial lattice of the maximal subgroups
S2 x S2 of S4
S4
 12, 34 
 13, 24 
 14, 23 
This Group Structure
determines the
Bifurcation Structure
Define a Gradient Flow
Goal: To determine the bifurcation structure of stationary points of
 maxq  (G(q) +  D(q))
Method: Study the equilibria of the of the flow

 q 



    q , L (q,  ,  ) :  q ,  G(q)   D(q)    y   q( z | x)  1 

yY
 z

 

•
Equilibria of this system (in RNK+K ) are possible solutions of the optimization
problem
•
The Jacobian q,L(q*,*) is symmetric, and so only bifurcations of equilibria
can occur.
•
The first equilibrium is q*(0 = 0)  1/N.
Symmetry Breaking Bifurcations
q*
q 1 is fixed by S N  S 4
N
q1 
N
1

4
S4
S3
S2 S2 S2
S3
S2
S2
S3
S2
S2
1
S2 S2

S3
S2
S2
S2
Symmetry Breaking Bifurcations
q is fixed by S N 1  S3
*
q* 
q*
q 1 is fixed by S N  S 4
N
q1 
N
1

4
S4
S3
S2 S2 S2
S3
S2
S2
S3
S2
S2
1
S2 S2

S3
S2
S2
S2
Symmetry Breaking Bifurcations
q is fixed by S N 1  S3
*
q* 
q * is fixed by S N  2  S 2
q* 
q*
q 1 is fixed by S N  S 4
N
q1 
N
1

4
S4
S3
S2 S2 S2
S3
S2
S2
S3
S2
S2
1
S2 S2

S3
S2
S2
S2
Symmetry Breaking Bifurcations
q*
S4
S3
S2 S2 S2
S3
S2
S2
S3
S2
S2
1
S2 S2

S3
S2
S2
S2
Symmetry Breaking Bifurcations
q*
S4
 12, 34 
 13, 24 
 14, 23 

Symmetry Breaking Bifurcations
q* is fixed by S2  S2  (12), (34)
q* 
q*
S4
 12, 34 
 13, 24 
 14, 23 

Existence Theorems for
Bifurcating Branches
q*
Given a bifurcation at a point fixed by SN ,
•

Equivariant Branching Lemma
The Smoller-Wasserman Theorem
(Vanderbauwhede and Cicogna 1980-1)
(Smoller and Wasserman 1985-6)
•
•
There are N bifurcating branches, each which have symmetry SN-1 .
There are N!/(2m!n!) bifurcating branches which have symmetry Sm x Sn if N=m+n.
Existence Theorems for
Bifurcating Branches
q*
Given a bifurcation at a point fixed by SN-1 ,
•

Equivariant Branching Lemma
The Smoller-Wasserman Theorem
(Vanderbauwhede and Cicogna 1980-1)
(Smoller and Wasserman 1985-6)
•
•
There are N-1 bifurcating branches, each which have symmetry SN-2 .
There are (N-1)!/(2m!n!) bifurcating branches which have symmetry Sm x Sn if N-1=m+n.
Observed Bifurcation
Structure
Group Structure
S4
S3
S2 S2 S2
S3
S2
S2
S3
S2
S2
1
S2 S2
S3
S2
S2
S2
The Equivariant Branching Lemma shows that the bifurcation structure contains the branches …
Observed Bifurcation
Structure
q*
Group Structure
S4
S3
S2 S2 S2
S3
S2
S2
S3
S2
S2
S2 S2
S3
S2
S2
S2
1

The subgroups {S2x S2} give additional structure …
Observed Bifurcation
Structure
q*
Group Structure
S4
 12, 34 
 13, 24 
 14, 23 

The subgroups {S2x S2} give additional structure …
Observed Bifurcation
Structure
q*
Group Structure
S4
 12, 34 
 13, 24 
 14, 23 

Theorem: There are at exactly K bifurcations on the branch (q1/N ,  ) whenever G(q1/N) is nonsingular
Observed Bifurcation
Structure
q*
There are K=52
bifurcations
on the first
branch

A partial subgroup lattice for S4 and the corresponding bifurcating
directions given by the Equivariant Branching Lemma
S4
S3
 3v 
 
 v
 v
 v
 
0 
S2 S2 S2
 0 
 
 2v 
 v
 v
 
0 
 0 
 
 v
 2v 
 v
 
0 
 0 
 
 v
 v
 2v 
 
0 
S3
 v
 
 3v 
 v
 v
 
0 
S2 S2 S2
 2v 
 
 0 
 v
 v
 
0 
 v
 
 0 
 2v 
 v
 
0 
 v
 
 0 
 v
 2v 
 
0 
1
S3
 v
 
 v
 3v 
 v
 
0 
S2 S2 S2
 2v 
 
 v
 0 
 v
 
0 
 v
 
 2v 
 0 
 v
 
0 
 v
 
 v
 0 
 2v 
 
0 
S3
 v
 
 v
 v
 3v 
 
0 
S2 S2 S2
 2v 
 
 v
 v
 0 
 
 0 
 v
 
 2v 
 v
 0 
 
 0 
 v
 
 v
 2v 
 0 
 
 0 
A partial subgroup lattice for S4 and the corresponding bifurcating
directions corresponding to subgroups isomorphic to S2 x S2.
S4
 12, 34 
 v 
 
 v 
 v
 
 v
 
 13, 24 
 14, 23 
 v 
 
 v
 v
 
 v 
 
 v 
 
 v
 v 
 
 v
 
This theory
enables us to answer the
questions previously posed …
??????
Observed Bifurcations for the 4 Blob Problem
Conceptual Bifurcation Structure
q*

Why are there only 3 bifurcations observed? In general, are there only N-1 bifurcations?
What kinds of bifurcations do we expect: pitchfork-like, transcritical, saddle-node, or some
other type?
How many bifurcating solutions are there?
What do the bifurcating branches look like? Are they subcritical or supercritical ?
What is the stability of the bifurcating branches? Is there always a bifurcating branch
which contains solutions of the optimization problem?
Are there bifurcations after all of the classes have resolved ?
Conceptual Bifurcation Structure
S4
q*
S3
S2

S2
S3
S2
S2
S2
S3
S2
S2
S2
S2
S3
S2
S2
S2
1
Why are there only 3 bifurcations observed? In general, are there only N-1 bifurcations?
There are N-1 symmetry breaking bifurcations from SM to SM-1 for M  N.
What kinds of bifurcations do we expect: pitchfork-like, transcritical, saddle-node, or some other
type?
How many bifurcating solutions are there? There are at least N from the first bifurcation, at
least N-1 from the next one, etc.
What do the bifurcating branches look like? They are subcritical or supercritical depending on
the sign of the bifurcation discriminator (q*,*,uk) .
What is the stability of the bifurcating branches? Is there always a bifurcating branch which
contains solutions of the optimization problem? No.
Are there bifurcations after all of the classes have resolved ? Generically, no.
Continuation techniques
numerically illustrate the theory
using the
Information Distortion
I(Y,Z) bits
q*

Bifurcating branches with
symmetry S2 x S2 = <(12),(34)>
I(Y,Z) bits
q*

I(Y,Z) bits
Additional structure!!
I(Y,Z) bits
I(Y,Z) bits
A closer look …
q*
I(Y,Z) bits

Bifurcation from S4 to S3…
q*
I(Y,Z) bits

I(Y,Z) bits
The bifurcation from S4 to S3 is subcritical …
(the theory predicted this since the bifurcation discriminator (q1/4,*,u)<0 )
I(Y,Z) bits
What does this
mean regarding
solutions of
the original
problems?
(4) RH(I0) = maxqH(Z|X) constrained by I(Y,Z)  I0
(7) maxq(H(Z|X) +  I(Y,Z))
Theorem:
• dR/dI0 = -(I0)
• d2R/dI02 = -d(I0)/dI0
(4) RH(I0) = maxqH(Z|X) constrained by I(Y,Z)  I0
(7) maxq(H(Z|X) +  I(Y,Z))
RH as a function of I0
RH(I0) = maxq H(Z|Y) constrained by I(X;Z)  I0
• is not convex and not concave
• is a monotonically decreasing, continuous function
RH
Theorem:
• dR/dI0 = -(I0)
• d2R/dI02 = d(I0)/dI0
Consequences??
• Analogue for the Information Distortion
RH(I0) = maxq H(Z|X) constrained by I(Y;Z)  I0
is neither concave nor convex since subcritical bifurcations and saddle nodes exist.
• Rate Distortion Function (from Information Theory)
R(D0) = minq I(X;Z) constrained by D(X,Z) D0
is convex if D(Y,Z) is linear in q (Rose, 1994; Cover and Thomas; Grey).
• Relevance Compression Function (for Information Bottleneck)
RI(I0) = minq I(X;Z) constrained by I(Y;Z)  I0
is convex if N>K+1 (Witsenhausen and Wyner 1975, Bachrach et al 2003)
So What??
Analogue for the Information Distortion
RH(I0) = maxq H(Z|X) constrained by I(Y;Z)  I0
is neither concave nor convex since subcritical bifurcations and saddle nodes exist.
Relevance Compression Function (for Information Bottleneck)
RI(I0) = minq I(X;Z) constrained by I(Y;Z)  I0
is convex if N>K+1 (Bachrach et al 2003)
•
RI(I0) and RH(I0) are related by I(X;Z) = H(Z) - H(Z|X).
•
The Information Bottleneck can not have a subcritical bifurcation when
N > K+1. Are there subcritical bifurcations when N<K+1 ?
•
Is RH(I0) convex when N>K+1 ? That would mean that the subcritical
bifurcations go away when considering the gradient flow in R(K+2)K instead of
RNK.
Application to cricket sensory data
E(Y|Z): stimulus
means conditioned
on each of the classes
spike
patterns
optimal
clustering
Conclusions …
 We have a complete theoretical picture of how the
clusterings evolve for a class of annealing problems of the
form
maxq(G(q)+D(q))
subject to the assumptions stated earlier.
o When clustering to N classes, there are N-1 bifurcations.
o In general, there are only pitchfork and saddle-node bifurcations.
o We can determine whether pitchfork bifurcations are either
subcritical or supercritical (1st or 2nd order phase transitions)
o We know the explicit bifurcating directions
 SO WHAT??
 There are theoretical consequences …
 This suggests an algorithm for solving the
annealing problem … (NIPS 2002)
Download