An Introduction to Deterministic Annealing

advertisement
An Introduction to Deterministic Annealing
Fabrice Rossi
SAMM – Université Paris 1
2012
Optimization
I
in machine learning:
I
I
optimization is one of the central tools
methodology:
I
I
I
I
I
choose a model with some adjustable parameters
choose a goodness of fit measure of the model to some data
tune the parameters in order to maximize the goodness of fit
examples: artificial neural networks, support vector
machines, etc.
in other fields:
I
I
I
operational research: project planning, routing, scheduling,
etc.
design: antenna, wings, engines, etc.
etc.
Easy or Hard?
I
is optimization computationally difficult?
Easy or Hard?
I
I
is optimization computationally difficult?
the convex case is relatively easy:
I
I
I
minx∈C J(x) with C convex and J convex
polynomial algorithms (in general)
why?
Easy or Hard?
I
I
is optimization computationally difficult?
the convex case is relatively easy:
I
I
I
minx∈C J(x) with C convex and J convex
polynomial algorithms (in general)
why?
I
I
a local minimum is global
the local tangent hyperplane is a global lower bound
Easy or Hard?
I
I
is optimization computationally difficult?
the convex case is relatively easy:
I
I
I
minx∈C J(x) with C convex and J convex
polynomial algorithms (in general)
why?
I
I
I
a local minimum is global
the local tangent hyperplane is a global lower bound
the non convex case is hard:
I
I
I
multiple minima
no local to global inference
NP hard in some cases
In machine learning
I
convex case:
I
I
I
I
linear models with(out) regularization (ridge, lasso)
kernel machines (SVM and friends)
nonlinear projections (e.g., semi-define embedding)
non convex:
I
I
I
I
artificial neural networks (such as multilayer perceptrons)
vector quantization (a.k.a. prototype based clustering)
general clustering
optimization with respect to meta parameters:
I
I
kernel parameters
discrete parameters, e.g., feature selection
In machine learning
I
convex case:
I
I
I
I
linear models with(out) regularization (ridge, lasso)
kernel machines (SVM and friends)
nonlinear projections (e.g., semi-define embedding)
non convex:
I
I
I
I
artificial neural networks (such as multilayer perceptrons)
vector quantization (a.k.a. prototype based clustering)
general clustering
optimization with respect to meta parameters:
I
I
I
kernel parameters
discrete parameters, e.g., feature selection
in this lecture, non convex problems:
I
I
combinatorial optimization
mixed optimization
Particular cases
I
Pure combinatorial optimization problems:
I
I
I
a solution space M: finite but (very) large
an error measure E from M to R
goal: solve
M ∗ = arg min E(M)
M∈M
I
example: graph clustering
Particular cases
I
Pure combinatorial optimization problems:
I
I
I
a solution space M: finite but (very) large
an error measure E from M to R
goal: solve
M ∗ = arg min E(M)
M∈M
I
I
example: graph clustering
Mixed problems:
I
I
I
a solution space M × Rp
an error measure E from M × Rp to R
goal: solve
(M ∗ , W ∗ ) = arg
I
example: clustering in Rn
min
M∈M,W ∈Rp
E(M, W )
Graph clustering
Goal: find an optimal clustering of a graph with N nodes in K
clusters
I
M: set of all partitions of {1, . . . , N} in K classes (the
asymptotic behavior of |M| is roughly K N−1 for a fixed K
with N → ∞ )
I
assignment matrix: a partition in M is
by N × K
Pdescribed
K
matrix M such that Mik ∈ {0, 1} and k =1 Mik = 1
many error measures are available:
I
I
I
I
Graph cut measures (node normalized, edge normalized,
etc.)
Modularity
etc.
Graph clustering
21
19
16
23
15
31
33
27
30
20
34
22
9
2
14
24
I
18
3
original graph
28
32
4
8
10
13
1
29
26
12
25
5
7
6
17
11
Graph clustering
I
original graph
I
four clusters
Graph visualization
woman
bisexual man
heterosexual man
I
2386 persons:
unreadable
Graph visualization
woman
bisexual man
heterosexual man
I
2386 persons:
unreadable
I
maximal modularity
clustering in 39
clusters
Graph visualization
7
17
19
2
18
8
11
20
22
15
10
1
28
36
35
21
25
13
33
34
5
29
27
12
23
16
4
31
14
32
38
6
39
2386 persons:
unreadable
I
maximal modularity
clustering in 39
clusters
30
37
3
I
24
26
9
Graph visualization
woman
bisexual man
heterosexual man
I
2386 persons:
unreadable
I
maximal modularity
clustering in 39
clusters
I
hierarchical display
Vector quantization
I
N observations (xi )1≤i≤N in Rn
I
M: set of all partitions
I
quantization error:
E(M, W ) =
K
N X
X
Mik kxi − wk k2
i=1 k =1
I
the “continuous” parameters are the prototypes wk ∈ Rn
I
the assignment matrix notation is equivalent to the
standard formulation
E(M, W ) =
N
X
kxi − wk (i) k2 ,
i=1
where k (i) is the index of the cluster to which xi is assigned
−2
−1
0
X2
1
2
3
4
Example
−2
0
2
X1
4
6
−2
−1
0
X2
1
2
3
4
Example
−2
0
2
X1
4
6
Combinatorial optimization
I
I
a research field by itself
many problems are NP hard...
Combinatorial optimization
I
I
a research field by itself
many problems are NP hard... and one therefore relies on
heuristics or specialized approaches:
I
I
I
relaxation methods
branch and bound methods
stochastic approaches:
I
I
simulated annealing
genetic algorithms
Combinatorial optimization
I
I
a research field by itself
many problems are NP hard... and one therefore relies on
heuristics or specialized approaches:
I
I
I
relaxation methods
branch and bound methods
stochastic approaches:
I
I
I
simulated annealing
genetic algorithms
in this presentation: deterministic annealing
Outline
Introduction
Mixed problems
Soft minimum
Computing the soft minimum
Evolution of β
Deterministic Annealing
Annealing
Maximum entropy
Phase transitions
Mass constrained deterministic annealing
Combinatorial problems
Expectation approximations
Mean field annealing
In practice
Outline
Introduction
Mixed problems
Soft minimum
Computing the soft minimum
Evolution of β
Deterministic Annealing
Annealing
Maximum entropy
Phase transitions
Mass constrained deterministic annealing
Combinatorial problems
Expectation approximations
Mean field annealing
In practice
Optimization strategies
I
mixed problem transformation
(M ∗ , W ∗ ) = arg
I
min
M∈M,W ∈Rp
E(M, W ),
remove the continuous part
min
M∈M,W ∈Rp
I
E(M, W ) = min
M∈M
M 7→ minp E(M, W )
W ∈R
or the combinatorial part
min
M∈M,W ∈Rp
E(M, W ) = minp W 7→ min E(M, W )
W ∈R
M∈M
Optimization strategies
I
mixed problem transformation
(M ∗ , W ∗ ) = arg
I
min
M∈M,W ∈Rp
E(M, W ),
remove the continuous part
min
M∈M,W ∈Rp
I
E(M, W ) = min
M∈M
M 7→ minp E(M, W )
W ∈R
or the combinatorial part
min
M∈M,W ∈Rp
I
E(M, W ) = minp W 7→ min E(M, W )
W ∈R
this is not alternate optimization
M∈M
Alternate optimization
I
elementary heuristics:
1.
2.
3.
4.
start with a random configuration M 0 ∈ M
compute W i = arg minW ∈Rp E(M i−1 , W )
compute M i = arg minM∈M E(M, W i )
go back to 2 until convergence
Alternate optimization
I
elementary heuristics:
1.
2.
3.
4.
I
start with a random configuration M 0 ∈ M
compute W i = arg minW ∈Rp E(M i−1 , W )
compute M i = arg minM∈M E(M, W i )
go back to 2 until convergence
e.g., the k-means algorithm for vector quantization:
1. start with a random partition M 0 ∈ M
2. compute the optimal prototypes with
PN
Wki = PN 1M i−1 j=1 Mjki−1 xj
j=1
jk
3. compute the optimal partition with Mjki = 1 if and only if
k = arg min1≤l≤K kxj − Wli k2
4. go back to 2 until convergence
Alternate optimization
I
elementary heuristics:
1.
2.
3.
4.
I
start with a random configuration M 0 ∈ M
compute W i = arg minW ∈Rp E(M i−1 , W )
compute M i = arg minM∈M E(M, W i )
go back to 2 until convergence
e.g., the k-means algorithm for vector quantization:
1. start with a random partition M 0 ∈ M
2. compute the optimal prototypes with
PN
Wki = PN 1M i−1 j=1 Mjki−1 xj
j=1
jk
3. compute the optimal partition with Mjki = 1 if and only if
k = arg min1≤l≤K kxj − Wli k2
4. go back to 2 until convergence
I
alternate optimization converges to a local minimum
Combinatorial first
I
let consider the combinatorial first approach:
min p E(M, W ) = minp W 7→ min E(M, W )
M∈M,W ∈R
I
W ∈R
M∈M
not attractive a priori, as W 7→ minM∈M E(M, W ):
I
I
has no reason to be convex
has no reason to be C 1
Combinatorial first
I
let consider the combinatorial first approach:
min p E(M, W ) = minp W 7→ min E(M, W )
M∈M,W ∈R
I
M∈M
not attractive a priori, as W 7→ minM∈M E(M, W ):
I
I
I
W ∈R
has no reason to be convex
has no reason to be C 1
vector quantization example:
F (W ) =
N
X
i=1
is neither convex nor C 1
min kxi − wk k2
1≤k ≤K
Example
Clustering in 2 clusters elements from R
x
2
4
Example
0
−2
−4
multiple local minima and
singularities: minimizing
F (W ) is difficult
w2
ln F (W )
−4
−2
0
w1
2
4
Soft minimum
I
a soft minimum approximation of minM∈M E(M, W )
Fβ (W ) = −
X
1
exp(−βE(M, W ))
ln
β
M∈M
solves the regularity issue: if E(M, .) is C 1 , then Fβ (W ) is
also C 1
Soft minimum
I
a soft minimum approximation of minM∈M E(M, W )
Fβ (W ) = −
X
1
exp(−βE(M, W ))
ln
β
M∈M
solves the regularity issue: if E(M, .) is C 1 , then Fβ (W ) is
also C 1
I
if M ∗ (W ) is such that for all M 6= M ∗ (W ),
E(M, W ) > E(M ∗ (W ), W ), then
lim Fβ (W ) = E(M ∗ (W ), W ) = min E(M, W )
β→∞
M∈M
Example
M
E(M, W )
Example
β = 10−2
M
exp(−βE(M,W ))
exp(−βE(M ∗ ,W ))
Example
β = 10−1.5
M
exp(−βE(M,W ))
exp(−βE(M ∗ ,W ))
Example
β = 10−1
M
exp(−βE(M,W ))
exp(−βE(M ∗ ,W ))
Example
β = 10−0.5
M
exp(−βE(M,W ))
exp(−βE(M ∗ ,W ))
Example
β = 100
M
exp(−βE(M,W ))
exp(−βE(M ∗ ,W ))
Example
β = 100.5
M
exp(−βE(M,W ))
exp(−βE(M ∗ ,W ))
Example
β = 101
M
exp(−βE(M,W ))
exp(−βE(M ∗ ,W ))
Example
β = 101.5
M
exp(−βE(M,W ))
exp(−βE(M ∗ ,W ))
Example
β = 102
M
exp(−βE(M,W ))
exp(−βE(M ∗ ,W ))
Example
Fβ
1.00e−02
3.16e−02
1.00e−01
3.16e−01
1.00e+00
β
3.16e+00
1.00e+01
3.16e+01
1.00e+02
Discussion
I
positive aspects:
I
I
I
if E(M, W ) is C 1 with respect to W for all M, then Fβ is C 1
limβ→∞ Fβ (W ) = minM∈M E(M, W )
∀β > 0,
−
I
1
ln |M| ≤ Fβ (W ) − min E(M, W ) ≤ 0
M∈M
β
when β is close to 0, Fβ is generally easy to minimize
Discussion
I
positive aspects:
I
I
I
if E(M, W ) is C 1 with respect to W for all M, then Fβ is C 1
limβ→∞ Fβ (W ) = minM∈M E(M, W )
∀β > 0,
−
I
I
1
ln |M| ≤ Fβ (W ) − min E(M, W ) ≤ 0
M∈M
β
when β is close to 0, Fβ is generally easy to minimize
negative aspects:
I
I
I
how to compute Fβ (W ) efficiently?
how to solve minW ∈Rp Fβ (W )?
why would that work in practice?
Discussion
I
positive aspects:
I
I
I
if E(M, W ) is C 1 with respect to W for all M, then Fβ is C 1
limβ→∞ Fβ (W ) = minM∈M E(M, W )
∀β > 0,
−
I
I
when β is close to 0, Fβ is generally easy to minimize
negative aspects:
I
I
I
I
1
ln |M| ≤ Fβ (W ) − min E(M, W ) ≤ 0
M∈M
β
how to compute Fβ (W ) efficiently?
how to solve minW ∈Rp Fβ (W )?
why would that work in practice?
philosophy: where does Fβ (W ) come from?
Outline
Introduction
Mixed problems
Soft minimum
Computing the soft minimum
Evolution of β
Deterministic Annealing
Annealing
Maximum entropy
Phase transitions
Mass constrained deterministic annealing
Combinatorial problems
Expectation approximations
Mean field annealing
In practice
Computing Fβ (W )
I
in general computing Fβ (W ) is intractable: exhaustive
calculation of E(M, W ) on the whole set M
Computing Fβ (W )
I
in general computing Fβ (W ) is intractable: exhaustive
calculation of E(M, W ) on the whole set M
I
a particular case: clustering with an additive cost, i.e.
E(M, W ) =
K
N X
X
Mik eik (W ),
i=1 k =1
where M is an assignment matrix (and e.g.,
eik (W ) = kxi − wk k2 )
I
then
Fβ (W ) = −
N
K
1X X
ln
exp(−βeik (W ))
β
i=1
k =1
computational cost O(NK ) (compared to ' K N−1 )
Proof sketch
I
let Zβ (W ) be the partition function given by
Zβ (W ) =
X
exp(−βE(M, W ))
M∈M
I
assignments in M are independent and the sum can be
rewritten
X
Zβ (W ) =
M1. ∈CK
I
X
...
exp(−βE(M, W )),
MN. ∈CK
with CK = {(1, 0, . . . , 0), . . . , (0, . . . , 0, 1)}
then
Zβ (W ) =
N
Y
X
i=1 Mi. ∈CK
exp(−β
X
k
Mik eik (W ))
Independence
I
additive cost corresponds to some form of conditional
independence
I
given the prototypes, observations are assigned
independently to their optimal clusters
I
in other words: the global optimal assignment is the
concatenation of the individual optimal assignment
I
this is what makes alternate optimization tractable in the
K-means despite its combinatorial aspect:
min
M∈M
N X
K
X
i=1 k =1
Mik kxi − wk k2
Minimizing Fβ (W )
I
assume E to be C 1 with respect to W , then
P
exp(−βE(M, W ))∇W E(M, W )
∇Fβ (W ) = M∈MP
M∈M exp(−βE(M, W ))
I
at a minimum ∇Fβ (W ) = 0 ⇒ solve the equation or use
gradient descent
I
same calculation problem as for Fβ (W ): in general, an
exhaustive scan of M is needed
Minimizing Fβ (W )
I
assume E to be C 1 with respect to W , then
P
exp(−βE(M, W ))∇W E(M, W )
∇Fβ (W ) = M∈MP
M∈M exp(−βE(M, W ))
I
at a minimum ∇Fβ (W ) = 0 ⇒ solve the equation or use
gradient descent
I
same calculation problem as for Fβ (W ): in general, an
exhaustive scan of M is needed
I
if E is additive
∇Fβ (W ) =
N PK
X
i=1
k =1 exp(−βeik (W ))∇W eik (W )
PK
k =1 exp(−βeik (W ))
Fixed point scheme
I
I
a simple strategy to solve ∇Fβ (W ) = 0
starting from a random value of W :
1. compute
exp(−βeik (W ))
µik = PK
l=1 exp(−βeil (W ))
2. keeping the µik constant, solve for W
N X
K
X
µik ∇W eik (W ) = 0
i=1 k =1
3. loop to 1 until convergence
Fixed point scheme
I
I
a simple strategy to solve ∇Fβ (W ) = 0
starting from a random value of W :
1. compute (expectation phase)
exp(−βeik (W ))
µik = PK
l=1 exp(−βeil (W ))
2. keeping the µik constant, solve for W (maximization phase)
N X
K
X
µik ∇W eik (W ) = 0
i=1 k =1
3. loop to 1 until convergence
Fixed point scheme
I
I
a simple strategy to solve ∇Fβ (W ) = 0
starting from a random value of W :
1. compute (expectation phase)
exp(−βeik (W ))
µik = PK
l=1 exp(−βeil (W ))
2. keeping the µik constant, solve for W (maximization phase)
N X
K
X
µik ∇W eik (W ) = 0
i=1 k =1
3. loop to 1 until convergence
I
generally converges to a minimum of Fβ (W ) (for a fixed β)
Vector quantization
I
if eik (W ) = kxi − wk k2 (vector quantization),
∇wk Fβ (W ) = 2
N
X
µik (W )(wk − xi ),
i=1
with
I
exp(−βkxi − wk k2 )
µik (W ) = PK
2
l=1 exp(−βkxi − wl k )
setting ∇Fβ (W ) = 0 leads to
wk = PN
1
N
X
i=1 µik (W ) i=1
µik (W )xi
Links with the K-means
I
if µil = δk (i)=l , we obtained the prototype update rule of the
K-means
I
in the K-means, we have
k (i) = arg min kxi − wk k2
1≤l≤K
I
in deterministic annealing, we have
exp(−βkxi − wk k2 )
µik (W ) = PK
2
l=1 exp(−βkxi − wl k )
I
this is a soft minimum version of the crisp rule of the
k-means
Links with EM
I
isotropic Gaussian mixture with a unique and fixed
variance pk (x|wk ) =
I
1
1
2
e− 2 kxi −wk k
p/2
(2π)
given the mixing coefficients δk , responsibilities are
1
2
δk e− 2 kxi −wk k
P(xi ∈ Ck |xi , W ) = P
K
l=1 δl e
1
− 2
kxi −wl k2
I
identical update rules for W
I
β can be seen has an inverse variance: quantify the
uncertainty about the clustering results
So far...
I
P PK
if E(M, W ) = N
i=1
k =1 Mik eik (W ), one tries to reach
minM∈M,W ∈Rp E(M, W ) using
Fβ (W ) = −
K
N
1X X
ln
exp(−βeik (W ))
β
i=1
k =1
I
Fβ (W ) is smooth
I
a fixed point EM-like scheme can be used to minimize
Fβ (W ) for a fixed β
So far...
I
P PK
if E(M, W ) = N
i=1
k =1 Mik eik (W ), one tries to reach
minM∈M,W ∈Rp E(M, W ) using
Fβ (W ) = −
K
N
1X X
ln
exp(−βeik (W ))
β
i=1
k =1
I
Fβ (W ) is smooth
I
a fixed point EM-like scheme can be used to minimize
Fβ (W ) for a fixed β
but:
I
I
I
the k-means algorithm is also a EM like algorithm: it does
not reach a global optimum
how handle to handle β → ∞?
Outline
Introduction
Mixed problems
Soft minimum
Computing the soft minimum
Evolution of β
Deterministic Annealing
Annealing
Maximum entropy
Phase transitions
Mass constrained deterministic annealing
Combinatorial problems
Expectation approximations
Mean field annealing
In practice
Limit case β → 0
I
limit behavior of the EM scheme:
I
I
µik = K1 for all i and k (and W !)
W is therefore a solution of
N X
K
X
i=1 k =1
I
no iteration needed
∇W eik (W ) = 0
Limit case β → 0
I
limit behavior of the EM scheme:
I
I
µik = K1 for all i and k (and W !)
W is therefore a solution of
N X
K
X
∇W eik (W ) = 0
i=1 k =1
I
I
no iteration needed
vector quantization:
I
we have
N
1 X
wk =
xi
N
i=1
I
I
each prototype is the center of mass of the data: only one
cluster!
in general the case β → 0 is easy (unique minimum under
mild hypotheses)
Example
Clustering in 2 clusters elements from R
x
back to the example
0
w2
2
4
Example
−4
−2
original problem
−4
−2
0
w1
2
4
0
−2
−4
simple soft minimum with
β = 10−1
w2
2
4
Example
−4
−2
0
w1
2
4
0
−2
−4
more complex soft
minimum with β = 0.5
w2
2
4
Example
−4
−2
0
w1
2
4
Increasing β
I
I
when β increases, Fβ (W ) converges to minM∈M E(M, W )
optimizing Fβ (W ) becomes more and more complex:
I
I
local minima
rougher and rougher: Fβ (W ) remains C 1 but with large
values for the gradient at some points
Increasing β
I
I
when β increases, Fβ (W ) converges to minM∈M E(M, W )
optimizing Fβ (W ) becomes more and more complex:
I
I
I
local minima
rougher and rougher: Fβ (W ) remains C 1 but with large
values for the gradient at some points
path following strategy (homotopy):
I
I
I
for an increasing series (βl )l with β0 = 0
initialize W0∗ = arg minW F0 (W ) for β = 0 by solving
PN PK
i=1
k =1 ∇W eik (W ) = 0
∗
compute Wl∗ via the EM scheme for βl starting from Wl−1
Increasing β
I
I
when β increases, Fβ (W ) converges to minM∈M E(M, W )
optimizing Fβ (W ) becomes more and more complex:
I
I
I
path following strategy (homotopy):
I
I
I
I
local minima
rougher and rougher: Fβ (W ) remains C 1 but with large
values for the gradient at some points
for an increasing series (βl )l with β0 = 0
initialize W0∗ = arg minW F0 (W ) for β = 0 by solving
PN PK
i=1
k =1 ∇W eik (W ) = 0
∗
compute Wl∗ via the EM scheme for βl starting from Wl−1
a similar strategy is used in interior point algorithms
Example
Clustering in 2 classes of 6 elements from R
1.00
2.00
3.00
7.00
x
8.25
2
minM∈M E(M, W )
4
w2
6
8
Behavior of Fβ (W )
2
4
6
w1
8
2
Fβ (W ), β = 10−2
4
w2
6
8
Behavior of Fβ (W )
2
4
6
w1
8
2
Fβ (W ), β = 10−1.75
4
w2
6
8
Behavior of Fβ (W )
2
4
6
w1
8
2
Fβ (W ), β = 10−1.5
4
w2
6
8
Behavior of Fβ (W )
2
4
6
w1
8
2
Fβ (W ), β = 10−1.25
4
w2
6
8
Behavior of Fβ (W )
2
4
6
w1
8
2
Fβ (W ), β = 10−1
4
w2
6
8
Behavior of Fβ (W )
2
4
6
w1
8
2
Fβ (W ), β = 10−0.75
4
w2
6
8
Behavior of Fβ (W )
2
4
6
w1
8
2
Fβ (W ), β = 10−0.5
4
w2
6
8
Behavior of Fβ (W )
2
4
6
w1
8
2
Fβ (W ), β = 10−0.25
4
w2
6
8
Behavior of Fβ (W )
2
4
6
w1
8
2
Fβ (W ), β = 1
4
w2
6
8
Behavior of Fβ (W )
2
4
6
w1
8
2
minM∈M E(M, W )
4
w2
6
8
Behavior of Fβ (W )
2
4
6
w1
8
One step closer to the final algorithm
I
given an increasing series (βl )l with β0 = 0:
1. compute W0∗ such that
2. for l = 1 to L:
PN PK
i=1
k =1
∇W eik (W0∗ ) = 0
∗
2.1 initialize Wl∗ to Wl−1
exp(−βeik (Wl∗ ))
PK
exp(−βeil (Wl∗ ))
l=1
P P
∗
Wl such that Ni=1 Kk=1
2.2 compute µik =
2.3 update
µik ∇W eik (Wl∗ ) = 0
2.4 good back to 2.2 until convergence
3. use WL∗ as an estimation of
arg minW ∈Rp (minM∈M E(M, W ))
I
this is (up to some technical details) the deterministic
annealing algorithm
PN PKfor minimizing
E(M, W ) = i=1 k =1 Mik eik (W )
What’s next?
I
general topics:
I
I
I
why the name annealing?
why should that work? (and other philosophical issues)
specific topics:
I
I
technical “details” (e.g., annealing schedule)
generalization:
I
I
non additive criteria
general combinatorial problems
Outline
Introduction
Mixed problems
Soft minimum
Computing the soft minimum
Evolution of β
Deterministic Annealing
Annealing
Maximum entropy
Phase transitions
Mass constrained deterministic annealing
Combinatorial problems
Expectation approximations
Mean field annealing
In practice
Annealing
I
from Wikipedia:
Annealing is a heat treatment wherein a material is
altered, causing changes in its properties such as
strength and hardness. It is a process that produces
conditions by heating to above the re-crystallization
temperature and maintaining a suitable temperature,
and then cooling.
Annealing
I
from Wikipedia:
Annealing is a heat treatment wherein a material is
altered, causing changes in its properties such as
strength and hardness. It is a process that produces
conditions by heating to above the re-crystallization
temperature and maintaining a suitable temperature,
and then cooling.
I
in our context, we’ll see that
I
I
I
T = kB1β acts as a temperature and models some thermal
agitation
E(W , M) is the energy of a configuration the system while
Fβ (W ) corresponds to the free energy of the system at a
given temperature
increasing β reduces the thermal agitation
Simulated Annealing
I
I
a classical combinatorial optimization algorithm for
computing minM∈M E(M)
given an increasing series (βl )l
1. choose a random initial configuration M0
2. for l = 1 to L
2.1 take a small random step from Ml−1 to build Mlc (e.g.,
change the cluster of an object)
2.2 if E(Mlc ) < E(Ml−1 ) then set Ml = Mlc
2.3 else set Ml = Mlc with probability
exp(−βl (E(Mlc ) − E(Ml−1 )))
2.4 else set Ml = Ml−1 in the other case
3. use ML as an estimation of arg minM∈M E(M)
I
naive vision:
I
I
always accept improvement
“thermal agitation” allows to escape local minima
Statistical physics
I
consider a system with state space M
I
denote E(M) the energy of the system when in state M
I
at thermal equilibrium with the environment, the probability
for the system to be in state M is given by the Boltzmann
(Gibbs) distribution
E(M)
1
exp −
PT (M) =
ZT
kB T
with T the temperature, kB Boltzmann’s constant, and
X
E(M)
ZT =
exp −
kB T
M∈M
Back to annealing
I
physical analogy:
I
maintain the system at thermal equilibrium:
I
I
I
slowly decrease the temperature:
I
I
I
set a temperature
wait for the system to settle at this temperature
works well in real systems (e.g., crystallization)
allows the system to explore the state space
computer implementation:
I
I
direct computation of PT (M) (or related quantities)
sampling from PT (M)
Simulated Annealing
Revisited
I
simulated annealing samples from PT (M)!
I
more precisely: the asymptotic distribution of Ml for a fixed
β is given by P1/kb β
Simulated Annealing
Revisited
I
simulated annealing samples from PT (M)!
I
more precisely: the asymptotic distribution of Ml for a fixed
β is given by P1/kb β
how?
I
I
I
Metropolis-Hastings Markov Chain Monte Carlo
principle:
I
P is the target distribution
Q(.|.) is the proposal distribution (sampling friendly)
start with x t , get x 0 from Q(x|x t )
set x t+1 to x 0 with probability
P(x 0 )Q(x t |x 0 )
min 1,
P(x t )Q(x 0 |x t )
I
keep x t+1 = x t when this fails
I
I
I
Simulated Annealing
Revisited
I
in SA, Q is the random local perturbation
I
major feature of Metropolis-Hastings MCMC:
PT (M 0 )
E(M 0 ) − E(M)
= exp −
PT (M)
kB T
ZT is not needed
I
underlying assumption, symmetric proposals
Q(x t |x 0 ) = Q(x 0 |x t )
I
rationale:
I
I
sampling directly from PT (M) for small T is difficult
track likely area during cooling
Mixed problems
I
Simulated Annealing applies to combinatorial problems
I
in mixed problems, this would means removing the
continuous part
min p E(M, W ) = min M 7→ minp E(M, W )
M∈M,W ∈R
I
M∈M
W ∈R
in the vector quantization example
2
N
X
1
Mjk xj E(M) =
Mik xi − PN
j=1 Mjk j=1
i=1 k =1
N X
K
X
I
no obvious direct relation to the proposed algorithm
Statistical physics
I
thermodynamic (free) energy: the useful energy available
in a system (a.k.a., the one that can be extracted to
produce work)
I
Helmholtz (free) energy is FT = U − TS, where U is the
internal energy of the system and S its entropy
I
one can shown that
FT = −kB T ln ZT
I
the natural evolution of a system is to reduce its free
energy
I
deterministic annealing mimicks the cooling process of a
system by tracking the evolution of the minimal free energy
Gibbs distribution
I
still no use of PT ...
Gibbs distribution
I
I
still no use of PT ...
important property:
I
f a function defined on M
EPT (f (M)) =
E(M)
1 X
f (M) exp −
ZT
kB T
M∈M
I
then
lim EPT (f (M)) = f (M ∗ ),
T →0
where M ∗ = arg minM∈M E(M)
I
useful to track global information
Membership functions
I
in the EM like phase, we have
exp(−βeik (W ))
µik = PK
l=1 exp(−βeil (W ))
Membership functions
I
in the EM like phase, we have
exp(−βeik (W ))
µik = PK
l=1 exp(−βeil (W ))
I
this does not come out of thin air:
X
µik = EPβ,W (Mik ) =
Mik Pβ,W (M)
M∈M
I
µik is therefore the probability for xi to belong to cluster k
under the Gibbs distribution
Membership functions
I
in the EM like phase, we have
exp(−βeik (W ))
µik = PK
l=1 exp(−βeil (W ))
I
this does not come out of thin air:
X
µik = EPβ,W (Mik ) =
Mik Pβ,W (M)
M∈M
I
µik is therefore the probability for xi to belong to cluster k
under the Gibbs distribution
I
at the limit β → ∞, the µik peak to Dirac like membership
functions: they give the “optimal” partition
Example
Clustering in 2 classes of 6 elements from R
1.00
2.00
3.00
7.00
x
8.25
0.6
0.4
0.0
0.2
2
4
w2
Probability
6
0.8
8
1.0
Membership functions
2
4
6
w1
8
1
2
3
7
x
7.5
8.25
0.6
0.4
0.0
0.2
2
4
w2
Probability
6
0.8
8
1.0
Membership functions
2
4
6
w1
8
1
2
3
7
x
7.5
8.25
0.6
0.4
0.0
0.2
2
4
w2
Probability
6
0.8
8
1.0
Membership functions
2
4
6
w1
8
1
2
3
7
x
7.5
8.25
0.6
0.4
0.0
0.2
2
4
w2
Probability
6
0.8
8
1.0
Membership functions
2
4
6
8
1
2
3
7
x
w1
before phase transition
7.5
8.25
0.6
0.4
0.0
0.2
2
4
w2
Probability
6
0.8
8
1.0
Membership functions
2
4
6
8
1
2
3
7
x
w1
after phase transition
7.5
8.25
0.6
0.4
0.0
0.2
2
4
w2
Probability
6
0.8
8
1.0
Membership functions
2
4
6
w1
8
1
2
3
7
x
7.5
8.25
0.6
0.4
0.0
0.2
2
4
w2
Probability
6
0.8
8
1.0
Membership functions
2
4
6
w1
8
1
2
3
7
x
7.5
8.25
0.6
0.4
0.0
0.2
2
4
w2
Probability
6
0.8
8
1.0
Membership functions
2
4
6
w1
8
1
2
3
7
x
7.5
8.25
0.6
0.4
0.0
0.2
2
4
w2
Probability
6
0.8
8
1.0
Membership functions
2
4
6
w1
8
1
2
3
7
x
7.5
8.25
Summary
I
two principles:
I
I
I
I
annealing algorithms tend to reproduce this slow cooling
behavior
simulated annealing:
I
I
I
physical systems tend to reach a state of minimal free
energy at a given temperature
when the temperature is slowly decreased, a system tends
to reach a well organized state
uses Metropolis Hasting MCMC to sample the Gibbs
distribution
easy to implement but needs a very slow cooling
deterministic annealing:
I
I
I
direct minimization of the free energy
computes expectations of global quantities with respect to
the Gibbs distribution
aggressive cooling is possible, but ZT computation must be
tractable
Outline
Introduction
Mixed problems
Soft minimum
Computing the soft minimum
Evolution of β
Deterministic Annealing
Annealing
Maximum entropy
Phase transitions
Mass constrained deterministic annealing
Combinatorial problems
Expectation approximations
Mean field annealing
In practice
Maximum entropy
I
another interpretation/justification of DA
I
reformulation of the problem: find a probability distribution
on M which is “regular” and gives a low average error
Maximum entropy
I
another interpretation/justification of DA
I
reformulation of the problem: find a probability distribution
on M which is “regular” and gives a low average error
in other words: find PW such that
I
I
I
I
I
P
PM∈M PW (M) = 1
M∈M E(W , M)PW (M)
P is small
the entropy of PW , − M∈M ln PW (M)PW (M) is high
entropy plays a regularization role
Maximum entropy
I
another interpretation/justification of DA
I
reformulation of the problem: find a probability distribution
on M which is “regular” and gives a low average error
in other words: find PW such that
I
I
I
I
P
PM∈M PW (M) = 1
M∈M E(W , M)PW (M)
P is small
the entropy of PW , − M∈M ln PW (M)PW (M) is high
I
entropy plays a regularization role
I
this leads to the minimization of
X
M∈M
E(W , M)PW (M) +
1 X
ln PW (M)PW (M)
β
M∈M
where β sets the trade-off between fitting and regularity
Maximum entropy
I
some calculations show that the minimum over PW is given
by
1
Pβ,W (M) =
exp(−βE(M, W )),
Zβ,W
with
Zβ,W =
X
exp(−βE(M, W )).
M∈M
I
plugged back into the optimization criterion, we end up
with the soft minimum
Fβ (W ) = −
X
1
ln
exp(−βE(M, W ))
β
M∈M
So far...
I
three derivations of the deterministic annealing:
I
I
I
I
all of them boil down to two principles:
I
I
I
soft minimum
thermodynamic inspired annealing
maximum entropy principle
replace the crisp optimization problem in the M space by a
soft one
track the evolution of the solution when the crispness of the
approximation increases
now the technical details...
Outline
Introduction
Mixed problems
Soft minimum
Computing the soft minimum
Evolution of β
Deterministic Annealing
Annealing
Maximum entropy
Phase transitions
Mass constrained deterministic annealing
Combinatorial problems
Expectation approximations
Mean field annealing
In practice
Fixed point
I
remember the EM phase:
µik (W ) =
wk
=
exp(−βkxi − wk k2 )
PK
2
l=1 exp(−βkxi − wl k )
1
PN
N
X
i=1 µik (W ) i=1
µik (W )xi
I
this is a fixed point method, i.e., W = Uβ (W ) for a data
dependent Uβ
I
the fixed point is generally stable, except during a phase
transition
0.6
0.4
0.0
0.2
2
4
w2
Probability
6
0.8
8
1.0
Phase transition
2
4
6
8
1
2
3
7
x
w1
before phase transition
7.5
8.25
0.6
0.4
0.0
0.2
2
4
w2
Probability
6
0.8
8
1.0
Phase transition
2
4
6
8
1
2
3
7
x
w1
after phase transition
7.5
8.25
8
6
w2
4
2
2
4
w2
6
8
Stability
2
4
6
w1
8
2
4
6
w1
8
Stability
I
unstable fixed point:
I
I
I
even during a phase transition, the exact previous fixed
point can remain a fixed point
but we want the transition to take place: this is a new
cluster birth!
fortunately, the previous fixed point is unstable during a
phase transition
Stability
I
unstable fixed point:
I
I
I
I
even during a phase transition, the exact previous fixed
point can remain a fixed point
but we want the transition to take place: this is a new
cluster birth!
fortunately, the previous fixed point is unstable during a
phase transition
modified EM phase:
∗
1. initialize Wl∗ to Wl−1
+ε
exp(−βeik (Wl∗ ))
PK
∗
l=1 exp(−βeil (Wl ))
PN PK
∗
Wl such that i=1 k =1
2. compute µik =
3. update
µik ∇W eik (Wl∗ ) = 0
4. good back to 2.2 until convergence
I
the noise is important to prevent missing phase transition
Annealing strategy
I
neglecting symmetries, there are K − 1 phase transitions
for K clusters
I
tracking W ∗ between transitions is easy
Annealing strategy
I
neglecting symmetries, there are K − 1 phase transitions
for K clusters
I
tracking W ∗ between transitions is easy
trade-off:
I
I
I
slow annealing: long running time but no transition is
missed
fast annealing: quicker but with higher risk to miss a
transition
Annealing strategy
I
neglecting symmetries, there are K − 1 phase transitions
for K clusters
I
tracking W ∗ between transitions is easy
trade-off:
I
I
I
I
slow annealing: long running time but no transition is
missed
fast annealing: quicker but with higher risk to miss a
transition
but critical temperatures can be computed:
I
I
via a stability analysis of fixed points
corresponds to a linear approximation of the fixed point
equation Uβ around a fixed point
Critical temperatures
I
I
for vector quantization
first temperature:
I
I
the fixed point stability related to the variance of the data
the critical β is 1/2λmax where λmax is the largest
eigenvalue of the covariance matrix of the data
Critical temperatures
I
I
for vector quantization
first temperature:
I
I
I
the fixed point stability related to the variance of the data
the critical β is 1/2λmax where λmax is the largest
eigenvalue of the covariance matrix of the data
in general:
I
I
I
the stability of the whole system is related to the stability of
each cluster
the µik play the role of membership functions for each
cluster
the critical β for cluster k is 1/2λkmax where λkmax is the
largest eigenvalue of
X
µik (xi − wk )(xi − wk )T
i
A possible final algorithm
1. initialize W 1 =
2. for l = 2 to K :
1
N
PN
i=1 xi
2.1 compute the critical βl
2.2 initialize W l to W l−1
2.3 for t values of β around βl
2.3.1 add some small noise to W l
exp(−βkxi −Wkl k2 )
2.3.2 compute µik = PK exp(−βkx
−W l k2 )
t=1
PN i t
l
1
2.3.3 compute Wk = PN µ
i=1 µik xi
i=1
ik
2.3.4 good back to 2.3.2 until convergence
3. use W K as an estimation of
arg minW ∈Rp (minM∈M E(M, W )) and µik as the
corresponding M
Computational cost
I
N observations in Rp and K clusters
I
prototype storage: Kp
I
µ matrix storage: NK
I
one EM iteration costs: O(NKp)
I
we need at most K − 1 full EM runs: O(NK 2 p)
compared to the K-means:
I
I
I
I
I
more storage
no simple distance calculation tricks: all distances must be
computed
roughly corresponds to K − 1 k-means, not taking into
account the number of distinct β considered during each
phase transition
neglecting the eigenvalue analysis (in O(p3 ))...
Outline
Introduction
Mixed problems
Soft minimum
Computing the soft minimum
Evolution of β
Deterministic Annealing
Annealing
Maximum entropy
Phase transitions
Mass constrained deterministic annealing
Combinatorial problems
Expectation approximations
Mean field annealing
In practice
Collapsed prototypes
I
when β → 0, only one cluster
I
waste of computational resources: K identical calculations
this is a general problem:
I
I
I
I
each phase transition adds new clusters
prior to that prototypes are collapsed
but we need them to track cluster birth...
Collapsed prototypes
I
when β → 0, only one cluster
I
waste of computational resources: K identical calculations
this is a general problem:
I
I
I
I
each phase transition adds new clusters
prior to that prototypes are collapsed
but we need them to track cluster birth...
I
a related problem: uniform cluster weights
I
in a Gaussian mixture
1
2
δk e− 2 kxi −wk k
P(xi ∈ Ck |xi , W ) = P
K
l=1 δl e
1
kxi −wl k2
− 2
prototypes are weighted by the mixing coefficients δ
Mass constrained DA
I
we introduce prototype weights δk and plug them into the
soft minimum formulation
K
N
1X X
ln
δk exp(−βeik (W ))
Fβ (W , δ) = −
β
i=1
I
I
k =1
Fβ is minimized under the constraint
this leads to
µik (W ) =
δk
=
PK
k =1 δk
δk exp(−βkxi − wk k2 )
PK
2
l=1 δl exp(−βkxi − wl k )
N
X
µik (W )/N
i=1
wk
=1
= N
N
X
i=1
µik (W )xi /δk
MCDA
I
increases the similarity with EM for Gaussian mixtures:
I
I
I
exactly the same algorithm for a fixed β
isotropic Gaussian mixture with variance
1
2β
however:
I
I
I
the variance is generally a parameter in the Gaussian
mixtures
different goals: minimal distortion versus maximum
likelihood
no support for other distributions in DA
Cluster birth monitoring
I
avoid wasting computational resources:
I
consider a cluster Ck with a prototype wk
prior a phase transition in Ck :
I
I
I
I
I
I
I
duplicate wk into wk0
apply some noise to both prototypes
δ0
split δk into δ2k and 2k
apply the EM like algorithm and monitor kwk − wk0 k
accept the new cluster if kwk − wk0 k becomes “large”
two strategies:
I
I
eigenvalue based analysis (costly, but quite accurate)
opportunistic: always maintain duplicate prototypes and
promote diverging ones
Final algorithm
1. initialize W 1 =
2. for l = 2 to K :
1
N
PN
i=1 xi
2.1 compute the critical βl
2.2 initialize W l to W l−1
2.3 duplicate the prototype of the critical cluster and split the
associated weight
2.4 for t values of β around βl
2.4.1 add some small noise to W l
δ exp(−βkxi −Wkl k2 )
2.4.2 compute µik = PK k δ exp(−βkx
l 2
i −Wt k )
P t=1 t
2.4.3 compute δk = Ni=1 µik (W )/N
PN
2.4.4 compute Wkl = δN
i=1 µik xi
k
2.4.5 good back to 2.4.2 until convergence
3. use W K as an estimation of
arg minW ∈Rp (minM∈M E(M, W )) and µik as the
corresponding M
−2
−1
0
X2
1
2
3
4
Example
−2
0
2
X1
A simple dataset
4
6
0.2
0.5
1.0
E
2.0
5.0
Example
0.1
0.15
0.22
0.32
0.47
0.69
1.02
1.51
β
Error evolution and cluster births
2.22
3.37
−2
−1
0
X2
1
2
3
4
Clusters
−2
0
2
X1
4
6
−2
−1
0
X2
1
2
3
4
Clusters
−2
0
2
X1
4
6
−2
−1
0
X2
1
2
3
4
Clusters
−2
0
2
X1
4
6
−2
−1
0
X2
1
2
3
4
Clusters
−2
0
2
X1
4
6
−2
−1
0
X2
1
2
3
4
Clusters
−2
0
2
X1
4
6
−2
−1
0
X2
1
2
3
4
Clusters
−2
0
2
X1
4
6
−2
−1
0
X2
1
2
3
4
Clusters
−2
0
2
X1
4
6
−2
−1
0
X2
1
2
3
4
Clusters
−2
0
2
X1
4
6
−2
−1
0
X2
1
2
3
4
Clusters
−2
0
2
X1
4
6
−2
−1
0
X2
1
2
3
4
Clusters
−2
0
2
X1
4
6
−2
−1
0
X2
1
2
3
4
Clusters
−2
0
2
X1
4
6
Summary
I
I
deterministic annealing addresses combinatorial
optimization by smoothing the cost function and tracking
the evolution of the solution while the smoothing
progressively vanishes
motivated by:
I
I
I
I
heuristic (soft minimum)
statistical physics (Gibbs distribution)
information theory (maximal entropy)
in practice:
I
I
I
rather simple multiple EM like algorithm
a bit tricky to implement (phase transition, noise injection,
etc.)
excellent results (frequently better than those obtained by
K-means)
Outline
Introduction
Mixed problems
Soft minimum
Computing the soft minimum
Evolution of β
Deterministic Annealing
Annealing
Maximum entropy
Phase transitions
Mass constrained deterministic annealing
Combinatorial problems
Expectation approximations
Mean field annealing
In practice
Combinatorial only
I
the situation is quite different for a combinatorial problem
arg min E(M)
M∈M
I
no soft minimum heuristic
I
no direct use of the Gibbs distribution PT ; one need to rely
on EPT (f (M)) for interesting f
calculation of PT is not tractable:
I
I
I
tractability is related to independence
if
X
X
E(M) =
...
(E(M1 ) + . . . + E(MD ))
M1
MD
then independent optimization can be done on each
variable...
Graph clustering
I
I
a non oriented graph with N nodes and weight matrix A (Aij
is the weight of the connection between nodes i and j)
maximal modularity clustering:
I
I
I
I
node degree ki =
PN
ki kj
2m ,
j=1
Aij , total weight m =
1
2m (Aij
Bij =
null model Pij =
assignment matrix M
Modularity of the clustering
Mod(M) =
− Pij )
XX
i,j
Mik Mjk Bij
k
I
difficulty: coupling Mik Mjk (non linearity)
I
in the sequel E(M) = −Mod(M)
1
2
P
i,j
Aij
Strategy for DA
I
choose some interesting statistics:
I
I
EPT (f (M))
for instance EPT (Mik ) for clustering
I
compute an approximation of EPT (f (M)) at high
temperature T
I
track the evolution of the approximation while lowering T
I
use limT →0 EPT (f (M)) = f (arg minM∈M E(M)) to claim
that the approximation converges to the interesting limit
I
rationale: approximating EPT (f (M)) for a low T is probably
difficult, we use the path following strategy to ease the task
1.0
0.5
0.0
y
1.5
Example
−1.0
−0.5
0.0
x
0.5
1.0
0.000
0.005
0.010
probability
0.015
0.020
Example
β = 10−1
−1.0
−0.5
0.0
x
0.5
1.0
Example
0.015
0.010
0.005
0.000
probability
0.020
β = 10−0.75
−1.0
−0.5
0.0
x
0.5
1.0
0.000
0.005
0.010
0.015
probability
0.020
0.025
Example
β = 10−0.5
−1.0
−0.5
0.0
x
0.5
1.0
Example
0.020
0.010
0.000
probability
0.030
β = 10−0.25
−1.0
−0.5
0.0
x
0.5
1.0
Example
0.03
0.02
0.01
0.00
probability
0.04
β = 100
−1.0
−0.5
0.0
x
0.5
1.0
Example
0.04
0.02
0.00
probability
0.06
0.08
β = 100.25
−1.0
−0.5
0.0
x
0.5
1.0
Example
0.10
0.05
0.00
probability
0.15
β = 100.5
−1.0
−0.5
0.0
x
0.5
1.0
0.00 0.05 0.10 0.15 0.20 0.25 0.30
probability
Example
β = 100.75
−1.0
−0.5
0.0
x
0.5
1.0
Example
0.3
0.2
0.1
0.0
probability
0.4
0.5
β = 101
−1.0
−0.5
0.0
x
0.5
1.0
Example
0.4
0.2
0.0
probability
0.6
β = 101.25
−1.0
−0.5
0.0
x
0.5
1.0
Example
0.4
0.2
0.0
probability
0.6
0.8
β = 101.5
−1.0
−0.5
0.0
x
0.5
1.0
Example
0.6
0.4
0.2
0.0
probability
0.8
1.0
β = 101.75
−1.0
−0.5
0.0
x
0.5
1.0
Example
0.6
0.4
0.2
0.0
probability
0.8
1.0
β = 102
−1.0
−0.5
0.0
x
0.5
1.0
0.3
0.2
0.1
0.0
x^
0.4
0.5
Example
0.1
0.5
1.0
5.0
β
10.0
50.0
100.0
Expectation approximation
I
computing
Z
EPZ (f (Z )) =
I
has many applications
for instance, in machine learning:
I
I
I
I
f (Z )dPZ
performance evaluation (generalization error)
EM for probabilistic models
Bayesian approaches
two major numerical approaches:
I
I
sampling
variational approximation
Sampling
I
(strong) law of large numbers
Z
N
1X
f (zi ) = f (Z )dPZ
N→∞ N
lim
i=1
if the zi are independent samples of Z
I
many extensions, especially to dependent samples (slower
convergence)
I
MCMC methods: build a Markov chain with stationary
distribution PZ and take the average of f on a trajectory
Sampling
I
(strong) law of large numbers
Z
N
1X
f (zi ) = f (Z )dPZ
N→∞ N
lim
i=1
if the zi are independent samples of Z
I
many extensions, especially to dependent samples (slower
convergence)
I
MCMC methods: build a Markov chain with stationary
distribution PZ and take the average of f on a trajectory
I
applied to EPT (f (M)): this is simulated annealing!
Variational approximation
I
main idea:
I
I
I
I
calculus of variation:
I
I
I
the intractability of the calculation of EPZ (f (Z )) is induced
by the complexity of PZ
let’s replace PZ by a simpler distribution QZ ...
...and EPZ (f (Z )) by EQZ (f (Z ))
optimization over functional spaces
in this context: choose an optimal QZ in a space of
probability measures Q with respect to a goodness of fit
criterion between PZ and QZ
probabilistic context:
I
I
simple distributions: factorized distributions
quality criterion: Kullback-Leibler divergence
Variational approximation
PD
I
assume Z = (Z1 , . . . , ZD ) ∈ RD and f (Z ) =
I
for a general PZ , f structure cannot be exploited in
EPZ (f (Z ))
i=1 fi (Zi )
Variational approximation
PD
I
assume Z = (Z1 , . . . , ZD ) ∈ RD and f (Z ) =
I
for a general PZ , f structure cannot be exploited in
EPZ (f (Z ))
variational approximation
I
I
I
choose Q a set of tractable distributions on R
solve
Q ∗ = arg
min
KL (Q||PZ )
Q=(Q1 ×...×QD )∈QD
with
Z
KL (Q||PZ ) = −
I
i=1 fi (Zi )
ln
dPZ
dQ
dQ
approximate EPZ (f (Z )) by
EQ ∗ (f (Z )) =
D
X
i=1
EQi∗ (fi (Zi ))
Application to DA
1
Zβ
I
general form: Pβ (M) =
exp(−βE(M))
I
intractable because E(M) is not linear in M = (M1 , . . . , MD )
Application to DA
1
Zβ
I
general form: Pβ (M) =
I
intractable because E(M) is not linear in M = (M1 , . . . , MD )
variational approximation:
I
I
exp(−βE(M))
replace E by F (W , M) defined by
F (W , M) =
D
X
Wi Mi
i=1
I
use for QW ,β
QW ,β =
I
1
exp(−βF (W , M))
ZW ,β
optimize W via KL (QW ,β ||Pβ )
Fitting the approximation
I
a complex expectation calculation is replaced by a new
optimization problem
I
we have
KL QW ,β ||Pβ = ln Zβ −ln ZW ,β +βEQW ,β (E(M) − F (W , M))
I
I
tractable by factorization on F , except for Zβ
solved by ∇W KL QW ,β ||Pβ = 0
I
I
I
tractable because Zβ do not depend on W
generally solved via a fixed point approach (EM like again)
similar to variational approximation in EM or Bayesian
methods
Example
I
in the case of (graph) clustering, we use
F (W , M) =
K
N X
X
Wik Mik
i=1 k =1
I
a crucial (general) property is that, when i 6= j
EQW ,β Mik Mjl = EQW ,β (Mik ) EQW ,β Mjl
I
a (long and boring) series of equations leads to the general
mean field equations
∂EQW ,β (E(M))
∂Wjl
=
K
X
∂EQW ,β Mjk
k =1
∂Wjl
Wjk , ∀j, l.
Example
I
classical EM like scheme to solve the mean field
equations:
1. we have
exp(−βWik )
EQW ,β (Mik ) = PK
l=1 exp(−βWil )
2. keep EQW ,β (Mik ) fixed and solve the simplified mean field
equations for W
3. update EQW ,β (Mik ) based on the new W and loop on 2 until
convergence
I
in the graph clustering case
X
Wjk = 2
EQW ,β (Mik ) Bij
i,j
Summary
I
I
deterministic annealing for combinatorial optimization
given an objective function E(M)
I
I
I
choose a linear parametric approximation of E(M),
F (W , M) with the associated distribution
QW ,β = ZW1,β exp(−βF (W , M))
write the mean field equations, i.e.
∇W − ln ZW ,β + βEQW ,β (E(M) − F (W , M)) = 0
use a EM like algorithm to solve the equations:
I
I
I
given W , compute EQW ,β (Mik )
given EQW ,β (Mik ), solve the equations
back to our classical questions:
I
I
why would that work?
how to do that in practice?
Outline
Introduction
Mixed problems
Soft minimum
Computing the soft minimum
Evolution of β
Deterministic Annealing
Annealing
Maximum entropy
Phase transitions
Mass constrained deterministic annealing
Combinatorial problems
Expectation approximations
Mean field annealing
In practice
Physical interpretation
I
back to the (generalized) Helmholtz (free) energy
Fβ = U −
I
S
β
the free energy can be generalized to any distribution Q on
the states, by
Fβ,Q = EQ (E(M)) −
where H(Q) = −
I
P
H(Q)
,
β
M∈M Q(M) ln Q(M)
is the entropy of Q
the Boltzmann-Gibbs distribution minimizes the free
energy, i.e.
FT ≤ FT ,Q
Physical interpretation
I
in fact we have
FT ,Q = FT +
1
KL (Q||P) ,
β
where P is the Gibbs distribution
I
minimizing KL (Q||P) ≥ 0 over Q corresponds to finding
the best upper bound of the free energy on a class of
distribution
I
in other words: given the system states must be distributed
according to a distribution in Q, find the most stable
distribution
Mean field
I
I
E(M) is difficult to handle because of coupling
(dependencies) while F (W , M) is linear and corresponds
to a de-coupling in which dependencies are replaced by
mean effects
example:
I
modularity

E(M) =
XX
i
I
Mik 
k
Mjk Bij 
j
approximation
F (W , M) =
XX
i
I

X
Mik Wik
k
the complex influence of Mij on E is replaced by a single
parameter Wik : the mean effect associated to a change in
Mij
Outline
Introduction
Mixed problems
Soft minimum
Computing the soft minimum
Evolution of β
Deterministic Annealing
Annealing
Maximum entropy
Phase transitions
Mass constrained deterministic annealing
Combinatorial problems
Expectation approximations
Mean field annealing
In practice
In practice
I
homotopy again:
I
at high temperature (β → 0):
I
I
EQW ,β (Mik ) = K1
the fixed point equation is generally easy to solve; for
instance in graph clustering
Wjk =
2 X
Bij
K
i,j
I
I
I
slowly decrease the temperature
use the mean field at the previous higher temperature as a
starting point for the fixed point iterations
phase transitions:
I
I
stable versus unstable fixed points (eigenanalysis)
noise injection and mass constrained version
A graph clustering algorithm
1. initialize Wjk1 =
2
K
P
i,j
Bij
2. for l = 2 to K :
2.1 compute the critical βl
2.2 initialize W l to W l−1
2.3 for t values of β around βl
2.3.1 add some small noise to W l
ik )
2.3.2 compute µik = PKexp(−βW
t=1
Pexp(−βWit )
2.3.3 compute Wjk = 2 i,j µik Bij
2.3.4 good back to 2.3.2 until convergence
3. threshold µik into an optimal partition of the original graph
I
mass constrained approach variant
I
stability based transition detection (slower annealing)
Karate
21
19
16
23
15
31
33
27
I
I
30
clustering of a
simple graph
Zachary’s Karate
club
20
34
22
9
2
14
24
18
3
28
32
4
8
10
13
1
29
26
12
25
5
7
6
17
11
Modularity
0.3
0.2
0.1
0.0
Modularity
0.4
Evolution of the modularity during annealing
50
100
200
β
500
Phase transitions
I
1
2
3
4
first phase: 2
clusters
Phase transitions
I
first phase: 2
clusters
I
second phase: 4
clusters
1
2
3
4
Karate
I
clustering of a
simple graph
I
Zachary’s Karate
club
I
Four clusters
DA Roadmap
How to apply deterministic annealing to a combinatorial
optimization problem E(M)?
1. define a linear mean field approximation F (W , M)
2. specialize the mean field equations, i.e. compute
∂EQW ,β (E(M))
∂
=
∂Wjl
∂Wjl
1
ZW ,β
!
X
E(M) exp(−βF (W , M))
M
3. identify a fixed point scheme to solve the mean field equations,
using EQW ,β (Mi ) as a constant if needed
4. wrap the corresponding EM scheme in an annealing loop: this is
the basic algorithm
5. analyze the fixed point scheme to find stability conditions: this
leads to an advanced annealing schedule
Summary
I
I
deterministic annealing addresses combinatorial
optimization by solving a related but simpler problem and
by tracking the evolution of the solution while the simpler
problem converges to the original one
motivated by:
I
I
I
I
heuristic (soft minimum)
statistical physics (Gibbs distribution)
information theory (maximal entropy)
works mainly because of the solution following strategy
(homotopy): do not solve a difficult problem from scratch,
but rather starting from a good guess of the solution
Summary
I
difficulties:
I
I
I
I
boring and technical calculations (when facing a new
problem)
a bit tricky to implement (phase transition, noise injection,
etc.)
tends to be slow in practice: computing exp is very costly
even on modern hardware (roughly 100 flop)
advantages:
I
I
I
I
versatile
appears as a simple EM like algorithm embedded in an
annealing loop
very interesting intermediate results
excellent final results in practice
References
T. Hofmann and J. M. Buhmann.
Pairwise data clustering by deterministic annealing.
IEEE Transactions on Pattern Analysis and Machine Intelligence,
19(1):1–14, January 1997.
K. Rose.
Deterministic annealing for clustering, compression,classification,
regression, and related optimization problems.
Proceedings of the IEEE, 86(11):2210–2239, November 1998.
F. Rossi and N. Villa.
Optimizing an organized modularity measure for topographic
graph clustering: a deterministic annealing approach.
Neurocomputing, 73(7–9):1142–1163, March 2010.
Download