An Introduction to Deterministic Annealing Fabrice Rossi SAMM – Université Paris 1 2012 Optimization I in machine learning: I I optimization is one of the central tools methodology: I I I I I choose a model with some adjustable parameters choose a goodness of fit measure of the model to some data tune the parameters in order to maximize the goodness of fit examples: artificial neural networks, support vector machines, etc. in other fields: I I I operational research: project planning, routing, scheduling, etc. design: antenna, wings, engines, etc. etc. Easy or Hard? I is optimization computationally difficult? Easy or Hard? I I is optimization computationally difficult? the convex case is relatively easy: I I I minx∈C J(x) with C convex and J convex polynomial algorithms (in general) why? Easy or Hard? I I is optimization computationally difficult? the convex case is relatively easy: I I I minx∈C J(x) with C convex and J convex polynomial algorithms (in general) why? I I a local minimum is global the local tangent hyperplane is a global lower bound Easy or Hard? I I is optimization computationally difficult? the convex case is relatively easy: I I I minx∈C J(x) with C convex and J convex polynomial algorithms (in general) why? I I I a local minimum is global the local tangent hyperplane is a global lower bound the non convex case is hard: I I I multiple minima no local to global inference NP hard in some cases In machine learning I convex case: I I I I linear models with(out) regularization (ridge, lasso) kernel machines (SVM and friends) nonlinear projections (e.g., semi-define embedding) non convex: I I I I artificial neural networks (such as multilayer perceptrons) vector quantization (a.k.a. prototype based clustering) general clustering optimization with respect to meta parameters: I I kernel parameters discrete parameters, e.g., feature selection In machine learning I convex case: I I I I linear models with(out) regularization (ridge, lasso) kernel machines (SVM and friends) nonlinear projections (e.g., semi-define embedding) non convex: I I I I artificial neural networks (such as multilayer perceptrons) vector quantization (a.k.a. prototype based clustering) general clustering optimization with respect to meta parameters: I I I kernel parameters discrete parameters, e.g., feature selection in this lecture, non convex problems: I I combinatorial optimization mixed optimization Particular cases I Pure combinatorial optimization problems: I I I a solution space M: finite but (very) large an error measure E from M to R goal: solve M ∗ = arg min E(M) M∈M I example: graph clustering Particular cases I Pure combinatorial optimization problems: I I I a solution space M: finite but (very) large an error measure E from M to R goal: solve M ∗ = arg min E(M) M∈M I I example: graph clustering Mixed problems: I I I a solution space M × Rp an error measure E from M × Rp to R goal: solve (M ∗ , W ∗ ) = arg I example: clustering in Rn min M∈M,W ∈Rp E(M, W ) Graph clustering Goal: find an optimal clustering of a graph with N nodes in K clusters I M: set of all partitions of {1, . . . , N} in K classes (the asymptotic behavior of |M| is roughly K N−1 for a fixed K with N → ∞ ) I assignment matrix: a partition in M is by N × K Pdescribed K matrix M such that Mik ∈ {0, 1} and k =1 Mik = 1 many error measures are available: I I I I Graph cut measures (node normalized, edge normalized, etc.) Modularity etc. Graph clustering 21 19 16 23 15 31 33 27 30 20 34 22 9 2 14 24 I 18 3 original graph 28 32 4 8 10 13 1 29 26 12 25 5 7 6 17 11 Graph clustering I original graph I four clusters Graph visualization woman bisexual man heterosexual man I 2386 persons: unreadable Graph visualization woman bisexual man heterosexual man I 2386 persons: unreadable I maximal modularity clustering in 39 clusters Graph visualization 7 17 19 2 18 8 11 20 22 15 10 1 28 36 35 21 25 13 33 34 5 29 27 12 23 16 4 31 14 32 38 6 39 2386 persons: unreadable I maximal modularity clustering in 39 clusters 30 37 3 I 24 26 9 Graph visualization woman bisexual man heterosexual man I 2386 persons: unreadable I maximal modularity clustering in 39 clusters I hierarchical display Vector quantization I N observations (xi )1≤i≤N in Rn I M: set of all partitions I quantization error: E(M, W ) = K N X X Mik kxi − wk k2 i=1 k =1 I the “continuous” parameters are the prototypes wk ∈ Rn I the assignment matrix notation is equivalent to the standard formulation E(M, W ) = N X kxi − wk (i) k2 , i=1 where k (i) is the index of the cluster to which xi is assigned −2 −1 0 X2 1 2 3 4 Example −2 0 2 X1 4 6 −2 −1 0 X2 1 2 3 4 Example −2 0 2 X1 4 6 Combinatorial optimization I I a research field by itself many problems are NP hard... Combinatorial optimization I I a research field by itself many problems are NP hard... and one therefore relies on heuristics or specialized approaches: I I I relaxation methods branch and bound methods stochastic approaches: I I simulated annealing genetic algorithms Combinatorial optimization I I a research field by itself many problems are NP hard... and one therefore relies on heuristics or specialized approaches: I I I relaxation methods branch and bound methods stochastic approaches: I I I simulated annealing genetic algorithms in this presentation: deterministic annealing Outline Introduction Mixed problems Soft minimum Computing the soft minimum Evolution of β Deterministic Annealing Annealing Maximum entropy Phase transitions Mass constrained deterministic annealing Combinatorial problems Expectation approximations Mean field annealing In practice Outline Introduction Mixed problems Soft minimum Computing the soft minimum Evolution of β Deterministic Annealing Annealing Maximum entropy Phase transitions Mass constrained deterministic annealing Combinatorial problems Expectation approximations Mean field annealing In practice Optimization strategies I mixed problem transformation (M ∗ , W ∗ ) = arg I min M∈M,W ∈Rp E(M, W ), remove the continuous part min M∈M,W ∈Rp I E(M, W ) = min M∈M M 7→ minp E(M, W ) W ∈R or the combinatorial part min M∈M,W ∈Rp E(M, W ) = minp W 7→ min E(M, W ) W ∈R M∈M Optimization strategies I mixed problem transformation (M ∗ , W ∗ ) = arg I min M∈M,W ∈Rp E(M, W ), remove the continuous part min M∈M,W ∈Rp I E(M, W ) = min M∈M M 7→ minp E(M, W ) W ∈R or the combinatorial part min M∈M,W ∈Rp I E(M, W ) = minp W 7→ min E(M, W ) W ∈R this is not alternate optimization M∈M Alternate optimization I elementary heuristics: 1. 2. 3. 4. start with a random configuration M 0 ∈ M compute W i = arg minW ∈Rp E(M i−1 , W ) compute M i = arg minM∈M E(M, W i ) go back to 2 until convergence Alternate optimization I elementary heuristics: 1. 2. 3. 4. I start with a random configuration M 0 ∈ M compute W i = arg minW ∈Rp E(M i−1 , W ) compute M i = arg minM∈M E(M, W i ) go back to 2 until convergence e.g., the k-means algorithm for vector quantization: 1. start with a random partition M 0 ∈ M 2. compute the optimal prototypes with PN Wki = PN 1M i−1 j=1 Mjki−1 xj j=1 jk 3. compute the optimal partition with Mjki = 1 if and only if k = arg min1≤l≤K kxj − Wli k2 4. go back to 2 until convergence Alternate optimization I elementary heuristics: 1. 2. 3. 4. I start with a random configuration M 0 ∈ M compute W i = arg minW ∈Rp E(M i−1 , W ) compute M i = arg minM∈M E(M, W i ) go back to 2 until convergence e.g., the k-means algorithm for vector quantization: 1. start with a random partition M 0 ∈ M 2. compute the optimal prototypes with PN Wki = PN 1M i−1 j=1 Mjki−1 xj j=1 jk 3. compute the optimal partition with Mjki = 1 if and only if k = arg min1≤l≤K kxj − Wli k2 4. go back to 2 until convergence I alternate optimization converges to a local minimum Combinatorial first I let consider the combinatorial first approach: min p E(M, W ) = minp W 7→ min E(M, W ) M∈M,W ∈R I W ∈R M∈M not attractive a priori, as W 7→ minM∈M E(M, W ): I I has no reason to be convex has no reason to be C 1 Combinatorial first I let consider the combinatorial first approach: min p E(M, W ) = minp W 7→ min E(M, W ) M∈M,W ∈R I M∈M not attractive a priori, as W 7→ minM∈M E(M, W ): I I I W ∈R has no reason to be convex has no reason to be C 1 vector quantization example: F (W ) = N X i=1 is neither convex nor C 1 min kxi − wk k2 1≤k ≤K Example Clustering in 2 clusters elements from R x 2 4 Example 0 −2 −4 multiple local minima and singularities: minimizing F (W ) is difficult w2 ln F (W ) −4 −2 0 w1 2 4 Soft minimum I a soft minimum approximation of minM∈M E(M, W ) Fβ (W ) = − X 1 exp(−βE(M, W )) ln β M∈M solves the regularity issue: if E(M, .) is C 1 , then Fβ (W ) is also C 1 Soft minimum I a soft minimum approximation of minM∈M E(M, W ) Fβ (W ) = − X 1 exp(−βE(M, W )) ln β M∈M solves the regularity issue: if E(M, .) is C 1 , then Fβ (W ) is also C 1 I if M ∗ (W ) is such that for all M 6= M ∗ (W ), E(M, W ) > E(M ∗ (W ), W ), then lim Fβ (W ) = E(M ∗ (W ), W ) = min E(M, W ) β→∞ M∈M Example M E(M, W ) Example β = 10−2 M exp(−βE(M,W )) exp(−βE(M ∗ ,W )) Example β = 10−1.5 M exp(−βE(M,W )) exp(−βE(M ∗ ,W )) Example β = 10−1 M exp(−βE(M,W )) exp(−βE(M ∗ ,W )) Example β = 10−0.5 M exp(−βE(M,W )) exp(−βE(M ∗ ,W )) Example β = 100 M exp(−βE(M,W )) exp(−βE(M ∗ ,W )) Example β = 100.5 M exp(−βE(M,W )) exp(−βE(M ∗ ,W )) Example β = 101 M exp(−βE(M,W )) exp(−βE(M ∗ ,W )) Example β = 101.5 M exp(−βE(M,W )) exp(−βE(M ∗ ,W )) Example β = 102 M exp(−βE(M,W )) exp(−βE(M ∗ ,W )) Example Fβ 1.00e−02 3.16e−02 1.00e−01 3.16e−01 1.00e+00 β 3.16e+00 1.00e+01 3.16e+01 1.00e+02 Discussion I positive aspects: I I I if E(M, W ) is C 1 with respect to W for all M, then Fβ is C 1 limβ→∞ Fβ (W ) = minM∈M E(M, W ) ∀β > 0, − I 1 ln |M| ≤ Fβ (W ) − min E(M, W ) ≤ 0 M∈M β when β is close to 0, Fβ is generally easy to minimize Discussion I positive aspects: I I I if E(M, W ) is C 1 with respect to W for all M, then Fβ is C 1 limβ→∞ Fβ (W ) = minM∈M E(M, W ) ∀β > 0, − I I 1 ln |M| ≤ Fβ (W ) − min E(M, W ) ≤ 0 M∈M β when β is close to 0, Fβ is generally easy to minimize negative aspects: I I I how to compute Fβ (W ) efficiently? how to solve minW ∈Rp Fβ (W )? why would that work in practice? Discussion I positive aspects: I I I if E(M, W ) is C 1 with respect to W for all M, then Fβ is C 1 limβ→∞ Fβ (W ) = minM∈M E(M, W ) ∀β > 0, − I I when β is close to 0, Fβ is generally easy to minimize negative aspects: I I I I 1 ln |M| ≤ Fβ (W ) − min E(M, W ) ≤ 0 M∈M β how to compute Fβ (W ) efficiently? how to solve minW ∈Rp Fβ (W )? why would that work in practice? philosophy: where does Fβ (W ) come from? Outline Introduction Mixed problems Soft minimum Computing the soft minimum Evolution of β Deterministic Annealing Annealing Maximum entropy Phase transitions Mass constrained deterministic annealing Combinatorial problems Expectation approximations Mean field annealing In practice Computing Fβ (W ) I in general computing Fβ (W ) is intractable: exhaustive calculation of E(M, W ) on the whole set M Computing Fβ (W ) I in general computing Fβ (W ) is intractable: exhaustive calculation of E(M, W ) on the whole set M I a particular case: clustering with an additive cost, i.e. E(M, W ) = K N X X Mik eik (W ), i=1 k =1 where M is an assignment matrix (and e.g., eik (W ) = kxi − wk k2 ) I then Fβ (W ) = − N K 1X X ln exp(−βeik (W )) β i=1 k =1 computational cost O(NK ) (compared to ' K N−1 ) Proof sketch I let Zβ (W ) be the partition function given by Zβ (W ) = X exp(−βE(M, W )) M∈M I assignments in M are independent and the sum can be rewritten X Zβ (W ) = M1. ∈CK I X ... exp(−βE(M, W )), MN. ∈CK with CK = {(1, 0, . . . , 0), . . . , (0, . . . , 0, 1)} then Zβ (W ) = N Y X i=1 Mi. ∈CK exp(−β X k Mik eik (W )) Independence I additive cost corresponds to some form of conditional independence I given the prototypes, observations are assigned independently to their optimal clusters I in other words: the global optimal assignment is the concatenation of the individual optimal assignment I this is what makes alternate optimization tractable in the K-means despite its combinatorial aspect: min M∈M N X K X i=1 k =1 Mik kxi − wk k2 Minimizing Fβ (W ) I assume E to be C 1 with respect to W , then P exp(−βE(M, W ))∇W E(M, W ) ∇Fβ (W ) = M∈MP M∈M exp(−βE(M, W )) I at a minimum ∇Fβ (W ) = 0 ⇒ solve the equation or use gradient descent I same calculation problem as for Fβ (W ): in general, an exhaustive scan of M is needed Minimizing Fβ (W ) I assume E to be C 1 with respect to W , then P exp(−βE(M, W ))∇W E(M, W ) ∇Fβ (W ) = M∈MP M∈M exp(−βE(M, W )) I at a minimum ∇Fβ (W ) = 0 ⇒ solve the equation or use gradient descent I same calculation problem as for Fβ (W ): in general, an exhaustive scan of M is needed I if E is additive ∇Fβ (W ) = N PK X i=1 k =1 exp(−βeik (W ))∇W eik (W ) PK k =1 exp(−βeik (W )) Fixed point scheme I I a simple strategy to solve ∇Fβ (W ) = 0 starting from a random value of W : 1. compute exp(−βeik (W )) µik = PK l=1 exp(−βeil (W )) 2. keeping the µik constant, solve for W N X K X µik ∇W eik (W ) = 0 i=1 k =1 3. loop to 1 until convergence Fixed point scheme I I a simple strategy to solve ∇Fβ (W ) = 0 starting from a random value of W : 1. compute (expectation phase) exp(−βeik (W )) µik = PK l=1 exp(−βeil (W )) 2. keeping the µik constant, solve for W (maximization phase) N X K X µik ∇W eik (W ) = 0 i=1 k =1 3. loop to 1 until convergence Fixed point scheme I I a simple strategy to solve ∇Fβ (W ) = 0 starting from a random value of W : 1. compute (expectation phase) exp(−βeik (W )) µik = PK l=1 exp(−βeil (W )) 2. keeping the µik constant, solve for W (maximization phase) N X K X µik ∇W eik (W ) = 0 i=1 k =1 3. loop to 1 until convergence I generally converges to a minimum of Fβ (W ) (for a fixed β) Vector quantization I if eik (W ) = kxi − wk k2 (vector quantization), ∇wk Fβ (W ) = 2 N X µik (W )(wk − xi ), i=1 with I exp(−βkxi − wk k2 ) µik (W ) = PK 2 l=1 exp(−βkxi − wl k ) setting ∇Fβ (W ) = 0 leads to wk = PN 1 N X i=1 µik (W ) i=1 µik (W )xi Links with the K-means I if µil = δk (i)=l , we obtained the prototype update rule of the K-means I in the K-means, we have k (i) = arg min kxi − wk k2 1≤l≤K I in deterministic annealing, we have exp(−βkxi − wk k2 ) µik (W ) = PK 2 l=1 exp(−βkxi − wl k ) I this is a soft minimum version of the crisp rule of the k-means Links with EM I isotropic Gaussian mixture with a unique and fixed variance pk (x|wk ) = I 1 1 2 e− 2 kxi −wk k p/2 (2π) given the mixing coefficients δk , responsibilities are 1 2 δk e− 2 kxi −wk k P(xi ∈ Ck |xi , W ) = P K l=1 δl e 1 − 2 kxi −wl k2 I identical update rules for W I β can be seen has an inverse variance: quantify the uncertainty about the clustering results So far... I P PK if E(M, W ) = N i=1 k =1 Mik eik (W ), one tries to reach minM∈M,W ∈Rp E(M, W ) using Fβ (W ) = − K N 1X X ln exp(−βeik (W )) β i=1 k =1 I Fβ (W ) is smooth I a fixed point EM-like scheme can be used to minimize Fβ (W ) for a fixed β So far... I P PK if E(M, W ) = N i=1 k =1 Mik eik (W ), one tries to reach minM∈M,W ∈Rp E(M, W ) using Fβ (W ) = − K N 1X X ln exp(−βeik (W )) β i=1 k =1 I Fβ (W ) is smooth I a fixed point EM-like scheme can be used to minimize Fβ (W ) for a fixed β but: I I I the k-means algorithm is also a EM like algorithm: it does not reach a global optimum how handle to handle β → ∞? Outline Introduction Mixed problems Soft minimum Computing the soft minimum Evolution of β Deterministic Annealing Annealing Maximum entropy Phase transitions Mass constrained deterministic annealing Combinatorial problems Expectation approximations Mean field annealing In practice Limit case β → 0 I limit behavior of the EM scheme: I I µik = K1 for all i and k (and W !) W is therefore a solution of N X K X i=1 k =1 I no iteration needed ∇W eik (W ) = 0 Limit case β → 0 I limit behavior of the EM scheme: I I µik = K1 for all i and k (and W !) W is therefore a solution of N X K X ∇W eik (W ) = 0 i=1 k =1 I I no iteration needed vector quantization: I we have N 1 X wk = xi N i=1 I I each prototype is the center of mass of the data: only one cluster! in general the case β → 0 is easy (unique minimum under mild hypotheses) Example Clustering in 2 clusters elements from R x back to the example 0 w2 2 4 Example −4 −2 original problem −4 −2 0 w1 2 4 0 −2 −4 simple soft minimum with β = 10−1 w2 2 4 Example −4 −2 0 w1 2 4 0 −2 −4 more complex soft minimum with β = 0.5 w2 2 4 Example −4 −2 0 w1 2 4 Increasing β I I when β increases, Fβ (W ) converges to minM∈M E(M, W ) optimizing Fβ (W ) becomes more and more complex: I I local minima rougher and rougher: Fβ (W ) remains C 1 but with large values for the gradient at some points Increasing β I I when β increases, Fβ (W ) converges to minM∈M E(M, W ) optimizing Fβ (W ) becomes more and more complex: I I I local minima rougher and rougher: Fβ (W ) remains C 1 but with large values for the gradient at some points path following strategy (homotopy): I I I for an increasing series (βl )l with β0 = 0 initialize W0∗ = arg minW F0 (W ) for β = 0 by solving PN PK i=1 k =1 ∇W eik (W ) = 0 ∗ compute Wl∗ via the EM scheme for βl starting from Wl−1 Increasing β I I when β increases, Fβ (W ) converges to minM∈M E(M, W ) optimizing Fβ (W ) becomes more and more complex: I I I path following strategy (homotopy): I I I I local minima rougher and rougher: Fβ (W ) remains C 1 but with large values for the gradient at some points for an increasing series (βl )l with β0 = 0 initialize W0∗ = arg minW F0 (W ) for β = 0 by solving PN PK i=1 k =1 ∇W eik (W ) = 0 ∗ compute Wl∗ via the EM scheme for βl starting from Wl−1 a similar strategy is used in interior point algorithms Example Clustering in 2 classes of 6 elements from R 1.00 2.00 3.00 7.00 x 8.25 2 minM∈M E(M, W ) 4 w2 6 8 Behavior of Fβ (W ) 2 4 6 w1 8 2 Fβ (W ), β = 10−2 4 w2 6 8 Behavior of Fβ (W ) 2 4 6 w1 8 2 Fβ (W ), β = 10−1.75 4 w2 6 8 Behavior of Fβ (W ) 2 4 6 w1 8 2 Fβ (W ), β = 10−1.5 4 w2 6 8 Behavior of Fβ (W ) 2 4 6 w1 8 2 Fβ (W ), β = 10−1.25 4 w2 6 8 Behavior of Fβ (W ) 2 4 6 w1 8 2 Fβ (W ), β = 10−1 4 w2 6 8 Behavior of Fβ (W ) 2 4 6 w1 8 2 Fβ (W ), β = 10−0.75 4 w2 6 8 Behavior of Fβ (W ) 2 4 6 w1 8 2 Fβ (W ), β = 10−0.5 4 w2 6 8 Behavior of Fβ (W ) 2 4 6 w1 8 2 Fβ (W ), β = 10−0.25 4 w2 6 8 Behavior of Fβ (W ) 2 4 6 w1 8 2 Fβ (W ), β = 1 4 w2 6 8 Behavior of Fβ (W ) 2 4 6 w1 8 2 minM∈M E(M, W ) 4 w2 6 8 Behavior of Fβ (W ) 2 4 6 w1 8 One step closer to the final algorithm I given an increasing series (βl )l with β0 = 0: 1. compute W0∗ such that 2. for l = 1 to L: PN PK i=1 k =1 ∇W eik (W0∗ ) = 0 ∗ 2.1 initialize Wl∗ to Wl−1 exp(−βeik (Wl∗ )) PK exp(−βeil (Wl∗ )) l=1 P P ∗ Wl such that Ni=1 Kk=1 2.2 compute µik = 2.3 update µik ∇W eik (Wl∗ ) = 0 2.4 good back to 2.2 until convergence 3. use WL∗ as an estimation of arg minW ∈Rp (minM∈M E(M, W )) I this is (up to some technical details) the deterministic annealing algorithm PN PKfor minimizing E(M, W ) = i=1 k =1 Mik eik (W ) What’s next? I general topics: I I I why the name annealing? why should that work? (and other philosophical issues) specific topics: I I technical “details” (e.g., annealing schedule) generalization: I I non additive criteria general combinatorial problems Outline Introduction Mixed problems Soft minimum Computing the soft minimum Evolution of β Deterministic Annealing Annealing Maximum entropy Phase transitions Mass constrained deterministic annealing Combinatorial problems Expectation approximations Mean field annealing In practice Annealing I from Wikipedia: Annealing is a heat treatment wherein a material is altered, causing changes in its properties such as strength and hardness. It is a process that produces conditions by heating to above the re-crystallization temperature and maintaining a suitable temperature, and then cooling. Annealing I from Wikipedia: Annealing is a heat treatment wherein a material is altered, causing changes in its properties such as strength and hardness. It is a process that produces conditions by heating to above the re-crystallization temperature and maintaining a suitable temperature, and then cooling. I in our context, we’ll see that I I I T = kB1β acts as a temperature and models some thermal agitation E(W , M) is the energy of a configuration the system while Fβ (W ) corresponds to the free energy of the system at a given temperature increasing β reduces the thermal agitation Simulated Annealing I I a classical combinatorial optimization algorithm for computing minM∈M E(M) given an increasing series (βl )l 1. choose a random initial configuration M0 2. for l = 1 to L 2.1 take a small random step from Ml−1 to build Mlc (e.g., change the cluster of an object) 2.2 if E(Mlc ) < E(Ml−1 ) then set Ml = Mlc 2.3 else set Ml = Mlc with probability exp(−βl (E(Mlc ) − E(Ml−1 ))) 2.4 else set Ml = Ml−1 in the other case 3. use ML as an estimation of arg minM∈M E(M) I naive vision: I I always accept improvement “thermal agitation” allows to escape local minima Statistical physics I consider a system with state space M I denote E(M) the energy of the system when in state M I at thermal equilibrium with the environment, the probability for the system to be in state M is given by the Boltzmann (Gibbs) distribution E(M) 1 exp − PT (M) = ZT kB T with T the temperature, kB Boltzmann’s constant, and X E(M) ZT = exp − kB T M∈M Back to annealing I physical analogy: I maintain the system at thermal equilibrium: I I I slowly decrease the temperature: I I I set a temperature wait for the system to settle at this temperature works well in real systems (e.g., crystallization) allows the system to explore the state space computer implementation: I I direct computation of PT (M) (or related quantities) sampling from PT (M) Simulated Annealing Revisited I simulated annealing samples from PT (M)! I more precisely: the asymptotic distribution of Ml for a fixed β is given by P1/kb β Simulated Annealing Revisited I simulated annealing samples from PT (M)! I more precisely: the asymptotic distribution of Ml for a fixed β is given by P1/kb β how? I I I Metropolis-Hastings Markov Chain Monte Carlo principle: I P is the target distribution Q(.|.) is the proposal distribution (sampling friendly) start with x t , get x 0 from Q(x|x t ) set x t+1 to x 0 with probability P(x 0 )Q(x t |x 0 ) min 1, P(x t )Q(x 0 |x t ) I keep x t+1 = x t when this fails I I I Simulated Annealing Revisited I in SA, Q is the random local perturbation I major feature of Metropolis-Hastings MCMC: PT (M 0 ) E(M 0 ) − E(M) = exp − PT (M) kB T ZT is not needed I underlying assumption, symmetric proposals Q(x t |x 0 ) = Q(x 0 |x t ) I rationale: I I sampling directly from PT (M) for small T is difficult track likely area during cooling Mixed problems I Simulated Annealing applies to combinatorial problems I in mixed problems, this would means removing the continuous part min p E(M, W ) = min M 7→ minp E(M, W ) M∈M,W ∈R I M∈M W ∈R in the vector quantization example 2 N X 1 Mjk xj E(M) = Mik xi − PN j=1 Mjk j=1 i=1 k =1 N X K X I no obvious direct relation to the proposed algorithm Statistical physics I thermodynamic (free) energy: the useful energy available in a system (a.k.a., the one that can be extracted to produce work) I Helmholtz (free) energy is FT = U − TS, where U is the internal energy of the system and S its entropy I one can shown that FT = −kB T ln ZT I the natural evolution of a system is to reduce its free energy I deterministic annealing mimicks the cooling process of a system by tracking the evolution of the minimal free energy Gibbs distribution I still no use of PT ... Gibbs distribution I I still no use of PT ... important property: I f a function defined on M EPT (f (M)) = E(M) 1 X f (M) exp − ZT kB T M∈M I then lim EPT (f (M)) = f (M ∗ ), T →0 where M ∗ = arg minM∈M E(M) I useful to track global information Membership functions I in the EM like phase, we have exp(−βeik (W )) µik = PK l=1 exp(−βeil (W )) Membership functions I in the EM like phase, we have exp(−βeik (W )) µik = PK l=1 exp(−βeil (W )) I this does not come out of thin air: X µik = EPβ,W (Mik ) = Mik Pβ,W (M) M∈M I µik is therefore the probability for xi to belong to cluster k under the Gibbs distribution Membership functions I in the EM like phase, we have exp(−βeik (W )) µik = PK l=1 exp(−βeil (W )) I this does not come out of thin air: X µik = EPβ,W (Mik ) = Mik Pβ,W (M) M∈M I µik is therefore the probability for xi to belong to cluster k under the Gibbs distribution I at the limit β → ∞, the µik peak to Dirac like membership functions: they give the “optimal” partition Example Clustering in 2 classes of 6 elements from R 1.00 2.00 3.00 7.00 x 8.25 0.6 0.4 0.0 0.2 2 4 w2 Probability 6 0.8 8 1.0 Membership functions 2 4 6 w1 8 1 2 3 7 x 7.5 8.25 0.6 0.4 0.0 0.2 2 4 w2 Probability 6 0.8 8 1.0 Membership functions 2 4 6 w1 8 1 2 3 7 x 7.5 8.25 0.6 0.4 0.0 0.2 2 4 w2 Probability 6 0.8 8 1.0 Membership functions 2 4 6 w1 8 1 2 3 7 x 7.5 8.25 0.6 0.4 0.0 0.2 2 4 w2 Probability 6 0.8 8 1.0 Membership functions 2 4 6 8 1 2 3 7 x w1 before phase transition 7.5 8.25 0.6 0.4 0.0 0.2 2 4 w2 Probability 6 0.8 8 1.0 Membership functions 2 4 6 8 1 2 3 7 x w1 after phase transition 7.5 8.25 0.6 0.4 0.0 0.2 2 4 w2 Probability 6 0.8 8 1.0 Membership functions 2 4 6 w1 8 1 2 3 7 x 7.5 8.25 0.6 0.4 0.0 0.2 2 4 w2 Probability 6 0.8 8 1.0 Membership functions 2 4 6 w1 8 1 2 3 7 x 7.5 8.25 0.6 0.4 0.0 0.2 2 4 w2 Probability 6 0.8 8 1.0 Membership functions 2 4 6 w1 8 1 2 3 7 x 7.5 8.25 0.6 0.4 0.0 0.2 2 4 w2 Probability 6 0.8 8 1.0 Membership functions 2 4 6 w1 8 1 2 3 7 x 7.5 8.25 Summary I two principles: I I I I annealing algorithms tend to reproduce this slow cooling behavior simulated annealing: I I I physical systems tend to reach a state of minimal free energy at a given temperature when the temperature is slowly decreased, a system tends to reach a well organized state uses Metropolis Hasting MCMC to sample the Gibbs distribution easy to implement but needs a very slow cooling deterministic annealing: I I I direct minimization of the free energy computes expectations of global quantities with respect to the Gibbs distribution aggressive cooling is possible, but ZT computation must be tractable Outline Introduction Mixed problems Soft minimum Computing the soft minimum Evolution of β Deterministic Annealing Annealing Maximum entropy Phase transitions Mass constrained deterministic annealing Combinatorial problems Expectation approximations Mean field annealing In practice Maximum entropy I another interpretation/justification of DA I reformulation of the problem: find a probability distribution on M which is “regular” and gives a low average error Maximum entropy I another interpretation/justification of DA I reformulation of the problem: find a probability distribution on M which is “regular” and gives a low average error in other words: find PW such that I I I I I P PM∈M PW (M) = 1 M∈M E(W , M)PW (M) P is small the entropy of PW , − M∈M ln PW (M)PW (M) is high entropy plays a regularization role Maximum entropy I another interpretation/justification of DA I reformulation of the problem: find a probability distribution on M which is “regular” and gives a low average error in other words: find PW such that I I I I P PM∈M PW (M) = 1 M∈M E(W , M)PW (M) P is small the entropy of PW , − M∈M ln PW (M)PW (M) is high I entropy plays a regularization role I this leads to the minimization of X M∈M E(W , M)PW (M) + 1 X ln PW (M)PW (M) β M∈M where β sets the trade-off between fitting and regularity Maximum entropy I some calculations show that the minimum over PW is given by 1 Pβ,W (M) = exp(−βE(M, W )), Zβ,W with Zβ,W = X exp(−βE(M, W )). M∈M I plugged back into the optimization criterion, we end up with the soft minimum Fβ (W ) = − X 1 ln exp(−βE(M, W )) β M∈M So far... I three derivations of the deterministic annealing: I I I I all of them boil down to two principles: I I I soft minimum thermodynamic inspired annealing maximum entropy principle replace the crisp optimization problem in the M space by a soft one track the evolution of the solution when the crispness of the approximation increases now the technical details... Outline Introduction Mixed problems Soft minimum Computing the soft minimum Evolution of β Deterministic Annealing Annealing Maximum entropy Phase transitions Mass constrained deterministic annealing Combinatorial problems Expectation approximations Mean field annealing In practice Fixed point I remember the EM phase: µik (W ) = wk = exp(−βkxi − wk k2 ) PK 2 l=1 exp(−βkxi − wl k ) 1 PN N X i=1 µik (W ) i=1 µik (W )xi I this is a fixed point method, i.e., W = Uβ (W ) for a data dependent Uβ I the fixed point is generally stable, except during a phase transition 0.6 0.4 0.0 0.2 2 4 w2 Probability 6 0.8 8 1.0 Phase transition 2 4 6 8 1 2 3 7 x w1 before phase transition 7.5 8.25 0.6 0.4 0.0 0.2 2 4 w2 Probability 6 0.8 8 1.0 Phase transition 2 4 6 8 1 2 3 7 x w1 after phase transition 7.5 8.25 8 6 w2 4 2 2 4 w2 6 8 Stability 2 4 6 w1 8 2 4 6 w1 8 Stability I unstable fixed point: I I I even during a phase transition, the exact previous fixed point can remain a fixed point but we want the transition to take place: this is a new cluster birth! fortunately, the previous fixed point is unstable during a phase transition Stability I unstable fixed point: I I I I even during a phase transition, the exact previous fixed point can remain a fixed point but we want the transition to take place: this is a new cluster birth! fortunately, the previous fixed point is unstable during a phase transition modified EM phase: ∗ 1. initialize Wl∗ to Wl−1 +ε exp(−βeik (Wl∗ )) PK ∗ l=1 exp(−βeil (Wl )) PN PK ∗ Wl such that i=1 k =1 2. compute µik = 3. update µik ∇W eik (Wl∗ ) = 0 4. good back to 2.2 until convergence I the noise is important to prevent missing phase transition Annealing strategy I neglecting symmetries, there are K − 1 phase transitions for K clusters I tracking W ∗ between transitions is easy Annealing strategy I neglecting symmetries, there are K − 1 phase transitions for K clusters I tracking W ∗ between transitions is easy trade-off: I I I slow annealing: long running time but no transition is missed fast annealing: quicker but with higher risk to miss a transition Annealing strategy I neglecting symmetries, there are K − 1 phase transitions for K clusters I tracking W ∗ between transitions is easy trade-off: I I I I slow annealing: long running time but no transition is missed fast annealing: quicker but with higher risk to miss a transition but critical temperatures can be computed: I I via a stability analysis of fixed points corresponds to a linear approximation of the fixed point equation Uβ around a fixed point Critical temperatures I I for vector quantization first temperature: I I the fixed point stability related to the variance of the data the critical β is 1/2λmax where λmax is the largest eigenvalue of the covariance matrix of the data Critical temperatures I I for vector quantization first temperature: I I I the fixed point stability related to the variance of the data the critical β is 1/2λmax where λmax is the largest eigenvalue of the covariance matrix of the data in general: I I I the stability of the whole system is related to the stability of each cluster the µik play the role of membership functions for each cluster the critical β for cluster k is 1/2λkmax where λkmax is the largest eigenvalue of X µik (xi − wk )(xi − wk )T i A possible final algorithm 1. initialize W 1 = 2. for l = 2 to K : 1 N PN i=1 xi 2.1 compute the critical βl 2.2 initialize W l to W l−1 2.3 for t values of β around βl 2.3.1 add some small noise to W l exp(−βkxi −Wkl k2 ) 2.3.2 compute µik = PK exp(−βkx −W l k2 ) t=1 PN i t l 1 2.3.3 compute Wk = PN µ i=1 µik xi i=1 ik 2.3.4 good back to 2.3.2 until convergence 3. use W K as an estimation of arg minW ∈Rp (minM∈M E(M, W )) and µik as the corresponding M Computational cost I N observations in Rp and K clusters I prototype storage: Kp I µ matrix storage: NK I one EM iteration costs: O(NKp) I we need at most K − 1 full EM runs: O(NK 2 p) compared to the K-means: I I I I I more storage no simple distance calculation tricks: all distances must be computed roughly corresponds to K − 1 k-means, not taking into account the number of distinct β considered during each phase transition neglecting the eigenvalue analysis (in O(p3 ))... Outline Introduction Mixed problems Soft minimum Computing the soft minimum Evolution of β Deterministic Annealing Annealing Maximum entropy Phase transitions Mass constrained deterministic annealing Combinatorial problems Expectation approximations Mean field annealing In practice Collapsed prototypes I when β → 0, only one cluster I waste of computational resources: K identical calculations this is a general problem: I I I I each phase transition adds new clusters prior to that prototypes are collapsed but we need them to track cluster birth... Collapsed prototypes I when β → 0, only one cluster I waste of computational resources: K identical calculations this is a general problem: I I I I each phase transition adds new clusters prior to that prototypes are collapsed but we need them to track cluster birth... I a related problem: uniform cluster weights I in a Gaussian mixture 1 2 δk e− 2 kxi −wk k P(xi ∈ Ck |xi , W ) = P K l=1 δl e 1 kxi −wl k2 − 2 prototypes are weighted by the mixing coefficients δ Mass constrained DA I we introduce prototype weights δk and plug them into the soft minimum formulation K N 1X X ln δk exp(−βeik (W )) Fβ (W , δ) = − β i=1 I I k =1 Fβ is minimized under the constraint this leads to µik (W ) = δk = PK k =1 δk δk exp(−βkxi − wk k2 ) PK 2 l=1 δl exp(−βkxi − wl k ) N X µik (W )/N i=1 wk =1 = N N X i=1 µik (W )xi /δk MCDA I increases the similarity with EM for Gaussian mixtures: I I I exactly the same algorithm for a fixed β isotropic Gaussian mixture with variance 1 2β however: I I I the variance is generally a parameter in the Gaussian mixtures different goals: minimal distortion versus maximum likelihood no support for other distributions in DA Cluster birth monitoring I avoid wasting computational resources: I consider a cluster Ck with a prototype wk prior a phase transition in Ck : I I I I I I I duplicate wk into wk0 apply some noise to both prototypes δ0 split δk into δ2k and 2k apply the EM like algorithm and monitor kwk − wk0 k accept the new cluster if kwk − wk0 k becomes “large” two strategies: I I eigenvalue based analysis (costly, but quite accurate) opportunistic: always maintain duplicate prototypes and promote diverging ones Final algorithm 1. initialize W 1 = 2. for l = 2 to K : 1 N PN i=1 xi 2.1 compute the critical βl 2.2 initialize W l to W l−1 2.3 duplicate the prototype of the critical cluster and split the associated weight 2.4 for t values of β around βl 2.4.1 add some small noise to W l δ exp(−βkxi −Wkl k2 ) 2.4.2 compute µik = PK k δ exp(−βkx l 2 i −Wt k ) P t=1 t 2.4.3 compute δk = Ni=1 µik (W )/N PN 2.4.4 compute Wkl = δN i=1 µik xi k 2.4.5 good back to 2.4.2 until convergence 3. use W K as an estimation of arg minW ∈Rp (minM∈M E(M, W )) and µik as the corresponding M −2 −1 0 X2 1 2 3 4 Example −2 0 2 X1 A simple dataset 4 6 0.2 0.5 1.0 E 2.0 5.0 Example 0.1 0.15 0.22 0.32 0.47 0.69 1.02 1.51 β Error evolution and cluster births 2.22 3.37 −2 −1 0 X2 1 2 3 4 Clusters −2 0 2 X1 4 6 −2 −1 0 X2 1 2 3 4 Clusters −2 0 2 X1 4 6 −2 −1 0 X2 1 2 3 4 Clusters −2 0 2 X1 4 6 −2 −1 0 X2 1 2 3 4 Clusters −2 0 2 X1 4 6 −2 −1 0 X2 1 2 3 4 Clusters −2 0 2 X1 4 6 −2 −1 0 X2 1 2 3 4 Clusters −2 0 2 X1 4 6 −2 −1 0 X2 1 2 3 4 Clusters −2 0 2 X1 4 6 −2 −1 0 X2 1 2 3 4 Clusters −2 0 2 X1 4 6 −2 −1 0 X2 1 2 3 4 Clusters −2 0 2 X1 4 6 −2 −1 0 X2 1 2 3 4 Clusters −2 0 2 X1 4 6 −2 −1 0 X2 1 2 3 4 Clusters −2 0 2 X1 4 6 Summary I I deterministic annealing addresses combinatorial optimization by smoothing the cost function and tracking the evolution of the solution while the smoothing progressively vanishes motivated by: I I I I heuristic (soft minimum) statistical physics (Gibbs distribution) information theory (maximal entropy) in practice: I I I rather simple multiple EM like algorithm a bit tricky to implement (phase transition, noise injection, etc.) excellent results (frequently better than those obtained by K-means) Outline Introduction Mixed problems Soft minimum Computing the soft minimum Evolution of β Deterministic Annealing Annealing Maximum entropy Phase transitions Mass constrained deterministic annealing Combinatorial problems Expectation approximations Mean field annealing In practice Combinatorial only I the situation is quite different for a combinatorial problem arg min E(M) M∈M I no soft minimum heuristic I no direct use of the Gibbs distribution PT ; one need to rely on EPT (f (M)) for interesting f calculation of PT is not tractable: I I I tractability is related to independence if X X E(M) = ... (E(M1 ) + . . . + E(MD )) M1 MD then independent optimization can be done on each variable... Graph clustering I I a non oriented graph with N nodes and weight matrix A (Aij is the weight of the connection between nodes i and j) maximal modularity clustering: I I I I node degree ki = PN ki kj 2m , j=1 Aij , total weight m = 1 2m (Aij Bij = null model Pij = assignment matrix M Modularity of the clustering Mod(M) = − Pij ) XX i,j Mik Mjk Bij k I difficulty: coupling Mik Mjk (non linearity) I in the sequel E(M) = −Mod(M) 1 2 P i,j Aij Strategy for DA I choose some interesting statistics: I I EPT (f (M)) for instance EPT (Mik ) for clustering I compute an approximation of EPT (f (M)) at high temperature T I track the evolution of the approximation while lowering T I use limT →0 EPT (f (M)) = f (arg minM∈M E(M)) to claim that the approximation converges to the interesting limit I rationale: approximating EPT (f (M)) for a low T is probably difficult, we use the path following strategy to ease the task 1.0 0.5 0.0 y 1.5 Example −1.0 −0.5 0.0 x 0.5 1.0 0.000 0.005 0.010 probability 0.015 0.020 Example β = 10−1 −1.0 −0.5 0.0 x 0.5 1.0 Example 0.015 0.010 0.005 0.000 probability 0.020 β = 10−0.75 −1.0 −0.5 0.0 x 0.5 1.0 0.000 0.005 0.010 0.015 probability 0.020 0.025 Example β = 10−0.5 −1.0 −0.5 0.0 x 0.5 1.0 Example 0.020 0.010 0.000 probability 0.030 β = 10−0.25 −1.0 −0.5 0.0 x 0.5 1.0 Example 0.03 0.02 0.01 0.00 probability 0.04 β = 100 −1.0 −0.5 0.0 x 0.5 1.0 Example 0.04 0.02 0.00 probability 0.06 0.08 β = 100.25 −1.0 −0.5 0.0 x 0.5 1.0 Example 0.10 0.05 0.00 probability 0.15 β = 100.5 −1.0 −0.5 0.0 x 0.5 1.0 0.00 0.05 0.10 0.15 0.20 0.25 0.30 probability Example β = 100.75 −1.0 −0.5 0.0 x 0.5 1.0 Example 0.3 0.2 0.1 0.0 probability 0.4 0.5 β = 101 −1.0 −0.5 0.0 x 0.5 1.0 Example 0.4 0.2 0.0 probability 0.6 β = 101.25 −1.0 −0.5 0.0 x 0.5 1.0 Example 0.4 0.2 0.0 probability 0.6 0.8 β = 101.5 −1.0 −0.5 0.0 x 0.5 1.0 Example 0.6 0.4 0.2 0.0 probability 0.8 1.0 β = 101.75 −1.0 −0.5 0.0 x 0.5 1.0 Example 0.6 0.4 0.2 0.0 probability 0.8 1.0 β = 102 −1.0 −0.5 0.0 x 0.5 1.0 0.3 0.2 0.1 0.0 x^ 0.4 0.5 Example 0.1 0.5 1.0 5.0 β 10.0 50.0 100.0 Expectation approximation I computing Z EPZ (f (Z )) = I has many applications for instance, in machine learning: I I I I f (Z )dPZ performance evaluation (generalization error) EM for probabilistic models Bayesian approaches two major numerical approaches: I I sampling variational approximation Sampling I (strong) law of large numbers Z N 1X f (zi ) = f (Z )dPZ N→∞ N lim i=1 if the zi are independent samples of Z I many extensions, especially to dependent samples (slower convergence) I MCMC methods: build a Markov chain with stationary distribution PZ and take the average of f on a trajectory Sampling I (strong) law of large numbers Z N 1X f (zi ) = f (Z )dPZ N→∞ N lim i=1 if the zi are independent samples of Z I many extensions, especially to dependent samples (slower convergence) I MCMC methods: build a Markov chain with stationary distribution PZ and take the average of f on a trajectory I applied to EPT (f (M)): this is simulated annealing! Variational approximation I main idea: I I I I calculus of variation: I I I the intractability of the calculation of EPZ (f (Z )) is induced by the complexity of PZ let’s replace PZ by a simpler distribution QZ ... ...and EPZ (f (Z )) by EQZ (f (Z )) optimization over functional spaces in this context: choose an optimal QZ in a space of probability measures Q with respect to a goodness of fit criterion between PZ and QZ probabilistic context: I I simple distributions: factorized distributions quality criterion: Kullback-Leibler divergence Variational approximation PD I assume Z = (Z1 , . . . , ZD ) ∈ RD and f (Z ) = I for a general PZ , f structure cannot be exploited in EPZ (f (Z )) i=1 fi (Zi ) Variational approximation PD I assume Z = (Z1 , . . . , ZD ) ∈ RD and f (Z ) = I for a general PZ , f structure cannot be exploited in EPZ (f (Z )) variational approximation I I I choose Q a set of tractable distributions on R solve Q ∗ = arg min KL (Q||PZ ) Q=(Q1 ×...×QD )∈QD with Z KL (Q||PZ ) = − I i=1 fi (Zi ) ln dPZ dQ dQ approximate EPZ (f (Z )) by EQ ∗ (f (Z )) = D X i=1 EQi∗ (fi (Zi )) Application to DA 1 Zβ I general form: Pβ (M) = exp(−βE(M)) I intractable because E(M) is not linear in M = (M1 , . . . , MD ) Application to DA 1 Zβ I general form: Pβ (M) = I intractable because E(M) is not linear in M = (M1 , . . . , MD ) variational approximation: I I exp(−βE(M)) replace E by F (W , M) defined by F (W , M) = D X Wi Mi i=1 I use for QW ,β QW ,β = I 1 exp(−βF (W , M)) ZW ,β optimize W via KL (QW ,β ||Pβ ) Fitting the approximation I a complex expectation calculation is replaced by a new optimization problem I we have KL QW ,β ||Pβ = ln Zβ −ln ZW ,β +βEQW ,β (E(M) − F (W , M)) I I tractable by factorization on F , except for Zβ solved by ∇W KL QW ,β ||Pβ = 0 I I I tractable because Zβ do not depend on W generally solved via a fixed point approach (EM like again) similar to variational approximation in EM or Bayesian methods Example I in the case of (graph) clustering, we use F (W , M) = K N X X Wik Mik i=1 k =1 I a crucial (general) property is that, when i 6= j EQW ,β Mik Mjl = EQW ,β (Mik ) EQW ,β Mjl I a (long and boring) series of equations leads to the general mean field equations ∂EQW ,β (E(M)) ∂Wjl = K X ∂EQW ,β Mjk k =1 ∂Wjl Wjk , ∀j, l. Example I classical EM like scheme to solve the mean field equations: 1. we have exp(−βWik ) EQW ,β (Mik ) = PK l=1 exp(−βWil ) 2. keep EQW ,β (Mik ) fixed and solve the simplified mean field equations for W 3. update EQW ,β (Mik ) based on the new W and loop on 2 until convergence I in the graph clustering case X Wjk = 2 EQW ,β (Mik ) Bij i,j Summary I I deterministic annealing for combinatorial optimization given an objective function E(M) I I I choose a linear parametric approximation of E(M), F (W , M) with the associated distribution QW ,β = ZW1,β exp(−βF (W , M)) write the mean field equations, i.e. ∇W − ln ZW ,β + βEQW ,β (E(M) − F (W , M)) = 0 use a EM like algorithm to solve the equations: I I I given W , compute EQW ,β (Mik ) given EQW ,β (Mik ), solve the equations back to our classical questions: I I why would that work? how to do that in practice? Outline Introduction Mixed problems Soft minimum Computing the soft minimum Evolution of β Deterministic Annealing Annealing Maximum entropy Phase transitions Mass constrained deterministic annealing Combinatorial problems Expectation approximations Mean field annealing In practice Physical interpretation I back to the (generalized) Helmholtz (free) energy Fβ = U − I S β the free energy can be generalized to any distribution Q on the states, by Fβ,Q = EQ (E(M)) − where H(Q) = − I P H(Q) , β M∈M Q(M) ln Q(M) is the entropy of Q the Boltzmann-Gibbs distribution minimizes the free energy, i.e. FT ≤ FT ,Q Physical interpretation I in fact we have FT ,Q = FT + 1 KL (Q||P) , β where P is the Gibbs distribution I minimizing KL (Q||P) ≥ 0 over Q corresponds to finding the best upper bound of the free energy on a class of distribution I in other words: given the system states must be distributed according to a distribution in Q, find the most stable distribution Mean field I I E(M) is difficult to handle because of coupling (dependencies) while F (W , M) is linear and corresponds to a de-coupling in which dependencies are replaced by mean effects example: I modularity E(M) = XX i I Mik k Mjk Bij j approximation F (W , M) = XX i I X Mik Wik k the complex influence of Mij on E is replaced by a single parameter Wik : the mean effect associated to a change in Mij Outline Introduction Mixed problems Soft minimum Computing the soft minimum Evolution of β Deterministic Annealing Annealing Maximum entropy Phase transitions Mass constrained deterministic annealing Combinatorial problems Expectation approximations Mean field annealing In practice In practice I homotopy again: I at high temperature (β → 0): I I EQW ,β (Mik ) = K1 the fixed point equation is generally easy to solve; for instance in graph clustering Wjk = 2 X Bij K i,j I I I slowly decrease the temperature use the mean field at the previous higher temperature as a starting point for the fixed point iterations phase transitions: I I stable versus unstable fixed points (eigenanalysis) noise injection and mass constrained version A graph clustering algorithm 1. initialize Wjk1 = 2 K P i,j Bij 2. for l = 2 to K : 2.1 compute the critical βl 2.2 initialize W l to W l−1 2.3 for t values of β around βl 2.3.1 add some small noise to W l ik ) 2.3.2 compute µik = PKexp(−βW t=1 Pexp(−βWit ) 2.3.3 compute Wjk = 2 i,j µik Bij 2.3.4 good back to 2.3.2 until convergence 3. threshold µik into an optimal partition of the original graph I mass constrained approach variant I stability based transition detection (slower annealing) Karate 21 19 16 23 15 31 33 27 I I 30 clustering of a simple graph Zachary’s Karate club 20 34 22 9 2 14 24 18 3 28 32 4 8 10 13 1 29 26 12 25 5 7 6 17 11 Modularity 0.3 0.2 0.1 0.0 Modularity 0.4 Evolution of the modularity during annealing 50 100 200 β 500 Phase transitions I 1 2 3 4 first phase: 2 clusters Phase transitions I first phase: 2 clusters I second phase: 4 clusters 1 2 3 4 Karate I clustering of a simple graph I Zachary’s Karate club I Four clusters DA Roadmap How to apply deterministic annealing to a combinatorial optimization problem E(M)? 1. define a linear mean field approximation F (W , M) 2. specialize the mean field equations, i.e. compute ∂EQW ,β (E(M)) ∂ = ∂Wjl ∂Wjl 1 ZW ,β ! X E(M) exp(−βF (W , M)) M 3. identify a fixed point scheme to solve the mean field equations, using EQW ,β (Mi ) as a constant if needed 4. wrap the corresponding EM scheme in an annealing loop: this is the basic algorithm 5. analyze the fixed point scheme to find stability conditions: this leads to an advanced annealing schedule Summary I I deterministic annealing addresses combinatorial optimization by solving a related but simpler problem and by tracking the evolution of the solution while the simpler problem converges to the original one motivated by: I I I I heuristic (soft minimum) statistical physics (Gibbs distribution) information theory (maximal entropy) works mainly because of the solution following strategy (homotopy): do not solve a difficult problem from scratch, but rather starting from a good guess of the solution Summary I difficulties: I I I I boring and technical calculations (when facing a new problem) a bit tricky to implement (phase transition, noise injection, etc.) tends to be slow in practice: computing exp is very costly even on modern hardware (roughly 100 flop) advantages: I I I I versatile appears as a simple EM like algorithm embedded in an annealing loop very interesting intermediate results excellent final results in practice References T. Hofmann and J. M. Buhmann. Pairwise data clustering by deterministic annealing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(1):1–14, January 1997. K. Rose. Deterministic annealing for clustering, compression,classification, regression, and related optimization problems. Proceedings of the IEEE, 86(11):2210–2239, November 1998. F. Rossi and N. Villa. Optimizing an organized modularity measure for topographic graph clustering: a deterministic annealing approach. Neurocomputing, 73(7–9):1142–1163, March 2010.