· LIDS-P-1 494 August 20, 1985 ANALYSIS OF SIMULATED ANNEALING FOR OPTIMIZATION by Saul B. Gelfand and Sanjoy K. Mitter Department of Electrical Engineering and Computer Science and Laboratory for Information and Decision Systems Massachusetts Institute of Technology Cambridge, MA 02139 Abstract Simulated annealing is a popular Monte Carlo algorithm for combinatorial optimization. The annealing algorithm simulates a nonstationary finite state Markov chain whose state space minimized. of is the domain of the cost function to be We analyze this chain focusing on those issues most important for optimization. {I,J} Q In all of our results we consider Q; important special cases are when I an arbitrary partition is the set of minimum cost states or a set of all states with sufficiently small cost. bound on the probability that the chain visits = 1,2,.... This bound may be useful even I We give a lower at some time when the < k, for does not I in probability and obtain an estimate of the rate of convergence as well. We converge. We give conditions under which the chain converges to also give conditions under which the visits almost always, or does not converge to I algorithm k chain visits I infinitely often, I, with probability 1. Research supported by the U.S. Army Res. Off. under Grant DAAG29-84-K-0005 and by the Air Force Off. of Scientific Res. under Grant AFOSR 82-0135B 1. Introduction Simulated annealing, as proposed by Kirkpatrick Monte-Carlo algorithm for combinatorial optimization. [11, is a popular Simulated annealing is a variation on an algorithm introduced by Metropolis [2] for approximate computation of mean values of various statistical-mechanical quantities for a physical system in equilibrium at a given temperature. In simulated annealing the temperature of the system is slowly decreased to zero; if the temperature is decreased slowly enough the system should end up among the minimum energy states or at least among states of sufficiently low energy. Hence the annealing algorithm can be viewed as minimizing a cost function (energy) over a finite set (the system's states). been applied to several combinatorial Simulated annealing has optimization problems including the traveling salesman problem [2], computer design problems [2],[3], and image reconstruction problems [4] with apparently good results. The annealing algorithm consists of simulating a nonstationary finite-state Markov chain which we shall call the annealing chain. We now describe the precise relationship optimization problem to be solved. M to be the real numbers, between this chain the natural numbers, and the cardinality of a finite set finite set, say Q = {1,...,QI}, and over i E n. Let Ui E R Tk > 0 for A. Let W(k)= energies {Tk}kEMN Wk) 1 iE {Ui}iE R i E 0; k E MN. ( Q be a we want to shall be the {Ui}iE Q as the as the annealing schedule of temperatures. (a row vector) be a Boltzman distribution over the at temperature Wk) Let for state-space for the annealing chain and we shall refer to energy function and finite M 0 = M U {O}, and we tAI Ui the Here and in the sequel we shall take shall denote by minimize and e Tk, i.e., -Ui/Tk v-U j/Tk for all k E N 0. The annealing chain will be constructed such that at each time k 7(k) the chain has each time k the annealing chain shall have a P(kk+l) = [p (k k+l)i 3 Q S transition matrix is the unique solution of The motivation for this is as follows. be the minimum energy states in 13ct Tif 1-step 7 = D(k) such that 7 -=p(k,k+l) the vector equation Let as its unique invariant distribution, i.e., at Q. Now if Tk 0 -* as k - X then iE S (k ) _,S* if 0 as k , i.e., the S , i invariant distributions distribution over the minimum energy states. converge to a uniform The hope is then that the chain itself converges to the minimum energy states. We now show how Metropolis constructs a transition matrix 7(k) with invariant vector for k E N0. Q = Let p(k,k+l) [qij]ijE be a symmetric and irreducible stochastic matrix, and let qije P for all ~kk+l) qI i ( 1- ~ ·1~ >(k,k+l) Pie Qli and k E M 0. i,j E Q f(k)p(k,k+l) -( Uj-Ui) /Tk for all k e N. if Uj > U i , if Uj < Ui, J • i, if 3 = i, Then it is easily verified that In fact, p(kk+l) and 7(k) (k) satisfy the reversibility condition p(k,k+l) f(k) = ilk) for all k E N 0. transition Let matrices (k,k+l) i, {xkI}kEN be the ({P(kk+l)}k and annealing chain with 1-step some constructed on a suitable probability space (M,A,P). for i E Q and initial Let distribution, pik) - P{xk = i} k E[NO. The annealing chain is simulated as follows. generate E ' ij i ji a random variable y E n -3- with Suppose P{y = J} xk = i E 0. Then = qiJ for j E 0. Suppose y = J E Q. Then set (i Uj < Ui, if }3~~ if k+1 j :if j i ~~~~~-(U Ui U with probability -Ui)/T k e else. Hence we may think of the annealing algorithm as a "probabilistic descent" algorithm where the Q matrix represents some prior distribution of "directions", transitions to same or lower energy states are always allowed, and transitions to higher energy states are allowed with positive probability which tends to 0 k - o as (when Tk - 0 as k -+ A). Even though simulated annealing was proposed as heuristic, its apparent success in dealing with hard combinatorial optimization problems makes it desirable to understand in a rigorous fashion why it works. The recent works of Geman [4], Gidas C5], and Mitra et. al. [6] have approached this problem by showing the existance of an annealing schedule for which the annealing chain converges distributions weakly to {ff(k)} the k E N limit as the sequence , i.e., to a uniform distribution over case a (different) constant large enough same and c is given such that if Tk - 0 1 1t- if of invariant S . In each Tk > c / log k for k - ~ then as * i E S*, Is, O Piif 0 as k * A. icQi are an f i0 S Furthermore, under an annealing schedule of the form log(k+k 0) where E |pk)-il i if T > c for extension and k0 Tk = T / 1, Mitra et. al. obtain an upper bound on k E N 0. The results of Geman, Gidas, and Mitra et. al. of convergence weak results for stationary aperiodic irreducible chains [7] and certain nonstationary chains [8], and are useful in proving ergodic theorems (which Gidas does). However, if one is simply interested in finding any minimum energy state than weak convergence seems unnecessarily strong. In a recent paper HaJek [9] investigates when the annealing chain converges in probability to for a constant for large d S . Hajek gives an expression such that under the annealing schedule k E lN, enough P{x Furthermore the condition that E k Q S ) 1 - k - as T k = T / log k iff o T > d . be symmetric is relaxed to what is called "weak reversibility". In this paper, we analyze simulated annealing focusing on optimization issues. Here we are not so much interested in the statistics of individual states as in that of certain groups of states, minimum energy states or more sufficiently low energy. partition {I,J} of such as the set generally a set S of all S of states with In all of our results we consider an arbitrary Q, and examine the behavior of the annealing chain relative to this partition; we obtain results for I - S as a special case. We investigate both finite-time and asymptotic behavior as it depends on the Q matrix and the annealing schedule of temperatures In Section 2 we establish notation. behavior. In Section 3 we examine finite-time We observe that since we may keep track of the minimum energy state visited up to time probability of visiting of visiting S k, it seems more appropriate to lower bound the S k. where T > 0 P{xn E I, some n < k} exponentially fast. n < k, rather than the probability at some time at time Tk = T / log(k+k 0) 0. Under an annealing and k E M O. for For small T ki schedule For large T this bound converges to 1 the bound converges to a positive value > T when the In Section 4 we examine asymptotic behavior. First, we show that under suitable conditions on * of the form > 1, we obtain a lower bound on Hence the bound is potentially useful even for small algorithm may not converge. Q there exists a constant * U such that if probability that Tk h U xk E I suitable conditions on k E I, P{x {Tk}kE.N k then xk E I} = 1 - O(k / log k for large enough infinitely often is Q T > U if and 1. /T ) as k - A, where -5- Second, we show that under T k = T / log k converges in probability to T > 0 k E I, then the I. for large enough Infact, we show that does not depend on T and only depends on Q pairs of allowed transitions. on Q log k there exists a constant for large enough always is 1. {(i,j) E Q x Q: qij through the set , 0} of ordered Third, we show that under suitable conditions U* such that if U < T < Us k E N, then the probability that and Tk = T / xk E I almost Hence we obtain three results about the convergence of the annealing algorithm with increasingly stronger assumptions and conclusions. In Section 4 we also obtain a converse which gives conditions under which the annealing algorithm does not converge: we show that under suitable conditions on Q that there exists a constant (W -6) / log k for large enough infinitely often is W such that if e > 0 and k E N, then the probability that Tk < xk e I < 1. Finally, we briefly compare our results to Hajek's work and indicate some directions for further research. We remark that Sections 3 and 4 are essentially independent of each other. -- _ _ _ _ _ 2. Notation and Preliminaries In this section we describe notation which is necessary to state our a technical and discuss this notation, of a few examples give results, condition which we shall often impose in the sequel. Let and U -=max Ui. iEO for some U < U < U. U = min U -iEOi {i E (: Ui define U} Then to be the d-step transition matrix the stochastic matrix Q p(k+d-l,k+d) =(k,k+l) {fX}kEM In defining the annealing chain oin Section 1 we assumed that This assumption is was symmetric and irreducible. If unnecessarily strong for our purposes. I to converge to xi as k {I,J} is a partition of J J and possibly another condition which makes transitions from I likely than transitions from j that > 0 n = O,...,k for all ,... ,d-1. i to every d E mN think of j i can reach = J = J iJd) J and such that Ui at energy i,J E 0 and and i,j E 0 such that let qin+1 let p(kik+l) > 0 in n+1 < U Tk > 0 for all for all Note that -- iJ i can such for all U. AJd) be the for all d k E N0). n = 0,...,d-1. as the sequences of allowed transitions of length at infinite temperature. i be the sequences of states M(d) i-i > 0 is an Q are the sequences of allowed transitions of length j at positive temperature (we defined i 0O,...id to A(d) i - io,.,id I, more i = i 0 ,il,...,i k = J U E R n = 0,...,k-1; if than we shall say that sequences of states to I i,j E 0 we shall say that For each k E N 0, and for every d E m Let to for now assume if there exists a sequence of states qi i and J, depending on the mode of convergence. to We will be more precise later in Section 4; arbitrary stochastic matrix. Q a, then we need only require some i kind of condition which guarantees transitions can be made from reach S = k, i.e., p(k,k+d) we want and Following standard notation, we shall p(k,k+d) -= P(kk+d)]i i starting at time = {i E Q: Ui = U} S n = from For i = We might d from i c Ai ), and the elements of C Ae ___- A) from Us are precisely those sequences which have a self transition, say \ M( s, with s- qss = 0 d E N, Now for st >' 0 and , i,j E and t EO for some X E Aid) > Ut such that let d-1 U(X) = ] max[O, Ui -Ui ], n n+1 n=O max maxtO, Ui V(X) = - U i ], n=O,...,d-1 n+1 O -Ui ]. max maxtO, Ui W(X) = n n+l n=O,...,d-1 Also let rain U(X) umid) # i, ( d) E(dA ) if (d) if for all AiJd) = d e I, and d<j| 'J i deIN for all by i,j E and V Q. (2.1) min U(d) Ui= inf U(d) d) WJ by replacing U and ViJd) Vi ii Wi3 ij ,via above. Uid),U in the definitions of Similarly define W, respectively, Finally, if one or both of the indices i,j E Q are replaced by I,J c Q in these definitions then an additional minimization is to be performed over the I,J, e.g., elements of if we replace (d) A) Ui by (d) etc. Note that W , Win min U JEJ iEI,JEJ (d)V (d) W) and in the definitions of U then the values of these quantities will in general be changed; however the values of Uij, Vij, and Wij will be unchanged. We shall refer to U(d)) as the transition energy (d-step transition energy) from xy for x,y E Q U 2. Example 2.1 {1,...,5} from to J E Q labelled with the value of the corresponding qij is shown iff p(k,k+l) qij. Q matrix, i.e., y, to In Figure 2.1 we show a state transition diagram for where transitions are governed by the i E Q x Uxy Q - an edge , O, in which case the edge is To obtain the state transition diagram for matrix, k E No 0 simply add a self-transition loop to every state which can make a transition to a higher energy state (if -8-_____ _ one is not already present) and relabel self-transitions which are allowed under depicted by broken loops. edges P(kk+l) appropriately. but not under The Q are Also observe that the ordinate axis gives the energy of the corresponding state. A() 15 the To illustrate the notation we have = {(1,1,2,3,4,5),(1,2,3,3,4,5),(1,2,3,4,5,5,)} M(5) = {(1,1,2,3,4,5)} U 1 5 -= U1 Ui5 V1 1 = Let U4-U = V 5 U 1 U 15 i = UU1 {I,J} +U4 3 = 4, 4-U 3 = 3, = 2 be a partition of Q. following condition: there exists energy from J E J j to I (U(d) = Uj that if I exists d 0 < IJI S ) U (d ) and so of Uji J E J). such that for every d o0 U(d) for for all for all to I, for all for all j E J. d > do, It is easy to show Infact, in this case there U(d) = UjZ for all j E J. In Figure 2.2 we show a state transition diagram for (see Example 2.1). U4 j This will allow us to get lower bounds then this condition is satisfied. {3,...,7}. Then (d) (d) U 3I = U361 U (d such that the d-step transition equals the transition energy from for all Example 2.2 {1,...,7} d E I P{x(k+l)d E I I x|d = j} on the quantity = In Section 4 we will often impose the UGId Let (d) 7I = 1i, d > 2, = 2, d > 3, 3. I = S = {i E Q: E E , then J E J. J = > 1, I Note that if we replace i, U i < 2} = {1,2,3}, Q = _= A(d) 3ij by there does not exist Mid) i d E in the definition such that U(d) 3. Finite-time Behavior From the point of view of applications it is important to understand the Certainly it is interesting finite-time behavior of the annealing algorithm. to know whether the this converges algorithm annealing information may well give to according insight into various criteria, and finite-time behavior. However this information may also be misleading for the following reasons. First, the finite-time behavior of the annealing algorithm may be quite satisfactory even when the algorithm does not converge, which may well Second, the finite-time behavior of be the case for typical applications. the annealing algorithm may not be clearly related to the convergence rate when the algorithm does converge, as the following example indicates. It is a simple consequence of Proposition 4.1(ii) that if Example 3.1 Q enough a,a > 0 k E [, then there exists and P 1-step transition matrix Since S P{Yk E S } - b > 0 finite-time and k of the such that P{xk E S I} 1 as annealing chain Q would and be better finite-time - o behavior of the than then annealing the 4 T. We now address the question of what is an appropriate the k Of course we would hope that the stationary chain, for appropriate choice of assess [l/IQ13 is at best polynomial while the rate that 1 is at worst exponential. behavior = k E N0. is chosen such that T 0 < p < 1 >} 1 - bp , P{(xk E S } 1 Q by setting is Just the set of persistent states for this chain, it P{Yk E S the rate that p(kk+1) and some initial distribution, constructed on is well-known that there exists Hence assuming that large enough. Yk E 0, be a stationary Markov chain with {Yk}kE,[ Tk = O, and let (M,A,P). k be the matrix obtained from P for large such that P{xk E S } < 1 - a Now let T k > T / log k T > 0, and is symmetric and irreducible, criterion to algorithm. For our purposes, we are simply interested in finding any state of sufficiently low energy, i.e., an element of S. Hence it seems reasonable to lower bound P{(k E S} for k E N0. However, we observe that by just doubling the annealing algorithm's memory requirements we can keep track of one of the minimum energy states visited by the chain up to the current time. case we are really interested in having visited S opposed to k. actually occupying S at time P{xn E S, some n < k} appropriate to lower bound at some time Hence for it In this n < k, as seems more k E MN. We start with a proposition which gives a lower bound on the d-step transition probability X E A(d) of sequences Proposition 3.1 for k E MN. r(X) > 0 Proof in terms of the transition energies U(X) i,j E Q for d E f, Let Then for every p(k,k+d) > where p(kkd) T > 0, k0 > 1, and Tk = T / log(k+k)0 i,j E Q ) r(X)(k+k0 +dl)-U(X)/T, XEP(d) k E 0, and X (3.1) is given in (3.2). Let (k) (Pii (k,k+l) for all E A(d) i,j E 0 and if k E MN. J i Also for every i,J E 0 (io,...,id) let An(X) = max[O, n -Ui Ui n+1 n d-1 (k+n) ) ' 0, rk(X) = TT ri n-0O = 0,...,d-l, k E NO, n n+l and r(X) -= Tk Since d-1 ri ° ) > 0. n-O n n+1 is strictly nondecreasing, so that (3.2) (k,k+l) decreasing, rk(X) > r(X) for all k E MiN. E (k,k+d) d-1 FT7 Pij (i i )EAJd) n-0 (k+nI Pi nin in n+l and +n+l) hence rk) Hence for every are i,J 2 d) (io ... r! i n ) (a) > d (d) rk(X) r(X)(k exp + 0dT, n=O XE-'( - max[O, U i 1+ T-i1 exp UA(d) n=O log(k+nin+1 0 +d.-) ij d-1 AnW nT k+n ] 1 k e XEAij XeAi (1) In Figure 2.1 we have 1 Remarks on Proposition 3.1 r((1,2,3,4,5)) = q1223q34q45 r((2,3,3,4,5)) = = 445 = 23P33 1 - 2/T 0 2 I0 r((2,3,4,5,5)) = q3q34q45P55 (1) (2) nondecreasing as T decreases or On the other hand, (if r(X) is increases, which reflects the fact k0 that self-transitions in the sequence temperature. is easy to see that From (3.2) it k E [N 0.' Fix /T 1 have larger probability at lower X (k+ko+d-l)-U(X)/T 1 0 as T 1 0 or ko T U(X) > 0), which reflects the fact that transitions to higher energy states in the sequence X have smaller probability at lower temperature. These two phenomena compete with each other in the lower bound (3.1). The next theorem gives a lower bound on [N by setting Let (d) max Uji, T , O, P{xnd E J, n = O,...,k} {I,J} be a partition of k 0 , 1, and exp exp k d(1-a d.(a) + n) nO <tp for k E I = S. Theorem 3.1 jEJ P{x n E S, some n < k} T k = T / log(k+k O ) 1-a a) ] exp[ Q. for (d no] d(a-1) na-l exp| e[a-) - 12 - (kd + n)a-1 Also let d E [N, k E [NO. Then if T U, if T if T = U, U = k E N0, where for all a = U/T, nO = k0+d-1, a > 0 and is given in (3.5). Note In the statement of Theorem 3.1 and in the proof to follow we suppress the dependence of the constants U make this dependence explicit by writing U ( d) Proof and From Proposition 3.1 for every a on Later, we shall a (d ) and i,j E Q r(X)(k+ko+d-1) - U(X)/T p(k,k+d) >_ d. k E [0, XEA d where r(X) , O is given in (3.2). Hence k-1 P{Xnd E J, n = O,...,k} _<max P{X(n+l)d E J I Xnd =J} n=O JEJ 10- = n=O F7 (nd (n+i)d) pji iEif min n-03JEJ k-l n=O (nd + n o)a (34) where r(X) > 0. a = min (3.5) EJ iEI kXEAi U(X)<u (if U = ~ let a be any positive real). Since 1+x < e x x E R, for all we have k-1 T [ ~k-1~~~~ a ~k-1i | < exp[- a (nd + n O ) a n=O nal 0 exp d la kd, +[ nO for all k E MN. <exp- a| -= (d + + n0)a exp[- a n=0 (nd 0 (kd + n)1a n exp[- d d O d (xd + n ) a 1, if a if a - 1, (3.6) Combining (3.4) and (3.6) completes the proof. Remarks on Theorem 3.1 = 4 1 1 ) in Figure 2.1. Then (1) ) U = U (4 15 = 4 a I - S Let = {5}, J = {1,2,3,4}, and d and min r(X). JE{1,2,3,4} XEA), U(X)<4 Now it is not hard to see that the minimum is obtained by - 13 - J = 1 or 2. Using the values of computed in the first remark following Proposition 3.1 r(X) we have a1 am min[1, 4 42/T _ _ Hk k 21/T 1 k1jT (2) Note that 0} P{xnd E J, n E = lim P{xnd E J, n = O,...,k} n k-4 =0 < exp- d(a- ) n- l so that the bound is potentially useful even when (3) Fix k E N . (see (3.6)). increases, k0 if T < U, T < U. T and k0 in the form (xd + n(3.7) J, n = 0,...,k} < exp - a Since T > U, It will be convenient to analyze the dependence of the upper bound (3.3) on P{xnd if r(X) is nondecreasing we have from (3.5) that a as T decreases is nondecreasing as T or k0 decreases or increases, which reflects the fact that self-transitions in sequences of transitions from the other hand, J to I have larger probability at lower temperature. 1 0 (xd + no)a d I 0 dx as T I 0 or ko 0 T (if On U > o), which reflects the fact that transitions to higher energy states in sequences of transitions from Since these two J to I have smaller probability at lower temperature. phenomena compete minimizing the r.h.s of (3.7) over T with and each k0 other one could consider to obtain the best bound. (4) We can generalize Theorem 3.1 by replacing U = max UJd) with U' JEJ > U a (if and U' < U a then a = 0 and the upper bound (3.3) is useless). are both nondecreasing with increasing minimizing the r.h.s. of (3.7) over U' as well as U' T Since one could consider and k0 to obtain the best bound (see previous remark). In order to apply Theorem 3.1 we must obtain suitable estimates for the constants U(d) and a(d). We are currently investigating this in the context of a particular problem. - 14 - 4. Asymptotic Analysis In the previous we section pointed out some of difficulties the associated with using the asymptotic behavior of the annealing algorithm to Nonetheless, it is certainly interesting predict its finite-time behavior. from a theoretical viewpoint to perform an asymptotic analysis, i.e, to find conditions under which the annealing algorithm does or does not converge according to various criteria, and when the algorithm converges to estimate the rate of convergence as well. and then briefly our compare In this section we address these questions, results to HaJek's work and indicate some directions for further research. We first address the question of what are appropriate criteria to assess the asymptotic performance of the annealing algorithm. For our purposes, we are simply interested in finding any state of sufficiently low energy, i.e., an element of S. Hence we shall investigate conditions on the and the annealing schedule of temperatures (Tk}kEN Q matrix under which one or more of the following is true: (i) P{xk E S i.o.} = 1, (ii) P{xk E S}I (iii) P{xk E S a.a.} = 1. 1 as k- , Here "i.o." and "a.a." are abbreviations for "infinitely often" and "almost always", i.e., {xk E S i.o.} { k--)} - E S} = n U {xk E S} na=1 k>n and {xk E S a.a.} = lim {xk E S} = Ok-, U n n=l k>n {Xk E S} Since (c.f. (7]) P{xk E S a.a.} < lim P{x k E S)} < I P{Xk E S} < P{Xk E S i.o.}, (4.1) it follows that (i),(ii), and (iii) are increasingly strong results and so we expect increasingly strong conditions under which each is true. We are also interested in obtaining the rate of convergence in (ii) as well as conditions - 15 - under which (i),(ii), and (iii) do not hold. We start by giving a proposition which establishes asymptotic upper and pijk+d) lower bounds on the d-step transition probability Uij, for terms of the transition energy for and in i,j E Q. T > O. Then there exists aij > 0 such that each of the following is true: i,j E Q if (i) d E m Let Proposition 4.1 k as k E N then for large enough Tk < T / log k iJ/T (k,k+d) < a Pij < aij - i,j E f, for all T k > T / log k if (ii) k EN for large enough and Tk O k * as D then for all (k,k+d) i j / Tlim Lim > aij k pij k-.)a U ii. such that U (d) ij i,j E f (iii) for all - Proof We prove (i); from (i) and (ii). for all E A(d) ij i,j E f as Uij/T such that I Ud i,3 E £ ) the proof of (ii) is similar and (iii) follows So assume T k < T / log k for large enough k E M. Also, for every i,J E Q and An(X) = max[O, Uin-Ui d-1 n-O ' Uij. let rk(X) = then aij p(k,k+d) pij k E m for large enough T k = T / log k if n 0,..,d (k+n) > 0O, ni n n+l k E IN, and r(X) = lim rk(X) = sup r k (X) > O. kEN k-n and k E M X = (io,..., ) and id exists in the definition of That the limit k - i,j E Hence for every a). and is equal to the sup (kk+l) ItN kE Ni Pu lim (kk+l) lit ~ Pii k-) supremum is a consequence of r(X) (since k-0 Tk as n 0'(7,d ij ~(Itrk+d) (i 0 ,... id)E,,jn d-1 (k+n,k+n+l) 7=rrikn) (d)An=O (i,·..,id),ij -Ui] max[O, U i exp[ T k+d-l nn+1 n+l n d-1 ACX) (d) rk(X)ep[- = Tk+n XEAij n=O d-1 Z< E rW(X) exp[ - AlorW An(X) IT (k large enough) n=O E(d) XEAij k rk(X) (d) kr(X) /T k XEA(d a ji(d as [(d)UX X -Ik ij , where r(X) () aij aiJ 0 U(X)=U (d) (if iJd) = c let aij be any positive real). The following theorem gives conditions under which by setting I = S. Theorem 4.1 Let (a) there exists to I P{x k E S i.o.} = 1 {I,J} be a partition of (b) every for all J E J and assume d E M such that the d-step transition energy from equals the transition energy from Ud) = Uj Q J to I, for all J E J), can reach some - i E I 17 - (max UjI < JEJ ). J J E J Also let Tk - 0 U as Proof - max UjI < JEJ k - a. Then Tk P{x / log k U i,j E E I i.o.} = 1. a kUij/U such that k E N, and for large enough a >0 From Proposition 4.1(ii) there exists p(tI+d)> for all k > k (iJd) = Uij. n=k JEJi large enough, Hence for every large enough P{xnd E J, n > k} < T7 max P{X(n+l)d E J n=k JEJ = FT[-1 - min mm =7 such that k E N I Xnd =j} P(nd,(n+l)d) P i isn Or Id, JE (nd)UJi/ theorem ,Iot1 and follows./U the < for se all n k mino the0r allk E M iE, Remarks on Theorem 4.1 1,2,3,4. by (a). settin we0l. (1) (nd)UjApIx/U In Figure 2.1 let 1 a/(-a) = Ia S = J < - 4. a U and obtae ins an estimate of the rate of con vergencek as = We shall neeand the following te lemma, (i w,2here U , Then ceg I JEJ Remark-o=o(ebIn the proof of which can be found ins. g as J = -o{5, U15 = 4.0, -Our under which theorem next gives conditions P{x k E S 1 as k n E [0 for every (ii) rn'3 r = /-a where QGj Q=m+m0O m~= m=k > O. Let Theorem 4.2 (a) there exists to I (b) every be a partition of d E m such that the d-step transition energy from (c) the U Then - there exists (d) as 1 and T > U, < a, = max Uj JEJ J E J I, for all to i E I k (max UjI < a), JEJ is greater than j to -4 C. the (minEUIj-Uji] , 0). JEJ Tk = T / log k for large enough j E J I, for all to j I from energy transition P{x k E I} i E I can reach some transition energy from E N. J j j E J), for all j E J and assume Q {I,J} equals the transition energy from d) (U (i ji = U Also let = O(k), _l _ a_ k (k+l-m)n k Furthermore, if we assume J E J which can reach some (UIJ < ), then as P{x k E I} = 1 - O(k-T T), where T = min[UIj-UjI 3EJ Proof (k,k+d) for all for all i,j E Q. i,; E £ j - a, by (c) and (d)). (CO <'r a1 > 0 From Proposition 4.1 there exists i k 1 such that (4.2) (4.2) k E I, U. kij1 /Ta Also from Proposition 4.1 there exists a2 > 0 (a) (k,k+d) pik) > 2 Uij/T such that (d) ij = U ij. k large enough, In the sequel (4.2) ((4.3)) will be used to upper (lower) bound the probability of transitions from to S I J to (J I). Let E such that Jr and = {5}, J1...,Jr O Ujr I < Ujs I be a partition of for all J = {1,2,3,4}, r ( s. J such that - 19 - for all For example, in Figure 2.1 let 1 = {4}, so that UJI = Uj I J2 = {2,3}, and J J I = = {1}. = U a a = U /T, Also let a. Note that r = 1,...,r O. IKr|, for = a < 1 U Kr UIj /T, /T, and = J. K and J-s k - Finally let p(j,m,n,r) = P{xkd E Kr, k = m+l,...,n I xmd = J}, and a(i,j,m,n,r) = P{Xkd E K,r for i,j E Q, m,n E N, and k = m+l,...,n-1 ; xnd = J r = l,...,rO. md = i k0 E f Then for every we can write P{xkd E J} = P(k) + (4.4) (i) where (k d) P(k) P= p P p(j,kO,k,r O) (4.5) jeJ and k-1 pk) p(md) m=k 0 iEI for all k = ko 0k for all n = k o,..$,k, and m = k o, .. .,k - 1 p(md,(m+l)d)p(Jm+lkr) jEJ +1,.... In words, and p(k) Xnd E J p(k) 01 is the probability that is the probability that for all n = m+l,...,k. (k) P(k) + P(k) 2 3 4 xmd E I Xnd E J nd for some We can further write (4.6) where k-1 (k) 3 ) = m=k rO p(md) i _ Pi iEI (md,(m+l)d) Pij p(J,m+l,k,r) (4.7) r=l jEJr and k-2 ( for all k)md) m=k k i k when xnd m+l (at some time .... In words, makes the transition from energy back to m+l,...20 a(i,j,m,n,r-1) p(j,n,k,r), n=m+2 r=2 jEJr iEI ko , +0 ro I (P(k)) is the probability that P(k) I to J > m+2) the state in amongst the states in at time J J m it visits at time with the largest transition that are visited from time -,k. - 20 - (4.8) n = The motivation for the decompostion in (4.6) is as follows. work directly with (4.4). Observe that the how the chain makes transitions from I to P(k) J Suppose we term only keeps track of but not how it stays in J. In this case we are forced to work with the "worst case" scenario where the chain makes minimum energy d-step transitions from I to J (with energy UIJ) and maximum energy d-step transitions from J to I (with energy max ~~~~~(k) ~~~~JEJ UjI). In order to show that p(k) 2 0 as k - o it seems clear that we would have to require and p(k) terms transitions from P(k),p(k) 3 '4 - 0 of - max U UI (4.6) we not I to J > 0. jI but also how it stays in (and consequently in the p(k) 3 only keep track of how the chain makes P(k) 2 that we need only require min[UIj-Uj] je J now proceed with the details. (k) p k) We start by upper bounding k0 E m On the other hand, 0) as k J. In order to show that ~ it is not hard to see - > 0, which is guaranteed by (c). We Using (4.3), for every large enough we have k-1 p(J 0 ,kO,k,ro) < FT max P{x(p+l)d E J I Xd =J} Q=ko jEJ k-i QdQ+)d)] 75 |1 - min) p(J1d, ld) Q=k0 JEJ iE I k-1 a FT- I- P=ko(-dU O t U /T min J (Qd) ijT iEI, U d)=ui ji ti k-1 a jEJ ~-k (d) iI, U jI u(d);u ji k-i • 1- f1 a 2 i 1 | Jo E J, k = kO ko+1,. r = l,...,ro, by (a). Combining (4.5) and (4.9) gives for every large enough P1 <) [i =k0 2d d)a - k = ko,ko+ 0 1.... 21 - (4.9) k0 E N (4.10) Since a2 > 0 enough and k 0 E N to get p where we can apply Lemma (i) to (4.10) for every large a < 1 pk) == O(e-b(kd)l-a o(eb(kd) ) b = a 2 /(1-a) as k - ', (4.11) 0. > We continue by upper bounding P3 P4k) and First, by almost the same reasoning that led to (4.9), for every large enough k-l n E I we have a2 1 - p(j,n,k,r) < j E J a k = n,n+l... (Qd) r =-1,... ,r O. (4.12) Next, suppose that xmd i, Xk E Kr, Xnd for some k = (m+l)d,...,(n-l)d, for j' i,j E Q, there exists m E k E N n = m+2,m+3,..., and N, r = 1,...,r O. (1 < k < min[n-m-l,kr]), intermediate times ...< ik- 1 < n-l, and distinct intermediate states Xmd = i, Xi d X(m+l)d = , X(n-1)d = Jk' Let Then clearly Jl1''''Jk E Kr m ( il such that j ' for +, )d Q = 1,...,k-1, j' Xnd = (4.13) A(i,j,m,n,r;k,il,...,ikl, l'''...''jk) be the event defined by (4.13). Then we have shown that G(i,j,m,n,r) < ~. 1' P{A(ij,m,n,r;kil,...,ik-lJl ....Jk)} k-l' Jl'''' 'Jk k < kr 2 r (rn-m-2) r ,,k' P{A(i,j,m,n,r;k,il,...,ik_, l max k-1 j 1 ,·..,jkk i,j E n, n = r = l,...,r O. - 22 - m+2,m+3,..., m E [, Now using (4.2) and (4.12) it is not hard to show that for large enough P{A(i,j,m,n,r;k,il,...,ik_ljil, ... ,jk n-2 k+1 < a T 1 md)Uij/T Q=m+k m E } a d)ar and consequently k -1 (i,j,m,n,r) (n-m-2)r (md) U ij/ T 0 Q=m+kr 1 - ij (Qd) r n = m+2,m+3,..., where cl is an unimportant constant. (4.14) (4.7),(4.8),(4.12), and k0 E I (4.14) gives for every large enough P Combining r = l,...,r O , (k) (k) +P 3 4 O +< k-1 ( -rn 1m'kg0 (md):r z'=l r=2 m=k (md.)rd) P=m+l '0 k-1 r=2 k ,kkg (md) r 1 k -1 km)r- ArQ ~=m+k a + a Qd) r r_,( (4.15) where and c2, C3 are unimportant constants. (Ar-ar)T = UJ JrI mTin [UI-UjI] -23- Since > 0O a2 , O, for all a r < ar r - l,...,r 0 = a ( 1, we can apply Lemma (ii) to each term in (4.15) for every large enough 3(k) k) + p 4(k)) r-a ( /Tk-( O(k/T ) as k ko 0E to get , (4.16) - r=l where the last equality follows from T = min[UIj-UJI 1 = JEJ min [UIj - UJ m-in I] (r-ar)T. r=,...,rr 0 Finally, combining (4.4),(4.6),(4.11), and (4.16) gives P{Xkd E J} = O(e -b(kd) 1can Similarly we show E J}, for all P{xkd+k that k P{xk E J} = O(e 0= (4.17) in k- replaced by as a, Hence -r ) + O(k-/T O0 (and b,r (4.17) . can be P{xkd E J} O,...,d-1. - bkl -a and the Theorem follows since )_+ O(k -r1/T /T)as k - (4.18) if (d) is true). Tr < S = {5}, (2) Condition (a) was discussed in Section 2 and is satisfied for I = (3) Condition (c) is satisfied for on Remarks J = {1,2,3,4}. Theorem Then U (1) 4.2 = U15 = 4 In and 2.1 Figure let r = U51-U15 = U1-U I = =1. S. min[U IU jE EJ > min I = S [U ij -Uji] = min symmetric since [Uj-U ] > . iEI, jEJ iE I, jGJ When condition (d) is not satisfied (4) Q and (7-= ), (4.18) shows that bkl-a P{xk E II = 1 - O(e a = U /T where and b , O. ), as k * a, What we have actually shown is that 1l-a P{x k E I, some n < k} = 1 - O(e and this is valid when only (a),(b), enough k E m T = U . It are assumed. is possible to as ), T > U , and k- a, T k > T / log k for large Theorem 4.1 can be deduced from this by taking lower bound b in terms of the aij 's from Proposition 4.1, but we shall not do so here. (5) follows. We can get a somewhat better estimate of the rate of convergence as Let the partition I be the collection of subsets of {IO,J O} I such that I0 E I satisfies conditions (a),(b),(c), and (d). - 24 - iff Assume I • t that and let r(I,) *,_ r(I o) - _ max U (I*) r = r(I.), T > T = T (I*). T I0 E U ( o) for large enough T k - T / log k , and k E N. Then O(k - r /T P{x k E I} = 1 - The to corollary P(xk E S a.a.} = 1 the Let Theorem 4.3 0, > 0, e Tk and gives under conditions be a partition of I to J < k which I = S. {I,J} transition energy from theorem next by setting as is positive (U,-e) log k / and assume that the Q (UIj > 0). Also let k E large enough for U* = Uij > N. Then P{x k E I a.a.} = P{x k E I i.o.}. Proof Let T = U*-E. Then from Proposition 4.1(i) there exists a > 0 such that (k,k+l) < a ij kUij/ for all i,j E n. 'EN, Hence P{xk e I, Xk+ 1 E J} < max P{xk+l E J I x k = i} = max EI < max iE- P(k,k+l) JE iEI aa IJkan - < [__a I Uij/T U./T ji J k j k k E EN, U*/T > 1, and since 0c P{x k E I, Xk+ 1 E J} k=l Applying the "first" Borel-Cantelli Lemma (c.f. [7]) we have P{x k E I, xk+l E J i.o.} = 0, and the theorem follows. Corollary 4.1 (a) there exists to I (U(d (b) ) {I,J} Let d E N every for all j E J and assume that such that the d-step transition energy from equals the transition energy from UjI Q be a partition of j to I, for all j E J), can reach some - i E I 25 - (max U JEJ < ), j j E J (c) the transition energy transition energy from j from to I to I, for all J is j E J (UI greater than - max Uj the ,>0). *E* Also let U =max for large enough Proof U k E N. Then Then Theorem U = U15 = 4 4.2, satisfied, even when (2) < T U*, and T = T / log k Combine Theorems 4.1 and 4.3. {1,2,3,4}. of U P{xk E I a.a.} = 1. Remarks on Corollary 4.1 (c) > 0 = U (1) In Figure 2.1 let and U* = U54 ° 4. condition I = S and (c) Q of I = Hence, Corollary 4.1 S = {5}, J = unlike condition is not generally is symmetric. Note that UIA = = min = Uj-Ui] maxO[, iEI, jEJ, qiJ > The corollary P{xk E S i.o.} ( 1 to the next by setting theorem I = S. gives conditions under which By (4.1), these are conditions under which the algorithm does not converge according to any of our criteria. Theorem 4.4 Let {I,J} be a partition of (a) the transition energy from (b) every Also let E >0 i E I and J can reach some Tk < (UJI-e) to I and assume is positive j E J / log k Q (max U )< i I J (UJI > 0), c). for large enough k E N. P{x k E I i.o.} < 1. Proof From Proposition 4.1(i) there exists p(k,k+l) ij for all i,j E Q. a a > 0 k E U.j./T ' Hence for every large enough such that , k E m P{xn E J, n > k} > P{xk E J} F7 min P{xn+ 1 E J I xn =J} n=k jEJ P{x k J} .1max p- n=k P{xk 6 J} n=k > P{x k E J} T 1- JEa iI max J aI Ui 1- 26 - Uji/T |] n - Pji - Then Since Uji/T > 1 by (b) 0 the infinite product converges (to a positive value), and P{xk E J} > 0 for some large enough Corollary 4.2 (a) the (max U Also let k E N. for infinitely many P{x n E J, n > k} > Hence k E N, and the theorem follows. Let {I,J} be a partition of transition energy from Q and assume that j E J some to I is positive > 0). W > 0, = max WJI J = {jE J: J to Wji = W }, I = Q \ J , and JEJ assume that (b) the transition energy from .... Finally let e > 0 I T k < (W -E) / log k and is positive (UJ* * I for large enough > 0). k E N. Then P{xk E I i.o.} < 1. Proof Observe that W = U*< and apply Theorem 4.4 to the partition * {I ,T }. In Figure 2.1 let 2 and J I = S = {5}, J = {1,2,3,4}. Then W = W1 5 = W 2 5 = W35 = {1,2,3}. We next state a theorem of Hajek's which gives necessary and sufficient conditions for P{xk E S } Theorem (Hajek) (a) i (b) 1 as if i at energy Uj+Wji, for all i,j E Q). = max c. - j, for all can be reached from i d k Assume that can be reached from reached from Let - V j* < c, j (Q at energy U, for all T > 0, and i,j E Q i,j E Q is irreducible), U then U E R and Tk = T / log k J can be (Ui+Wij = for large enough k E js N. Then P{xk E S } Proof 1 - as k - In HaJek's irreducibility" satisfied (3) iff T > d. See C93. Remarks on HaJ.ek's Theorem (2) m for and Q paper (1) In Figure 2.1 we have conditions "weak reversibility", (a) and (b) respectively. d are = V15 = 3. called Condition "strong (b) is symmetric. Obviously W < d < U and the equalities hold only in fairly - 27 - trivial Hence cases. under (a) and conditions Hajek's (b), stronger than our Theorem 4.2 and Corollary 4.2 with I = S . is Theorem However, the conditions under which our results are obtained are different, and in general weaker than Hajek's, with the exception that condition (c) of Theorem 4.2 can Also be true when condition (b) of Hajek's Theorem is false and conversely. we obtain an estimate of the rate for which We close this section by P{xkE S } indicating 1 - we how as can modifications of the annealing algorithm by our methods. k * a. analyze various Such modifications might include matrix to depend on time, (i) allowing the Q (ii) measuring the energy differences (iii) allowing the temperature Tk Uj-U with random error, i to depend on the current state xk. The important point to observe in modifications such as these is that our results depend only on the Markov property of the annealing chain and the asymptotic behavior of its d-step transition matrix as k - for co fixed d E [N. {Xk}kE0 {P(k'k+d)}k our results are based on In particular, satisfying one or both of the inequalities lim _lim ki/T (k,k+d) k i: > 0 (4.19) co (4.20) k->co and U. /T lim for appropriate (k~kfd.) It13 < k-e i,j E Q. Hence our results are valid for any Markov chain which satisfies (4.19) and/or (4.20) for appropriate general the Uij's are not given by (2.1), i,j E Q. and can Ofcourse in infact be any non-negative real numbers (or a), with the exception that in Theorem 4.2 we require Uij < U i+U j for certain the modifications of the attempting to extend our i,j,Q E Q. We are currently examining annealing algorithm mentioned above and are results to more uncountable) states spaces. - 28 - general also (countably infinite and 5. Conclusion We have analyzed the simulated annealing algorithm focusing on those issues most important for optimization. Here we are interested in finding good but not necessarily optimal solutions. finite time and asymptotic behavior finite-time analysis we gave We distinguished between the of the a lower annealing bound on the algorithm. probability In our that the annealing chain visits a set of low energy states at some time < k, for k = 1,2,.... This bound may be useful even when the algorithm does not converge and as such is probably our most important result for applications. We are currently engaged in trying to apply this bound to a specific problem. our asymptotic algorithm criteria. analysis we converges to obtained a set of In conditions under which the annealing low energy states according to various HaJek has recently given necessary and sufficient conditions that the annealing chain converge in probability to the minimum energy states. gave an estimate of the rate of convergence. modifications of the annealing algorithm. modifications and to extend our results We Our methods apply to various We hope to explore some of these to more general state spaces. 6. Appendix Without loss of generality we assume Proof of Lemma (i) using the inequality k It [ exp-- a k-1 T[ a x E R for all l+x < ex k 0 - 1. we have < exp[- a J k exp[ - - 1 dax a = ebe-bk 1-a k E N. (x+l)y Then using (A.1) and the inequality < y <1 (A.1) Without loss of generality we assume k Proof of Lemma (ii) Then < xy + y = m 0 = 1. x > 1 for all and 0 we have a ] < eb(m+ ) 1 - a -bkl - a m[1 - eae-bkl-a bma <ee e m E N. k = m+l,m+2,..., Let f (kP) bm k+l-m)n -m) ebm (k _ mn- l-a 1,...,k, k eE M, n E MO . Then we can write k 1 (k+m)n We shall show that for every n MN 0 there exists S an (k+l -Q) eb1-a +b fn(k,) n n E [N k E M f(kk) eaebkl [ m+li an,bn E R 6 =,.. such that . k, k E N, (A.2) and consequently i (1+m)n m kP k FT [1 - a O(k[T), as k o, m=l as required. n E N 0. Proof of (A.2) is by induction on g(x) = ebx/xl, x > 1. Since g'(x) > 0 for large enough that fo(k' ) 0 = m-1 bml-a < m g(x)dx + g(l) + g(m) - 30 - First consider n = 0. Let x, it follows - l-a b e + -a e+-b +e 1-aI 1 where -e dx + eb e 6 = (A-a)/(l-a) = r/(l-a) > O. Then expanding Let [6] 1,...,k, k E N, be the largest integer < 6. ebx in a Taylor series and integrating term by term we have i i 1-a f (k,Q) = 1 cok, • -a S= + Q1 a |1bixi I coiI ~ ~i-a +- ,1 _ a(Q + e dx aTIC )6 i-a(i-l)!(i-5) 1-a b bi(l-a)i +--6 I--L i= ]+11 i b + e eb +- 1-a + b, < aO where a l,...,k, - = 1 + (1/a)(6j+1)/(6j+1-6)] N, b1 = eb and n E N0 Next assume (A.2) is valid for k Q and consider n+1. Summing by parts (c.f. [10]) we have fn+ (kB,) = (k+l-q)f (k,Q) + < a m=l (k~in+1 (1-a (k-Q) e+ < an+ if we set an+l = a/(a +1) valid for all fn(k,m) n E N. m 2~b n+1 + anf(k,-) +bn+ e and k b l b (a-+2). 1,...,k, k E N, By induction (A.2) is 7. References [1] Kirkpatrick, S., C.D.Gelatt, and M.P.Vecchi: Annealing," Science 220, (1983) 621-680 [2] Metropolis, N., A.Rosenbluth, M.Rosenbluth, A.Teller, and E.Teller: "Equations of state calculations by fast computing machines," J. Chem. Phys. 21 (1953) 1087-1091 [3] White, S: "Concepts of Scale in Simulated Annealing," Thomas J. Watson Res. Cent. (1984) [4] Geman, S. and D. Geman: "Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images," IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 6 (1984) 721-741 [5] Gidas, B.: "Nonstationary Markov Chains and the Convergence of the Annealing Algorithm," Preprint, Div. of Applied Math, Brown University (1983) [6] Mitra, D., F.Romeo, and A.Sangiovanni-Vincentelli: "Convergence and Finite-time Behavior of Simulated Annealing," Memo M85/23, Elec. Res. Lab, College of Eng., U.C. Berkely (1985) [7] Billingsley, (1979) [8] Seneta, (1981) [9] Hajek, B.: "Cooling Schedules for Optimal Annealing," Preprint, Dept. Elec. Eng. and Coor. Science Lab, U. Illinois at Chaimpaign-Urbana (1985) E.: P: Probability Non-negative and Measure, "Optimization by Simulated John Wiley Matrices and Markov Chains, Preprint, and Sons, IBM Inc. Springer-Verlag [10] Rudin, W: Principles of Mathematical Analysis, McGraw-Hill, Inc. (1976) energy 5 A <E 3 ~2~~~~ iI FIGURE 2.1 TRANSITION DIAGRAM FOR EXAMPLE 2.1 energy ~s~ arnaa 1 1 FIGURE 2.2 rr TRANSITION ~ 1M DIAGRAM FOR EXAMPLE 2.2 I c senergy--~