M. Tran Title 1E7.pW1riMncs 'W111h m G,7m.1hachDMas 01MFDi111ated Anmm(Bstin Algethnn ParalleT11 TDT Ath by Mua Dinh Lam Tran B.S.E.E., Boston University (1987) SUBMYITED TO THE DEPARTMENT OF AERONAUTICS AND ASTRONAUTICS INPARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY July, 1989 O Mua Dinh Lam Tran, 1989 Signature of the Author bepartment of Aeronautics and Astronautics July 11, 1989 Certified by Certified by Accepted by Dr. Richard E. Harper Thesis Supervisor, C. S. Draper Laboratory, Inc. , - ---; ,- --Professor Wallace E. Vander Velde Thesis Supervisor, Professor of Aeronautics and Astronautics /Professor Harold Y. Wachman Chairman, Dkpartmental Graduate Committee -1-Ar FES 26 '90 Aero M. Tran Abstract 1EZPxBpTrUienMCe 'Wdah a 0,7=41TasIsPaDMOerene Trmav'mag SaesSman& IPTIAubNM By Mua Dinh Lam Tran Submitted to the Department of Aeronautics and Astronautics on July 12, 1989 in partial fulfillment of the requirements for the degree of Master of Science Abstract A Synchronous Parallel Simulated Annealing Algorithm was designed for the Traveling Salesman Problem. The speedup of the candidate parallel algorithm was analyzed. General bounds on speedup of the candidate parallel algorithm were obtained. On the average, a processor handles O(N) messages in a network, and the communication overhead is O(N) or O(NpNs) units of time, where Np is the number of processors, N is the number of cities in the TSP, and Ns is the number of cities in a subtour. As N or Ns approaches infinity, the speedup is O(Np) and is independent of communication overhead. For large Np, the speedup is O(Ns). If communication overhead be neglected, the speedup is oNp + 8, where 0 << a 5 1 and 0 • 8 << 1. Through a computational study, the behavior of the candidate parallel algorithm was investigated for two annealing schedules: Tk+1 = cTk and Tk = d/logk, where Tk is the temperature at kth iteration, c and d are constants under investigation. In comparison between Citywise Exchange and Edgewise Exchange neighborhood structures, Citywise Exchange provided a better solution. MdaD• i'nh LAn Tr6n I · __I Dedication _· A Diploma at Nether ProvidenceHigh School (1983) A Bachelor of Science at Boston University (1987) A Master of Science at MassachusettsInstitute of Technology (1990) And this thesis (1989) have been pursued on behalf of and are dedicatedto two truly beloved human beings, who have always dreamed of an educationbut never had an opoortunitv to see inside a first grade classroom. my parents: Mr. Cda aTrks and Mrs. A('ui !i Lam "Mla, you are our oldest son and the oldest grandson. You are your father's right arm and the prospective 'Head'of ourfamily. You are the sole measure of our true values, standards and honors. You are everything we have, and we love you.' On 29th September 1978 at 9:48PM,just before the author stepped out of his front door to begin his "Odysseus"journey. 'Cing cha n/h? nui Thui Sdn; (ghame nhi nzddc trong ngun ch ra. 9t long thdmme kizi cha; Cho trong chfr hiitu mdi a1dao on.' I 1 M. Tran Acknowledgements On the night of 29 th September, 1978, navigating a small fishing boat away from the shore of Vietnam in a threatening darkness of the Eastern sky, I silently said to myself, "If I am to live through this journey, I shall never let my parents' dream die in vain." Although I have imperfectly executed the last phase of this long project, today, I have completed and fulfilled this quest. How can I, as an Englishless and penniless refugee who set his feet on this great nation of America less than a decade ago, be able to compose this page of acknowledgements? I have been blessed and honored more than I actually deserve. The journey I have travelled was made only to be experienced--not to be expressed. Neither "thank you", "gratitude", nor any English word is an adequate word; I hope some day I will have the opportunity to define it properly from the perspective of a refugee. For now, TO my beloved family, Dad (Dr. Alan), Mom (Mrs. Erika), Brother Brian and Sister Samantha Kors, who have cheered me up when I am down; warmed me when I am cold; comforted me when I am struggling; pulled me to my feet when I am falling flat on my face; and slowed me down when I am moving too fast. They are the Gothic foundation upon which I have built, and the central axis of love and faith around which I have evolved. TO Grandpop Juston and Grandmom Marcia Wallace and Aunt Bibi, Uncle Phil, Cousins Jeri and Howard Feintuch, for reminding me to use the $20 gifts to go out for "a good dinner with a friend", for sending me a ton of different clothes to keep me warm during the Bostonian winters, and for "having a lot of faith" in me. TO my first high school teacher and tutor, Mrs. Pat Frantz, without whose dedicated teaching, I would neither know the Bill of Rights, have learned five-years equivalence of elementary mathematics in six months nor have been ranked 3rd in my graduating class of 1983. TO G. De Aragon, for shopping my grocery and keeping our apartment neat and clean. TO T. A. Le and Dr. N. V. Nguyen, for the persistent prodding, advice, and enthusiasm. TO K. C. Luong, for the loyal friendship, for the immediate availability when I needed him most, and for reminding me, "You are the star of your family; keep it shining." -4- M. Tran Acknowledgements TO the Staff Engineers and Scientists of Draper, especially the Fault-Tolerant Systems Division, Dr. J. Lala, M. Dzwonczyk, P. Mukai, T. Sims, D. Fuhry, and R. Hain for constantly prodding me during these past two years, M. Busa for "having a lot of faith" in me, John P. Deyst for offering me some money to pay my rents, B. Mc Carragher for giving me his Mac skills so freely, and S. Kim, J. Cervantes and J. Turkovich for engaging in several fruitful technical discussions. TO my former Education Director and present V.P. of Engineering: "MIT is not for everyone. It is hard when you are in it, but it is not that hard when you are out of it. It is always nice to be a member of the Institute, you know." Dr. David Burke advised wisely; I understood it fully. TO my direct Technical Supervisor, Dr. Richard E. Harper: "Relax!" Rick demanded humanely; I listened impatiently. "MIT first!" Rick commanded authoritatively; I obeyed soldierly. "I was there (MIT), Sport." Rick comforted teasingly; I was relieved naturally. TO my Honorable Professor and Thesis Advisor, Prof. Wallace E. Vander Velde: If any discrepancy is found in this thesis, it is the sole responsibility of the author, if any result or beauty is perceived in this thesis, it is a mere reflection of the professional guidance, keen technical insights, and boundless patience of my "humanly-down-to-earth" Thesis Advisor, without whom the struggle for completion of this thesis could be like the days and the nights of the ultimate fight for survival in the turbulent South China Sea. This report was prepared at The Charles Stark Draper Laboratory, Incorporated under an internal research and development contract. Publication of this report does not constitute approval by the Draper Laboratory of the findings or conclusions contained herein. It is published for the exchange and simulation of ideas. I hereby assign my copyright of this thesis to The Charles Stark Draper Laboratory Incorporated, Cambridge, Massachusetts. Mua D. L. Tran Permission is hereby granted by The Charles Stark Draper Laboratory, Incorporated to the Massachusetts Institute of Technology to reproduce any or all of this thesis. -5- Page 6 is missing from the original thesis submitted to the Institute Archives. M. Tran Table of Contents TAIBELJE P TCOITEHT Tittle .................................................................................................. 1 Abstract ................................................................................................ Acknowledgements ........................................................................... .. 4 Table of Contents ........................................................................... .. 7 Lists of Figures and Tables ..................................................... 9....... N om enclature .................................................................................. .. 13 1.0 INTRODUCTION ....................................................... 15 1.1 Motivation....................................................15 1.2 Problem Statement .............................................................. 17 TSP and Combinatorial Optimization.......................... ..... . 17 1.3 21 Methodology ............................................................. 1.4 1.4.1 Simualted Annealing ........................................... 24 1.4.2 Parallelization ........................................................... 25 28 Objective and Thesis Outline ................................................ 1.5 2.0 CLASSICAL SIMULATED ANNEALING ALGORITHM............ 30 30 2.1 Introduction................................................... 2.2 2.3 2.4 Local Optimization ............................. .............................. 32 ... 34 ................ Statistical Mechanics--A Physical Analogy ......... Classical Simulated Annealing ................................................. 39 3.0 QUANTITATIVE ANALYSIS OF THE SIMULATED 44 ANNEALING ALGORITHM .......................................... 3.1 Introduction ...................................................................... 44 3.2 3.3 Mathematical Model ............................................................. 3.2.1 Asymptotic Convergence ............................................... 3.2.2 Annealing Schedules .................................................. Analysis of the Cost Function .................................................. 47 48 52 54 M. Tran Table of Contents 4.0 DESIGN OF THE PARALLEL SIMULATED ANNEALING ALGORITHM ......... ....................... 59 4.1 Introduction ....................................................................... 59 4.2 Algorithm Framework and Parallelization Methodology ..................... 60 4.3 Neighborhood Structures .................................................... .. 64 4.3.1 Citywise Exchange ................................................ 65 4.4 4.5 4.6 4.3.2 Lin's 2-Opt or Edgewise Exchange ................................. 66 Costs of TSP and Subtours................................. ......... 68 Candidate Algorithms ........................... ....... ..... ................... ...... 70 4.5.1 Synchronous Parallel Simulated Annealing Algorithm (A) ........ 70 4.5.2 Asynchronous Parallel Simulated Annealing Algorithm (B) ....... 73 Implementation Issues of the Candidate Algorithms....................... 74 5.0 SPEEDUP ANALYSIS OF THE PARALLEL SIMULATED ANNEALING ALGORITHM ............................ 78 5.1 Introduction ...................................................................... 78 5.2 Speedup Analysis of Independent Subtours .................................. 80 5.3 Interprocessor Communication............................................ 85 5.3.1 Message Communication ................................................ 87 5.3.2 Data Communication .................................................... 88 Speedup Analysis of Interprocessor Communication .................... 91 5.4 5.5 5.6 Speedup Analysis of Step 3 of Algorithm A................................ 96 General Bounds on Speedup of Algorithm A ............................... 99 6.0 EMPIRICAL ANALYSIS OF THE PARALLEL SIMULATED ANNEALING ALGORITHM ...........................111 111 6.1 Introduction ................................................................... 6.2 6.3 6.4 6.5 Analysis Methodology .............................................. 113 ....... 115 Annealing Schedule Analysis.............................. Simulated Annealing Versus Local Optimization ........................ 159 Citywise Exchange Versus Edgewise Exchanges..........................166 7.0 SUMMARY AND CONCLUSIONS ................................... 178 APPENDIX A: SIMULATION PROGRAM FOR A SYNCHRONOUS SIMULATED ANNEALING ALGORITHM ................... BIBLIO GRAPHY ............................................................. 182 223 M. Tran Figures and Tables T1373 OF 71GUJIR32 AND TABLES Figure 2.1 Local Optimization Algorithm .................... ............................. Figure 2.2 Plateau, Local Minima and Global Minimum for the Cost Function....... 33 Figure 2.3 Boltzmann Distribution Curves for an Energy Function at Various Temperatures .................................... ................... 36 Figure 2.4 General Metropolis Algorithm................................................. 37 Figure 2.5 Analogy Between Physical System and Combinatorial Optimization...... 39 Figure 2.6 Simulated Annealing Algorithm ................... .............................. 41 Figure 4.1 (a) An Arbitrary TSP Tour. (b) TSP Tour is Divided into Four Subtours............................................. .61 33 Figure 4.2 High-Level Description of the Parallel Scheme .......................... 63 Figure 4.3 General Algorithm of a Subtour ........................................ 64 Figure 4.4 (a)Algorithm of Subtour with Citywise Exchange ......................... 65 Figure 4.4 (b) Neighborhood Structure with Citywise Exchange....................... 66 Figure 4.5 Neighborhood Structure with Lin's 2-Opt Exchange or Edgewise Exchange .................... ......................................... 67 Figure 4.6 Algorithm of Subtour with Lin's 2-Opt Exchange ........................... 67 Figure 5.1: A 2-Processors System with Interprocessor Communication ............ 86 Figure 5.2: A 4-Subtours System with Interprocessor Communication ................ 90 Figure 6.1 Temperature versus Time for T(k+1) = cT(k) at Different Values of c and T(0) = 20.0 ................................................... 117 Figure 6.2 Costs versus Time for T(k+1) = cT(k) at c = 0.94, T(0) = 20.0 and N = 50.....................................118 Figure 6.3 Costs versus Time for T(k+1) = cT(k) at c = 0.95, T(0) = 20.0 and N = 50 .............. .................................. Figure 6.4 Costs versus Time for T(k+1) = cT(k) at c = 0.96, 19 1..... M. Tran Figuresand Tables T(0) = 20.0 and N = 50 ...................................... .................. 120 Figure 6.5 Costs versus Time for T(k+1) = cT(k) at c = 0.97, T(0) = 20.0 and N = 50.....................................121 Figure 6.6 Costs versus Time for T(k+1) = cT(k) at c = 0.98, T(0) = 20.0 and N = 50.....................................122 Figure 6.7 Costs versus Time for T(k+l) = cT(k) at c = 0.99, T(0) = 20.0 and N = 50.....................................123 Figure 6.8 Perturbed Costs versus Time for T(k+1) = cT(k) at Various Values of c, T(0) = 20.0 and N = 50 .........................124 Figure 6.9 Best Costs versus Time for T(k+l) = cT(k) at Various Values of c, T(0) = 20.0 and N = 50 ..........................125 Figure 6.10 Map of the Best Tour at 1st Iteration for T(k+1) = cT(k), c = 0.94, T(0) = 20.0, N = 50 and Best Cost = 1376.23............. 126 Figure 6.11 Map of the Best Tour at 15th Iteration for T(k+l) = cT(k), c = 0.94, T(0) = 20.0, N = 50 and Best Cost = 742.42 ................... 127 Figure 6.12 Map of the Best Tour at 35th Iteration for T(k+l) = cT(k), c = 0.94, T(0) = 20.0, N = 50 and Best Cost = 529.52 ................... 128 Figure 6.13 Map of the Best Tour at 60th Iteration for T(k+l) = cT(k), c = 0.94, T(0) = 20.0, N = 50 and Best Cost = 342.27 ................. 129 Figure 6.14 Map of the Best Tour at 93rd Iteration for T(k+1) = cT(k), c = 0.94, T(0) = 20.0, N = 50 and Best Cost = 319.95 ................... 130 Figure 6.15 Map of the Best Tour at 122nd Iteration for T(k+1) = cT(k), c = 0.95, T(0) = 20.0, N = 50 and Best Cost = 299.06 ................... 131 Figure 6.16 Map of the Best Tour at 230rd Iteration for T(k+l) = cT(k), c = 0.96, T(0) = 20.0, N = 50 and Best Cost = 292.27 ...................132 Figure 6.17 Map of the Best Tour at 286th Iteration for T(k+1) = cT(k), c = 0.97, T(0) = 20.0, N = 50 and Best Cost = 293.79 ................... 133 Figure 6.18 Map of the Best Tour at 277th Iteration for T(k+1) = cT(k), c = 0.98, T(0) = 20.0, N = 50 and Best Cost = 295.45 ................. 134 Figure 6.19 Map of the Best Tour at 407th Iteration for T(k+1) = cT(k), c = 0.99, T(0) = 20.0, N = 50 and Best Cost = 296.85 ................. 135 Figure 6.20 Minimum Costs or Quality of Final Solutions versus c for T(k+l) = cT(k) at T(0) = 20.0 and N = 50 ............................ 136 Figure 6.21 Temperature versus Time for Tk = d/logk at Different Values of d.......140 Figure 6.22 Costs versus Time for Tk = d/logk at d = 5 and N = 50 ................... 141 -10- -1 M. Tran Figuresand Tables Figure 6.23 Costs versus Time for Tk = d/logk at d = 10 and N = 50............... 142 Figure 6.24 Costs versus Time for Tk = d/logk at d = 15 and N = 50................143 Figure 6.25 Costs versus Time for Tk = d/logk at d = 20 and N = 50................144 Figure 6.26 Perturbed Costs versus Time for Tk = d/logk at Various Values of d and N = 50 ..................................... ..................145 Figure 6.27 Best Costs versus Time for Tk = d/logk at Various Values of d and N = 50 ............................ ................................ 146 Figure 6.28 Map of the Best Tour at 1st Iteration for Tk = d/logk d = 5, N = 50 and Best Cost = 1376.23 ................................... 147 Figure 6.29 Map of the Best Tour at 10th Iteration for Tk = d/logk, d = 5, N = 50 and Best Cost = 596.85.................148 Figure 6.30 Map of the Best Tour at 43rd Iteration for Tk = d/logk, d = 5, N = 50 and Best Cost = 317.23...............149 Figure 6.31 Map of the Best Tour at 67th Iteration for Tk = d/logk, d = 5, N = 50 and Best Cost = 311.63.................150 Figure 6.32 Map of the Best Tour at 203rd Iteration for Tk = d/logk, d = 5, N = 50 and Best Cost = 307.54..............151 Figure 6.33 Map of the Best Tour at 293rd Iteration for Tk = d/logk, d = 5, N = 50 and Best Cost = 285.54.......................................152 Figure 6.34 Map of the Best Tour at 300th Iteration for Tk = d/logk, d = 10, N = 50 and Best Cost = 365.65 ................................... 153 Figure 6.35 Map of the Best Tour at 300th Iteration for Tk = d/logk, d = 15, N = 50 and Best Cost = 514.874 ............................ 154 Figure 6.36 Map of the Best Tour at 300th Iteration for Tk = d/logk, d = 20, N = 50 and Best Cost = 630.85 ................................... 1..55 Figure 6.37 Minimum Costs or Quality of Final Solutions Versus d for Tk = d/logk ................................................................. 156 Figure 6.38 Perturbed Costs versus Time at N = 50 for T(k+1) = cT(k), T(0) = 20.0, c = 0.96 and Tk = d/logk at d = 5.............................157 Figure 6.39 Best Costs versus Time at N = 50 for T(k+1) = cT(k), T(0) = 20.0, c = 0.96 and Tk = d/logk at d = 5.............................158 Figure 6.40 Map of the Best Tour at 108th Iteration using Local Optimization for Tk = d/logk, d = 5, N = 50 and Best Cost = 327.93...............160 Figure 6.41 Local Optimization Costs versus Time for Tk = d/logk at d = 5 and N = 50 ....................................... 161 -11- Page 12 is missing from the original thesis submitted to the Institute Archives. M. Tran Nomenclature F~~pa~ ~ n6~PAT1UIR~ PP(Np) Speedup of the sequential codes of Algorithm A. C(G) Cost or objective function of the entire TSP tour. C(on) Cost value of the TSP tour at nth iteration as a perturbed cost value. C(on)B Best cost value of the TSP tour at nt h iteration as an accepted cost value. c,d, depth Constants of the annealing schedules. (C)r, Expectation and variance of the cost function. C,o(T) Average and standard of deviation of the cost function. II Number of serial iterations required for Algorithm A to converge to some desired cost. INp Number of parallel iterations required for Algorithm A to converge to some desired cost. L Markov-chain Length. Lij(s) Edgewise Exchange or Lin's 2-Opt. Ms Total number of messages that processor i sends to other processrs. Mr Total number of messages that processor i receives from other processors. N,num_nodes Number of cities in the TSP tour. Ns, cardinalty Number of cities in a subtour or cardinality of a subtour. Np Number of processorss in a parallel system. Or Total communication overhead for exchanging (transmitting and receiving) Ns cities among P subtours. Oa Average communication overhead of the ith subtour executing on a Npprocessors system. co(C) Configuration density. -13- M. Tran Nomenclature Q(C,T) Equilibrium-configuration density. P,G,A Transition matrix, generation matrix, and acceptance matrix. P or npe Number of partitioned subtours in the TSP. Pij,Gij,Aij Transition probability, generation probability, and acceptance probability. q,qi Stationary distribution and components of stationary distribution. 9Configuration space. SR opt Set of globally minimal confurations. S Speedup as a measure of degree of parallelism of the parallel system performance. Si Speedup of Step i of Algorithm A per iteration. s,s' Subtour and perturbed subtour. Ga,0' Entire current TSP tour and entire TSP perturbed TSP tour. T,Tk Annealing schedule and temperature at kth iteration. Tij(s) Citywise Exchange. T1 Execution time of Algorithm A from Steps Al to A7 per iteration using 1 processor. TNp Execution time of Algorithm A from Steps Al to A7 per iteration using Np processors. TI Execution time of Step i of Algorithm A per iteration using 1 processor. Th' Execution time of Step i of Algorithm A per iteration using Np processors. Execution time of transmission or receipt of a city between 2 subtours or processors. -14- __ M. Tran Chapterl: Introduction CHAPTER 1 1.1 Motivation Vehicle routing problems involve finding a set of pick up and/or delivery routes from one or several central depots to various demand points (e.g. customers), in order to minimize some objective function (minimization of routing costs, or of the sum of fixed and variable costs, or of the number of vehicles required, etc.). Vehicles may have capacity and, possibly, maximum-route-time constraints. For example, the problem that arises when there is a single domicile (depot), a single vehicle of unlimited capacity, unit demands, only routing costs, and an objective function which minimizes total distance traveled, is the famous Traveling Salesman Problem (TSP). Of course, instead of minimizing the distance, other notions such as time, cost, number of vehicles in fleet etc. can equivalently be considered. With several vehicles of common capacity, a single depot, known demands, and the same objective function as the TSP, we have a standard vehicle routing problem. A large body of scholarly literature devoted exclusively to the TSP is quite impressive. One has simply to consult review papers such as Bodin et al. [Bod83] to be convinced that the TSP is perhaps the most fundamental and prominent, and also the most intensively investigated, of all unsolved classical combinatorial optimization problems. Although one can easily state and clearly conceptualize the TSP, it is, in fact, the most -15- M. Tran Chapterl: Introduction "difficult" and the first problem to be described in the book, Computers and Intractability [Gar79]; it is also the most common conversational comparator ("Why, it's as hard as the Traveling Salesman!"). The effort spent on this problem is partially a reflection of the fact that the TSP encompasses and represents quite a diverse set of practical problems. A specific and representative example of such practical problems is the application of the TSP in a mission plan for a fully autonomous/semi-autonomous flight vehicle under development at the C.S. Draper Laboratory ([Deut85] and [Adams86]). Furthermore, the TSP is an essential component of most of the general vehicle routing problems such as retail distribution, mail and newspaper delivery, municipal waste collection, fuel oil delivery, etc.; it also has numerous, and sometimes surprising, other applications (see, for example, [Len75]). Motivated by the TSP's wide applicability and computational intensiveness, the primary goal of this thesis is to design a parallel method to solve the TSP concurrently. The secondary goal is to analyze the performance of this parallel algorithm with respect to a "typical" sequential algorithm. And, using a Fault Tolerant Parallel Processor (FTPP), currently under development at the C.S. Draper Laboratory, as a testbed ([Harp85] and [Harp87]), and the parallel algorithm being developed, the third goal of this thesis is to conduct a computational study of the TSP via examination of the quality of its final solutions. Before further detailed discussions of the above objectives are outlined in Section 1.5, and methodology is introduced in Section 1.4, it is essential to clearly understand the definition of the TSP in Section 1.2 and to briefly review its important role in combinatorial optimization in Section 1.3. -16- M. Tran Chapterl:Introduction 1.2 Problem Statement The Traveling Salesman Problem can be stated as follows: given a set of N cities and the distances between any two cities, a salesman is required to visit each of the N cities once and only once, starting from any city and returning to the original city of departure; then, "what is the shortest route, or tour, that he must choose in order to minimize the total distance he traveled?" Again, instead of minimizing the total distance, other notions such as time, cost, number of vehicles in fleet etc. can equivalently be considered. Mathematically, it can be formulated as follows: given a set of vertices V = {vo,...,VN) and the distance, dij, from vi to vj, what is the best ordered cyclic permutation, a = (o(1), ... ,o(N)) of V, that minimizes the cost function, N-i C(a)= Xd a(i).a(i+ )l + d f(N),oa(l)) (1.1) i= I It is essential to note that this best permutation is selected by comparison of all possible cyclic permutations; there are (N-1)!/2 such permutations in the TSP with symmetric distance matrix. 1.3 TSP and Combinatorial Optimization Combinatorial optimization is a subject which consists of a set of problems that are central to the many disciplines of science and engineering [Law76]. Research in these areas has aimed at developing efficient techniques for finding minimum (or maximum) values of a function of very many independent variables [Aho74]. This function, usually called the cost function or objective function, represents a quantitative measure of the "goodness" of some complex system. It is one of the main reasons that the last two generations of combinatorial mathematicians and operations research analysts, including -17- M. Tran Chapterl: Introduction computer scientists and engineers, cumulatively have devoted literally many man years to the study of combinatorial optimization. Since the TSP is the most basic and the most representative of all combinatorial optimization problems in general, and an essential component in most vehicle routing problems in particular, it is extremely interesting to know how often old methods for solving the TSP have precipitated and generated many new and important general techniques in combinatorial optimization. Because it is not the purpose of this thesis to investigate the historically important role of the TSP, an interested reader is encouraged to examine a recently published monograph, called The Traveling Salesman Problem [Law85]; the aim of this section is simply to give the reader a brief flavor and a good appreciation of its historical importance and to highlight certain key events. From even the earliest studies of discrete models, the Traveling Salesman Problem has been a major stimulant to research on combinatorial optimization. Early studies of the TSP pioneered the use of cutting plane techniques in integer programming of Dantzig, Fulkerson, and Johnson [Dant54] and were responsible for several important ideas associated with tree enumeration methods including coining the term "branch and bound" ([Lit63] and [Moh82]). They also introduced problem partitioning and decomposition techniques in the context of dynamic programming [Held62] that later proved to be fruitful in other applications of dynamic programming, and in assessing heuristic methods for combinatorial optimization. An isolated probabilistic study of the TSP in the plane [Beard59] has become widely recognized as the seminal contribution to the probabilistic evaluation of heuristic methods for combinatorial optimization problems. Many contributions to combinatorial optimization throughout the 1950's and 1960's, for such problem classes as machine scheduling and production planning with setup costs, crew scheduling, set covering, and facility location problems were extensions and generalizations on these basic themes. Research focused on the design of optimization algorithms, usually based upon dynamic programming recursions or somewhat tailored -18- M. Tran Chapterl:Introduction versions of general-purpose integer programming methods, often for special cases in the problem class. Studies of scheduling theory as summarized by Conway, Maxwell, and Miller [Con67] and of uncapacitated inventory and production lot size planning by dynamic programming are prototypes of this period, as are branch and bound methods for plant location problems ([Wag58], [Dav69], and [Fed80]). At the same time that these integer and dynamic programming methods were evolving, combinatorial optimization was emerging and flourishing as a discipline in applied mathematics, based, in large part, on the widespread practical and combinatorial applications of network flow theory [Ford62] and its generalizations such as nonbipartite matching and matroid optimization [Law76]. Indeed, it is easy to understate the importance of these landmark contributions in defining combinatorial optimization as we know it today. In a survey, Klee summarizes much of the research devoted to these topics during this period [Klee80]. Although researchers were designing and applying heuristic (or approximate) algorithms during the 1950's and 1960's (for example, exchange heuristics for both the Traveling Salesman Problem [Lin65] and facility location problem ([Lit63] and [Man64])), optimization-based methods remained at the forefront of academic activity. The heuristic algorithms developed at this time may have been progenitors of algorithms studied later, but their analysis was often of such a rudimentary nature that heuristics did not capture the imagination and full acceptance of the academic community of this era. Rather than statistical assessment or error bound analysis, limited empirical verification of heuristics ruled the 1960's. Two developments in the 1970's, namely the emergence of computational complexity theory and the evolution of enhanced capabilities in mathematical programming, revitalized combinatorial optimization and precipitated a new focus in its research. The familiar computational complexity theory ([Cook71], and [Karp72]) has shown that the TSP and nearly every other "difficult" combinatorial problem, the so called NP-complete -19- M. Tran Chapterl:Introduction (nondeterministic polynomial time complete) class of problems, are all computationally equivalent; namely, each of these problems has eluded any algorithmic design guaranteed to be more efficient than tree enumeration, and if one problem could be solved by an algorithm that is polynomial in its problem size, then they all could. This revelation suggested that algorithmic possibilities for optimization methods were limited, and motivated renewed interest to design and analyze effective heuristics. Lin & Kemighan's variable r-opt exchanges [Lin73] is an example of such heuristics. Again, the TSP was at the forefront during this era. Worst case (i.e. performance guarantee) analysis [Chris76], statistical analysis, and probabilistic analysis [Karp77] of various heuristics for the TSP typified this period of research and were among the first steps in the evolution of interesting analytic approaches for evaluating heuristic methods. Indeed, the mere fact that computational complexity theory embraced the "infamous" Traveling Salesman Problem undoubtedly was instrumental in the theory's acceptance as a new paradigm in operation research, computer science and engineering. Computational complexity theory has become pervasive, so much so that Garey and Johnson's comprehensive monograph discusses more than 300 combinatorial applications (320 applications exactly), and the TSP is the first representative problem to be described in their book [Gar79]. Lenstra and Rinnooy Kan's [Len81] summary of computational complexity as applied specifically to vehicle routing and scheduling also shows that most of these problems are NP-complete. Consequently, the Traveling Salesman Problem would appear to be both a source of inspiration and a prime candidate for analysis by heuristic methods. Cumulatively, this fertile decade of 1970's research has yielded much improved capabilities for applying optimization methods to combinatorial problems, capabilities that tend to counterbalance the trend, stimulated by computational complexity theory, toward heuristic methods. As a consequence, heuristic methods provide excellent opportunities for current algorithmic developments for the Traveling Salesman Problem of the 1980's. -20- M. Tran 1.4 Chapterl: Introduction Methodology Currently, the search for faster heuristic methods for combinatorial optimization problems seems to follow two directions. On one hand, the search for faster computing machinery such as the FTPP [Harp87] has recently received a substantial amount of attention. On the other hand, there has been a considerable amount of effort devoted to the development of better algorithms. The number of instances in which these methodologies and algorithms thus developed to solve the TSP have been used successfully in practical applications has been growing encouragingly over the past years. These algorithms can be classified into two categories, namely exact and heuristic (or approximate)algorithms. An example of such an exact algorithm for solving the TSP is the branch and bound method [Moh82]. This method is a generalized scheme for limiting the necessary enumeration of the entire configuration space of the TSP, thus improving on the exhaustive search technique. It accomplishes this by arranging the configuration space as a tree and attempting to find bounds by which entire branches of the configuration space may be discarded from the search. Let us consider a configuration space of an N-city TSP, which may be represented by a number of binary state variables corresponding to the presence or absence of a tour edge (or branch of the tour); each edge directly connects city i and city j. From this configuration space, we can derive that there are N(N-1)/2 such possible edges, and of course "only" (N- 1)!/2 combinations of the binary state variables to map to the valid tours. It is important to note that, in this method, a tree is constructed such that it branches in two directions at each node, depending on whether or not a particular edge is considered part of the tour. As we descend through the tree, the distance of the current incomplete tour grows as certain edges considered are included in the tour. If a certain upper bound has already been established for the optimal tour length, then an entire -21- M. Tran Chapterl:Introduction branch of the configuration space tree may be eliminated if the current incomplete tour length already exceeds that bound. Equally important to know that as the algorithm proceeds through the search tree, lower and upper bounds may be discovered as new branches are traversed. Although this ability to prune the search tree is vital to the success of the branch and bound algorithm, such expansion and pruning of the search tree can continue endlessly. Though most of these algorithms have aimed for efficiency and computational tractability, the TSP is a NP-complete problem. In other words, the TSP is unlikely to be solvable exactly by any algorithm or amount of computation time when the problem size (the number of cities in the TSP), N, is large. Because of the exponentially dependent nature of the computation on N, the computing time in employing an exhaustive search in solving the TSP for the exact solution is practically infeasible. Such infeasibility can be evidently demonstrated in a simple example. Let us consider a computer that can be programmed to enumerate all the possible tours for a set of N cities, keeping track of the the shortest tour. Suppose this computer enumerates and examines a tour in one microsecond. At this rate the computer would solve a ten city problem in 0.18 seconds, which is not too bad. In a fifteen city problem, the computational effort would require over twelve hours. But, a twenty city problem would require nearly two thousand years! [Cerv87]. It is not too difficult to render such an algorithm entirely impractical--only a small increase in the problem size causes a nearly unbounded computation time. Because of the impracticality in employing exact algorithms in solving the TSP, there exists fortunately another algorithmic category that constitutes quite a few practical heuristic (or approximate) algorithms, known as iterative improvement algorithms, for finding nearly optimal solutions for the TSP and other combinatorial optimization problems. Iterative improvement starts with a feasible tour and seeks to improve the tour via a sequence of interchanges. In other words, it begins by selecting an initial state in the -22- M. Tran Chapterl: Introduction configuration space and successively applying a set of rules for alternating the configuration so as to increase the optimality of the current solution. Given a random starting solution, the algorithm descends on the surface of the objective function until it terminates at a local minimum. The local minimum occurs because none of the allowed transitions or moves in the configuration space yield states with lower objective function. Thus, one application of this algorithm yields what may be a fairly optimal solution. By repeating this procedure many times, the probability of finding more highly optimal states is increased. The best-known algorithms of this type are the edge (branch) exchange algorithms provided by Lin [Lin65] and Lin & Kernighan [Lin73], respectively. In the general case, r edges in a feasible tour are exchanged for r edges not in that solution as long as the result remains a tour whose length, distance, or cost is less than that of the previous tour. Exchange algorithms are referred to as r-opt algorithms where r is the number of edges exchanged at each iteration. In an r-opt algorithm, all exchanges of r edges are tested until there is no feasible exchange that improves the current solution. This solution is then said to be r-optimal [Lin65]. In general, the larger the value of r, the more likely it is that the final solution is optimal. Even for approximate algorithms, the number of operations necessary to test all r exchanges unfortunately increases rapidly as the number of cities increases. As a result, values of r = 2 and r = 3 are the ones most commonly used. Lin & Kernighan's (variable r-opt) algorithm, which decides at each iteration how many edges to exchange, has proven to be more powerful than Lin's (r-opt) algorithm. Lin & Kernighan's algorithm requires considerably more effort to code than either the 2-opt or 3-opt approach. However, it produces solutions that are usually near-optimal. Since Lin & Kernighan's algorithm decides dynamically at each iteration what the value of r (the number of edges to exchange) should be, series of tests are performed to determinine whether (r+l) edge exchanges should be considered. This process continues until stopping conditions are satisfied. -23- M. Tran Chapterl: Introduction Such heuristic algorithms, whose rate of growth of the computation time is a low order polynomial in N, rather than exponential in N, have been observed to perform well. Among these heuristic algorithms, a modified iterative improvement heuristic known as Synchronous Parallel Simulated Annealing Algorithm is selected to investigate the Traveling Salesman Problem. 1.4.1 Simulated Annealing Annealing is the process of heating a solid and cooling it slowly so as to remove strain and crystal imperfections. The Simulated Annealing process consists of first "melting" the system being optimized at a high effective temperature, then lowering the temperature by slow stages until the system "freezes" and no further changes occur. At each temperature, the simulation must proceed long enough for the system to reach a steady state. The sequence of temperatures attempted to reach to a steady-state equilibrium is referred to as an annealing schedule. During this annealing process, the free energy of the solid is minimized. The initial heating is necessary to avoid becoming trapped in a local minimum. Virtually every function can be viewed as the free energy of some system and thus studying and imitating how nature reaches a minimum during the annealing process should yield optimization algorithms. In 1982, Kirkpatrick, Gelatt & Vecchi [Kirk82,83] observed that there is an analogy between combinatorial optimization problems such as the TSP and large physical systems of the kind studied in statistical mechanics. Using the cost function in place of the energy function and defining configurations by a set of energy states, it is possible, with the Metropolis procedure that allows uphill transitions, to generate a population of configurations of a given optimization problem at some effective temperature schedule. This temperature schedule is simply a control parameter in the same units as the cost function. -24- M. Tran Chapterl: Introduction Just what simulation, i.e. imitation, here means mathematically, along with its underlying relation with statistical physics, will be the subject of Chapter 2 and Chapter 3. The resulting method called "Simulated Annealing", which is a heuristic combinatorial optimization technique that modifies the iterative improvement method by allowing the possibility of uphill moves in the configuration space, has become a remarkably powerful tool in solving global optimization problems in general and the TSP in particular. 1.4.2 Parallelization As briefly mentioned in the section before last, one of the major developments in computing in recent years has been the introduction of a variety of parallel computers and the development of parallel algorithms which effectively utilize these computers' capabilities. Parallel computer is informally meant to be any capable electronic machinery which performs two or more computations simultaneously or concurrently. Algorithms that are designed to carry out many simultaneous or concurrent operations are called "parallel algorithms". In contrast, traditional computers that are designed to execute one instruction at a time are "sequential computers", and algorithms designed for such computers are "sequential algorithms". The development of parallel computers and parallel algorithms is motivated by a number of objectives. First, sensor systems may be geographically separated which may dictate a need for distributed computations at sensor sites. Gathering these distributed informations may be impractical due to limited communication link bandwidth or severe contention for links between processors. The second goal is the desire to solve problems in an ever-increasing range of problems. The third objective for parallel computation is the desire to solve problems more cheaply than with sequential computers. For instance, it is well known that many optimization problems are very expensive to solve due to one (or more) of the following reasons: the size of the problem, i.e. number of variables and -25- M. Tran Chapter]:Introduction constraints, is large; the objective function or constraints are expensive to evaluate; many iterations or function evaluations are required to solve the problem; a large number of variations of the same problem must be solved. Indeed, often optimization problems are not solved or are only "solved" approximately because the cost of solving them with existing sequential computers and algorithms is prohibitive or does not fit within real-time constraints. Finally, it may simply be desirable to implement parallelism or concurrency to increase the computational speed of the algorithm, enabling the solution of problems that were too time-consuming to be solved in a reasonable amount of time, or allowing the solution of problems within the time constraints of real-time systems. Questions such as "what is the speedup," i.e. how much faster the parallel algorithm is over the serial algorithm, is addressed in Chapter 5. Because of the above objectives, a major research effort in the design of parallel computers and algorithms has been under way. Parallel algorithms have been broadly classified into two classes [Kung76]: synchronous and asynchronous algorithms. A synchronous algorithm is one where a global "clock" controls the sequence of operations to be performed by all processors. The advantage of such an algorithm is that the flow of data in the processors is easy to control and the analysis of the algorithm is somewhat simpler than the asynchronous counterpart. However, the speed of the algorithm may be dependent upon the computation time of the slowest processor. For example, it may often be the case that while the slowest processor is computing, the faster processors are idle waiting for the next operation to be performed. An asynchronous algorithm, on the other hand, performs computations independently at each processor according to a local "clock". Each processor computes independently with the knowledge it currently has. Because each processor is always performing useful computation, the speed of the algorithm is less dependent upon the slowest processor. There are several methodologies presently existing for reconstructing a serial algorithm into a parallel algorithm. One way is the use of the "branch and bound" method as in [Moh82]. The parallelization of the algorithms for the Traveling Salesman Problem -26- M. Tran Chapterl:Introduction and the results on speedup and parallelism for synchronous and asynchronous parallel algorithms were compared. Both approaches were tested on the Cm* multiprocessor system. In general, it was shown that the asynchronous approach resulted in a higher speedup than the synchronous counterpart. Furthermore, increased parallelism severely affected the synchronous algorithm due to bottlenecks and idle processors. The asynchronous algorithm, however, behaved reasonably well with increased parallelism. Another method for parallelization is the use of partition of the Markov chains of the Simulated Annealing Algorithm [van87]. The basic idea underlying this methodology is to assign a Markov chain to each available processor and let processors compute and generate Markov chains simultaneously. Using this method, it is reported that the speedup of this method is about 6 to 9, i.e. the parallel algorithms are about 6 to 9 times faster than the sequential algorithms. Still another method is to partition the Traveling Salesman Problem into subproblems. Then, each of these subproblems is assigned to a processor which is responsible for all the computations of that particular subproblem. And, the results are combined together when all subcomputations are complete. This methodology, ideally, reduces the computation time by the number of processors used. However, due to bottlenecks, time delays and processor saturation, the approach usually reduces the computation time by an amount less than the ideal. This methodology was considered by Tsitsiklis [Tsit84] and Bertsekas [Bert85], Schnabel [Sch84] and is examined in this thesis. Within this framework of dividing a problem into subproblems and of categorizing parallel algorithms into synchronous and asynchronous algorithms, as will be examined in detail in Chapter 4, a Parallel Simulated Annealing Algorithm is designed by partitioning the tour of the TSP into subtours. Each subtour is assigned to a processor which is responsible for computing the current cost of its subtour, perturbing the current subtour and computing the cost of this perturbed subtour, and performing the annealing process. Then, the results of these subtours are combined and the annealing process is examined -27- M. Tran Chapterl:Introduction globally. It is important to note that one must carefully partition the tour of the TSP into subtours such that one does not introduce too much overheads, thus, defeating the effectiveness of the parallelization, and these overheads will be addressed in Chapter 5. 1.5 Objective and Thesis Outline It is hoped that the investigation in this thesis will yield some interesting and useful results. These expected results are briefly outlined as follows: (1). The performance of parallel software and hardware systems is measured by means of an important metric called speedup. Speedup for the Synchronous Parallel Simulated Annealing Algorithm is analytically analyzed in Chapter 5. In this way, questions such as "How much faster this candidate parallel algorithm is over a serial algorithm?" can be evaluated. Furthermore, the second metric known as message capacity is also analytically studied in Chapter 5; so that question such as "How much message communication overhead, i.e. congestion and competition, in a network-based parallel processor?" can be studied. (2). Since the performance of the Simulated Annealing Algorithm is measured by the quality of solutions and running times, a computational study of the Synchronous Parallel Simulated Annealing Algorithm for an instance of the TSP is performed. In this way, questions such as "Is Simulated Annealing Algorithm a good heuristic?" can be answered by a comparison of the results of Local Optimization with the Annealing Algorithm. (3). It has been a well-known fact that neighborhood structures or perturbation functions have major effects on the performance of the Simulated Annealing Algorithm. Two specific neighborhood structures, namely Citywise Exchange and Edgewise Exchange, are selected to investigate the performance of the algorithm. In this way, questions such as "How strongly do neighborhood structures affect the overall performance of the Simulated -28- M. Tran Chapterl: Introduction Annealing Algorithm?" can be addressed by analyzing and comparing the results of two neighborhood structures. (4). It is also well-known that the performance of the Simulated Annealing Algorithm is dependent upon the annealing schedules. In order to answer "How will different annealing schedules affect the behavior of the Simulated Annealing Algorithm?", two different annealing schedules which are derived from the theoretical efforts on convergence of the Simulated Annealing Algorithm are investigated experimentally for an instance of the TSP. These annealing schedules are as follows: Tk+1 = cTk and Tk = d/logk, where c and d are constants under investigation. In this way, the results of two annealing schedules can be analyzed, and the effects of these annealing schedules on the overall performance of the algorithm can be evaluated. In Chapter 2, the underlying motivation and historical development of the Simulated Annealing Algorithm are outlined and discussed. In Chapter 3, the mathematical foundations of the Simulated Annealing Algorithm are examined. In Chapter 4, the design of a Parallel Simulated Annealing Algorithms for the TSP is formally developed and presented. In Chapter 5, the speedup of the Parallel Simulated Annealing Algorithm is analyzed. The performance analyses and comparisons of the candidate parallel algorithm in the context of quality of solutions and running times for 2 different neighborhood structures and 2 different annealing schedules are subjects of Chapter 6. Finally, in Chapter 7, the main contributions of this thesis are summarized and highlighted, and future research directions are also suggested. -29- M. Tran Chapter2: Simulated Annealing CHAPTER 2 CLAICAMIL Il lULATTID AHHTAL1HG 2.1 Introduction As briefly introduced in the previous chapter, Simulated Annealing ([Kirk82,83] and independently [Cern85]) is one of the most powerful heuristic optimization techniques for solving difficult combinatorial optimization problems which have been known to belong to the class of NP-complete problems. This new approach was originally invented and developed by physicists based on ideas from statistical mechanics and motivated by an analogy to the behavior of physical systems in the presence of a heat bath. Because the number of molecules in the physical system of interest is very large, experimental measurements of the energy of every molecule in the system is practically impossible. Physicists were thus forced to develop statistical methods to describe the probable internal behavior of molecules. In its original form, the Simulated Annealing Algorithm is based on the analogy between the simulation of the annealing of solids and the problem of solving large combinatorial optimization problems, where the configurations actually are states (in an idealized model of a physical system), and the cost function is the amount of (magnetic) energy in a state. For this reason, the algorithm became known as "Simulated Annealing ". -30- Chapter2: SimulatedAnnealing M. Tran With the Metropolis procedure, Simulated Annealing offers a mechanism for accepting increases in the objective function in a controlled fashion. At each temperature setting, an increase in the tour length is accepted with a certain probability while a decrease in the tour length is always accepted. In this way, it is possible that accepting an increase will reveal a new configuration that will avoid a local minimum or at least a bad local minimum. The effect of the method is that one descends slowly. By controlling these probabilities, through the temperatures, many random starting configurations are in essence simulated in a controlled fashion. An analogy similar to this is well-known in statistical mechanics. The non-physicist, however, can view it simply as an enhanced version of the familiar technique of "iterative improvement," in which an initial configuration is repeatedly improved by making small local alterations until no such alteration yields a better configuration. Simulated Annealing randomizes this procedure in such a way that allows for occasional "uphill moves," changes that worsen the configurations, in an attempt to reduce the probability of getting stuck at a poor and locally optimal configuration. Since the Simulated Annealing Algorithm is a generalization of "iterative improvement" and because of its apparent ability to avoid poor local optima, it can readily be adapted in solving new combinatorial optimization problems, thus, offering hope of obtaining significantly better results. Ever since Kirkpatrick et al. [Kirk82,83] introduced the concepts of annealing with incorporation of the Metropolis procedure [Met53] into the field of combinatorial optimization and applied it successfully to the "Ising spin class" problem, much attention has been devoted to the research of the theory and applications of Simulated Annealing. Important fields as diverse as VLSI design ([Kirk82,83] and [Rom85]), and pattern recognition [Gem84] have been applying Simulated Annealing with substantial success. Computational results to date have been mixed. For further detailed examinations, an interested reader is encouraged to refer to Kirkpatrick, Gelatt & Vecchi [Kirk82,83], Golden & Skiscim [Gold86], and Kim [Kim86]. -31- Chapter2: Simulated Annealing M. Tran In order to fully appreciate the thrust that is underlying the Simulated Annealing Algorithm as introduced in Section 2.4, it is important to understand Local Optimization which is briefly reviewed in Section 2.2 and the birth of the Simulated Annealing Algorithm which is fully discussed in Section 2.3. 2.2 Local Optimization To gain a real appreciation of the Simulated Annealing Algorithm as will be described in more detail in Section 2.3 and Section 2.4, one must first understand Local Optimization. A combinatorial optimization problem can be specified by identifying a set of configurations together with a "cost function" that assigns a numerical value to each configuration. An optimal configuration is a configuration with the minimum possible cost (there may be more than one such configuration). Given an arbitrary configuration to such a problem, Local Optimization attempts to improve on that configuration by a series of incremental, local changes. To define a Local Optimization algorithm, one first specifies a method for perturbing configurations so as to obtain different ones. The set of configurations that can be obtained in one such step from a given configuration i is called the neighborhoodof i. The algorithm then performs the simple loop shown in Figure 2.1 (with the specific methods for choosing i and j left as implementation details). Although i need not be a global optimal configuration when the loop is finally exited, it will be locally optimal in that none of its neighbors has lower cost. The hope is that "locally optimal" will be good enough. Because the locally optimal configuration is not -32- M. Tran Chapter2: Simulated Annealing -L 1. Get an initial configuration i. 2. While (there is an untested neighbor of i) do the following: 2.1 Let j be an untested neighbor of i. 2.2 If cost(j) < cost(i), set i = j. 3. Return i. Figure 2.1: Local Optimization Algorithm. always sufficient as can be seen from Figure 2.2, the Simulated Annealing Algorithm may provide the means to find both good locally optimal configurations and possibly a globally optimal configuration. Hence, it is the topic of discussion of the next section and the following. Cos Cost Function, C(i) Configurations, i Figure 2.2: Plateau, Local Minima and Global Minimum for the Cost Function. -33- Chapter2: SimulatedAnnealing M. Tran 2.3 Statistical Mechanics--A Physical Analogy As will be seen in the next section, Simulated Annealing is the algorithmic counterpart to a physical annealing process of statistical mechanics, using the well-known Metropolis Algorithm as its inner loop. Statistical mechanics concerns itself with analyzing aggregate properties of large numbers of atoms in liquids or solids. The behavior is characterized by random numbers fluctuating about a most probable behavior, namely the average behavior of the system at that temperature. An important question is: What happens to the molecules in the system at extremely low temperatures, i.e. about zero degree? The low-temperature state may be referred to as the ground state or the lowest energy state of the system. Since low-temperature states are very rare, experiments that reveal the low-temperature state of a material are performed by a process referred to as annealing. In condensed matter physics, annealing denotes a physical process in which a solid material under study in a heat bath is first melted by increasing the temperature of the heat bath to a maximum value at which all particles of the solid randomly arrange themselves in the liquid phase; this melted material is then cooled slowly by gradually lowering the temperature of the heat bath, with a long time spent at temperature near the freezing point. It is important to note that the period of time at each temperature must be sufficiently long to allow a thermal equilibrium to be achieved; otherwise, certain random fluctuations will be frozen into the material and the true low-energy state or ground state energy will not be reached. The process is like growing a crystal from a melt. To simulate the evolution of the thermal equilibrium at any given temperature T, Metropolis et al. [Met53] introduced a Monte Carlo method, a simple algorithm that can be used both to generate sequences of internal configurations or states and to provide an efficient simulation of collections of atoms in order to examine the behavior of gases in the presence of an external heat bath at a fixed temperature (here the energies of the individual gas molecules are presumed to jump randomly from level to level in line with the computed probabilities). In each step of this algorithm, a randomly generated atom is given a small -34- Chapter2: Simulated Annealing M. Tran random displacement, and the resulting change, AE, in the energy of the system between the current configuration and the perturbed configuration is computed. If AE < 0, the displacement is accepted, and the configuration with the displaced atom is used as the starting point of the next step. The case AE > 0 is treated probabilistically: the probability that the configuration is accepted is P(AE) = exp(-AE/kBT). This acceptance rule of the new configurations is known as the Metropolis criterion. Random numbers uniformly distributed in the interval (0,1) are a convenient means of implementing the random part of the algorithm. One such number is selected and compared with P(AE); if this random number is less than P(AE), then the new configuration is retained for the next step; otherwise, the original configuration is used to start the next step. By repeating the basic step many times and using the above acceptance criterion, one simulates the thermal motion of the atoms of a solid in thermal contact with a heat bath at each temperature T, thus allowing the solid to reach thermal equilibrium. This choice of P(AE) has the consequence that the system in a given state i with energy E(i) evolves into the Boltzmann distribution. PT(i) = exp(-E(i)/kBT)/Z(T) where * Z(T) is (2.1) a normalization factor, known as the partitionfunction, * T is the temperature, * kB is the Boltzmann constant, *i is a configuration of molecules in a system, *E(i) is the energy of configuration i, * exp (-E/kBT) is known as the Boltzmann factor, *and, PT(i) is its probability. Figure 2.3 illustrates the probability distribution curves for an energy function at various temperatures. -35- Chapter2: Simulated Annealing M. Tran Figure 2.3: Boltzmann Distribution Curves for an Energy Function at Various Temperatures. Note that, as the temperature decreases, the Boltzmann distribution concentrates on states with the lowest energy, and finally when the temperature approaches zero, only the minimum energy states have a non-zero probability of occurrence. In statistical mechanics, this Monte Carlo method, which is the Metropolis Algorithm, is a well-known method used to estimate averages or integrals by means of random sampling techniques. The general structure of the Metropolis Algorithm is summarized in Figure 2.4. It is important to note that a decrease (downhill) in the change of energy is always accepted while an increase (uphill) in the change of energy is accepted probabilistically. After many iterations of the Metropolis Algorithm, it is expected that the configuration of atoms would vary according to its stationary probability distribution. The type of acceptance probability used for uphill moves in the Metropolis Algorithm may be used in the Simulated Annealing Algorithm. The AE of the Metropolis Algorithm is replaced by the change in the value of the objective function and the quantity kBT is replaced by the dimensionless version of the temperature, T. Given a sufficiently -36- Chapter2: Simulated Annealing M. Tran low temperature, the distribution of configurations of the optimization problem will converge to a Boltzmann distribution that sufficiently favors lower objective function states (the optimal states). The probability of accepting any uphill moves approaches zero as the temperature approaches zero. As a result, approaching thermal equilibrium requires an unacceptably 1. Generate an initial random state i of the system. 2. Set the initial temperature T > 0. 3. while (not yet "frozen") do the following: 3.1 While (not yet in "thermal equilibrium" do the following: 3.1.1 Perturb atom from state i to state j. 3.1.2 Compute: AE = Energy(j) - Energy(i). 3.1.3 If AE : 0 *decreased energy transition* Then set i =j. 3.1.4 If AE > 0 *increased energy transition* Then set i = j with probability = exp(-AE/kBT). 3.2 Set T = update(T). *reduce the temperature* 4. Return i. *return the best state* Figure 2.4: General Metropolis Algorithm large number of steps in the algorithm. The general approach of Simulated Annealing is to let the algorithm spend a sufficient amount of time at a higher temperature, and is then slowly lowering the temperature by small incremental steps. The process is then repeated until a sufficiently low temperature has been obtained, i.e. T = 0. This is faster than simply setting the temperature initially to a low value and waiting for configurations of substances to reach thermal equilibrium. Annealing may be considered as the process of cooling -37- M. Tran Chapter2: Simulated Annealing slowly enough so that phase transitions are allowed to occur at their corresponding critical temperatures. Thus, to obtain pure crystalline systems, the cooling phase of the annealing process must proceed slowly while the system freezes. However, it is well known [Kirk82,83] that if the cooling is too rapid, i.e. if the solid or crystal structure is not allowed to reach thermal equilibrium for each temperature value, defects and widespread irregularities or non-equilibrium states can be 'frozen' or locked into the solid, and metastable amorphous structures corresponding to glasses can result rather than the low energy crystalline lattice structure. Furthermore, this process is known in condensed matter physics as "rapid quenching"; the temperature of the heat bath is lowered instantaneously, which results in a freezing of the particles in the solid into one of the metastable amorphous structures. The resulting energy level would be much higher than it would be in a perfectly structured crystal. This "rapid quenching" process can be viewed as analogous to Local Optimization. When crystals are grown in practice, the danger of bad "local optima" is avoided because the temperature is lowered in a much more gradual way, by a process that Kirkpatrick calls "careful annealing". In this process, the temperature descends slowly through a series of levels, each held long enough for the crystal melt to reach "equilibrium" at that temperature. As long as the temperature is nonzero, uphill moves remain possible. By keeping the temperature from getting too far ahead of the current equilibrium energy level, we can hope to avoid local optima until we are relatively close to the ground state. The correspondent analogy we are seeking now presents itself. Each feasible configuration of the combinatorial optimization problem or each feasible tour of the TSP corresponds to a state of the system; the configuration space of the combinatorial optimization problem or the permutation space of the TSP corresponds to the state space of the system; the cost or objective function corresponds to the energy function; the objective value associated with each feasible tour corresponds to the energy value associated with each state of that system; the optimal configuration or tour associated with the optimal cost -38- Chapter2: SimulatedAnnealing M. Tran value corresponds to the ground state associated with the lowest energy value of the state of the physical system. The analogy is summarized in Figure 2.5. Physical System Optimization Problem Traveling Salesman I State Feasible Configuration Feasible Tour State Space Configuration Space Permutation Space Ground State Optimal Configuration Optimal Tour Energy Function Cost Function Cost Function Energy Cost Cost Rapid Quenching Local Optimization Local Optimization Careful Annealing Simulated Annealing Simulated Annealing Figure 2.5: Analogy Between Physical System and Combinatorial Optimization 2.4 Classical Simulated Annealing As was discussed in Section 2.2 and illustrated by Figure 2.2, the difficulty with Local Optimization is that it has no way to "back out" of the unattractive local optima because it never moves to a new configuration unless the direction is "downhill," i.e. to a better value of the cost function, Simulated Annealing is an approach that attempts to avoid the entrapment in poor local optima by allowing an occasional "uphill" move. This is done under the influence of a random number generator and an annealing schedule. The attractiveness of using the Simulated Annealing approach for combinatorial optimization problems is that transitions away from a local optimum are always possible when the temperature is nonzero. As pointed out by Kirkpatrick et al., the temperature is merely a control parameter; this parameter controls the probability of accepting a tour length -39- M. Tran Chapter2: Simulated Annealing increase. As such, it is expressed in the same units as the objective function. In implementing the approach, any improvement procedure could be used. As was seen, the Metropolis Algorithm can also be used to generate sequences of configurations of a combinatorial optimization problem. In that case, the configurations assume the role of the states of a solid while the cost function C and the control parameter called the annealing schedule, T, take the roles of energy and the product of temperature and Boltmann's constant, respectively. The Simulated Annealing Algorithm can now be viewed as a sequence of Metropolis Algorithms evaluated at each value of the decreasing sequence of the annealing schedule, which is defined to be T = Itl, t2, ... ,t), where tl > t2 > ... >tn-1 > tn. It can thus be described as follows. Initially, the annealing schedule is given a high value, and a sequence of configurations of the combinatorial optimization problem is generated. As in the iterative improvement algorithm, a generation mechanism is defined, so that, given a configuration i, another configuration j can be obtained by choosing at random a configuration from the neighborhood of i. The latter corresponds to the small perturbation in the Metropolis Algorithm. Let AC(i,j) = C(j) - C(i), then the probability for configuration j to be the next configuration in the sequence is given by 1 if AC(i,j) < 0, and by exp(-AC(i,j)/T) if AC(ij) > 0 (the Metropolis criterion). Thus, there is a non-zero probability of continuing with a configuration with higher cost than the current configuration. This process is continued until equilibrium is reached, i.e. until the probability distribution of the configuration approaches the Boltzmann distribution, now given by PT( configuration = i) = qi(T) = exp(-C(i)/T)/Q(T) Where .Q(T) is a normalization constant depending on the annealing schedule T, which is equivalent to the partition function Z(T). -40- (2.2) Chapter2: SimulatedAnnealing M. Tran And, the probability distribution curves for the cost function is analogous to Figure 2.3 with E(i) is replaced by C(i). The annealing schedule T is then lowered in incremental steps, with the system being allowed to approach equilibrium for each step. The algorithm is terminated for some small value of T, at which virtually no further deteriorations or increases in cost are accepted. The final 'frozen' configuration is then taken as the optimal configuration of the problem under consideration. The main steps in the Simulated Annealing Algorithm are outlined in Figure 2.6. 1. Generate an initial random configuration i. 2. Set the initial temperature T > 0. 3. While (not yet "frozen") do the following: 3.1. While ("inner loop iteration" not yet satisfied) do the following: 3.1.1. Select random neighbor j from configuration i. 3.1.2. Compute: AC(i,j) = Cost(j) - Cost(i); 3.1.3. If AC(i,j) 0 * downhill transition * Then i =j. 3.1.4. If AC(i,j) > 0 * uphill transition * Then set i = j with probability = exp(-ACf/). 3.2. Set T = update(T). 4. Return i. * reduce the temperature * * return the best configuration * Figure 2.6: Simulated Annealing Algorithm Thus, as with iterative improvement, we have again a generally applicable approximation algorithm: once configurations, a cost function and a generation mechanism -41- M. Tran Chapter2: Simulated Annealing (or, equivalently, a neighborhood structure) are defined, a combinatorial optimization problem can be solved along the lines given by the description of the Simulated Annealing Algorithm. The heart of this procedure is the loop at Step 3.1, and the importance of this step will be further analyzed in a subsequent chapter when a Parallel Simulated Annealing Algorithm is discussed. Note that the acceptance criterion is implemented by drawing random numbers from a uniform distribution on (0,1) and comparing these with exp(AC(i,j)/T). Note also that exp(-AC(i,j)F') will be a number in the interval (0,1) when AC and T are positive, and so can rightfully be interpreted as a probability. Note also how this probability depends on AC and T. The probability that an uphill move of size AC will be accepted diminishes as the temperature declines, and, for a fixed temperature T, small uphill moves have higher probabilities of acceptance than larger ones. This particular method of operation is motivated by a physical analogy of the physics of crystal growth described in the last section. The main difference between the Simulated Annealing Algorithm and the Metropolis Algorithm is that the Simulated Annealing Algorithm iterates with variable temperature while the Metropolis Algorithm iterates with a constant temperature. As the temperature is slowly decreased to zero or annealed, the system approaches to steady state equilibrium. This implies that the cost function should converge to a global minimum. It is worthy to emphasize that the cooling or annealing process should be done slowly; otherwise, the system can get stuck at a local minimum. Ever since Kirkpatrick had recognized the physical analogy between statistical mechanics and combinatorial optimization, the Simulated Annealing Algorithm has been important in many disciplines. Not only has it been successfully applied in many important fields of science and engineering but also it has been one of the major stimulants of research in the academic and industrial communities. The force that makes the Simulated Annealing Algorithm powerful is its inherent ability to avoid and/or to escape from being -42- M. Tran Chapter2: Simulated Annealing entrapped at local minima, which are so many for a medium-size combinatorial optimization problem in general and the TSP in particular. In this chapter, the underlying motivation and historical development of the Simulated Annealing Algorithm has been covered. To provide some useful results for the subsequent chapters, a mathematical model and a quantitative analysis of the Simulated Annealing Algorithm are studied in the next chapter. -43- M. Tran Chapter3: QuantitativeAnalysis CHAPTER 3 3.1 Introduction In Chapter 1, a brief description of the Simulated Annealing was introduced. In Chapter 2, the origin and the motivation of Simulated Annealing were examined in detail, and the Algorithm was outlined. In this chapter, certain key mathematical concepts which are the underlying foundation of Simulated Annealing will be investigated. The Simulated Annealing Algorithm can be modelled mathematically by using concepts of the theory of Markov chains. Since a detailed analysis of these Markov chains is beyond the scope of this thesis, they are extensively discussed and proved by a number of authors ([van87], [Mit85], [Gem84], and [Haj85]) that under certain conditions, the algorithm converges asymptotically to an optimal solution. Thus, asymptotically, the algorithm is an optimization algorithm. In practical applications, however, asymptoticity is never attained and thus convergence to an optimal solution is no longer guaranteed. Consequently, in practice, the algorithm is an approximate algorithm. The performance analysis of an approximate algorithm concentrates on the following two quantities: -44- Chapter3: QuantitativeAnalysis M. Tran *The quality of the final solution obtained by the algorithm, i.e. the difference in cost value between the final solution and a globally minimal configuration; *and, the running time required by the algorithm. For the Simulated Annealing Algorithm, these quantities depend on the problem instance as well as the annealing schedules. Traditionally, three different types of performance analysis are distinguished, namely worst-case analysis [Law85, Chapter 5], average-caseanalysis [Law85, Chapter 6] , and empirical analysis [Law85, Chapter 7]. The worst-case analysis is concerned with upper bounds on quality of the final solutions, i.e. how far from optimal the constructed tour can be, while the average-case analysis is focused on the expected values of quality of the final solutions and running times for a given probability distribution of the problem instances. Empirical analysis here means the analysis originating in or based on computational experience. In other words, solving many different instances of the TSP with different annealing schedules and drawing conclusions from the results, with respect to both quality of solutions and running time. In this way, the effects of the annealing schedules on the algorithm can be analyzed. It is interesting to analyze these effects because, even for a fixed instance, the computation time and the quality of the final solution are random variables, due to the probabilistic nature of the algorithm. All three approaches are attempts to provide the information that will help in answering the question 'How well will the algorithm perform (how near to optimal will be the tours it constructs) on the problem instances to which I intend to apply it?' Each approach has its advantages and its drawbacks. Worst-case analysis can provide guarantees that hold for individual instances and does not involve the assumption of any probability distribution. The drawback here is that, since the guarantee must hold for all instances, even ones that may be quite atypical, there may be a considerable discrepancy in the behavior of an algorithm. Empirical analysis can be most appropriate if the problem instances on which it is based are similar to the problem -45- M. Tran Chapter3: QuantitativeAnalysis of interest. It may be quite misleading if care is not taken in the choice of test problems, or if the test problems chosen have very different characteristics from those at hand. Averagecase (or average ensemble) analysis can tell us a lot, especially when we will be applying the algorithm to many instances having similar characteristics. However, by its nature, this type of analysis must make assumptions about the probability distribution on the class of instances, and if these assumptions are not appropriate, then the results of the analysis may not be germane to the instances at hand. A final problem with both average-case analysis and worst-case analysis of heuristics comes from the rigorous nature of both approaches. Analyzing a heuristic in either way can be a very challenging mathematical task. Heuristics that yield nice probabilistic bounds may be inappropriate for worst-case analysis, and heuristics that behave well in the worst-case are often exceedingly difficult to analyze probabilistically. In addition, many heuristics (including quite useful ones such as that of Lin & Kernighan [Lin73]) do not seem to be susceptible to either type of analysis. When studying the Simulated Annealing Algorithm, an additional probabilistic aspect is added to the above classification. Besides the probability distribution over the set of problem instances, there is also a probability distribution over the set of possible solutions for a given problem. Thus, in an average-case analysis, the average can be referred to as the average of the set of solutions of a given problem instance. In this chapter, and computationally studied in Chapter 6, a combination of both the average-case analysis (or known as average ensemble) for the set of solutions of a given problem instance and empirical analysis grouped as "semiempirical" average-case analysis will be investigated for two representative instances of the Traveling Salesman Problem. Using these instances to present a "semiempirical" average-case analysis of the algorithm by running it a number of times, it is possible to reproduce the observed behavior by using standard techniques from statistical physics discussed in Chapter 2 and some assumptions on the configuration density. Presently, a systematic investigation of the typical behavior -46- M. Tran Chapter3: QuantitativeAnalysis and the average-case performance analysis of the Simulated Annealing Algorithm remains as an open research problem. In Section 3.2, the core mathematical model of the Simulated Annealing Algorithm based on the theory of Markov chains is represented and discussed. In this section, the salient features of the annealing schedules which will be useful in the computational study in Chapter 6 are also highlighted. And, the analysis of the cost function is presented in Section 3.3. 3.2 Mathematical Model A combinatorial optimization problem can be characterized by the configuration space 91 , denoting the set of all possible configurations i, and a cost function C: 91 -- R, which assigns a real number C(i) to each configuration i. C is assumed to be defined such that the lower value of C, the better the corresponding configuration, with respect to the optimization criteria. This can be done without loss of generality. The objective is to find an optimal configuration i* for which C(i*) = Cmin = min{C(i) Ii E 91} (3.1) where .Cmin denotes the minimum cost. To apply the Simulated Annealing Algorithm, a mechanism known as the neighborhood structure or the perturbation function, which will be defined precisely in Chapter 4, is used to generate a new configuration, i.e. a neighbor of i, by a small perturbation. A neighborhood j is defined as the set of configurations that can be reached from configuration i by a single perturbation. The Simulated Annealing Algorithm starts off with a given initial configuration and continuously tries to transform a current configuration into one of its neighbors by applying a perturbation mechanism and an -47- M. Tran Chapter3: QuantitativeAnalysis acceptance criterion. The acceptance criterion allows for deteriorations in the cost function, thus enabling the algorithm to escape from local minima. 3.2.1 Asymptotic Convergence As mentioned in the last section, the Simulated Annealing Algorithm can be formulated as a sequence of Markov chains, each Markov chain being a sequence of trials whose outcomes, Xl, X2, ... , satisfy the following two properties: (i) Each outcome belongs to a finite set of outcomes (1, 2, 3, ... , n) called the configuration space 91 of the system; if the outcome on the kth trial is i, then the system is said to be in state i at time k or at the kth step. (ii) The outcome of any trial depends at most upon the outcome of the immediately preceding trial and not upon any other previous outcome, i.e. the outcome of a given trial only depends on the outcome of the previous trial; with each pair of states or configurations (ij) there is given the probability Pij such that j occurs immediately after i occurs. Such a stochastic process is called a (finite) Markov chain. The numbers Pij, called the transition probabilities, can be arranged into a transition matrix P below. P1 P= -... . Pnl Pin . ••• Pnn called the transitionmatrix. Thus, with each configuration i, there corresponds the ith row (Pil, Pi2, ... ,Pin) of the transition matrix P; if the system is in configuration i, then this row vector represents the probabilities of all the possible outcomes of the next trial and so it is a probability vector, whose row sum is always equal to 1. Note that the outcomes of the trials here are the configurations. For example, the outcome of the given trial is the perturbed configuration j while the outcome of the previous trial is the current configuration i. So, a Markov chain is described by means of a set of -48- Chapter3: QuantitativeAnalysis M. Tran conditional probabilities Pij(k - l,k) for each pair of outcomes (i,j); Pij(k - 1,k) is the probability that the outcome of the kth trial is j, given that the outcome of the (k - I)th trial is i. Let ai(k) denote the probability of outcome i at the kth trial, then ai(k) is obtained by solving the recursive relation: ai(k) = . al(k-1).Pli(k-1,k), k = 1, 2,..., 1 (3.2) where the sum is taken over all possible outcomes. Let X(k) denotes the outcome of the k th trial. Then, Pij(k-l,k) = Pr{X(k) = j I X(k-1) = i) (3.3) and ai(k) = Pr{X(k) = i) (3.4) If the conditional probabilities do not depend on k, the corresponding Markov chain is called homogeneous, otherwise it is called inhomogeneous. In the case of the Simulated Annealing Algorithm, the conditional probability Pij(k 1,k) denotes the probability that the kth transition is a transition from configuration i to configuration j. Thus, X(k) is the configuration obtained after k transitions. In this view, Pij(k - 1,k) is the transition probability and the I9 I x I19 I-matrix P(k - 1,k) the transition matrix. The transition probabilities depend on the value of the annealing schedule T. Thus, if T is kept constant, the corresponding Markov chain is homogeneous, and its transition probability, i.e. the probability that a trial transforms configuration i into configuration j, is defined as -49- Chapter3: QuantitativeAnalysis M. Tran A ij(T)Gj (T) Pij(T)= 1Where A, (T)Gik(T) ke9t,k i if i • j if i=j (3.5) .Pij(T) denotes the transition probability. .Gij(T) denotes the generation probability, i.e. the probability of generating configuration j from configuration i. .Aij(T) denotes the acceptance probability, i.e. the probability of accepting configuration j given the configurations i and j. .And T is the annealing schedule. Each transition probability is defined as the product of the following two conditional probabilities: the generation probability Gij(T) of generating configuration j from configuration i, and the acceptance probability Aij(T) of accepting configuration j, once it has been generated from i. The corresponding matrices G(T) and A(T) are called the generation and acceptance matrices, respectively. As the result of the definition in Equation 3.5, P(T) is a stochastic matrix, i.e. Vi: YjPij(T) = 1. . a homogeneous algorithm: the algorithm is described by the sequence of homogeneous Markov chains. Each Markov chain is generated at a fixed value of T and T is decreased in between subsequent Markov chains, and . an inhomogeneous algorithm: the algorithm is described by a single inhomogeneous Markov chain. The value of T is decreased in between subsequent transitions. It is not within the scope of this chapter to analyze these two different types of algorithms, they are discussed extensively in [van87]. The Simulated Annealing Algorithm obtains a global minimum if, after a large number of transitions, K, i.e. K-- oo, the following relation holds: -50- Chapter3: QuantitativeAnalysis M. Tran Pr(X(K) E 9 opt) = 1, (3.6) where 91 opt is the set of globally minimal configurations. Equation 3.6 can be proved under a number of conditions on the probabilities Gij(T) and Aij(T); asymptotically, i.e. for infinitely long Markov chains and T -- 0, the algorithm finds an optimal configuration with probability equal to 1. The proof is based on the existence of an equilibrium distribution [van87]. Let X(k) denote the outcome of the kth trial of a Markov chain; then under the condition that the Markov chains are irreducible, aperiodic, and recurrent, there exists a unique equilibrium distribution given by the 19 Ivector q(T). The component qi(T) denotes the probability that a configuration i will be found after an infinite number of trials and are given by the following expression: qi(T) =lim Pr{X(k)=i IT} = lim ([pk(T)]Ta)i k-4 * (3.7) where a denotes the initial probability distribution of the configurations and P(T) the transition matrix, whose entries are given by the Pij(T). Under certain additional _ 0 to a conditions on the probabilities Gij(T) and Aij(T), the algorithm converges as T -uniform distribution on the set of optimal configurations, i.e., lim ( lim Pr{X(k) = i T)= lim qi(T) = n k-,T-40 T-,o (3.8) and I%1 -II0 if iE91O elsewhere -51- (3.9) -M M. Tran Chapter3: QuantitativeAnalysis where 91 opt denotes the set of optimal configurations. Here, we apply the standard form of the Simulated Annealing Algorithm, i.e., the perturbation probability Gij(T) is chosen independent of T and uniformly over the neighborhood of a given configuration i. The acceptance probability is chosen as A, exp ( - AC . / T) u1 if AC .. > 0 if AC..ifAC <0 0 (3.10) where ACij = C(j) - C(i). For this choice the components of the equilibrium distribution take the form exp {[Cmin - C(i)] / T} (T)exp {[Cm - C(j)] / T} JEm (3.11) The above results are extremely useful when the cost function is analyzed in Section 3.3. 3.2.2 Annealing Schedules As mentioned previously, the performance of the Simulated Annealing Algorithm is a function of the annealing schedules. Hence, it is common that one resorts to an implementation of the Simulated Annealing Algorithm in which a sequence of Markov chains of finite length is generated at decreasing values of the annealing schedule. Optimization is begun at a starting value of the temperature To and continues by repeatedly generating Markov chains for decreasing values of T until T approaches 0. This procedure is governed by the annealing schedule. Generally, the parameters used in studying the performance of the Simulated Annealing Algorithm are (1) the length L of the individual Markov chains; (2) the stopping criterion to terminate the algorithm; (3) the start value TO of the annealing schedule; and (4) the decrement function of the annealing schedule. -52- M. Tran Chapter3: QuantitativeAnalysis The salient features of these parameters which are the subjects of investigation in Chapter 6 are summarized here. For an extensive treatment of these parameters, the reader is encouraged to examine reference [van87]. (1). Markov-chain length L: All Markov chains are chosen equally long. In practice, the number of cities in the TSP tour or the number of runs of the algorithm is taken to be equal to the length of the Markov chains. For the computational study of Chapter 6, the Markov-chain length is taken to be the number of runs of the algorithm. (2). Stopping criterion: There are many criteria for terminating the Simulated Annealing Algorithm presently existed. To reduce the level of complexity of software implementation and analysis of computational results, in our study of Chapter 6, the algorithm is terminated at a certain maximum number of iterations arbitrarilly set by the user. (3). Starting value TO: The purpose of the starting temperature value is to begin the thermal system at a high temperature as discussed in Chapter 2. There are many variations for the starting value of the annealing schedules. This starting value is as high as 100 and as low as 20. For the purpose of our computational study in Chapter 6, this starting value is initially set appropriately to a particular annealing schedule under consideration. (4). Annealing schedule T: As mentioned in Section 1.5 and various notes throughout Chapter 2 and this chapter, the performance of the Simulated Annealing Algorithm is a function of the annealing schedule. Because of this dependence, the following two well-known annealing schedules which prove to provide good solutions to the TSP are extensively investigated in Chapter 6 by varying the parameters c and d, i.e. 0.9 < c 5 0.99 and 5 5 d < 30. and Tk+1 = CTk; k = 0, 1, 2, ... , max_iterations (3.12) Tk = d/logk; k = 2, 3, 4, ... , max_iterations (3.13) -53- M. Tran Chapter3: QuantitativeAnalysis Note that as a consequence of the asymptotic convergence of the Simulated Annealing Algorithm, it is intuitively clear that the slower the "cooling" is carried out, the larger the probability that the final configuration is close to an optimal configuration. Thus, the deviation of the final configuration from an optimal configuration can be made as small as desired by investing more computational effort. The literature has not elaborated on the probabilistic dependence on the parameters of the annealing schedule. In this chapter and Chapter 6, semiempirical results on this topic are represented. A more theoretical treatment is still considered as an open research topic. 3.3 Analysis of the Cost Function In this section, some quantitative aspects of the Simulated Annealing Algorithm are discussed. The discussion is based on an extensive set of numerical data obtained by applying the algorithm to a specific instance of the Traveling Salesman Problem [Law85]. The description of the problem instance is given in Chapter 6 when empirical results are obtained. The behavior of the Simulated Annealing Algorithm is analyzed. In this section, an analytical approach to derive the expectation and the variance of the cost function in terms of the annealing schedule is analyzed. The discussion is based on an average-case performance analysis. To model the behavior of the Simulated Annealing Algorithm, an analytical approach to calculate the expectation (Cjr and the variance o2r of the cost function is 2 discussed. Let X denote the outcome of a given trial; the (C) h and C T can be defined as Pr{X= il T} C(i) (C T= (3.14) ie and -54- M. Tran Chapter3: QuantitativeAnalysis oT= YPr(X=i lT) [C(i)-(C) ]2 i e9 (3.15) In equilibrium, we obtain, using Equations 3.7 and 3.11, I exp { [C m - C(i)] / T) C(i) a i'r = qi(T)C(i ) (C) = exp {[Cm - C(j)]/T} ie 9 (3.16) and o= ie9 qi (T)[C( i ) - (C ) ] 2 i'e9 exp { [C - C(i)] / T)} [C(i)- (C)T ]2 I -C(j)] / T} exp { [C JE 9 (3.17) Next, the configuration density co(C) is defined as co(C)dC = ISI i91 IC C(i) <C+dC)( (3.18) Then, in the case of the Simulated Annealing Algorithm employing the acceptance probability of Equation 3.10, the equilibrium-configuration density Q(C,T) at a given value of T is given by S(C, T) dC = o(C) exp [ (C m -C) / T]dC T) f m(C')exp[(Cm - -C') /T]dC' (3.19) Clearly, f2(C,T) is the equivalent of the stationary distribution q(T) given by Equation 3.11. As indicated by the notion "equilibrium," Q(C,T) is the configuration density in equilibrium when applying the Simulated Annealing Algorithm. Thus, one obtains -55- M. Tran Chapter3: QuantitativeAnalysis (C) = C' i(C, T) dC' (3.20) and o= -_ [C'- (C) T ]2K(C ', T)dC' (3.21) Given an analytical expression for the configuration density o)(C), it is possible to evaluate the integrals of Equations 3.19--3.21. To estimate o(C) for a given combinatorial optimization problem is in most cases very hard. Indeed, O(C) may vary drastically for different specific problem instances, especially for C values close to Cmin. The average cost C and the standard of deviation o(T) of the cost as a function of the annealing schedule T when applying the Simulated Annealing Algorithm to an instance of the TSP are the following expressions, L C(T) = L -1 Ci(T) i= 1 (3.22) and r L -1/2 [Ci(T)-(T) o(T)= L-' i= 1 (3.23) (3.23) where the average is taken over the values of the cost function Ci(T), for i = 1, ..., L, of the Markov chains generated at a given value of the annealing schedule T. From the above relations, the behavior of the Simulated Annealing Algorithm is observed for many different problem instances and is reported by a number of authors (for example, [Kirk83] and [van87]). Furthermore, some characteristic features of the expectation (C)r and the variance oyr of the cost function can be deduced. For large values of T, the average and the standard of deviation of the cost are about constant and equal to C(oo) and oY(oo). This behavior is directly explained from Equations 3.16 and 3.17, or Equations 3.18 -- 3.21, namely -56- M. Tran Chapter3: QuantitativeAnalysis (c) =lim (C) =-T T--* I IC(i) ie 9 (3.24) and a 2 = lim T -* = [C(i)- C) ] i9 (3.25) The results presented in this section and Section 3.2.2 are extremely useful when the computational study is investigated in Chapter 6. And, note that more detailed estimates of the average-case performance of the Simulated Annealing Algorithm can only be deduced from a rigorous performance analysis which takes into account the detailed structure of the optimization problem at hand. Presently, such a theoretical average-case performance analysis remains to be as an open research problem. The average-case performance of the Simulated Annealing Algorithm is discussed by analyzing the expectation and the variance of the cost function as a function of annealing schedule for a certain instance of the Traveling Salesman Problem; The results can be summarized as follows: *the performance of the Simulated Annealing Algorithm depends strongly on the chosen annealing schedule; this is especially true for the quality of the solution obtained by the algorithm; *with a properly chosen annealing schedule, near-optimal solutions may be obtained. * computation times can be extremely extensive for some problems. For example, in solving an instance TSP of 100 cities, a few hundred hours of CPU time on a VAX11/780 has been reported [van87]. -57- M. Tran Chapter3: QuantitativeAnalysis In this chapter, certain key mathematical concepts which are the underlying foundation of Simulated Annealing was examined. In the next chapter, a full description of the design of the Parallel Simulated Annealing Algorithm is presented. -58- M. Tran Chapter4: ParallelSimulated Annealing Algorithm CHAPTER 4 AUNN3ALUlG SJMUILAT3D AiOREIITUM 4.1 Introduction In Chapter 2, the underlying historical development of the Simulated Annealing Algorithm was discussed. In Chapter 3, a mathematical model and a quantitative analysis of the Simulated Annealing Algorithm were examined. In this chapter, a Parallel Simulated Annealing Algorithm is designed, an algorithm which provide the basis for speedup analysis in the next chapter and for a computational study in the following chapter. Program partitioning or parallelization and interprocessor communication are two popular terms in parallel processing. Intuitively, these two terms are self-explanatory: Parallelization refers to the process of breaking a program or a problem down to smaller components, but this can be done by using several different approaches and for different objectives. Questions like how one can partition a program, what are the boundaries, what are the tradeoffs and the precise goals of parallelization remain largely unanswered. The term interprocessor communication is also self-explanatory. But similar questions about the precise meaning of interprocessor communication and its impact on program execution have no unique answers. Also, there is no available methodology for quantitatively -59- M. Tran Chapter4: ParallelSimulated Annealing Algorithm characterizing these terms. The problems of parallelization of the Simulated Annealing Algorithm for the TSP in this chapter and intercommunication between processors discussed in the next chapter are defined by modelling them, quantifying them, and identifying their variables. To make these ideas more concrete, Section 4.2 establishes a framework for the TSP, explains in detail the partition process, and gives a high-level description of the overall algorithm. In Section 4.3, two different neighborhood structures or perturbation functions for the TSP are presented. The relationship between cost of the TSP tour and cost of the subtour is analyzed in Section 4.4. The candidate synchronous and asynchronous parallel algorithms are formally outlined in Section 4.5. And, Section 4.6 discusses key implementation issues during different phases of the candidate Parallel Simulated Annealing Algorithm. Algorithm Framework and Parallelization 4.2 Methodology In order to formulate the TSP concretely and precisely, let us establish some common ground by introducing some notation. From graph theory, a graph, G, is defined as G = (V,E) where V = {v0,...,N) is the set of vertices and E = (vi,vj): there is an arc from vi to vj) is the set of edges. A tour is defined as an ordered sequence a = (a(O),.... ,a(N-l),a(O)). (or equivalently, as a function o:{0, ... , N-1,0) -> V) satisfying a(i) ac(j) for i * j. a is a permutation sequence, through which the tour is traversed, and a(i) is the ith vertex in a tour. As illustrated in Figure 4.1(a), this arbitrary TSP tour has the following permutation or ordered sequence a = (o(O), a(1), a(2), T(3), r(4), a(5), a(6), F(7), Y(8), a(9), a(10), -60- Chapter4: ParallelSimulatedAnnealing Algorithm M. Tran o(11), o(12), a(13), o(14), o(0)) = (vo, v4, V9, V13, V3, V7, Vll, Vl, V2, V15, V6, V5, V12, V8, V10, v14, vo). Thus, the framework for the TSP is established. As discussed in Section 1.4.2, there are several approaches to partitioning the TSP tour into subtours. The approach here is to partition the entire TSP tour into a number of equal subtours by taking the total number of cities (or vertices) desired in the TSP tour divided the number of cities desired in the subtour. To illustrate this concept, let us consider an arbitrary TSP tour of Figure 4.1(a); let us further divide this particular tour into four subtours as shown in Figure 4.1(b). Hence, the first subtour consists of a permutation r f | L 12 11 I (b) (a) Figure 4.1: (a): An Arbitrary TSP Tour. (b): TSP Tour is Divided into Four Subtours. sequence (black cities) si = (s(O), s(1), s(2), s(3)) = (vO, v4, v9, v13); the second subtour has a permutation sequence s2 = (s(O), s(1), s(2), s(3)) = (v3, v7, vii, Vl); the third subtour consists of a permutation sequence s3 = (s(O), s(1), s(2), s(3)) = (v2, v15, v6, v5); and the fourth subtour has a permutation sequence s4 = (s(O), s(l), s(2), s(3)) = (v12, v8, vl0, v14). Note that a and Y'are respectively denoted as the entire TSP tour and the entire -61- M. Tran Chapter4: ParallelSimulatedAnnealing Algorithm perturbed TSP tour while s and s' are respectively denoted as the subtour and the perturbed subtour. It is important to notice one key feature before proceeding. The important fact is that the number of cities in the TSP tour should be multiple of the number of cities in the subtours. Thus, one can generally partition the total TSP tour into any arbitrary subtour sizes and number of subtours. To illustrate this, let N be denoted the number of cities in the TSP tour; let P be denoted the number of subtours; let Np be denoted the number of processing elements (PEs) or processors; and let Ns be denoted the number of cities in the subtours. Then, if N = PNs always hold, then one can partition the TSP tour into any sizes Ns and P subtours. For example, let the number of cities in the TSP tour be N = 50, and if we process this tour using 5 processing elements; then, there are 5 (=P) subtours, each of which consists of 10 (=Ns) cities. In order to amplify the understanding of the communication issues discussed in the next chapter, the candidate algorithms provided in the later section and the software provided in the Appendix A, it is important to be familiar with the overall conceptual framework of the parallel processing scheme or software system model depicted in Figure 4.2. In the next chapter, the speedup of the software system model is analyzed generally where the number of subtours P is less than, equal to, and greater than the number of available processors Np. In this chapter, for simplicity of analysis and discussion, the case of the number of subtours P is equal to the number of processors Np is considered. Thus, this software system model is assumed to be mapped exactly onto the FTPP [Harp87] which is a loosely coupled Multiple Instruction stream Multiple Data stream (MIMD) architecture whose Processing Elements (PEs) are communicated via I/O devices. These PEs are interconnected by means of an interconnection network or a bus; each of these PEs consists of a general-purpose CPU, i.e. MC68020, RAM, ROM, timers, I/O, and interprocessor communication ports, is capable of executing its own program independently and operating on different copies of data. Mapping of the software system -62- M. Tran Chapter4: ParallelSimulated Annealing Algorithm model onto the FTPP means that each subtour si of Figure 4.2 is assigned exactly to one PE. I- PEO PE1 PE3 PE2 Figure 4.2: High-Level Description of the Parallel Scheme. As can be seen from Figure 4.2, the central coordinator assumes several principal tasks. First, it partitions the entire TSP tour into subtours. Second, for each partitioned subtour, it creates a processing element datum structure. Third, at each particular temperature value, it assigns a process, which consists of a processing element datum structure and a subtour, to an available PE. Finally, it reconstructs the new TSP tour, computes the cost of the new TSP tour and performs the global annealing process after all processes have completed their computations. Note that, in this chapter, the terms PE and processor are used interchangeably. -63- Chapter4: ParallelSimulated Annealing Algorithm M. Tran It is important to notice a few key features. First, communications are channelled not only between the central coordinator and the PEs (or subtours), but also between the PEs (or subtours) themselves. These interprocessor communications will be discussed in detail in the next chapter. Second, Simulated Annealing is performed both locally and globally. And, finally, the parallelization methodology in this thesis is generally applicable; in other words, if the number of subtours P were set to 1, then the Parallel Simulated Annealing Algorithm is reduced to a Serial Simulated Annealing Algorithm. 4.3 Neighborhood Structures In this section, the two neighborhood structures, which are experimentally studied in Chapter 6, are presented. For easy reference, a particular subtour si which is assigned to a particular processing element PEi in Figure 4.2 can be extracted from Figure 2.6 as shown in Figure 4.3. 3.1. While ("inner loop iteration" not yet satisfied) do the following: 3.1.1. Select random neighbor s' of configuration s. 3.1.2. Compute: AC(s',s) = Cost(s') - Cost(s). * downhill transition * 3.1.3. If AC(s',s) • 0 Then set s = s'. *uphill transition* 3.1.4. If AC(s',s) > 0 Then set s = s' with probability = exp(-AC/T). L Figure 4.3: General Algorithm of a Subtour -64- M. Tran Chapter4: ParallelSimulated Annealing Algorithm The general algorithm of a subtour in Figure 4.3 is performed at a particular temperature. The important statement which influences the computational results substantially is the statement 3.1.1. Therefore, it is germane to pay close attention to the variations of the this statement as we proceed. The following neighborhood structures for the statement 3.1.1. are subjects of computational study in Chapter 6. 4.3.1 Citywise Exchange For a given subtour s, a perturbed subtour s' is obtained by interchanging the positions of two distinct randomly chosen cities. Such a modification will be denoted by Tij(s). The operation Tij takes a subtour s and produces a new subtour s' = Tij(s) which 3.1. For i = 0 to (Ns-1), do the following: * Ns : number of cities in subtour* 3.1.1.Neighborhood structure: (1). Generate a random city j, 0 _j _ (Ns-1), j # i. (2). Construct a trial permutation from the current permutation as follows: . Find: i' = min(ij) and j' = max(ij). Set: s'(k) = s(k), k = 0,1, ... ,i' -1. s'(k) = s(k) , k = i',1,2, ... , j'. s'(k) = s(k), k = j'+l,j'+2, ... ,(Ns-1). 3.1.2. Compute: AC(s',s) = Cost(s') - Cost(s); * downhill transition * 3.1.3. If AC(s',s) 5 0 Then set s = s'. * uphill transition * 3.1.4. If AC(s',s) > 0 Then set s = s' with probability = exp(-AC/T). Figure 4.4 (a): Algorithm of Subtour with Citywise Exchange. -65- M. Tran Chapter4: ParallelSimulatedAnnealing Algorithm i-1 i -P1 i j i+1 Subtour s i+ 1 j-1 -1 j j+1 j+1 Perturbed Subtour s' A Figure 4.4 (b): Neighborhood Structure with Citywise Exchange. satisfies s'(j) = s(i), s'(i) = s(j) and s'(k) = s(k) for every k * ij. The algorithmic steps for the subtour are provided in Figure 4.4(a) while the neighborhood structure is outlined in Step 3.1.1 and is graphically illustrated in Figure 4.4(b). It is important to notice that by this construction only the positions of the ith and jth cities are interchanged. 4.3.2 Lin's 2-Opt Algorithm or Edgewise Exchange Another neighborhood structure under consideration in Chapter 6 is the Lin's 2-Opt Algorithm or Edgewise Exchange. In their well-known paper [Lin73], Lin & Kernighan proposed a series of heuristics of increasing complexity that gives approximate solutions to the TSP. The simplest one, known as Lin's 2-Opt Algorithm and denoted Lij(s), is the following: starting from a given tour, exchange two edges (or branches) by replacing two edges in a given subtour with two edges not in the tour, provided that the resulting connection is also a subtour. Whenever this exchange yields a shorter or better subtour, it -66- M. Tran Chapter4: ParallelSimulatedAnnealing Algorithm is repeated until no more improvements are possible. To illustrate this concept, let s = (vl, v2, ... , vNs) be the current permutation sequence. There are edges connecting vl and v2, i i+l Subtour s j j+1 0 i j+1 i+·cJ Perturbed Subtour s' Figure 4.5: Neighborhood Structure with Lin's 2-Opt Exchange or Edgewise Exchange 3.1. For i = 0 to (Ns-1), do the following: * Ns : number of cities in subtour* 3.1.1.Neighborhood structure: (1). Generate a random city j, 0 5 j 5 (Ns-1), j * i. (2). Construct a trial permutation from the current permutation as follows: . Find: i' = min(ij) and j' = max(ij). Set: s'(k) = s(k) , k = 0,1, ... ,i'-1. s'(i'+ k) = s(j'- k), k = 0,1,2, ... , j'- i'. s'(k) = s(k), k = j'+l,j'+2, ... ,(Ns-1). 3.1.2. Compute: AC(s',s) = Cost(s') - Cost(s); 3.1.3. If AC(s',s) 5 0 * downhill transition * Then set s = s'. 3.1.4. If AC(s',s) > 0 * uphill transition * Then set s = s' with probability = exp(-AC/T). Figure 4.6: Algorithm of Subtour with Lin's 2-Opt Exchange. -67- M. Tran Chapter4: ParallelSimulatedAnnealing Algorithm v2 and v3, ... ,VNs-1 and vNs in the current permutation. In the two edge perturbation strategy, two edges are chosen and broken from the current perturbation. These are v2 and v3, ... ,VNs-1 and vNs in the current permutation. In the two edge perturbation strategy, two edges are chosen and broken from the current perturbation. These are replaced by the two unique edges required to rejoin the permutation (and create a new one). For example, if the edges (vi, vi+1) and (vj, vj+1) are broken for some i and j such that i < j < Ns, the new edges are (vi, vj) and (vi+1, vj+1). The net result is that the permutation segment between vi+1 and vj is reversed. The perturbed permutation is s' = (vl,v2, ... ,vi, vj, vj-1, ... , vi+l, vj+1, ... , vNs) as depicted in Figure 4.5. To be precise, the algorithm of the subtour with Lin's 2-Opt Exchange is provided in Figure 4.6. It is important to point out here the difference between the permutation processes of the Tij(s) and Lij(s) Exchanges. The two cities of the subtour are interchanged in the Tij(s) permutation process, hence the term Citywise Exchange, whereas the two edges are switched in the Lij(s) process, thus the term Edgewise Exchange. Figure 4.4(b) and Figure 4.5 illustrate this difference. For example, city i is interchanged with city j in Figure 4.4(b) whereas edges (vi,vi+l) and (vj,vj+l) are interchanged with edges (vi,vj) and (vi+l,vj+l) in Figure 4.5. 4.4 Cost of TSP Tour and Subtour The Steps 3.1.2--3.1.4. of the Algorithm of Subtour were outlined in Figure 4.3. Let us examine in more detail what the functions of these steps in computing the subtour costs are and how they are related with the entire TSP tour cost. -68- M. Tran Chapter4: ParallelSimulated Annealing Algorithm Let us attach weights or costs to the edges of the entire TSP tour. Let duv be the cost of an edge (u,v). Thus, da(i),o(i+l) is the cost associated with an edge of a tour. The cost of a tour y,denoted as C(a) , is given by, N-1 C(O)= d (i)ai+ l) }+(a d(N), a(l)I (4.10) i=I which is exactly the same as Equation 1.1. The cost of the subtour with Tij(s) exchange is C(Tij(s)) = C(s') = :E de' (4.11) e'lE' where Es' is the set of edges resulting from Tij(s). For the Simulated Annealing Algorithm, one is interested in the change in cost from C(s) to C(s'). The change in cost, AC(s',s), is AC(s',s) = C(s') - C(s) = de' - : e' E" eeVE de (4.12) where Es is the set of edges in the subtour s and Es' is the set of edges in the perturbed subtour s'. Equation 4.13 gives an expression for the cost of the entire TSP tour with P simultaneous subtours. P-1 C(o') = C(a) + : AC(sk,s) k=O (4.13) The Equation 4.13 can be interpreted as follows. The perturbed cost of the TSP tour is equal to the cost of the current tour plus the sum of the changes in cost of the -69- Chapter4: ParallelSimulatedAnnealing Algorithm M. Tran individual subtours. This equation is incorporated into the candidate algorithm in the next section. 4.5 Candidate Algorithms Two algorithms are presented below, which are based on the ideas outlined in [Kim86], and both of which are iterative and utilize the central coordinator. Algorithm A is a Synchronous Parallel Simulated Annealing Algorithm while Algorithm B is an Asynchronous Parallel Simulated Annealing Algorithm. Though an analysis of Algorithm B is not performed in Chapter 6, it would be a very interesting subject of a computational study in the future research efforts. Let o n be denoted as the tour obtained by the nth iteration and let C(o n ) be its associated cost. Also, let C(on)B be the best, or minimum, cost found up to iteration n. 4.5.1 Algorithm A: Synchronous Parallel Simulated Annealing Algorithm. Initialize: At iteration (n =) 1, generate an initial random tour ol , compute its initial cost C(ol), set the initial best cost C(o 1l )B = C(o 1 ). 1. The central coordinator partitions the entire TSP tour into P subtours and selects the annealing temperature Tn at the nth iteration. 2. The subtour along with the processing element datum structure at the nth iteration and T'n are delivered to kth available processor. -70- M. Tran Chapter4: ParallelSimulated Annealing Algorithm 3. (A). At nth iteration, the kth processor calculates the change in cost of the subtours and performs the local annealing process according to the following relations: AC (s'", s))=C(s'") n sk sn -C(s,) (4.14) if r <exp { - AC(s'", s) / T") or AC(s", s )<0 otherwise. (4.15) and C(s )= C(s' ) if r<exp({ -AC(s", C(s k C(s") otherwise. s, ) / T) or AC(sk, Sn)<O (4.16) where r is a uniformly distributed random variable over (0,1). Note that s'kn = Tikjk(s n ) is the Tij(s) or Lij(s) interchange at nth iteration presented in Section 4.3. (B). The k th processor participates in Citywise Exchanges with other processors and performs the local annealing process as in Step 3A. (C). And, the kth processor performs the local annealing process as in Step 3A to optimize its subtour again. NOTE: The kth processor repeats for cardinality times (or equivalently the number of cities in the subtour) in Steps 3A, 3B, and 3C, respectively, at each iteration, i.e. Step 3.1. of Figure 4.4(a). 4. The k th processor keeps the central coordinator informed of the status of acceptance of the interchange. Whenever the interchange is accepted, the kth processor delivers a change in cost and the perturbed subtour s', which has been accepted, to the central coordinator. -71- M. Tran Chapter4: ParallelSimulated Annealing Algorithm 5. The central coordinator reconstructs the new tour a n +l and computes its new cost according to the following relations: P-I On+1 Sn • i =O (4.17) and P-1 C(an+1 ) = C(on) + 7 AC(s n,sn) k=O (4.18) 6. The central coordinator computes the change in cost, and performs the global annealing process according to the following relations: AC = C(O"+ 1) C C(o)B ")B n C(" + 1) if r <exp { - AC / T") or AC <0 C("Y)B otherwise . (d 1 (4.20) and , o " + if r<exp{ - AC / T") or AC<0 B =a otherwise B+I (4.21) where r is a uniformly distributed random variable over (0,1). 7. Check for a stopping criterion in terms of maximum number of iterations, i.e."frozen". 8. if the condition of Step 7 is not satisfied, then goto Step 1. -72- Chapter4: ParallelSimulated Annealing Algorithm M. Tran 4.5.2 Algorithm B: Asynchronous Parallel Simulated Annealing Algorithm. Though the design of Algorithm B has not been carefully laid out in detail, its main steps are outlined as follows: Initialize: At iteration (n =) 1, generate an initial random tour a 1 , compute its initial cost C(ol), set the initial best C(of )B = C(o 1 ). 1. A central coordinator partitions the entire TSP tour into P subtours and selects the annealing temperature Tn at the nth iteration. 2. The subtour s and T" are delivered to the kth processor, making the kth processor busy. 3. The same step as in Step A3. 4. The same step in Step A4 and the kth processor becomes free. 5. The central coordinator reconstructs the tour and updates the cost of the tour by, s <== s' = Tikjk(s) C(a) <== C(a) + AC(s',s). 6. The same as in Step A6. 7. The same as in Step A7. 8. The same as in Step A8. -73- M. Tran 4.6 Implementation Algorithm Chapter4: ParallelSimulatedAnnealing Algorithm Issues of the Candidate Algorithms A and B outline the major phases of the iterative process. The central coordinator's tasks, Steps Al, A5 and A6, should be implemented as efficiently as possible. Otherwise, processors may become idle with no jobs to perform. Thus, one of the most important design goals for the central coordinator tasks is to reduce the overhead introduced by the parallelization. At each iteration, which is one time step, the central coordinator has basically four tasks: (1) selection of the temperature at each iteration; (2) generation of a starting city, partition of the entire TSP tour into subtours, creation of a processing element data structure for each subtour, and assignment for each processor a subtour and a temperature value; (3) reconstruction of the new tour from the "better" subtours and computation of the cost of the new tour; and (4) performance of a global annealing. This section will discuss these tasks in detail. First of all, in Step Al, the central coordinator is selecting the annealing temperature at a given iteration. Two different annealing schedules, namely Equations 3.12 and 3.13 or Equations 4.22 and 4.23 respectively, are considered for Algorithm A, and they were extensively discussed in Section 3.2.2. The first annealing schedule, Equation 4.22, has been shown to provide the Simulated Annealing Algorithm good solutions when computing the probability of accepting an uphill movement in cost [Mit85]. With this annealing schedule, the temperature is initially started at an reasonably high temperature value and then is held constant until the cost function has reached equilibrium or the steady state probabilities. Then, the temperature is lowered slowly according to Equation 4.22. However, it is impossible to determine exactly when the cost function has reached equilibrium. Instead of having the temperature lowered at equilibrium, one can lower the temperature after some fixed number of iterations or after the number of accepted interchanges exceeds some pre-determined threshold. In Chapter 6, a computational study -74- Chapter4: ParallelSimulated Annealing Algorithm M. Tran on the first alternative, i.e. lowering temperature after some fixed number of iterations, is performed. Figure 4.7 plots temperature versus time for various values of c. Tk+1 = CTk or equivalently, Tk = CkT0 for 0.9 5 c 5 0.99 and k = 0, 1, 2, ... , max_iterations 0 50 100 150 200 250 300 350 Time (iterations) (4.22) 400 450 500 Figure 4.7: Temperature versus time for Tk = ckTo for different values of c at To = 20.0. The second annealing schedule, Equation 4.23, provides convergence in probability to the set of optimum tours. However, the theory assumes that the algorithm is run for infinite time steps. Practically, the algorithm cannot be run indefinitely; hence, only near optimal solutions can be expected. -75- M. Tran Chapter4: ParallelSimulated Annealing Algorithm Tk = --d- logk for d 2 L and k = 2, 3, ... , max_iterations (4.23) Figure 4.8 plots temperature versus time for various values of d. In Chapter 6, the results of a computational study on these annealing schedules are presented. 100 10 1 0 100 200 300 400 500 600 Time (iterations) 700 800 900 1000 Figure 4:8: Temperature versus time for Tk = d/logk for different values of d. Secondly, the central coordinator generates randomly a starting city. Then, from this initial city, it partitions the tour a into equal number of subtours s's, each of which is associated with a data structure. Along with the computed temperature value, the central coordinator assigns each available processor or PE a data structure. Finally, the central coordinator is performing Steps A5 and A6. It reconstructs the new tour a' from the processor updates. It then computes the new cost of the tour and performs the global annealing process. If the central coordinator accepts the new tour, then -'76- M. Tran Chapter4: ParallelSimulatedAnnealing Algorithm it updates its data structure and uses this accepted tour as the current tour; otherwise, the central coordinator keeps its old data structure and uses it for the next iteration. Either accepting or rejecting the new tour, the algorithm is repeated until the stopping condition is satisfied, i.e. maximum number of iterations. In summary, the central coordinator starts at an iteration by selecting the temperature value, randomly generating a starting city, and partitioning the current TSP tour into subtours. It then delivers the temperature and the partitioned subtours to Np processing elements in the system. Each of these processors is responsible for perturbing the subtour, computing the cost of the perturbed subtour, and performing the local annealing process as in Step A3. When each of these processors is finished with its tasks in its subtours, the central coordinator is reconstructing the new tour from these "better" subtours, computing its new cost, and performing the global annealing process. Then, the central coordinator repeats the iteration until some "maximum iterations" stopping criterion has been satisfied. In this chapter, the overall framework of the Parallel Simulated Simulated Annealing Algorithm and the parallelization methodology were presented. Two different neighborhood structures were discussed. The candidate algorithm and its implementation issues were outlined and discussed. In the next chapter, the speedup of the Parallel Simulated Annealing Algorithm (Algorithm A) in Section 4.5.1 is analyzed. -77- __ M. Tran Chapter5: Speedup Analysis CHAPTER 5 OJPMIDUP ANALYGU0 07 TE2 IPARAJLIL TMUL TaD AAAL1G 5.1 Introduction In Chapters 2 and 3, the historical development and a mathematical model of the Classical Simulated Annealing Algorithm were covered. In the last chapter, a Parallel Simulated Annealing Algorithm was designed. In this chapter, a measure of the performance of the Parallel Simulated Annealing Algorithm (Algorithm A) is analyzed. Generally, there are several objectives in parallel algorithm performance analysis. First, one would like to know the execution time of the algorithm. The real execution time depends on the details of the system on which the algorithm is run. If models do not include all the details of a particular system, the execution time computed with respect to a given model is only an approximate measure of the actual time. If the model includes parameters describing the system, then the impact of the system on algorithm performance can be studied. In the parallel system environment, communication factors can also have a significant impact on algorithm performance. An algorithm analysis based on a model that includes this aspect is generally more accurate and useful than one that does not. One objective of this chapter is to show how communication can be incorporated in the analysis -78- M. Tran Chapter5: Speedup Analysis of parallel algorithms and in measures of algorithm performance. Another purpose of this chapter is to discover methods for improving algorithm performance by removing inefficiencies caused by communication overheads. Intuitively, the degree of parallelism can be defined as some function of the number of processors at any given moment. Ideally, in a system with Np processors, it is desirable that the degree of parallelism always be Np or as close to Np as possible. This degree of parallelism known as speedup is a measure of parallel algorithm performance. Though various definitions of speedup do exist [Poly86], [Crane86], and [Eag89], the speedup of the Algorithm A in Section 4.5.1 and the software system model in Figure 4.2 is in general defined as follows, S- Where IIT, I,I T,P (5.1) .T 1 is the execution time of Algorithm A from Steps Al to A7 per iteration using 1 processor. .I1 is the number of serial iterations required for Algorithm A to converge to some desired cost. .TNp is the execution time of Algorithm A from Steps Al to A7 per iteration using Np processors. .and, INp is the number of parallel iterations required for Algorithm A to converge to some desired cost. Although they may be different in practice, Ii and INp are assumed to be the same throughout the analysis of this chapter. Thus, the speedup definition of Equation 5.1 can be reduced and modified to be Equation 5.2 so that the analysis and discussion which follow can be done per iteration of Algorithm A. -79- M. Tran Chapter5: Speedup Analysis T S=- T Where 1 NP (5.2) .T 1 is the execution time of Algorithm A per iteration using 1 processor. .TNp is the execution time of Algorithm A per iteration using P subtours and Np processors. From Section 4.5.1, it can be seen that the Algorithm A consists principally of 7 steps. The parallelization scheme designed in this thesis is Step 3, and all other steps are essentially sequential. In Section 5.2, the speedup of independent subtours (Step 3A or Step 3C) is analyzed. Interprocessor communication is discussed in Section 5.3. Speedup analysis of interprocessor communication (Step 3B) is investigated in Section 5.4. In Section 5.5, the speedup of Step 3 of Algorithm A is examined, and general bounds on speedup of Algorithm A are given in Section 5.6. 5.2 Speedup Analysis of Independent Subtours In this section, the speedup of independent subtours (Step 3A or Step 3C) is analyzed. Let dij denote the execution time of a distance calculation between city i and city j, and let e denote the execution time to perform a s(i) = s'(i) or s'(i) = s(i) operation for some integer i. For each iteration in Step 3A or Step 3C of Algorithm A in Section 4.5.1, the following 4 steps are performed: 1. Generate a perturbed subtour s'. 2. Compute the current subtour cost C(s). 3. Compute the perturbed subtour cost C(s'). -80- M. Tran Chapter5: Speedup Analysis 4. Perform the local annealing. Since Steps 1 and 4 take the same amount of time (NE/P) to perform (N/P) iterations and Steps 2 and 3 take the same amount of time (N/P - 1)dij to compute, the execution time to compute Step 3A for one iteration is (2(N/P - 1)dij + 2(Ne/P)). Thus, the total execution time tA to compute Step 3A of Algorithm A for (N/P) iterations in general is (N/P)(2(N/P - 1)dij + 2(NE/P)) or tA =124 P' "N" djij +e)-- d.. P (5.3) Assuming that N (d + E) > > d.. -• P i (5.4) the total execution time to compute an independent subtour (Step 3A or Step 3C of Algorithm A) is tA 2N2 (dP +A i2 )= N- y , y=2(d.. + ). (5 .5) The execution time of Step 3A or Step 3C using P subtours and 1 processor is A = I = PtA (5.6) Before we consider the total execution time of Step 3A or Step 3C using P subtours and Np processors, let us examine how processors are being allocated to ready subtours and how they are being idle when there is no available subtour. -81- M. Tran Chapter5: Speedup Analysis There are generally two cases of processors allocations and two cases of idle processors [Poly86], [Crane86] and [Eag89]: An unlimited number of processors, where the number of processors is greater than or equal to the number of subtours, i.e. Np 2 P; and, a limited number of processors, where the number of processors is less than the number of subtours, i.e. Np < P. Let Np be the number of processors; let P be the number of subtours; let I(Np) be the average processors idle time; and let tA be defined as in Equation 5.5. For the unlimited number of processors (Np > P), we can allocate as many processors to the subtours as we are pleased. P ready subtours are always being kept busy by P allocated processors, and the execution time is just tA. In this case of processors allocation, there are (Np - P) processors being idle, and the processors idle time I(Np) on the average is just ((Np - P)/Np)tA. Unlike the case of unlimited number of processors, the case of limited number of processors allocation (Np < P) is assumed to be as follows. Let sj be denoted the jth subtour, where j = 1, ... ,P; and, let PEi be denoted the i tl processor for any i, where i = (j + Np)%Np and % denotes the modulus operator. Note that the modulus operator calculates for only the remainder of the two numbers, i.e. 10%3 = 1. So, each processor PEi is assigned to the jth subtour in Figure 4.2. In other words, sl is assigned to PEI, s2 to PE2, .... , SNp to PENp, S(Np + 1)to PE 1, s(Np + 2) to PE 2 , ...., until all subtours are being executed. In this case of processors allocation, Np processors need to keep P subtours busy; the last (P + Np)%Np subtours are being kept busy by the first Np processors, and there are (Np - (P + Np)%Np) being idle. Thus, the processors idle time I(Np) on the average is just ((Np - (P+Np)%Np)tA/Np). For the worst idle processors case, where there are at most (Np - 1) idle processors, the processors idle time I(Np) on the average is just ((Np - l)tA/Np) for all Np. From the above discussion, the total execution time of Step 3A or Step 3C using P subtours and Np processors when taking idle processors time I(Np) into consideration is in general, -82- M. Tran Chapter5: Speedup Analysis 3A tA+ N 3C T N =T N = P N, - (P + N,)% NP, A pN , Np< P. P t A+ Np P. (5.7) or equivalently, N P 3A (+ S--p, ) % N -) A 3C N < P. N =TN tA ,Np>P. (5.8) and the corresponding speedup for an arbitrary number of idle processors is PN -(P+N,)% Np S3A S 3C ,N <P. PN ,Np, P. (5.9) And if there are at most (Np - 1) idle processors, the speedup in Equation 5.9 for Step 3A or Step 3C becomes, PN P P+N -1 S3A= S3C PN ,N ,2P. 2NP - 1 Let P be defined as follows, -83- m ,N <P. (5.10) Chapter5: Speedup Analysis M. Tran kNP , N, < Por V k > 1. P= NP S , N, Por Vm 21. (5.11) and substituting P of Equation 5.11 into Equation 5.9, the following relation results, P S 3A < PorP=kN, ,Vk> 1. ,N N ,N >PorP=- , V m 1.(512 Similarly, for the worst idle processors case if there are at most (Np - 1) idle processors, Equation 5.10 becomes, kN 2 ,P N p < PorP=kNp, V k> 1. p (k+ 1)N -1 S 3A - 2 P (2N - 1)m Np P mPorP=-•-, Vm 1.(5.13) Let us make a few observations here. Observation 5.2.1: If P >> Np or k is large in Equation 5.12, the speedup of Step 3A or Step 3C can not be greater than Np and is asymptotically approaching to Np. Thus, we can see that there is little benefit in partitioning the number of subtours P much beyond the number of available processors Np in the system. Observation 5.2.2: If Np >> P or m is large, the speedup of Step 3A or Step 3C degrades somewhat and possibly decreases to a value less than 1. In fact, from Equation 5.9, in the limit as Np approaches to infinity, S3A approaches to P/2; if we process the Algorithm A sequentially, i.e. P = 1, then, the speedup is only 0.5, which is worsen than executing a -84- M. Tran Chapter5: Speedup Analysis sequential program on a 1-processor system. The consequence of this result proves to be very useful when we are considering the sequential steps of the Algorithm A in a later section. Like Observation 5.2.1, there is very little or no benefit in allocating the number of processors Np much beyond the number of partitioned subtours P. Observation 5.2.3: For the special case of m = 1 or Np = P in Equation 5.12, the speedup S3A = Np results, as one would expect for Np independent subtours executing on an Npprocessors system. As we can see, the maximum speedup occurs when m = 1 or Np = P. Observation 5.2.4: For the worst idle processors case, where there are (Np - 1) idle processors, the speedup should be at least Equation 5.10 or Equation 5.13. Note that it is always true that the speedup of the worst idle processors case is equal to or less than the speedup of an arbitrary number of idle processors, i.e. Equation 5.10 is equal to or less than Equation 5.9 or Equation 5.13 is equal to or less than Equation 5.12. Thus far, we have discussed only Step 3A or Step 3C of Algorithm A but not much being said about Step 3B of Algorithm A because the two former steps are steps, in which subtours are processed or executed independently whereas the later step is a step, in which subtours are "communicating" with one another. Thus, it is the subject of the next two sections. 5.3 Interprocessor Communication In the last section, the speedup of independent subtours was analyzed. In this section, interprocessor communication is examined, which provides some useful results for the speedup analysis of the next section. Interprocessor communication is the process of information or data exchange between two or more processors. interprocessor communication are distinguished: Two types of message communication, where processors exchange message information, and data communication, during which one -85- M. Tran Chapter5: Speedup Analysis processor receives data that it needs from the other processors. Both types of interprocessor communication are significant because both are reflected as overheads in the system network and the total execution time of a program, respectively. In Algorithm A, interprocessor communication is essentially Step 3B, where subtours or processors are exchanging cities with one another. Consider a 2-Processors System with Interprocessor Communication in Figure 5.1. Assume that PEi wants to exchange a city with PEj. It first reserves a city in its subtour, i.e. city A. It then requests an available city from PEj, i.e. city B, and computes the new cost, including city B. If PEi accept city B as an exchange, it sends a message to PEj to compute the cost of r I PEj PEi I.1 Figure 5.1: A 2-Processors System with Interprocessor Communication. its subtour, including city A; otherwise, PEi sends a message to PEj, telling PEj that it does not accept city B as an exchange. Suppose that if PEi accept city B but PEj does not accept city A, then PEi can not take city B away from PEj. Thus, if either PEi or PEj does not accept the exchange, then neither PEi can take city B from PEj nor PEj can take city A from PEi. Hence, both processors PEi and PEj need to come to a mutual agreement when exchanging the cities in their corresponding subtours. In an Np-processors system, processors are communicated analogously. -86- M. Tran Chapter5: Speedup Analysis In the next subsection, message communication [Horo81] is analyzed; in the following subsection, data communication [Poly86] is examined. 5.3.1 Message Communication Message communication is a key overhead factor that influences the parallel system performance. There is a methodology for measuring this message communication, called message capacity. The rationale for this message capacity is to capture the phenomena which lead to congestion and competition in a network-based parallel processor. Using this metric, those networks which characterize how well the parallel machines support multiple message transfers between arbitrary processors can be examined. This "message capacity" is based on an examination of message density along each of the interconnection paths between processors in the machine. It gives a figure of merit for the message capacity of the network. This measure is an extension of the ideas in Horowitz and Zorat [Horo81]. They use the number of communication paths which each processor in a network is responsible for as a measure of the communication overhead in the system. A message is defined for our application as a transmission or reception of a city between processor i and processor j. Let Ms be the total number of messages that processor i sends to other processors, i.e. one to every other processor, let Mr be the total number of messages that processor i receives from other processors, i.e. one from every other processor;, let Np denote the number of available processors in a parallel system; let P denote the number of subtours in the TSP; let N denote the number of cities in the TSP tour; let Ns be a set of messages which processor i sends to or receives from every other processors, where Ns = (N/P). Assuming that a given TSP tour be partitioned into Np subtours, i.e. Np = P and these subtours are executed concurrently on an Np-processors system, then, there are . Ms = (Np - 1)Ns messages sent by processor i to every other processor, and -87- M. Tran Chapter5: Speedup Analysis . Mr = (Np - 1)Ns messages received by processor i from every other processor; Since each processor is responsible for transmitting or receiving Ns cities or messages in a subtour. Thus, the processor i exchanges (sends to and receives from), (5.13) Ms + Mr = 2(Np - 1)Ns messages with other processors. But, there are Np such processors. Thus, the total number of messages exchanged in an interconnection network of an Np- processors system at a particular iteration are 2Np(Np - 1)Ns = 2N(Np - 1) messages (5.14) and, on the average, a processor handles , in the bus, at each iteration of 2N(Np-1) NP = O(N) messages. (5.15) To illustrate, let us consider the case of a 2-Processors System in Figure 5.1. It can be easily seen that there are a total of 16 message transmissions in the system, and on the average, each processor handles 8 messages, i.e. 4 transmissions and 4 receptions. 5.3.2 Data Communication In the last subsection, we have seen how a processor, on the average, handles messages at a particular iteration. In this section, we will examine the execution time due to the communication overhead factor. Let us assume that we are dealing with all cases of number of processors allocations so that all P subtours are not necessarily equal to the Np -88- M. Tran available processors. Chapter5: Speedup Analysis Of course, different subtours will be executed by different processors during the parallel execution of Algorithm A, especially Step 3. Let us again consider the problem of interprocessor communication for the case of a 2-processors system and 2 subtours in Figure 5.1. Let us review how interprocessor communication takes place between two subtours si and sj. Let subtour si be assigned to processor i, and subtour sj be assigned to processor j. Again, interprocessor communication between processor i and processor j takes place if processor i and processor j mutually agree to exchange a city in their corresponding subtours. Therefore, interprocessor communication involves explicit transmission or reception of cities between processors, and of course, the number of cities that processor i sends to processor j is equal to the number of cities that processor j receives from processor i. Let us define the communication unit r to be the execution time it takes to exchange (to transmit and to receive) a city between two subtours or processors, i.e. the execution time of 2 messages. Let Ns be the cardinality of the subtour or the number cities in the subtour; let w be the communication weight between processor i and processor j. Then, the time spent for communication during the concurrent execution of Algorithm A with 2 subtours on 2 processors is w, which is equal to rNs or r(N/P), where N is the number of cities in the TSP, and P is the number of subtours. Thus, for 2 subtours i and j executing on a 2-processors system, the execution time of subtour i by processor i when exchanging Ns cities (transmitting to and receiving from) with processor j is (ti + r(N/P)). For P subtours executing on a 1-processor system, which is a simulated environment, the execution time due to communication overhead of the ith subtour when exchanging Ns cities with other (P - 1) subtours is (P - l)w units of time -89- (5.16) M. Tran Chapter5: Speedup Analysis Thus, the total execution time due to communication overhead for exchanging Ns cities among P subtours when using 1 processor is Or = P(P - 1)w = N'(P - 1) units of time (5.17) and on the average, the execution time due to communication overhead for exchanging Ns cities among P subtours when executing P subtours on an Np-processors system is Oa= NT(P- 1) - NP 1) units of time (5.18) To illustrate, let us consider the case of a 4-Subtours System with Interprocessors Communication as shown in Figure 5.2. Then, the total execution time for exchanging Ns cities among 4 subtours when using only 1 processor is 12w or 3 Nt, and the execution Figure 5.2: 4-Subtours System with Interprocessor Communication -90- M. Tran Chapter5: Speedup Analysis time on the average for exchanging Ns cities among 4 subtours when using Np processors is 12w/Np or is 3NT/Np for any Np, i.e. if Np = 4, then each processor, on the average, takes 3w or 0.75Nt units of times to intercommunicate with other processors. Note that Equation 5.17 is the special case of Equation 5.18. Note also that when using 1 processor, we generally set P = 1 so that the communication overheads in Equations 5.16, 5.17, and 5.18 are equal to zero; thus, no interprocessor communication exists when executing a program sequentially. For the purpose of calculating the average communication overhead, P # 1 is assumed. 5.4 Speedup Communication Analysis of Interprocessor In the last section, communication overheads were analyzed. Let us incorporate the result of Equation 5.18 to our analysis of the speedup of Step 3B of Algorithm A. As can be seen from Figure 5.1 of a 2-Processors System with Interprocessor Communication, the execution of Step 3B follows the same steps as the execution of Step 3A or Step 3C except for the communication factor, namely Citywise Exchange between 2 subtours or processors. Thus, the execution time of Step 3B of Algorithm A on an Np-processors system is just the execution time of Step 3A and the average communication overhead as shown in Equation 5.19, t, = tA + O or equivalently, -91- (5.19) Chapter5: Speedup Analysis M. Tran NT(P - 1) N+ tB = tA (5.20) Let us consider a speedup of Step 3B for an arbitrary number of idle processors. For P subtours, the sequential execution time of Step 3B is 3B T 1 =Pt A (5.21) and the execution time of P subtours on an Np-processors system in general takes the following form, N - (P + Np p N -P NP or equivalently, A+ N tA+ O ,Np< P. ,Np A+ P. (5.22) or equivalently, '(P +N P- (P + N,)% Np)tA + N(P- 1) 3B "P (2NP - P) tA + Nt(P - 1) Np I. ,N,<P. ,Np<P. NP- (5.23) Note that if Np = P, there is thus no idle processor, Equation 5.23 is reduced to Equation 5.20. The speedup of Step 3B for an arbitrary number of idle processors is of the following form, -92- M. Tran Chapter 5: Speedup Analysis [ PN t A I Np <P. (P + N, - (P + Np)% NP)tA+ Nr(P - 1) S 3B PN ptA ,NP (2NP- P)tA+ Nt(P - 1) P. (5.24) And, in the worst idle processors case, where there are (Np - 1) idle processors, the speedup is PNt A (P + N - 1)tA+ NT(P - 1) S =3B ,Np< P. PN t A N , <P. (2N, - 1)tA + Nt(P- 1) N >P. (5.25) Substituting Equation 5.5 for tA into Equation 5.24, the following equation results, yPPNN y(P + N, - (P + Np)% Np)N+ rP 2(P S3B - ,N <P. 1) yPN N y(2N, - P)N+ rP2(P - ,N ,2P. i) (5.26) Let us define N as follows, kN, N,p N=N,P= N,Np m V k >1. ,Vm_ 1. (5.27) and substituting it and Equation 5.11 into Equation 5.26, we have the following speedup equation, -93- M. Tran Chapter5: Speedup Analysis ykN ,N, y(k + 1)N,+ rk(kNp- 1) P =kNp, V k > 1. S=B 3B 2 )nN,N p P- ym(2m - 1)N, + r(N, - m) Np ,Vm>_. (5.28) Similarly, let us consider the worst idle processors case, where there are (Np - 1) idle processors; with tA, then Equation 5.25 becomes, yPNpN ,Np< P. y(P + N, - 1)N + 'cP(P - 1) S3B = yPNPN ,Np,>P. y(2Np - 1)N + tP'(P - 1) (5.29) and substituting Equations 5.11 and 5.27 into Equation 5.29, we have the following relation, 'kN,N 2 S3B P =kNp, V k > 1 y((k + 1)N, - 1)N, + rkN,(kN, - 1) ymN ,N ym2(2N, - 1)N, + 2N,(Np- m) ,P- N, , Vmk>l. V m Ž. (5.30) A few key points can be observed from the above equations. Observation 5.4.1: In the limit as N goes to infinity, Equation 5.26 is asymptotically approaching to Equation 5.31, which is the same as Equation 5.9 for P independent subtours executing on an Np-processors system. Note also that Equation 5.31 is independent of the t term, the communication overhead factor. -94- M. Tran Chapter5: Speedup Analysis PNPN P+Np-(P+Np)%N Lim S3= N-- <P. ,N -- ' P PN, SP 2NP P ,N>P. P(5.31) Observation 5.4.2: In the limit as the number of cities in a subtour Ns goes to infinity, the speedup of Equation 5.28 approaches to a linear function of the number of processors as shown in Equation 5.32. Note that if we substitute Equation 5.11 into Equation 5.26 and take the limit as N goes to infinity, we will also have the same speedup result, i.e. Equation 5.32, which is equivalent to Equation 5.12, and Observations 5.2.1-5.2.3 are also applied here. So, as the number of cities N or Ns increases, the effect of the communication overhead factor seems to be less dominant, and as the number of cities approaches to infinity, the communication overhead factor is independent of the Step 3B speedup equation. This independence will be examined in some detail in a later section. [( k N ,PP=kN,,Vk> 1. Lim S3B = Lim S,3B= N-N -a- N 1 (2m- 1 N mP , , m . (5.32) Observation 5.4.3: For (Np - 1) idle processors and in the limit as Ns approaches to infinity in Equation 5.30, the speedup of Step 3B approaches to Equation 5.33, which is equivalent to Equation 5.13. Note that if we substitute Equation 5.11 into Equation 5.29 and take the limit as N approaches infinity, we will also have the same result. Of course, Observation 5.2.4 is equally applied here as well. 2 Lim S3 = Lim S, -3B N = ~P (k + 1)N,- 1 + 2N ' P = kN, Vk > 1. N2 3B, N NP= , V m> 1. -p (2N , - 1)m -95- P (5.33) M. Tran Chapter5: Speedup Analysis 5.5 Speedup Analysis of Step 3 of Algorithm A In the last section, we have analyzed the speedup of the interprocessor communication, namely Step 3B. In this section, we examine the speedup of Step 3 of Algorithm A. As usual, we begin with the execution time. The execution time of Step 3 consists mainly of the execution times of Step 3A, Step 3B, and Step 3C. Since Step 3A is the same as Step 3C, the execution time of Step 3 then comprises of the execution time of Step 3B and twice as much as the execution time of Step 3A. Thus, the execution time of Step 3 of Algorithm A using 1 processor is 3 3A T =2T 3B +TI =3Pt A (5.34) and, the execution time of Step 3 using an Np-processors system is 3 3A 3B 3A + TN T3 N,= 2TN +N (5.35) or equivalently, 3 (P+ N, - (P+ Np)% N,)t A + N (P - 1) TN N = 3(2Np,- P)t A + N (P - 1) INp , Np < P. 1 (5.36) and, the speedup of Step 3 for an arbitrary number of idle processors is of the following form, -96- M. Tran Chapter5: Speedup Analysis 3 (P + N- S 3PNp tA (P + N,)% Np)tA + Nt(P- 1) 3PN, Np <P. tA Np >P. 'P - 3(2Np - P)tA + Nt(P - 1) (5.37) In the worst idle processors case, where there are (Np - 1) idle processors, the speedup of Step 3 becomes, F 3PN pt A I S3= I I 3(2N, I ,N <P. 3(P + N, - 1)tA + NT(P - 1) 3PN ptA ,Np >P. -1)t A+ N(P - 1) (5.38) After substituting Equation 5.5 for tA into Equation 5.37 and simplifying, the following speedup result of Step 3 is obtained, 3yP N, N S3 , N < P. 3y(P + N - (P + N,)% Np)N + tP 2 (P - 1) "t= 3yPN pN 3y(2N,- P)N + tP (P - , N, 2 P. 1) (5.39) If we substitute Equations 5.11 and 5.27 into Equation 5.39, Equation 5.40 results, P. 3kNkN, 3y(k + 1)N, + Tk(kN, - 1) 3 3~ym N, N 3ym(2m - 1)N, + z(Np - m) P=kN, Vk > 1. ,PP m,P Vm>1. Let's us make a few observations about the above equations. -97- (5.40) M. Tran Chapter5: Speedup Analysis Observation 5.5.1: In the limit as the number of cities N in the TSP approaches to infinity in Equation 5.39, the following speedup of Step 3 is obtained. Note that Equation 5.41 is exactly equivalent to Equations 5.9 and 5.31. Of course, Observation 5.4.1 can be made here as well. PN Lim 3 P + Np - (P + Np)% Np N <P. P Lim S =- SN > P. (5.41) 2N-P Observation 5.5.2: In the limit as the number of cities Ns in a subtour approaches to infinity in Equation 5.40, the speedup Equation 5.42 holds. Note that Equation 5.42 is also exactly equivalent to Equations 5.12 and 5.32. Similarly, Observation 5.4.2 is also applicable here. Lim S 3 =Lim S 3 = N-+ - N -+- K+ l)NP 1 2m-I N p ,P=kNp,Vk>1. Np NP Vm l. m, (5.42) Observation 5.5.3: From Equation 5.40, if m = 1 or Np = P, Equation 5.43 is obtained. Note that the communication overhead factor, the t term, in the denominator, is not likely to be dominant in the speedup equation, a factor which will be discussed in some detail when we are analyzing the speedup of the overall algorithm in the next section. S3yN,5 Np 3yN, + t(N, - 1) -98- (5.43) M. Tran Chapter5: Speedup Analysis Observation 5.5.4: If there were no communication overhead (' = 0), S3 = Np, resulting in ideal speedup, which is consistent to our intuition for Np independent subtours executing on an Np-processors system. Observation 5.5.5: From Equation 5.43, for the number of processors Np is large, the speedup of Step 3 is approximately equal to Equation 5.44. 37N, T 3 (5.44) Note that Equation 5.44 is essentially a function of the number of cities Ns in a subtour. 5.6 General Bounds on Speedup of Algorithm A In the last section, we have analyzed the speedup of the parallelization step of Algorithm A, namely Step 3. In this section, we will examine the speedup of the overall algorithm. As mentioned earlier, Algorithm A consists principally of 7 steps, all of which are essentially sequential except Step 3. Let us consider the sequential steps. Let T 1 denote the execution time of Steps 1 through 7 except 3 using 1 processor. Then, 1 2 4 5 6 T =T'+ 1 T + T + T1 + T+T 7 (5.45) and, let Týp denote the execution time of Steps 1 through 7 except 3 when using Np processors. Then, -99- M. Tran Chapter5: Speedup Analysis T TpN P(Np) (5.46) It is important to notice the assumption about 1(Np) in Equation 5.46 instead of Np because Steps 1 through 7 except 3 are essentially serial codes rather than parallel codes. Generally, the speedup is not defined when a sequential program module is being executed by a 1-processor system. In practice, we don't know exactly how the value of P(Np) varies as a function of Np when executing the sequential codes on an Np-processors system because we have to take other overhead problems such as resource contentions and bottlenecks into consideration, and we don't exactly know how the scheduler of the system is going to assign what piece of sequential codes to which processor. We can speculate that as the number of allocated processors increases, the number of idle processors increases. Let us examine how our mathematical model tells us about the value of P(Np). Let us consider Equation 5.9. A given program is processed sequentially means that P = 1. Thus, only the case of unlimited processors allocation (Np 2 P) in Equation 5.9 is applicable, and S3A = Np/2Np - 1, for all Np 2 1. As we can see, P(Np) is equivalent to S3A. If Np = 1, P(Np) = 1 and if Np approaches to infinity, P(Np) is asymptotically approaching to 0.5. Thus, 3(N) and N, 2N, - 1 (5.47) 0.5 <I(Np) 5 1.0 Let us consider a total of 7 steps of Algorithm A. The execution time of Algorithm A when using 1 processor is the sum of the parallel step and the sequential steps, namely, -100- M. Tran Chapter5: Speedup Analysis T1 = T + TI = 3Pt A (5.48) I and, the execution time of Algorithm A when using Np processors is of the following form, TN =T 3N + T N P (5.49) or equivalently, [30(P + N, - (P + NP)% NP)tA + tf3N(PI 1)+TVNp tON, T ,N < P. 32N "r 3P(2N, P - P) t A+ N(P - 1) + TN p N, 2 P. (5.50) Thus, the speedup of Algorithm A for an arbitrary number of idle processors is 33PPNptA + PTNNp 3P(P + N, - (P + Np)% NP)t A TPN(P- NP< P. 1) + TNp 3PPNptA + IPT;N 3P(2N,- P)t-A+ ,N p P. ,N _P . 1)P+T*N 1- (5.51) And, if there are at most (Np - 1) idle processors, the speedup is at least, 30PN pt A+ P-N N, < P. S3PNPtA 33P(P + N,- l1)t A + TN(P- 1) + + PTNNp 130(2N, - 1) t A + TON(P - 1)+ TNp -101- PTNp Np >_P. (5.52) M. Tran Chapter5: Speedup Analysis Substituting tA of Equation 5.5 into Equation 5.51, the speedup of Algorithm A for an arbitrary number of idle processors is obtained, 3PyPNN + T'PN 3py(P+Np- (P + N)% Np)N 2 + 3NP 2 (P- 1) +T;PN i ,Np P P 2N 2 3PPNN + L3py(2NP- P)N2+ N <P. - NP 1) + T 1P N (5.53) and substituting Equation 5.11 into Equation 5.53, the following equation results, , 2 2 3P*kNpN 2 + T k Np 3py(k + 1)N2 + f3k2 Np(kN, - 1)N+ Tk N p 'k , P =kN > 1. S=( 3T2NN2+ PT*2mN2 N2 3•ym2(2m -1)N2 + ct•N,(Np - m)N + TlmN (5.54) Let us make afew observations from the above equations. Observation 5.6.1: In the limit as N goes to infinity in Equation 5.53, Equation 5.55 results. Note that Equation 5.55 is equivalent to Equations 5.9, 5.31, and 5.41, whose comments are equally applied here. LP + NP - (P+Np)% Np Lim S =N-+*PN I2N-P IN < P. > ,N P. (5.55) Again, the speedup of Algorithm A is independent of communication overhead in the limit as the number of the cities in the TSP approaches to infinity. -102- M. Tran Chapter5: Speedup Analysis Observation 5.6.2: It is interesting to observe that the first term of both the numerator and the denominator in Equations 5.53 and 5.54 is a quadratic function of N, which is the number of cities in the TSP, whereas the communication overhead factor in the denominator (term with r) is linear in N and quadratic in Np. Since the number of cities in the TSP is much greater than the number of processors, i.e. N >> Np, in the limit as N approaches infinity, N2 term dominates in both the numerator and the denominator, and the speedup of Algorithm A is independent of the communication overhead term. Observation 5.6.3: Consider Equation 5.56, which is from Equation 5.54. Let m = 1 or S 2 3p3ym2N N + T;,mN ( -)N (m)N T 3 ym2(2m -1)N + TPNp(Np - m)N + TlmN p ,P = , V m >1. (5.56) Np = P in Equation 5.56. Then, Equation 5.57 results. As we can see, the effect of the communication overhead factor r3Np(Np - 1)N is to reduce the speedup equation. 3pyNpN + PTN, S= 3P N 2 + TNp(Np - 1)N + TN p Observation 5.6.4: (5.57) To see the above effect more clearly, suppose there is no communication overhead factor (r = 0). Thus, Equation 5.57 is further reduced to Equation 5.58. Obviously, Equation 5.58 is approximately a linear function of Np as will be shown later. S= 3pyNpN 2 + IT; Np S3pyN2 P -103- (5.58) M. Tran Chapter5: Speedup Analysis From Observation 5.6.2, we have mentioned that the speedup is a function of the square of the number of cities N, and the communication overhead factor is a linear function of N. However, we have not elaborated much about where they are come from and what they mean. Consider the execution time of Step 3A or Step 3C, namely tA in Equation 5.5 or Equation 5.59, and the average communication overhead, namely Oa in Equation 5.18. As we can see, tA is O(N 2 ) whereas Oa is O(N). So, as N is approaching to infinity, tA is increasing much faster than Oa, tA thus dominates the speedup in Equation 5.51. Hence, in the limit as N approaches to infinity, the speedup equation is independent of the communication overhead factor. What this means is that if there exists a reliable "stand-alone" computer with a massive memory to store data and an extensive computer time to execute Algorithm A for a long period of time, then the speedup of Algorithm A is approaching to an ideal value; otherwise, the speedup is affected by the communication overhead factor t3Np(Np - 1)N. One may wonder about the quadratic term in Np in the communication overhead factor. Let us examine this by substituting N = NsP into Equation 5.5 for tA and into Equation 5.54 for the speedup of Algorithm A. We thus have the two following equations, N =NY2 tA AP 2 (5.59) and 2 3pymN N + 131pym(2m - 1)N2 + m2 M3T - m) + Tm- 2N,(Np 1. m (5.60) -104- M. Tran Chapter5: Speedup Analysis Let us further observe the above equations. Observation 5.65: Notice that both tA in Equation 5.59 and the speedup in Equation 5.60 are now quadratic in Ns, which is the number of cities in a subtour. Observation 5.6.6: In the limit as Ns approaches to infinity in Equation 5.60, the speedup is asymptotically approaching to Equation 5.61, which is equivalent to Equations 5.12, 5.32, and 5.42, and all previous observations about the characteristics of this equation are also applicable here. As N approaches to infinity in Equation 5.54, Equation 5.61 also holds. k )N, , P=kN, Vk>1. Lim S = Lim S= N- N N.N P 2m - 1 Vm> ' (5.61) Observation 5.6.7: From Equation 5.60, we can see that the communication overhead factor now is linear rather quadratic in Np. Let us examine the special case of Equation 5.60, where m = 1 or Np = P. Then, Equation 5.60 is reduced to Equation 5.62 for with and without the communication overhead factor. PT; 23PyN •N,+ +P 3PyN,+ t3N,(N P- 1)+ T; +=N, + N3PyN + TJ1 I " , OV p. 0, , V 3PyN + T* Observation 5.6.8: If Np = 1 and 0 = 1, then the speedup S = 1 as expected. -105- (5.62) I M. Tran Chapter5: Speedup Analysis Observation 5.6.9: If Np is large, then the speedup S is approximately equal to (3yNs/t) which is the equivalent to Equation 5.44. Observation 5.6.10: As we can see, from Equation 5.62, the effect of the communication overhead factor (t Ns(Np - 1)) is to degrade the speedup of the Algorithm A. Observation 5.6.11(a):. For 0.5 <5 3 5 1.0, rt 0 and an arbitrary number of idle processors, the bounds on speedup of Algorithm A are 1. 5yN'N, + 0. 5T* 3yNN, + T• _<S 1. 5yN + 0. 5tN,(N, - 1) + T ,, 0 3yN2 + tN,(Np - 1) + T (5.63) Observation 5.6.11(b): Similarly, for 0.55 0 < 1.0, t = 0 and an arbitrary number of idle processors, the bounds on speedup of Algorithm A are of the following form, 1. 1. 5yN, + T N 0. 5T< 3 N+ 3yN ,'+ T') 1·. 5yN2 + T T=0 3yN + T- (5.64) Observation 5.6.11(c): If we substitute Equation 5.47 for P3(Np) into Equation 5.62, Equation 5.65 results. Note that Equation 5.65 is always on or within the bounds on Equations 5.63 and 5.64, respectively. 3yNN p+ TN• S3yN2N, + 0. NN,(N, - 1) + T (2N,- 1) 3yNNP +T;N, N3NN , 1 + T;(2Np - 1) -106- (5.65) M. Tran Chapter5: Speedup Analysis Observation 5.6.1 1(d): In comparison, we can see that the communication overhead factor has shifted the bounds on speedup of Algorithm A to a different range, from the range in Equation 5.63 to the range in Equation 5.64. Of course, the range in Equation 5.63 is strictly less than the range in Equation 5.64 for all * 0. Observation 5.6.11 (e): As one would expect, if there is no communication overhead (t = 0), the speedup of the Algorithm A in Equations 5.62 and 5.64 is a linear function of Np. Observation 5.6.12: As Ns approaches to infinity in the limit in Equations 5.63, 5.64, and 5.65, the speedup approaches to Np, resulting ideal speedup. Observation 5.6.13: If no sequential part of the Algorithm A is involved, i.e. T] = 0, the speedup of Equation 5.64 is reduced to S = Np, resulting ideal speedup. Let's consider the bounds on speedup for the worst idle processors case, when there are (Np - 1) idle processors. Substituting Equation 5.59 for tA into Equation 5.52, after some simplification, the following relation is obtained, 3PkNN 2 2 N + T1N y(+2 N 3P*NNp 3P1( (k + 1)N, - 1)N, + 2pkN,N, (kN, - 1) + T'N, 3mN2N + PTm2N 3pym2(2N, - 1)N2+ sINNm(NP -m)+ T*m 2N P NNP kNp, V k > 1. N m Vm 1. m, (5.66) As usual, let us consider the special case of processors allocation where m = 1 or Np = P. Thus, Equation 5.66 is reduced to Equation 5.67 for with and without the communication overhead factor. -107- M. Tran Chapter5: Speedup Analysis 2 i3 Py(2N- 2 • 1)N 2 + r3N , Np(Np - 1) + T;N p 3pyNNp + PTNp0 3py(2N, - 1)N, + T';NP (5.67) Observation 5.6.14(a): From Equation 5.67, for 0.5 I05 1.0, Tr 0 and (Np - 1) idle processors, the following bounds on speedup of Algorithm A holds, 1.5yN ,N + 0.5TN,p 3yN N + T1 N, 1. 5y(2N, - 1)NZ + 0. 5"rN , Np(N, - 1) + TN, 3y(2N - 1)N + N,Np(N- (5.68) Observation 5.6.14(b): From Equation 5.68, for 0.55 03 1.0, r = 0 and (Np - 1) idle processors, the bounds on speedup in Equation 5.69 results, 1. 5yN2N' + 0. 5T 1N, 3yNN <S< + T 1N, 3y(2N, - 1)N 2 + T;N P 1. 5y(2N, - 1)N2 + T'N, ,<5=0 (5.69) Of course, we can see that most of the previous observations on the speedup of Algorithm A, where an arbitrary number of processors are idle, be applied to the worst case as well, where (Np - 1) processors are idle. Let us establish the general bounds, within which all speedup of Algorithm A must lie. It can be easily argued that the lower bound (the worst case possible) on the speedup of Algorithm A occurs where 0 = 0.5 and (Np - 1) processors are idle and that the upper bound (the best case possible) on the speedup occurs where 0 = 1.0 and no processor is idle. From Equation 5.68, the lower bound on speedup of the Algorithm A is just the lower bound in Equation 5.68, and the upper bound on speedup of the Algorithm A is the -108- 1) + TN M. Tran Chapter5: Speedup Analysis upper bound in Equation 5.63. Hence, the general bounds on speedup of Algorithm A are established as shown in Equation 5.70, 1. 5yNN, + 0. 5T:N, 3yN N, + T1 1. 5y(2N, - 1)N,+ 0. 5rN,Np(Np- 1) + TN,p 3yNZ, + tN,(N,- 1) + T1 (5.70) If there is no communication overhead, the general bounds in Equation 5.70 are reduced to Equation 5.71. 1. 5yN N' + 0. 5TNp 3yN N, + T+ 1. 5y(2N, - 1)N + TNP 3yN2 + Ti (5.71) and if the execution time of both the communication overhead and the sequential parts of Algorithm A are not present, then the bounds on speedup in Equation 5.70 is reduced to N2 2N <S<N, 2Np -1 ,j=OandT =0. (5.72) As one would expect, if Np independent subtours are executed on an Np-processors system, the upper bound is bounded by the ideal speedup Np whereas the lower bound is bounded by the first term in Equation 5.72. Note that if Np approaches to infinity, lower bound in Equation 5.72 approaches to 0.5, and the upper bound approaches to infinity; if Np = 1, then the speedup is 1. In this chapter, the speedup of the Parallel Simulated Annealing Algorithm (Algorithm A) was analyzed. Communication overheads as the principal factors which degrade the performance of the Parallel Simulated Annealing Algorithm were discussed. It was shown that each processor in an Np-processors system must handle on the average of -109- M. Tran Chapter5: Speedup Analysis O(N) messages, and overhead due to data communication is O(N) or O(NpNs) units of time, i.e. Equation 5.18 or Equation 5.60. The speedup of Algorithm A is a linear function of the number of processors Np if there is no communication overhead (r = 0), and is equal to ideal speedup Np if both communication overhead and the sequential parts of Algorithm A are not present, i.e. Observation 5.6.13. General bounds on speedup of the Algorithm A were established in Equation 5.70. In the next chapter, a computational study of the Synchronous Parallel Simulated Annealing Algorithm is investigated for the two different annealing schedules and for two different neighborhood structures presented in Chapter 4. With variation of certain parameters, the quality of solutions can be obtained and analyzed, and questions such as "Which annealing schedule or which neighborhood structure provides consistent quality of solutions for the TSP?" can be addressed. -110- m M. Tran Chapter6: EmpiricalAnalysis CHAPTER 6 1PARALLIL IMULATIED ANEALNjG 6.1 Introduction In Chapters 2 and 3, the underlying foundation of the Classical Simulated Annealing Algorithm was laid. In Chapter 4, the Parallel Simulated Annealing Algorithm was designed. In the last chapter, the speedup of the Parallel Simulated Annealing Algorithm was analyzed. In this chapter, computational results of the Parallel Simulated Annealing Algorithm are analyzed, discussed, compared and evaluated. Thus far, empirical results for the Classical Simulated Annealing Algorithm are reported by a number of authors for various combinatorial optimization problems. Most of these results focus on the quality of the final solutions and the corresponding running times obtained by solving a given TSP instance. However, empirical results for the Parallel Simulated Annealing Algorithm are still in their infancy. Thus, besides the work of Kim [Kim86], the computational results presented in this chapter represent one of the first attempts in investigating the performance of the Parallel Simulated Annealing Algorithm empirically. -111- M. Tran Chapter6: EmpiricalAnalysis The methodology under investigation is analyzed and discussed in Section 6.2. Since it is well-known that the performance of any Simulated Annealing Algorithm is dependent upon the annealing schedule, two different annealing schedules discussed in Section 3.2.2, namely Equation 3.12 and Equation 3.13, are selected to examine the behavior of the algorithm. Thus far, a good numerical value for the constant c has not been reported. Therefore, for the annealing schedule in Equation 3.12, a specifically good choice for the value c is determined by comparing the quality of solutions for different values of c. Similarly, for the annealing schedule in Equation 3.13, a specially good choice for the value d is determined by evaluating the quality of solutions for different values of d. Then, this d is compared with that of Kim's. In this way, the results of two annealing schedules can be systematically analyzed and evaluated in Section 6.3, and "How will different annealing schedules affect the behavior of the Parallel Simulated Annealing Algorithm?" can be addressed. In this analysis, the most effective annealing schedule is obtained. In order to illustrate the assertion that the Parallel Simulated Annealing Algorithm is more powerful than Local Optimization, an assertion which was discussed in Section 2.2 and repeated throughout this thesis, an experimental study is conducted and illustrated in Section 6.4. Since it is equally well-known that neighborhood structures have major impacts on the performance of the Simulated Annealing Algorithm, two specific neighborhood structures, namely Citywise Exchange and Edgewise Exchange, are selected to investigate the performance of the algorithm; the quality of solutions of these structures are compared and evaluated in Section 6.5. In this way, questions such as "How much effect does the neighborhood structure have on the overall performance of the Parallel Simulated Annealing Algorithm" can be addressed. -112- M. Tran Chapter6: EmpiricalAnalysis 6.2 Analysis Methodology The computational study of the performance of the Synchronous Parallel Simulated Annealing Algorithm (Algorithm A) is measured in terms of the quality of the final solutions. The results are presented in Sections 6.3 - 6.5. The purpose of the experiments is to provide useful computational results of the quality of solutions and running times. Since the Parallel Simulated Annealing Algorithm is a probabilistic hill climbing algorithm, the quality of solutions is measured by sampling the C(On)B at regular intervals (maximum iterations/number of samples to be taken) over a number of runs with different random seeds for the random number generator. In this way, the average-case performance analysis or ensemble averaging discussed in Chapter 3 can be ideally investigated. Let Copt be the cost value of the globally minimal configuration; then, the landmark paper of Beardwood et al. [Beard59] has shown that Lim C =VNN- op (6.1) Where 0 is an unknown constant; (numerical estimates by means of Monte Carlo experiments yield 0 = 0.749) and N is the number of cities in the tour. As was seen from Chapter 3, the quality of the final solution is defined to be the difference in cost value between the final or minimum configuration and the globally minimal configuration. However, Equation 6.1 is applicable only for N near infinity, and we don't really know the value of Copt when N is small. Thus, in this chapter, let us define the quality of the final solution to be the minimum value of the best cost Cmin, i.e. Cmin = min C(on)B)} for some n. Note that n is the nth iteration at which the minimum cost occurs during one run of Algorithm A. These minimum values and iterations are tabulated in Tables 6.1 and 6.2. -113- M. Tran Chapter6: EmpiricalAnalysis The codes of the software for Algorithm A is written in "C", provided in the Appendix A, and is simulated on the uniprocessor of the FTPP, which consists of 16 32bit Motorola 68020 processors at 25 MHz with Motorola 68881 Floating Point Coprocessors and other important features described in Section 4.2. All floating point operations are performed by the coprocessor, namely the computations of distances and the calculations of temperatures. The execution of the program follows the steps in Algorithm A as outlined in Section 4.5.1. Some of the properties of the software are discussed below, and the reader is encouraged to refer to Appendix A for more detailed understanding. The inputs to the program are the number of cities, num_nodes, which is equivalent to N; the number of subtours, npe, which is equivalent to P; the maximum number of iterations, max ito; a random seed for a map, mapseed; and a random seed for the algorithm, seed; the number of runs of the algorithm, num_runs; and the number of samples, sample_pts. Another input variable, depth, is for the annealing schedule, where depth can be used either for the variable c in Equation 3.12 or for the variable d in Equation 3.13 appropriately. At any given iteration, the outputs of the program are the temperature value T, perturbed cost C(an), best cost C(0n)B, perturbed tour a n , and best tour (an)B. At the end of each experiment, the average values of C(a n ) and C(an)B and the standard of deviations, i.e. C(T) and a , over the number of runs can be the outputs. Due to the computational intensity of the TSP, Algorithm A is simulated with only one run for each experiment. Although the algorithm described above and in Section 4.5.1 is applicable for any size of the TSP instance which is divisible by the number of subtours, i.e. (num_nodes/npe) or (N/P) must be an integer, all numerical data presented in this chapter are obtained by applying the Parallel Simulated Annealing Algorithm to two different instances of the TSP, one instance with 50 cities and the other with 100 cities. These medium-size optimization problems are considered as proper representatives of a broad -114- M. Tran Chapter6: EmpiricalAnalysis class of combinatorial optimization problems to which the analysis developed throughout this thesis can be successfully applied, for the following reasons: (1) there are many different values of the cost function, (2) there are many local minima, and (3) the cost function is reasonably smooth, i.e. no clustering. These TSP instances under consideration are symmetric instances whose euclidean distances are defined on the NxN square. The coordinates (x,y) of the cities in a 2dimensional plane are generated in the beginning of the program based upon the input mapseed for a random number generator. These coordinates are stored in memory and the euclidean distances are computed. The timing results in calculating the running time is obtained by calling the system function time() provided by the Unix Operating System. 6.3 Annealing Schedule Analysis In this section, a computational study for two annealing schedules, namely T(k+1) = cT(k) (Equation 3.12) and Tk = d/logk (Equation 3.13), is performed. The 50-cities TSP instance considered in this analysis is a set of random vertex locations on a square, and this 50-cities TSP instance is processed as 10 subtours, each of which consists of 5 cities. From this analysis, the design choices of constants of the annealing schedules can be obtained and evaluated. Algorithm A with Citywise Exchange discussed in Section 4.3.1 is used throughout this comparison. As briefly discussed in Chapter 4, the temperature value at the nth iteration is computed in Step Al of Algorithm A by the central coordinator and is passed to each processor. -115- M. Tran Chapter6: EmpiricalAnalysis Let us consider the first annealing schedule, T(k+l) = cT(k); Figure 6.1 plots the temperature versus time (iterations) at various values of c and T(0) = 20.0. As was mentioned in Chapter 4, this annealing schedule has been shown to provide good solutions with the Simulated Annealing Algorithm. To examine how the behavior of the Simulated Annealing Algorithm depends upon the annealing schedule, Figure 6.2 through Figure 6.7 plot the perturbed cost C(o n ) and the best cost C(on)B versus time for a sample run of Algorithm A with c = 0.94, 0.95, 0.96, 0.97, 0.98, and 0.99, respectively. And, Figure 6.8 and Figure 6.9 plot the combined perturbed costs and the best costs. Note that the best cost functions allow some uphill movements in the cost values when the temperature values are high because the best cost C(Gn)B = ((AC < 0) or (r < e-AC/T)}. As the temperature values decrease, the frequency of accepting uphill movements in the costs is reduced. Maps of the best tours for a sample run at various iterations for c = 0.94 are shown in Figure 6.10 through Figure 6.14 while maps of the best tour for c = 0.94, 0.95, 0.96, 0.97, 0.98, and 0.99 are shown in Figure 6.14 through Figure 6.19. From the best cost figures and Figure 6.1, it is interesting to note how the annealing schedule, particularly the value of c, has influenced the behavior of the Parallel Simulated Annealing Algorithm. If the rate of temperature reduction, or cooling, is too slow, i.e. c = 0.99, both the perturbed cost function and the best cost function converge fairly slowly; On the other hand, if the annealing schedule increases its cooling rate, thus decreases the value of c to 0.94, both the perturbed cost function and the best cost function converge much faster. But the quality of the final solution has deteriorated. This justifies the claim of Kirkpatrick that the annealing schedule is merely a control parameter. From Table 6.1, as the value of c increases, the number of iterations that Algorithm A required to converge to the minimum cost value increases. Note that the iteration from Table 6.1 is taken when the value of the cost function is minimal. Note also that the number of iterations that Algorithm A required to converge to the best minimum cost value at c = 0.96 is less than half that at c = 0.99. It is -116- 9 o• U iv U 13U 200 250 300 Time (iterations) Figure 6.1 350 400 450 500 Ll c;* OI Oo I C) oN N' u 100 200 300 400 500 Time (iterations) Figure 6.2 4; 1200 1100 1000 900 800 \ý0 600 500 400 NI 0I 300 200 0 100 200 300 400 . 500 Time (iterations) Figure 6.3 V c•" NI N 0 100 300 200 Time (iterations) iF 64' IgUB V0 400 500 C1 N r I OA U 100 200 300 400 500 3 Time (iterations) - .I" rigure D.5 (m NI 5: Sou 200 300 Time (iterations) Figure 6.6 400 500 P*% 1-3 1400 1300 1200 1100 1000 900 -I 800 600 500 400 300 N. 200 0 100 200 300 Time (iterations) Figure 6.7 400 500 k 0 100 200 300 Time (iterations) Figure 6.8 400 500 I.. · 1M 14qU 1300 1200 1100 1000 900 700 600 500 400 N 300 200 0 100 200 300 Time (iterations) Figure 6.9 400 500 zn Chapter6: EmpiricalAnalysis 5 10 15 20 25 30 35 40 45 Figure 6.10: Map of the best tour at 1st Iteration for T(k+1) = cT(k), c = 0.94, T(0) = 20.0, N = 50, and Best Cost = 1376.23. -126- 50 M. Tran Chapter6: EmpiricalAnalysis 50 45 40 35 30 25 20 15 10 A 0 5 10 15 20 25 30 35 40 45 Figure 6.11: Map of the Best Tour at 15th Iteration for T(k+l) = cT(k), c = 0.94, T(0) = 20.0, N = 50, and Best Cost = 742.42. -127- 35 M. Tran Chapter6: EmpiricalAnalysis 50 45 40 35 30 25 20 15 10 5 0 5 10 15 20 25 30 35 40 45 Figure 6.12: Map of the Best Tour at 35th Iteration for T(k+1) = cT(k), c = 0.94, T(0) = 20.0, N = 50, and Best Cost = 529.52. -128- 50 M. Tran Chapter 6: Empirical Analysis 50 45 40 35 30 25 20 15 10 C 0 5 10 15 20 25 30 35 40 45 Figure 6.13: Map of the Best Tour at 60th Iteration for T(k+1) = cT(k), c = 0.94, T(0) = 20.0, N = 50, and Best Cost = 342.27. -129- 50 M. Chapter6: EmpiricalAnalysis Tran 50 45 40 35 30 25 20 15 10 5 0 0 5 10 15 20 25 30 35 40 45 Figure 6.14: Map of the Best Tour at 93rd Iteration for T(k+1) = cT(k), c = 0.94, T(O) = 20.0, N = 50, and Best Cost = 319.95. -130- 50 M. Tran Chapter6: EmpiricalAnalysis -- 50 45 40 35 30 25 20 15 10 5 0 0 5 10 15 20 25 30 35 40 45 Figure 6.15: Map of the Best Tour at 122nd Iteration for T(k+l) = cT(k), c = 0.95, T(0) = 20.0, N = 50, and Best Cost = 299.06. -131- 50 Chapter6: EmpiricalAnalysis M. Tran 50 45 40 35 30 22 1 0 5 10 15 20 25 30 33 4U 4 Figure 6.16: Map of the Best Tour at 230th Iteration for T(k+l) = cT(k), c = 0.96, T(0) = 20.0, N = 50, and Best Cost = 292.27. -132- J n Chapter6: EmpiricalAnalysis M. Tran 50 45 40 35 30 25 21 1( 0 5 10 15 20 25 30 35 40 4 Figure 6.17: Map of the Best Tour at 286th Iteration for T(k+l) = cT(k), c = 0.97, T(0) = 20.0, N = 50, and Best Cost = 293.79. -133- u M. Tran Chapter6: EmpiricalAnalysis 50 45 40 35 30 25 20 15 10 0 5 10 15 20 25 30 35 40 45 Figure 6.18: Map of the Best Tour at 277th Iteration for T(k+l) = cT(k), c = 0.98, T(O) = 20.0, N = 50, and Best Cost = 295.45. -134- 50 M. Tran Chapter 6: EmpiricalAnalysis 50 45 40 35 30 25 1: 1' 0 5 10 15 25 20 30 35 40 45 Figure 6.19: Map of the Best Tour at 407th Iteration for T(k+1) = cT(k), c = 0.99, T(0) = 20.0, N = 50, and Best Cost = 296.85. -135- 50 M. Tran Chapter6: EmpiricalAnalysis c Iteration Cmin 0.94 93 319.997 0.95 122 299.064 0.96 230 292.267 0.97 286 293.791 0.98 277 295.450 0.99 470 296.845 Table 6.1: Minimum Costs or Quality of Final Solutions for Various Values of c for T(k+1) = cT(k), T(0) = 20.0 and N = 50. 320 310 E 300 290 0.93 0.94 0.95 0.96 0.97 0.98 0.99 Figure 6.20: Minimum Costs or Quality of Final Solutions Versus c for T(k+1) = cT(k). -136- 1.00 M. Tran Chapter6: EmpiricalAnalysis more interesting to investigate which value of c provides the minimum cost or the best quality of the final solution Cmin. As can be seen from Figure 6.20 and Table 6.1, the annealing schedule with c = 0.96 provides the best solution, i.e. Cmin = 292.267 units. When comparing the solution of this annealing schedule with solutions of other c's, the minimum cost values at c = 0.94, 0.95, 0.97, 0.98, and 0.99 are respectively equal to 319.997, 299.064, 293.791, 295.450, and 296.845 units, which are respectively 27.73, 6.797, 1.524, 3.183, 4.578 units higher than at c = 0.96. In comparison of the maps of the best tours, Figure 6.16 at c = 0.96 also provides the most appealing tour with respect to its counterparts in Figure 6.14 at c = 0.94, in Figure 6.15 at c = 0.95, in Figure 6.17 at c = 0.97, in Figure 6.18 at c = 0.98, and in Figure 6.19 at c = 0.99. Now let's consider the second annealing schedule Tk = d/logk whose temperature values versus time are shown in Figure 6.21 for different values of d. From the theoretical viewpoint, Geman and Geman required that d 2 L = I9 IA for convergence, and 19 I = 50!/2, where I9 1is the cardinality of the configuration space and A is some constant. Independent of the value of A, L is extremely large, and the cooling of the Equation 3.13 becomes very slow. Moreover, Mitra, Romeo & Sangiovanni-Vincentalli [Mit85] required d 2 rX for convergence, and the values of r and X are impossible to determine a priori computationally. Hajek's condition on d is similarly true [Haj85]. Therefore, a computational study to investigate how the value of d influences the behavior of our algorithm is required. The behavior of Algorithm A with different values of d can best be seen from Figure 6.22 through Figure 6.25. These figures plot the perturbed costs C(an) and the best costs C(on)B versus time (iterations) for d = 5, 10, 15, and 20, respectively. And, Figure 6.26 and Figure 6.27 summarize the results of these perturbed costs and best costs. To examine closely how the value of d influences the behavior of Algorithm A, from Figure 6.27, the best cost with d = 5 provides the minimal cost function with respect to -137- M. Tran Chapter 6: EmpiricalAnalysis other best cost functions, i.e. best costs with d = 10, 15, and 20. At d = 5, Figures 6.22 and 6.27 indicate a stable best cost function, allowing uphill and downhill movements in cost values regularly. As d increases, the behavior of Algorithm A becomes more irregular. At d = 20, Figures 6.25 and 6.27 show many uphill movements in the cost functions. From these figures, one can deduce that the performance of Algorithm A is very dependent on the annealing schedule Tk = d/logk and is extremely sensitive to the variation of the values of d. If the rate of temperature reduction, or cooling, is too slow, i.e. d >> 20 , the cost functions worsen and never seem to converge to a "near-optimal" solution. Conversely, if the cooling rate is too fast, i.e. d << 5, the cost functions may not have any indication of uphill movements in cost; thus, the algorithm may have a tendency to get stuck at some local minimum. Between these two extreme cases, the cooling rate is gradual enough such that it allows the algorithm to converge with regular uphill and downhill movements in the cost functions. Thus, this value of d is likely to be around 5 in contrast with the value found in [Kim86], which is 20. Although experiments to "fine-tune" the value of d are not conducted in this thesis, in a future research effort, it would be very interesting to see how the variation of d between 1 to 5 affect the performance of Algorithm A. In addition to the cost functions, the maps of the best tours are invaluable tools in examining the performance of the Parallel Simulated Annealing Algorithm. The maps of the best tours for a sample run at d = 5 are shown in Figure 6.28 through Figure 6.33, and maps at d = 10, 15, and 20 are respectively provided in Figure 6.34 through Figure 6.36. Since the quality of the final solutions are inversely related to the frequency of crossings in the best tour, i.e. the less frequent crossings the better the tour and solution, it can be observed that the best tour map in Figure 6.33 with d = 5 provides the best tour compared to the maps in Figure 6.34 with d = 10, Figure 6.35 with d = 15, and Figure 6.36 with d = 20, respectively. Since each tour is associated with a minimum cost, the plot of the -138- M. Tran Chapter6: EmpiricalAnalysis minimum costs or the quality of the final solutions versus d in Figure 6.37 and Table 6.2 justify that at d = 5 the cost value is minimal. Hence, the annealing schedule Tk = d/logk with d = 5 with the best cost = 285.54 units provides the best solution. Let us compare the performance of Algorithm A using two different annealing schedules discussed above. As was determined, T(k+l) = cT(k) with c = 0.96 provides the best solution among all the solutions with other values of c's. As was determined in the previous paragraph, Tk = d/logk with d = 5 provides the most competitive solution with the Simulated Annealing Algorithm with respect to other values of d. To investigate which annealing schedule, T(k+1) = cT(k) with c = 0.96 or Tk = d/logk with d = 5, provides the better solution to the Traveling Salesman Problem, three criteria are examined. First, the perturbed cost or the best cost functions plotted in Figure 6.38 and Figure 6.39 for the two different annealing schedules are considered. As can be seen from Figure 6.39, the best cost function with d = 5 converges initially to a good solution faster than that with c = 0.96, but it may converge to some local minimum. The best cost function for c = 0.96 reaches its minimum cost value at the 230th iteration while the best cost function for d = 5 converges to its minimum cost value at 293rd iteration. Although the best cost function with d = 5 converges to its best solution slower than that with c = 0.96, it allows many uphill cost movements, thus allowing the algorithm to escape from local minima. This analysis intuitively justifies the theoretical fact that as time goes to infinity, the algorithm converges to an optimal solution with an annealing schedule Tk = d/logk. Secondly, by comparison of the TSP tours in Figure 6.16 for c = 0.96 and Figure 6.33 for d = 5 respectively, one can observe that their characteristics are quite similar. Finally, the minimum cost or the quality of the final solution of the annealing schedule with c = 0.96 are 292.267 units and 286.97 units respectively while the minimum cost or the quality of the final solution of the annealing schedule with d = 5 are respectively 285.54 units and 280.244 units; the minimum cost or the quality of the final solution with d = 5 are 6.727 -139- N' §\ Irrmmbrrr X=DUctom kt N 0 100 200 300 400 500 600 Time (iterations) Figure 6.21 700 800 900 1000 c'I. ,, .,,, 1 4ýk N rm N: 0 100 200 300 Time (iterations) Figure 6.22 400 500 y 140C 1300 1200 1100 1000 900 800 ' 700 600 500 400 300 200 0 100 200 300 Time (iterations) Figure 6.23 S 400 500 N p w N" ,J 0 100 200 300 Time (iterations) Fieure 6.24 400 500 'AI P o\ ~1 d 3 0 100 20C 300 Time (iterations) Fieure 6.25 400 500 N N 1uu 200 300 Time (iterations) Fieure 6.26 400 500 1400 1300 1200 1100 1000 Cb) 900 *1 800 700 600 500 400 300 200 0 100 200 300 Time (iterations) Figure 6.27 400 500 M. Tran Chapter6: EmpiricalAnalysis 50 45 40 35 3C 2( 1 0 5 10 15 20 25 30 35 40 45 Figure 6.28: Map of the Best Tour at 1st Iteration for Tk = d/logk, d = 5, N = 50 and Best Cost = 1376.23. -147- 50 M. Tran Chapter6: EmpiricalAnalysis 50 45 40 35 30 25 20 14 I( 0 5 10 15 20 25 30 35 40 45 50 Figure 6.29: Map of the Best Tour at 10th Iteration for Tk = d/logk, d = 5, N = 50 and Best Cost = 596.85. -148- M. Tran Chapter6: EmpiricalAnalysis -- 50 45 40 35 30 25 20 15 10 0 5 10 15 25 20 30 35 40 45 Figure 6.30: Map of the Best Tour at 43rd Iteration for Tk = d/logk, d = 5, N = 50 and Best Cost = 317.23. 149- 50 M. Tran Chapter6: EmpiricalAnalysis 5 10 15 20 25 30 35 40 45 Figure 6.31: Map of the Best Tour at 67th Iteration for Tk = d/logk, d = 5, N = 50 and Best Cost = 311.63. -150- 5( Chapter6: EmpiricalAnalysis M. Tran 50 45 40 35 30 25 2( i: 1 0 5 10 15 20 25 30 35 4U 43 Figure 6.32: Map of the Best Tour at 203rd Iteration for Tk = d/logk, d = 5, N = 50 and Best Cost = 307.54. -151- U 0 1 M. Tran Chapter6: EmpiricalAnalysis 50 45 40 35 30 25 20 15 10 0 5 10 15 20 25 30 35 40 45 50 Figure 6.33: Map of the Best Tour at 293rd Iteration for Tk = d/logk, d = 5, N = 50 and Best Cost = 285.54 -152- Chapter6: EmpiricalAnalysis M. Tran 50 45 40 35 30 25 20 15 ic 0 5 10 15 20 25 30 35 40 45 Figure 6.34: Map of the Best Tour at 300th Iteration for Tk = d/logk, d = 10, N = 50 and Best Cost = 365.65. 153- A 50 M. Tran Chapter6: Empirical Analysis 50 45 40 35 30 25 20 15 1( 0 5 10 15 20 25 30 35 40 45 Figure 6.35: Map of the Best Tour at 300th Iteration for Tk = d/logk, d = 15, N = 50 and Best Cost = 504.87. -154- 50 Chapter6: EmpiricalAnalysis M. Tran 50 45 40 35 30 25 2( 11 1( A 0 5 10 15 25 20 30 35 40 45 Figure 6.36: Map of the Best Tour at 122nd Iteration for Tk = d/logk, d = 20, N = 50 and Best Cost = 514.24. -155- 30 Chapter 6. EmpiricalAnalysis M. Tran d iteration Cm 5 293 285.54 10 300 365.65 15 300 504.87 20 122 514.24 Table 6.2: Minimum costs or Quality of Final Solutions for Various Values of d. 600 550 500 450 400 350 300 250 200 0 5 10 15 20 25 d Figure 6.37: Minimum Costs or Quality of Final Solutions Versus d for Tk = d/logk. -156- N k r ul N '.3 U 1o00 200 300 Time (iterations) Figure 6.38 400 500 _ ___ 1400 1300 1200 1100 1000 900 800 700 600 500 8" 400 o\ 300 sI 200 100.0 200.0 300.0 Time (iterations) 400.0 500.0 ts E;· Figure 6.39 M. Tran Chapter6: EmpiricalAnalysis in the next section and subsequent sections, the annealing schedule Tk = d/logk with d = 5 is assumed or explicitly indicated otherwise. 6.4 Simulated Annealing Versus Local Optimization In the previous section, the annealing schedule with Tk = d/logk at d = 5 was shown to be better than the annealing schedule T(k+1) = cT(k), T(0) = 20.0, for any c. In this section, using this determined annealing schedule, namely Tk = d/logk with d = 5, the performance analyses of the Algorithm A using Simulated Annealing and the Algorithm A using Local Optimization techniques are investigated. Recall that Simulated Annealing accepts the next configuration with probability 1 if AC < 0 and accepts it with probability exp(-AC/T) if AC > 0. On the other hand, Local Optimization only accepts the next configuration if AC < 0 and rejects it, otherwise. In order to implement Algorithm A using Local Optimization, i.e. without annealing, only Step A6 is changed. By removing the factor (r < exp {-AC/Fn)) from both Equations 4.20 and 4.21, the Algorithm A becomes a Local Optimization Algorithm. The experiments for both Simulated Annealing and Local Optimization are conducted using the annealing schedule Tk = d/logk, d = 5. In this way, the behavior of the Simulated Annealing can be compared with that of the Local Optimization. The map of the best tour using Local Optimization is shown in Figure 6.40 while its associated cost functions are plotted in Figure 6.41. Figure 6.42 through Figure 6.44 plot different variations of perturbed cost and best cost using Local Optimization and perturbed cost and best cost using Simulated Annealing. The characteristic appeal of the best tour map using -159- Chapter6: EmpiricalAnalysis M. Tran 50 45 40 35 30 25 2( 0 5 10 15 20 25 30 35 40 45 Figure 6.40: Map of the Best Tour using Local Optimization at 108th Iteration for Tk = d/logk, d = 5, N = 50 and Best Cost = 327.93. -160- 5050 N" 0 100 300 200 Time (iterations) Figure 6.41 400 500 z- 0IAA 14U1 130( 1200 1100 1000 900 800 ! to\ 700 !> 600 500 400 N 300 200 u 100 200 300 Time (iterations) Figure 6.42 400 500 . .^A 1400 1300 1200 1100 1000 900 800 700 600 500 400 N 300 200 0 100 200 300 Time (iterations) Figure 6.43 400 500 - o) 0 100 200 300 400 500 . Time (iterations) Figure 6.44 tim M. Tran Chapter6: EmpiricalAnalysis AC/Tn}) from both Equations 4.20 and 4.21, Algorithm A becomes a Local Optimization Algorithm. The experiments for both Simulated Annealing and Local Optimization are conducted using the annealing schedule Tk = d/logk, d = 5. Note that the annealing schedule is still used for the local Simulated Annealing, i.e. annealing subtours. In this way, the behavior of the Simulated Annealing can be compared with that of the Local Optimization. The map of the best tour using Local Optimization is shown in Figure 6.40 while its associated cost functions are plotted in Figure 6.41. Figure 6.42 through Figure 6.44 plot different variations of perturbed cost and best cost using Local Optimization and perturbed cost and best cost using Simulated Annealing. The characteristic appeal of the best tour map using Simulated Annealing shown in Figure 6.33 appears much better than the best tour map using Local Optimization shown in Figure 6.40. The best cost of Algorithm A with Simulated Annealing is 285.54 units which occurred at the 293rd iteration while the best cost with Local Optimization is 327.93 units which occurred at the 108th iteration; Algorithm A with the Simulated Annealing technique converges to a much better solution than Algorithm A with the Local Optimization technique, but it has to iterate much longer than Local Optimization. This observation is also clearly seen in Figure 6.44. Notice that while Algorithm A with Local Optimization converges fairly quickly, and it gets stuck at the first local minimum at the 108th iteration with the best cost equal to 327.93 units. Algorithm A with Simulated Annealing continues to iterate, avoiding this local minimum, to a lower (or better) best cost value. Both the ability to avoid being entrapped in local minima and convergence to a better solution verify our intuition (Chapter 2) and various notes throughout this thesis that the Simulated Annealing Algorithm is much more powerful than Local Optimization. The running time of Algorithm A for a 50-cities TSP instance, which has been investigated in the last section and this, is on the average about 13 hours in real-time. -165- M. Tran Chapter 6: EmpiricalAnalysis 6.5 Citywise Versus Edgewise Exchanges In the last section, the usefulness of Simulated Annealing over Local Optimization was shown. In this section, the question "How much impact does the neighborhood structure have on the overall performance of the Parallel Simulated Annealing Algorithm" is addressed. For this particular investigation, two neighborhood structures, namely Citywise Exchange and Edgewise Exchange, are selected. Furthermore, a 100-cities TSP instance is used to process the experimental results independently, and these results are compared. This 100-cities TSP is processed as 20 subtours, each of which consists of 5 cities. Maps of the best tours using Citywise Exchange for a sample run of Algorithm A are shown in Figure 6.45 through Figure 6.49. For comparison purposes, a map of the best tour using Edgewise Exchange is also provided in Figure 6.50. As can be seen from Figure 6.49 and Figure 6.50, the map of the best tour using Citywise Exchange is intuitively more appealing than the map of the best tour using Edgewise Exchange. This difference is justified by their corresponding best cost values; the best cost using Citywise Exchange is 865.09 units while the best cost using Edgewise Exchange is 1087.16 units. Moreover, Citywise Exchange allows the algorithm some regular uphill and downhill movements in both the perturbed cost and best cost functions as shown in Figure 6.51 while allowing the algorithm to converge to a good solution. On the other hand, Edgewise Exchange seems to allow the algorithm "only" downhill movements in the cost functions as shown in Figure 6.52. Because of this "only" downhill movements, the algorithm converges to a high cost value. This comparison can be seen in Figure 6.53 and Figure 6.54 where only the best cost functions of the two neighborhood structures versus time and versus temperature are respectively plotted. It is interesting to note, from Figure 6.54, the divergence point of these two functions, which is about at a temperature of 7.15 units. For -166- M. Tran Chapter6: EmpiricalAnalysis -100 95 90 85 80 75 70 65 60 55 50 45 40 35 30 25 20 15 10 5 0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 Figure 6.45: Map of the Best Tour using Citywise Exchange at 1st Iteration for Tk = d/logk, d = 5, N = 100 and Best Cost = 5517.64. -167- 95 10( M. Tran Chapter6: EmpiricalAnalysis 100 95 - 9085 - 8075 70 - 65 - 6055 - 50 - 45 - 4035 30252015 ' 10' 5. o . 0 1 5 . a1 ' 10 1 15 ' . . 20 .1' 25 I1 30 ' 1 ' 35 . 1 40 ' . I .I..I..I..I . . I 45 50 55 60 65 70 75 . 80 ..I.,I ..I. 85 90 Figure 6.46: Map of the Best Tour using Citywise Exchange at 10th Iteration for Tk = d/logk, d = 5, N = 100 and Best Cost = 1778.22. -168- 95 100 M. Tran Chapter6: EmpiricalAnalysis OAM 100 95 90 85 80 75 70 65 60 55 50 45 40 35 30 25 20 15 10 5 0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 Figure 6.47: Map of the Best Tour using Citywise Exchange at 40th Iteration for Tk = d/logk, d = 5, N = 100 and Best Cost = 1236.59. -169- 95 M. Tran Chapter6: EmpiricalAnalysis 100- 95 9085 - 80 75 - 70 65 6055 50 45 45 40' 35' 30' 25 20 15 10 5 0 . 0 5 I vI II 10 II 15 v I 20 I 25 . . rI I 30 35 40 I 45 . I . 50 I 55 . I 60 I 65 . .'*I . I 70 .'. . 75 80 . . 85 . I * 90 Figure 6.48: Map of the Best Tour using Citywise Exchange at 94th Iteration for Tk = d/logk, d = 5, N •00 and Best Cost = 992.75. * I.. * 95 * 100 M. Tran Chapter6: EmpiricalAnalysis 1007 95 90O 85 80 75 - 70 65 - 6055 50 45 40 35 - 30- 252015 - 10. 50- -..v.................................... r0 5 10 15 20 . 25 30 35 40 45 50 55 60 . I 65 a I 1 611111 111 70 75 80 85 .I ..I.I 90 Figure 6.49: Map of the Best Tour using Citywise Exchange at 436th Iteration for Tk = d/logk, d = 5, N = 100 and Best Cost = 865.09. -171- 95 100 M. Tran Chapter6: EmpiricalAnalysis I 95 90- 85 80 - 75 - 70 65 - 6055 50 45 - 4035 30- 25 20 15 10 - 50- " 0 . 1 5 ' 1 I' 10 ' 1 15 ' 1 '''1 1''1 I 20 25 . I 30 I I 35 I I I 40 ' I Ii ' 45 I1 50 ''I I 55 . I 60 ' 1 65 ' I 1 70 1 75 1 I ' 80 . 1 I 85 . 1 I . 90 Figure 6.50: Map of the Best Tour using Edgewise Exchange at 492nd Iteration for Tk = d/logk, d = 5, N = 100 and Best Cost = 1087.16. -172- 1 I 95 ' I 100 5500 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 r!1 0 0 100 200 300 Time (iterations) Figure 6.51 400 500 6000 5500 I 5000 Costs versus Time for Tk = d/logk at d = 5 and N = 100. { 4500 4000- -*- 3500 Edgewise Perturbed Cost Edgewise Best Cost 300025002000 1500 1000 J -. - 1 100 200 300 Time (iterations) Figure 6.52 400 500 2_ _ _ '^^^ 6(00 5500 5000 4500 4000 i 3500 3000 2500 2000 1500 1000 500 0 0 100 200 300 Time (iterations) Figure 6.53 400 500 6UMO 5500 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0 10 0 101 Temperature Figure 6.54 10 2 Chapter6: EmpiricalAnalysis M. Tran this particular study, it is safe to conclude that the performance of Algorithm A using Citywise Exchange exceeds the performance using Edgewise Exchange (Lin's 2-Opt Algorithm) in three different perspectives: maps of the TSP tours, perturbed and best cost functions, and minimum cost values. From this analysis, one can see that the performance of the Parallel Simulated Annealing Algorithm depends heavily upon the neighborhood structure(s) being used. Though further investigation of the above neighborhood structures are not conducted in this thesis, it would be a very interesting subject for further study by comparing the performance of Citywise Exchange with some other neighborhood structure or combination of neighborhood structures. The running time of Algorithm A for a 100-cities TSP instance is on the average about 45 hours in real-time, which is more than triple the running time of a 50-cities TSP instance. In this chapter, we have seen how strongly the performance of the Parallel Simulated Annealing Algorithm depends on the chosen annealing schedule and its variables. This was experimentally examined in Section 6.3. With a properly chosen annealing schedule good solutions can be obtained for many combinatorial optimization problems. In Section 6.4, the usefulness of Simulated Annealing over Local Optimization was demonstrated. The performance of the Parallel Simulated Annealing Algorithm depends not only on the annealing schedule but also on the neighborhood structure as illustrated in Section 6.5. Finally, the computation times can be very extensive for some large combinatorial optimization problems. For example, double the size of the TSP instance from 50-cities to 100-cities, the running time is increased from 13 hours to 45 hours on the average, which is more than triple. -177- Chapter7: Summary and Conclusions M. Tran CHAPTER 7 sulMAkR kAND DCONCLUSIION In Chapter 1, the Traveling Salesman Problem was defined and introduced. Practical applications of the TSP were also briefly covered. The historical relationship of the TSP with the field of combinatorial optimization was reviewed, and the methodology in solving the TSP was reviewed and introduced. In Chapter 2, the historical development of the classical Simulated Annealing was studied, and the Simulated Annealing Algorithm was outlined. In Chapter 3, certain key mathematical concepts which are underlying the Simulated Annealing Algorithm were examined. In Chapter 4, the Synchronous Parallel Simulated Annealing Algorithm was designed (Algorithm A in Section 4.5.1). In this candidate parallel algorithm (Algorithm A), the central coordinator begins at each iteration by selecting the temperature value, randomly generating a starting city, and partitioning the current tour into P subtours. Then, it delivers the temperature value and the P partitioned subtours to Np processing elements in the system. Each of these processors is responsible for perturbing the subtour, computing the cost of the perturbed subtour, and performing the local annealing process as in Step A3. When each of these processors is finished with its tasks in its subtours, the central coordinator is reconstructing the new tour from these "better" subtours, computing its new cost, and performing the global annealing process. Then, the central coordinator repeats the next iteration until some "maximum iterations" -178- M. Tran Chapter 7: Summary and Conclusions stopping criterion has been satisfied. In Chapter 4, two neighborhood structures were also outlined and discussed. In Chapter 5, the speedup of Algorithm A was analyzed for two different cases of processors allocations, namely, the unlimited number of processors allocation (Np 2 P) and the limited number of processors allocation (Np < P) and for two different cases of idle processors, a special case of which was there are at most (Np - 1) idle processors. It was shown that there is little benefit in partioning the number of subtours P much beyond the number of allocated processors Np or in allocating the number of processors Np much beyond the number of partitioned subtours P; the maximum speedup occurs when the number of allocated processors is equal to the number of subtours, i.e. Np = P and Obervations 5.2.1-5.2.3. Communications are taken place in Algorithm A between the central coordinator and the processors, and between processors themselves. Interprocessor communication is where processors are exchanging cities with one another. It was found that, on the average, a processor in the network can handle O(N) messages, and communication overhead is O(N) or O(NpNs) units of time from Equation 5.18 or tPNs(Np - 1) at each iteration. The speedup of Algorithm A is not generally independent of the communication overhead factor but rather is dependent upon the communication overhead factor. However, as the number of cities N or Ns increases, the communication overhead factor seems to be less donimant in the speedup equation; as the number of cities N or Ns approaches to infinity, the speedup of the Algorithm A is asymptotically approaching to a linear function of the number of allocated processors Np, (Equation 5.61 and Observation 5.6.6), and is independent of the communication overhead factor ,t3Np(Np - 1)N or rpNs(Np - 1). For a large number of allocated processors Np, the speedup is approximately equal to (3yNs/t). For an arbitrary idle processors case, the bounds on speedup of Algorithm A for all Ns and Np were shown to be in the range of Equation 5.63 if the communication overhead factor were taken into consideration and in the range of Equation 5.64, otherwise. For the worst idle processors case, the bounds on -179- [] M. Tran Chapter 7: Summary and Conclusions speedup of Algorithm A were shown to be in the range of Equation 5.68 if the communication overhead factor were taken into consideration and in the range of Equation 5.69, otherwise. For any case of idle processors, the general bounds on speedup of Algorithm A for all Ns and Np were shown to be in the range of Equation 5.70 if communication overhead factor were taken into consideration and in the range of Equation 5.71, otherwise. In Chapter 6, an computational study was conducted for two instances of the TSP, namely a 50 cities and a 100 cities. Using a 50-cities TSP instance, in Section 6.3, experiments were performed for two annealing schedules, namely T(k+l) = cT(k) and Tk = d/logk, and the results were compared and evaluated. For the annealing schedule T(k+1) = cT(k), experiments were performed for T(0) = 20.0 and for different values of c in the range of 0.94 • c 5 0.99. It was found that the annealing schedule T(k+1) = cT(k) with c = 0.96 provided the best solution relative with respect to other values of c. The quality of the final solution or the minimum cost Cmin was found to be 292.267 units. For the annealing schedule Tk = d/logk, experiments were perfomred for d = 5, 10, 15, and 20, and the results were compared and evaluated. It was found that the annealing schedule Tk = d/logk with d = 5 provides the best solution with respect to other values of d's. The quality of the final solution or the minimum cost Cmin was 285.54 units. In comparison of the results of the two annealing schedules, the minimum cost value or the quality of the final solution of Tk = d/logk with d = 5 is 6.727 units lower than those of T(k+l) = cT(k) with c = 0.96; thus, the annealing schedule Tk = d/logk with d = 5 provided a better solution than the annealing schedule T(k+l) = cT(k). In general, the annealing schedule should be lowered slowly; otherwise, the cost function has the tendency to get stuck in higher cost values, possibly a local minimum. Similarly, experiments were performed for Simulated Annealing and Local Optimization in Section 6.4, and the results were compared and evaluated. The results indicated that Simulated Annealing yielded much better solution than Local Optimization, but it had to iterate much longer. In Section 6.5, a 100-cities -180- M. Tran Chapter 7: Summary and Conclusions TSP instance was used to process the experimental results for two neighborhood structures, namely Citywise Exchange and Edgewise Exchange. It was found that the best cost using Citywise Exchange was 865.09 units while the best cost using Edgewise Exchange was 1087.16 units. With other criteria, it was concluded that the performance of Algorithm A using Citywise Exchange outperformed the performance of Algorithm A using Edgewise Exchange. During the course of this study, the author has encountered several "interesting" and "challenging" open research problems about the Traveling Salesman Problem and Parallel Simulated Annealing Algorithm. Among these, two areas below are suggested for further investigation: 1. A vigorous theoretical treatment of the average-case performance analysis for the Parallel Simulated Annealing Algorithm. 2. A computational study of Algorithm B for the Traveling Salesman Problem. -181- M. Tran Appendix A: Simulation Program APPENDIX A 1MULAITON PGRA1J IR A SY4CURONIU0 PARATLTML SlTATIID AIHAILkNHG AkLGRITUjM The software program for the Parallel Simulated Annealing Algorithm is given in the following pages. It is written in "C" on Unix Operating System. In the simulation program, five math library subroutines were accessed by including <math.h>: drand48(), which returns a non-negative double-precision values uniformly distributed over the interval [0,1); pow() perfoms XY operations; exp() perfoms ex operations; loglO() perfoms logarithmic operations for the annealing schedule, and sqrt() perfoms square root operations. Timing information is obtained by calling the function time() provided the Unix Operating System; this time() function computes the execution of the program in realtime. The structure of the program follows the general steps outlined in Section 4.5.1. -182- Appendix A: Simulation Program M. Tran /* /* /* SYNCHRONOUS PARALLEL SIMULATED ANNEALING ANLGORITHM */ */ FOR THE TRAVELING SALESMAN PROBLEM */ written in "C", July 15, 1989. /******************************************************* #define #define #define #define INDEBUG MAXNODES 500 MAXSUBTOURS MAXITO /** input() debug constant **/ /** maximum nodes, subtours and iterations */ 20 500 #define #define NOINIT INIT /** states of the subtour **/ #define #define ACCEPTED NOTACCEPTED #define #define FREE BUSY #define #define AVAILABLE RESERVED #define #define FALSE TRUE #define #define NONODE NODE +1 3 9 /** states of Annealing **/ 10 /** status of the processors **/ 19 /** states of the intersubtours pairwise exchage */ /**************************STRUCTURES**********************/ /** structure to save xy coordinates of a random vertex map **/ struct dsp_map_struct { float x_coord; float y_coord; /** message structure between central coordinator and processors **/ struct message_struct { current_subtour[MAXNODES]; /* current nodes in subtour */ int perturb_subtour[MAXNODES]; /* the perturbed subtour */ int /* states of the nodes */ node_state[MAXNODES]; int reserved_node[MAXNODES]; /* reserved nodes */ int /* status of the processor */ status; int int state_subtour; /* defined as above */ int state_tour, int float float float /* number of nodes in a subtour */ cardinality; temperature; cost_curr subtour; /* cost of the current subtour */ cost_perturb_subtour; /* cost of the perturb subtour */ -183- n M. Tran float float float float Appendix A: Simulation Program cost_subtour; chgin cost; rng; /* acceptprob; /* accepted cost of subtour */ /* change in cost of subtour */ random genertor */ /* acceptance probability */ /** processing element structure **/ struct tsp_pe_struct { struct message_struct msg; /** central coordinator structure **/ struct tsp_ccstruct I float cost_init_tour[20]; int soln_tour[MAXNODES]; /** float cost_soln tour; /** C(O) **/ int best_tour[MAXNODES]; float int float float float float float float float float float int /***** extern extern extern extern extern extern extern extern extern extern **/ /** **/ 3B cost_best_tour; /** C(})B **/ optimal_tour[MAXNODES]; /** min{oB} **/ /** min{C(C)B} **/ costopt_tour[20]; opttemperature[20]; opt_iteration[20]; iteration; mgcc; acceptprobcc; chgincostcc; temperature; state_tour; not_acc_cnt; global variables ******/ int init_run_time, final_run_time, elapsed_run_time; int npe; /** number of processing elements **/ int max_subtours; /** maximum # of subtours formed **/ int num_nodes,num_edges; /** number of vertices, nodes in TSP */ int seed,mapseed; /** seeds for random number generator **/ int num_runs,run_num; /** # of runs of program */ int sample_pts; /** # of sample points to be taken **/ int SamplePts; /** running index for sample_pts **/ float max_ito; /* stopping criterion, on maximum iterations*/ int iter, -184- M. Tran Appendix A: Simulation Program /*****structures ***************/ extern struct tsp cc struct central_coord; /* central coordinator */ extern struct tsp_pe_struct proc_pe[]; /* p processing elements */ extern struct dsp_map_struct map_nodes[]; /* map */ /***** annealing schedule externals ******/ extern float depth; /** T=depth / log(k) or T = (depth)kT(O) **/ /******* functions ***********/ extern int extern float extern int step_A3(),step_A10(),step_A5()0; anneal_schedule(); inito,inputo,output(); extern int extern int timeO,results(); update stats(); extern extern extern extern best_history[] [MAXNODES]; soln_history[] [MAXNODES]; cost_history[][40]; uphill_history[][40]; int int float float #include #include /** ALGORITHM A main steps */ "defs.c" "external.c" /************************* BEGIN OF GLOBAL.C ******************/ /** Allocate storage space for global parameters **/ I****************************************************************I int int initruntime, final_run_time, elapsed_run_time; npe; float max_ito; int int int max_subtours; num_nodes,num_edges; num_runs, run_num; float depth; int int int seed, mapseed; sample_pts; SamplePts; struct tsp cc_struct central_coord; /** 1 central coordinator */ struct tsp_pe_struct proc_pe[MAXSUBTOURS]; /** p processing elements **/ struct dsp_map_struct map_nodes[MAXNODES]; /** coordinator per node **/ int best_history[MAXNODES][MAXNODES]; int soln_history[MAXNODES][MAXNODES]; float cost_history[MAXNODES][40]; -185- Appendix A: Simulation Program M. Tran float uphill_history[MAXNODES] [40]; /******************* END OF GLOBAL.C ***********************I /********************** BEGIN OF IN.DAT ***************/ 10 50 500.0 10 10 5.0 1 100 #define INDAT #ifdef INDAT printf('%\,t npe \t num_nodes \t max_ito\n"); printf('t\t %d \t\t %d \t\t %f\n",npe,num_nodes,max_ito); printf('"t\t mapseed \t seed \n"); printf('"t\t %d \t\t\t %d\n",mapseed,seed); printf('ttt depth\n"); printf('"\tt %fAn",depth); printf("num_runs\t sample_pts\n"); printf("%d\t %d~\",num_runs,sample_pts); #endif ********************* END OF IN.DAT ******************/ #include <stdio.h> #include <math.h> "defs.c" #include #include "external.c" /************************* BEGIN OF INPUT.C *****************/ /** Inputs variables from file in.dat **/ input() { struct pair struct *prs; /* pointers to structures */ FILE *fopeno, *fp; /* fp=ptr to FILE = type */ static char infile[] = "in.dat"; /** get input via file name "in.dat" **/ /* fp=fopen(name,mode) */ if( (fp = fopen(infile,"r")) == NULL){ /* Std i/o library function */ printf('"n error on reading input file! \n"); exit(); -186- Appendix A: Simulation Program M. Tran fscanf(fp,"%d %d %f\n", &npe, &num_nodes, &max_ito); fscanf(fp,"%d %d\n", &mapseed, &seed); fscanf(fp,"%f\n", &depth); fscanf(fp,"%d %d\n", &numruns, &sample_pts); /** maximum # of processors **/ /** number of nodes in TSP */ /** stopping criterion on iterations */ /** seed for map generation **/ /** seed for random number generator*/ /** annealing schedule, d **/ /** # of runs of program **/ /** # of sample points taken **/ fclose(fp); #define INDEBUG #ifdef INDEBUG printf('\t INPUT DATA \n\n"); printf('"týt npe \t num_nodes \t max_ito\n"); printf('"V\tt %d'\t\t %d \t\t %f \n", npe,num_nodes,max_ito); printf('"\ft mapseed \t seed \n"); printf('Nt %d t\t %d\n",mapseed,seed); printf('l\\t depth \n"); printf('"\.t %f\n",depth); t num_runs\t printf(' A\t sample_pts \n "); printf("\t\t %d \tNt %d\n ",num_runs,sample_pts); #endif return; I /************************** #include #include #include #include END OF INPUT.C * <stdio.h> <math.h> "defs.c" "external.c" /**************************** BEGIN OF MAP.C *****************/ /** this subroutine generates a random vertex map in a 50x50 square **/ /**************************************************************/ make_map() struct dspmapstruct struct dsp-map-struct *map; /* map pts to dsp_map_struct */ -187- M. Tran int Appendix A: Simulation Program i,node, nodel, node2; float chg_x, chg_y, dist; double sqrt(), drand48(); static char mapfl] = "map.dat"; FILE *fp, *fopen(); /*** for every node or city; num_nodes = max # of cities in TSP ***/ for( node = 0; node < num_nodes; node++ )( /**pointer to map structure **/ map = &map_nodes[node]; /**get a random x coordinate for a node **/ map->x_coord = drand48(mapseed) * num_nodes; /**get a random y coordinate for a node **/ map->y_coord = drand48(mapseed) * num_nodes; #define XYDEBUG #ifdef XYDEBUG if ( num_nodes < MAXNODES )( if( (fp = fopen(mapf,"w") ) == NULL )( printf(" map.dat failed to open\n "); exit(); /*** for every node or city; num_nodes = max # of cities in TSP ***/ fprintf(fp,'\nNODE NUMBER, X AND Y COORDINATE POSITION\n\n"); fprintf(fp,"Node Number \t X Coordinate \t Y Coordinate \n"); for( node = 0; node < numnodes; node++ ){ /**pointer to map structure **/ map = &map_nodes[node]; /**get a random x coordinate for a node **/ map->x_coord = drand48(mapseed) * num_nodes; /**get a random y coordinate for a node **/ map->y_coord = drand48(mapseed) * num_nodes; fprintf(fp,"%d \t %.2f \t %.2f\n",node,map->x_coord,map->y_coord); } #endif XYDEBUG #ifdef MAPDEBUG fprintf(fp,"\n DISTANCE MATRIX \n\n"); -188- M. Tran Appendix A: Simulation Program i= 0; for(nodel = 0; node 1 < num_nodes; nodel++) { for(node2 = 0; node2 < num_nodes; node2++){ chg_x = map_nodes[nodel].x_coord map_nodes[node2].x_coord; chg_y = map_nodes[nodel].y_coord map_nodes[node2].y_coord; dist = sqrt( (chg_x * chgx) + (chg_y * chg_y) ); if( i !=0) if( (i % 10) ==0 ) fprintf(fp,'"\n"); /**next line **/ fprintf(fp," %.2f ",dist); i++; I fprintf(fp,' n\n"); ) #endif fclose(fp); }else{ printf('"\n MAP.C: num_nodes must be less than MAXNODES \n"); exit(); return; /************************** END OF MAP.C **********************/ #include #include #include #include <stdio.h> <math.h> "defs.c" "external.c" /******************* BEGIN OF INIT.C ******************/ /** initialize all the necessary parameter for ALGORITHM A init() { struct tsp cc_struct *cc; struct tsppe_struct *pe; double drand48(, sqrt(); float xdelta, ydelta, distance; int node, i, node 1l,node2; int tour_loc[MAXNODES]; static char init_tourf[] = "init_tour.dat"; FILE *fp, *fopen(); -189- **/ M. Tran Appendix A: Simulation Program cc = &central_coord; seed += 10; /** generate a random initial tour generating a # between 0 & num_nodes **/ /** and assigning it to each elt of the soln_tour[] **/ for (i = 0; i < num_nodes; i++) tourloc[i] = NONODE; i= 0; while (i != num_nodes)( node = drand48(seed) * num_nodes; if ( tourloc[node] == 1 ) continue; cc->soln_tour[node] = i; tourloc[node] = 1; i++; /** calculate initial cost of tour by calculating the sum of the distances **/ /** of the map_nodes array whose indices are controlled by soln_tour[] **/ cc->cost_soln_tour = 0.0; for( node = 0; node < num_nodes; node++){ nodel = cc->soln_tour[node]; /* a randomly generated node*/ node2 = cc->soln_tour[(node+ 1+num_nodes)%num_nodes]; /* []= nodel + 1 ==> num_nodes */ xdelta = map_nodes[nodel ].x_coord - map_nodes[node2].x_coord; ydelta = map_nodes[nodel].y_coord - map_nodes[node2].y_coord; distance = sqrt( (xdelta * xdelta) + (ydelta * ydelta) ); cc->cost_soln_tour += distance; /* get the cost of the initial tour */ /** Save the cost of the initial and optimal tours for each run **/ cc->cost inittour[runnum] = cc->cost_soln_tour; cc->cost_opt_tour[run_num] = cc->cost_soln_tour; #ifdef INITDEBUG if( (fp = fopen(init_tourf,"w") ) == NULL){ printf(" init_tour.dat failed to open \n"); exit(); }else{ fprintf(fp,'"n\n THE RANDOM INITIAL TOUR IS \nLn"); for (node = 0; node < num_nodes; node++)( if (node != 0 ) if ((node % 10)== 0) fprintf(fp,'\n"); /** next line **/ -190- Appendix A: Simulation Program M. Tran fprintf(fp," %d\t",cc->soln_tour[node]); fprintf(fp,'\n\n"); fprintf(fp,'\n\n TOTAL COST OF THE INITIAL TOUR = %f\n",cc->cost_soln_tour); fclose(fp); #endif /** intialize the best_tour **/ cc->cost_besttour = cc->cost_solntour; /* Let best cost = initial cost */ for ( node = 0; node < num_nodes; node++ ) cc->best_tour[node] = cc->soln_tour[node]; /** initialize the initial statistics **/ cc->iteration = 1.0; /** initial time before any execution **/ opt tour_record(); update_stats(); /** initialize other variables **/ /* At iteration = n = 2 */ cc->iteration = 2.0; cc->temperature = anneal_schedule(cc->iteration); opt_tour_record(); num_edges = num_nodes; /* Let num of edges = num of nodes */ max_subtours = npe; /* max_subtour = npe = P = max # of parallel processes */ return; ************************* END OF INIT.C ************************ #include #include #include #include <stdio.h> <math.h> "defs.c" "external.c" BEGIN OF TOUR1.C *********************/ /** stepA () tasks: (1). establishes a message channel beween the **/ every processor **/ central coordinator and /** (2). select the annealing temperature **/ /** /** (3). and partition the tsp tour into subtours **/ /********************* /**************************************************************/ -191- Appendix A: Simulation Program M. Tran stepA 1() { struct tsp_cc_struct *cc; struct tsp_pe_struct *pe; i,k,proc, node,snode,nodel,node2,cardinality; int int iter, InitRanSeed, InitRanNode; float xdelta,ydelta,distance,init_cost; double sqrto, drand48(); int offset_tourloc [MAXNODES]; cc = &central_coord; /** Generate the initial random node **/ iter = cc->iteration; InitRanSeed = (iter + num_nodes)%num_nodes; InitRanNode = drand48(seed + InitRanSeed)*num_nodes; /** Offset the tour to begin at the initial random node **/ for ( i = 0; i < num_nodes; i++ ){ offsettour_loc[i] = cc->soln_tour[(InitRanNode + i + num_nodes)%num_nodes]; printf("offet_tour_loc[%d] = cc->soln_tour[%d] = %d\n", i,( (InitRanNode + i + num_nodes)%num_nodes ),cc->soln_tour[i]); /** Get the offset tour **/ for (i = 0; i < num_nodes; i++) cc->soln_tour[i] = offset_tourloc[i]; /*** for each process in system, make a message between the central ***/ /*** coordinator and processes. Partition the tsptour into subtours.***/ node = 0; cardinality = num_nodes/npe; /* cardinality = # nodes in subtour */ for( proc = 0; proc < npe; proc++ ){ pe = &proc_pe[proc]; pe->msg.state_subtour = INIT; pe->msg.status = FREE; pe->msg.cardinality = cardinality; /** All nodes are available **/ for (k = 0; k < cardinality; k++) pe->msg.node_state[k] = AVAILABLE; /** Update the temperature at every iteration **/ -192- Appendix A: Simulation Program M. Tran if ( cc->iteration < 2.0 ) cc->iteration = 2.0; }else( cc->temperature = pe->msg.temperature = anneal_schedule(cc->iteration); } if( pe->msg.temperature == ){ printf(" tourl.c: The temperature can not be ZERO\n"); exit(); } /**partition the tsp tour into 'npe' or P subtours **/ for ( snode = 0; snode < cardinality; snode++){ node = ( node + num_nodes ) % num_nodes; pe->msg.current subtour[snode] = cc->soln_tour[node]; node++; /**compute the initial cost of the subtour **/ init_cost = 0.0; for ( snode = 0; snode < cardinality-1; snode++)( node 1 = pe->msg.current_subtour[snode]; node2 = pe->msg.current_subtour[snode+ 1]; xdelta = map_nodes[nodel].x_coord - map_nodes[node2].x_coord; ydelta = map_nodes[nodel].y_coord - map_nodes[node2].y coord; /**eucledian distance, cost of edge **/ distance = sqrt( ( xdelta * xdelta ) + ( ydelta * ydelta) ); init_cost += distance; pe->msg.cost_subtour = init_cost; #ifdef T1DEBUG printf("\n\t TOUR1=A1(): msg number = %d\n",proc); if ( pe->msg.state_subtour == INIT) printf('Nt\t state_subtour = INIT\n"); if ( pe->msg.status == FREE ) printf('\t\t processor %d status = FREE\n",proc); printf('Nti temperature = %f\n",pe->msg.temperature); printf('\t\t cardinality = %d\n",pe->msg.cardinality); printf('"t\t cost_subtour = %f\n",pe->msg.cost_subtour); printf('\t\t subtour is ... \n"); for ( i = 0; i < cardinality; i++)( if( i !=0) if( (i%10)== 0) printf("\n"); printf("%d\t",pe->msg.current_subtour[i]); } printf('M"\n"); #endif T1DEBUG -193- m M. Tran Appendix A: Simulation Program ) /* for loop */ return; S/* Step_A1() */ /**************** END OF TOUR1.C ***********************/ #include <math.h> #include <stdio.h> #include "defs.c" #include "external.c" /*********** annealing schedule computation **********/ **/ /** The following function will computes appropriate /** different annealing schedules /** Input: time or iteration **/ /** Output: temperature value **/ float anneal_schedule(curr_time) float curr_time; /* curr_time = cc->iteration */ { struct tsp_cc_struct *cc; float prob,tempiter, anneal_tmp; double pow(),log10(); cc = &central_coord; /****** Inverse natural log annealing T = d/logt ******/ anneal_tmp = depth / log10(curr_time); /****** linear schedule with various scalings if ( curr_time <= 2 ){ anneal_tmp = 20.0; }else { anneal_tmp = 20.0*pow(depth,curr_time); return(anneal_tmp); #include <stdio.h> -194- M. Tran Appendix A: Simulation Program #include <math.h> #include "defs.c" #include "external.c" BEGIN OF SUBTOUR.C ************/ /* This subroutine performs STEP A3 of Algorithm A . */ /* Code for the p subtours or processors in the parallel computer; */ perform a Tij or Lij interchange, change in costs, interprocessors */ /* /* communication, and performs the annealing test. */ /*********************** step_A3() int proc; #ifdef CITYWISE /** Step 3A for Citywise Exchanges and annealing in subtours **/ for( proc = 0; proc < max_subtours; proc++ ) citywise_exchange(&proc_pe[proc],proc); #endif #define EDGEWISE #ifdef EDGEWISE /** Step 3A Edgewise Exchanges and annealing in subtours **/ for( proc = 0; proc < max_subtours; proc++ ) edgewise_exchange(&proc_pe[proc],proc); #endif /**Step 3B for Interprocessor Communication **/ if (npe != 1 ) for( proc = 0; proc < max_subtours; proc++ ) Intercommunication(&proc_pe[proc],proc); } #ifdef CITYWISE /** Step 3C for Citywise Exchanges and annnealing in subtours **/ to sort out the subtours for ( proc = 0; proc < max_subtours; proc++) citywise_exchange(&proc_pe[proc],proc); #endif #define EDGEWISE #ifdef EDGEWISE /** Step 3C for Edgewise Exchanges and annealing in subtours **/ for( proc = 0; proc < max_subtours; proc++ ) edgewise_exchange(&proc_pe[proc],proc); #endif -195- Appendix A: Simulation Program M. Tran return; /** Each processing element performs its Tij exchange and internal annealing **/ /***************************************************************** citywise_exchange(pe,proc) struct tsp_pe_struct *pe; /* pointer to processor 'proc' data */ int proc; struclt tsp_cc_struct *cc; structt dsp_map_struct *map; /* pointer to xy coordinates of ma p */ float xdelta,ydelta,distance; /* map variables */ float chgjincost; /* chg in cost for Lij or Tij */ float temp,accept_prob,rng; doub le sqrto,powo; /* for the cost of subtours */ doub le drand48(),exp(); /*returns a non-negative double-precisi on floating-point values uniformly dis tributed over the interval [0,1)*/ int i,j,node 1,node2,pairl ,pair2,node,next_node; intcardinality; /* cardinality of subtour */ int p,k,nodei,nodej,nodei_prime,nodej_prime; intireservedjreserve,jreserved,proci,procj; intjreserve_seed,nootheravail node; cc = &central_coord; pe = &proc_pe[proc]; cardinality = pe->msg.cardinality = num_nodes/npe; /** NODEWISE EXCHANGE **/ for ( nodei = 0; nodei < cardinality; nodei++ ){ current_subtour_cost(&proc_pe[proc],proc); /** set up for NODEWISE EXCHANGE **/ for ( i = 0; i < cardinality; i++) pe->msg.perturb_subtour[i] = pe->msg.current_subtour[i]; nodej = drand48(seed) * cardinality; while (nodei == nodej ) ( nodej = drand48(seed) * cardinality; n nodei_prime = ( ( nodei < nodej ) ? nodei : nodej ); /* min(nodei,nodej) */ nodejprime = ( ( nodei > nodej ) ? nodei : nodej ); /* max(nodei,nodej) */ for (k = 0; k < nodei; k++ ) pe->msg.perturb_subtour[k] = pe->msg.current_subtour[k]; -196- Appendix A: Simulation Program M. Tran for ( k = nodei_prime; k <= nodej_prime ;k++) pe->msg.perturb_subtour[k] = pe->msg.current_subtour[k]; for ( k = nodej_prime+1; k < cardinality; k++ ) pe->msg.perturb_subtour[k] = pe->msg.current_subtour[k]; #ifdef SUBTOURS printf(" INTRASUBTOUR RESULTS for processor %d\n",proc); current_subtours(&proc_pe[proc],proc); #endif /** compute cost of (ordered) permuted_subtour[] **/ perturbed_subtour_cost(&proc_pe[proc],proc); /** ANNEALING THE SUBTOUR **/ intra_subtour anneal(&proc_pe[proc],proc); #ifdef SUBDEBUG printf(" INTRASUBTOURS RESULTS for processor %d\n",proc); subtour_result(&proc_pe[proc],proc); #endif )/* nodei */ #ifdef SUBDEBUG printf(" INTRASUBTOURS RESULTS for processor %d\n",proc); subtour_result(&proc_pe[proc],proc); #endif return; /** Each processing element performs its Lij exchange and internal annealing **/ /1**************************************************************** edgewise_exchange(pe,proc) struct tsp_pe_struct *pe; int proc; /* pointer to processor 'proc' data */ struct tsp_ccstruct *cc; /* pointer to xy coordinates of map */ struct dspmap_struct * ma]p; float xdelta,ydelta,distance; /* map variables */ float chgincost; /* chg in cost for Lij or Tij */ float temp,accept_prob,rng; double sqrtO,pow(); /* for the cost of subtours */ double drand48(),exp(); /*returns a non-negative double-precisi -197- M. Tran int int int int int Appendix A: Simulation Program on floating-point values uniformly dis tributed over the interval [0,1)*/ i,j,node 1,node2,pair 1,pair2,node,next_node; cardinality; /* cardinality of subtour */ p,k,nodei,nodej,nodei_prime,nodej_prime; ireserved,jreservejreserved,proci,procj; jreserve_seed,no_other_avail_node; cc = &central_coord; pe = &proc_pe[proc]; cardinality = pe->msg.cardinality = num_nodes/npe; for ( nodei = 0; nodei < cardinality; nodei++ ){ current_subtour_cost(&proc_pe[proc],proc); /** set up for EDGEWISE EXCHANGE **/ for ( i = 0; i < cardinality; i++) pe->msg.perturb_subtour[i] = pe->msg.current_subtour[i]; nodej = drand48(seed) * cardinality; while( nodei == nodej )( nodej = drand48(seed) * cardinality; I nodei_prime = (( nodei < nodej ) ? nodei : nodej ); /* min(nodei,nodej) */ nodejprime = ( ( nodei > nodej ) ? nodei : nodej );/* max(nodei,nodej) */ for ( k = 0; k < nodei; k++ ) pe->msg.perturb_subtour[k] = pe->msg.currentsubtour[k]; for ( k = 0; k <= ( nodej_prime - nodei_prime );k++) pe->msg.perturb_subtour[nodei_prime+k] = pe->msg.current_subtour[nodej_prime-k]; for ( k = nodej_prime+1; k < cardinality; k++ ) pe->msg.perturb_subtour[k] = pe->msg.current_subtour[k]; #ifdef SUBTOURS printf(" INTRASUBTOUR RESULTS for processor %d\n",proc); current_subtours(&proc_pe[proc],proc); #endif /** compute cost of (ordered) permutedsubtour[] **/ perturbed_subtourcost(&proc_pe[proc],proc); /** ANNEALING THE SUBTOUR **/ intra_subtour_anneal(&proc_pe[proc],proc); #ifdef SUBDEBUG printf(" INTRASUBTOURS RESULTS for processor %d\n",proc); subtourresult(&proc_pe[proc],proc); #endif } /* nodei */ -198- Appendix A: Simulation Program M. Tran #ifdef SUBDEBUG printf(" INTRASUBTOURS RESULTS for processor %d\n",proc); subtour_result(&proc_pe[proc],proc); #endif return; } /** INTERPROCESSOR COMMUNICATION **/ Intercommunication(pe,proc) struct tsp_pe_struct *pe; /* pointer to processor 'proc' data */ int proc; { structt tspccstruct *cc; structt tsp_pe_struct *pei; structt tsp_pe_struct *pej; struct dsp_map_struct /*pointer to xy coordinates of ma p */ *map; float xdelta,ydelta,distance; /*map variables */ float chgjincost; /*chg in cost for Lij or Tij */ float temp,accept_prob,mg; doub le sqrto,powo; /* for the cost of subtours */ doub le drand48(),exp(); /*returns a non-negative double-precisi on floating-point values uniformly dis tributed over the interval [0,1)*/ int i,j,nodel ,node2,pairl ,pair2,node,next_node; intcardinality; /*cardinality of subtour */ int p,k,nodei,nodej,nodei_prime,nodejprime; int ireserved,jreserve,jreserved,proci,procj; intjreserve_seed,noother_avail_node; cc = &central_coord; pe = &proc_pe[proc]; cardinality = pe->msg.cardinality = num_nodes/npe; proci = proc; pei = &proc_pe[proci]; for ( ireserved = 0; ireserved < cardinality; ireserved++ ){ wait_with_timeout(&proc_pe[proci]); pei->msg.nodestate[ireserved] = RESERVED; pei->msg.reserved_node[ireserved] = pei->msg.current_subtour[ireserved]; pei->msg.status = FREE; for ( procj = 0; procj < npe; procj++ ) if ( procj != proci ){ pej = &proc_pe[procj]; -199- M. Tran Appendix A: Simulation Program wait_with_timeout(&proc_pe[procj]); /** Exchange reserved_node[ireserved] with 'cardinality' randomly generated jreserved nodes **/ for ( j = 0; j < cardinality; j++ )( for ( jreserve = 0; jreserve < cardinality; jreserve++ ) if ( pej->msg.nodestate[jreserve] != RESERVED ){ jreserve_seed = seed; jreserved = drand48(jreserve_seed)*cardinality; no_other_avail_node = 0; while( (jreserved < jreserve) II (pej->msg.nodestate[jreserved] == RESERVED) )( jreserve_seed += 5; jreserved = drand48(jreserve_seed)*cardinality; if ( no_other_avail_node == num_nodes ) ( printf(" proc %d may not has other available node\n",procj); jreserved = jreserve; break; no_other_avail_node += 1; ) /** while **/ pej->msg.node_stateljreserved] = RESERVED; pej->msg.reserved_node[jreserved] = pej->msg.current_subtour[jreserved]; #ifdef SUB printf("pei->msg.reservednode[%d] = %d of processor %d\n", ireserved, pei- >msg.reserved_node [ire served] ,proci); printf("pej->msg.reserved_node[%d] = %d of processor %d\n\n", jreserved, pej->msg.reserved_node [jreserved],procj); #endif break; ) /** if pej**/ if (pej->msg.node_state[cardinality- 1] == RESERVED)( printf(" No available node in processor %d\n",procj); goto NextProcessor; i S/** jreserve **/ -200- n Appendix A: Simulation Program M. Tran pej->msg.status = FREE; wait_with_timeout(&proc_pe[proci]); /************ pei SUBTOUR ANNEALING *************/ /** Set up to perturb the current_subtour **/ for ( k = 0; k < cardinality; k++ ) pei->msg.perturb_subtour[k] = pei->msg.current_subtour[k]; wait_with_timeout(&proc_pe[procj]); /** Nodewise Exchange **/ pei->msg.perturb_subtour[ireserved] = pej->msg.reserved_node[jreserved]; #ifdef SUB printf('\n/** Nodewise Exchange **An "); printf("PEi: current and perturbed subtours \n"); current_subtours(&proc_pe[proci],proci); #endif SUB pej->msg.status = FREE; /** Get the current cost of the subtour **/ current_subtour_cost(&proc_pe[proci],proci); /** Get the perturb cost of the subtour **/ perturbed_subtour cost(&proc_pe[proci],proci); /** Annealing the subtour **/ inter_subtour_anneal(&proc_pe[proci],proci); if ( pei->msg.state_subtour == ACCEPTED ) #ifdef SUB if ( pei->msg.state_subtour == ACCEPTED ) printf('\npei->msg.state_subtour = ACCEPTED\n "); printf("pei: current and perturbed subtours \n"); current_subtours(&proc_pe[proci],proci); #endif SUB wait_with_timeout(&proc_pe[procj]); /************ pej SUBTOUR ANNEALING *************/ /** Set up to perturb the current_subtour **/ for (k = 0; k < cardinality; k++ ) pej->msg.perturb_subtour[k] = pej->msg.current_subtour[k]; /** Nodewise Exchange **/ pej->msg.perturb_subtour[jreserved] = -201- Appendix A: Simulation Program M. Tran pei->msg.reserved_node[ireserved]; #ifdef SUB printf('"\n/** Nodewise Exchange **/\n "); printf("PEj: current and perturbed subtours \n"); current_subtours(&proc_pe[procj],procj); #endif SUB pei->msg.status = FREE; /** Get the current cost of the subtour **/ current_subtour_cost(&proc_pe[procj],procj); /** Get the perturb cost of the subtour **/ perturbedsubtour_cost(&proc_pe[procj],procj); /** Annealing the subtour **/ inter_subtour_anneal(&proc_pe[procj],procj); if ( pej->msg.state_subtour == ACCEPTED ){ /** pej ACCEPTED **/ #ifdef SUB if ( pej->msg.state_subtour == ACCEPTED ) printf('Nnpej->msg.state_subtour = ACCEPTED\n "); printf("pej: current and perturbed subtours: \n"); current_subtours(&procpe[procj],procj); #endif SUB wait_with_timeout(&proc_pe[proci],proci); #ifdef SUB printf('1n\nPEi: BEFORE THE ACTUAL EXCHANGE \n"); printf("PEi: current and perturbed subtours \n"); current_subtours(&proc_pe[proci],proci); #endif SUB /************ ACTUAL EXCHANGE for Pei *************/ /* get the accepted subtour cost */ pei->msg.state_subtour = ACCEPTED; pei->msg.cost_subtour = pei->msg.cost_perturb_subtour, /* update the current_subtour[] */ cardinality = pei->msg.cardinality; for ( k = 0; k < cardinality; k++ ) ( pei->msg.current_subtour[k] = pei->msg.perturb_subtour[k]; pei->msg.node_state[k] = AVAILABLE; /** Set Up for the next NODEWISE EXCHANGE **/ pei->msg.node_state[ireserved] = RESERVED; pei->msg.reserved_node[ireserved] = -202- __ M. Tran Appendix A: Simulation Program pei->msg.current_subtour[ireserved]; #ifdef SUBDEBUG printf('\t PEi: AFTER THE ACTUAL EXCHANGE \n "); printf('Nt INTERSUBTOUR RESULTS for (ith) processor %d :\n",proci); subtour_result(&proc_pe[proci],proci); #endif #ifdef SUB printf('nn\nPEj: BEFORE THE ACTUAL EXCHANGE \n"); printf("PEj: current and perturbed subtours: \n"); current_subtours(&proc_pe[procj],procj); #endif SUB /************** ACTUAL EXCHANGE FOR Pej ************* pej->msg.state_subtour = ACCEPTED; /*get the accepted subtour cost */ pej->msg.cost_subtour = pej->msg.cost_perturbsubtour; /* update the current_subtour[] */ cardinality = pej->msg.cardinality; for ( k = 0; k < cardinality; k++ ){ pej->msg.current_subtour[k] = pej->msg.perturb_subtour[k]; pej->msg.node_state[k] = AVAILABLE; I #ifdef SUBDEBUG printf('\t PEj: AFTER THE ACTUAL EXCHANGE \n "); printf('.t INTERSUBTOUR RESULTS for (jth) processor %d :\n",procj); subtour_result(&proc_pe[procj],procj); #endif )else( /**pej NOTACCEPTED **/ pej->msg.state_subtour = NOTACCEPTED; /**Pej wants the reserved node back **/ pej->msg.current_subtour[jreserved] = pej->msg.reserved_node[jreserved]; /** Made all nodes of pej available for Nodewise Exchange **/ for ( k = 0; k < cardinality; k++) pej->msg.nodestate[k] = AVAILABLE; -203- Appendix A: Simulation Program M. Tran /** Pej NOTACCEPTED, so set up for the next NODEWISE EXCHANGE **/ pei->msg.state_subtour = NOTACCEPTED; for ( k = 0; k < cardinality; k++ ) pei->msg.node_state[k] = AVAILABLE; /** Get the reserved node back **/ pei->msg.nodestate[ireserved] = RESERVED; pei->msg.current_subtour[ireserved] = pei->msg.reserved_node[ireserved]; #ifdef SUBDEBUG printf(" \t INTERSUBTOUR RESULTS for (ith) processor %d :\n",proci); subtour_result(&proc_pe[proci],proci); #endif #ifdef SUBDEBUG printf(" \t INTERSUBTOUR RESULTS for (jth) processor %d :\n",procj); subtour_result(&proc_pe[procj],procj); #endif }/** pej NOTACCEPTED **/ )else{ /** pei NOTACCEPTED **/ pei->msg.state_subtour = NOTACCEPTED; /** Set up for next NODEWISE EXCHANGE with another processor **/ for ( k = 0; k < cardinality; k++ ) pei->msg.node_state[k] = AVAILABLE; /** Get the reserved node back **/ pei->msg.node_state[ireserved] = RESERVED; pei->msg.current_subtour[ireserved] = pei->msg.reservednode[ireserved]; /** Pei NOTACCEPTED, so set up for the next NODEWISE EXHANGE **/ pej->msg.state_subtour = NOTACCEPTED; /** Get the reserved node back **/ pej->msg.current_subtour[jreserved] = pej->msg.reserved_node[jreserved]; /** Made all nodes available for Exchange **/ for ( k = 0; k < cardinality; k++ ) pej->msg.node_state[k] = AVAILABLE; -204- n M. Tran Appendix A: Simulation Program #ifdef SUBDEBUG printf(" \t INTERSUBTOUR RESULTS for (ith) processor %d :\n",proci); subtour_result(&proc_pe[proci],proci); #endif #ifdef SUBDEBUG printf(" \t INTERSUBTOUR RESULTS for (jth) processor %d :\n",procj); subtour_result(&proc_pe[procj],procj); #endif S/** pei NOTACCEPTED **/ S/** forj **/ ) /* if procj **/ NextProcessor: { ; ] /** Last processor had no available node. Check next processor **/ } /**for procj **/ ) /** for ireserved **/ return; } /********************** FUNCTIONS FOR SUBTOUR ******************/ wait_with_timeout(pe) struct tsp_pe_struct *pe; static int time_out = 0; /** the processor must have been Busy with the present task **/ while( (pe->msg.status == BUSY) && (time_out < 5) ) time_out += 1; /** do nothing **/ /** Get the processor BUSY for the next task **/ return(pe->msg.status = BUSY); } /** Compute the cost of the current suibtour **/ current_subtour_cost(pe,proc) /* pointer to processor 'proc' data */ struct tsp_pe_struct *pe; int proc; *map; struct dsp_map_struct float xdelta,ydelta,distance; /* pointer to xy coordinates of map */ /* map variables */ -205- Appendix A: Simulation Program M. Tran float current_cost; /* cost for a Tij interchange */ double sqrtO, powo; /* for the cost of subtours */ int int nodel, node2, node; cardinality; /* cardinality of subtour */ pe = &proc_pe[proc]; cardinality = pe->msg.cardinality = num_nodes/npe; current_cost = 0.0; for( node = 0; node < cardinality-1; node++ ){ /* compute cost of subtour */ node 1 = pe->msg.current_subtour[node]; /* these msg.array are nodes */ node2 = pe->msg.current_subtour[(node+ 1+ inlity)%cardinality]; /* or coordinates if map_nodes[]*/ /* from Al()=tourl.c */ xdelta = map_nodes[nodel].x_coord - map_nodes[node2].x_coord; ydelta = mapnodes[nodel].y_coord - map_nodes[node2].y_coord; /** eucledian distance, cost of edge */ distance = sqrt( (xdelta * xdelta) + (ydelta * ydelta) ); current_cost += distance; /* total cost of current_subtour */ pe->msg.cost_curr_subtour = current_cost; return; I /** Compute the perturbed subtour cost **/ perturbed subtour_cost(pe,proc) /* pointer to processor 'proc' data */ struct tsp_pe_struct *pe; int proc; struct float float float double dsp_map_struct *map; /* pointer to xy coordinates of map */ xdelta, ydelta, distance; /* map variables */ /* cost for a Tij interchange */ perturb_cost; chgincost; /* chg in cost for Lij or Tij */ sqrto,powO; /* for the cost of subtours */ int node 1l,node2,node; int cardinality; /* cardinality of subtour */ pe = &proc_pe[proc]; cardinality = pe->msg.cardinality = num_nodes/npe; /** compute cost of (ordered) permuted_subtour[] **/ perturb_cost = 0.0; for( node = 0; node < cardinality-1; node++ ){ node 1 = pe->msg.perturb_subtour[node]; node2 = pe->msg.perturb_subtour[node+1]; xdelta = map_nodes[nodel].x_coord - map_nodes[node2].x_coord; ydelta = map_nodes[node 1l].y_coord - map_nodes[node2].y_coord; -206- M. Tran Appendix A: Simulation Program /** cost of an ordred edge of the permuted_subtour **/ distance = sqrt( (xdelta * xdelta) + (ydelta * ydelta) ); perturb_cost += distance; /* total cost of permuted_subtour */ } pe->msg.cost_perturb_subtour = perturb_cost; return; } /**** ANNEALING INSIDE A GIVEN SUBTOUR ****/ intra_subtour_anneal(pe,proc) /* pointer to processor 'proc' data */ struct tsp_pe_struct *pe; int proc; I struct tsp_cc_struct *cc; struct dsp_map_struct *map; float xdelta,ydelta,distance; float chgincost; /* pointer to xy coordinates of map */ /* map variables */ /* chg in cost for Lij or Tij */ float temp,acceptprob,mg; double sqrto,powo; /* for the cost of subtours */ double drand48(),exp(); /*returns a non-negative double-precisi on floating-point values uniformly dis tributed over the interval [0,1)*/ int int i,node 1,node2,node; cardinality; /* cardinality of subtour */ pe = &proc_pe[proc]; cc = &central_coord; /************ ANNEALING ************************/ /** if the new change in cost, swtch_cost, < 0, **/ /** then accept it and otherwise => annealing **/ mg = drand48(seed); /*returns non-negative double-precision floating-point*/ /*values uniformly distributed over the interval [0,1)*/ pe->msg.mg = rng; /* to be used as r = exp {-(chgjincost)/Tn) condition */ chgin_cost = pe->msg.cost_perturb_subtour - pe->msg.cost_curr_subtour; /* difference in cost */ temp = pe->msg.temperature; if (temp == 0 ){ printf(" ERROR===>CAN'T DIVIDE A ZERO TEMPERATURE !!!\n"); exit(); I accept_prob = ( (chgincost > 0 ) ? exp(-chgjincost/temp) : exp(chgjincost/temp)); pe->msg.accept prob = accept_prob; if ( (pe->msg.cost_perturb_subtour < pe->msg.cost_curr_subtour) II(rng < accept_prob) ){ -207 Appendix A: Simulation Program M. Tran pe->msg.state_subtour = ACCEPTED; /** accept the perturbed tour **/ pe->msg.cost_subtour = pe->msg.cost_perturb_subtour, /* get the accepted subtour cost */ pe->msg.chg incost = chgin_cost; /* get the decremental cost */ /* update the current_subtour[] */ cardinality = pe->msg.cardinality; for ( i = 0; i < cardinality; i++ ) pe->msg.current_subtour[i] = pe->msg.perturb_subtour[i]; }else{ pe->msg.state_subtour = NOTACCEPTED; pe->msg.cost_subtour = pe->msg.cost_curr_subtour; pe->msg.chgincost = 0.0; I return; /**** ANNEALING BETWEEN SUBTOURS ****/ inter_subtouranneal(pe,proc) /* pointer to processor 'proc' data */ struct tsp_pe_struct *pe; int proc; { struct tspccstruct *cc; *map; struct dsp_map_struct float xdelta,ydelta,distance; float chgin_cost; /* pointer to xy coordinates of map */ /* map variables */ /* chg in cost for Lij or Tij */ float temp,accept_prob,mg; double /* for the cost of subtours */ sqrto,powo; double /*returns a non-negative double-precisi drand48(),exp(); on floating-point values uniformly dis tributed over the interval [0,1)*/ int i,nodel,node2,node; int c;ardinality; / * cardinality of subtour */ pe = &proc_pe[proc]; cc = &central_coord; /************ ANNEALING ************************/ /** if the new change in cost, swtch_cost, < 0, **/ /** then accept it and otherwise => annealing **/ mg = drand48(seed); /*returns non-negative double-precision floating-point*/ /*values uniformly distributed over the interval [0,1)*/ pe->msg.mg = rug; /* to be used as r = exp {-(chgincost)/Tn } condition */ -208- m M. Tran Appendix A: Simulation Program chgjin_cost = pe->msg.costperturb_subtour - pe->msg.cost_curr_subtour; /* difference in cost */ temp = pe->msg.temperature; if ( temp == 0 ) printf(" ERROR==>CAN'T DIVIDE A ZERO TEMPERATURE !!!\n"); exit(); } accept_prob = ( (chgincost > 0 ) ? exp(-chgincost/temp) : exp(chgjincost/temp)); pe->msg.acceptprob = accept_prob; if ( (pe->msg.cost_perturb_subtour < pe->msg.cost_curr_subtour) II(mg < accept_prob) ){ pe->msg.state_subtour = ACCEPTED; /** accept the perturbed tour **/ pe->msg.chgincost = chgincost; /* get the decremental cost */ }else{ pe->msg.state_subtour = NOTACCEPTED; pe->msg.chgincost = chgjincost; /* get the decremental cost */ return; /** States of the tours **/ current_subtours(pe,proc) struct tsp_pe_struct *pe; int { int /* pointer to processor 'proc' data */ proc; i,cardinality; pe = &proc_pe[proc]; cardinality = pe->msg.cardinality; printf(" the current subtour is....\n"); for ( i = 0; i < cardinality; i++ ) printf(" %d\t",pe->msg.current_subtour[i]); if(i !=0) if ( (i% 10) == 0) printf("\n"); /** new line **/ printf("hn the perturbed subtour is ....\n"); for( i = 0; i < cardinality; i++ ){ -209- M. Tran Appendix A: Simulation Program printf(" %d\t",pe->msg.perturb_subtour[i]); if( i !=0) if((i% 10) == 0) printf('\n"); /** new line **/ printf('\n"); return; /** Results of the subtours **/ subtour_result(pe,proc) struct tsp_pe_struct *pe; int proc; /* pointer to processor 'proc' data */ I int i,cardinality; pe = &proc_pe[proc]; cardinality = pe->msg.cardinality; printf('t SUBTOUR RESULT: for processor %d ",proc); if ( pe->msg.status == FREE) printf(" FREE \n"); else printf(" BUSY \n"); printf('%t\t RNG = %f\n",pe->msg.rng); printf('1t\t PWR = %f\n",pe->msg.acceptprob); printf('\t\t temperature = %f\n t",pe->msg.temperature); printf('Nt\t cardinality = %d\n",pe->msg.cardinality); printf('Nt\t current_cost = %f\n",pe->msg.cost_curr_subtour); printf('t•t perturb_cost = %f\n",pe->msg.cost_perturbsubtour); printf('Nt\t cost_subtour = %f\n",pe->msg.cost_subtour); printf('Nt\t chgin cost = %f\n",pe->msg.chgin cost); if ( pe->msg.chgin_cost < 0.0 ) printf('"\t CHG IN COST < 0\n"); if ( pe->msg.state_subtour == INIT) printf('"t\t subtour_state = INIT\n"); else if ( pe->msg.state_subtour == ACCEPTED ) printf('tAn subtour_state = ACCEPTED\n"); else printf( '%tVn subtour_state = NOT ACCEPTED\n"); printf(" the new subtour is....\n"); for ( i = 0; i < cardinality; i++ ){ if( i != 0 ) if((i% 0) == 0) printf('\n"); /** new line **/ -210- M. Tran Appendix A: Simulation Program printf(" %d\t",pe->msg.current_subtour[i]); /* from Al) */ } printf('"n\n"); return; I #include #include #include #include <stdio.h> <math.h> "defs.c" "external.c" /**************** BEGIN OF TOUR2.C ********************/ /** The following function performs steps A5, A6, and A7 **/ /** Tasks: (1). Reconstruct the new tour from the subtours **/ (2). Compute the cost of the new tour /** /** (3). Perform the global annealing **/ /** (4). Check for stopping condition **/ int step_A5() struct tsp_cc_struct *cc; struct tsp_pe_struct *pe; int i,iter,proc,cardinality,rotate_tour; int node,node 1,node2; float xdelta,ydelta,distance; float tempcc,acceptprobcc,rngcc; float chgincostcc, min_cost; double drand48(); cc = &central_coord; /** read the processor updates and reconstruct the tour **/ reconstruct_tour(); /****************** ANNEALING GLOBALLY ***************/ /** compute the cost of the new tour **/ cost_current_tour(); -211- Appendix A: Simulation Program M. Tran /** Find the best cost of the tour and the best tour **/ global_annealing(); #ifdef TOURDEBUG tour_results(); #endif /**** record of the optimal tour ****/ opt tour_record(); #define T2DEBUG #ifdef T2DEBUG if(cc->cost_best_tour < cc->cost_opt_tour[run_num]) tour_results(); #endif T2DEBUG /** increment iterations and call statistics **/ update_stats(); /* update the statistical data of the tour */ cc->iteration += 1.0; /** next iteration **/ /** Set up for the next iteration **/ if ( cc->state_tour != ACCEPTED ) for (i = 0; i < num_nodes; i++) cc->soln_tour[i] = cc->best_tour[i]; /** check for stopping criterion: stepA7 & step_A8 **/ /** # of iterations > maximum # of iterations **/ if( cc->iteration > max_ito) return(TRUE); return(FALSE); /** continue to iterate **/ S/** tsp **/ /************************** END TOUR2.C ************************I /************ FUNCTIONS CALLED BY STEP_A5() BEGIN ************/ /** reconstruct the entire TSP tour from updated subtours **/ reconstruct_tourO struct tsp_cc_struct *cc; struct tsp_pe_struct *pe; int int i,cardinality; node,proc; -212- Appendix A: Simulation Program M. Tran cc = &central_coord; /** read the processor updates and reconstruct the tour **/ node = 0; cardinality = num_nodes/npe; /** reconstruct the TSP tour **/ for ( proc = 0; proc < npe; proc++ ){ pe = &proc_pe[proc]; for (i = 0; i < cardinality; i++ ){ node = ( node + num_nodes ) % num_nodes; cc->soln_tour[node++] = pe->msg.current_subtour[i]; ) return; /**************** cost of the current tour **************/ cost_current_tourO { struct tsp_cc_struct *cc; struct tsp_pe_struct *pe; int proc, node,nodel,node2; float xdelta,ydelta,distance; cc = &central_coord; cc->cost_soln_tour = 0; for (proc = 0; proc < npe; proc++) pe = &procpe[proc]; cc->cost_soln_tour += pe->msg.costsubtour; I return; /*************** Global Annealing of the TSP tour **************/ global_annealing() struct tspccstruct *cc; int i float tempcc,acceptprobcc,rngcc; float chgincostcc; double drand48(); cc = &central_coord; cc->rngcc = rngcc = drand48(seed); chgjin_costcc = cc->chgincostcc = cc->cost_soln_tour - cc->cost_best_tour; tempcc = cc->temperature; if ( tempcc == 0 ){ printf(" ERROR CAN'T DIVIDE ZERO TEMPERATURE!! !\n"); -213- M. Tran Appendix A: Simulation Program exit(); accept_probcc = ( (chgjincostcc > 0) ? exp(-chgincostcc/tempcc) :exp(chg.incostcc/tempcc)); cc->accept_probcc = accept_probcc; if( (cc->cost_soln_tour < cc->cost_best_tour) II(rngcc < accept_probcc) ) { cc->state_tour = ACCEPTED; cc->notacc_cnt = 0; cc->chgLincostcc = chgin_costcc; cc->cost_best_tour = cc->cost_soln_tour; /** Update the TSP tour **/ for( i = 0; i < num_nodes; i++) cc->best_tour[i] = cc->solntour[i]; }else{ cc->state_tour = NOTACCEPTED; cc->not_acc_cnt += 1; cc->chg incostcc = chgjin_costcc; return; ) /********************* update_stats()O update_stats() **************/ *data; struct stats_struct struct tsp_ccstruct *cc; int i, j; unsigned itose, k, offsets, zoos, 1; FILE *fopenO, *fp; static char bestf[]="best.dat"; int ito; cc = &central_coord; if( cc->iteration == 1 ){ SamplePts = 0; cc->state_tour = INIT; tour_results(); itose = cc->iteration; k = max_ito / sample_pts; if( ((itose % k) == 0) II(itose == 1) )( /* history of the cost of best_tour */ -214- Appendix A: Simulation Program M. Tran cost_history[SamplePts][run_num] = cc->cost_best_tour; /* history of the cost of solntour */ uphill_history[SamplePts] [run_num] = cc->cost_soln_tour; /******* history of the best_tour and the soln_tour ******/ for( i = 0; i < num_nodes; i++ )( best_history[SamplePts][i] = cc->best_tour[i]; soln_history[SamplePts][i] = cc->soln_tour[i]; #ifdef T2 printf("uphill_history[%d][%d] = %f\t",SamplePts,run_num,cc->cost soln_tour); printf("cost_history[%d][%d] = %f\n",S amplePts,run_num,cc->cost_best_tour); #endif T2 itose = cc->iteration; printf("SampPts = %d,iter = %d,costperb_tour = %f,cost_acc_tour = %f\n", SamplePts,itose,cc->cost_soln_tour,cc->costbest_tour); SamplePts ++; I #define STDEBUG #ifdef STDEBUG itose = cc->iteration; k = maxito / sample_pts; (itose == 1) ){ if( ((itose % k) == 0) II if ( (fp = fopen(bestf,"a+") ) == NULL ){ printf(" best.dat failed to open!!!\n"); exit(); if (itose == 1){ fprintf(fp," run_number = %d\n",run_num); fprintf(fp,"iteration\t temperature\tcost_soln_tour\t cost_best_tour\n\n"); fprintf(fp,"%d \t %f\t %f\t %f\n", 1,cc->temperature,cc->cost_soln_tour,cc->cost_best_tour); }else( fprintf(fp,"%f\t %f\t%f\t %f\n", cc->iteration,cc->temperature,cc->cost_soln_tour,cc->cost_best_tour); if ( itose == max_ito) fprintf(fp,'.n\n"); fclose(fp); #endif return; } -215- m M. Tran Appendix A: Simulation Program /********** record of optimal tour ***********/ opt tour_recordO { struct tsp_cc_struct *cc; int i; int num_opt_out,iter,opt_out; FILE *fopenO, *fp; static char optf[] = "opt.dat"; cc = &central_coord; /** To keep track of the optimal tour **/ if(cc->cost_best_tour < cc->cost_opt_tour[run_num]) { cc->costopttour[run_num] = cc->cost_besttour; cc->opttemperature[run_num] = cc->temperature; cc->opt_iteration[runnum] = cc->iteration; for (i = 0; i < num_nodes; i++ ) cc->optimal_tour[i] = cc->best_tour[i]; I #define OPTDEBUG #ifdef OPTDEBUG /** number of optimum tours to be outputted **/ num_opt_out = 5; iter = cc->iteration; opt_out = maxito/num_opt_out; if( ( (iter%opt_out) == 0 ) 11(iter == 1) ){ if( ( fp = fopen(optf,"a+") ) == NULL ){ printf("****** Optimal.dat failed to open *****\n"); exit(); if( cc->iteration == 1.0 ){ fprintf(fp,"number of pe = %d\n",npe); fprintf(fp,"number of nodes = %d\n",num_nodes); fprintf(fp,"mapseed = %d\n",mapseed); fprintf(fp,"seed = %d\n",seed); fprintf(fp,"annealing - depth = %f\n",depth); fprintf(fp,'\n THE RANDOM INITIAL COST = %f\n",cc->cost_soln_tour); fprintf(fp,'\n\n THE RANDOM INITIAL TOUR .... \n"); for ( i = 0; i < num_nodes; i++){ if(i != 0 ) if((i% 10)== 0) fprintf(fp,'\n"); /** next line **/ fprintf(fp,"%cN ",cc->soln_tour[i]); fprintf(fp,'\n"); fprintf(fp," At %dth run\n",run_num); -216- M. Tran Appendix A: Simulation Program fprintf(fp," The optimal tour cost = %f\n",cc->cost_opt_tour[run_num]); fprintf(fp," occurs at the optimal iteration = %f\n", cc->opt_iteration[run_num]); fprintf(fp," and at the optimal temperature = %f\n", cc->opt_temperature[run_num]); fprintf(fp,"\n"); fprintf(fp,"THE OPTIMAL TOUR ....\n"); for (i = 0; i < num_nodes; i++){ if(i !=O) if((i% 10)== 0) fprintf(fp,'1n"); /* line feed */ fprintf(fp,"%d\t",cc->optimal_tour[i]); fprintf(fp,'"\n\n"); fclose(fp); #endif return; I /**** results of the best tour or the current tour ****/ tour_results() { struct tsp_cc_struct int i; *cc; cc = &central_coord; if ( cc->state_tour == ACCEPTED ){ printf('Ntin ITERATION ACCEPTED_TOUR RESULTS (in tour2.c):\n "); printf('t,\t mgcc = %f\n",cc->rngcc); printf('\t\t accept_probcc = %f\n",cc->accept_probcc); printf('%t\t temperature = %f\n",cc->temperature); printf('"\tt opt_temperature = %f\n",cc->opt_temperature[run_num]); printf('\tft cost_opttour = %f\n",cc->cost opt_tour[runnum]); printf('"\t opt_iteration = %f\n",cc->opt_iteration[run_num]); printf('"t\t iteration = %ftn",cc->iteration); printf('Nt\t chg_in_costcc = %f\n",cc->chgin_costcc); if ( cc->state_tour = INIT ) printf('Nt\t state_tour = INIT \n"); if ( cc->state_tour == ACCEPTED ) printf('t\t state_tour = ACCEPTED \n"); else printf('t\t state_tour = NOT ACCEPTED \n"); printf("'tNt not_acc_cnt = %d\n",cc->not_acc_cnt); printf('%t\t cost_accepted_tour = %f\n",cc->cost_best_tour); \"); printf('Nt\t the accepted tour = .... for( i = 0; i < num_nodes; i++ )( if(i != ) -217- | M. Tran Appendix A: Simulation Program /** next line **/ if((i % 10) == 0) printf('Nn"); printf("%d\t",cc->best_tour[i]); printf('n\n"); }else( printf('t\n ITERATION PERTURBED_TOUR RESULTS (in tour2.c):\n "); printf('%\t rngcc = %f\n",cc->rngcc); printf('Nt\t accept_probcc = %f\n",cc->acceptprobcc); printf('Nt\t temperature = %f\n",cc->temperature); printf('Nt\t costopt_tour = %f\n",cc->cost_opt_tour[run_num]); printf('\tt opttemperature = %f\n",cc->opLttemperature[run_num]); printf('%\t opt_iteration = %f\n",cc->opt_iteration[run_num]); printf('\tt iteration = %f\n",cc->iteration); printf('"\t chgin_costcc = %f\n",cc->chgincostcc); if ( cc->state_tour = INIT ) printf('%t\t state_tour = INIT \n"); if ( cc->state_tour == ACCEPTED ) printf('Wt state_tour = ACCEPTED \n"); else printf('\tt state_tour = NOT ACCEPTED \n"); printf('"\t\t not_acc_cnt = %d\n",cc->not_acc_cnt); printf('t\t cost_perturbed_tour = %f\n",cc->cost_soln_tour); n"); printf('"\t\t the perturbed tour = .... for( i = 0; i < num_nodes; i++)( if(i !=O) if ((i % 10) == 0) /**next line **/ printf('Nn"); printf("%d\t",cc->soln_tour[i]); printf('\n\n"); } /*else */ return; )/*struct */ /********** END OF FUNCTIONS CALLED BY STEP_A5() ***********/ #include #include #include #include <stdio.h> <math.h> "defs.c" "external.c" BEGIN OF OUTPUT.C ***************/ /**This function outputs the best possible tour or optimal tour of the TSP **/ /************************* output() { -218- M. Tran struct int int int FILE static Appendix A: Simulation Program tspcc_struct *cc; time(); upper, lower; i, j; *fopeno, *fp; char outfil[]="out.dat"; cc = &centralcoord; final_run_time = time(); #define OUTDEBUG #ifdef OUTDEBUG if( ( fp = fopen(outfil,"a+") ) == NULL ){ printf(" ****** Out.dat failed to open ***** \n"); exit(); I fprinff(fp,'t\t\t**TRAVELING SALESMAN / ANNEALING ALGORITHM **\\n"); fprintf(fp,"number of nodes = %d\n",num_nodes); fprintf(fp,"mapseed = %d\n",mapseed); fprintf(fp,"seed = %d\n",seed); fprintf(fp,'\n"); fprintf(fp,"annealing - depth = %f\n",depth); fprintf(fp, "initruntime = %d\n",init_run_time); fprintf(fp,"final_run_time = %d\n",final_run_time); fprintf(fp,"elapsed_run_time = %d\n",final_run_time-init_run_time); fprintf(fp,'"n\n"); fprintf(fp," At %dth run\n",run_num); fprintf(fp," The optimal tour cost = %f\n",cc->cost_opt_tour[run_num]); fprintf(fp," occurs at the optimal iteration = %f\n", cc->opt_iteration[run_num]); fprintf(fp," and at the optimal temperature = %fin", cc->opt_temperature[run_num]); fprintf(fp,'\n"); fprintf(fp,"THE WINNING TOUR \n"); for (i=0; i < num_nodes; i++) { if ( ( i % 10 ) == 0 ) fprintf(fp,'"n"); fprintf(fp," %d\",cc->optimal_tour[i]); } fprintf(fp,'"\n\n"); for( i = 0; i < sample_pts; i++) fprintf(fp, "%f\n",costhistory[i] [run_num]); **/ fprintf(fp,'\n\n"); fclose(fp); -219- M. Tran Appendix A: Simulation Program #endif return; I #include <stdio.h> #include <math.h> #include "defs.c" #include "external.c" /********************BEGIN OF STAT.C ******************/ /** This function outputs the results of statistical data for every run **/ /*********************************************************** results() struct tsp_cc_struct *cc; static char total[] = "total.dat"; FILE *fopenO, *fp; float inittot cost,best_tot_cost, soln_tot_cost; float opttotcost,opttotiter,opt_tottemp; float opt_mean,itermean,temp_mean,soln_mean, best_mean, std_dev; float mean_differ; int itr,sample, iter; cc = &central_coord; if( (fp=fopen(total,"a+") ) == NULL ){ printf("total.dat failed to open\n"); exit(); fprintf(fp,'Nt **** INPUT DATA ****\n\n"); fprintf(fp,'\t npe \t num_nodes \t max_ito \n"); fprintf(fp,'\t %d \t %d \t %f\n\n",npe,num_nodes,max_ito); fprintf(fp,'\t mapseed \t seed \t depth \n"); fprintf(fp,'1t %d \t %d \t %f\n\n",mapseed,seed,depth); fprintf(fp,'Nt num_runs \t sample_pts\tn"); fprintf(fp,'\t %d \t %d \n",num_runs,sample_pts); fprintf(fp,'\n"); /** Calculate the average optimal cost or min {best cost), iteration and temperature **/ opttot_cost = 0.0; opttot_iter = 0.0; opt_tot_temp = 0.0; for ( run_num = 0; run_num < num_runs; run_num++ )( opt_tot_cost += cc->cost_opt_tour[run_num]; opttot_iter += cc->optiteration[run_num]; -220- m Appendix A: Simulation Program M. Tran opttottemp += cc->opt_temperature[run_num]; ) opt_mean = opt_tot_cost/num_runs; iter_mean = opttot-iter/num_runs; temp_mean = opttot temp/num_runs; fprintf(fp,"The average optimal cost = %fAn",opt_mean); fprintf(fp,"occurs at the average temperature = %f\n",temp_mean); fprintf(fp," and at the average iteration = %f\n\n",iter_mean); fprintf(fp,'" t tt COST_HISTORY STATISTICS \n"); fprintf(fp,"Itenrt Soln_Mean\t Best_Mean \t Std_dev \n\n"); /** Calculate the average initial cost **/ init_tot cost = 0.0; for (run_num = 0; run_num < num_runs; run_num++ ) init_tot_cost += cc->cost_init_tour[runnum]; } soln_mean = best_mean = init_tot_cost / num_runs; mean_differ = std_dev = 0.0; fprintf(fp,"%d \t %.2f \t %.2f\t %.2f\t %.2f\n", 1,soln_mean,best_mean,mean_differ,std_dev); /** Average Ensembling best costs and current costs **/ for( sample = 0; sample < SamplePts; sample++ ) best_tot_cost = 0.0; /** need to be initilized at every samplept **/ soln_totcost = 0.0; for ( run_num = 0; run_num < num_runs; run_num++){ /*total cost of the history of the cost_best_tour */ best_tot_cost += cost_history[sample][run_num]; /*total cost of the history of the cost_soln_tour */ soln_tot_cost += uphill_history[sample][run_num]; I /*The average of the total cost of the history of the cost_soln_tour */ soln_mean = soln_tot_cost / numruns; /*The average of the total cost of the history of the costbesttour */ best_mean = best_tot_cost / num_runs; best_tot_cost = 0; for ( run_num = 0; run_num < num_runs; run_num++ )( /*total cost of the history of the costbest tour */ best_tot_cost += ( cost_history[sample] [run_num] - best_mean ); -221- M. Tran Appendix A: Simulation Program I std_dev = sqrt(besttotcost/num_runs); /** the iteration starts at 2 and (maxito/sample_pts)*iter after **/ iter = (max_ito/sample_pts)*sample; fprintf(fp,"%d \t %.2f \t %.2f \t %.2f\n", iter,soln_mean,best_mean,std_dev); ) /* sample */ fprintf(fp,'ln\n\n"); fclose(fp); return; } /*********************** END OF STAT.C *********************** 222- M. Tran Bibliography [Adams86] M.B. Adams and R.M. Beaton, "Automated Mission and Trajectory Planning," Pilot's Associate Planning Conference, Aspen, Colorado, September 16, 1986. [Aho74] A.V. Aho, J.E. Hopcroft, and J.D. Ullman, The Design and Analysis of Computer Algorithms, Addison-Wesley, Reading, MA 1974. [Amda67] G.M. Amdahl, "Validity of the Single Processor Approach to Achieving LargeScale Computing Capabilities," Proc. AFIPS, 30, 483-485, 1967. [Beard59] J. Beardwood, J.H. Halton, and J.M. Hammersley, "The Shortest Path Through Many Points," Proc. CambridgePhilos. Soc., 55, 299-327, 1959. [Bell68] M. Bellmore and G. L. Memhauser, "The Traveling Salesman Problem: A Survey," Operation Research, 16, 538-558, 1968. [Bert85] D. Bertsekas and J. Tsitsiklis, "Distributed Asynchronous Optimal Routing in Data Networks," MIT LIDS-P-1452, 1985. [Bod83] L. Bodin, A.A. Golden, and M. Ball, "Routing and Scheduling of Vehicles and Crews, The State of the Art," Comp. and Oper. Res., 10, 69--211, 1983. [Bon84] E. Bonomi and J. Lutton, "The N-City Traveling Salesman Problem: Statistical Mechanics and the Metropolis Algorithms," SIAM Review, 26, No. 4, 551-568, October 1984. [Cern85] V. Cerny, "Thermodynamical Approach to the Traveling Salesman Problem: An Efficient Simulation Algorithm," J. Opt. Theory Appl., 45, 41-51, 1985. [Cerv87] J.H. Cervantes, "The Boltzmann Machine and Its Application to the TravelingSalesman Problem," S.B. Thesis, Dept. of EECS, Massachusetts Institute of Technology, June 1987. [Chen75] T. C. Chen, "Overlap and Pipeline Processing," Introduction to Computer Architecture, H. Stone, Ed. SRA, 375-431, 1975. [Chris76] N. Christofides, "Worst-Case Analysis of a New Heuristic for the Traveling Salesman Problem," Report 388, Graduate School of Industrial Administration, Carnegie Mellon University, February 1976. [Con67] R.W. Conway, W.L. Maxwell, and L.W. Miller, Theory of Scheduling, Addison-Wesley, Reading, MA, 1967. [Cook71] A.S. Cook, "The Complexity of Theorem-Proving Procedures," Proc. 3rd Annual ACM Symp. on Theory of Computing, Association for Computing Machinery, New York, 151-158, 1971. -223- M. Tran Bibliography [Crane86] R.L. Crane, M. Minkoff, K.E. Hillstrom, and S.D. King, "Performance modeling of large-grained parallelism," Technical Memorandum No. 63, Argonne National Laboratory, Argonne, Illinois 60439, March 1986. [Dant54] G.B. Dantzig, D.R. Fulkerson, and S.M. Johnson, "Solution of a Large Scale Traveling Salesman Problem," OperationResearch, 2, 393-410, 1954. [Dav69] P.S. Davis and T.L. Ray, "A Branch-Bound Algorithm for the Capacitated Facilities Location Problem," Nav. Res. Log. Q., 16, 331-344, 1969. [Deut85] O.L. Deutsch, J.V. Harrison, and M.G. Adams, "Heuristically-Guided Planning for Mission Control/Decision Support," Technical Report, The Charles Stark Draper Laboratory, Cambridge, MA 02139, 1985. [Eag89] D. L. Eager, et. al., "Speedup Versus Efficiency in Parallel Systems," IEEE Transactionson Computers, 38, No. 3, March 1989. [Fed80] A. Federgruen and P. Zipkin, "A Combined Vehicle Routing and Inventory Allocation Problem," Technical Report No. 345A, Graduate School of Business, Columbia University, June 1980. [Ford62] L.R. Ford and D.R. Fulkerson, Flows in Nerworks, Princeton U. P., Princeton University, NJ, 1962. [Gar79] M.R. Garey and D.S. Johnson, Computers and Intractibility:A Guide to the Theory of NP-Completeness,Will Freeman, San Francisco, 1979. [Gem84] S. Geman and D. Geman, "Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images," IEEE Trans. on Pattern Analysis and Machine Intelligence, November 1984. [Greene84] J.W. Greene, "Simulated Annealing without Rejected Moves," Proceedings IEEE Conference on Computer Design: VLSI in Computers, Port Chester, NY, October 1984. [Gold86] B. L. Golden and C.C. Skiscim, "Using Simulated Annealing to Solve Routing and Location Problems," Naval Logistics Research Quaterly, 33, 261-279, 1986. [Harp85] R.E. Harper and J. Lala, "A Fault-Tolerant Parallel Processor," Technical Report, The Charles Stark Draper Laboratory, Cambridge, MA 02139, 1985. [Harp87] R.E. Harper, "Critical Issues in Ultra-Reliable Parallel Processing," PhD Thesis, Dept. of Aero./Astro., Massachusetts Institute of Technology, June 1987. [Haj85] B. Hajek, "Cooling Schedules for Optimal Annealing," Dept. of EE, University of Illinois at Urbana-Campaign, 1985. [Horo81] E. Horowitz and A. Zorat, "The Binary Tree as an Interconnection Network: Applications to Multiprocessor Systems and VLSI," IEEE Trans. on Computers, 30, No. 4 April 1981. [Held62] M. Held and R.M. Karp, "A Dynamic Programming Approach to Sequencing Problems," SIAM, 10, 196-210, 1962. -224- m M M. Tran Bibliography [Karp72] R. M. Karp, "Reducibility Among Combinatorial Problems," in Complexity of Computer Computations, R.E.Miller and J.W. Thatcher, Eds., Plenum, New York, 85103, 1972. [Karp77] R. M. Karp, "Probabilistic Analysis of Partitioning Algorithms for the Traveling Salesman Problem," SIAM J. Computing, 8, 561-573, 1977. [Kim86] S. Kim, "A Computational Study of a Parallel Simulated Annealing Algorithm for the Traveling Salesman Problem", S.M. Thesis, Dept. of Aero./Astro., Massachusetts Institute of Technology, June 1986. [Kirk82] S. Kirkpatrick, C.D. Gelatt, and M.P. Vecchi, "Optimization by Simulated Annealing," IBM Thomas J. Watson Research Center, Yorktown Heights, NY, 1982. [Kirk83] S. Kirkpatrick, C.D. Gelatt, and M.P. Vecchi, "Optimization by Simulated Annealing," Science, 220, 671-78, May 1983. [Klee80] V. Klee, "Combinatorial Optimization: What Is the State of the Art?" Math. Oper. Res., 5, 1-26, 1980. [Kron85] L. Kronsjo, Computational Complexity of Sequential and ParallelAlgorithms, John Wiley & Sons, 1985. [Kung76] H.T. Kung, "Synchronized and Asynchronous Parallel Algorithms for Multiprocessors," Algorithms and Complexity, Academic Press, 153-200, 1976. [Law85], E.L.Lawler (Ed) et al., The Traveling Salesman Problem: A guided Tour of CombinatorialOptimization, John Wiley & Sons Ltd., 1985. [Law76] E.L.Lawler, CombinatorialOptimization:Netwoks andMatroids, Holt, Rinehart & Winston, New York, 1976. [Len75] J.K. Lenstra and A.H.G. Rinnooy Kan, "Some Simple Applications of the Traveling Salesman Problem," Oper. Res. Quart., 24, 717--733, 1975. [Len81] J.K. Lenstra and A.H.G. Rinnooy Kan, "Complexity of Vehicle Routing and Scheduling Problems," Networks, 11, 221-227, 1981. [Lin65] S. Lin, "Computer Solutions for the Traveling Salesman Problem," Bell Systems Technical Journal,44 ,2245-2269, 1965. [Lin73] S. Lin and B.W. Kerningham, "An Effective Heuristic Algorithm for the Traveling Salesman Problem," OperationResearch, 21, 498-516, 1973. [Lit63] J.D.C. Little, K.G. Murty, D.W. Sweeney, and C. Karel, "An Algorithm for the Traveling Salesman Problem," OperationResearch, 11, 195-1203, 1976. [Man64] A.S. Manne, "Plant Location Under Economics-of Scale-Decentralization and Computation," Management Science, 11, 213-235, 1964. [Met53] N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller, "Equation of State Calculations by Fast Computing Machines," JournalChemical Physics, 21, 1156-1159, 1953. -225- m M. Tran Bibliography [Mit85] D. Mitra, F. Romeo, and A. Sangiovanni-Vincentelli, "Convergence and Finitetime Behavior of Simulated Annealing," UC Berkeley Memo. No. UCB/ERL M85/23, March 1985. [Moh82] J. Mohan, "A Study in Parallel Computation--The Traveling Salesman Problem," Dept. of CS, Carnegie Mellon University, 1982. [Park83] R.G. Parker and R.L. Rardin, "The Traveling Salesman Problem: An Update of Research," Naval Res. Log. Quart., 30, 69--96, 1983. [Poly86] C. D. Polychronopoulos, "On Program Restructing, Scheduling, and Computation for Parallel Systems," PhD Thesis, Univ. of Illinois at Urbana-Champaign, 1986. [Rom85] R. Romeo and A.L. Sangiovanni-Vincentelli, "Probabilistic Hill Climbing Algorithms: Properties and Applications," Proc. 1985 Chapel Hill Conference on VLSI, 393-417, May 1985. [Ross86] Y. Rossier Y., M. Troyon, T. Liebling, "Probabilistic Exchange Algorithms and Eulidean Traveling Salesman Problems," Dept. de Math., Polytech. Federale, Lausanne, Switzerland OR Spektrum (Germany) 8, No.3, 151-164, 1986. [Sch84] R.B. Schnabel, "Parallel Computing in Optimization", Dept. of CS, University of Colorado, CU_CS_282-84, October 1984. [Simp69] R. Simpson, "Scheduling and Routing Models for Airline Systems," unpublished report, Flight Transportation Laboratory, MIT, 1969. [Tsit84] J.N. Tsitsiklis, "Problems in Decentralized Decision Making and Computation," PhD Thesis, Department of EECS, MIT, 1984. [Tsit85] J.N. Tsitsiklis, "Markov Chains with Rare Transitions and Simulated Annealing," Dept. of EECS, MIT LIDS Memo., August 1985. [van87] P. J. M. van Laarhoven and E. H. L. Aarts, Simulated Annealing: Theory and Applications, Kluwer, Dordrecht, 1987. [Vec83] M.P. Vecchi and S. Kirkpatrick, "Global Wiring by Simulated Annealing," IEEE Trans. on Computer-Aided Design, CAD-2, 259 - 271, 1983. [Wag58] H.M. Wagner and T.M. Whithin, "Dynamic Version of the Economic Lot Size Model," Management Science, 5, 89-96, 1958. -226- LIST OF DISTRIBUTION MS INTERNAL: 20 10 1 1 1 1 1 1 1 1 1 1 1 EXTERNAL: Mua Tran Rick Harper Mark Busa Mark Dzwonczyk Paul Mukai Ken Jaskowiak Nghia Nguyen Tuan Le John Deyst James Cervantes Doug Fuhry Roger Hain Education Office 3E 3E 3E 3E 3E 3E 20 63 92 3B 2B 2B 5-7 Professor Wallace E. Vander Velde ROOM 3443A 2466A 3441A 3440A 3435A 1452A 3130 2325D 8105C 3463 2463 2451 5286 MIT-33-109