A Genetic Algorithm for the Optimisation of a Multiprocessor Computer Architecturere C. J. Burgess December 1995 CSTR-95-024 University of Bristol Department of Computer Science Also issued as ACRC-95:CS-024 A Genetic Algorithm for the Optimisation of a Multiprocessor Computer Architecture C. J. Burgess Department of Computer Science, University of Bristol, Bristol, BS8 1TR, England. Abstract Many computer problems today need the computer power that is only available using large scale parallel processing. For a signicant number of these problems, the density of the global communications between the individual processors dominates the performance of the whole parallel implementation on a distributed memory multiprocessor system. In these cases, the design of the interconnection network for the processors is known to play a signicant part in the ecient implementation of real problems. Networks of processors connected as an irregular conguration have the advantage that they satisfy the best known criteria for producing congurations that perform well on real applications. This paper presents a new genetic algorithm for the generation of optimal irregular congurations, which is a challenging problem since there is no obvious representation which allows all the main genetic operations to be implemented eectively. 1 Introduction Many computer problems today need the computer power that is only available using large scale parallel processing. For a signicant number of these problems, the density of the global communications between the individual processors dominates the performance of the whole parallel implementation on a distributed memory multiprocessor system. When a problem with a global communication pattern is implemented, every processor will need to communicate with the all other processors and the overheads that result from this global communication increase as more processors are added. There are a large number of problems, from a wide variety of disciplines, which have these global communication requirements, for example: free Lagrangian codes; boundary element methods; radiosity and spectral methods in graphics; and the modelling of molecular interactions. Networks of processors connected as an irregular conguration have the advantage that they satisfy the best known criteria for producing congurations that perform well. Some of the best performance gures for real problems requiring global communication, which have been implemented on a network of T800 transputers, have been achieved using irregular congurations [5, 6]. The design of these networks requires nding an optimal conguration over a very large search space. In this paper, congurations are reported that are better than those produced by any previous method, including other genetic algorithm approaches, and many, for the smaller networks, are very close to the theoretical limits. The problem also poses some interesting problems for the application of genetic algorithms, particularly with respect to nding a suitable representation of the problem to support the eective use of traditional genetic operators. 1 2 Theoretical limits related to optimal irregular congurations Each processor in a distributed memory multiprocessor architecture has a number of links available for interconnection with the other processors in the system. If this number of links is not restricted then the system may be fully interconnected. In practice, however, processors have a limited number of links available, which is typically 4 or 6, although processors with more links do exist. There are a wide variety of regular congurations that have been proposed for the interconnection of processors, in addition to the irregular congurations considered in this paper. Two important metrics for measuring the eectiveness of these congurations are the diameter and the average interprocessor distance. The diameter, Diam , is the largest number of links in the shortest path connecting any two processors in the network. The average interprocessor distance, Avg , is the average number of links that occur in the shortest paths between any two processors. Thus the average interprocessor distance is equivalent to the average number of links a message has to cross when passing from any source processor to its desired destination processor. It is important to show how the maximum number of processors that can be interconnected in a conguration of a certain diameter can be derived [7]. An upper bound on this value may be obtained from extremal graph theory [2]. Assuming that every processor in the conguration has the same number of links (or valency), , the upper bound on the maximum number of processors in the conguration, usually called the Moore limit, can be found by counting the number of processors which are at distances of 0 1 2 Diam links from a given processor. This gives at most: 1 + + ( , 1) + ( , 1)DDiam,1 (1) processors for a diameter Diam. Graphs achieving this bound are called Moore graphs. Biggs [1] shows that (except for Diam =1 or =2), Moore graphs can exist only for the cases: Diam=2 and =3, 7, or 57; but in all other cases, graphs can be found that approach these bounds. In order to provide a practical parallel processing system, the system must have access to Input/Output facilities. Most systems achieve this by designating one of the processors as the system controller with the responsibilities of providing this Input/Output interface with the other processors performing the actual processing of the problem. The inclusion of a system controller within a conguration may be achieved either by the provision of an additional link to one or more processors or by using some of the existing links. For practical irregular congurations, two existing links from the conguration are used for connection to the system controller, since as the number of links per processor is normally even, using one or three links would leave another link unconnected. In such a conguration of processors, each with valency , processors will have , 1 links available and ( , 2) processors will have links D D ; ; ;:::; D ::: D D D n two n 2 available for interconnection with the other processors. This results in a smaller upper limit on the maximum number of processors that can be obtained, by considering the number of processors at dierent distances from a processor connected to the system controller. Since such a processor has ( , 1) links available for interconnection with other processors, this gives the Dnumber of processors as: 1 + ( , 1) + ( , 1) Diam (2) It is possible to consider an optimal irregular conguration of n processors as a combination of two nodes containing the system controller connections , which will be called the system nodes, and , 2 other nodes. The two system nodes can then be considered to be connected to all the other nodes by a spanning tree with a root node with one less link available, as represented by formula (2), and the other , 2 nodes by a spanning tree with root node of normal valency, as represented by formula (1). It is then possible to calculate the average interprocessor distance, Avg , of this conguration, and the number of shortest paths Diam , which are equal to the diameter. These values, which are the theoretical limits for optimal congurations for various numbers of processors, are included in Table 1. The Moore limits for a valency of 4 are 40 processors for a diameter of 3 and 121 processors for a diameter of 4. It is important to realise that these are theoretical limits on the metrics for the best possible congurations, and do not constitute a proof that such congurations exist. ::: n n D P 3 Generating algorithm 3.1 Introduction There are two major complications associated with this problem which make it interesting from the genetic algorithm viewpoint. Firstly, it does not seem possible to nd a representation of the network that supports a crossover operator that still maintains a valid conguration, in the sense of preserving the correct number of links on each processor. In practise, the new strings formed are made valid using a network repair function. The second complication is that, even after applying the repair function, the solutions produced by the crossover are nearly always very much inferior to any previous solutions, unless a very long way from the optimal, and thus some additional method must be used to produce comparable new solutions. In this algorithm, a very large use is made of mutations in order to achieve this. 3.2 Representation For this problem, the n processors in the network are numbered from 0 to n-1, with processors 0 and , 1 being the two system processors. Every link from a processor is then given a dierent numeric value. Thus for a valency of 4, processor 0 will have links 0, 1, 2 and 3; processor 1 will have links 4,5,6 and 7 etc. The system links are then the lowest and highest numbered links. A link between two processors is then represented by a pair of integers, e.g. the pair (6,13) in a network of valency 4, would represent a connection between processors 1 and 3, using links 6 and 13. This numbering is the same as that used by Sharpe [8]. n 3 However the representation is dierent, as the network is represented by a string of pairs of these integers, one pair for each link, ordered by the rst link number and then by the second link number. Thus each link number only occurs once in the string and the same representation can be used for any valency. Two alternative representations have been proposed by Warwick [9] and by Sharpe [8]. 3.3 Genetic algorithm The genetic algorithm used is based on reproduction, one-point random crossover, mutation and an elitest strategy. There were two major inuences on the design of the algorithm. The rst is that after the crossover operation, a repair function normally has to be applied to produce a valid solution. Since the left-hand side of the string formed after the crossover, up to the crossover point, had previously been part of a valid network, then the repair function can be conned to the righthand side. This is designed to help the development of schema in the left-hand sides. A second strong inuence is that, in a previous paper by the author and Chalmers [4], it has been shown that using only random mutation of the links without any other genetic operators, which is basically just a random local search procedure, can achieve very good solutions. Thus the algorithm used was the following: 1. Construct at random a set of N1 strings which represent valid solutions to the problem and evaluate each of these solution using a tness function. A better solution has a lower value of the tness function chosen. 2. A weight is then associated with each of these solutions, which is calculated as 1000 less the dierence between the tness functions of the current solution and the best solution in the set. 3. A mating pool of N2 distinct strings, (N2 N1 ), is formed by weighted random selection from the N1 strings. Currently N1 is set at 40 and N2 at 10 for most problems. 4. The new family of strings then consists of a set of elitest strings, which is currently set at 10% of N1, plus a set of strings formed by crossover, making N1 strings in total. 5. A single-point crossover operation is used, where the point of crossover is chosen at random. The crossover selects both the strings from any of the N2 strings in the mating pool. 6. A repair function is applied to the links in the right-hand half, which will always succeed. This repair function replaces the rightmost value of any duplicated link number by the value of one that has not yet been used anywhere in the string. 7. All the new solutions are evaluated. It is found that the solutions formed from the crossover and repair function are nearly always very much worse than the elitest solutions from the previous generation. This seems to be due to the crossover and not just the implementation of the repair function. Thus a series of M1 random mutations is applied to the right-part of the solutions followed by M2 to the whole solution. A string is only replaced by < 4 its mutation, if this results in a direct improvement in its tness function. It is found that by choosing suitable values for M1 and M2, some of the solutions produced by the crossover are then better than the elitest solutions, which is necessary if the reproduction and crossover operations are to have any eect. Without this comparatively large use of mutation, the crossover was not very eective. M1 and M2 are usually set to 10 to 40 times the number of processors. 8. This whole process is repeated for a number of generations, currently set at 20. 9. Finally, an attempt is made to further improve just the elitest solutions obtained by using M3 random mutations. M3 is current set to 40 times the number of processors. There is no particular constraint, either by the algorithm or by the representation of the network, on the way the tness function is formulated. It was shown by Sharpe et al [8] that the minimum diameter metric was more important than the average interprocessor distance for the performance of real applications and thus the following tness function was used: = 1 Diam + 2 + 3 Diam where Diam is the diameter of the conguration, is the total length of all the minimum length paths connecting nodes, Diam is the number of interprocessor distances with length Diam, and i for = 1, 2, 3 are constant weights. This gives an integer value for the tness function. The average interprocessor distance of the conguration is then given by ( 2). Since it is important to ensure that the minimum diameter dominates the value of the tness function, 1 is kept large. The weights used for the tness function are: 1 = 1000000, 2 = 1000 and 3 = 1 since past experience showed that the values of changed more smoothly than those of Diam, i.e. one mutation of two links can have a very appreciable change on Diam , but a comparatively small change on . When evaluating the tness function, successively higher diameter congurations are constructed until all the nodes are connected. Thus networks that are not fully connected gain a very high diameter, chosen to be 9, which ensures that such congurations never survive. The choice of which two links to swap is made at random subject to the following constraints, which are to avoid generating congurations which could not possibly be an improvement. These are that no two processors, of the four processors involved in the two links, can be the same and the system controller links can not be moved. The combined eect of all these decisions is that the vast majority of the computation time is spent in evaluating the tness function. f f w D w Length w P D Length P D w i Length= n w w w w Length P P Length 4 Results Although the algorithm described can be applied to any valency or numbers of system nodes, this paper will restrict itself to a valency of 4 and two system nodes. 5 The results obtained from this genetic algorithm are then presented in tables 1 to 3. Processors 32 40 53 64 128 Results Achieved Diam D 3 4 4 4 5 Diam P 247 19 176 479 795 Avg D 2.297 2.461 2.697 2.878 3.511 Theoretical Limit D Diam 3 3 4 4 5 Diam P 244 464 13 365 7 D Avg 2.291 2.431 2.578 2.821 3.409 Diam D 3 4 4 4 5 RLS [4] Diam P 247 17 171 492 899 D Table 1: Comparison with theoretical limits and best previously published results 8 13 16 GA 2 2 3 GA [8] 2 2 3 GA [9] - - RLS [4] 2 2 3 DFS [7] 2 2 3 Hypercube 3 - 4 Torus 3 - 4 Ternary Tree 4 4 5 Mesh 4 - 6 Ring 4 6 8 Chain 7 12 15 Processors 32 40 53 3 4 4 4 4 4 4 - 3 4 4 4 4 4 5y - 6 7 6 6 8 10 11 16 20 26 31 39 52 64 128 4 5 5 6 4 5 4 5 6y 7y 7 12 8 10 14 22 32 64 63 127 Table 2: Comparison of conguration diameters Table 1 shows a comparison between the results obtained with the theoretical limits described in Section 2 and the best previously published results in the column marked RLS [4], which represents results obtained by the author and Chalmers using a random local search algorithm. This table shows that it is possible to obtain the best diameter for a given number of processors, when not too near a Moore limit. For the smaller sized networks, by comparing the values of Diam and Avg , it can be seen that these congurations are very close or may actually be the optimal congurations with respect to smallest average interprocessor distance for the minimum diameter. When compared to the best previously published results, these results are comparable for the smaller networks, but show a considerable improvement in the values of Diam for the larger networks. Smaller values of Avg have been obtained previously for some of these congurations, but only at the expense of larger diameters [8]. P D P D 6 Avg 2.297 2.456 2.693 2.884 3.523 8 GA 1.28 GA [8] 1.28 GA [9] RLS [4] 1.28 DFS [7] 1.28 Hypercube 1.50 Torus 1.50 Ternary Tree 1.97 Mesh 1.75 Ring 2.00 Chain 2.63 13 1.55 1.55 1.55 1.55 2.56 3.23 4.31 16 1.68 1.69 1.66 1.73 2.00 2.00 2.91 2.5 4.00 5.31 Processors 32 40 2.30 2.46 2.30 2.47 2.47 2.30 2.46 2.31 2.53 2.50y 3.00 3.50 3.93 4.25 3.88 4.23 8.00 10.00 10.67 13.99 53 64 128 2.69 2.88 3.51 2.72 2.92 2.69 2.88 3.52 2.76 2.92 3.58 - 3.00y 3.50y - 4.00 5.64 4.77 5.01 6.25 - 5.26 7.94 13.25 16.00 32.00 17.66 21.33 42.66 a Note that the values marked with a y in the tables for the hypercube for more than 16 processors cannot be compared directly as, unlike the other congurations, they assume more than four links are available on each processor a Table 3: Comparison of average interprocessor distances These results are also compared with a variety of previous published results for the minimum diameter for a given number of processors, as shown in Table 2; and with respect to the minimum average interprocessor distance for the best known diameter for that number of processors as shown in Table 3. In these tables, the row labelled GA represents results from the genetic algorithm described in this paper, GA [8] and GA [9] show previous results achieved using genetic algorithms as reported by Sharpe et al. and by Warwick and Tsang. respectively. The nal row of results labelled DFS [7] are those obtained using a heuristic depth-rst search strategy implemented in Prolog as reported by Chalmers and Gregory. Also included in the tables for comparison are the diameters and average processor distances of some of the well known regular congurations. A fuller comparison between regular and irregular congurations can be found in Burgess and Chalmers [3]. These tables show that this genetic algorithm is better than any previous genetic algorithm published for this problem. When compared with the random local search procedure described by the author in [4], it appears to be giving similar results for congurations up to 53 processors, but producing better results for the larger networks, particularly with respect to values of Diam. It is very dicult to compare directly the timings of the two methods at this stage, as the results obtained do not scale easily with the size of families chosen, numbers of mutations and crossovers, and other adjustable parameters. The author's opinion is that this should prove to be a better method for obtaining the best possible congurations for the larger sized networks, and also for reducing the computation time required to achieve a given optimality, but a detailed comparison requires far more research. P 7 5 Conclusion This paper has described the successful application of a recently developed genetic algorithm for the design of an irregular conguration of processors. It has reported better results than any previously published for the larger sized congurations. There are several areas for future research. One of these is the renement of the genetic algorithm to try and improve on both the results obtained and to reduce the number of evaluations of the tness function, so that problems with larger number of processors can be tackled in reasonable computing times. Another research area is the optimal design of the routes to be used between the processors in a particular conguration, so as to avoid bottlenecks on any particular links. 6 Acknowledgements I would like to acknowledge the help of Dr. A.G. Chalmers who rst described the problem to me, and has provided advice on the computer architecture implications. References [1] N. Biggs, Algebraic Graph Theory, Cambridge Tracts in Math. No. 67, Cambridge University Press, London, 1974. [2] B. Bollobas, Extremal Graph Theory, Academic Press, London, 1978. [3] C. J. Burgess and A. G. Chalmers, Optimum congurations for multiprocessor systems, Research Report CSTR-94-18, Department of Computer Science, University of Bristol, Dec. 1994. [4] C. J. Burgess and A. G. Chalmers, Optimum transputer congurations for real applications requiring global communications, 18th World Occam and Transputer Users Group Conference, IOS Press, Manchester, Apr. 1995. [5] A. G. Chalmers, S. Fiddes, and D. J. Paddon, Parallel panel methods, 13th Occam Users Group conference, pp. 313-321, IOS Press, York, 1990. [6] A. G. Chalmers and D. J. Paddon, Parallel radiosity methods, 4th North American Transputer Users Group, pages 183-193, IOS Press, Ithaca, NY, 1990. [7] A. G. Chalmers and S. Gregory, Constructing minimum path congurations for multiprocessor systems, Parallel Computing, Vol 19, pp. 343-355, 1993. [8] P. Sharpe, A. Chalmers, and A. Greenwood, Genetic algorithms for generating minimum path congurations, Microprocessors and Microsystems, Dec. 1994. [9] T. Warwick and E.P.K. Tsang, Using a Genetic Algorithm to tackle the processor conguration problem, Computer Science Technical Report No. CSM190, University of Essex, 1993. 8