Department of Computer Science

advertisement
A Genetic Algorithm for the Optimisation of a
Multiprocessor Computer Architecturere
C. J. Burgess
December 1995
CSTR-95-024
University of Bristol
Department of Computer Science
Also issued as ACRC-95:CS-024
A Genetic Algorithm for the Optimisation of a
Multiprocessor Computer Architecture
C. J. Burgess
Department of Computer Science, University of Bristol,
Bristol, BS8 1TR, England.
Abstract
Many computer problems today need the computer power that is only
available using large scale parallel processing. For a signicant number of
these problems, the density of the global communications between the individual processors dominates the performance of the whole parallel implementation on a distributed memory multiprocessor system. In these cases,
the design of the interconnection network for the processors is known to play
a signicant part in the ecient implementation of real problems. Networks
of processors connected as an irregular conguration have the advantage
that they satisfy the best known criteria for producing congurations that
perform well on real applications. This paper presents a new genetic algorithm for the generation of optimal irregular congurations, which is a
challenging problem since there is no obvious representation which allows
all the main genetic operations to be implemented eectively.
1 Introduction
Many computer problems today need the computer power that is only available using large scale parallel processing. For a signicant number of these problems, the
density of the global communications between the individual processors dominates
the performance of the whole parallel implementation on a distributed memory
multiprocessor system. When a problem with a global communication pattern is
implemented, every processor will need to communicate with the all other processors and the overheads that result from this global communication increase as
more processors are added. There are a large number of problems, from a wide
variety of disciplines, which have these global communication requirements, for example: free Lagrangian codes; boundary element methods; radiosity and spectral
methods in graphics; and the modelling of molecular interactions.
Networks of processors connected as an irregular conguration have the advantage that they satisfy the best known criteria for producing congurations that
perform well. Some of the best performance gures for real problems requiring
global communication, which have been implemented on a network of T800 transputers, have been achieved using irregular congurations [5, 6]. The design of these
networks requires nding an optimal conguration over a very large search space.
In this paper, congurations are reported that are better than those produced by
any previous method, including other genetic algorithm approaches, and many,
for the smaller networks, are very close to the theoretical limits. The problem also
poses some interesting problems for the application of genetic algorithms, particularly with respect to nding a suitable representation of the problem to support
the eective use of traditional genetic operators.
1
2 Theoretical limits related to optimal irregular congurations
Each processor in a distributed memory multiprocessor architecture has a number
of links available for interconnection with the other processors in the system. If
this number of links is not restricted then the system may be fully interconnected.
In practice, however, processors have a limited number of links available, which is
typically 4 or 6, although processors with more links do exist.
There are a wide variety of regular congurations that have been proposed
for the interconnection of processors, in addition to the irregular congurations
considered in this paper. Two important metrics for measuring the eectiveness of
these congurations are the diameter and the average interprocessor distance. The
diameter, Diam , is the largest number of links in the shortest path connecting
any two processors in the network. The average interprocessor distance, Avg ,
is the average number of links that occur in the shortest paths between any two
processors. Thus the average interprocessor distance is equivalent to the average
number of links a message has to cross when passing from any source processor to
its desired destination processor.
It is important to show how the maximum number of processors that can
be interconnected in a conguration of a certain diameter can be derived [7].
An upper bound on this value may be obtained from extremal graph theory [2].
Assuming that every processor in the conguration has the same number of links
(or valency), , the upper bound on the maximum number of processors in the
conguration, usually called the Moore limit, can be found by counting the number
of processors which are at distances of 0 1 2
Diam links from a given
processor. This gives at most:
1 + + ( , 1)
+ ( , 1)DDiam,1
(1)
processors for a diameter Diam.
Graphs achieving this bound are called Moore graphs. Biggs [1] shows that
(except for Diam =1 or =2), Moore graphs can exist only for the cases: Diam=2
and =3, 7, or 57; but in all other cases, graphs can be found that approach these
bounds.
In order to provide a practical parallel processing system, the system must have
access to Input/Output facilities. Most systems achieve this by designating one of
the processors as the system controller with the responsibilities of providing this
Input/Output interface with the other processors performing the actual processing
of the problem. The inclusion of a system controller within a conguration may
be achieved either by the provision of an additional link to one or more processors
or by using some of the existing links.
For practical irregular congurations, two existing links from the conguration are used for connection to the system controller, since as the number of links
per processor is normally even, using one or three links would leave another link
unconnected. In such a conguration of processors, each with valency ,
processors will have , 1 links available and ( , 2) processors will have links
D
D
;
;
;:::;
D
:::
D
D
D
n
two
n
2
available for interconnection with the other processors. This results in a smaller
upper limit on the maximum number of processors that can be obtained, by considering the number of processors at dierent distances from a processor connected
to the system controller. Since such a processor has ( , 1) links available for
interconnection with other processors, this gives the Dnumber of processors as:
1 + ( , 1)
+ ( , 1) Diam
(2)
It is possible to consider an optimal irregular conguration of n processors as
a combination of two nodes containing the system controller connections , which
will be called the system nodes, and , 2 other nodes. The two system nodes can
then be considered to be connected to all the other nodes by a spanning tree with a
root node with one less link available, as represented by formula (2), and the other
, 2 nodes by a spanning tree with root node of normal valency, as represented
by formula (1). It is then possible to calculate the average interprocessor distance,
Avg , of this conguration, and the number of shortest paths Diam , which are
equal to the diameter. These values, which are the theoretical limits for optimal
congurations for various numbers of processors, are included in Table 1. The
Moore limits for a valency of 4 are 40 processors for a diameter of 3 and 121
processors for a diameter of 4.
It is important to realise that these are theoretical limits on the metrics for the
best possible congurations, and do not constitute a proof that such congurations
exist.
:::
n
n
D
P
3 Generating algorithm
3.1 Introduction
There are two major complications associated with this problem which make it
interesting from the genetic algorithm viewpoint. Firstly, it does not seem possible
to nd a representation of the network that supports a crossover operator that still
maintains a valid conguration, in the sense of preserving the correct number of
links on each processor. In practise, the new strings formed are made valid using
a network repair function. The second complication is that, even after applying
the repair function, the solutions produced by the crossover are nearly always
very much inferior to any previous solutions, unless a very long way from the
optimal, and thus some additional method must be used to produce comparable
new solutions. In this algorithm, a very large use is made of mutations in order
to achieve this.
3.2 Representation
For this problem, the n processors in the network are numbered from 0 to n-1,
with processors 0 and , 1 being the two system processors. Every link from a
processor is then given a dierent numeric value. Thus for a valency of 4, processor
0 will have links 0, 1, 2 and 3; processor 1 will have links 4,5,6 and 7 etc. The
system links are then the lowest and highest numbered links. A link between
two processors is then represented by a pair of integers, e.g. the pair (6,13) in
a network of valency 4, would represent a connection between processors 1 and
3, using links 6 and 13. This numbering is the same as that used by Sharpe [8].
n
3
However the representation is dierent, as the network is represented by a string of
pairs of these integers, one pair for each link, ordered by the rst link number and
then by the second link number. Thus each link number only occurs once in the
string and the same representation can be used for any valency. Two alternative
representations have been proposed by Warwick [9] and by Sharpe [8].
3.3 Genetic algorithm
The genetic algorithm used is based on reproduction, one-point random crossover,
mutation and an elitest strategy. There were two major inuences on the design
of the algorithm. The rst is that after the crossover operation, a repair function
normally has to be applied to produce a valid solution. Since the left-hand side
of the string formed after the crossover, up to the crossover point, had previously
been part of a valid network, then the repair function can be conned to the righthand side. This is designed to help the development of schema in the left-hand
sides. A second strong inuence is that, in a previous paper by the author and
Chalmers [4], it has been shown that using only random mutation of the links
without any other genetic operators, which is basically just a random local search
procedure, can achieve very good solutions.
Thus the algorithm used was the following:
1. Construct at random a set of N1 strings which represent valid solutions to
the problem and evaluate each of these solution using a tness function. A
better solution has a lower value of the tness function chosen.
2. A weight is then associated with each of these solutions, which is calculated as 1000 less the dierence between the tness functions of the current
solution and the best solution in the set.
3. A mating pool of N2 distinct strings, (N2 N1 ), is formed by weighted
random selection from the N1 strings. Currently N1 is set at 40 and N2 at
10 for most problems.
4. The new family of strings then consists of a set of elitest strings, which is
currently set at 10% of N1, plus a set of strings formed by crossover, making
N1 strings in total.
5. A single-point crossover operation is used, where the point of crossover is
chosen at random. The crossover selects both the strings from any of the
N2 strings in the mating pool.
6. A repair function is applied to the links in the right-hand half, which will
always succeed. This repair function replaces the rightmost value of any
duplicated link number by the value of one that has not yet been used
anywhere in the string.
7. All the new solutions are evaluated. It is found that the solutions formed
from the crossover and repair function are nearly always very much worse
than the elitest solutions from the previous generation. This seems to be
due to the crossover and not just the implementation of the repair function.
Thus a series of M1 random mutations is applied to the right-part of the
solutions followed by M2 to the whole solution. A string is only replaced by
<
4
its mutation, if this results in a direct improvement in its tness function.
It is found that by choosing suitable values for M1 and M2, some of the
solutions produced by the crossover are then better than the elitest solutions,
which is necessary if the reproduction and crossover operations are to have
any eect. Without this comparatively large use of mutation, the crossover
was not very eective. M1 and M2 are usually set to 10 to 40 times the
number of processors.
8. This whole process is repeated for a number of generations, currently set at
20.
9. Finally, an attempt is made to further improve just the elitest solutions
obtained by using M3 random mutations. M3 is current set to 40 times the
number of processors.
There is no particular constraint, either by the algorithm or by the representation of the network, on the way the tness function is formulated. It was shown
by Sharpe et al [8] that the minimum diameter metric was more important than
the average interprocessor distance for the performance of real applications and
thus the following tness function was used:
= 1 Diam + 2 + 3 Diam
where Diam is the diameter of the conguration,
is the total length of all
the minimum length paths connecting nodes, Diam is the number of interprocessor
distances with length Diam, and i for = 1, 2, 3 are constant weights. This
gives an integer value for the tness function. The average interprocessor distance
of the conguration is then given by
( 2).
Since it is important to ensure that the minimum diameter dominates the value
of the tness function, 1 is kept large. The weights used for the tness function
are:
1 = 1000000, 2 = 1000 and 3 = 1
since past experience showed that the values of
changed more smoothly
than those of Diam, i.e. one mutation of two links can have a very appreciable
change on Diam , but a comparatively small change on
. When evaluating
the tness function, successively higher diameter congurations are constructed
until all the nodes are connected. Thus networks that are not fully connected gain
a very high diameter, chosen to be 9, which ensures that such congurations never
survive.
The choice of which two links to swap is made at random subject to the following constraints, which are to avoid generating congurations which could not
possibly be an improvement. These are that no two processors, of the four processors involved in the two links, can be the same and the system controller links can
not be moved. The combined eect of all these decisions is that the vast majority
of the computation time is spent in evaluating the tness function.
f
f
w
D
w
Length
w
P
D
Length
P
D
w
i
Length= n
w
w
w
w
Length
P
P
Length
4 Results
Although the algorithm described can be applied to any valency or numbers of
system nodes, this paper will restrict itself to a valency of 4 and two system nodes.
5
The results obtained from this genetic algorithm are then presented in tables 1 to
3.
Processors
32
40
53
64
128
Results Achieved
Diam
D
3
4
4
4
5
Diam
P
247
19
176
479
795
Avg
D
2.297
2.461
2.697
2.878
3.511
Theoretical Limit
D
Diam
3
3
4
4
5
Diam
P
244
464
13
365
7
D
Avg
2.291
2.431
2.578
2.821
3.409
Diam
D
3
4
4
4
5
RLS [4]
Diam
P
247
17
171
492
899
D
Table 1: Comparison with theoretical limits and best previously published results
8 13 16
GA
2 2 3
GA [8]
2 2 3
GA [9]
- - RLS [4]
2 2 3
DFS [7]
2 2 3
Hypercube 3 - 4
Torus
3 - 4
Ternary Tree 4 4 5
Mesh
4 - 6
Ring
4 6 8
Chain
7 12 15
Processors
32 40 53
3 4 4
4 4 4
4 - 3 4 4
4 4 4
5y - 6 7 6 6 8
10 11 16 20 26
31 39 52
64 128
4 5
5 6
4 5
4 5
6y 7y
7 12
8 10
14 22
32 64
63 127
Table 2: Comparison of conguration diameters
Table 1 shows a comparison between the results obtained with the theoretical limits described in Section 2 and the best previously published results in the
column marked RLS [4], which represents results obtained by the author and
Chalmers using a random local search algorithm. This table shows that it is possible to obtain the best diameter for a given number of processors, when not too
near a Moore limit. For the smaller sized networks, by comparing the values of
Diam and Avg , it can be seen that these congurations are very close or may
actually be the optimal congurations with respect to smallest average interprocessor distance for the minimum diameter. When compared to the best previously
published results, these results are comparable for the smaller networks, but show
a considerable improvement in the values of Diam for the larger networks. Smaller
values of Avg have been obtained previously for some of these congurations, but
only at the expense of larger diameters [8].
P
D
P
D
6
Avg
2.297
2.456
2.693
2.884
3.523
8
GA
1.28
GA [8]
1.28
GA [9]
RLS [4]
1.28
DFS [7]
1.28
Hypercube
1.50
Torus
1.50
Ternary Tree 1.97
Mesh
1.75
Ring
2.00
Chain
2.63
13
1.55
1.55
1.55
1.55
2.56
3.23
4.31
16
1.68
1.69
1.66
1.73
2.00
2.00
2.91
2.5
4.00
5.31
Processors
32
40
2.30 2.46
2.30 2.47
2.47
2.30 2.46
2.31 2.53
2.50y
3.00 3.50
3.93 4.25
3.88 4.23
8.00 10.00
10.67 13.99
53
64 128
2.69 2.88 3.51
2.72 2.92
2.69 2.88 3.52
2.76 2.92 3.58
- 3.00y 3.50y
- 4.00 5.64
4.77 5.01 6.25
- 5.26 7.94
13.25 16.00 32.00
17.66 21.33 42.66
a
Note that the values marked with a y in the tables for the hypercube for more than 16
processors cannot be compared directly as, unlike the other congurations, they assume more
than four links are available on each processor
a
Table 3: Comparison of average interprocessor distances
These results are also compared with a variety of previous published results
for the minimum diameter for a given number of processors, as shown in Table
2; and with respect to the minimum average interprocessor distance for the best
known diameter for that number of processors as shown in Table 3.
In these tables, the row labelled GA represents results from the genetic algorithm described in this paper, GA [8] and GA [9] show previous results achieved
using genetic algorithms as reported by Sharpe et al. and by Warwick and Tsang.
respectively. The nal row of results labelled DFS [7] are those obtained using a heuristic depth-rst search strategy implemented in Prolog as reported by
Chalmers and Gregory. Also included in the tables for comparison are the diameters and average processor distances of some of the well known regular congurations. A fuller comparison between regular and irregular congurations can be
found in Burgess and Chalmers [3].
These tables show that this genetic algorithm is better than any previous
genetic algorithm published for this problem. When compared with the random
local search procedure described by the author in [4], it appears to be giving similar
results for congurations up to 53 processors, but producing better results for the
larger networks, particularly with respect to values of Diam. It is very dicult
to compare directly the timings of the two methods at this stage, as the results
obtained do not scale easily with the size of families chosen, numbers of mutations
and crossovers, and other adjustable parameters. The author's opinion is that this
should prove to be a better method for obtaining the best possible congurations
for the larger sized networks, and also for reducing the computation time required
to achieve a given optimality, but a detailed comparison requires far more research.
P
7
5 Conclusion
This paper has described the successful application of a recently developed genetic
algorithm for the design of an irregular conguration of processors. It has reported
better results than any previously published for the larger sized congurations.
There are several areas for future research. One of these is the renement of the
genetic algorithm to try and improve on both the results obtained and to reduce
the number of evaluations of the tness function, so that problems with larger
number of processors can be tackled in reasonable computing times. Another
research area is the optimal design of the routes to be used between the processors
in a particular conguration, so as to avoid bottlenecks on any particular links.
6 Acknowledgements
I would like to acknowledge the help of Dr. A.G. Chalmers who rst described the
problem to me, and has provided advice on the computer architecture implications.
References
[1] N. Biggs, Algebraic Graph Theory, Cambridge Tracts in Math. No. 67, Cambridge University Press, London, 1974.
[2] B. Bollobas, Extremal Graph Theory, Academic Press, London, 1978.
[3] C. J. Burgess and A. G. Chalmers, Optimum congurations for multiprocessor
systems, Research Report CSTR-94-18, Department of Computer Science,
University of Bristol, Dec. 1994.
[4] C. J. Burgess and A. G. Chalmers, Optimum transputer congurations for
real applications requiring global communications, 18th World Occam and
Transputer Users Group Conference, IOS Press, Manchester, Apr. 1995.
[5] A. G. Chalmers, S. Fiddes, and D. J. Paddon, Parallel panel methods, 13th
Occam Users Group conference, pp. 313-321, IOS Press, York, 1990.
[6] A. G. Chalmers and D. J. Paddon, Parallel radiosity methods, 4th North
American Transputer Users Group, pages 183-193, IOS Press, Ithaca, NY,
1990.
[7] A. G. Chalmers and S. Gregory, Constructing minimum path congurations
for multiprocessor systems, Parallel Computing, Vol 19, pp. 343-355, 1993.
[8] P. Sharpe, A. Chalmers, and A. Greenwood, Genetic algorithms for generating
minimum path congurations, Microprocessors and Microsystems, Dec. 1994.
[9] T. Warwick and E.P.K. Tsang, Using a Genetic Algorithm to tackle the processor conguration problem, Computer Science Technical Report No. CSM190, University of Essex, 1993.
8
Download