hardware implementation

advertisement
HARDWARE IMPLEMENTATION OF A PARALLELIZED
GENETIC ALGORITHM FOR
TASK SCHEDULING
by
VIJAY TIRUMALAI
A THESIS
Submitted in partial fulfillment of the requirements for the degree of
Master of Science in Electrical Engineering in the Department
of Electrical and Computer Engineering
in the Graduate School of
The University of Alabama
TUSCALOOSA, ALABAMA
2006
Submitted by Vijay Tirumalai in partial fulfillment of the requirements for the
degree of Master of Science in Electrical Engineering.
Accepted on behalf of the Faculty of the Graduate school by the thesis committee:
William A. Stapleton, Ph.D.
Keith A. Woodbury, Ph. D.
David J. Jackson, Ph.D.
Kenneth G. Ricks, Ph.D.
Chairperson
David J. Jackson, Ph.D.
Department Head
Date
Ronald W. Rogers, Ph.D.
Dean of the Graduate school
Date
ii
I dedicate this thesis to my family, for their love and support
iii
LIST OF ABBREVIATIONS AND SYMBOLS
NP
Non polynomial
GA
Genetic algorithms
TG
Task graph
V
Finite set of vertices
E
Finite set of directed edges
T
ith computational task
eij
Directed edge from Ti to Tj
Pi
Static priority of ith computational task
DAG
Directed acyclic graph
MCP
Multiple constrained path
EDF
Earliest deadline first
HLF
Highest level first
HGA
Hardware genetic algorithm
Fi
Fitness value of ith chromosome
F
Average fitness of the population
PMX
Partially matched crossover
PSG
Peer selected graph
msec
Milliseconds
PGA
Parallel genetic algorithm
MPI
Message passing interface
iv
FPGA
Field programmable gate array
NRE
Non-recurring engineering
HDL
Hardware description language
PLD
Programmable logic devices
SPLD
Simple PLD
CPLD
Complex PLD
SRAM
Static random access memory
MC
Memory controller module
RNG
Random number generator module
IPG
Initial population generator module
RTG
Read task graph module
FM
Fitness module
SM
Selection module
CMM
Crossover and mutation module
PS
Population sequencer
VHDL
Very high speed integrated circuit hardware description language
CA
Cellular automaton
PHGA
Parallel hardware based genetic algorithm
v
ACKNOWLEDGMENTS
I would like to express my gratitude to Dr. Kenneth G. Ricks, my advisor, for his
experienced counsel, and for his support, both financial and moral. Without his guidance,
this thesis would have never been finished. I would also like to thank Dr. David J.
Jackson for providing me with his expertise in the hardware implementation of the thesis,
Dr. Keith A. Woodbury for explaining the intricacies of genetic algorithms and for
providing me the access to the mechanical engineering cluster, and Dr. William A.
Stapleton for guiding me through the parallel implementation of my algorithm. I would
also like to extend my appreciation to my friends Praveen, Pramodh, Smita, Gokul, Jai
and Uday for their invaluable suggestions and continuous moral support. Finally, I would
like to thank my family for their never ending encouragement and spiritual support.
vi
CONTENTS
LIST OF ABBREVIATIONS AND SYMBOLS……………………………………….. iv
ACKNOWLEDGMENTS……………………………………………………………….. vi
LIST OF TABLES……………………………………………………………………...... ix
LIST OF FIGURES………………………………………………………………………. x
ABSTRACT………………………………………………………………….………......xii
1. INTRODUCTION……………………………………………………………………. 1
a. Problem Overview-Task Scheduling….……………………………………………… 3
b. Solving the Scheduling Problem……………………...……………………………… 5
c. Thesis Outline……………………………………………………………………….. 10
2. GENETIC ALGORITHMS…………………………………………………………. 11
a. Overview of Genetic Algorithms……………………………………………………. 11
b. Operational Principles……………………………………………………………......13
c. Basic Operations………..…………………………………………………………... 14
3. SEQUENTIAL GENETIC ALGORITHMS……………………………………....... 20
a. Advanced Operations………………………………………………………………... 20
4. PARAMETRIC STUDY OF GENETIC ALGORITHMS………………………….. 26
a. Experimental Verification………………………………..…………………………. 26
b. Study of GA Configurations………………………………………………………… 29
c. Results and Discussions…….….……………………………………………………. 37
vii
5. PARALLEL GENETIC ALGORITHMS…..……………………………………….. 39
a. Overview of Parallel GA……………………………………………………………. 39
b. Software Implementation of GA……………………………………………………. 42
c. Results and Discussions…………………………………………………………….. 43
6. HARDWARE-BASED GA TASK SCHEDULER MODEL………………………. 46
a. Motivation for the HGA system…………………………………………………….. 46
b. Overview of the Field Programmable Gate Arrays…………………………………. 48
c. Previous Research Work Related to HGA…………………………………………... 50
d. Basic HGA model…………………………………………………………………… 51
e. Overview of the System……………………………………………………………... 52
f. Module Description…………………………………………………………………. 54
g. Pipelining……………………………………………………………………………. 59
h. Results and Discussions…………………………………………………………….. 60
7. PARALLEL HARDWARE-BASED GA TASK SCHEDULER MODEL………… 62
a. PHGA Model………………………………………………………………………... 62
b. Results and Discussions…………………………………………………………….. 64
8. CONCLUSIONS AND FUTURE WORK…………………………………………… 65
a. Results and Conclusions…………………………………………………………….. 65
b. Future Work…………………………………………………………………………. 66
REFERENCES……………………………………………….…..………………………68
APPENDIX A…………………………………………………………………………… 73
viii
LIST OF TABLES
1. Task schedule for random graphs…….……………………………………………....37
2. Task schedules for peer selected graphs……………………………………………...37
3. Cyclone EP1C12Q240 device features and resource utilization……………………. 60
4. Resource utilization for PHGA……………….………………..……………………. 64
5. Speedup for scheduler implementations…………………………………………...... 65
ix
LIST OF FIGURES
1. Task graph and its schedule …………………………………………………….……. 5
2. Search space example…………………….…………………………….…………….. 7
3. Structure of GA……………….……………………………………………………... 14
4. Permutation encoding mechanism..……………………………………….………… 15
5. Roulette wheel selection operation………………………….………………………. 16
6. Single-point crossover operation……………………………………….………….... 17
7. Partially matched crossover operation……………………………………….……… 24
8. Number swapping mutation operation………….……………………………….…. 24
9. Random graphs…………………………….………………………………………... 27
10. Peer selected graphs……………………………………………………….……..….. 28
11. Average execution time (msec) vs population size…………………………………..31
12. Average execution time (msec) vs crossover rate……………………………………33
13. Average execution time (msec) vs mutation rate..………………………………….. 34
14. Average execution time (msec) vs elitism………………………………………….. 35
15. Master-slave implementation ………………………………………………………..40
16. Asynchronous concurrent implementation …………………………………………. 40
17. Distributed or network implementation …………………………………………….. 40
18. Speedup vs number of processors…………………………………………………… 44
19. Structure of an FPGA..……………………………………………………………….49
20. Simple HGA model……………………………..…………………………………....53
x
21. Course-grained pipeline………………………..……………………………………. 59
22. PHGA model with two FMs…………………..…………………………………….. 63
xi
ABSTRACT
Computing systems play a vital role in our society, influencing both industry and
academia, and the spectrum of applications executing on these systems and their
complexity varies widely. Future generations of computing systems are expected to
execute processes which are more complex and larger in size. The tasks comprising these
applications will have many constraints associated with their execution, and scheduling
of such tasks using exact methods will become intractable or Non Polynomial (NP)
complete. The NP completeness of the task scheduling problem has incited the use of
various heuristics to obtain good solutions in a reasonable amount of time. In this thesis
one such meta-heuristic known as a genetic algorithm (GA) is used.
GAs are generic, domain independent search and optimization techniques based
on the theory of evolution. GAs have been successfully applied to many classes of
scheduling problems. However, with the increase in the size of the scheduling problem,
the time complexity of the software implementation of a GA becomes high. To reduce
the time complexity without compromising the solution quality, GAs have been
parallelized. However, parallelization demands additional resources leading to a poor cost
performance tradeoff. The advent of hardware technologies and design tools in the field
of reconfigurable devices has enabled us to achieve better cost performance tradeoff. This
thesis explores one such approach where a parallelized hardware-based GA task
scheduler is implemented and its performance is compared to that of parallelized
software-based GA.
xii
CHAPTER 1
INTRODUCTION
Computing systems play a vital role in our society, influencing both industry and
academia. They control manufacturing processes, laboratory experiments, automobile
engines, and process control. The spectrum of their complexity varies widely from the
very simple to the very complex. A process model of computation is often used for
describing the behavior of any application that needs to be executed on such computing
systems. In this model, the complete system application is partitioned into individual
entities, called processes or tasks, each representing a schedulable unit of execution. The
tasks must be scheduled for execution on a given set of processing elements. The
problem of sequencing the tasks according to a certain policy (criteria) depending on the
application, utilizing a given set of available resources is commonly referred to as the
task scheduling problem. The methods used to schedule tasks vary widely and are
dependent upon the architecture of the target processing system as well as the task
characteristics. Some task characteristics that impact the scheduling methodology include
task dependencies, periodicity, task priority, task deadline, task execution time, and task
preemption. Architectural influences on the scheduling methodology might include
uniprocessor and multiprocessor implementations as well as distributed and centralized
approaches.
1
2
Future generations of computing problems and applications are expected to
become more complex, distributed and larger in size (number of tasks). The tasks
generally will have many types of constraints associated with them, and scheduling of
such tasks will become nearly intractable using exact methods such as branch and bound
and integer programming techniques (Lae-Jeong and Cheol-Hoon, 1995). It has been
shown that a multiple constrained scheduling problem is Non Polynomial (NP)-complete
and cannot be exactly solved in polynomial time (Korkmaz, Krunz and Tragoudas, 2002
and Zomaya, Ward and Macey, 1999). The NP-completeness of the problem has incited
the application of many heuristics or stochastic search techniques such as genetic
algorithms (GAs), simulated annealing, multi-step algorithms, algorithms using local
search, and list-based heuristics to provide reasonable solutions for restricted instances of
the scheduling problem in a reasonable amount of time (Zomaya et al., 1999 and
RĖ˜adulescu and Gemund, 2002). However, as the size of the problem increases, the
performance of the software implementation of these techniques deteriorates, and
additional resources are needed to meet the specified performance requirements. In order
to achieve cost-performance tradeoff, much research has focused on exploring the
parallel and hardware implementations of various software scheduling algorithms. It has
been shown that significant performance gain can be achieved by implementing a few or
all of the components of a system utilizing these techniques (Loo, Wells and
Winningham, 2003). Similarly, the focus of this thesis is the improvement of scheduling
performance using parallelization and hardware acceleration.
3
Problem Overview - Task Scheduling
Computing systems have evolved considerably over the past few decades, and so
has the magnitude and complexity of the work to be processed by them. Although
existing computing systems are highly efficient, a good scheduling policy can improve
the productivity of the system for a given application (Parker, 1995). The scheduling
policy involves optimizing some desired performance criteria (like makespan or space
complexity) by managing the access to and the use of the resources by various consumers
satisfying some pre-defined constraints (like priority or dependency) (El-Rewini, Lewis
and Ali, 1994).
A typical scheduling model has three important characteristics. First, there are
different task qualities and various constraints which dictate their sequence of execution.
Secondly, there are metrics that guide scheduling decisions to achieve a desired
performance level for a given application. Finally, the machine environment on which the
task set has to be scheduled must be defined (Coffman, 1975). There are many open
research scheduling models/problems available in the field of computing. To isolate and
focus on the evaluation of the performance of a hardware-based task scheduler, various
assumptions have been made to simplify the scheduling problem.
In this thesis, the scheduling problem involves determining an execution order for
a task set that maintains predefined priority relationships and predefined partial orders of
execution among the tasks in a uniprocessor environment. The partial order relationships
are defined by a set of precedence constraints that are independent of the priority
relationships among the tasks. The tasks can be independent, i.e. there is no direct
communication between them, else they are dependent. If two tasks are dependent, the
4
source of communication is called the parent task, and the recipient of communication is
called the child task. Upon completion, a parent task sends output to all its children
simultaneously. A child task receives all input required from its parent(s) before
beginning execution. Each task may have several predecessors, and may not begin
execution until all its predecessors have completed their execution. Such tasks are called
AND tasks and the partial order over them an AND-only dependency graph (Gilles and
Liu, 1995). Once a task begins executing, it executes to completion without interruption.
This is called non-preemptive execution. Each instance of a task is called a job. The
communication delays among tasks are assumed to be negligible.
Based on the above description, a set of partially ordered computational tasks can
be represented by a directed acyclic task graph, TG = (V, E), consisting of a finite
nonempty set of vertices, V, and a finite set of directed edges, E, connecting the vertices.
The collection of vertices, V = {T1, T2. . . Tm}, represents the set of tasks to be executed,
and the directed edges, E = {eij}, (eij denotes a directed edge from vertex Ti to Tj) implies
a partial ordering or precedence relation, >>, exists between the tasks. That is, if Ti >> Tj
then task Ti must be completed before Tj can be initiated (Hou, Ansari and Ren, 1994).
Each vertex also has a static priority, Pi, representing a static priority assigned to task Ti,
and Pi > Pj implies that a ready job of task Ti will be allocated to an available processor
before a ready job of task Tj. The execution order of ready jobs of two different tasks
having equal priorities is determined arbitrarily. In the case where the priority
relationship implies that T >> T while the precedence relationship implies that T >> T
j
i
i
j
both relationships cannot be satisfied, and the precedence relationship will override the
priority relationship. The reason for this is the priority relationship P > P only truly
j
i
5
defines the order in which jobs of tasks T and T leave the ready queue, whereas the
i
j
precedence relations define the order in which they enter the ready queue. The goal of
this research is to generate a feasible schedule which satisfies all the precedence and
priority constraints within the task set. Figure 1 shows a task graph and its respective
schedule, where a higher priority has a higher value.
1
A
2
B
3
C
Schedule
Task
Precedence
4
D
5
C B A E D F
E
6
F
Priority
Figure.1. Task graph and its schedule
Solving the Scheduling Problem
For a directed acyclic graph (DAG) TG = (V, E), each edge is associated with two
non-negative additive values. In this present study these values represent the number of
priority and precedence constraint mismatches. The problem is to find a path p from a
source node to a destination node such that all the edges along the path produce zero
mismatches by complying with the two constraints. This problem is the well known
Multiple Constrained Path Selection (MCP) problem (Korkmaz et al., 2002). The specific
task scheduling problem studied here is a variation of the MCP problem, wherein the path
passing through all the nodes represents a possible schedule and the algorithm used to
obtain such a path is termed the task scheduling algorithm. It is assumed that an
6
independent task has an incoming edge from every other node, i.e. an independent task
can be inserted into a path after any other node, based on its priority value and is ready to
be executed at any given instant of time.
Generally speaking, scheduling algorithms vary widely depending on the
architecture of the computing system and the characteristics of the tasks to be scheduled.
Broadly these algorithms have been classified as static and dynamic. When task
characteristics are known a priori, the tasks of the application can be scheduled statically,
i.e. at compile time. When the structure of the application is unpredictable, the scheduling
activities are performed at run-time. Such scheduling is dynamic scheduling. In this
thesis, the scheduling methodology considered is static scheduling. A more general
overview of task scheduling algorithms can be found in (Casavant and Kuhl, 1988) and
(Ramamritham and Stankovic, 1994).
In static task scheduling, the application is represented by a DAG as previously
described. The scheduling problem equates to finding a path from the source task to the
destination task that complies with both the priority and precedence constraints. This path
is called the task schedule for the TG. Solving this specific problem is known to be NPcomplete. In fact, there is no efficient polynomial-time algorithm that can find an optimal
solution which simultaneously satisfies both the constraints even for more restricted cases
(Korkmaz et al., 2002). This has led to the development of many algorithms that search
for solution from among all possible solutions. For large task sets, the search space
(solution space containing all the possible solutions) becomes so large that searching such
a solution space for the optimal solution becomes very exhaustive and time consuming.
7
An example of a complex search space is shown in Figure 2. In these cases, the goal
becomes to locate a sub-optimal solution in a reasonable amount of time.
Local maxima
Global maxima
Figure 2. Search space example (Reade, 2002)
Sub-optimal algorithms can be further classified as approximate and heuristic. For
a sub-optimal approximate algorithm, instead of searching the entire solution space for an
optimal solution, it is satisfactory to obtain a good solution that has been determined
based on some criterion function. There are many approaches that can be applied to static
approximate scheduling which include queuing theory, graph theoretic approaches,
mathematical programming and enumeration, and state-space. The second category under
sub-optimal algorithms is heuristics, which make the most realistic assumptions about a
priori knowledge concerning process and system characteristics. Heuristics make use of
special parameters which affect the system performance in an indirect way, and this
8
alternate parameter is much easier to monitor or evaluate (Casavant and Kuhl, 1988 and
Kwok and Ahmad, 1996).
A heuristic is said to be better than another heuristic if solutions approach
optimality more often or if a near-optimal solution is obtained in less time. A heuristic
that can optimally schedule a particular task set on a certain target machine may not
produce optimal schedules for other task sets on other machines. As a result, several
heuristics have been proposed, each of which may work under different circumstances.
At the moment since heuristics offer the most successful solutions to scheduling
problems, considerable research is being done to create various types of heuristic-based
scheduling algorithms (Zomaya et al., 1999). Giving a detailed description of each of
these algorithms is exhaustive and outside the scope of this thesis, but a general
classification and some of the most popular algorithms need to be mentioned.
Heuristic algorithms may be divided in two main classes. First, the general
purpose optimization algorithms independent of the given optimization problem and,
second, the heuristic approaches specifically designed for a particular problem. The area
of interest in this thesis is the first class of algorithms. Some widely used techniques that
belong to this class are list scheduling, hill climbing algorithm, simulated annealing and
genetic algorithms. In list scheduling, all the ready tasks are arranged in decreasing order
of priority before the scheduling process begins. During the scheduling process, as the
processor becomes available, the task with the highest priority is selected from the list
and assigned to the processor. The algorithms belonging to this class basically differ in
the method employed to assign the priority to the task. Some popular methods of
assigning priorities are based on earliest deadline first (EDF algorithm), shortest period
9
first (Rate Monotonic algorithm), highest level first (HLF algorithm) and critical path
(MCP algorithm). Another class of heuristic is the insertion scheduling heuristic which is
an improvement over list heuristics. It tries to assign ready tasks to the idle time slots
resulting from communication delays. Another heuristic studied by researchers is the
mapping heuristic which incorporates adaptive routing to minimize communication
delays in inter-connect topologies (Ramamritham and Stankovic, 1994, Kwok and
Ahmad, 1996, Zomaya et al., 1999, and Correa, Ferreira, and Rebreyend, 1999). The hillclimbing algorithm is typically used to find the global minimum in convex solution
spaces. Most often it is rather a local instead of a global minimum which is found.
Simulated annealing offers a way to overcome this major drawback of hill-climbing but
the price to pay is a large computation time. Also, simulated annealing is rather
sequential in nature; its parallelization is quite a difficult task. More distributed
optimization techniques, inherently parallel, have also been considered. Some of them are
closely related to neural network algorithms and evolutionary algorithms (El-Rewini,
Lewis and Ali, 1995 and Talbi and Muntean, 1993). One such class of algorithm is the
meta-heuristic known as a genetic algorithm (GA), a guided random search method
where elements in a given set of solutions are randomly combined and modified
iteratively until some termination condition is achieved (Correa et al., 1999). This
heuristic is being used in this thesis and is discussed further in the next chapter.
10
Thesis Outline
This thesis is organized into eight chapters. In Chapter 2 a detailed description of
genetic algorithms including their operational principles and applications is provided. In
Chapter 3 the advanced operators used to enhance the performance of the software
implementation of a sequential GA is presented. Chapter 4 presents the parametric
analysis of the GA along with the description of the procedure used to experimentally
verify the functionality of the GA. Chapter 5 presents the motivation for implementing
the parallelized version of the GA in software with a brief overview of the various
parallel models available. The parallelization of GA is then supported by the speedup
results obtained for different sizes of cluster, followed by a discussion on the choice of
cluster size. Chapter 6 discusses the motivation behind the hardware-based scheduling
system, presents previous work related to this field, and a general model of the system. It
also presents the results obtained for various test cases. In Chapter 7, the parallel
implementation of the Hardware Genetic Algorithm (HGA) is presented and the results
obtained for the test cases are presented and discussed. Finally, in Chapter 8 a
performance comparison of the software implementation of a simple sequential GA, a
parallel GA, the hardware implementation of sequential GA and parallel hardware-based
GA is presented, along with possible future work in this field.
CHAPTER 2
GENETIC ALGORITHMS
This chapter begins by providing a brief overview of GAs and their applications,
followed by a detailed description of their operation. Further, a detailed description of
each module and its role in the operation of a GA is presented. Also, various methods of
implementing a GA and its operators are discussed to provide a complete view of the
research area.
Overview of Genetic Algorithms
Genetic algorithms are generic, domain independent search and optimization
techniques which borrow from nature the basic concept of “survival of the fittest and
natural selection”, as stated in Darwin’s evolutionary theory. According to this theory,
the stronger or the more fit individuals survive while the weaker individuals die or get
eliminated. This phenomenon eventually tends to transfer characteristics of the dominant
individuals to the next generation. Over a number of generations, individuals comprising
the population attain features which make them more fit, enabling them to withstand the
external pressure of the environment (Mahmood, A., 2000). In a way analogous to the
evolutionary process, GAs simulate the process of natural selection and genetic
recombination to guide a highly exploitative search through a coding of a parameter
space for a near optimal solution, avoiding convergence to false peaks (local optima).
11
12
GAs are randomized techniques which, based on probabilistic transition rules and
historical information, speculate new search points with expected improvement in
solutions (Zomaya et al., 1999). The reason for their success and a wide and ever growing
research field is a combination of power and flexibility along with the robustness and
simplicity the algorithms exhibit. They are being applied successfully to find acceptable
solutions to problems in business, engineering, and science.
Genetic algorithms differ from the traditional problem solving methods in four
major ways described below.
1. They work with a coding of the parameter rather than the actual parameter.
2. They work from a population of strings instead of a single point.
3. They use payoff (objective function) information, not derivatives or other auxiliary
knowledge.
4. They use probabilistic transition rules, not deterministic rules.
These characteristics make GAs more robust, powerful and efficient compared to their
traditional counterparts (Goldberg, 1989).
Genetic algorithms have been applied to many applications in the area of
computer science, engineering, business and the social sciences. Some successful
application areas include job shop scheduling, VLSI circuit layout compaction,
transportation scheduling, neural network design, image processing and pattern
recognition, traveling salesman problem, economic forecasting, nonlinear equation
solving, calibration of population migration and many more (Goldeberg, 1989).
In the field of task scheduling a considerable amount of research has been done in
the implementation of GAs for obtaining a near optimal solution. It has been successfully
13
applied to the multiprocessing scheduling problem in real-time systems, scheduling jobs
on computational grids, scheduling tasks on parallel processors, and task scheduling on a
heterogeneous computing environment.
Operational Principles
In a GA, the offspring are produced by standard genetic operators: selection,
crossover, and mutation. In each generation, a selection scheme is used to stochastically
select the survivors to the next generation according to their fitness values as evaluated
by a problem-based user defined function. With this artificial evolution, the fitness of the
solutions gradually improves generation by generation. The GA process starts with a
random population and iterates until a termination condition is met (Aporntewan and
Chongstitvatana, 2001).
The operation of a GA can be summarized in the following manner. The first step
in a GA is to encode any possible solution to an optimization problem as a set of strings
and then derive a random initial population, which acts as the first generation from which
the evolution starts. After that the fitness of each individual, also called a chromosome, in
the population is evaluated. Once the fitness has been evaluated, chromosomes are
selected for crossover from the current population. Random characteristics are then
introduced after crossover into the population using mutation, based on some probability.
This process of selection, crossover and mutation continues until a new population has
been generated. Once the new population has been generated its fitness is evaluated and it
replaces the old population. If the termination criterion is not met, the new population
14
undergoes another iteration of selection, crossover, mutation and fitness evaluation
(Wang, Siegel, Roychowdhary and Maciejewski, 1997).
The basic structure of a simple GA is shown in Figure 3.
1. [Start] Encode and generate initial population
2. [Fitness] Evaluate the fitness of initial population
3. [Test] If termination condition is met stop ELSE
4. [New population] Create a new population
4.1. [Selection]
4.2. [Crossover]
4.3. [Mutation]
4.4. [Fitness]
5. [Test] If termination condition is met stop ELSE
6. [Replace] Update old population
7. [Loop] Goto to step 4
Figure 3. Structure of a GA
The next section provides a detailed description of GA modules and their operation
emphasizing their significance in the process of optimization.
Basic Operations
A GA starts with a pool of feasible solutions (gene pool) to a specific problem,
encoded as strings, known as chromosomes, based on a particular representation scheme.
At each iteration, a set of genetic operators (crossover and mutation) defined over the
population are applied and new populations of solutions are created, with the more fit
solutions more likely to procreate. The basic operations of a GA are described below.
15
Encoding. This is the initial step in a GA where in a specific mechanism is
employed for uniquely mapping chromosome values onto the decision variable for a
given problem. The most commonly used representation for chromosomes in a GA is the
binary code {0, 1}. Other representations include value encoding (ternary, integer and
real valued), tree encoding and permutation encoding. In this thesis, permutation
encoding has been employed, which is mostly used for ordering problems like the
traveling salesman problem and the task ordering problem (Zomaya et al., 1999 and
Wang et al., 1997). An example of permutation encoding is shown in Figure 4 where a
set of individuals are generated by reordering the sequence of numbers 1-7 and each
sequence represents an order of execution of tasks.
Chromosome 1
1 2 3 4 5 6 7
Chromosome 2
3 2 1 5 7 6 4
Figure 4. Permutation encoding mechanism
Initial Population Generation. The initial population represents the starting set of
individuals that need to be evolved using the GA. These individuals are usually generated
randomly, but if prior knowledge about the system is known then such information can
be utilized to produce a better initial population. This can help to promote a faster
convergence without biasing the search.
Fitness and Objective Function. The objective function provides the mechanism
for evaluating each chromosome. In the case of a minimization problem, as is the case in
this thesis, the more fit individuals will have a lower numerical value for their objective
function. The fitness function converts/normalizes the objective function value,
16
transforming it into a relative measure of fitness in a given range. This value is in turn
used by the selection mechanism (Zomaya et al., 1999).
Selection. This operation mimics nature’s survival of the fittest mechanism, which
ensures that more fit solutions survive while the weaker ones perish. It is more likely that
a more fit string will produce a higher number of offspring, increasing its chances of
survival. There are many methods for selecting the best chromosomes. The simplest form
of selection is roulette wheel selection. In this method, a string with a fitness value of Fi
is allocated a relative fitness value of Fi/F, where F is the average fitness value of the
population. The GA uses a roulette wheel style of selection to implement this
proportional selection scheme as shown in Figure 5.
Chromosome 4
Chromosome 3
Chromosome 1
Chromosome 2
Figure 5. Roulette wheel selection
In the above figure, each chromosome is allocated a sector of the roulette wheel,
with the sector size calculated based on the relative fitness of the respective chromosome.
In the above case, chromosome 1 has the maximum fitness, followed by chromosome 2
with chromosome 3 and 4 having the same fitness. The probability of selection of an
individual is provided by the relative fitness of the chromosome. The count value of the
17
expected number of an individual is determined by multiplying a randomly generated
number in the range of 0-(population size) with the relative fitness of that individual, and
then taking the integer portion of the product. Some other schemes are stochastic
reminder with and without replacement, tournament selection and rank selection. In this
thesis, stochastic reminder without replacement method is used (Zomaya et al., 1999). It
is described in Chapter 3.
Crossover/Recombination. The crossover (also called recombination) mechanism
stochastically causes the exchange of genetic material between the selected parents to
produce new individuals having some portions of both the parents’ genetic material. The
probability of occurrence of crossover is user defined and is called the crossover rate. A
typical value of crossover rate is in the range of 0.5 – 1.0. The simplest form of crossover
is the single-point crossover, where a crossover point is chosen stochastically and the
substrings are swapped between the parents across that point to create new individuals.
Simple crossover is shown in Figure 6.
Chromosome 1
3 2 1
6 5 4
Chromosome 2
1 2 3
4 5 6
Offspring 1
3 2 1
4 5 6
Offspring 2
1 2 3
6 5 4
Crossover Point
Figure 6. Single-point crossover operation
18
It is not necessary that the parts that contribute most to the fitness of an individual
be contained in the substrings formed. Such a disruptive nature instigates exploration of
new areas of the search space instead of converging into local maxima. This makes GAs
robust. There are other methods of crossover depending on the type of encoding and type
of application such as two point, multipoint, uniform, arithmetic, order, cycle and
partially matched crossover (Zomaya et al., 1999, and Goldeberg, 1989). In this thesis
partially matched crossover is used. It will be described in detail in the advanced operator
section in Chapter 3.
Mutation. This mechanism causes a small alteration of the chromosome
stochastically. Mutation is applied uniformly on the entire population, with a random
probability called the mutation rate, which for a GA is typically very low in the range of
0.1 – 0.4 percent. Most researchers believe mutation plays a secondary role in the GA. It
reintroduces divergence in a converging population, ensures the probability of finding an
optimal solution is never zero, and recovers good genetic material if lost through
selection and crossover (Zomaya et al., 1999). Typical methods of performing mutation
are bit inversion, order changing, adding or subtracting small values at the mutation point
and number swapping.
Replacement. Once the new population has been generated and evaluated, a
scheme is needed to replace the existing members of the old population with new
members. Some methods include the generational scheme where the new members
replace their parents, steady state scheme where the new members replace the weaker
19
members of the population, and incremental scheme where the new members are simply
added to the existing population provided enough memory is available (Scott, 1994).
Termination. A GA is terminated depending on the expected output. It is possible
that a GA is terminated after a preset number of populations have been obtained, or the
population’s average fitness has reached a threshold fitness value. Other termination
conditions include the best individual having reached a threshold fitness value, or some
degree of homogeneity being obtained in the population. Often a combination of the
above mentioned approaches is used (Zomaya et al., 1999).
Depending on the particular optimization problem, the GA parameters and
operators need to be chosen to provide better performance. Although a GA with simple
operators is a very powerful tool, there are ways to improve its performance by applying
some advanced genetic operators and using some variations of the GA. The advanced
genetic operators and functions used in implementing the sequential GA for performing
task scheduling in this thesis are discussed in the next chapter.
CHAPTER 3
SEQUENTIAL GENETIC ALGORITHMS
A GA in itself is a robust and powerful tool and has been used to solve many
optimization problems. But with the complexity associated with some specific
optimization problems, it is not desirable to use the standard genetic operators discussed
in Chapter 2, and special operators need to be utilized to improve the performance of the
GA.
Advanced Operations
The execution time of the GA can be defined as the time taken by the algorithm to
converge to an optimal solution. This time reflects on the quality of the solutions
comprising each generation. If the quality of the solutions is poor, i.e. the individuals do
not concur to the constraints imposed, the GA will take more time to converge to the best
solution. Some of the techniques and operators used to improve the quality of solutions
and thereby the execution times of the sequential GA implemented in this thesis are
described below.
Population initialization. Instead of randomly generating the initial population,
some individuals known to be in the vicinity of the global maximum can be included to
speed up the convergence. In this thesis the initial population contains individuals which
20
21
have been sorted in accordance to their priority along with the randomly generated
individuals.
Fitness function. As previously mentioned, task scheduling is a multiple
constrained problem. A GA is directly applicable only to unconstrained optimization
problems. Hence, it is necessary to use some additional methods that will keep the
solutions in the region of feasible schedules. Traditional GAs have incorporated
constraints in the form of bounds on the variables, but these suffer from some drawbacks.
However, recently researchers have developed methods that can be used to solve general
constrained optimization problems. Presently there are four methods to handle
constrained problems with GAs, namely rejection of the offspring, repair algorithms,
modified genetic operators, and penalty functions (Joines and Houck, 1994 and Yeniay,
2005). This thesis utilizes the penalty functions method to evaluate the fitness of the
chromosomes.
Penalty functions convert a constrained problem into an unconstrained problem
by penalizing those solutions which are unfeasible (Yeniay, 2005). There are two basic
ways to apply penalty functions. The first is the additive form shown below.
f(x)
if x is a feasible solution
Eval(x) =
f(x) + p(x) otherwise
where, f(x) is the objective function and p(x) is the penalty function. The second method
is the multiplicative form, which is shown below.
f(x)
if x is a feasible solution
f(x) * p(x)
otherwise
Eval(x) =
22
In this thesis, a combination of these two techniques has been used, where a unique
penalty function or value is multiplied with the number of priority and precedence
misses, and is finally added to provide a fitness value for the chromosome.
Researchers have shown if one imposes a high degree of penalty, more emphasis
is placed on obtaining feasibility and the GA will move very quickly towards a feasible
solution. The system will tend to converge to a feasible point even if it is far from
optimal. However, if one imposes a low degree of penalty, less emphasis is placed on
feasibility, and the system may never converge to a feasible solution. Hence, it is
important to choose correct values for penalty function parameters (Joines and Houck,
1994).
Selection scheme. The basic selection scheme used in the GA is the roulette wheel
selection, but it has been shown by De Jong that this scheme suffers from high stochastic
errors (Goldeberg, 1989). Hence, an advanced selection scheme like stochastic reminder
without replacement is used. This scheme starts in the same manner as the roulette wheel
selection scheme, where the count value of the expected number of an individual is
determined by multiplying a randomly generated number in the range of 0- (population
size) with the relative fitness of that individual, and then taking the integer portion of the
product. The fractional portion of the product provides us with the probability of any
additional copies of the chromosome that can be expected. For example, a chromosome
having a product of 2.5 will receive two copies surely, and another one with a probability
of 0.5. This process continues until the next generation is completely generated (Zomaya
et al., 1999, & Goldeberg, 1989).
23
Advanced crossover operators. For order based problems like the traveling
salesman problem and task ordering, a simple crossover operation does not produce the
desired results. Hence, advanced crossover operators, known as reordering operators,
were constructed, which combine features of inversion and crossover into a single
operator. Some such operators include order, cycle, and partially matched crossover
(PMX). In this thesis, the PMX operator, which was initially constructed for solving the
blind traveling salesman problem, has been used. Under PMX, two strings are aligned,
and two crossover points are picked at random along the strings. These two points define
the crossover section used during the crossover operation, through position-by-position
exchange. Figure 7 shows a simple PMX crossover between two chromosomes A and B.
PMX proceeds first by position exchanges between crossover points 4 to 6, i.e. tasks 5
and 2, 6 and 3, and 7 and 10 exchange places. Once the exchange is complete there might
exist duplication of tasks in the offsprings. To overcome this problem partial matching
takes place, wherein if during the crossover 5 was replaced by 2 in chromosome A, and
there exists a 2 in the un-exchanged part of chromosome A then that 2 is replaced by a 5.
Similar matching operation takes place for the rest of the exchanges. Hence, after
performing the PMX operation, strings containing ordering information partially
determined by each of its parents are produced (Goldeberg, 1989).
24
Chromosome A
9 8 4 5 6 7
Chromosome B
8 7 1 2 3 10 9 5 4 6
Offspring 1
9 8 4 2 3 10 1 6 5 7
Offspring 2
8 10 1 5 6 7
1 3 2 10
9 2 4 3
Crossover Point
Figure 7. Partially matched crossover operation
Advanced mutation operators. Advanced operations for performing mutation need
to be used, to ensure that no item/task occurs more than once in the chromosome. One
such operator is the number swapping operator where two points of mutation are
randomly chosen and the numbers at those points swap their position. Figure 8
demonstrates the number swapping mutation operation used in this thesis.
Original offspring
3 2 1
4
5
6
Mutated offspring
3 2 1
5
4
6
Mutation Point
Figure 8. Number swapping mutation operation
Elitism. When creating a new population using crossover and mutation, the
chances are that some of the most fit chromosomes from the previous population might
be lost. To prevent this, elitism is performed, which copies the best chromosomes and
25
replaces the weaker chromosomes present in the new population. The rest of the
population is constructed in the usual manner. Elitism can rapidly increase the
performance of the GA by preventing the loss of the best found solutions.
Now that all the advanced operators and methods to improve the execution time
of the GA have been described, the next chapter explains the methods used to test the
performance, and verify the correctness of the algorithm.
CHAPTER 4
PARAMETRIC STUDY OF GENETIC ALGORITHMS
Various methods and advanced genetic operators were discussed in Chapter 3 to
improve the performance of the GA. But, if the appropriate configuration of GA
parameters is not used during execution performance improvements cannot be
guaranteed. This chapter presents the importance of GA configuration and the tests
performed to obtain the appropriate GA configuration for this problem. Before describing
these methods, a test suite of task graphs used to evaluate the GA’s performance and
functionality is created. Finally, after describing the test suite and obtaining an
appropriate configuration, the results obtained for the sequential GA are presented.
Experimental Verification
To verify that the GA created here is a robust algorithm and works for most task
graphs, it must be tested on a wide range of graphs of varying size and connectivity.
Hence, the experimental verification of the sequential GA is performed using various
benchmark graphs described by Kwok and Ahmad (1998). The benchmark graphs are
diverse in nature and have variations in parameters such as graph complexity and size.
The benchmark graphs are used to test the correctness of the algorithm, and the results
are manually verified to confirm that all the precedence and priority constraints are met.
26
27
These benchmark graphs are of types, random graphs and peer selected graphs (PSGs).
Random Graphs. These are a set of graphs with random precedence and priority
constraints, representing task sets with dissimilar structures. Such graphs play an
important role in the evaluation of an algorithm, since it is necessary that the graphs used
for testing do not follow a specific pattern that might lead to a bias to a particular
algorithm. To create the random graphs used in this analysis, several graph characteristics
are randomized. The maximum order of the task graph, i.e. the number of tasks, is limited
to 20 to facilitate manual verification. Next, the priority and execution time for each task
is generated randomly. The priorities are selected between 1 to 30 and execution time is
selected between 1 to 7. All the random numbers have been generated using Marsaglia’s
random number generator algorithm (Marsaglia, Zaman, & Tsang, 1990 and James,
1990). Some random graphs generated are shown in Figure 9.
A
A
1
B
2
C
3
A
8
E
6
B
6
D
4
D
E
7
5
F
G
Graph R1
6
9
B
C
D
4
F
8
C
7
E
5
Graph R2
5
G
7
F
4
3
9
Graph R3
Figure 9. Random graphs
Peer Selected Graphs (PSGs). These are example task graphs used by various
researchers and are documented in publications. These graphs are usually small in size
and can be used to trace the operation of an algorithm by examining the schedule
produced (Kwok and Ahmad, 1998). Eight PSGs with varying degrees of connectivity are
used. The priorities of the tasks in the task graphs are assigned randomly. Some of the
28
peer selected graphs used to analyze the functionality of the sequential GA are shown in
Figure 10.
A 1
A 6
C 13
B 3
8
A
C
6
C 9
1
F
D 6
G 3
E 7
H 8
I
9
G 2
E 18
F 8
7 K 2 L 4 M 9
H 10
J
3
K 1
N 3
I 16
Graph 1
(Colin & Chretienne, 1991)
D 11
E 6
D 11
G 11 H 10 I 14 J
F 7
B 3
B 9
O 16
P 15
Graph 2
(Chung & Ranka, 1992)
Q 6
Graph 3
(Kwok & Ahmad, 1998)
A 7
B 12 C 16
D 17
E 4 F 3
G 3
H 9
A
I 10
14
J 10 K 2
5
B
C
E
F
13
D
9
L 1
A
M 6
6
B
4
1
4
G
6
N 7 O 13
3
C 8
D 4
E 5
H
3
I
J
2
K
4
P 11
Q 19
Graph 4
(Wu & Gajski, 1990)
A 2
B 8
F 2
R 4
Graph 5
(Yang & Gerasoulis, 1993)
C 1
A
4
F 9
G 7
D 6
H 5
I
E
Graph 7
(Correa et al., 1999)
G 7
1
B
2
D 4
3
C
E
3
5
F 7
Graph 8
(Correa et al., 1999)
Figure 10. Peer Selected Graphs
L
1
Graph 6
(Adam, Chandy & Dickson, 1974)
29
Before testing the GA with the above test suite, a GA configuration for which the
GA performs optimally must be found. Searching for an optimal configuration can be a
very exhaustive and tedious job. The approach taken to determine this configuration is
discussed in the next section.
Study of GA Configurations
To compare performance in terms of execution time for the hardware
implementation of a GA-based task scheduler with the software implementation of the
scheduler, first a GA configuration for which both the implementations provide good
performance results must be found.
As mentioned at the start of this chapter, the performance of the GA is very much
affected by the values of the GA parameters, namely population size, crossover rate,
mutation rate and elitism. A set of unique values for these parameters define a unique GA
configuration. The use of a non-optimal configuration for the implementation of a GA
might result in a tedious and time-consuming optimization process. Therefore, various
tests need to be performed on the sequential GA in order to obtain an optimal GA
configuration. This section studies the effect produced by each parameter on the
performance of the algorithm, for the test suite described previously.
There can exist numerous GA configurations resulting from different
combinations of parameter values. Searching for an optimal GA configuration by varying
each parameter, for each task graph of the test suite, is a very difficult and exhaustive
process. To narrow this search space, the following steps were taken. First, the
performance of each GA configuration was observed for only a subset of the test suite
30
which includes PSGs 3, 4, 5 and 8. This subset represents task graphs of varying
complexity and size. The execution time is noted for each possible configuration for each
of the four graphs. Based on the results obtained for each of the graphs, an optimal GA
configuration was picked that shows a relatively shorter execution time. Also, this
configuration provides similar results on average for numerous runs of the algorithm.
This optimal configuration has the following parameter values: population size of 140,
crossover probability of 0.8, mutation probability of 0.6, and elitism of 90 (2/3 population
size). The following sub-section discusses the effect of each parameter on the GA’s
performance by varying the value of one parameter at a time. Also, in order to get a
statistically significant quality assessment of the GA’s performance, the algorithm was
executed 20 times and the average execution time is used in the results.
A simple non-pipelined version of the GA was implemented in the C
programming language. It was executed on a UNIX machine, with an EM64T CPU
having a clock frequency of 3.2 GHz and a peak performance of 102.4 GFLOPS. The test
results obtained by varying each parameter of the GA configuration are shown and
discussed in the following sub-sections.
Effect of Altering the Population Size. It is important to choose an appropriate
population size during the execution of a GA. Choosing a population size which is too
small makes the algorithm prone to the risk of premature convergence to a local
optimum, whereas choosing a larger population size might provide a global optimum, but
at the expense of a longer CPU computation time since the GA needs to search through a
larger solution space. A population size which does not compromise with the solution
31
quality, i.e. the fitness of the solution, and also searches a section of the solution space
with a reasonable size, without increasing the computation time, needs to be chosen.
Figure 11 shows the average execution time of the GA for each population size
for the four PSGs. The population size was varied from 20 to the maximum allowed
population size of 200 in increments of 20 keeping other GA parameters constant. The
average value of the GA execution time is shown. The execution time of PSG 3 is scaled
down by a factor of 50 in order to show the results clearly in the same graph.
PSG 3
Average execution
time (msec)
PSG 4
300
200
100
0
20
40
60
80
100 120 140 160 180 200
Population size
PSG 5
Average execution
time (msec)
PSG 8
8
6
4
2
0
20
40
60
80
100
120 140
160 180
200
Population size
Figure 11. Average execution time (msec) vs population size
It can be observed that for large task graphs, such as PSGs 3 and 4 the execution
time is high for low population size (20-80). It then decreases for mid-range population
sizes (80-160), and finally there is an increase in the execution time at the higher end of
32
the population size (160-200). Although, PSGs 3 and 4 may not completely concur with
this observation, it was noted for most of the other large task graphs in the test suite.
For low population size, there are not enough individuals present in the
population. The GA tends to search around the local optimum leading to longer execution
times. In the middle range, the population size is large enough to make the GA move
away from the local optimum and eventually converge to a global optimum while still
being small enough to take little CPU time. Towards the higher end of population size,
the GA has more individuals to evaluate, most of them being multiple copies of the same
individual. This redundancy of individuals in the population does not improve the
chances of the GA to converge to a global optimum, but just presents extra computation
overhead; thereby producing the same result produced by a medium sized population in a
longer time. With the increase in population size the portion of the solution space being
searched also increases and the probability of accidentally moving into the section
containing the global optimal solution improves and is depicted by the occasional dips in
the graph.
For small task graphs such as PSGs 5 and 8 as the population size increases the
execution time also increases. This is because the solution space for these graphs is small
and the GA converges to the global optimum very soon. The variation is not much over
the population size is small due to the size of the graphs. As can be observed, population
sizes 100 and 120 provide good execution time for PSG 3, 5 and 8, and population size of
140 for PSG 4. A population size of 140 is finally chosen, as it produced good results for
most of the other large task graphs in the test suite.
33
Effect of Crossover Probability. The crossover rate was varied from 0.2 to 1.0 in
increments of 0.1, and the average execution time is shown in Figure 12 for the four test
graphs. Similar kinds of increments have also been used by Zomaya et al. (1999) while
evaluating the performance of a GA used for task scheduling. To fit the plot for PSG 3 in
the figure its values has been scaled down by 10 respectively.
PSG 3
Average execution
time (msec)
PSG 4
1500
1000
500
0
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Crossover rate
PSG 5
Average execution
time (msec)
PSG 8
8
6
4
2
0
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Crossover rate
Figure 12. Average execution time (msec) vs. crossover rate
The results for PSGs indicate that the execution time is high for low crossover
rate (0.2-0.4), starts decreasing and remains low over the higher range (0.4-1.0). A low
crossover rate produces a higher convergence time as the GA does not perform frequent
recombination of good genetic material between individuals so as to explore more
number of solutions. For most of the task graphs in the test suite, the best value of
34
crossover rate was found to be in the range of 0.4 - 0.8, for which the average execution
time was low. The execution time increased with further increase in the crossover rate, as
higher crossover probability makes the GA explore a large portion of the solution space
swiftly, prolonging its convergence. Hence, a crossover rate of 0.8 was finally chosen to
provide sufficient variation in the population for most of the task graphs in the test suite.
Effect of Changing the Mutation Probability The mutation rate was varied from
0.2 to 1.0 in increments of 0.1 (Zomaya et al., 1999). Figure 13 shows the graph of the
average execution time versus mutation rate for some of the task graphs. To fit the plot
for PSG 3 in the figure its values has been scaled down by 10 respectively.
PSG 3
Average execution
time (msec)
PSG 4
1000
800
600
400
200
0
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Mutation rate
PSG 5
Average execution
time (msec)
PSG 8
10
8
6
4
2
0
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Mutation rate
Figure 13. Average execution time (msec) vs. mutation rate
0.9
1
35
It is observed for most of the graphs that the execution time for low mutation rate
(0.2-0.5) is high since the probability of the GA getting stuck at a local optimum or losing
good genetic material is higher. Towards the middle range (0.5-0.8), enough disruption is
produced in the population so as to recover any good material lost during selection and
crossover. Also, such a rate ensures that the GA never stagnates at a local optimum. With
further increase (0.8-1.0) in mutation rate, frequent disruption is produced causing the
GA to hop through various sections of the solution space never staying long enough in
one section to converge to an optimum. This leads to a high execution time. A mutation
rate of 0.6 in the mid-range is finally chosen as it produced low execution time for most
of the task graphs for the given GA configuration.
Effect of Changing the Value of Elitism. The sole purpose of introducing elitism is
to replicate the previous generation’s best solutions in a particular generation thereby
increasing the average fitness of the population. The elitism was varied from 10 to 140 in
increments of 10 and the corresponding execution time is recorded and shown in Figure
14. So as to fit the data of PSG 3 in the plot its execution time has been scaled down by
10.
36
PSG 3
Average execution
time (msec)
PSG 4
2500
2000
1500
1000
500
0
10 20 30 40 50 60 70 80 90 100 110 120 130 140
Elitism
PSG 5
Average execution
time (msec)
PSG 8
8
6
4
2
0
10 20 30 40 50 60 70 80 90 100 110 120 130 140
Elitism
Figure 14. Average execution time (msec) vs. elitism
The execution time is observed to be high for low elitism (10-40), decreases as
elitism increases towards the higher end (40-110), with low computation time at near
about two thirds of the population size for PSGs 3, 4 and 8. Mid-range (60-110) elitism
allows the GA to work on a large portion of good individuals leading to a faster
convergence to the global optimum. Although, elitism of 100 and 110 seem to provide
better execution time, an elitism of 90 provided the best result for the given configuration
for most of the task graphs in the test suite. Hence, elitism of 90 has been used while
executing this GA. It should also be noted that for smaller PSGs, such as graphs 5 and 8,
elitism does not produce a significant improvement in the performance of the GA. Now
37
that the optimal configuration has been obtained and justified, the GA was executed on
the full set of test graphs to verify its correctness.
Results and Discussions
For both sets of benchmark graphs described in the first section of the chapter, the
software implementation of the GA was tested and its output schedules were manually
verified to meet all priority and precedence constraints. The test suite is comprised of
eight peer selected graphs and 100 randomly generated graphs. The resultant schedules
obtained for random graphs are shown in Table 1 and for peer selected graphs in Table 2.
Table 1. Task schedules for random graphs
Randomly Generate Graphs
Task Schedule
Graph R1
ABCFEGD
Graph R2
ABEFCD
Graph R3
ABDCEFG
Table 2. Task schedules for peer selected graphs
Peer-Selected Graph
Task Schedule
Graph 1
ACBEDHGFI
Graph 2
ABDHCFGEIJK
Graph 3
BCADMFEIGHJLNKOPQ
Graph 4
ADCBEFGIJHKLONPQMR
Graph 5
ACEBDGF
Graph 6
ACDGBFEKIHJL
Graph 7
BACFGEDHI
Graph 8
CBAEDF
38
Despite the advanced techniques developed to improve performance, GAs can
still require long computation times especially for large problems or task graphs.
Performance improvements are possible through parallelization and hardware
acceleration.
CHAPTER 5
PARALLEL GENETIC ALGORITHMS
Intractability of the scheduling problem has led to use of heuristics for obtaining
sub-optimal solutions in a reasonable amount of time. Although heuristics can produce
good solutions, i.e. solutions with high fitness value, their time complexities are fairly
high. Many heuristics work well for small task sets, but their performance deteriorates as
the problem size and the task set’s complexity increases. To reduce the time complexity
without compromising the solution quality, a natural approach is to parallelize the
scheduling algorithm (Ahmad and Kwok, 1999). In this chapter an overview of various
parallel GA (PGA) structures is presented. Then, the parallelization of the sequential GA
presented earlier is discussed and the results are presented.
Overview of Parallel GAs
The simple structure and operational principle of the GA inherently makes it easy
to parallelize. Much research has been devoted to parallelizing GAs. Several parallel GA
implementations have been examined and these approaches can be broadly classified into
four categories: synchronous master-slave, semi-synchronous master-slave, asynchronous
concurrent, and distributed or network (Goldeberg, 1989) shown in Figures 15, 16, and
17.
39
40
Master GA
Fitness
Slave
Fitness
…..
Slave
Slave
…..
Slave
Figure 15. Master-slave implementation
Concurrent GA
Process
Concurrent GA
Process
Concurrent GA
Process
Concurrent GA
Process
Shared
Memory
Concurrent GA
Process
Concurrent GA
Process
Concurrent GA
Process
Concurrent GA
Process
Figure 16. Asynchronous concurrent implementation
GA
Process
GA
Process
GA
Process
GA
Process
Figure 17. Distributed or network implementation
41
Master-slave or global-parallel, GA shown in Figure 15, is one of the most
popular approaches for parallelizing a GA. In this approach, the master performs most of
the genetic operations, like selection, crossover and mutation; it is the evaluation of
fitness of the chromosomes which is distributed over slave processors. Once the slaves
are done evaluating, they return the fitness value to the master. Meanwhile, the master
waits on the slaves for completion. The second approach, semi-synchronous masterslave, is similar to the master-slave except that the master does not synchronize its
operation with the slaves. Instead, selection and recombination take place as the slaves
perform the evaluation.
In the asynchronous concurrent GA approach, shown in Figure 16, the GA
executes independently on multiple processors, which work on a population present in a
shared memory. While executing the GA, these processors make sure that there is no data
incoherency. The final approach is the network-based GA shown in Figure 17. In this
case, multiple processors use local populations to evolve. These processors work
independently of each other and synchronization takes place at the end of each generation
when each local GA sends the best solution to other processors (Goldeberg, 1989). One
variation of this approach, implemented by Solar, Kri and Parada (2001), uses the best
solution sent by the local GAs to the central GA at the end of a generation. The central
GA selects the best solution and broadcasts it to the local GAs. Such a GA is often called
as distributed GA (Cantu-Paz, 1998, Yoshida and Yasuoka, 1999 and Solar et al., 2001).
Ahmad and Kwok (1999) demonstrate another method for implementing a
parallel GA. In this method, the DAG is partitioned into multiple parts. One can partition
the DAG either horizontally or vertically into layers of nodes. A horizontal partitioning
42
divides the DAG into layers of nodes such that the paths in the DAG are at the same
level. Vertical partitioning is performed by partitioning along the paths. After
partitioning, each part is scheduled by a node, and at the end these schedules are
combined to generate the final schedule (Ahmad and Kwok, 1999).
Software Implementation of PGA
In this thesis, a synchronous master-slave parallel GA has been implemented by
parallelizing the fitness evaluation module to improve the performance of the scheduling
algorithm. The evaluation for fitness constitutes the largest part of the GA’s execution
time. In the master-slave GA, the evaluation of individuals is parallelized by assigning a
fraction of the newly generated population to different slave processors. Communication
is minimized by grouping all individuals intended for a slave into one transfer. In the
semi-synchronous model, each individual is sent to the slave as it is generated after
crossover and mutation, and does not wait for the whole population to be generated, as is
the case in synchronous master-slave model. Sending each individual in this manner
leads to larger communication cost as compared to sending a group of individuals for
evaluation. This overhead might overshadow any performance gain that is achieved
through parallelism. Hence, the semi-synchronous model has not been implemented.
In synchronous master-slave PGA the master node executes the sequential GA
process of selection and reproduction. But when it comes to the evaluation of fitness, the
master node distributes fitness evaluations amongst itself and the slave nodes. The master
evaluates its fraction of population and waits on the slaves to return the fitness values for
all the other individuals comprising the population before moving on to the next
43
generation if a desired solution has not been obtained. Hence, a master-slave GA is the
same as the sequential GA only with its performance improved through parallelization of
the evaluation/fitness function (Cantu-Paz, 1998).
Results and Discussions
The PGA implemented here uses the same parameter configuration as the
sequential GA, hence only the effect of parallelization has been discussed. The PGA
distributes fitness evaluation over the nodes of a cluster, with the master distributing the
population over these slave nodes, the slave nodes evaluating the sub-population’s fitness
and returning them to the master. The PGA is coded in C using the Message Passing
Interface (MPI) standards and is executed on a Rocks cluster with 16 EM64T CPUs, each
with a clock frequency of 3.2 GHz and performance of 102.4 GFLOPS. To study the
speedup in the execution time of the GA with respect to the number of processors used, a
mixed subset of the PSGs described in Chapter 4 is used.
The time taken to evaluate fitness depends on two things. First is the size of the
population. More individuals require longer computation time. Second, is the quality of
the solutions. Poor solutions require the fitness module to apply a penalty, thus requiring
more computations. Although, all fitness modules receive an equal number of individuals
to evaluate, the parallel implementation does not ensure that all the modules receive
individuals of the same quality. This type of condition is known as load imbalance. Due
to the synchronization between master and slaves, the computation time of the overall
algorithm is constrained by the slowest evaluating fitness module.
44
Since the population size is set to 140 for this GA, the number of processors used
is 1, 2, 4, 5, 7, 10, and 14 which are the least common divisors of 140. These numbers
equally divide the population among the slave processors. The speedup obtained by
parallelizing the GA is shown in Figure 18.
PSG 3
PSG 4
PSG 5
PSG 8
3.5
3
Speedup
2.5
2
1.5
1
0.5
0
1
2
4
5
7
10
14
Number of processors
Figure 18. Speedup vs. number of processors
As shown in the graph, for most of the tested task graphs speedup increases as the
number of processors increase. After a certain point, speedup typically remains constant
before gradually decreasing. This characteristic can be attributed to load imbalance and
the increase in communication costs incurred while sending subpopulations to slave
nodes, thereby negating any improvement obtained through parallelism. These
communication costs increase as the subpopulation size decreases, eventually dominating
computation time. It is also observed that, for smaller PSGs, PSGs 5 and 8, the
communication overhead increases so much that the performance of the PGA sinks below
45
that of the sequential GA. A maximum speedup of approximately 3 was obtained using 7
processors for complex task sets. For simple task graphs, best speedups were obtained
using fewer processors on the order of three. Finally, it is observed that by parallelizing
the fitness evaluation a speedup on the order of 1.5-3 can be achieved over the sequential
GA depending on characteristics of the task graph being scheduled.
Despite parallelization, a GA may still require long execution times, in some
cases several days, to solve a large problem (Vega-Rodriguez, Gutierrez-Gil, AvilaRomán, Sánchez Pérez and Gómez Pulido, 2005). As Figure 18 shows, using additional
computing resources may not provide improved performance. Thus, better methods to
improve the execution time of a GA are being researched. Implementing the GA directly
in hardware using reconfigurable hardware devices is one promising approach.
CHAPTER 6
HARDWARE-BASED GA TASK SCHEDULER MODEL
This chapter begins with the motivation behind a hardware-based GA (HGA),
followed by a brief description of field programmable gate arrays (FPGAs), and previous
work done in the field of HGAs. A conceptual description of the HGA design is given
including an overview, a detailed description of each module’s functionality, and a brief
description of how pipelining has been implemented in the design. Finally, the
performance results for the HGA are presented and discussed.
Motivation for the HGA System
Sequential GA implementations tend to be slow due to the amount of computation
required to efficiently search the large solution spaces associated with task scheduling
problems. Parallelizing these GAs can offer only limited speedup due to communication
costs, as shown in Figure 18. More substantial performance improvement is possible
using hardware acceleration.
With the advent of hardware technologies and design tools in the field of
reconfigurable devices, there has been a drastic increase in the hardware implementation
of applications that were previously being executed in the software environment. With
hardware implementation, high performance can be expected with minimum resources at
hand.
46
47
The advent of such technology features has enabled engineers to think of new
ways to approach the research and development of both hardware and software systems.
Much work has been done to demonstrate the significant speedup that can be achieved by
implementing hardware accelerators of software algorithms.
A typical sequential GA is comprised mainly of a set of simple genetic operations
and fitness evaluation functions that are executed repetitively until the required solution
is obtained. If a particular GA executes for num_gen generations and performs genetic
operations on a population of size popsize, then the GA operations are executed num_gen
* popsize times. For example, for one particular execution of the GA for PSG 3, the GA
executed 454 generations with a population size of 120 until it converged to the global
optimum. Thus, the GA operations executed 454 * 120 = 54,480 times. It has been
noticed for complex scheduling problems that the population size and the number of
generations needed is large, making the number of GA operations executed before the
termination condition is reached very large. Therefore, hardware acceleration of these
algorithms may provide significant speedup.
Hardware acceleration can be achieved using different technologies. Custom
hardware can be used, but it is not desirable due to high non-recurring engineering (NRE)
costs and long design time (Vahid and Givargis, 2000). Reconfigurable hardware devices
such as FPGAs are popular because of their low cost, flexibility, and speed. Also,
standardized hardware description languages (HDLs) and powerful design tools are
available for these devices making the design process easier. The next section provides an
overview of FPGAs.
48
Overview of Field Programmable Gate Arrays
The process of designing digital hardware has changed dramatically over the past
few years with the availability of new types of sophisticated programmable logic devices
(PLDs). Unlike previous generations of technology, in which board-level designs
included large numbers of small scale integrated chips containing basic gates, virtually
every digital design produced today consists mostly of high-density devices. When logic
circuits are destined for high-volume systems they have been integrated into high-density
gate arrays. However, gate arrays suffer from high NRE costs and take too long to
manufacture to be viable for prototyping or other low-volume applications. For these
reasons, most prototypes, and also many production designs are now built using PLDs
because of features such as instant manufacturing turnaround, low start-up costs, and ease
of design changes. The three main categories of PLDs are: Simple PLDs (SPLDs),
Complex PLDs (CPLDs) and FPGAs (Brown and Rose, 1996, Brown and Vranesic, 2005
and Vahid and Givargis, 2000).
An FPGA is a programmable logic device that supports implementation of
relatively large logic circuits. FPGAs contain an array of uncommitted circuit elements,
called logic blocks, and interconnect resources. Configuration is performed through
programming by the end user. The general structure of an FPGA is shown in Figure 19.
49
Logic Blocks
IO Blocks
Interconnection
Wires
Switches
Figure 19. Structure of an FPGA
As shown in Figure 19, an FPGA contains three main types of resources: logic
blocks, I/O blocks for connecting to the pins of the package, and interconnection wires
and switches. The logic blocks, arranged in a two dimensional array, provide the
implementation of the required functions, and the interconnection wires are organized as
horizontal and vertical routing channels between rows and columns of logic blocks. The
routing channels contain wires and programmable switches that allow the logic blocks to
be interconnected in many ways. The switches can be of any one of the pass-transistors
controlled by static RAM cells, anti-fuses, EPROM or E2PROM transistors. There are
two basic categories of FPGAs on the market today, Static Random Access Memory
(SRAM)-based FPGAs and antifuse-based FPGAs. For SRAM-based arrays, Xilinx and
Altera are the leading manufacturers. For antifuse-based products the leading
manufacturers are Actel, Quicklogic, Cypress, and Xilinx (Brown and Rose, 1996 and
Scott, 1994).
50
Previous Research Work Related to HGA
For years, researchers have undertaken the study of hardware implementation of
schedulers and their performance. These studies have been mainly in the area of
development
of
hardware
schedulers
using
VLSI,
evolvable
hardware
and
implementation of schedulers using FPGAs. Burleson et al. (1999) developed a hardware
scheduler (Spring scheduler) in the form of a coprocessor which was used as a VLSI
accelerator for multiprocessing environments (Burleson et al., 1999). The scheduler is
used for sophisticated static scheduling as well as on-line scheduling using various
algorithms like EDF, HLF or the Spring scheduling algorithm (Niehaus et al., 1993).
Yoshida and Yasuoka (1999) proposed a hardware-based GA called Genetic Algorithm
Processor that employs a steady state GA and pipeline processing. They also
implemented a two level parallelization of parallel and distributed GA, which was
effective in performance and convergence improvement (Yoshida and Yasuoka, 1999).
Although high performance is provided by such conventional custom hardware
schedulers, lack of flexibility and reconfigurability of such components has prompted
many researchers to use reconfigurable devices such as FPGAs for the development of
hardware-based schedulers.
Use of pre-fabricated reconfigurable devices has allowed rapid implementation of
sizeable systems without the need to create custom hardware. Such hardware provides
economies-of-scale production advantages as well as flexibility formerly found in
software-based systems (Loo et al., 2003). Scott, Samal and Seth (1995) proposed a
general hardware-based GA engine which implements a parallelized simple genetic
algorithm. Loo et al. (2003) developed a static GA-based scheduler on a reconfigurable
51
device. Tang and Yip (2004) implemented a general GA using FPGAs with the ability to
reconfigure the hardware-based fitness function.
Considerable amount of research has been applied to a sub-category of hardwarebased evolutionary algorithms known as evolvable hardware or evolware.
In such
hardware systems, the function and architecture of the system can be dynamically
changed in self adaptation with the simulation of the outside environment (Lei, Mingcheng, and Jing-xia, 2002). Evolware consists of a reconfigurable hardware device that
can undergo a number of evolutionary cycles to produce a suitable solution to the
application at hand. Evolvable hardware has become a rapidly growing field, in which
new discoveries are being made at an astounding pace (Abielmona and Groza, 2001).
Abielmona and Groza (2001) and Lei et al.(2002) implemented such evolvable hardware
using a GA to design new chips and map a design to a technology to minimize the
structure of a circuit. Although, much research effort has been devoted to hardware
schedulers, little has been done in developing hardware-based task schedulers which
perform scheduling using a GA.
Basic HGA Model
The HGA fits into a general computing environment in the following way. The
front end, which might be a CPU, contains the task graph that needs to be scheduled.
Before starting the GA scheduler, the front end writes the task graph into the random
access memory provided on the FPGA board. Then using one of the I/O pins, the front
end sends a GO signal to the GA scheduler. The scheduler detects the GO signal and
starts reading the parameters and the task graph from memory. It executes until the
52
termination condition has been met. Once the solution has been found, the scheduler
signals the front end, indicating that the GA has completed the scheduling operation, and
the solution has been written to the on-board memory. The CPU reads the schedule from
the memory and writes it to a file.
Overview of the System
The basic simple HGA model is shown in Figure 20. It consists of a shared onboard memory module which stores the input task graph, output schedule, GA
configurations and both the old and new population. A memory controller (MC) module
reads the external signals and controls what is read and written to the memory based on
the requests obtained from other GA modules and the front end.
53
New Population
Fitness Module
(FM)
S
t
a
r
t
G
A
M
o
d
u
l
e
s
Reproduced
Children
On Board
Memory
Task Table
Members and
Fitness
Initial
Population
Crossover &
Mutation Module
(CMM)
Selected
Parents
Selection Module
(SM)
Read Task Graph
(RTG)
Initial Population
Generator (IPG)
Random
Numbers
Random Number
Generator
(RNG)
Members and
Fitness
Population
Sequencer (PS)
S
t
a
r
t
M
o
d
u
l
e
s
Memory
Controller
(MC)
Old Population
Front End
Figure 20. Simple HGA model
A GA is a randomized technique, and its core of operation is based on the random
number generator (RNG) module which provides various random numbers of different
sizes to other GA modules. The initial population is generated by the initial population
generator (IPG) module. It uses the random numbers provided by the RNG module and
fills the initial population with random strings and priority sorted strings of tasks.
Simultaneously, the read task graph (RTG) module reads the task graph from memory,
fills the task table, and sends it to the fitness module (FM). As the initial population is
being generated, the random strings produced are evaluated by the FM. After the initial
54
population has been generated, the FM signals the selection module (SM), crossover and
mutation module (CMM), and the population sequencer (PS) module to start the GA.
Once the best solution is obtained, the FM signals the MC that the GA needs to be
terminated. The MC stops all the modules and sends a DONE signal to the front end. The
front end on receiving the DONE signal starts reading the solution. Typically, there might
exist multiple task graphs in the memory at any given time. The front end maintains a
count of task schedules generated and task schedules that need to be generated. Hence,
when the front end performs a read operation it knows which task schedule corresponds
to which task graph. A detailed description of the individual modules is given in the next
sub-section.
Module Description
The modules as shown in Figure 20 and coded in Appendix A are mostly based
on the operators used in the sequential GA model described in Chapter 3. As shown,
these modules form a coarse grained pipeline and execute concurrently with each other.
All theses modules have been described using Very High Speed Integrated Circuit
Hardware Description Language (VHDL), which is an IEEE standard, and implemented
on the Altera Cyclone FPGA device. Each module can be described as a finite state
machine with an asynchronous reset. The modules communicate with each other
asynchronously using a handshaking protocol. For example, if a consumer module A
wants some service from producer module B it initiates a request signal to module B.
Module B senses the request line and returns an acknowledgement to module A
indicating that it is ready to service A’s request. Module A then gets the service and
55
lowers its request line. Module B, on completion of service, lowers its acknowledgement
line. Examples of such handshaking might exist between the MC and the IPG or the IPG
and the FM. A detailed description of each module is given below.
On-Board Memory. The memory module used in the HGA represents the onboard SRAM M4K blocks comprised of 130 kilobits of memory. The memory is
implemented using Altera synchronous single port RAM memory module with a width of
8 bits and capacity to store 16K words. All memory read and writes are performed
through the MC. The memory module is used to store the old and new population used in
the GA along with the input task graphs and the output schedules.
Memory Controller (MC). The MC is the module which acts as the interface
between the external environment and the GA scheduler. It is the MC which responds to
all the front end signals which initiate and stop the GA scheduler operation. It also
provides signals to the front end indicating that scheduling is completed and the output
schedule is ready. The MC also responds to all the memory read/write requests from
other GA modules and acts as a memory interface.
Random Number Generator (RNG). The RNG module forms the core of a GAbased scheduler. This module provides the random numbers to three GA modules. It
supplies random strings to the IPG module to produce the initial random population. It
also provides two random strings to the SM which uses the strings to scale down the sum
of fitness which is in turn used to select two parents from the population for performing
56
recombination. Six random numbers each representing the test probability for crossover
and mutation, two crossover and mutation points respectively, are sent to the CMM.
The RNG module is based on the cellular automaton (CA) theory. The CA theory
can be thought of as dynamical systems, discrete in both time and space. They are
implemented as an array of cells with homogeneous functionality, constrained to a
regular lattice of some dimensionality. In this thesis, CA in one dimension, called Linear
CA, has been implemented. CA relies on bit-level computation and local interconnection
to implement hardware-based random number generators. These generators have long
random number sequences and are sufficiently long for use in a GA. Statistical tests have
proven that CA-based random number generators are far more superior than linear
feedback shift register-based generators (Shackleford, Tanaka, Carter and Snider, 2002
and Martin, 2002).
After receiving an initiate signal from the MC, the RNG module first gets the
random seed from memory. The CA used in the RNG is a linear CA with 16 cells which
change their states according to rules 90 and 150.
Rule 90: Si_next = Si-1_present XOR Si+1_present
Rule 150: Si_next = Si-1_present XOR Si_present XOR Si+1_present
Here Si_next is the next state of cell i and Si_present is the current state of cell i. It has
been proven that a 16-cell CA whose cells are updated by the rule sequence 150-150-90150-90-150……90-150 produces a maximum length cycle (Serra, Slater, Muzio and
Miller, 1990).
57
Initial Population Generator (IPG). The IPG module gets the population size and
chromosome size from memory and uses input from the RNG to generate random strings
containing randomly ordered tasks and a few ordered strings based on the priority of the
tasks in the task graph. The IPG also initiates the initial fitness value to 50000 which is
the maximum fitness. After generating a random individual, the IPG sends it to the FM
for evaluating the fitness of the random string. Once the IPG is done generating the
random population and the FM is done evaluating them, the FM enables the rest of the
GA modules, and the normal GA operation is started.
Read Task Graph (RTG). The RTG module reads the task graph for which the
schedule is to be obtained from memory and generates the task table containing various
values needed for the fitness evaluation of a schedule by the FM. It is only after the RTG
is done generating the task table that the FM is initiated to perform fitness evaluation.
Population Sequencer (PS). The job of the PS module is to scan through the
current population and pass the members to the selection module. The PS module acts as
an interface between the selection module and the MC, and provides the address of the
member to be read to the MC and passes the member and its fitness to the selection
module.
Selection Module (SM). The HGA’s SM implements the stochastic reminder
without replacement selection method. Each time a new member is to be selected, the SM
gets a random number from the RNG and scales the sum of fitness. It then signals the PS
58
to cycle through the population, and accumulates the fitness value until it reaches the
scaled sum of fitness value. Once the scaled fitness is reached, the member is stored so
that it can be passed to the CMM. The SM repeats the same procedure for obtaining the
next selected member. After obtaining the two parents, it signals the CMM that parents
are ready for recombination. After passing the selected parents to the CMM, the SM
resets itself and starts over with the selection process.
Crossover and Mutation Module (CMM). The CMM is based on the PMX and
number swapping mutation methods. After receiving a ready signal from the SM, the
CMM module obtains random strings from the RNG. Using these random numbers, the
CMM performs crossover and mutation based on the user defined crossover and mutation
rates. Once the CMM completes its operation, it signals the FM that new offspring are
available for fitness evaluation.
Fitness Module (FM). The FM implements the additive and multiplicative penalty
function described previously. The penalty function is based on the number of constraints
in the scheduling problem that have not been followed by the chromosome and is used to
evaluate its fitness. The FM receives inputs from two GA modules, initially, from the
IPG, while evaluating the fitness of the initial population and afterwards during the
normal GA operation for evaluating the new members provided by the CMM module.
After evaluating new individuals, the FM sends the new members to the MC which
writes them to memory. The FM also keeps track if the best possible solution has been
obtained based upon the pre-defined fitness threshold. If it has been obtained it signals
59
the MC that the GA has obtained the solution and sends the solution to the MC which
then writes it to a separate memory location which is accessed by the user. The FM also
has the responsibility of maintaining the statistics for a given generation, like sum of
fitness, average fitness value, maximum fitness value, and minimum fitness value. These
statistics are used by the SM to select new members. After evaluating a generation, the
FM performs elitism, re-evaluates the sum of fitness, and writes it to memory through the
MC.
Pipelining
Employing hardware resources provides higher levels of concurrency. The simple
nature of GA operations makes it easy to form a coarse grained pipeline within the HW
implementation as shown in Figure 21. The PS, SM, CMM and FM modules together
form a pipeline.
PS
SM
CMM
FM
Figure 21. Coarse-grained pipeline
First the PS gets the members from the MC and passes them to the SM. The SM,
based on a pre-calculated value, continues accepting the incoming members until it has
chosen a pair of members. Once this decision has been made, the SM passes the pair to
the CMM which, based on the crossover and mutation probability, performs crossover
and mutation on the selected members. While the CMM is performing recombination and
mutation operations on the selected members, the SM and PS start getting the next set of
parents. Once the CMM is done, it passes the newly generated offspring to the FM and
60
starts working on the next set of parents if available. The FM evaluates the fitness and
sends the result to the MC which writes it to memory. This sequence of operation
represents the GA pipeline architecture. Populations are created and evaluated one
individual at a time. This is in stark contrast to the software implementation where the
entire population is created before any individuals can be evaluated. In this case, the
fitness evaluation must idly wait for the population creation to complete. This pipelining
provides a significant speedup compared to software implementations. The next section
discusses the performance of the HGA with respect to the variations in the GA
parameters.
Results and Discussions
Similar to the sequential GA, the performance of the HGA very much affected by
the values of the GA parameters. Various tests were performed on the HGA to study the
variations in its performance as the values of these parameters change. The HGA was
implemented using VHDL on an Altera Cyclone EP1C12Q240 device. The device
features and the corresponding resource utilization for each PSG is shown in Table 3
below.
Table 3.Cyclone EP1C12Q240 device features and resource utilization
Device features
Logic Elements
RAM Blocks
Total RAM bits
User IO pins
Value
PSG 5
12060
10218 (84%)
52
239616
173
131072 (54%)
4 (2%)
PSG 8
8650 (71%)
131072 (54%)
4 (2%)
61
Due to resource constraints, the tests have been restricted to graphs of size less
than or equal to eight tasks. This limits the test suite to PSGs 5 and 8 only. In each test
only one of the GA parameters were changed and the tests were executed 20 times to get
an average performance of the system. To calculate the execution time of the HGA, the
number of clock cycles required by the algorithm to generate the schedule was divided by
the clock frequency of the hardware. From these tests, the following parameter values
were determined: population size of 140, crossover probability of 0.8, and mutation
probability of 0.6.
From the data obtained for execution time using the above method and comparing
them with the PGA, we found that the performance of PSG 5 improved by a factor of 1.4,
but the performance of PSG 8 deteriorated by 0.75. But this performance can be further
improved by performing timing optimization on the hardware. Since, the Cyclone device
used was limited in the number of logic elements shown in Table 3, no timing
optimization could be performed. It was observed that on a Stratix device, with larger
resources an improvement on the order of 2-3 can be achieved. Another way to obtain
higher speedup is by parallelization of the HGA modules. The process of achieving a
parallelized version of the HGA is discussed in the next chapter.
CHAPTER 7
PARALLEL HARDWARE-BASED GA TASK SCHEDULER MODEL
The HGA implementation discussed in the previous chapter can be improved with
parallelization. But it is important to note that the logic resources available within
commercially off-the-shelf FPGAs are limited, thereby making it difficult in some
applications to configure the FPGA in such a way where all the modules of the GA can
be implemented in hardware. Thus, compromises may be necessary where parts of the
GA are implemented in hardware and other parts are implemented in software. For
instance, it has been noticed that in the GA it is the evaluation of the fitness function
which takes the longest time to complete. Therefore, if the hardware resources are
limited, the FM can be implemented in hardware and interfaced with the software GA.
PHGA Model
If sufficient hardware resources are available and enough chip area is present,
then the entire GA can be implemented in hardware and various HGA modules can be
duplicated to produce additional concurrency. Scott et al. (1995) describes various
techniques to achieve this additional concurrency. In this thesis, the FM is duplicated
providing parallelization of the fitness evaluation. This approach is taken because there
are two chromosomes needing evaluation concurrently and it is the fitness evaluation that
62
63
consumes the majority of the overall execution time. The parallel hardware-based GA
(PHGA) task scheduler is shown in Figure 22.
New Population
On Board
Memory
Task Table
FM 1
S
t
a
r
t
G
A
M
o
d
u
l
e
s
FM 2
Reproduced
Children
Members and
Fitness
Initial
Population
Crossover &
Mutation Module
(CMM)
Selected
Parents
Selection Module
(SM)
Read Task Graph
(RTG)
Initial Population
Generator (IPG)
Random
Numbers
Random Number
Generator
(RNG)
Members and
Fitness
Population
Sequencer (PS)
S
t
a
r
t
M
o
d
u
l
e
s
Memory
Controller
(MC)
Old Population
Front End
Figure 22. PHGA model with two FMs
Two FMs have been implemented. These FMs take in the two new offspring
provided by CMM, evaluate their fitness and pass the evaluated offspring to the MC. This
concurrency reduces the time required to evaluate the fitness of two chromosomes.
Combining this parallelization with the existing pipelining, the performance of the GA
scheduler can be significantly improved over that of other implementations. The
performance results obtained from this implementation are presented in the next section.
64
Results and Discussions
The PHGA was implemented using VHDL on an Altera Cyclone EP1C12Q240
device. The improvement in the performance of the GA scheduler is recorded for PSGs 5
and 8 described in Chapter 4.
Similar to the HGA, load imbalance due to solution quality will not affect the
execution time for the PHGA because of pipelining. Also, the concurrency from the two
FMs reduces the overall evaluation time of individuals by almost half. Additionally, since
the nodes coexist in the same reconfigurable hardware device, the communication cost is
negligible. These architectural features explain the improved performance seen by this
implementation.
To measure the speedup for this implementation, tests were performed on PSGs 5
and 8 using the GA configuration of population size 140, crossover rate 0.8 and mutation
rate of 0.6. It is observed that for PSG 8 the speedup achieved by PHGA over HGA was
of the order of 2.3 and for PSG 5 it was of the order of 1.6. Table 4 shows the resource
utilization for PSG 5 and 8 for PHGA.
Table 4.Resource utilization for PHGA
Device features
Logic Elements
RAM Blocks
Total RAM bits
User IO pins
Value
PSG 5
12060
10837 (89%)
52
239616
173
131072 (54%)
4 (2%)
PSG 8
10340 (85%)
131072 (54%)
4 (2%)
CHAPTER 8
CONCLUSIONS AND FUTURE WORK
The goal of this thesis is to apply a GA to the task scheduling problem and to
improve the overall execution time of the algorithm using hardware acceleration. This
section compares the performance of different implementations of the scheduler with
respect to the software sequential implementation. Finally, the future direction of the
research is presented.
Results and Conclusions
Table 5 displays the speedup obtained for each implementation of the scheduler
for the test task graphs with respect to the software sequential implementation.
Table 5.Speedup for scheduler implementations
GA Implementation
PSG 3
PSG 4
PSG 5
Software sequential
1.0
1.0
1.0
PGA
2.8 (7 FMs)
3.75 (7 FMs)
HGA
No Result
No Result
PHGA
No Result
No Result
1.6 (5 FMs)
2.2
3.4 (2 FMs)
PSG 8
1.0
1.3 (5 FMs)
1.0
2.3 (2 FMs)
For the software implementations, it can be concluded that significant speedup on
the order of 2-3 can be achieved by parallelizing the fitness evaluation using 5-7 slave
65
66
processors. It was noted that for large task graphs, such as PSG 3 and 4, more nodes are
required for achieving high speedup, whereas the same speedup can be achieved using
fewer slave nodes for the smaller, simpler task graphs such as PSG 5 and 8. In hardware
implementation it is observed that the sequential GA did not provide any performance
enhancement for PSG 8 for reasons discussed in Chapter 6, but still the performance is
comparable, while performance enhancement was noted for PSG 5. The PHGA provides
high speedups due to the advantages of parallelism and pipelining. Also, FPGAs with
higher resources can also be used to optimize the algorithm implemented in the thesis to
achieve a higher operating frequency. This has been confirmed by testing the same
algorithm on the Stratix device which provided a speedup of 2-3 for PSG 5 and 8.
Future Work
The PHGA has been shown to provide an improvement over the performance of
the software implementations for most of the task graphs used in this thesis. However,
different software parallelization schemes are available for parallelizing a GA as
discussed in Chapter 5. Future work might involve comparing the performance of theses
other schemes. Also, hardware implementations of these schemes can be evaluated
regarding performance and resource utilization. Some of these are described by Scott
(1993).
Another area for future consideration involves the use of different search
techniques which can be incorporated to speedup the penalty function. Implementing
these techniques might require extra resources thus limiting them to reconfigurable
devices having a large amount of resources.
67
Finally, a considerable amount of research is taking place in the field of hybrid
algorithms. Hybrid algorithms combine the characteristics of two or more scheduling
algorithms. Some popular algorithms used in hybrid techniques are list-based algorithms
and simulated annealing. These algorithms are known to converge to a local optimum
faster than a GA. Hence, a hybrid algorithm can be implemented where the simulated
annealing algorithm can perform local searches on sub-populations in slave nodes and
send the results to the GA running on the master node, which in turn uses these solutions
to search for the global optimum.
REFERENCES
Abielmona, R., & Groza, V. (2001). Circuit Synthesis Evolution Using a HardwareBased Genetic Algorithm. Canadian Conference on Electrical and Computer
Engineering. 2, 963-968.
Adam, T. L., Chandy, K. M., & Dickson, J. R. (1974). A Comparison of List Schedules
for Parallel Processing Systems. Communications of the ACM. 17, 12, 685-690.
Ahmad, I., & Kwok, Y-K. (1999). On Parallelizing the Multiprocessor Scheduling
Problem. IEEE Transactions on Parallel and Distributed Systems, 16, 4, 414-432.
Aporntewan, C., & Chongstitvatana, P. (2001). A Hardware Implementation of the
Compact Genetic Algorithm. Proceedings of the IEEE Congress on Evolutionary
Computation, 624-629.
Brown, S., & Rose, J. (1996). Architecture of FPGAs and CPLDs: A Tutorial. IEEE
Design and Test of Computer, 13, 2, 42-57.
Brown, S., & Vranesic, Z. (2005). Fundamentals of Digital Logic with VHDL Design.
New York: McGraw Hill.
Burleson, W., Ko, J., Niehaus, D. , Ramamritham, K., Stankovic, J. A., Wallace, G., &
Weems, C. (1999). The Spring Scheduling Coprocessor: A Scheduling Accelerator.
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 7, 1, 38-47.
Cantu-Paz, E. (1998). A Survey of Parallel Genetic Algorithms. Calculateurs Paralleles,
10, 2.
Casavant, T. L., & Kuhl, J. G. (1988). A Taxonomy of Scheduling in General-Purpose
Distributed Computing Systems. IEEE Transactions on Software Engineering, 14, 2,
141-154.
68
69
Chung, Y. C., & Ranka, S. (1992). Application and Performance Analysis of a CompileTime Optimization Approach fir List Scheduling Algorithms on Distributed-Memory
Multiprocessors. Proceedings of Supercomputing’92. 512-522.
Coffman Jr., E.G. (1975) Computer and Job Shop Scheduling Theory. New York: John
Wiley & Sons Inc.
Colin, J. Y., & Chretienne, P. (1991). C.P.M. Scheduling with Small Computation Delays
and Task Duplication. Operations Research. 680-684.
Correa, R. C., Ferreira, A., & Rebreyend, P. (1999). Scheduling Multiprocessor Tasks
with Genetic Algorithms. IEEE Transactions on Parallel and Distributed Systems,
10, 8, 825-837.
El-Rewini, H., Lewis, T.G., & Ali, H.H. (1994). Task scheduling in parallel and
distributed systems. New Jersey:Prentice-Hall Inc.
El-Rewini, H., Lewis, T.G., & Ali, H.H. (1995). Task scheduling in Multiprocessing
Systems. IEEE Computer Society Press, 28, 12, 27-37.
Gilles, D.W., & Liu, J.W.S. (1995). Scheduling tasks with AND/OR precedence
constraints. SIAM Journal on Computing, 24, 4 , 797–810.
Goldberg, D.E. (1989). Genetic Algorithms in Optimization, Search and Machine
Learning. Massachusetts: Addison Wesley.
Hou, E. S. H., Ansari, N., & Ren, H. (1994). A Genetic Algorithm for Multiprocessor
Scheduling. IEEE Transactions on Parallel and Distributed Systems, 5, 2, 113-120.
James, F. (1990). A Review of Pseudorandom Number Generators. Computer Physics
Communications, 60, 329-344.
Joines, J. A., & Houck, C. R. (1994). On the Use of Non-Stationary Penalty Functions to
Solve Nonlinear Constrained Optimization Problems with GA’s. IEEE Transactions
on Evolutionary Computation, 7, 5, 445-455.
Korkmaz, T., Krunz, M., & Tragoudas, S. (2002). An Efficient Algorithm for Finding a
Path Subject to Two Additive Constraints. Computer Communications Journal, 25, 3,
225-238.
70
Kwok, Y.-K., & Ahmad, I. (1996). Dynamic Critical-Path Scheduling: An Effective
Technique for Allocating Task Graphs to Multiprocessors. IEEE Transactions on
Parallel and Distributed Systems, 7, 5, 506-521.
Kwok, Y.-K., & Ahmad, I. (1998). Benchmarking the Task Graph Scheduling
Algorithms. Proceedings of the 12th International Parallel Processing Symposium,
531-537.
Lae-Jeong Park, & Cheol-Hoon Park (1995). Application of Genetic Algorithm to Job
Shop Scheduling Problems with Active Schedule Constructive Crossover.
Proceedings of IEEE International Conference on Systems, Man, and Cybernetics,
530-535.
Lei, T., Ming-Cheng, Z., & Jing-xia, W. (2002). The Hardware Implementation of a
Genetic Algorithm Model with FPGA. Proceedings of IEEE International
Conference on Field-Programmable Technology, 374-377.
Loo, S.M., Wells, B.E., & Winningham, J.D. (2003). A Genetic Algorithm Approach to
Static Task Scheduling in a Reconfigurable Hardware Environment. Computers and
Their Applications, 36-39.
Mahmood, A. (2000). A Hybrid Genetic Algorithm for Task Scheduling in
Multiprocessor Real-Time Systems. Retrieved May 05 2005 from http://
www.ici.ro/ici/revista/sic2000_3/art05.html
Marsaglia, G., Zaman, A., & Tsang, W. W. (1990). Toward a universal random number
generator. Letters in Statistics and Probability, 9, 1, 35-39.
Martin, P. (2002). An Analysis of Random Number Generators for a Hardware
Implementation of Genetic Programming using FPGAs and Handel-C. Technical
Report, Department of Computer Science, University of Essex. 1-13.
Niehaus, D., Ramamritham, K., & Stankovic, J. A. Wallace, G., Weems, C., Burleson,
W., & Ko, J. (1993). The Spring Scheduling Coprocessor: Design, Use and
Performance. IEEE Real-Time Systems Symposium, 106-111.
Parker, R. G. (1995). Deterministic Scheduling Theory. London: Chapman and Hall.
71
RĖ˜adulescu, A., & Gemund, A. JC van (2002). Fast and Effective Task Scheduling in
Heterogeneous Systems. IEEE Transactions on Parallel and Distributed Systems, 13,
3, 260-274.
Ramamritham, K., & Stankovic, J. A. (1994). Scheduling Algorithms and Operating
Systems Support for Real-Time Systems. Proceedings of the IEEE, 82, 1, 55-67.
Reade, W. (2002). Optimization with Genetic Algorithms. Retrieved February 25, 2006,
from Web site: http://engr.smu.edu/~mhd/8331f04/GA.ppt
Scott, S.D. (1994). HGA: A Hardware-Based Genetic Algorithm. Master's thesis,
University of Nebraska-Lincoln, 1994.
Scott, S.D., Samal, A., & Seth, S.C. (1995). HGA:A Hardware-Based Genetic Algorithm.
Proceedings ACM/SIGDA International Symposium Field-Programmable Gate
Arrays, 53–59.
Serra, M., Slater, T., Muzio, J. C., & Miller, D. M. (1990). The Analysis of OneDimensional Linear Cellular Automaton and their Aliasing Properties. IEEE
Transactions on Computer Aided Design of Circuits and Systems. 9, 7, 767 – 778.
Shackleford, B., Tanaka, M., Carter, R. J., & Snider, G. (2002). FPGA Implementation of
Neighbourhood-of-Four Cellular Automata Random Number Generators.
Proceedings of the 10th ACM International Symposium on Field-Programmable Gate
Arrays. 106-112.
Solar, M., Kri, F., & Parada, V. (2001). A Parallel Genetic Scheduling Algorithm to
Generate High Quality Solutions in a Short Time. 4th Metaheuristics International
Conference, 115-119.
Talbi, E.–G., & Muntean, T. (1993). Hill-climbing, simulated annealing and genetic
algorithms: A comparative study. Proceedings of the 26th Hawaii International
Conference on Systems Science (HICSS-26), 2, 565-573.
Tang, W., & Yip, L. (2004). Hardware Implementation of Genetic Algorithm using
FPGA. The 47th IEEE International Midwest Symposium on Circuits and Systems.
549-552.
Vahid, F., & Givargis, T. (2000). Embedded System Design–A
Hardware/Software Introduction. New York: John Wiley & Sons Inc.
Unified
72
Vega-Rodriguez, M. A., Gutierrez-Gil, R., Avila-Román, J. M., Sánchez Pérez, J. M., &
Gómez Pulido, J. A. (2005). Genetic Algorithms Using Parallelism and FPGAs: The
TSP as Case Study. Proceedings of the 2005 International Conference on Parallel
Processing Workshops. 573-579.
Wang, L., Siegel, H. J., Roychowdhury, V. P., & Maciejewski, A. A. (1997). Task
Matching and Scheduling in Heterogeneous Computing Environments Using a
Genetic-Algorithm Based Approach. Journal of Parallel and Distributed Computing,
47, 8-22.
Wu, M. Y., & Gajski, D. D. (1990). Hypertool: A Programming Aid for Message-Passing
Systems. IEEE Transactions on Parallel and Distributed Systems. 1, 3, 330-343.
Yang, T., & Gerasoulis, A. (1993). List Scheduling with and without Communication
Delays. Parallel Computing Journal, 19, 1321–1344.
Yeniay, O. (2005). Penalty Function Methods for Constrained Problems Optimization
with Genetic Algorithms. Mathematical and Computational Applications. 10, 1, 4556.
Yoshida, N., & Yasuoka, T. (1999). Multi-GAP: Parallel and Distributed Genetic
Algorithms in VLSI. Proceedings of the IEEE SMC’99 Conference Proceedings on
Systems, Man and Cybernetics. 5, 571-576.
Zomaya, A.Y., Ward, C., & Macey, B. (1999). Genetic scheduling for parallel processor
systems: comparative studies and performance issues. IEEE Transactions on Parallel
and Distributed Systems, 10, 8, 795-812.
APPENDIX A
This appendix presents the VHDL code for some of the modules that are a part of
the HGA and PHGA implementation of the GA based task scheduler.
A1. GA HEADER
PACKAGE GA_Header IS
-- Initial Parameters of GA
constant maxnumtasks
constant width_numtasks
constant maxfitness
constant precfitness
constant priofitness
constant maxpop
constant maxnumgen
constant width_membus
constant rngseed
constant numtasks
constant popsize
constant probx
constant probm
: integer; -- maximum number of tasks
: integer; -- Width of maximum number of tasks
: integer; -- Maximum fitness
: integer; -- Weight for precedence
: integer; -- Weight for priority
: integer; -- maximum size of population
: integer; -- maximum number of generations
: integer; -- width of memory bus
: STD_LOGIC_VECTOR (width_rng DOWNTO 0)
:= "1001011011001100";-- Random number
generator seed
: integer; -- number of tasks in taskgaraph
: integer := 140; -- population size
: STD_LOGIC_VECTOR(width_prob DOWNTO
0) := "110010"; -- crossover rate
: STD_LOGIC_VECTOR(width_prob DOWNTO
0) := "100110"; -- mutation rate
--Data structure for task string
type taskn IS ARRAY (0 TO numtasks) OF STD_LOGIC_VECTOR(7 DOWNTO 0);
--Data structure for a chromosome
type chromosome IS
record
tasknum
:
fit
:
END record;
taskn;
STD_LOGIC_VECTOR(15 DOWNTO 0);
73
74
--Data structure for a task
type task IS
record
tnumber
tetime
tpriority
numofpred
pred
numofsuc
suc
lpriority
lflag
END record;
:
:
:
:
:
:
:
:
:
STD_LOGIC_VECTOR(7 DOWNTO 0);
STD_LOGIC_VECTOR(7 DOWNTO 0);
STD_LOGIC_VECTOR(7 DOWNTO 0);
STD_LOGIC_VECTOR(7 DOWNTO 0);
taskn;
STD_LOGIC_VECTOR(7 DOWNTO 0);
taskn;
integer range 0 to 21;
STD_LOGIC;
--Data structure for a task graph
type taskgraph IS ARRAY (0 TO numtasks) OF task;
END;
package body GA_Header IS
END;
A2. MEMORY CONTROLLER MODULE
ENTITY mem_ctrl IS
PORT(
-- Basic signals to start GA scheduler
clk
: IN STD_LOGIC;
-- clock signal
reset
: IN STD_LOGIC; -- reset signal
start
: IN STD_LOGIC; -- start GA signal
init
-- Signal to start RTG, IPG, RNG modules
: OUT STD_LOGIC; -- initiate module signal
datain
address
dataout
rw
-- Signals for memory read write
: IN STD_LOGIC_VECTOR(7 DOWNTO 0); -- data from mem
: OUT STD_LOGIC_VECTOR(13 DOWNTO 0);-- addr for r/w
: OUT STD_LOGIC_VECTOR(7 DOWNTO 0); -- data to mem
: OUT STD_LOGIC; -- read/write signal
reqtg
acktg
addrTG
valout
-- Signals for reading Task graph
: IN STD_LOGIC; -- request from RTG
: OUT STD_LOGIC; -- acknowledge RTG
: IN STD_LOGIC_VECTOR(13 DOWNTO 0);-- addr from RTG
: OUT STD_LOGIC_VECTOR(7 DOWNTO 0);-- data to RTG
-- Signals for writing new evaluated population from FM
old_new
: IN STD_LOGIC; -- select new population address signal
reqFM
: IN STD_LOGIC; -- request from FM
ackFM
: OUT STD_LOGIC; -- acknowledge FM
75
valin
addrFM
GA_donein
reqseq
addrseq
ackseq
seqout
: IN STD_LOGIC_VECTOR(7 DOWNTO 0); -- data to be written
: IN STD_LOGIC_VECTOR(13 DOWNTO 0); -- address of data
: IN STD_LOGIC;
-- GA found best schedule signal
-- Signals for reading old population for PS module
: IN STD_LOGIC; -- request from PS
: IN INTEGER RANGE 0 to popsize; -- address of member
: OUT STD_LOGIC; -- acknowledge to PS
: OUT chromosome; -- sEND member to PS
-- Signal for indicating GA is done scheduling
GA_done
: OUT STD_LOGIC -- GA done scheduling
);
END mem_ctrl;
ARCHITECTURE behavior OF mem_ctrl IS
TYPE states IS(start1, start2, idle, readTG, seqpop, seqpop1, FMinitial, FMwrite,
done1, done2, write1, write2, read1, read2, readseq1, readseq2,
readseq3, readseq4, clkdone, clkdone1, clkdone2, clkdone3);
SIGNAL state: states;
constant popbase0
: INTEGER := 1153;
constant popbase1
: INTEGER := 7954;
constant popbase2
: INTEGER := 32;
constant bestaddr
: STD_LOGIC_VECTOR(13 DOWNTO 0) := "11111111110000";
BEGIN
main:
PROCESS(clk,reset,GA_donein,old_new)
VARIABLE popbase
variable clkcntvec
variable chromo
variable chromsize
variable psizetemp
: INTEGER range 0 TO 7955;
: STD_LOGIC_VECTOR(31 DOWNTO 0);
: chromosome;
: integer range 0 to 32;
: INTEGER RANGE 0 to popsize;
BEGIN
IF reset = '0' THEN
init <= '0';
GA_done <= '0';
rw <= '0';
flag := '0';
flag1 := '0';
ackFM <= '0';
ackseq <= '0';
clkcnt := 0;
chromsize := 0;
tempcnt := 0;
cnt := 0;
state <= start1;
76
ELSIF (clk'EVENT AND clk = '1' AND clk'LAST_VALUE = '0') THEN
CASE state IS
WHEN start1 =>
init <= '0';
rw <= '0';
flag := '0';
ackFM <= '0';
acktg <= '0';
ackseq <= '0';
chromsize := 0;
IF start = '0' THEN
GA_done <= '0';
state <= start2;
END IF;
WHEN start2 =>
init <= '0';
GA_done <= '0';
rw <= '0';
flag := '0';
acktg <= '0';
ackFM <= '0';
ackseq <= '0';
chromsize := 0;
stcnt := '0';
IF start = '0' THEN
init <= '1';
state <= idle;
END IF;
WHEN idle =>
flag := '0';
flag1 := '0';
rw <= '0';
IF reqtg = '1' THEN
state <= readTG; -- read task graph
ELSIF GA_donein = '1' THEN
stcnt := '0';
clkcntvec := conv_std_logic_vector(clkcnt,32);
state <= clkdone; -- count number of clocks
ELSIF reqFM = '1' THEN
stcnt := '1';
state <= FMinitial; -- initialize FM
ELSIF reqseq = '1' THEN
chromsize := 0;
state <= seqpop; -- sequence through population
END IF;
WHEN readTG =>
77
IF reqtg = '1' THEN
popbase := 0;
acktg <= '1';
address <= conv_std_logic_vector(popbase + conv_integer((addrTG)), 14);
state <= read1;
END IF;
WHEN read1 =>
rw <= '1';
state <= read2;
WHEN read2 =>
valout <= datain;
rw <='1';
dataout <= datain;
acktg <= '0';
state <= idle;
WHEN seqpop =>
IF reqseq = '1' THEN
IF old_new = '0' THEN
popbase :=
popbase1;
ELSE
popbase := popbase0;
END IF;
ackseq <= '1';
state <= seqpop1;
END IF;
WHEN seqpop1 =>
IF reqseq = '0' THEN -- chromosome number to be read available
chromsize := 0;
psizetemp := addrseq;
state <= readseq1;
END IF;
WHEN readseq1 =>
IF chromsize < numtasks THEN -- read tasks
address <= conv_std_logic_vector((popbase + 34*(psizetemp) +
chromsize), 14);
flag := '0';
flag1 := '0';
ELSE -- read fitness value
IF flag1 = '0' THEN
address <= conv_std_logic_vector(popbase + ((32*(psizetemp +
1))+(psizetemp*2)), 14);
flag := '1';
ELSE
address <= conv_std_logic_vector(popbase + ((32*(psizetemp +
1))+(psizetemp*2)+1), 14);
flag := '1';
78
END IF;
END IF;
state <= readseq2;
WHEN readseq2 =>
rw <= '1';
state <= readseq3;
WHEN readseq3 =>
IF flag = '0' THEN -- read task numbers
chromo.tasknum(chromsize) := datain;
rw <= '1';
dataout <= datain;
chromsize := chromsize + 1;
state <= readseq4;
ELSE
IF flag1 = '0' THEN
-- read upper byte of fitness value
chromo.fit(15 DOWNTO 8) := datain;
rw <= '1';
dataout <= datain;
flag1 := '1';
state <= readseq4;
ELSE
chromo.fit(7 DOWNTO 0) := datain;-- read lower byte of fitness
rw <= '1';
dataout <= datain;
seqout <= chromo;
flag := '0';
flag1 := '0';
ackseq <= '0'; -- reading chromosome complete
state <= idle;
END IF;
END IF;
WHEN readseq4 =>
rw <= '0';
state <= readseq1;
WHEN FMinitial =>
IF reqFM = '1' THEN
IF old_new = '0' THEN
popbase := popbase0;
ELSE
popbase := popbase1;
END IF;
ackFM <= '1';
state <= FMwrite;
END IF;
WHEN FMwrite =>
IF reqFM = '0' THEN
79
address <= conv_std_logic_vector(popbase + conv_integer(('0' &
addrFM)), 14);
dataout <= valin;
state <= write1;
END IF;
WHEN write1 =>
rw <= '1';
state <= write2;
WHEN write2 =>
IF flag = '0' THEN
rw <= '0';
ackFM <= '0';
state <= idle;
ELSE
IF cnt = 0 THEN
state <= clkdone1;
ELSIF cnt = 1 THEN
state <= clkdone2;
ELSIF cnt = 2 THEN
state <= clkdone3;
ELSE
state <= done1;
END IF;
END IF;
WHEN clkdone =>
address <= bestaddr;
dataout <= clkcntvec(31 DOWNTO 24);
flag := '1';
state <= write1;
WHEN clkdone1 =>
address <= bestaddr+1;
dataout <= clkcntvec(23 DOWNTO 16);
flag := '1';
cnt := 1;
state <= write1;
WHEN clkdone2 =>
address <= bestaddr+2;
dataout <= clkcntvec(15 DOWNTO 8);
flag := '1';
cnt := 2;
state <= write1;
WHEN clkdone3 =>
address <= bestaddr+3;
dataout <= clkcntvec(7 DOWNTO 0);
flag := '1';
80
cnt := 3;
state <= write1;
WHEN done1 =>
rw <= '0';
state <= done2;
WHEN done2 =>
init <= '0';
rw <= '0';
ackFM <= '0';
GA_done <= '1';
flag := '0';
acktg <= '0';
ackseq <= '0';
chromsize := 0;
state <= start1;
END CASE;
IF stcnt /= '0' THEN
clkcnt := clkcnt+1;
END IF;
END IF;
END PROCESS main;
END behavior;
A3. RANDOM NUMBER GENERATOR MODULE
ENTITY rand_gen IS
PORT(
clk
load
: IN STD_LOGIC ; -- clock signal
: IN STD_LOGIC ; -- load signal from MC
-- Random signals to CMM
domut
: OUT STD_LOGIC_VECTOR(width_prob DOWNTO 0);
docross
: OUT STD_LOGIC_VECTOR(width_prob DOWNTO 0);
mutpt1
: OUT STD_LOGIC_VECTOR(width_numtasks DOWNTO 0);
mutpt2
: OUT STD_LOGIC_VECTOR(width_numtasks DOWNTO 0);
crosspt1
: OUT STD_LOGIC_VECTOR(width_numtasks DOWNTO 0);
crosspt2
: OUT STD_LOGIC_VECTOR(width_numtasks DOWNTO 0);
-- Random signals to SM
select1
: OUT STD_LOGIC_VECTOR(width_pop DOWNTO 0);
select2
: OUT STD_LOGIC_VECTOR(width_pop DOWNTO 0);
-- Random signal to IPG
rand_init
: OUT INTEGER RANGE 0 TO numtasks);
81
END rand_gen;
ARCHITECTURE behavior OF rand_gen IS
TYPE states IS (idle, init, toint, tomod);
SIGNAL state: states;
BEGIN
PROCESS(clk,load)
variable rn_vec1,rn_vec2 : STD_LOGIC_VECTOR(width_rng DOWNTO 0);
variable evenodd : STD_LOGIC;
variable rn_int1: INTEGER RANGE 0 to 65535;
variable b : INTEGER := 0;
constant c : INTEGER := numtasks;
constant d : REAL := 65535.0;
variable rn_int2: INTEGER RANGE 0 to 9;
BEGIN
IF LOAD = '0' THEN
state <= idle;
ELSIF rising_edge ( CLK ) THEN
CASE state IS
WHEN idle =>
rn_vec1 := rngseed;
state <= init;
WHEN init =>
rn_vec2(15) := '0' XOR rn_vec1(15) XOR rn_vec1(14);
rn_vec2(14) := rn_vec1(15) XOR rn_vec1(14) XOR rn_vec1(13);
evenodd := '0';
FOR i IN 13 DOWNTO 1 LOOP
rn_vec2(i) := rn_vec1(i+1) XOR rn_vec1(i-1);
IF evenodd = '1' THEN
rn_vec2(i) := rn_vec2(i) XOR rn_vec1(i);
END IF;
evenodd := NOT evenodd;
END LOOP;
rn_vec2(0) := rn_vec1(1) XOR '0';
IF evenodd = '1' THEN
rn_vec2(0) := rn_vec2(0) XOR rn_vec1(0);
END IF;
rn_vec1 := rn_vec2;
state <= toint;
WHEN toint =>
rn_int1 := 0;
rn_int2 := 0;
b := 1;
FOR i in 0 to 15 LOOP
IF rn_vec1(i) = '1' THEN
82
rn_int1 := rn_int1 + b;
END IF;
b := b*2;
END LOOP;
state <= tomod;
WHEN tomod =>
rn_int2 := rn_int1 mod c;
rand_init <= rn_int2;
domut <= rn_vec1(width_rng DOWNTO (width_rng-width_prob));
mutpt1 <= conv_std_logic_vector(((conv_integer('0' & (rn_vec1((width_rng7) DOWNTO (width_rng - width_numtasks - 7))))) mod
numtasks), (width_numtasks + 1));
mutpt2 <= conv_std_logic_vector(((conv_integer('0' & (rn_vec1((width_rng9) DOWNTO (width_rng - width_numtasks - 9))))) mod
numtasks),(width_numtasks + 1));
docross <= rn_vec1(width_rng-2 DOWNTO width_rng-width_prob-2);
crosspt1 <= conv_std_logic_vector((conv_integer(rn_vec1(width_rng-8
DOWNTO width_rng-width_numtasks-8)) mod numtasks),
width_numtasks+1);
crosspt2 <= conv_std_logic_vector((conv_integer(rn_vec1(width_rng-4
DOWNTO width_rng-width_numtasks-4)) mod numtasks),
width_numtasks+1);
pselect1 <= conv_std_logic_vector((conv_integer(rn_vec1(width_rng-3
DOWNTO width_rng-width_pop-3)) mod popsize),
width_pop+1);
pselect2 <= conv_std_logic_vector((conv_integer(rn_vec1(width_rng-1
DOWNTO width_rng-width_pop-1)) mod popsize),
width_pop+1);
select1 <= conv_std_logic_vector(((conv_integer(rn_vec1(width_rng-5
DOWNTO width_rng-width_pop-5)) mod (popsize-1)) + 1),
width_pop+1);
select2 <= conv_std_logic_vector(((conv_integer(rn_vec1(width_rng-6
DOWNTO width_rng-width_pop-6)) mod (popsize-1)) + 1),
width_pop+1);
state <= init;
END CASE;
END IF;
END PROCESS;
END behavior;
A4. READ TASK GRAPH MODULE
ENTITY readTG IS
PORT(
clk
: IN STD_LOGIC; -- clock signal
-- Signals to and from MC
inittg
: IN STD_LOGIC; -- initiate RTG
83
ackmc
reqmc
datain
addrmc
: IN STD_LOGIC; -- acknowledge from MC
: OUT STD_LOGIC; -- request to MC
: IN STD_LOGIC_VECTOR(7 DOWNTO 0); -- data from MC
: OUT STD_LOGIC_VECTOR(13 DOWNTO 0); -- address of TG
-- Signals to and from FM
tasktableout
: OUT taskgraph; -- task table to FM
initFM
: OUT STD_LOGIC -- initiate FM);
END readTG;
ARCHITECTURE behavior OF readTG IS
TYPE states IS (idle, read1, readpred, awaitackmem1, awaitackmem2, filltable, loop1, loop2,
loop3,loop4,loop5,loop6,done);
SIGNAL state: states;
SIGNAL tg : taskgraph;
BEGIN
PROCESS(clk,inittg,tg)
variable tempcnt
variable temppred,i,j,k
variable count1, test2, temp
variable count2
variable addr
variable data
: INTEGER RANGE 0 TO 32;
: INTEGER RANGE 0 TO 32;
: INTEGER RANGE 0 TO 40;
: STD_LOGIC_VECTOR(13 DOWNTO 0);
: STD_LOGIC_VECTOR(13 DOWNTO 0);
: STD_LOGIC_VECTOR(7 DOWNTO 0);
BEGIN
IF inittg = '0' THEN
reqmc <= '0';
tempcnt := 0;
temppred := 0;
count1 := 0;
count2 := "00000000000000";
initFM <= '0';
state <= idle;
ELSIF rising_edge ( CLK) THEN
CASE state IS
WHEN idle =>
reqmc <= '0';
tempcnt := 0;
temppred := 0;
count1 := 0;
count2 := "00000000000000";
initFM <= '0';
i := 0;
j := 0;
k := 0;
state <= read1;
84
WHEN read1 =>
IF tempcnt < numtasks THEN
tg(tempcnt).lflag <= '0';
tg(i).lpriority <= 0;
addr := count2;
addrmc <= addr;
reqmc <= '1'; -- request to Mem ctrller
state <= awaitackmem1;
ELSE
tempcnt := 0;
i := 0;
state <= filltable;
END IF;
WHEN awaitackmem1 =>
IF ackmc ='1' THEN
reqmc <= '0';
state <= awaitackmem2;
END IF;
WHEN awaitackmem2 =>
IF ackmc ='0' THEN
IF count1 = 0 THEN
tg(tempcnt).tnumber <= datain;
count1 := count1 + 1;
count2 := count2 + 1;
state <= read1;
ELSIF count1 = 1 THEN
tg(tempcnt).tetime <= datain;
count1 := count1 + 1;
count2 := count2 + 1;
state <= read1;
ELSIF count1 = 2 THEN
tg(tempcnt).tpriority <= datain;
count1 := count1 + 1;
count2 := count2 + 1;
state <= read1;
ELSIF count1 = 3 THEN
tg(tempcnt).numofpred <= datain;
count1 := count1 + 1;
count2 := count2 + 1;
temppred := 0;
state <= readpred;
ELSE
tg(tempcnt).pred(temppred) <= datain;
count1 := count1 + 1;
count2 := count2 + 1;
temppred := temppred + 1;
state <= readpred;
END IF;
END IF;
85
WHEN readpred =>
IF temppred < conv_integer(tg(tempcnt).numofpred)THEN
addr := count2;
addrmc <= addr;
reqmc <= '1';
state <= awaitackmem1;
ELSE
count1 := 0;
tempcnt := tempcnt + 1;
state <= read1;
END IF;
WHEN filltable =>
IF tempcnt < numtasks THEN
i := 0;
state <= loop1;
ELSE
i := 0;
state <= loop4;
END IF;
WHEN loop1 =>
IF i < numtasks THEN
IF tg(i).lflag = '0' THEN
IF tg(i).numofpred = "00000000" THEN
tg(i).lpriority <= H_P;
tg(i).lflag <= '1';
tempcnt := tempcnt + 1;
i := i + 1;
state <= loop1;
ELSE
test2 := 1;
temp := 0;
j := 0;
state <= loop2;
END IF;
ELSE
i := i + 1;
state <= loop1;
END IF;
ELSE
state <= filltable;
END IF;
WHEN loop2 =>
IF j < conv_integer(tg(i).numofpred) THEN
k := 0;
state <= loop3;
ELSE
IF temp = conv_integer(tg(i).numofpred) THEN
86
tg(i).lpriority <= H_P - test2;
tg(i).lflag <= '1';
tempcnt := tempcnt + 1;
i := i + 1;
state <= loop1;
END IF;
i := i + 1;
state <= loop1;
END IF;
WHEN loop3 =>
IF k < numtasks THEN
IF tg(i).pred(j) = tg(k).tnumber THEN
IF tg(k).lflag /= '0' THEN
IF tg(k).numofpred /= "00000000" THEN
IF test2 <= (H_P - tg(i).lpriority) THEN
test2 := H_P - tg(k).lpriority + 1;
temp := temp + 1;
ELSE
temp := temp + 1;
END IF;
ELSE
temp := temp + 1;
END IF;
END IF;
END IF;
k := k+1;
state <= loop3;
ELSE
j:= j+1;
state <= loop2;
END IF;
WHEN loop4 =>
IF i < numtasks THEN
j := 0;
tg(i).numofsuc <= "00000000";
state <= loop5;
ELSE
state <= done;
END IF;
WHEN loop5 =>
IF j < numtasks THEN
k := 0;
state <= loop6;
ELSE
i := i + 1;
state <= loop4;
END IF;
87
WHEN loop6 =>
IF k < conv_integer(tg(j).numofpred) THEN
IF tg(i).tnumber = tg(j).pred(k) THEN
tg(i).suc(conv_integer(tg(i).numofsuc)) <= tg(j).tnumber;
tg(i).numofsuc <= tg(i).numofsuc + 1;
k := k + 1;
j := j+1 ;
state <= loop5;
ELSE
k := k+1;
state <= loop6;
END IF;
ELSE
j := j + 1;
state <= loop5;
END IF;
WHEN done =>
reqmc <= '0';
tasktableout <= tg;
initFM <= '1';
END CASE;
END IF;
END PROCESS;
END behavior;
A4. INITIAL POPULATION GENERATOR MODULE
ENTITY GA_initdata1 IS
PORT(
clk
: IN STD_LOGIC; -- clock signal
initipg
: IN STD_LOGIC; -- initiate IPG
randint
: IN INTEGER RANGE 0 TO numtasks; -- random number from RNG
-- Signals to FM
ackfm
: IN STD_LOGIC; -- acknowledge from FM
reqfm
: OUT STD_LOGIC; -- request to FM
chromoout
: OUT chromosome -- member to FM);
END GA_initdata1;
ARCHITECTURE behavior OF GA_initdata1 IS
TYPE states IS (idle, init1,init2,awaitackfm1,awaitackfm2,initrand1, initrand2, done);
SIGNAL state: states;
BEGIN
PROCESS(clk,initipg)
variable chromo : chromosome;
variable tog : STD_LOGIC := '0';
88
variable psizetemp : integer RANGE 0 TO 200;
variable chromsize : integer RANGE 0 TO 32;
variable tempcnt : integer RANGE 0 TO 32;
variable index1, index2 : integer RANGE 0 TO 32;
variable temp : STD_LOGIC_VECTOR(7 DOWNTO 0);
BEGIN
IF initipg = '0' THEN
reqfm <= '0';
state <= idle;
psizetemp := 0;
chromsize := 0;
tempcnt := 0;
ELSIF rising_edge (CLK) THEN
CASE state IS
WHEN idle =>
state <= init1;
WHEN init1 =>
IF psizetemp < popsize THEN
state <= init2;
ELSE
state <= done;
END IF;
WHEN init2 =>
IF chromsize < numtasks THEN
chromo.tasknum(chromsize) := conv_std_logic_vector((chromsize + 1), 8);
chromsize := chromsize + 1;
state <= init1;
ELSE
chromo.fit := conv_std_logic_vector(50000,16);
tempcnt := 0;
state <= initrand1;
END IF;
WHEN initrand1 =>
IF tempcnt < numtasks THEN
index1 := tempcnt;
index2 := randint;
state <= initrand2;
ELSE
reqfm <= '1';
state <= awaitackfm1;
END IF;
WHEN initrand2 =>
temp := chromo.tasknum(index1);
chromo.tasknum(index1) := chromo.tasknum(index2);
89
chromo.tasknum(index2) := temp;
tempcnt := tempcnt+1;
state <= initrand1;
WHEN awaitackfm1 =>
IF ackfm ='1' THEN
chromoout <= chromo;
reqfm <= '0';
state <= awaitackfm2;
END IF;
WHEN awaitackfm2 =>
IF ackfm = '0' THEN
psizetemp := psizetemp +1;
IF psizetemp < popsize THEN
state <= init2;
ELSE
state <= done;
END IF;
END IF;
WHEN done =>
reqfm <= '0';
END CASE;
END IF;
END PROCESS;
END behavior;
A5. FITNESS MODULE
ENTITY GA_Fitness IS
PORT(
clk
: IN STD_LOGIC; -- clock signal
-- Signals to and from RTG
initfm
: IN STD_LOGIC; -- start FM
tasktablein
: IN taskgraph; -- task graph table
-- Signals to and from IPG
reqipg
: IN STD_LOGIC; -- request from IPG
dataipg
: IN chromosome; -- data from IPG
ackipg
: OUT STD_LOGIC; -- acknowledge IPG
-- Signals to and from MC
ackmc
: IN STD_LOGIC; -- acknowledge from MC
reqmc
: OUT STD_LOGIC; -- request to MC
addrmc
: OUT STD_LOGIC_VECTOR(13 DOWNTO 0); -- address of data
dataout
: OUT STD_LOGIC_VECTOR(7 DOWNTO 0); -- data
old_new
: OUT STD_LOGIC; -- old/new population toggle
90
best_done
: OUT STD_LOGIC; -- best schedule obtained
-- Signals to and from CMM
reqxover
: IN STD_LOGIC; -- request from CMM
child1
: IN chromosome; -- child1 from CMM
child2
: IN chromosome; -- child2 from CMM
ackxover
: OUT STD_LOGIC; -- acknowledge to CMM
-- Signals to and from SM
sof
: OUT integer RANGE 0 TO 10000000; -- Sum of fitness to SM
initGAmod
: OUT STD_LOGIC); -- Start PS,SM and CMM modules
END GA_Fitness;
ARCHITECTURE behavior OF GA_Fitness IS
TYPE states IS (idle0, idle1, idle, fm1,lp1,lp2,lp3,lp4,lp5,lp6,lp7,lp8,lp9,lp10,lp11,lp12,
fm2, ackmem1_1, ackmem1_2, ackmem2, done1, done2, done3, bestdone);
SIGNAL state: states;
SIGNAL bestfit : chromosome;
SIGNAL best : STD_LOGIC;
BEGIN
PROCESS(bestfit.fit)
BEGIN
IF bestfit.fit = "1100001101010000" THEN
best <= '1';
ELSE
best <= '0';
END IF;
END PROCESS;
PROCESS(clk,initfm,best)
variable individual,ch1,ch2
variable tg
variable tog
variable psizetemp
variable chromsize
variable addr
variable data
variable data1
variable flag,ipgdone
variable cnt,cntchild
variable RQ
variable EQ
variable sumfit1, sumfit2
variable i,j,k,l,m,n,p,rqc
variable pri, prec
variable bestnum
BEGIN
: chromosome;
: taskgraph;
: STD_LOGIC ;
: integer RANGE 0 TO 200;
: integer RANGE 0 TO 32;
: STD_LOGIC_VECTOR(13 DOWNTO 0);
: STD_LOGIC_VECTOR(7 DOWNTO 0);
: STD_LOGIC_VECTOR(15 DOWNTO 0);
: STD_LOGIC;
: STD_LOGIC;
: taskn;
: STD_LOGIC_VECTOR((numtasks-1) downto 0);
: integer RANGE 0 TO 10000000;
: integer RANGE 0 TO 32;
: integer RANGE 0 TO 45000;
: integer RANGE 0 TO 200;
91
IF initfm = '0' THEN
reqmc <= '0';
ackipg <= '0';
best_done <= '0';
bestfit.fit <= "0000000000000000";
psizetemp := 0;
chromsize := 0;
tog := '0';
old_new <= tog;
flag := '0';
cnt := '0';
cntchild := '0';
initGAmod <= '0';
ipgdone := '0';
state <= idle0;
ELSIF rising_edge(CLK) THEN
CASE state IS
WHEN idle0 =>
tg := tasktablein;
state <= idle1;
WHEN idle1 =>
psizetemp := 0;
chromsize := 0;
ipgdone := '0';
sumfit1 := 0;
sumfit2 := 0;
state <= idle;
WHEN idle =>
flag := '0';
prec := precfitness;
pri := priofitness;
cntchild := '0';
IF best = '1' THEN
state <= bestdone;
ELSIF ipgdone = '0' THEN
IF reqipg = '1' THEN
IF psizetemp < popsize THEN
cntchild := '1';
ackipg <= '1';
state <= fm1;
END IF;
END IF;
ELSE
IF reqxover = '1' THEN
IF psizetemp < popsize THEN
ackxover <= '1';
state <= fm1;
92
END IF;
END IF;
END IF;
WHEN fm1 =>
flag := '0';
cnt := '0';
IF ipgdone = '0' THEN
IF reqipg = '0' THEN
individual := dataipg;
i := 0;
rqc := 0;
state <= lp1;
END IF;
ELSIF reqxover = '0' THEN
IF cntchild = '0' THEN
ch1 := child1;
ch2 := child2;
individual := ch1;
i := 0;
rqc := 0;
state <= lp1;
ELSE
individual := ch2;
i := 0;
rqc := 0;
state <= lp1;
END IF;
END IF;
WHEN lp1 =>
ackipg <= '0';
IF cntchild = '1' THEN
ackxover <= '0'; -- acknowledge that children are copied
IF psizetemp = (popsize-1) THEN
initGAmod <= '0';
END IF;
END IF;
IF i <numtasks THEN
RQ(i) := "00000000";
EQ (i):= '0';
IF tg(i).numofpred = 0 THEN
RQ(rqc) := tg(i).tnumber;
rqc := rqc + 1;
END IF;
i := i+1;
state <= lp1;
ELSE
i := 0;
state <= lp2;
93
END IF;
WHEN lp2 =>
IF i < numtasks THEN
j := 0;
state <= lp3;
ELSE
IF prec /= precfitness THEN
IF prec > 0 THEN
IF prec < priofitness THEN
individual.fit := conv_std_logic_vector((maxfitness/1000),
16);
ELSE
IF pri < 0 THEN
individual.fit := conv_std_logic_vector((prec + pri),16);
ELSE
individual.fit := conv_std_logic_vector((prec + pri),16);
END IF;
END IF;
ELSE
individual.fit := conv_std_logic_vector((maxfitness/1000),16);
END IF;
ELSE
individual.fit := conv_std_logic_vector((prec + pri),16);
END IF;
chromsize := 0;
state <= fm2;
END IF;
WHEN lp3 =>
IF j < rqc THEN
IF individual.tasknum(i) = RQ(j) THEN
k := 0;
state <= lp4;
ELSE
IF j = (rqc-1) THEN
EQ(conv_integer(individual.tasknum(i))-1) := '1';
j := numtasks;
m := 0;
n := 0;
state <= lp11;
ELSE
j := j+1;
state <= lp3;
END IF;
END IF;
ELSE
i := i + 1;
state <= lp2;
END IF;
94
WHEN lp4 =>
IF k < rqc THEN
IF RQ(k) /= "00000000" THEN
l := 0;
state <= lp5;
ELSE
k := k+1;
state <= lp4;
END IF;
ELSE
EQ(conv_integer(individual.tasknum(i))-1) := '1';
RQ(j) := "00000000";
j := numtasks;
state <= lp6;
END IF;
WHEN lp5 =>
IF l < numtasks THEN
IF RQ(k) = tg(l).tnumber THEN
IF tg(conv_integer(individual.tasknum(i))-1).tpriority < tg(l).tpriority
THEN
pri := pri - 50 * (numtasks - i - 1);
END IF;
END IF;
l := l + 1;
state <= lp5;
ELSE
k := k + 1;
state <= lp4;
END IF;
WHEN lp6 =>
IF conv_integer(tg(conv_integer(individual.tasknum(i))-1).numofsuc) > 0
THEN
k := 0;
state <= lp7;
ELSE
j := j+1;
state <= lp3;
END IF;
WHEN lp7 =>
IF k < conv_integer(tg(conv_integer(individual.tasknum(i))-1).numofsuc)
THEN
l := 0;
state <= lp8;
ELSE
j := j+1;
state <= lp3;
END IF;
95
WHEN lp8 =>
IF l < numtasks THEN
IF tg(conv_integer(individual.tasknum(i))-1).suc(k) = tg(l).tnumber
THEN
IF EQ(l) = '0' THEN
IF conv_integer(tg(l).numofpred) > 1 THEN
m := 0;
n := 0;
state <= lp9;
ELSE
RQ(rqc) := tg(l).tnumber;
IF rqc = (numtasks - 1) THEN
rqc := 0;
ELSE
rqc := rqc+1;
END IF;
END IF;
END IF;
END IF;
l := l+1;
state <= lp8;
ELSE
k := k + 1;
state <= lp7;
END IF;
WHEN lp9 =>
IF n < conv_integer(tg(l).numofpred) THEN
p := 0;
state <= lp10;
ELSE
IF m = conv_integer(tg(l).numofpred) THEN
RQ(rqc) := tg(m).tnumber;
IF rqc = (numtasks) THEN
rqc := 0;
ELSE
rqc := rqc + 1;
END IF;
END IF;
l := l+1;
state <= lp8;
END IF;
WHEN lp10 =>
IF p < numtasks THEN
IF tg(l).pred(n) = tg(p).tnumber THEN
IF EQ(p) = '1' THEN
m := m + 1;
END IF;
END IF;
p := p + 1;
96
state <= lp10;
ELSE
n := n+1;
state <= lp9;
END IF;
WHEN lp11 =>
IF n < conv_integer(tg(conv_integer(individual.tasknum(i))-1).numofpred)
THEN
p := 0;
state <= lp12;
ELSE
IF m /= conv_integer(tg(conv_integer(individual.tasknum(i))1).numofpred) THEN
prec := prec - 150 * (numtasks - i -1);
END IF;
state <= lp6;
END IF;
WHEN lp12 =>
IF p < numtasks THEN
IF tg(conv_integer(individual.tasknum(i))-1).pred(n) = tg(p).tnumber
THEN
IF EQ(p) = '1' THEN
m := m + 1;
END IF;
END IF;
p := p + 1;
state <= lp12;
ELSE
n := n+1;
state <= lp11;
END IF;
WHEN fm2 =>
IF chromsize < numtasks THEN
data := individual.tasknum(chromsize);
addr := conv_std_logic_vector((34*(psizetemp) + chromsize), 14);
chromsize := chromsize + 1;
reqmc <= '1';
state <= ackmem1_1;
ELSE
data1 := individual.fit;
IF individual.fit > bestfit.fit THEN
bestnum := psizetemp;
bestfit <= individual;
END IF;
reqmc <= '1';
state <= ackmem1_2;
END IF;
97
WHEN ackmem1_1 =>
IF ackmc ='1' THEN
addrmc <= addr;
dataout <= data;
reqmc <= '0';
state <= ackmem2;
END IF;
WHEN ackmem1_2 =>
IF ackmc ='1' THEN
IF cnt = '0' THEN
addr := conv_std_logic_vector(((32*(psizetemp + 1))+
(psizetemp*2)), 14);
addrmc <= addr;
dataout <= data1(15 DOWNTO 8);
cnt := '1';
flag := '0';
reqmc <= '0';
ELSE
addr := conv_std_logic_vector(((32*(psizetemp + 1))+
(psizetemp*2) +1), 14);
addrmc <= addr;
dataout <= data1(7 DOWNTO 0);
reqmc <= '0';
chromsize := 0;
sumfit1 := sumfit1 + conv_integer('0' & data1);
flag := '1';
END IF;
state <= ackmem2;
END IF;
WHEN ackmem2 =>
IF ackmc = '0' THEN
IF flag = '1' THEN -- checks IF fitness lower byte is written
psizetemp := psizetemp +1;
IF cntchild = '0' THEN
flag := '0';
prec := precfitness;
pri := priofitness;
cntchild := '1';
state <= fm1;
ELSE
IF psizetemp < popsize THEN
state <= idle;
ELSE
state <= done1;
END IF;
END IF;
ELSE
state <= fm2;
END IF;
98
END IF;
WHEN done1 =>
reqmc <= '0';
state <= done2;
WHEN done2 =>
reqmc <= '0';
tog := NOT(tog);
old_new <= tog;
sumfit2 := sumfit1;
state <= done3;
WHEN done3 =>
reqmc <= '0';
ipgdone := '1';
sof <= sumfit2;
sumfit1 := 0;
psizetemp := 0;
chromsize := 0;
initGAmod <= '1';
state <= idle;
WHEN bestdone =>
reqmc <= '0';
best_done <= '1';
initGAmod <= '0';
END CASE;
END IF;
END PROCESS;
END behavior;
A6. POPULATION SEQUENCER MODULE
ENTITY GA_popseq IS
PORT(
clk
: IN STD_LOGIC; -- clock signal
initseq
: IN STD_LOGIC; -- initiate signal from FM
-- Signals to and from MC
ackmc
: IN STD_LOGIC; -- acknowledge from MC
datain
: IN chromosome; -- member from MC
reqmc
: OUT STD_LOGIC; -- request to MC
addrmc
: OUT INTEGER RANGE 0 to popsize; -- address of member
-- Signals to SM
chromoout
: OUT chromosome -- member to SM);
END GA_popseq;
99
ARCHITECTURE behavior OF GA_popseq IS
TYPE states IS (idle, getmember, getnext);
SIGNAL state: states;
BEGIN
PROCESS(clk,initseq)
variable member : chromosome;
variable memaddr: INTEGER RANGE 0 to popsize;
BEGIN
IF initseq ='0' THEN
state <= idle;
reqmc <= '0';
addrmc <= 0;
ELSIF rising_edge (CLK) THEN
CASE state IS
WHEN idle =>
memaddr := 0;
reqmc <= '1';
state <= getmember;
WHEN getmember =>
IF ackmc = '1' THEN
addrmc <= memaddr;
reqmc <= '0';
state <= getnext;
END IF;
WHEN getnext =>
IF ackmc = '0' THEN
member := datain;
chromoout <= member;
IF memaddr < (popsize-1) THEN
memaddr := memaddr + 1;
ELSE
memaddr := 0;
END IF;
reqmc <= '1';
state <= getmember;
END IF;
END CASE;
END IF;
END PROCESS;
END behavior;
100
A7. SELECTION MODULE
ENTITY GA_selection IS
PORT(
clk
: IN STD_LOGIC; -- clock signal
rand1
: IN STD_LOGIC_VECTOR (width_pop DOWNTO 0); -- random
select1 from RNG
rand2
: IN STD_LOGIC_VECTOR (width_pop DOWNTO 0); -- random
select2 from RNG
-- Signals from FM
initselect
: IN STD_LOGIC; -- initiate SM signal from FM
sof
: IN integer RANGE 0 TO 10000000; -- sum of fitness from FM
-- Signals from SM
dup
: IN STD_LOGIC; -- duplicate member signal from SM
datain
: IN chromosome; -- member from SM
-- Signals to and from CMM
ackxover
: IN STD_LOGIC; -- acknowledge from CMM
reqxover
: OUT STD_LOGIC; -- request to CMM
parent1
: OUT chromosome; -- parent 1 to CMM
parent2
: OUT chromosome -- parent 2 to CMM );
END GA_selection;
ARCHITECTURE behavior OF GA_selection IS
TYPE states IS (idle, init, getmembers,ackxover1,ackxover2);
SIGNAL state: states;
SIGNAL fitness: integer RANGE 0 TO 10000000;
SIGNAL scale : STD_LOGIC_VECTOR (width_pop DOWNTO 0);
SIGNAL scaledfit: integer RANGE 0 TO 10000000;
BEGIN
PROCESS(fitness,scale)
BEGIN
scaledfit <= fitness /(conv_integer('0' & scale));
END PROCESS;
PROCESS(clk,initselect)
variable member
: chromosome;
variable memaddr,psizecnt : INTEGER RANGE 0 to popsize;
variable sum
: integer RANGE 0 TO 10000000;
variable donea, doneb
: STD_LOGIC;
variable first,second
: STD_LOGIC := '0';
variable rnd1, rnd2
: STD_LOGIC_VECTOR (width_pop DOWNTO 0);
variable scaledfit1,scaledfit2: integer RANGE 0 TO 10000000;
variable accum1, accum2 : integer RANGE 0 TO 10000000;
variable member1, member2: chromosome;
101
BEGIN
IF initselect = '0' THEN
state <= idle;
sum := 0;
reqxover <= '0';
ELSIF rising_edge (CLK) THEN
CASE state IS
WHEN idle =>
sum := sof; -- get sum of fitness from FM
state <= init;
WHEN init => -- initialize all initial variables
IF second = '0' THEN
IF first = '0' THEN
rnd1 := rand1;
rnd2 := rand2;
fitness <= sum;
scale <= rnd1;
first := '1'; --get scaled fitness for first individual
ELSE
scaledfit1 := scaledfit; --scaled fitness for 1
fitness <= sum;
scale <= rnd2;
first := '0';
second := '1';--get scaled fitness for second individual
END IF;
ELSE
scaledfit2 := scaledfit; --scaled fitness for 2
accum1 := 0;
accum2 := 0;
donea := '0';
doneb := '0';
first := '0';
second := '0';
state <= getmembers;
END IF;
WHEN getmembers =>
IF donea = '0' THEN -- got 1st member
IF doneb = '0' THEN -- got 2nd member
member1 := datain;
member2 := datain;
first := '1';
ELSE
member1 := datain;
first := '1';
END IF;
ELSE
IF doneb = '0' THEN
member2 := datain;
102
first := '1';
ELSE -- IF both members have been selected
reqxover <= '1';
state <= ackxover1;
END IF;
END IF;
IF first = '1' THEN --IF both members have not been selected
accum1 := accum1 + conv_integer('0' & member1.fit);
accum2 := accum2 + conv_integer('0' & member2.fit);
first := '0';
END IF;
IF accum1 > scaledfit1 THEN
donea := '1';
ELSIF accum2 > scaledfit2 THEN
doneb := '1';
END IF;
WHEN ackxover1 =>
IF ackxover = '1' THEN
parent1 <= member1;
parent2 <= member2;
reqxover <= '0';
state <= ackxover2;
END IF;
WHEN ackxover2 =>
IF ackxover = '0' THEN
state <= init;
END IF;
END CASE;
END IF;
END PROCESS;
END behavior;
A8. CROSSOVER AND MUTATION MODULE
ENTITY GA_xover_mut IS
PORT(
clk
: IN STD_LOGIC; -- clock signal
-- Signals from RNG
rndm
: IN STD_LOGIC_VECTOR (width_prob DOWNTO 0); -- probability
of mutation
rndx
: IN STD_LOGIC_VECTOR (width_prob DOWNTO 0); -- probability
of crossover
xpoint1
: IN STD_LOGIC_VECTOR(width_numtasks DOWNTO 0); -crossover point 1
103
xpoint2
mpoint1
mpoint2
: IN STD_LOGIC_VECTOR(width_numtasks DOWNTO 0); -crossover point 2
: IN STD_LOGIC_VECTOR(width_numtasks DOWNTO 0); -- mutation
point 1
: IN STD_LOGIC_VECTOR(width_numtasks DOWNTO 0); -- mutation
point 2
-- Signals to and from SM
reqSM
: IN STD_LOGIC; -- request from SM
parent1
: IN chromosome; -- parent 1
parent2
: IN chromosome; -- parent 2
ackSM
: OUT STD_LOGIC; -- acknowledge to SM
-- Signals to and from FM
initxover
: IN STD_LOGIC; -- initiate CMM signal from FM
ackFM
: IN STD_LOGIC; -- acknowledge from FM
reqFM
: OUT STD_LOGIC; -- request to FM
child1
: OUT chromosome; -- child 1
child2
: OUT chromosome -- child 2);
END GA_xover_mut;
ARCHITECTURE behavior OF GA_xover_mut IS
TYPE states IS (idle, getparents, acksel, xover1, lp1, lp2, lp3, lp4, xover2, xover3,
mutation,awaitackfm1,awaitackfm2);
SIGNAL state: states;
BEGIN
PROCESS(clk,initxover)
variable c1,c2,p1,p2
: chromosome;
variable temp1,temp2 : STD_LOGIC_VECTOR(7 DOWNTO 0);
variable chk1, chk2
: STD_LOGIC_VECTOR((numtasks-1) DOWNTO 0);
variable i,j,k
: integer RANGE 0 TO 32;
variable cross1, cross2 : STD_LOGIC_VECTOR(width_numtasks DOWNTO 0);
variable mut1, mut2, x1, x2 : STD_LOGIC_VECTOR(width_numtasks DOWNTO 0);
variable test1, test2
: STD_LOGIC;
BEGIN
IF initxover ='0' THEN
state <= idle;
reqFM <= '0';
ackSM <= '0';
i := 0;
j := 0;
k := 0;
ELSIF rising_edge (CLK) THEN
CASE state IS
WHEN idle =>
reqFM <= '0';
ackSM <= '0';
state <= getparents;
104
WHEN getparents =>
IF reqSM = '1' THEN
ackSM <= '1';
x1 := xpoint1;
x2 := xpoint2;
mut1 := mpoint1;
mut2 := mpoint2;
state <= acksel;
END IF;
WHEN acksel =>
IF reqSM = '0' THEN
p1 := parent1;
p2 := parent2;
IF x1 <= x2 THEN
IF x1 = x2 THEN
cross1 := x1;
cross2 := x2;
state <= xover1;
ELSE
cross1 := x1;
cross2 := x2;
state <= xover1;
END IF;
ELSE
cross1 := x2;
cross2 := x1;
state <= xover1;
END IF;
END IF;
WHEN xover1 =>
ackSM <= '0';
chk1 := conv_std_logic_vector(0,numtasks);
chk2 := conv_std_logic_vector(0,numtasks);
test1 := '0';
test2 := '0';
IF rndx < probx THEN -- perfrom crossover
IF cross1 = cross2 THEN -- simple crossover
i := 0;
state <= lp1;
ELSE -- PMX crossover
c1 := p1;
c2 := p2;
c1.fit := conv_std_logic_vector(50000,16);
c2.fit := conv_std_logic_vector(50000,16);
j := conv_integer('0' & cross1);
state <= xover2;
END IF;
ELSE -- do not perform crossover
c1 := p1;
105
c2 := p2;
c1.fit := conv_std_logic_vector(50000,16);
c2.fit := conv_std_logic_vector(50000,16);
state <= mutation;
END IF;
WHEN lp1 => -- copy tasks from parent to child before crossover point
IF i < conv_integer('0' & cross1) THEN
c1.tasknum(i) := p1.tasknum(i);
chk1(((conv_integer('0' & p1.tasknum(i)))-1)) := '1';
c2.tasknum(i) := p2.tasknum(i);
chk2(((conv_integer('0' & p2.tasknum(i)))-1)) := '1';
i := i + 1;
state <= lp1;
ELSE
i := conv_integer('0' & cross1);
j := 0;
state <= lp2;
END IF;
WHEN lp2 => -- fill tasks after xover point not encountered be4
IF i < numtasks THEN
IF chk1(((conv_integer('0' & p2.tasknum(i)))-1)) = '1' THEN
c1.tasknum(i) := conv_std_logic_vector(0,8);
ELSE
c1.tasknum(i) := p2.tasknum(i);
chk1(((conv_integer('0' & p2.tasknum(i)))-1)) := '1';
END IF;
IF chk2(((conv_integer('0' & p1.tasknum(i)))-1)) = '1' THEN
c2.tasknum(i) := conv_std_logic_vector(0,8);
ELSE
c2.tasknum(i) := p1.tasknum(i);
chk2((conv_integer('0' & p1.tasknum(i))-1)) := '1';
END IF;
i := i + 1;
state <= lp2;
ELSE
i := conv_integer('0' & cross1);
j := 0;
state <= lp3;
END IF;
WHEN lp3 =>
IF i < numtasks THEN
IF c1.tasknum(i) = "00000000" THEN
IF j < numtasks THEN
IF chk1(j) = '0' THEN
c1.tasknum(i) := conv_std_logic_vector((j + 1),8);
chk1(j) := '1';
state <= lp3;
106
ELSE
j := j + 1;
state <= lp3;
END IF;
ELSE
j := 0;
state <= lp4;
END IF;
ELSE
j := 0;
state <= lp4;
END IF;
ELSE
state <= mutation;
END IF;
WHEN lp4 =>
IF c2.tasknum(i) = "00000000" THEN
IF j < numtasks THEN
IF chk2(j) = '0' THEN
c2.tasknum(i) := conv_std_logic_vector((j + 1),8);
chk2(j) := '1';
state <= lp4;
ELSE
j := j + 1;
state <= lp4;
END IF;
ELSE
i := i + 1;
j := 0;
state <= lp3;
END IF;
ELSE
i := i + 1;
j := 0;
state <= lp3;
END IF;
WHEN xover2 =>
IF j <= conv_integer('0'& cross2) THEN
temp1 := c1.tasknum(j);
c1.tasknum(j) := c2.tasknum(j);
c2.tasknum(j) := temp1;
k := 0;
state <= xover3;
ELSE
state <= mutation;
END IF;
107
WHEN xover3 =>
IF k < numtasks THEN
IF k /= j THEN
IF c1.tasknum(j) = c1.tasknum(k) THEN
c1.tasknum(k) := c2.tasknum(j);
END IF;
IF c2.tasknum(j) = c2.tasknum(k) THEN
c2.tasknum(k) := c1.tasknum(j);
END IF;
END IF;
k := k +1;
ELSE
j := j+1;
state <= xover2;
END IF;
WHEN mutation =>
IF rndm < probm THEN
IF mut1 = mut2 THEN
reqFM <= '1';
state <= awaitackfm1;
ELSE
temp1 := c1.tasknum((conv_integer('0' & mut1)));
temp2 := c2.tasknum((conv_integer('0' & mut2)));
c1.tasknum(conv_integer('0' & mut1)) := c1.tasknum(( conv_integer
('0' & mut2)));
c2.tasknum(conv_integer('0' & mut2)) := c2.tasknum((conv_integer
('0' & mut1)));
c1.tasknum((conv_integer('0' & mut2))):= temp1;
c2.tasknum((conv_integer('0' & mut1))):= temp2 ;
reqFM <= '1';
state <= awaitackfm1;
END IF;
ELSE
reqFM <= '1';
state <= awaitackfm1;
END IF;
WHEN awaitackfm1 =>
IF ackFM = '1' THEN
child1 <= c1;
child2 <= c2;
reqFM <= '0';
state <= awaitackfm2;
END IF;
WHEN awaitackfm2 =>
IF ackFM = '0' THEN
state <= getparents;
END IF;
108
END CASE;
END IF;
END PROCESS;
END behavior;
A9. FITNESS MODULES FOR PHGA
ENTITY GA_Fitness IS
PORT(
clk
: IN STD_LOGIC; -- clock signal
-- Signals to and from RTG
initfm
: IN STD_LOGIC; -- start FM
tasktablein
: IN taskgraph; -- task graph table
-- Signals to and from IPG
reqipg
: IN STD_LOGIC; -- request from IPG
dataipg
: IN chromosome; -- data from IPG
ackipg
: OUT STD_LOGIC; -- acknowledge IPG
-- Signals to and from MC
ackmc
: IN STD_LOGIC; -- acknowledge from MC
reqmc
: OUT STD_LOGIC; -- request to MC
addrmc
: OUT STD_LOGIC_VECTOR(13 DOWNTO 0); -- address of data
dataout
: OUT STD_LOGIC_VECTOR(7 DOWNTO 0); -- data
old_new
: OUT STD_LOGIC; -- old/new population toggle
best_done
: OUT STD_LOGIC; -- best schedule obtained
-- Signals to and from CMM
reqxover
: IN STD_LOGIC; -- request from CMM
child1
: IN chromosome; -- child1 from CMM
child2
: IN chromosome; -- child2 from CMM
ackxover
: OUT STD_LOGIC; -- acknowledge to CMM
-- Signals to and from SM
sof
: OUT integer RANGE 0 TO 10000000; -- Sum of fitness to SM
initGAmod
: OUT STD_LOGIC -- Start PS,SM and CMM modules );
END GA_Fitness;
ARCHITECTURE behavior OF GA_Fitness IS
TYPE states IS (idle0, idle1, idle, ipgfm1, ipgfm2, xoverfm1, xoverfm2, stfm, prefm2, fm2,
ackmem1_1,ackmem1_2,ackmem2, done1, done2, done3, bestdone);
TYPE statesfm IS (lp0, lp1,lp2,lp3,lp4,lp5,lp6,lp7,lp8,lp9,lp10,lp11,lp12,donefm);
SIGNAL state
SIGNAL statefm1,statefm2
SIGNAL bestfit
SIGNAL best
: states;
: statesfm;
: chromosome;
: STD_LOGIC;
109
SIGNAL ch1m2f, ch2m2f, ch1f2m, ch2f2m : chromosome;
SIGNAL tg
: taskgraph;
SIGNAL st1, st2, dn1, dn2
: STD_LOGIC;
BEGIN
PROCESS(bestfit.fit)
BEGIN
IF bestfit.fit = "1100001101010000" THEN
best <= '1';
ELSE
best <= '0';
END IF;
END PROCESS;
PROCESS(clk,st1)
variable chromo1
variable tg1
variable RQ
variable EQ
variable i,j,k,l,m,n,p,rqc
variable pri, prec
: chromosome;
: taskgraph;
: taskn;
: STD_LOGIC_VECTOR((numtasks-1) downto 0);
: integer RANGE 0 TO 32;
: integer RANGE 0 TO 45000;
BEGIN
IF st1 = '0' THEN
dn1 <= '0';
statefm1 <= lp0;
ELSIF rising_edge(CLK) THEN
CASE statefm1 IS
WHEN lp0 =>
i := 0;
rqc := 0;
chromo1 := ch1m2f;
statefm1 <= lp1;
WHEN lp1 =>
prec := precfitness;
pri := priofitness;
IF i <numtasks THEN
RQ(i) := "00000000";
EQ(i) := '0';
IF tg(i).numofpred = 0 THEN
RQ(rqc) := tg(i).tnumber;
rqc := rqc + 1;
END IF;
i := i+1;
statefm1 <= lp1;
ELSE
i := 0;
110
statefm1 <= lp2;
END IF;
WHEN lp2 =>
IF i < numtasks THEN
j := 0;
statefm1 <= lp3;
ELSE
IF prec /= precfitness THEN
IF prec > 0 THEN
IF prec < priofitness THEN
chromo1.fit := conv_std_logic_vector((maxfitness/1000),
16);
ELSE
IF pri < 0 THEN
chromo1.fit := conv_std_logic_vector((prec + pri),16);
ELSE
chromo1.fit := conv_std_logic_vector((prec + pri),16);
END IF;
END IF;
ELSE
chromo1.fit := conv_std_logic_vector((maxfitness/1000),16);
END IF;
ELSE
chromo1.fit := conv_std_logic_vector((prec + pri),16);
END IF;
ch1f2m <= chromo1;
statefm1 <= donefm;
END IF;
WHEN lp3 =>
IF j < rqc THEN
IF chromo1.tasknum(i) = RQ(j) THEN
k := 0;
statefm1 <= lp4;
ELSE
IF j = (rqc-1) THEN
EQ(conv_integer(chromo1.tasknum(i))-1) := '1';
j := numtasks;
m := 0;
n := 0;
statefm1 <= lp11;
ELSE
j := j+1;
statefm1 <= lp3;
END IF;
END IF;
ELSE
i := i + 1;
statefm1 <= lp2;
END IF;
111
WHEN lp4 =>
IF k < rqc THEN
IF RQ(k) /= "00000000" THEN
l := 0;
statefm1 <= lp5;
ELSE
k := k+1;
statefm1 <= lp4;
END IF;
ELSE
EQ(conv_integer(chromo1.tasknum(i))-1) := '1';
RQ(j) := "00000000";
j := numtasks;
statefm1 <= lp6;
END IF;
WHEN lp5 =>
IF l < numtasks THEN
IF RQ(k) = tg1(l).tnumber THEN
IF tg(conv_integer(chromo1.tasknum(i))-1).tpriority < tg(l).tpriority
THEN
pri := pri - 50 * (numtasks - i - 1);
END IF;
END IF;
l := l + 1;
statefm1 <= lp5;
ELSE
k := k + 1;
statefm1 <= lp4;
END IF;
WHEN lp6 =>
IF conv_integer(tg(conv_integer(chromo1.tasknum(i))-1).numofsuc) > 0
THEN
k := 0;
statefm1 <= lp7;
ELSE
j := j+1;
statefm1 <= lp3;
END IF;
WHEN lp7 =>
IF k < conv_integer(tg(conv_integer(chromo1.tasknum(i))-1).numofsuc)
THEN
l := 0;
statefm1 <= lp8;
ELSE
j := j+1;
statefm1 <= lp3;
END IF;
112
WHEN lp8 =>
IF l < numtasks THEN
IF tg(conv_integer(chromo1.tasknum(i))-1).suc(k) = tg(l).tnumber
THEN
IF EQ(l) = '0' THEN
IF conv_integer(tg(l).numofpred) > 1 THEN
m := 0;
n := 0;
statefm1 <= lp9;
ELSE
RQ(rqc) := tg(l).tnumber;
IF rqc = (numtasks - 1) THEN
rqc := 0;
ELSE
rqc := rqc+1;
END IF;
END IF;
END IF;
END IF;
l := l+1;
statefm1 <= lp8;
ELSE
k := k + 1;
statefm1 <= lp7;
END IF;
WHEN lp9 =>
IF n < conv_integer(tg(l).numofpred) THEN
p := 0;
statefm1 <= lp10;
ELSE
IF m = conv_integer(tg(l).numofpred) THEN
RQ(rqc) := tg(m).tnumber;
IF rqc = (numtasks) THEN
rqc := 0;
ELSE
rqc := rqc + 1;
END IF;
END IF;
l := l+1;
statefm1 <= lp8;
END IF;
WHEN lp10 =>
IF p < numtasks THEN
IF tg(l).pred(n) = tg(p).tnumber THEN
IF EQ(p) = '1' THEN
m := m + 1;
END IF;
END IF;
p := p + 1;
113
statefm1 <= lp10;
ELSE
n := n+1;
statefm1 <= lp9;
END IF;
WHEN lp11 =>
IF n < conv_integer(tg(conv_integer(chromo1.tasknum(i))-1).numofpred)
THEN
p := 0;
statefm1 <= lp12;
ELSE
IF m /= conv_integer(tg(conv_integer(chromo1.tasknum(i))1).numofpred) THEN
prec := prec - 150 * (numtasks - i -1);
END IF;
statefm1 <= lp6;
END IF;
WHEN lp12 =>
IF p < numtasks THEN
IF tg(conv_integer(chromo1.tasknum(i))-1).pred(n) = tg(p).tnumber
THEN
IF EQ(p) = '1' THEN
m := m + 1;
END IF;
END IF;
p := p + 1;
statefm1 <= lp12;
ELSE
n := n+1;
statefm1 <= lp11;
END IF;
WHEN donefm =>
dn1 <= '1';
END CASE;
END IF;
END PROCESS;
PROCESS(clk,st2)
variable chromo2
: chromosome;
variable RQ
: taskn;
variable EQ
: STD_LOGIC_VECTOR((numtasks-1) downto 0);
variable i, j, k, l, m, n, p, rqc : integer RANGE 0 TO 32;
variable pri, prec
: integer RANGE 0 TO 45000;
BEGIN
IF st2 = '0' THEN
dn2 <= '0';
114
statefm2 <= lp0;
ELSIF rising_edge(CLK) THEN
CASE statefm2 IS
WHEN lp0 =>
i := 0;
rqc := 0;
chromo2 := ch2m2f;
statefm2 <= lp1;
WHEN lp1 =>
prec := precfitness;
pri := priofitness;
IF i <numtasks THEN
RQ(i) := "00000000";
EQ(i) := '0';
IF tg(i).numofpred = 0 THEN
RQ(rqc) := tg(i).tnumber;
rqc := rqc + 1;
END IF;
i := i+1;
statefm2 <= lp1;
ELSE
i := 0;
statefm2 <= lp2;
END IF;
WHEN lp2 =>
IF i < numtasks THEN
j := 0;
statefm2 <= lp3;
ELSE
IF prec /= precfitness THEN
IF prec > 0 THEN
IF prec < priofitness THEN
chromo2.fit := conv_std_logic_vector((maxfitness/1000),
16);
ELSE
IF pri < 0 THEN
chromo2.fit := conv_std_logic_vector((prec + pri),16);
ELSE
chromo2.fit := conv_std_logic_vector((prec + pri),16);
END IF;
END IF;
ELSE
chromo2.fit := conv_std_logic_vector((maxfitness/1000),16);
END IF;
ELSE
chromo2.fit := conv_std_logic_vector((prec + pri),16);
END IF;
ch2f2m <= chromo2;
115
statefm2 <= donefm;
END IF;
WHEN lp3 =>
IF j < rqc THEN
IF chromo2.tasknum(i) = RQ(j) THEN
k := 0;
statefm2 <= lp4;
ELSE
IF j = (rqc-1) THEN
EQ(conv_integer(chromo2.tasknum(i))-1) := '1';
j := numtasks;
m := 0;
n := 0;
statefm2 <= lp11;
ELSE
j := j+1;
statefm2 <= lp3;
END IF;
END IF;
ELSE
i := i + 1;
statefm2 <= lp2;
END IF;
WHEN lp4 =>
IF k < rqc THEN
IF RQ(k) /= "00000000" THEN
l := 0;
statefm2 <= lp5;
ELSE
k := k+1;
statefm2 <= lp4;
END IF;
ELSE
EQ(conv_integer(chromo2.tasknum(i))-1) := '1';
RQ(j) := "00000000";
j := numtasks;
statefm2 <= lp6;
END IF;
WHEN lp5 =>
IF l < numtasks THEN
IF RQ(k) = tg(l).tnumber THEN
IF tg(conv_integer(chromo2.tasknum(i))-1).tpriority < tg(l).tpriority
THEN
pri := pri - 50 * (numtasks - i - 1);
END IF;
END IF;
l := l + 1;
statefm2 <= lp5;
116
ELSE
k := k + 1;
statefm2 <= lp4;
END IF;
WHEN lp6 =>
IF conv_integer(tg(conv_integer(chromo2.tasknum(i))-1).numofsuc) > 0
THEN
k := 0;
statefm2 <= lp7;
ELSE
j := j+1;
statefm2 <= lp3;
END IF;
WHEN lp7 =>
IF k < conv_integer(tg(conv_integer(chromo2.tasknum(i))-1).numofsuc)
THEN
l := 0;
statefm2 <= lp8;
ELSE
j := j+1;
statefm2 <= lp3;
END IF;
WHEN lp8 =>
IF l < numtasks THEN
IF tg(conv_integer(chromo2.tasknum(i))-1).suc(k) = tg(l).tnumber
THEN
IF EQ(l) = '0' THEN
IF conv_integer(tg(l).numofpred) > 1 THEN
m := 0;
n := 0;
statefm2 <= lp9;
ELSE
RQ(rqc) := tg(l).tnumber;
IF rqc = (numtasks - 1) THEN
rqc := 0;
ELSE
rqc := rqc+1;
END IF;
END IF;
END IF;
END IF;
l := l+1;
statefm2 <= lp8;
ELSE
k := k + 1;
statefm2 <= lp7;
END IF;
117
WHEN lp9 =>
IF n < conv_integer(tg(l).numofpred) THEN
p := 0;
statefm2 <= lp10;
ELSE
IF m = conv_integer(tg(l).numofpred) THEN
RQ(rqc) := tg(m).tnumber;
IF rqc = (numtasks) THEN
rqc := 0;
ELSE
rqc := rqc + 1;
END IF;
END IF;
l := l+1;
statefm2 <= lp8;
END IF;
WHEN lp10 =>
IF p < numtasks THEN
IF tg(l).pred(n) = tg(p).tnumber THEN
IF EQ(p) = '1' THEN
m := m + 1;
END IF;
END IF;
p := p + 1;
statefm2 <= lp10;
ELSE
n := n+1;
statefm2 <= lp9;
END IF;
WHEN lp11 =>
IF n < conv_integer(tg(conv_integer(chromo2.tasknum(i))-1).numofpred)
THEN
p := 0;
statefm2 <= lp12;
ELSE
IF m /= conv_integer(tg(conv_integer(chromo2.tasknum(i))1).numofpred) THEN
prec := prec - 150 * (numtasks - i -1);
END IF;
statefm2 <= lp6;
END IF;
WHEN lp12 =>
IF p < numtasks THEN
IF tg(conv_integer(chromo2.tasknum(i))-1).pred(n) = tg(p).tnumber
THEN
IF EQ(p) = '1' THEN
m := m + 1;
END IF;
118
END IF;
p := p + 1;
statefm2 <= lp12;
ELSE
n := n+1;
statefm2 <= lp11;
END IF;
WHEN donefm =>
dn2 <= '1';
END CASE;
END IF;
END PROCESS;
PROCESS(clk,initfm,best)
variable individual
: chromosome;
variable tog
: STD_LOGIC ;
variable psizetemp
: integer RANGE 0 TO 200;
variable chromsize,i
: integer RANGE 0 TO 32;
variable addr
: STD_LOGIC_VECTOR(13 DOWNTO 0);
variable data
: STD_LOGIC_VECTOR(7 DOWNTO 0);
variable data1
: STD_LOGIC_VECTOR(15 DOWNTO 0);
variable flag,ipgdone
: STD_LOGIC;
variable cnt,cntchild
: STD_LOGIC;
variable sumfit1, sumfit2 : integer RANGE 0 TO 10000000;
variable bestnum
: integer RANGE 0 TO 200;
BEGIN
IF initfm = '0' THEN
reqmc <= '0';
ackipg <= '0';
best_done <= '0';
bestfit.fit <= "0000000000000000";
st1 <= '0';
st2 <= '0';
psizetemp := 0;
chromsize := 0;
tog := '0';
old_new <= tog;
flag := '0';
cnt := '0';
cntchild := '0';
initGAmod <= '0';
ipgdone := '0';
state <= idle0;
ELSIF rising_edge(CLK) THEN
CASE state IS
WHEN idle0 =>
tg <= tasktablein;
119
state <= idle1;
WHEN idle1 =>
psizetemp := 0;
chromsize := 0;
ipgdone := '0';
sumfit1 := 0;
sumfit2 := 0;
state <= idle;
WHEN idle =>
flag := '0';
cntchild := '0';
IF best = '1' THEN
state <= bestdone;
ELSIF ipgdone = '0' THEN
IF reqipg = '1' THEN
IF psizetemp < popsize THEN
ackipg <= '1';
state <= ipgfm1;
END IF;
END IF;
ELSE
IF reqxover = '1' THEN
IF psizetemp < popsize THEN
ackxover <= '1';
state <= xoverfm1;
END IF;
END IF;
END IF;
WHEN ipgfm1 =>
flag := '0';
cnt := '0';
IF reqipg = '0' THEN
cntchild := '1';
ch2m2f <= dataipg;
st2 <= '1';
state <= ipgfm2;
END IF;
WHEN ipgfm2 =>
flag := '0';
cnt := '0';
IF reqipg = '0' THEN
cntchild := '1';
ch2m2f <= dataipg;
st2 <= '1';
state <= stfm;
END IF;
120
WHEN xoverfm1 =>
flag := '0';
cnt := '0';
IF reqxover = '0' THEN
ch1m2f <= child1;
ch2m2f <= child2;
st1 <= '1';
st2 <= '1';
state <= xoverfm2;
END IF;
WHEN xoverfm2 =>
flag := '0';
cnt := '0';
IF reqxover = '0' THEN
ch1m2f <= child1;
ch2m2f <= child2;
st1 <= '1';
st2 <= '1';
state <= stfm;
END IF;
WHEN stfm =>
ackipg <= '0';
ackxover <= '0';
IF cntchild = '1' THEN
IF psizetemp = (popsize-1) THEN
initGAmod <= '0';
END IF;
ELSE
IF psizetemp = (popsize-2) THEN
initGAmod <= '0';
END IF;
END IF;
state <= prefm2;
WHEN prefm2 =>
chromsize := 0;
IF dn1 = '1' THEN
IF dn2 = '1' THEN
individual := ch2f2m;
st2 <= '0';
state <= fm2;
ELSE
individual := ch1f2m;
st1 <= '0';
121
state <= fm2;
END IF;
ELSIF dn2 = '1' THEN
individual := ch2f2m;
st2 <= '0';
state <= fm2;
END IF;
WHEN fm2 =>
IF chromsize < numtasks THEN
data := individual.tasknum(chromsize);
addr := conv_std_logic_vector((34*(psizetemp) + chromsize), 14);
chromsize := chromsize + 1;
reqmc <= '1';
state <= ackmem1_1;
ELSE
data1 := individual.fit;
IF individual.fit > bestfit.fit THEN
bestnum := psizetemp;
bestfit <= individual;
END IF;
reqmc <= '1';
state <= ackmem1_2;
END IF;
WHEN ackmem1_1 =>
IF ackmc ='1' THEN
addrmc <= addr;
dataout <= data;
reqmc <= '0';
state <= ackmem2;
END IF;
WHEN ackmem1_2 =>
IF ackmc ='1' THEN
IF cnt = '0' THEN
addr := conv_std_logic_vector(((32*(psizetemp + 1))+
(psizetemp*2)), 14);
addrmc <= addr;
dataout <= data1(15 DOWNTO 8);
cnt := '1';
flag := '0';
reqmc <= '0';
ELSE
addr := conv_std_logic_vector(((32*(psizetemp + 1))+
(psizetemp*2)+1), 14);
addrmc <= addr;
dataout <= data1(7 DOWNTO 0);
reqmc <= '0';
chromsize := 0;
sumfit1 := sumfit1 + conv_integer('0' & data1);
122
flag := '1';
END IF;
state <= ackmem2;
END IF;
WHEN ackmem2 =>
IF ackmc = '0' THEN
IF flag = '1' THEN -- checks IF fitness lower byte is written
psizetemp := psizetemp +1;
IF dn1 = '0' THEN
IF dn2 = '0' THEN
IF psizetemp < popsize THEN
state <= idle;
ELSE
state <= done1;
END IF;
ELSE
state <= prefm2;
END IF;
ELSIF dn2 = '0' THEN
state <= prefm2;
END IF;
ELSE
state <= fm2;
END IF;
END IF;
WHEN done1 =>
reqmc <= '0';
state <= done2;
WHEN done2 =>
reqmc <= '0';
tog := NOT(tog);
old_new <= tog;
sumfit2 := sumfit1;
state <= done3;
WHEN done3 =>
reqmc <= '0';
ipgdone := '1';
sof <= sumfit2;
sumfit1 := 0;
psizetemp := 0;
chromsize := 0;
initGAmod <= '1';
state <= idle;
WHEN bestdone =>
reqmc <= '0';
best_done <= '1';
123
initGAmod <= '0';
END CASE;
END IF;
END PROCESS;
END behavior;
Related documents
Download