HARDWARE IMPLEMENTATION OF A PARALLELIZED GENETIC ALGORITHM FOR TASK SCHEDULING by VIJAY TIRUMALAI A THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering in the Department of Electrical and Computer Engineering in the Graduate School of The University of Alabama TUSCALOOSA, ALABAMA 2006 Submitted by Vijay Tirumalai in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering. Accepted on behalf of the Faculty of the Graduate school by the thesis committee: William A. Stapleton, Ph.D. Keith A. Woodbury, Ph. D. David J. Jackson, Ph.D. Kenneth G. Ricks, Ph.D. Chairperson David J. Jackson, Ph.D. Department Head Date Ronald W. Rogers, Ph.D. Dean of the Graduate school Date ii I dedicate this thesis to my family, for their love and support iii LIST OF ABBREVIATIONS AND SYMBOLS NP Non polynomial GA Genetic algorithms TG Task graph V Finite set of vertices E Finite set of directed edges T ith computational task eij Directed edge from Ti to Tj Pi Static priority of ith computational task DAG Directed acyclic graph MCP Multiple constrained path EDF Earliest deadline first HLF Highest level first HGA Hardware genetic algorithm Fi Fitness value of ith chromosome F Average fitness of the population PMX Partially matched crossover PSG Peer selected graph msec Milliseconds PGA Parallel genetic algorithm MPI Message passing interface iv FPGA Field programmable gate array NRE Non-recurring engineering HDL Hardware description language PLD Programmable logic devices SPLD Simple PLD CPLD Complex PLD SRAM Static random access memory MC Memory controller module RNG Random number generator module IPG Initial population generator module RTG Read task graph module FM Fitness module SM Selection module CMM Crossover and mutation module PS Population sequencer VHDL Very high speed integrated circuit hardware description language CA Cellular automaton PHGA Parallel hardware based genetic algorithm v ACKNOWLEDGMENTS I would like to express my gratitude to Dr. Kenneth G. Ricks, my advisor, for his experienced counsel, and for his support, both financial and moral. Without his guidance, this thesis would have never been finished. I would also like to thank Dr. David J. Jackson for providing me with his expertise in the hardware implementation of the thesis, Dr. Keith A. Woodbury for explaining the intricacies of genetic algorithms and for providing me the access to the mechanical engineering cluster, and Dr. William A. Stapleton for guiding me through the parallel implementation of my algorithm. I would also like to extend my appreciation to my friends Praveen, Pramodh, Smita, Gokul, Jai and Uday for their invaluable suggestions and continuous moral support. Finally, I would like to thank my family for their never ending encouragement and spiritual support. vi CONTENTS LIST OF ABBREVIATIONS AND SYMBOLS……………………………………….. iv ACKNOWLEDGMENTS……………………………………………………………….. vi LIST OF TABLES……………………………………………………………………...... ix LIST OF FIGURES………………………………………………………………………. x ABSTRACT………………………………………………………………….………......xii 1. INTRODUCTION……………………………………………………………………. 1 a. Problem Overview-Task Scheduling….……………………………………………… 3 b. Solving the Scheduling Problem……………………...……………………………… 5 c. Thesis Outline……………………………………………………………………….. 10 2. GENETIC ALGORITHMS…………………………………………………………. 11 a. Overview of Genetic Algorithms……………………………………………………. 11 b. Operational Principles……………………………………………………………......13 c. Basic Operations………..…………………………………………………………... 14 3. SEQUENTIAL GENETIC ALGORITHMS……………………………………....... 20 a. Advanced Operations………………………………………………………………... 20 4. PARAMETRIC STUDY OF GENETIC ALGORITHMS………………………….. 26 a. Experimental Verification………………………………..…………………………. 26 b. Study of GA Configurations………………………………………………………… 29 c. Results and Discussions…….….……………………………………………………. 37 vii 5. PARALLEL GENETIC ALGORITHMS…..……………………………………….. 39 a. Overview of Parallel GA……………………………………………………………. 39 b. Software Implementation of GA……………………………………………………. 42 c. Results and Discussions…………………………………………………………….. 43 6. HARDWARE-BASED GA TASK SCHEDULER MODEL………………………. 46 a. Motivation for the HGA system…………………………………………………….. 46 b. Overview of the Field Programmable Gate Arrays…………………………………. 48 c. Previous Research Work Related to HGA…………………………………………... 50 d. Basic HGA model…………………………………………………………………… 51 e. Overview of the System……………………………………………………………... 52 f. Module Description…………………………………………………………………. 54 g. Pipelining……………………………………………………………………………. 59 h. Results and Discussions…………………………………………………………….. 60 7. PARALLEL HARDWARE-BASED GA TASK SCHEDULER MODEL………… 62 a. PHGA Model………………………………………………………………………... 62 b. Results and Discussions…………………………………………………………….. 64 8. CONCLUSIONS AND FUTURE WORK…………………………………………… 65 a. Results and Conclusions…………………………………………………………….. 65 b. Future Work…………………………………………………………………………. 66 REFERENCES……………………………………………….…..………………………68 APPENDIX A…………………………………………………………………………… 73 viii LIST OF TABLES 1. Task schedule for random graphs…….……………………………………………....37 2. Task schedules for peer selected graphs……………………………………………...37 3. Cyclone EP1C12Q240 device features and resource utilization……………………. 60 4. Resource utilization for PHGA……………….………………..……………………. 64 5. Speedup for scheduler implementations…………………………………………...... 65 ix LIST OF FIGURES 1. Task graph and its schedule …………………………………………………….……. 5 2. Search space example…………………….…………………………….…………….. 7 3. Structure of GA……………….……………………………………………………... 14 4. Permutation encoding mechanism..……………………………………….………… 15 5. Roulette wheel selection operation………………………….………………………. 16 6. Single-point crossover operation……………………………………….………….... 17 7. Partially matched crossover operation……………………………………….……… 24 8. Number swapping mutation operation………….……………………………….…. 24 9. Random graphs…………………………….………………………………………... 27 10. Peer selected graphs……………………………………………………….……..….. 28 11. Average execution time (msec) vs population size…………………………………..31 12. Average execution time (msec) vs crossover rate……………………………………33 13. Average execution time (msec) vs mutation rate..………………………………….. 34 14. Average execution time (msec) vs elitism………………………………………….. 35 15. Master-slave implementation ………………………………………………………..40 16. Asynchronous concurrent implementation …………………………………………. 40 17. Distributed or network implementation …………………………………………….. 40 18. Speedup vs number of processors…………………………………………………… 44 19. Structure of an FPGA..……………………………………………………………….49 20. Simple HGA model……………………………..…………………………………....53 x 21. Course-grained pipeline………………………..……………………………………. 59 22. PHGA model with two FMs…………………..…………………………………….. 63 xi ABSTRACT Computing systems play a vital role in our society, influencing both industry and academia, and the spectrum of applications executing on these systems and their complexity varies widely. Future generations of computing systems are expected to execute processes which are more complex and larger in size. The tasks comprising these applications will have many constraints associated with their execution, and scheduling of such tasks using exact methods will become intractable or Non Polynomial (NP) complete. The NP completeness of the task scheduling problem has incited the use of various heuristics to obtain good solutions in a reasonable amount of time. In this thesis one such meta-heuristic known as a genetic algorithm (GA) is used. GAs are generic, domain independent search and optimization techniques based on the theory of evolution. GAs have been successfully applied to many classes of scheduling problems. However, with the increase in the size of the scheduling problem, the time complexity of the software implementation of a GA becomes high. To reduce the time complexity without compromising the solution quality, GAs have been parallelized. However, parallelization demands additional resources leading to a poor cost performance tradeoff. The advent of hardware technologies and design tools in the field of reconfigurable devices has enabled us to achieve better cost performance tradeoff. This thesis explores one such approach where a parallelized hardware-based GA task scheduler is implemented and its performance is compared to that of parallelized software-based GA. xii CHAPTER 1 INTRODUCTION Computing systems play a vital role in our society, influencing both industry and academia. They control manufacturing processes, laboratory experiments, automobile engines, and process control. The spectrum of their complexity varies widely from the very simple to the very complex. A process model of computation is often used for describing the behavior of any application that needs to be executed on such computing systems. In this model, the complete system application is partitioned into individual entities, called processes or tasks, each representing a schedulable unit of execution. The tasks must be scheduled for execution on a given set of processing elements. The problem of sequencing the tasks according to a certain policy (criteria) depending on the application, utilizing a given set of available resources is commonly referred to as the task scheduling problem. The methods used to schedule tasks vary widely and are dependent upon the architecture of the target processing system as well as the task characteristics. Some task characteristics that impact the scheduling methodology include task dependencies, periodicity, task priority, task deadline, task execution time, and task preemption. Architectural influences on the scheduling methodology might include uniprocessor and multiprocessor implementations as well as distributed and centralized approaches. 1 2 Future generations of computing problems and applications are expected to become more complex, distributed and larger in size (number of tasks). The tasks generally will have many types of constraints associated with them, and scheduling of such tasks will become nearly intractable using exact methods such as branch and bound and integer programming techniques (Lae-Jeong and Cheol-Hoon, 1995). It has been shown that a multiple constrained scheduling problem is Non Polynomial (NP)-complete and cannot be exactly solved in polynomial time (Korkmaz, Krunz and Tragoudas, 2002 and Zomaya, Ward and Macey, 1999). The NP-completeness of the problem has incited the application of many heuristics or stochastic search techniques such as genetic algorithms (GAs), simulated annealing, multi-step algorithms, algorithms using local search, and list-based heuristics to provide reasonable solutions for restricted instances of the scheduling problem in a reasonable amount of time (Zomaya et al., 1999 and RĖadulescu and Gemund, 2002). However, as the size of the problem increases, the performance of the software implementation of these techniques deteriorates, and additional resources are needed to meet the specified performance requirements. In order to achieve cost-performance tradeoff, much research has focused on exploring the parallel and hardware implementations of various software scheduling algorithms. It has been shown that significant performance gain can be achieved by implementing a few or all of the components of a system utilizing these techniques (Loo, Wells and Winningham, 2003). Similarly, the focus of this thesis is the improvement of scheduling performance using parallelization and hardware acceleration. 3 Problem Overview - Task Scheduling Computing systems have evolved considerably over the past few decades, and so has the magnitude and complexity of the work to be processed by them. Although existing computing systems are highly efficient, a good scheduling policy can improve the productivity of the system for a given application (Parker, 1995). The scheduling policy involves optimizing some desired performance criteria (like makespan or space complexity) by managing the access to and the use of the resources by various consumers satisfying some pre-defined constraints (like priority or dependency) (El-Rewini, Lewis and Ali, 1994). A typical scheduling model has three important characteristics. First, there are different task qualities and various constraints which dictate their sequence of execution. Secondly, there are metrics that guide scheduling decisions to achieve a desired performance level for a given application. Finally, the machine environment on which the task set has to be scheduled must be defined (Coffman, 1975). There are many open research scheduling models/problems available in the field of computing. To isolate and focus on the evaluation of the performance of a hardware-based task scheduler, various assumptions have been made to simplify the scheduling problem. In this thesis, the scheduling problem involves determining an execution order for a task set that maintains predefined priority relationships and predefined partial orders of execution among the tasks in a uniprocessor environment. The partial order relationships are defined by a set of precedence constraints that are independent of the priority relationships among the tasks. The tasks can be independent, i.e. there is no direct communication between them, else they are dependent. If two tasks are dependent, the 4 source of communication is called the parent task, and the recipient of communication is called the child task. Upon completion, a parent task sends output to all its children simultaneously. A child task receives all input required from its parent(s) before beginning execution. Each task may have several predecessors, and may not begin execution until all its predecessors have completed their execution. Such tasks are called AND tasks and the partial order over them an AND-only dependency graph (Gilles and Liu, 1995). Once a task begins executing, it executes to completion without interruption. This is called non-preemptive execution. Each instance of a task is called a job. The communication delays among tasks are assumed to be negligible. Based on the above description, a set of partially ordered computational tasks can be represented by a directed acyclic task graph, TG = (V, E), consisting of a finite nonempty set of vertices, V, and a finite set of directed edges, E, connecting the vertices. The collection of vertices, V = {T1, T2. . . Tm}, represents the set of tasks to be executed, and the directed edges, E = {eij}, (eij denotes a directed edge from vertex Ti to Tj) implies a partial ordering or precedence relation, >>, exists between the tasks. That is, if Ti >> Tj then task Ti must be completed before Tj can be initiated (Hou, Ansari and Ren, 1994). Each vertex also has a static priority, Pi, representing a static priority assigned to task Ti, and Pi > Pj implies that a ready job of task Ti will be allocated to an available processor before a ready job of task Tj. The execution order of ready jobs of two different tasks having equal priorities is determined arbitrarily. In the case where the priority relationship implies that T >> T while the precedence relationship implies that T >> T j i i j both relationships cannot be satisfied, and the precedence relationship will override the priority relationship. The reason for this is the priority relationship P > P only truly j i 5 defines the order in which jobs of tasks T and T leave the ready queue, whereas the i j precedence relations define the order in which they enter the ready queue. The goal of this research is to generate a feasible schedule which satisfies all the precedence and priority constraints within the task set. Figure 1 shows a task graph and its respective schedule, where a higher priority has a higher value. 1 A 2 B 3 C Schedule Task Precedence 4 D 5 C B A E D F E 6 F Priority Figure.1. Task graph and its schedule Solving the Scheduling Problem For a directed acyclic graph (DAG) TG = (V, E), each edge is associated with two non-negative additive values. In this present study these values represent the number of priority and precedence constraint mismatches. The problem is to find a path p from a source node to a destination node such that all the edges along the path produce zero mismatches by complying with the two constraints. This problem is the well known Multiple Constrained Path Selection (MCP) problem (Korkmaz et al., 2002). The specific task scheduling problem studied here is a variation of the MCP problem, wherein the path passing through all the nodes represents a possible schedule and the algorithm used to obtain such a path is termed the task scheduling algorithm. It is assumed that an 6 independent task has an incoming edge from every other node, i.e. an independent task can be inserted into a path after any other node, based on its priority value and is ready to be executed at any given instant of time. Generally speaking, scheduling algorithms vary widely depending on the architecture of the computing system and the characteristics of the tasks to be scheduled. Broadly these algorithms have been classified as static and dynamic. When task characteristics are known a priori, the tasks of the application can be scheduled statically, i.e. at compile time. When the structure of the application is unpredictable, the scheduling activities are performed at run-time. Such scheduling is dynamic scheduling. In this thesis, the scheduling methodology considered is static scheduling. A more general overview of task scheduling algorithms can be found in (Casavant and Kuhl, 1988) and (Ramamritham and Stankovic, 1994). In static task scheduling, the application is represented by a DAG as previously described. The scheduling problem equates to finding a path from the source task to the destination task that complies with both the priority and precedence constraints. This path is called the task schedule for the TG. Solving this specific problem is known to be NPcomplete. In fact, there is no efficient polynomial-time algorithm that can find an optimal solution which simultaneously satisfies both the constraints even for more restricted cases (Korkmaz et al., 2002). This has led to the development of many algorithms that search for solution from among all possible solutions. For large task sets, the search space (solution space containing all the possible solutions) becomes so large that searching such a solution space for the optimal solution becomes very exhaustive and time consuming. 7 An example of a complex search space is shown in Figure 2. In these cases, the goal becomes to locate a sub-optimal solution in a reasonable amount of time. Local maxima Global maxima Figure 2. Search space example (Reade, 2002) Sub-optimal algorithms can be further classified as approximate and heuristic. For a sub-optimal approximate algorithm, instead of searching the entire solution space for an optimal solution, it is satisfactory to obtain a good solution that has been determined based on some criterion function. There are many approaches that can be applied to static approximate scheduling which include queuing theory, graph theoretic approaches, mathematical programming and enumeration, and state-space. The second category under sub-optimal algorithms is heuristics, which make the most realistic assumptions about a priori knowledge concerning process and system characteristics. Heuristics make use of special parameters which affect the system performance in an indirect way, and this 8 alternate parameter is much easier to monitor or evaluate (Casavant and Kuhl, 1988 and Kwok and Ahmad, 1996). A heuristic is said to be better than another heuristic if solutions approach optimality more often or if a near-optimal solution is obtained in less time. A heuristic that can optimally schedule a particular task set on a certain target machine may not produce optimal schedules for other task sets on other machines. As a result, several heuristics have been proposed, each of which may work under different circumstances. At the moment since heuristics offer the most successful solutions to scheduling problems, considerable research is being done to create various types of heuristic-based scheduling algorithms (Zomaya et al., 1999). Giving a detailed description of each of these algorithms is exhaustive and outside the scope of this thesis, but a general classification and some of the most popular algorithms need to be mentioned. Heuristic algorithms may be divided in two main classes. First, the general purpose optimization algorithms independent of the given optimization problem and, second, the heuristic approaches specifically designed for a particular problem. The area of interest in this thesis is the first class of algorithms. Some widely used techniques that belong to this class are list scheduling, hill climbing algorithm, simulated annealing and genetic algorithms. In list scheduling, all the ready tasks are arranged in decreasing order of priority before the scheduling process begins. During the scheduling process, as the processor becomes available, the task with the highest priority is selected from the list and assigned to the processor. The algorithms belonging to this class basically differ in the method employed to assign the priority to the task. Some popular methods of assigning priorities are based on earliest deadline first (EDF algorithm), shortest period 9 first (Rate Monotonic algorithm), highest level first (HLF algorithm) and critical path (MCP algorithm). Another class of heuristic is the insertion scheduling heuristic which is an improvement over list heuristics. It tries to assign ready tasks to the idle time slots resulting from communication delays. Another heuristic studied by researchers is the mapping heuristic which incorporates adaptive routing to minimize communication delays in inter-connect topologies (Ramamritham and Stankovic, 1994, Kwok and Ahmad, 1996, Zomaya et al., 1999, and Correa, Ferreira, and Rebreyend, 1999). The hillclimbing algorithm is typically used to find the global minimum in convex solution spaces. Most often it is rather a local instead of a global minimum which is found. Simulated annealing offers a way to overcome this major drawback of hill-climbing but the price to pay is a large computation time. Also, simulated annealing is rather sequential in nature; its parallelization is quite a difficult task. More distributed optimization techniques, inherently parallel, have also been considered. Some of them are closely related to neural network algorithms and evolutionary algorithms (El-Rewini, Lewis and Ali, 1995 and Talbi and Muntean, 1993). One such class of algorithm is the meta-heuristic known as a genetic algorithm (GA), a guided random search method where elements in a given set of solutions are randomly combined and modified iteratively until some termination condition is achieved (Correa et al., 1999). This heuristic is being used in this thesis and is discussed further in the next chapter. 10 Thesis Outline This thesis is organized into eight chapters. In Chapter 2 a detailed description of genetic algorithms including their operational principles and applications is provided. In Chapter 3 the advanced operators used to enhance the performance of the software implementation of a sequential GA is presented. Chapter 4 presents the parametric analysis of the GA along with the description of the procedure used to experimentally verify the functionality of the GA. Chapter 5 presents the motivation for implementing the parallelized version of the GA in software with a brief overview of the various parallel models available. The parallelization of GA is then supported by the speedup results obtained for different sizes of cluster, followed by a discussion on the choice of cluster size. Chapter 6 discusses the motivation behind the hardware-based scheduling system, presents previous work related to this field, and a general model of the system. It also presents the results obtained for various test cases. In Chapter 7, the parallel implementation of the Hardware Genetic Algorithm (HGA) is presented and the results obtained for the test cases are presented and discussed. Finally, in Chapter 8 a performance comparison of the software implementation of a simple sequential GA, a parallel GA, the hardware implementation of sequential GA and parallel hardware-based GA is presented, along with possible future work in this field. CHAPTER 2 GENETIC ALGORITHMS This chapter begins by providing a brief overview of GAs and their applications, followed by a detailed description of their operation. Further, a detailed description of each module and its role in the operation of a GA is presented. Also, various methods of implementing a GA and its operators are discussed to provide a complete view of the research area. Overview of Genetic Algorithms Genetic algorithms are generic, domain independent search and optimization techniques which borrow from nature the basic concept of “survival of the fittest and natural selection”, as stated in Darwin’s evolutionary theory. According to this theory, the stronger or the more fit individuals survive while the weaker individuals die or get eliminated. This phenomenon eventually tends to transfer characteristics of the dominant individuals to the next generation. Over a number of generations, individuals comprising the population attain features which make them more fit, enabling them to withstand the external pressure of the environment (Mahmood, A., 2000). In a way analogous to the evolutionary process, GAs simulate the process of natural selection and genetic recombination to guide a highly exploitative search through a coding of a parameter space for a near optimal solution, avoiding convergence to false peaks (local optima). 11 12 GAs are randomized techniques which, based on probabilistic transition rules and historical information, speculate new search points with expected improvement in solutions (Zomaya et al., 1999). The reason for their success and a wide and ever growing research field is a combination of power and flexibility along with the robustness and simplicity the algorithms exhibit. They are being applied successfully to find acceptable solutions to problems in business, engineering, and science. Genetic algorithms differ from the traditional problem solving methods in four major ways described below. 1. They work with a coding of the parameter rather than the actual parameter. 2. They work from a population of strings instead of a single point. 3. They use payoff (objective function) information, not derivatives or other auxiliary knowledge. 4. They use probabilistic transition rules, not deterministic rules. These characteristics make GAs more robust, powerful and efficient compared to their traditional counterparts (Goldberg, 1989). Genetic algorithms have been applied to many applications in the area of computer science, engineering, business and the social sciences. Some successful application areas include job shop scheduling, VLSI circuit layout compaction, transportation scheduling, neural network design, image processing and pattern recognition, traveling salesman problem, economic forecasting, nonlinear equation solving, calibration of population migration and many more (Goldeberg, 1989). In the field of task scheduling a considerable amount of research has been done in the implementation of GAs for obtaining a near optimal solution. It has been successfully 13 applied to the multiprocessing scheduling problem in real-time systems, scheduling jobs on computational grids, scheduling tasks on parallel processors, and task scheduling on a heterogeneous computing environment. Operational Principles In a GA, the offspring are produced by standard genetic operators: selection, crossover, and mutation. In each generation, a selection scheme is used to stochastically select the survivors to the next generation according to their fitness values as evaluated by a problem-based user defined function. With this artificial evolution, the fitness of the solutions gradually improves generation by generation. The GA process starts with a random population and iterates until a termination condition is met (Aporntewan and Chongstitvatana, 2001). The operation of a GA can be summarized in the following manner. The first step in a GA is to encode any possible solution to an optimization problem as a set of strings and then derive a random initial population, which acts as the first generation from which the evolution starts. After that the fitness of each individual, also called a chromosome, in the population is evaluated. Once the fitness has been evaluated, chromosomes are selected for crossover from the current population. Random characteristics are then introduced after crossover into the population using mutation, based on some probability. This process of selection, crossover and mutation continues until a new population has been generated. Once the new population has been generated its fitness is evaluated and it replaces the old population. If the termination criterion is not met, the new population 14 undergoes another iteration of selection, crossover, mutation and fitness evaluation (Wang, Siegel, Roychowdhary and Maciejewski, 1997). The basic structure of a simple GA is shown in Figure 3. 1. [Start] Encode and generate initial population 2. [Fitness] Evaluate the fitness of initial population 3. [Test] If termination condition is met stop ELSE 4. [New population] Create a new population 4.1. [Selection] 4.2. [Crossover] 4.3. [Mutation] 4.4. [Fitness] 5. [Test] If termination condition is met stop ELSE 6. [Replace] Update old population 7. [Loop] Goto to step 4 Figure 3. Structure of a GA The next section provides a detailed description of GA modules and their operation emphasizing their significance in the process of optimization. Basic Operations A GA starts with a pool of feasible solutions (gene pool) to a specific problem, encoded as strings, known as chromosomes, based on a particular representation scheme. At each iteration, a set of genetic operators (crossover and mutation) defined over the population are applied and new populations of solutions are created, with the more fit solutions more likely to procreate. The basic operations of a GA are described below. 15 Encoding. This is the initial step in a GA where in a specific mechanism is employed for uniquely mapping chromosome values onto the decision variable for a given problem. The most commonly used representation for chromosomes in a GA is the binary code {0, 1}. Other representations include value encoding (ternary, integer and real valued), tree encoding and permutation encoding. In this thesis, permutation encoding has been employed, which is mostly used for ordering problems like the traveling salesman problem and the task ordering problem (Zomaya et al., 1999 and Wang et al., 1997). An example of permutation encoding is shown in Figure 4 where a set of individuals are generated by reordering the sequence of numbers 1-7 and each sequence represents an order of execution of tasks. Chromosome 1 1 2 3 4 5 6 7 Chromosome 2 3 2 1 5 7 6 4 Figure 4. Permutation encoding mechanism Initial Population Generation. The initial population represents the starting set of individuals that need to be evolved using the GA. These individuals are usually generated randomly, but if prior knowledge about the system is known then such information can be utilized to produce a better initial population. This can help to promote a faster convergence without biasing the search. Fitness and Objective Function. The objective function provides the mechanism for evaluating each chromosome. In the case of a minimization problem, as is the case in this thesis, the more fit individuals will have a lower numerical value for their objective function. The fitness function converts/normalizes the objective function value, 16 transforming it into a relative measure of fitness in a given range. This value is in turn used by the selection mechanism (Zomaya et al., 1999). Selection. This operation mimics nature’s survival of the fittest mechanism, which ensures that more fit solutions survive while the weaker ones perish. It is more likely that a more fit string will produce a higher number of offspring, increasing its chances of survival. There are many methods for selecting the best chromosomes. The simplest form of selection is roulette wheel selection. In this method, a string with a fitness value of Fi is allocated a relative fitness value of Fi/F, where F is the average fitness value of the population. The GA uses a roulette wheel style of selection to implement this proportional selection scheme as shown in Figure 5. Chromosome 4 Chromosome 3 Chromosome 1 Chromosome 2 Figure 5. Roulette wheel selection In the above figure, each chromosome is allocated a sector of the roulette wheel, with the sector size calculated based on the relative fitness of the respective chromosome. In the above case, chromosome 1 has the maximum fitness, followed by chromosome 2 with chromosome 3 and 4 having the same fitness. The probability of selection of an individual is provided by the relative fitness of the chromosome. The count value of the 17 expected number of an individual is determined by multiplying a randomly generated number in the range of 0-(population size) with the relative fitness of that individual, and then taking the integer portion of the product. Some other schemes are stochastic reminder with and without replacement, tournament selection and rank selection. In this thesis, stochastic reminder without replacement method is used (Zomaya et al., 1999). It is described in Chapter 3. Crossover/Recombination. The crossover (also called recombination) mechanism stochastically causes the exchange of genetic material between the selected parents to produce new individuals having some portions of both the parents’ genetic material. The probability of occurrence of crossover is user defined and is called the crossover rate. A typical value of crossover rate is in the range of 0.5 – 1.0. The simplest form of crossover is the single-point crossover, where a crossover point is chosen stochastically and the substrings are swapped between the parents across that point to create new individuals. Simple crossover is shown in Figure 6. Chromosome 1 3 2 1 6 5 4 Chromosome 2 1 2 3 4 5 6 Offspring 1 3 2 1 4 5 6 Offspring 2 1 2 3 6 5 4 Crossover Point Figure 6. Single-point crossover operation 18 It is not necessary that the parts that contribute most to the fitness of an individual be contained in the substrings formed. Such a disruptive nature instigates exploration of new areas of the search space instead of converging into local maxima. This makes GAs robust. There are other methods of crossover depending on the type of encoding and type of application such as two point, multipoint, uniform, arithmetic, order, cycle and partially matched crossover (Zomaya et al., 1999, and Goldeberg, 1989). In this thesis partially matched crossover is used. It will be described in detail in the advanced operator section in Chapter 3. Mutation. This mechanism causes a small alteration of the chromosome stochastically. Mutation is applied uniformly on the entire population, with a random probability called the mutation rate, which for a GA is typically very low in the range of 0.1 – 0.4 percent. Most researchers believe mutation plays a secondary role in the GA. It reintroduces divergence in a converging population, ensures the probability of finding an optimal solution is never zero, and recovers good genetic material if lost through selection and crossover (Zomaya et al., 1999). Typical methods of performing mutation are bit inversion, order changing, adding or subtracting small values at the mutation point and number swapping. Replacement. Once the new population has been generated and evaluated, a scheme is needed to replace the existing members of the old population with new members. Some methods include the generational scheme where the new members replace their parents, steady state scheme where the new members replace the weaker 19 members of the population, and incremental scheme where the new members are simply added to the existing population provided enough memory is available (Scott, 1994). Termination. A GA is terminated depending on the expected output. It is possible that a GA is terminated after a preset number of populations have been obtained, or the population’s average fitness has reached a threshold fitness value. Other termination conditions include the best individual having reached a threshold fitness value, or some degree of homogeneity being obtained in the population. Often a combination of the above mentioned approaches is used (Zomaya et al., 1999). Depending on the particular optimization problem, the GA parameters and operators need to be chosen to provide better performance. Although a GA with simple operators is a very powerful tool, there are ways to improve its performance by applying some advanced genetic operators and using some variations of the GA. The advanced genetic operators and functions used in implementing the sequential GA for performing task scheduling in this thesis are discussed in the next chapter. CHAPTER 3 SEQUENTIAL GENETIC ALGORITHMS A GA in itself is a robust and powerful tool and has been used to solve many optimization problems. But with the complexity associated with some specific optimization problems, it is not desirable to use the standard genetic operators discussed in Chapter 2, and special operators need to be utilized to improve the performance of the GA. Advanced Operations The execution time of the GA can be defined as the time taken by the algorithm to converge to an optimal solution. This time reflects on the quality of the solutions comprising each generation. If the quality of the solutions is poor, i.e. the individuals do not concur to the constraints imposed, the GA will take more time to converge to the best solution. Some of the techniques and operators used to improve the quality of solutions and thereby the execution times of the sequential GA implemented in this thesis are described below. Population initialization. Instead of randomly generating the initial population, some individuals known to be in the vicinity of the global maximum can be included to speed up the convergence. In this thesis the initial population contains individuals which 20 21 have been sorted in accordance to their priority along with the randomly generated individuals. Fitness function. As previously mentioned, task scheduling is a multiple constrained problem. A GA is directly applicable only to unconstrained optimization problems. Hence, it is necessary to use some additional methods that will keep the solutions in the region of feasible schedules. Traditional GAs have incorporated constraints in the form of bounds on the variables, but these suffer from some drawbacks. However, recently researchers have developed methods that can be used to solve general constrained optimization problems. Presently there are four methods to handle constrained problems with GAs, namely rejection of the offspring, repair algorithms, modified genetic operators, and penalty functions (Joines and Houck, 1994 and Yeniay, 2005). This thesis utilizes the penalty functions method to evaluate the fitness of the chromosomes. Penalty functions convert a constrained problem into an unconstrained problem by penalizing those solutions which are unfeasible (Yeniay, 2005). There are two basic ways to apply penalty functions. The first is the additive form shown below. f(x) if x is a feasible solution Eval(x) = f(x) + p(x) otherwise where, f(x) is the objective function and p(x) is the penalty function. The second method is the multiplicative form, which is shown below. f(x) if x is a feasible solution f(x) * p(x) otherwise Eval(x) = 22 In this thesis, a combination of these two techniques has been used, where a unique penalty function or value is multiplied with the number of priority and precedence misses, and is finally added to provide a fitness value for the chromosome. Researchers have shown if one imposes a high degree of penalty, more emphasis is placed on obtaining feasibility and the GA will move very quickly towards a feasible solution. The system will tend to converge to a feasible point even if it is far from optimal. However, if one imposes a low degree of penalty, less emphasis is placed on feasibility, and the system may never converge to a feasible solution. Hence, it is important to choose correct values for penalty function parameters (Joines and Houck, 1994). Selection scheme. The basic selection scheme used in the GA is the roulette wheel selection, but it has been shown by De Jong that this scheme suffers from high stochastic errors (Goldeberg, 1989). Hence, an advanced selection scheme like stochastic reminder without replacement is used. This scheme starts in the same manner as the roulette wheel selection scheme, where the count value of the expected number of an individual is determined by multiplying a randomly generated number in the range of 0- (population size) with the relative fitness of that individual, and then taking the integer portion of the product. The fractional portion of the product provides us with the probability of any additional copies of the chromosome that can be expected. For example, a chromosome having a product of 2.5 will receive two copies surely, and another one with a probability of 0.5. This process continues until the next generation is completely generated (Zomaya et al., 1999, & Goldeberg, 1989). 23 Advanced crossover operators. For order based problems like the traveling salesman problem and task ordering, a simple crossover operation does not produce the desired results. Hence, advanced crossover operators, known as reordering operators, were constructed, which combine features of inversion and crossover into a single operator. Some such operators include order, cycle, and partially matched crossover (PMX). In this thesis, the PMX operator, which was initially constructed for solving the blind traveling salesman problem, has been used. Under PMX, two strings are aligned, and two crossover points are picked at random along the strings. These two points define the crossover section used during the crossover operation, through position-by-position exchange. Figure 7 shows a simple PMX crossover between two chromosomes A and B. PMX proceeds first by position exchanges between crossover points 4 to 6, i.e. tasks 5 and 2, 6 and 3, and 7 and 10 exchange places. Once the exchange is complete there might exist duplication of tasks in the offsprings. To overcome this problem partial matching takes place, wherein if during the crossover 5 was replaced by 2 in chromosome A, and there exists a 2 in the un-exchanged part of chromosome A then that 2 is replaced by a 5. Similar matching operation takes place for the rest of the exchanges. Hence, after performing the PMX operation, strings containing ordering information partially determined by each of its parents are produced (Goldeberg, 1989). 24 Chromosome A 9 8 4 5 6 7 Chromosome B 8 7 1 2 3 10 9 5 4 6 Offspring 1 9 8 4 2 3 10 1 6 5 7 Offspring 2 8 10 1 5 6 7 1 3 2 10 9 2 4 3 Crossover Point Figure 7. Partially matched crossover operation Advanced mutation operators. Advanced operations for performing mutation need to be used, to ensure that no item/task occurs more than once in the chromosome. One such operator is the number swapping operator where two points of mutation are randomly chosen and the numbers at those points swap their position. Figure 8 demonstrates the number swapping mutation operation used in this thesis. Original offspring 3 2 1 4 5 6 Mutated offspring 3 2 1 5 4 6 Mutation Point Figure 8. Number swapping mutation operation Elitism. When creating a new population using crossover and mutation, the chances are that some of the most fit chromosomes from the previous population might be lost. To prevent this, elitism is performed, which copies the best chromosomes and 25 replaces the weaker chromosomes present in the new population. The rest of the population is constructed in the usual manner. Elitism can rapidly increase the performance of the GA by preventing the loss of the best found solutions. Now that all the advanced operators and methods to improve the execution time of the GA have been described, the next chapter explains the methods used to test the performance, and verify the correctness of the algorithm. CHAPTER 4 PARAMETRIC STUDY OF GENETIC ALGORITHMS Various methods and advanced genetic operators were discussed in Chapter 3 to improve the performance of the GA. But, if the appropriate configuration of GA parameters is not used during execution performance improvements cannot be guaranteed. This chapter presents the importance of GA configuration and the tests performed to obtain the appropriate GA configuration for this problem. Before describing these methods, a test suite of task graphs used to evaluate the GA’s performance and functionality is created. Finally, after describing the test suite and obtaining an appropriate configuration, the results obtained for the sequential GA are presented. Experimental Verification To verify that the GA created here is a robust algorithm and works for most task graphs, it must be tested on a wide range of graphs of varying size and connectivity. Hence, the experimental verification of the sequential GA is performed using various benchmark graphs described by Kwok and Ahmad (1998). The benchmark graphs are diverse in nature and have variations in parameters such as graph complexity and size. The benchmark graphs are used to test the correctness of the algorithm, and the results are manually verified to confirm that all the precedence and priority constraints are met. 26 27 These benchmark graphs are of types, random graphs and peer selected graphs (PSGs). Random Graphs. These are a set of graphs with random precedence and priority constraints, representing task sets with dissimilar structures. Such graphs play an important role in the evaluation of an algorithm, since it is necessary that the graphs used for testing do not follow a specific pattern that might lead to a bias to a particular algorithm. To create the random graphs used in this analysis, several graph characteristics are randomized. The maximum order of the task graph, i.e. the number of tasks, is limited to 20 to facilitate manual verification. Next, the priority and execution time for each task is generated randomly. The priorities are selected between 1 to 30 and execution time is selected between 1 to 7. All the random numbers have been generated using Marsaglia’s random number generator algorithm (Marsaglia, Zaman, & Tsang, 1990 and James, 1990). Some random graphs generated are shown in Figure 9. A A 1 B 2 C 3 A 8 E 6 B 6 D 4 D E 7 5 F G Graph R1 6 9 B C D 4 F 8 C 7 E 5 Graph R2 5 G 7 F 4 3 9 Graph R3 Figure 9. Random graphs Peer Selected Graphs (PSGs). These are example task graphs used by various researchers and are documented in publications. These graphs are usually small in size and can be used to trace the operation of an algorithm by examining the schedule produced (Kwok and Ahmad, 1998). Eight PSGs with varying degrees of connectivity are used. The priorities of the tasks in the task graphs are assigned randomly. Some of the 28 peer selected graphs used to analyze the functionality of the sequential GA are shown in Figure 10. A 1 A 6 C 13 B 3 8 A C 6 C 9 1 F D 6 G 3 E 7 H 8 I 9 G 2 E 18 F 8 7 K 2 L 4 M 9 H 10 J 3 K 1 N 3 I 16 Graph 1 (Colin & Chretienne, 1991) D 11 E 6 D 11 G 11 H 10 I 14 J F 7 B 3 B 9 O 16 P 15 Graph 2 (Chung & Ranka, 1992) Q 6 Graph 3 (Kwok & Ahmad, 1998) A 7 B 12 C 16 D 17 E 4 F 3 G 3 H 9 A I 10 14 J 10 K 2 5 B C E F 13 D 9 L 1 A M 6 6 B 4 1 4 G 6 N 7 O 13 3 C 8 D 4 E 5 H 3 I J 2 K 4 P 11 Q 19 Graph 4 (Wu & Gajski, 1990) A 2 B 8 F 2 R 4 Graph 5 (Yang & Gerasoulis, 1993) C 1 A 4 F 9 G 7 D 6 H 5 I E Graph 7 (Correa et al., 1999) G 7 1 B 2 D 4 3 C E 3 5 F 7 Graph 8 (Correa et al., 1999) Figure 10. Peer Selected Graphs L 1 Graph 6 (Adam, Chandy & Dickson, 1974) 29 Before testing the GA with the above test suite, a GA configuration for which the GA performs optimally must be found. Searching for an optimal configuration can be a very exhaustive and tedious job. The approach taken to determine this configuration is discussed in the next section. Study of GA Configurations To compare performance in terms of execution time for the hardware implementation of a GA-based task scheduler with the software implementation of the scheduler, first a GA configuration for which both the implementations provide good performance results must be found. As mentioned at the start of this chapter, the performance of the GA is very much affected by the values of the GA parameters, namely population size, crossover rate, mutation rate and elitism. A set of unique values for these parameters define a unique GA configuration. The use of a non-optimal configuration for the implementation of a GA might result in a tedious and time-consuming optimization process. Therefore, various tests need to be performed on the sequential GA in order to obtain an optimal GA configuration. This section studies the effect produced by each parameter on the performance of the algorithm, for the test suite described previously. There can exist numerous GA configurations resulting from different combinations of parameter values. Searching for an optimal GA configuration by varying each parameter, for each task graph of the test suite, is a very difficult and exhaustive process. To narrow this search space, the following steps were taken. First, the performance of each GA configuration was observed for only a subset of the test suite 30 which includes PSGs 3, 4, 5 and 8. This subset represents task graphs of varying complexity and size. The execution time is noted for each possible configuration for each of the four graphs. Based on the results obtained for each of the graphs, an optimal GA configuration was picked that shows a relatively shorter execution time. Also, this configuration provides similar results on average for numerous runs of the algorithm. This optimal configuration has the following parameter values: population size of 140, crossover probability of 0.8, mutation probability of 0.6, and elitism of 90 (2/3 population size). The following sub-section discusses the effect of each parameter on the GA’s performance by varying the value of one parameter at a time. Also, in order to get a statistically significant quality assessment of the GA’s performance, the algorithm was executed 20 times and the average execution time is used in the results. A simple non-pipelined version of the GA was implemented in the C programming language. It was executed on a UNIX machine, with an EM64T CPU having a clock frequency of 3.2 GHz and a peak performance of 102.4 GFLOPS. The test results obtained by varying each parameter of the GA configuration are shown and discussed in the following sub-sections. Effect of Altering the Population Size. It is important to choose an appropriate population size during the execution of a GA. Choosing a population size which is too small makes the algorithm prone to the risk of premature convergence to a local optimum, whereas choosing a larger population size might provide a global optimum, but at the expense of a longer CPU computation time since the GA needs to search through a larger solution space. A population size which does not compromise with the solution 31 quality, i.e. the fitness of the solution, and also searches a section of the solution space with a reasonable size, without increasing the computation time, needs to be chosen. Figure 11 shows the average execution time of the GA for each population size for the four PSGs. The population size was varied from 20 to the maximum allowed population size of 200 in increments of 20 keeping other GA parameters constant. The average value of the GA execution time is shown. The execution time of PSG 3 is scaled down by a factor of 50 in order to show the results clearly in the same graph. PSG 3 Average execution time (msec) PSG 4 300 200 100 0 20 40 60 80 100 120 140 160 180 200 Population size PSG 5 Average execution time (msec) PSG 8 8 6 4 2 0 20 40 60 80 100 120 140 160 180 200 Population size Figure 11. Average execution time (msec) vs population size It can be observed that for large task graphs, such as PSGs 3 and 4 the execution time is high for low population size (20-80). It then decreases for mid-range population sizes (80-160), and finally there is an increase in the execution time at the higher end of 32 the population size (160-200). Although, PSGs 3 and 4 may not completely concur with this observation, it was noted for most of the other large task graphs in the test suite. For low population size, there are not enough individuals present in the population. The GA tends to search around the local optimum leading to longer execution times. In the middle range, the population size is large enough to make the GA move away from the local optimum and eventually converge to a global optimum while still being small enough to take little CPU time. Towards the higher end of population size, the GA has more individuals to evaluate, most of them being multiple copies of the same individual. This redundancy of individuals in the population does not improve the chances of the GA to converge to a global optimum, but just presents extra computation overhead; thereby producing the same result produced by a medium sized population in a longer time. With the increase in population size the portion of the solution space being searched also increases and the probability of accidentally moving into the section containing the global optimal solution improves and is depicted by the occasional dips in the graph. For small task graphs such as PSGs 5 and 8 as the population size increases the execution time also increases. This is because the solution space for these graphs is small and the GA converges to the global optimum very soon. The variation is not much over the population size is small due to the size of the graphs. As can be observed, population sizes 100 and 120 provide good execution time for PSG 3, 5 and 8, and population size of 140 for PSG 4. A population size of 140 is finally chosen, as it produced good results for most of the other large task graphs in the test suite. 33 Effect of Crossover Probability. The crossover rate was varied from 0.2 to 1.0 in increments of 0.1, and the average execution time is shown in Figure 12 for the four test graphs. Similar kinds of increments have also been used by Zomaya et al. (1999) while evaluating the performance of a GA used for task scheduling. To fit the plot for PSG 3 in the figure its values has been scaled down by 10 respectively. PSG 3 Average execution time (msec) PSG 4 1500 1000 500 0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Crossover rate PSG 5 Average execution time (msec) PSG 8 8 6 4 2 0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Crossover rate Figure 12. Average execution time (msec) vs. crossover rate The results for PSGs indicate that the execution time is high for low crossover rate (0.2-0.4), starts decreasing and remains low over the higher range (0.4-1.0). A low crossover rate produces a higher convergence time as the GA does not perform frequent recombination of good genetic material between individuals so as to explore more number of solutions. For most of the task graphs in the test suite, the best value of 34 crossover rate was found to be in the range of 0.4 - 0.8, for which the average execution time was low. The execution time increased with further increase in the crossover rate, as higher crossover probability makes the GA explore a large portion of the solution space swiftly, prolonging its convergence. Hence, a crossover rate of 0.8 was finally chosen to provide sufficient variation in the population for most of the task graphs in the test suite. Effect of Changing the Mutation Probability The mutation rate was varied from 0.2 to 1.0 in increments of 0.1 (Zomaya et al., 1999). Figure 13 shows the graph of the average execution time versus mutation rate for some of the task graphs. To fit the plot for PSG 3 in the figure its values has been scaled down by 10 respectively. PSG 3 Average execution time (msec) PSG 4 1000 800 600 400 200 0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Mutation rate PSG 5 Average execution time (msec) PSG 8 10 8 6 4 2 0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Mutation rate Figure 13. Average execution time (msec) vs. mutation rate 0.9 1 35 It is observed for most of the graphs that the execution time for low mutation rate (0.2-0.5) is high since the probability of the GA getting stuck at a local optimum or losing good genetic material is higher. Towards the middle range (0.5-0.8), enough disruption is produced in the population so as to recover any good material lost during selection and crossover. Also, such a rate ensures that the GA never stagnates at a local optimum. With further increase (0.8-1.0) in mutation rate, frequent disruption is produced causing the GA to hop through various sections of the solution space never staying long enough in one section to converge to an optimum. This leads to a high execution time. A mutation rate of 0.6 in the mid-range is finally chosen as it produced low execution time for most of the task graphs for the given GA configuration. Effect of Changing the Value of Elitism. The sole purpose of introducing elitism is to replicate the previous generation’s best solutions in a particular generation thereby increasing the average fitness of the population. The elitism was varied from 10 to 140 in increments of 10 and the corresponding execution time is recorded and shown in Figure 14. So as to fit the data of PSG 3 in the plot its execution time has been scaled down by 10. 36 PSG 3 Average execution time (msec) PSG 4 2500 2000 1500 1000 500 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 Elitism PSG 5 Average execution time (msec) PSG 8 8 6 4 2 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 Elitism Figure 14. Average execution time (msec) vs. elitism The execution time is observed to be high for low elitism (10-40), decreases as elitism increases towards the higher end (40-110), with low computation time at near about two thirds of the population size for PSGs 3, 4 and 8. Mid-range (60-110) elitism allows the GA to work on a large portion of good individuals leading to a faster convergence to the global optimum. Although, elitism of 100 and 110 seem to provide better execution time, an elitism of 90 provided the best result for the given configuration for most of the task graphs in the test suite. Hence, elitism of 90 has been used while executing this GA. It should also be noted that for smaller PSGs, such as graphs 5 and 8, elitism does not produce a significant improvement in the performance of the GA. Now 37 that the optimal configuration has been obtained and justified, the GA was executed on the full set of test graphs to verify its correctness. Results and Discussions For both sets of benchmark graphs described in the first section of the chapter, the software implementation of the GA was tested and its output schedules were manually verified to meet all priority and precedence constraints. The test suite is comprised of eight peer selected graphs and 100 randomly generated graphs. The resultant schedules obtained for random graphs are shown in Table 1 and for peer selected graphs in Table 2. Table 1. Task schedules for random graphs Randomly Generate Graphs Task Schedule Graph R1 ABCFEGD Graph R2 ABEFCD Graph R3 ABDCEFG Table 2. Task schedules for peer selected graphs Peer-Selected Graph Task Schedule Graph 1 ACBEDHGFI Graph 2 ABDHCFGEIJK Graph 3 BCADMFEIGHJLNKOPQ Graph 4 ADCBEFGIJHKLONPQMR Graph 5 ACEBDGF Graph 6 ACDGBFEKIHJL Graph 7 BACFGEDHI Graph 8 CBAEDF 38 Despite the advanced techniques developed to improve performance, GAs can still require long computation times especially for large problems or task graphs. Performance improvements are possible through parallelization and hardware acceleration. CHAPTER 5 PARALLEL GENETIC ALGORITHMS Intractability of the scheduling problem has led to use of heuristics for obtaining sub-optimal solutions in a reasonable amount of time. Although heuristics can produce good solutions, i.e. solutions with high fitness value, their time complexities are fairly high. Many heuristics work well for small task sets, but their performance deteriorates as the problem size and the task set’s complexity increases. To reduce the time complexity without compromising the solution quality, a natural approach is to parallelize the scheduling algorithm (Ahmad and Kwok, 1999). In this chapter an overview of various parallel GA (PGA) structures is presented. Then, the parallelization of the sequential GA presented earlier is discussed and the results are presented. Overview of Parallel GAs The simple structure and operational principle of the GA inherently makes it easy to parallelize. Much research has been devoted to parallelizing GAs. Several parallel GA implementations have been examined and these approaches can be broadly classified into four categories: synchronous master-slave, semi-synchronous master-slave, asynchronous concurrent, and distributed or network (Goldeberg, 1989) shown in Figures 15, 16, and 17. 39 40 Master GA Fitness Slave Fitness ….. Slave Slave ….. Slave Figure 15. Master-slave implementation Concurrent GA Process Concurrent GA Process Concurrent GA Process Concurrent GA Process Shared Memory Concurrent GA Process Concurrent GA Process Concurrent GA Process Concurrent GA Process Figure 16. Asynchronous concurrent implementation GA Process GA Process GA Process GA Process Figure 17. Distributed or network implementation 41 Master-slave or global-parallel, GA shown in Figure 15, is one of the most popular approaches for parallelizing a GA. In this approach, the master performs most of the genetic operations, like selection, crossover and mutation; it is the evaluation of fitness of the chromosomes which is distributed over slave processors. Once the slaves are done evaluating, they return the fitness value to the master. Meanwhile, the master waits on the slaves for completion. The second approach, semi-synchronous masterslave, is similar to the master-slave except that the master does not synchronize its operation with the slaves. Instead, selection and recombination take place as the slaves perform the evaluation. In the asynchronous concurrent GA approach, shown in Figure 16, the GA executes independently on multiple processors, which work on a population present in a shared memory. While executing the GA, these processors make sure that there is no data incoherency. The final approach is the network-based GA shown in Figure 17. In this case, multiple processors use local populations to evolve. These processors work independently of each other and synchronization takes place at the end of each generation when each local GA sends the best solution to other processors (Goldeberg, 1989). One variation of this approach, implemented by Solar, Kri and Parada (2001), uses the best solution sent by the local GAs to the central GA at the end of a generation. The central GA selects the best solution and broadcasts it to the local GAs. Such a GA is often called as distributed GA (Cantu-Paz, 1998, Yoshida and Yasuoka, 1999 and Solar et al., 2001). Ahmad and Kwok (1999) demonstrate another method for implementing a parallel GA. In this method, the DAG is partitioned into multiple parts. One can partition the DAG either horizontally or vertically into layers of nodes. A horizontal partitioning 42 divides the DAG into layers of nodes such that the paths in the DAG are at the same level. Vertical partitioning is performed by partitioning along the paths. After partitioning, each part is scheduled by a node, and at the end these schedules are combined to generate the final schedule (Ahmad and Kwok, 1999). Software Implementation of PGA In this thesis, a synchronous master-slave parallel GA has been implemented by parallelizing the fitness evaluation module to improve the performance of the scheduling algorithm. The evaluation for fitness constitutes the largest part of the GA’s execution time. In the master-slave GA, the evaluation of individuals is parallelized by assigning a fraction of the newly generated population to different slave processors. Communication is minimized by grouping all individuals intended for a slave into one transfer. In the semi-synchronous model, each individual is sent to the slave as it is generated after crossover and mutation, and does not wait for the whole population to be generated, as is the case in synchronous master-slave model. Sending each individual in this manner leads to larger communication cost as compared to sending a group of individuals for evaluation. This overhead might overshadow any performance gain that is achieved through parallelism. Hence, the semi-synchronous model has not been implemented. In synchronous master-slave PGA the master node executes the sequential GA process of selection and reproduction. But when it comes to the evaluation of fitness, the master node distributes fitness evaluations amongst itself and the slave nodes. The master evaluates its fraction of population and waits on the slaves to return the fitness values for all the other individuals comprising the population before moving on to the next 43 generation if a desired solution has not been obtained. Hence, a master-slave GA is the same as the sequential GA only with its performance improved through parallelization of the evaluation/fitness function (Cantu-Paz, 1998). Results and Discussions The PGA implemented here uses the same parameter configuration as the sequential GA, hence only the effect of parallelization has been discussed. The PGA distributes fitness evaluation over the nodes of a cluster, with the master distributing the population over these slave nodes, the slave nodes evaluating the sub-population’s fitness and returning them to the master. The PGA is coded in C using the Message Passing Interface (MPI) standards and is executed on a Rocks cluster with 16 EM64T CPUs, each with a clock frequency of 3.2 GHz and performance of 102.4 GFLOPS. To study the speedup in the execution time of the GA with respect to the number of processors used, a mixed subset of the PSGs described in Chapter 4 is used. The time taken to evaluate fitness depends on two things. First is the size of the population. More individuals require longer computation time. Second, is the quality of the solutions. Poor solutions require the fitness module to apply a penalty, thus requiring more computations. Although, all fitness modules receive an equal number of individuals to evaluate, the parallel implementation does not ensure that all the modules receive individuals of the same quality. This type of condition is known as load imbalance. Due to the synchronization between master and slaves, the computation time of the overall algorithm is constrained by the slowest evaluating fitness module. 44 Since the population size is set to 140 for this GA, the number of processors used is 1, 2, 4, 5, 7, 10, and 14 which are the least common divisors of 140. These numbers equally divide the population among the slave processors. The speedup obtained by parallelizing the GA is shown in Figure 18. PSG 3 PSG 4 PSG 5 PSG 8 3.5 3 Speedup 2.5 2 1.5 1 0.5 0 1 2 4 5 7 10 14 Number of processors Figure 18. Speedup vs. number of processors As shown in the graph, for most of the tested task graphs speedup increases as the number of processors increase. After a certain point, speedup typically remains constant before gradually decreasing. This characteristic can be attributed to load imbalance and the increase in communication costs incurred while sending subpopulations to slave nodes, thereby negating any improvement obtained through parallelism. These communication costs increase as the subpopulation size decreases, eventually dominating computation time. It is also observed that, for smaller PSGs, PSGs 5 and 8, the communication overhead increases so much that the performance of the PGA sinks below 45 that of the sequential GA. A maximum speedup of approximately 3 was obtained using 7 processors for complex task sets. For simple task graphs, best speedups were obtained using fewer processors on the order of three. Finally, it is observed that by parallelizing the fitness evaluation a speedup on the order of 1.5-3 can be achieved over the sequential GA depending on characteristics of the task graph being scheduled. Despite parallelization, a GA may still require long execution times, in some cases several days, to solve a large problem (Vega-Rodriguez, Gutierrez-Gil, AvilaRomán, Sánchez Pérez and Gómez Pulido, 2005). As Figure 18 shows, using additional computing resources may not provide improved performance. Thus, better methods to improve the execution time of a GA are being researched. Implementing the GA directly in hardware using reconfigurable hardware devices is one promising approach. CHAPTER 6 HARDWARE-BASED GA TASK SCHEDULER MODEL This chapter begins with the motivation behind a hardware-based GA (HGA), followed by a brief description of field programmable gate arrays (FPGAs), and previous work done in the field of HGAs. A conceptual description of the HGA design is given including an overview, a detailed description of each module’s functionality, and a brief description of how pipelining has been implemented in the design. Finally, the performance results for the HGA are presented and discussed. Motivation for the HGA System Sequential GA implementations tend to be slow due to the amount of computation required to efficiently search the large solution spaces associated with task scheduling problems. Parallelizing these GAs can offer only limited speedup due to communication costs, as shown in Figure 18. More substantial performance improvement is possible using hardware acceleration. With the advent of hardware technologies and design tools in the field of reconfigurable devices, there has been a drastic increase in the hardware implementation of applications that were previously being executed in the software environment. With hardware implementation, high performance can be expected with minimum resources at hand. 46 47 The advent of such technology features has enabled engineers to think of new ways to approach the research and development of both hardware and software systems. Much work has been done to demonstrate the significant speedup that can be achieved by implementing hardware accelerators of software algorithms. A typical sequential GA is comprised mainly of a set of simple genetic operations and fitness evaluation functions that are executed repetitively until the required solution is obtained. If a particular GA executes for num_gen generations and performs genetic operations on a population of size popsize, then the GA operations are executed num_gen * popsize times. For example, for one particular execution of the GA for PSG 3, the GA executed 454 generations with a population size of 120 until it converged to the global optimum. Thus, the GA operations executed 454 * 120 = 54,480 times. It has been noticed for complex scheduling problems that the population size and the number of generations needed is large, making the number of GA operations executed before the termination condition is reached very large. Therefore, hardware acceleration of these algorithms may provide significant speedup. Hardware acceleration can be achieved using different technologies. Custom hardware can be used, but it is not desirable due to high non-recurring engineering (NRE) costs and long design time (Vahid and Givargis, 2000). Reconfigurable hardware devices such as FPGAs are popular because of their low cost, flexibility, and speed. Also, standardized hardware description languages (HDLs) and powerful design tools are available for these devices making the design process easier. The next section provides an overview of FPGAs. 48 Overview of Field Programmable Gate Arrays The process of designing digital hardware has changed dramatically over the past few years with the availability of new types of sophisticated programmable logic devices (PLDs). Unlike previous generations of technology, in which board-level designs included large numbers of small scale integrated chips containing basic gates, virtually every digital design produced today consists mostly of high-density devices. When logic circuits are destined for high-volume systems they have been integrated into high-density gate arrays. However, gate arrays suffer from high NRE costs and take too long to manufacture to be viable for prototyping or other low-volume applications. For these reasons, most prototypes, and also many production designs are now built using PLDs because of features such as instant manufacturing turnaround, low start-up costs, and ease of design changes. The three main categories of PLDs are: Simple PLDs (SPLDs), Complex PLDs (CPLDs) and FPGAs (Brown and Rose, 1996, Brown and Vranesic, 2005 and Vahid and Givargis, 2000). An FPGA is a programmable logic device that supports implementation of relatively large logic circuits. FPGAs contain an array of uncommitted circuit elements, called logic blocks, and interconnect resources. Configuration is performed through programming by the end user. The general structure of an FPGA is shown in Figure 19. 49 Logic Blocks IO Blocks Interconnection Wires Switches Figure 19. Structure of an FPGA As shown in Figure 19, an FPGA contains three main types of resources: logic blocks, I/O blocks for connecting to the pins of the package, and interconnection wires and switches. The logic blocks, arranged in a two dimensional array, provide the implementation of the required functions, and the interconnection wires are organized as horizontal and vertical routing channels between rows and columns of logic blocks. The routing channels contain wires and programmable switches that allow the logic blocks to be interconnected in many ways. The switches can be of any one of the pass-transistors controlled by static RAM cells, anti-fuses, EPROM or E2PROM transistors. There are two basic categories of FPGAs on the market today, Static Random Access Memory (SRAM)-based FPGAs and antifuse-based FPGAs. For SRAM-based arrays, Xilinx and Altera are the leading manufacturers. For antifuse-based products the leading manufacturers are Actel, Quicklogic, Cypress, and Xilinx (Brown and Rose, 1996 and Scott, 1994). 50 Previous Research Work Related to HGA For years, researchers have undertaken the study of hardware implementation of schedulers and their performance. These studies have been mainly in the area of development of hardware schedulers using VLSI, evolvable hardware and implementation of schedulers using FPGAs. Burleson et al. (1999) developed a hardware scheduler (Spring scheduler) in the form of a coprocessor which was used as a VLSI accelerator for multiprocessing environments (Burleson et al., 1999). The scheduler is used for sophisticated static scheduling as well as on-line scheduling using various algorithms like EDF, HLF or the Spring scheduling algorithm (Niehaus et al., 1993). Yoshida and Yasuoka (1999) proposed a hardware-based GA called Genetic Algorithm Processor that employs a steady state GA and pipeline processing. They also implemented a two level parallelization of parallel and distributed GA, which was effective in performance and convergence improvement (Yoshida and Yasuoka, 1999). Although high performance is provided by such conventional custom hardware schedulers, lack of flexibility and reconfigurability of such components has prompted many researchers to use reconfigurable devices such as FPGAs for the development of hardware-based schedulers. Use of pre-fabricated reconfigurable devices has allowed rapid implementation of sizeable systems without the need to create custom hardware. Such hardware provides economies-of-scale production advantages as well as flexibility formerly found in software-based systems (Loo et al., 2003). Scott, Samal and Seth (1995) proposed a general hardware-based GA engine which implements a parallelized simple genetic algorithm. Loo et al. (2003) developed a static GA-based scheduler on a reconfigurable 51 device. Tang and Yip (2004) implemented a general GA using FPGAs with the ability to reconfigure the hardware-based fitness function. Considerable amount of research has been applied to a sub-category of hardwarebased evolutionary algorithms known as evolvable hardware or evolware. In such hardware systems, the function and architecture of the system can be dynamically changed in self adaptation with the simulation of the outside environment (Lei, Mingcheng, and Jing-xia, 2002). Evolware consists of a reconfigurable hardware device that can undergo a number of evolutionary cycles to produce a suitable solution to the application at hand. Evolvable hardware has become a rapidly growing field, in which new discoveries are being made at an astounding pace (Abielmona and Groza, 2001). Abielmona and Groza (2001) and Lei et al.(2002) implemented such evolvable hardware using a GA to design new chips and map a design to a technology to minimize the structure of a circuit. Although, much research effort has been devoted to hardware schedulers, little has been done in developing hardware-based task schedulers which perform scheduling using a GA. Basic HGA Model The HGA fits into a general computing environment in the following way. The front end, which might be a CPU, contains the task graph that needs to be scheduled. Before starting the GA scheduler, the front end writes the task graph into the random access memory provided on the FPGA board. Then using one of the I/O pins, the front end sends a GO signal to the GA scheduler. The scheduler detects the GO signal and starts reading the parameters and the task graph from memory. It executes until the 52 termination condition has been met. Once the solution has been found, the scheduler signals the front end, indicating that the GA has completed the scheduling operation, and the solution has been written to the on-board memory. The CPU reads the schedule from the memory and writes it to a file. Overview of the System The basic simple HGA model is shown in Figure 20. It consists of a shared onboard memory module which stores the input task graph, output schedule, GA configurations and both the old and new population. A memory controller (MC) module reads the external signals and controls what is read and written to the memory based on the requests obtained from other GA modules and the front end. 53 New Population Fitness Module (FM) S t a r t G A M o d u l e s Reproduced Children On Board Memory Task Table Members and Fitness Initial Population Crossover & Mutation Module (CMM) Selected Parents Selection Module (SM) Read Task Graph (RTG) Initial Population Generator (IPG) Random Numbers Random Number Generator (RNG) Members and Fitness Population Sequencer (PS) S t a r t M o d u l e s Memory Controller (MC) Old Population Front End Figure 20. Simple HGA model A GA is a randomized technique, and its core of operation is based on the random number generator (RNG) module which provides various random numbers of different sizes to other GA modules. The initial population is generated by the initial population generator (IPG) module. It uses the random numbers provided by the RNG module and fills the initial population with random strings and priority sorted strings of tasks. Simultaneously, the read task graph (RTG) module reads the task graph from memory, fills the task table, and sends it to the fitness module (FM). As the initial population is being generated, the random strings produced are evaluated by the FM. After the initial 54 population has been generated, the FM signals the selection module (SM), crossover and mutation module (CMM), and the population sequencer (PS) module to start the GA. Once the best solution is obtained, the FM signals the MC that the GA needs to be terminated. The MC stops all the modules and sends a DONE signal to the front end. The front end on receiving the DONE signal starts reading the solution. Typically, there might exist multiple task graphs in the memory at any given time. The front end maintains a count of task schedules generated and task schedules that need to be generated. Hence, when the front end performs a read operation it knows which task schedule corresponds to which task graph. A detailed description of the individual modules is given in the next sub-section. Module Description The modules as shown in Figure 20 and coded in Appendix A are mostly based on the operators used in the sequential GA model described in Chapter 3. As shown, these modules form a coarse grained pipeline and execute concurrently with each other. All theses modules have been described using Very High Speed Integrated Circuit Hardware Description Language (VHDL), which is an IEEE standard, and implemented on the Altera Cyclone FPGA device. Each module can be described as a finite state machine with an asynchronous reset. The modules communicate with each other asynchronously using a handshaking protocol. For example, if a consumer module A wants some service from producer module B it initiates a request signal to module B. Module B senses the request line and returns an acknowledgement to module A indicating that it is ready to service A’s request. Module A then gets the service and 55 lowers its request line. Module B, on completion of service, lowers its acknowledgement line. Examples of such handshaking might exist between the MC and the IPG or the IPG and the FM. A detailed description of each module is given below. On-Board Memory. The memory module used in the HGA represents the onboard SRAM M4K blocks comprised of 130 kilobits of memory. The memory is implemented using Altera synchronous single port RAM memory module with a width of 8 bits and capacity to store 16K words. All memory read and writes are performed through the MC. The memory module is used to store the old and new population used in the GA along with the input task graphs and the output schedules. Memory Controller (MC). The MC is the module which acts as the interface between the external environment and the GA scheduler. It is the MC which responds to all the front end signals which initiate and stop the GA scheduler operation. It also provides signals to the front end indicating that scheduling is completed and the output schedule is ready. The MC also responds to all the memory read/write requests from other GA modules and acts as a memory interface. Random Number Generator (RNG). The RNG module forms the core of a GAbased scheduler. This module provides the random numbers to three GA modules. It supplies random strings to the IPG module to produce the initial random population. It also provides two random strings to the SM which uses the strings to scale down the sum of fitness which is in turn used to select two parents from the population for performing 56 recombination. Six random numbers each representing the test probability for crossover and mutation, two crossover and mutation points respectively, are sent to the CMM. The RNG module is based on the cellular automaton (CA) theory. The CA theory can be thought of as dynamical systems, discrete in both time and space. They are implemented as an array of cells with homogeneous functionality, constrained to a regular lattice of some dimensionality. In this thesis, CA in one dimension, called Linear CA, has been implemented. CA relies on bit-level computation and local interconnection to implement hardware-based random number generators. These generators have long random number sequences and are sufficiently long for use in a GA. Statistical tests have proven that CA-based random number generators are far more superior than linear feedback shift register-based generators (Shackleford, Tanaka, Carter and Snider, 2002 and Martin, 2002). After receiving an initiate signal from the MC, the RNG module first gets the random seed from memory. The CA used in the RNG is a linear CA with 16 cells which change their states according to rules 90 and 150. Rule 90: Si_next = Si-1_present XOR Si+1_present Rule 150: Si_next = Si-1_present XOR Si_present XOR Si+1_present Here Si_next is the next state of cell i and Si_present is the current state of cell i. It has been proven that a 16-cell CA whose cells are updated by the rule sequence 150-150-90150-90-150……90-150 produces a maximum length cycle (Serra, Slater, Muzio and Miller, 1990). 57 Initial Population Generator (IPG). The IPG module gets the population size and chromosome size from memory and uses input from the RNG to generate random strings containing randomly ordered tasks and a few ordered strings based on the priority of the tasks in the task graph. The IPG also initiates the initial fitness value to 50000 which is the maximum fitness. After generating a random individual, the IPG sends it to the FM for evaluating the fitness of the random string. Once the IPG is done generating the random population and the FM is done evaluating them, the FM enables the rest of the GA modules, and the normal GA operation is started. Read Task Graph (RTG). The RTG module reads the task graph for which the schedule is to be obtained from memory and generates the task table containing various values needed for the fitness evaluation of a schedule by the FM. It is only after the RTG is done generating the task table that the FM is initiated to perform fitness evaluation. Population Sequencer (PS). The job of the PS module is to scan through the current population and pass the members to the selection module. The PS module acts as an interface between the selection module and the MC, and provides the address of the member to be read to the MC and passes the member and its fitness to the selection module. Selection Module (SM). The HGA’s SM implements the stochastic reminder without replacement selection method. Each time a new member is to be selected, the SM gets a random number from the RNG and scales the sum of fitness. It then signals the PS 58 to cycle through the population, and accumulates the fitness value until it reaches the scaled sum of fitness value. Once the scaled fitness is reached, the member is stored so that it can be passed to the CMM. The SM repeats the same procedure for obtaining the next selected member. After obtaining the two parents, it signals the CMM that parents are ready for recombination. After passing the selected parents to the CMM, the SM resets itself and starts over with the selection process. Crossover and Mutation Module (CMM). The CMM is based on the PMX and number swapping mutation methods. After receiving a ready signal from the SM, the CMM module obtains random strings from the RNG. Using these random numbers, the CMM performs crossover and mutation based on the user defined crossover and mutation rates. Once the CMM completes its operation, it signals the FM that new offspring are available for fitness evaluation. Fitness Module (FM). The FM implements the additive and multiplicative penalty function described previously. The penalty function is based on the number of constraints in the scheduling problem that have not been followed by the chromosome and is used to evaluate its fitness. The FM receives inputs from two GA modules, initially, from the IPG, while evaluating the fitness of the initial population and afterwards during the normal GA operation for evaluating the new members provided by the CMM module. After evaluating new individuals, the FM sends the new members to the MC which writes them to memory. The FM also keeps track if the best possible solution has been obtained based upon the pre-defined fitness threshold. If it has been obtained it signals 59 the MC that the GA has obtained the solution and sends the solution to the MC which then writes it to a separate memory location which is accessed by the user. The FM also has the responsibility of maintaining the statistics for a given generation, like sum of fitness, average fitness value, maximum fitness value, and minimum fitness value. These statistics are used by the SM to select new members. After evaluating a generation, the FM performs elitism, re-evaluates the sum of fitness, and writes it to memory through the MC. Pipelining Employing hardware resources provides higher levels of concurrency. The simple nature of GA operations makes it easy to form a coarse grained pipeline within the HW implementation as shown in Figure 21. The PS, SM, CMM and FM modules together form a pipeline. PS SM CMM FM Figure 21. Coarse-grained pipeline First the PS gets the members from the MC and passes them to the SM. The SM, based on a pre-calculated value, continues accepting the incoming members until it has chosen a pair of members. Once this decision has been made, the SM passes the pair to the CMM which, based on the crossover and mutation probability, performs crossover and mutation on the selected members. While the CMM is performing recombination and mutation operations on the selected members, the SM and PS start getting the next set of parents. Once the CMM is done, it passes the newly generated offspring to the FM and 60 starts working on the next set of parents if available. The FM evaluates the fitness and sends the result to the MC which writes it to memory. This sequence of operation represents the GA pipeline architecture. Populations are created and evaluated one individual at a time. This is in stark contrast to the software implementation where the entire population is created before any individuals can be evaluated. In this case, the fitness evaluation must idly wait for the population creation to complete. This pipelining provides a significant speedup compared to software implementations. The next section discusses the performance of the HGA with respect to the variations in the GA parameters. Results and Discussions Similar to the sequential GA, the performance of the HGA very much affected by the values of the GA parameters. Various tests were performed on the HGA to study the variations in its performance as the values of these parameters change. The HGA was implemented using VHDL on an Altera Cyclone EP1C12Q240 device. The device features and the corresponding resource utilization for each PSG is shown in Table 3 below. Table 3.Cyclone EP1C12Q240 device features and resource utilization Device features Logic Elements RAM Blocks Total RAM bits User IO pins Value PSG 5 12060 10218 (84%) 52 239616 173 131072 (54%) 4 (2%) PSG 8 8650 (71%) 131072 (54%) 4 (2%) 61 Due to resource constraints, the tests have been restricted to graphs of size less than or equal to eight tasks. This limits the test suite to PSGs 5 and 8 only. In each test only one of the GA parameters were changed and the tests were executed 20 times to get an average performance of the system. To calculate the execution time of the HGA, the number of clock cycles required by the algorithm to generate the schedule was divided by the clock frequency of the hardware. From these tests, the following parameter values were determined: population size of 140, crossover probability of 0.8, and mutation probability of 0.6. From the data obtained for execution time using the above method and comparing them with the PGA, we found that the performance of PSG 5 improved by a factor of 1.4, but the performance of PSG 8 deteriorated by 0.75. But this performance can be further improved by performing timing optimization on the hardware. Since, the Cyclone device used was limited in the number of logic elements shown in Table 3, no timing optimization could be performed. It was observed that on a Stratix device, with larger resources an improvement on the order of 2-3 can be achieved. Another way to obtain higher speedup is by parallelization of the HGA modules. The process of achieving a parallelized version of the HGA is discussed in the next chapter. CHAPTER 7 PARALLEL HARDWARE-BASED GA TASK SCHEDULER MODEL The HGA implementation discussed in the previous chapter can be improved with parallelization. But it is important to note that the logic resources available within commercially off-the-shelf FPGAs are limited, thereby making it difficult in some applications to configure the FPGA in such a way where all the modules of the GA can be implemented in hardware. Thus, compromises may be necessary where parts of the GA are implemented in hardware and other parts are implemented in software. For instance, it has been noticed that in the GA it is the evaluation of the fitness function which takes the longest time to complete. Therefore, if the hardware resources are limited, the FM can be implemented in hardware and interfaced with the software GA. PHGA Model If sufficient hardware resources are available and enough chip area is present, then the entire GA can be implemented in hardware and various HGA modules can be duplicated to produce additional concurrency. Scott et al. (1995) describes various techniques to achieve this additional concurrency. In this thesis, the FM is duplicated providing parallelization of the fitness evaluation. This approach is taken because there are two chromosomes needing evaluation concurrently and it is the fitness evaluation that 62 63 consumes the majority of the overall execution time. The parallel hardware-based GA (PHGA) task scheduler is shown in Figure 22. New Population On Board Memory Task Table FM 1 S t a r t G A M o d u l e s FM 2 Reproduced Children Members and Fitness Initial Population Crossover & Mutation Module (CMM) Selected Parents Selection Module (SM) Read Task Graph (RTG) Initial Population Generator (IPG) Random Numbers Random Number Generator (RNG) Members and Fitness Population Sequencer (PS) S t a r t M o d u l e s Memory Controller (MC) Old Population Front End Figure 22. PHGA model with two FMs Two FMs have been implemented. These FMs take in the two new offspring provided by CMM, evaluate their fitness and pass the evaluated offspring to the MC. This concurrency reduces the time required to evaluate the fitness of two chromosomes. Combining this parallelization with the existing pipelining, the performance of the GA scheduler can be significantly improved over that of other implementations. The performance results obtained from this implementation are presented in the next section. 64 Results and Discussions The PHGA was implemented using VHDL on an Altera Cyclone EP1C12Q240 device. The improvement in the performance of the GA scheduler is recorded for PSGs 5 and 8 described in Chapter 4. Similar to the HGA, load imbalance due to solution quality will not affect the execution time for the PHGA because of pipelining. Also, the concurrency from the two FMs reduces the overall evaluation time of individuals by almost half. Additionally, since the nodes coexist in the same reconfigurable hardware device, the communication cost is negligible. These architectural features explain the improved performance seen by this implementation. To measure the speedup for this implementation, tests were performed on PSGs 5 and 8 using the GA configuration of population size 140, crossover rate 0.8 and mutation rate of 0.6. It is observed that for PSG 8 the speedup achieved by PHGA over HGA was of the order of 2.3 and for PSG 5 it was of the order of 1.6. Table 4 shows the resource utilization for PSG 5 and 8 for PHGA. Table 4.Resource utilization for PHGA Device features Logic Elements RAM Blocks Total RAM bits User IO pins Value PSG 5 12060 10837 (89%) 52 239616 173 131072 (54%) 4 (2%) PSG 8 10340 (85%) 131072 (54%) 4 (2%) CHAPTER 8 CONCLUSIONS AND FUTURE WORK The goal of this thesis is to apply a GA to the task scheduling problem and to improve the overall execution time of the algorithm using hardware acceleration. This section compares the performance of different implementations of the scheduler with respect to the software sequential implementation. Finally, the future direction of the research is presented. Results and Conclusions Table 5 displays the speedup obtained for each implementation of the scheduler for the test task graphs with respect to the software sequential implementation. Table 5.Speedup for scheduler implementations GA Implementation PSG 3 PSG 4 PSG 5 Software sequential 1.0 1.0 1.0 PGA 2.8 (7 FMs) 3.75 (7 FMs) HGA No Result No Result PHGA No Result No Result 1.6 (5 FMs) 2.2 3.4 (2 FMs) PSG 8 1.0 1.3 (5 FMs) 1.0 2.3 (2 FMs) For the software implementations, it can be concluded that significant speedup on the order of 2-3 can be achieved by parallelizing the fitness evaluation using 5-7 slave 65 66 processors. It was noted that for large task graphs, such as PSG 3 and 4, more nodes are required for achieving high speedup, whereas the same speedup can be achieved using fewer slave nodes for the smaller, simpler task graphs such as PSG 5 and 8. In hardware implementation it is observed that the sequential GA did not provide any performance enhancement for PSG 8 for reasons discussed in Chapter 6, but still the performance is comparable, while performance enhancement was noted for PSG 5. The PHGA provides high speedups due to the advantages of parallelism and pipelining. Also, FPGAs with higher resources can also be used to optimize the algorithm implemented in the thesis to achieve a higher operating frequency. This has been confirmed by testing the same algorithm on the Stratix device which provided a speedup of 2-3 for PSG 5 and 8. Future Work The PHGA has been shown to provide an improvement over the performance of the software implementations for most of the task graphs used in this thesis. However, different software parallelization schemes are available for parallelizing a GA as discussed in Chapter 5. Future work might involve comparing the performance of theses other schemes. Also, hardware implementations of these schemes can be evaluated regarding performance and resource utilization. Some of these are described by Scott (1993). Another area for future consideration involves the use of different search techniques which can be incorporated to speedup the penalty function. Implementing these techniques might require extra resources thus limiting them to reconfigurable devices having a large amount of resources. 67 Finally, a considerable amount of research is taking place in the field of hybrid algorithms. Hybrid algorithms combine the characteristics of two or more scheduling algorithms. Some popular algorithms used in hybrid techniques are list-based algorithms and simulated annealing. These algorithms are known to converge to a local optimum faster than a GA. Hence, a hybrid algorithm can be implemented where the simulated annealing algorithm can perform local searches on sub-populations in slave nodes and send the results to the GA running on the master node, which in turn uses these solutions to search for the global optimum. REFERENCES Abielmona, R., & Groza, V. (2001). Circuit Synthesis Evolution Using a HardwareBased Genetic Algorithm. Canadian Conference on Electrical and Computer Engineering. 2, 963-968. Adam, T. L., Chandy, K. M., & Dickson, J. R. (1974). A Comparison of List Schedules for Parallel Processing Systems. Communications of the ACM. 17, 12, 685-690. Ahmad, I., & Kwok, Y-K. (1999). On Parallelizing the Multiprocessor Scheduling Problem. IEEE Transactions on Parallel and Distributed Systems, 16, 4, 414-432. Aporntewan, C., & Chongstitvatana, P. (2001). A Hardware Implementation of the Compact Genetic Algorithm. Proceedings of the IEEE Congress on Evolutionary Computation, 624-629. Brown, S., & Rose, J. (1996). Architecture of FPGAs and CPLDs: A Tutorial. IEEE Design and Test of Computer, 13, 2, 42-57. Brown, S., & Vranesic, Z. (2005). Fundamentals of Digital Logic with VHDL Design. New York: McGraw Hill. Burleson, W., Ko, J., Niehaus, D. , Ramamritham, K., Stankovic, J. A., Wallace, G., & Weems, C. (1999). The Spring Scheduling Coprocessor: A Scheduling Accelerator. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 7, 1, 38-47. Cantu-Paz, E. (1998). A Survey of Parallel Genetic Algorithms. Calculateurs Paralleles, 10, 2. Casavant, T. L., & Kuhl, J. G. (1988). A Taxonomy of Scheduling in General-Purpose Distributed Computing Systems. IEEE Transactions on Software Engineering, 14, 2, 141-154. 68 69 Chung, Y. C., & Ranka, S. (1992). Application and Performance Analysis of a CompileTime Optimization Approach fir List Scheduling Algorithms on Distributed-Memory Multiprocessors. Proceedings of Supercomputing’92. 512-522. Coffman Jr., E.G. (1975) Computer and Job Shop Scheduling Theory. New York: John Wiley & Sons Inc. Colin, J. Y., & Chretienne, P. (1991). C.P.M. Scheduling with Small Computation Delays and Task Duplication. Operations Research. 680-684. Correa, R. C., Ferreira, A., & Rebreyend, P. (1999). Scheduling Multiprocessor Tasks with Genetic Algorithms. IEEE Transactions on Parallel and Distributed Systems, 10, 8, 825-837. El-Rewini, H., Lewis, T.G., & Ali, H.H. (1994). Task scheduling in parallel and distributed systems. New Jersey:Prentice-Hall Inc. El-Rewini, H., Lewis, T.G., & Ali, H.H. (1995). Task scheduling in Multiprocessing Systems. IEEE Computer Society Press, 28, 12, 27-37. Gilles, D.W., & Liu, J.W.S. (1995). Scheduling tasks with AND/OR precedence constraints. SIAM Journal on Computing, 24, 4 , 797–810. Goldberg, D.E. (1989). Genetic Algorithms in Optimization, Search and Machine Learning. Massachusetts: Addison Wesley. Hou, E. S. H., Ansari, N., & Ren, H. (1994). A Genetic Algorithm for Multiprocessor Scheduling. IEEE Transactions on Parallel and Distributed Systems, 5, 2, 113-120. James, F. (1990). A Review of Pseudorandom Number Generators. Computer Physics Communications, 60, 329-344. Joines, J. A., & Houck, C. R. (1994). On the Use of Non-Stationary Penalty Functions to Solve Nonlinear Constrained Optimization Problems with GA’s. IEEE Transactions on Evolutionary Computation, 7, 5, 445-455. Korkmaz, T., Krunz, M., & Tragoudas, S. (2002). An Efficient Algorithm for Finding a Path Subject to Two Additive Constraints. Computer Communications Journal, 25, 3, 225-238. 70 Kwok, Y.-K., & Ahmad, I. (1996). Dynamic Critical-Path Scheduling: An Effective Technique for Allocating Task Graphs to Multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 7, 5, 506-521. Kwok, Y.-K., & Ahmad, I. (1998). Benchmarking the Task Graph Scheduling Algorithms. Proceedings of the 12th International Parallel Processing Symposium, 531-537. Lae-Jeong Park, & Cheol-Hoon Park (1995). Application of Genetic Algorithm to Job Shop Scheduling Problems with Active Schedule Constructive Crossover. Proceedings of IEEE International Conference on Systems, Man, and Cybernetics, 530-535. Lei, T., Ming-Cheng, Z., & Jing-xia, W. (2002). The Hardware Implementation of a Genetic Algorithm Model with FPGA. Proceedings of IEEE International Conference on Field-Programmable Technology, 374-377. Loo, S.M., Wells, B.E., & Winningham, J.D. (2003). A Genetic Algorithm Approach to Static Task Scheduling in a Reconfigurable Hardware Environment. Computers and Their Applications, 36-39. Mahmood, A. (2000). A Hybrid Genetic Algorithm for Task Scheduling in Multiprocessor Real-Time Systems. Retrieved May 05 2005 from http:// www.ici.ro/ici/revista/sic2000_3/art05.html Marsaglia, G., Zaman, A., & Tsang, W. W. (1990). Toward a universal random number generator. Letters in Statistics and Probability, 9, 1, 35-39. Martin, P. (2002). An Analysis of Random Number Generators for a Hardware Implementation of Genetic Programming using FPGAs and Handel-C. Technical Report, Department of Computer Science, University of Essex. 1-13. Niehaus, D., Ramamritham, K., & Stankovic, J. A. Wallace, G., Weems, C., Burleson, W., & Ko, J. (1993). The Spring Scheduling Coprocessor: Design, Use and Performance. IEEE Real-Time Systems Symposium, 106-111. Parker, R. G. (1995). Deterministic Scheduling Theory. London: Chapman and Hall. 71 RĖadulescu, A., & Gemund, A. JC van (2002). Fast and Effective Task Scheduling in Heterogeneous Systems. IEEE Transactions on Parallel and Distributed Systems, 13, 3, 260-274. Ramamritham, K., & Stankovic, J. A. (1994). Scheduling Algorithms and Operating Systems Support for Real-Time Systems. Proceedings of the IEEE, 82, 1, 55-67. Reade, W. (2002). Optimization with Genetic Algorithms. Retrieved February 25, 2006, from Web site: http://engr.smu.edu/~mhd/8331f04/GA.ppt Scott, S.D. (1994). HGA: A Hardware-Based Genetic Algorithm. Master's thesis, University of Nebraska-Lincoln, 1994. Scott, S.D., Samal, A., & Seth, S.C. (1995). HGA:A Hardware-Based Genetic Algorithm. Proceedings ACM/SIGDA International Symposium Field-Programmable Gate Arrays, 53–59. Serra, M., Slater, T., Muzio, J. C., & Miller, D. M. (1990). The Analysis of OneDimensional Linear Cellular Automaton and their Aliasing Properties. IEEE Transactions on Computer Aided Design of Circuits and Systems. 9, 7, 767 – 778. Shackleford, B., Tanaka, M., Carter, R. J., & Snider, G. (2002). FPGA Implementation of Neighbourhood-of-Four Cellular Automata Random Number Generators. Proceedings of the 10th ACM International Symposium on Field-Programmable Gate Arrays. 106-112. Solar, M., Kri, F., & Parada, V. (2001). A Parallel Genetic Scheduling Algorithm to Generate High Quality Solutions in a Short Time. 4th Metaheuristics International Conference, 115-119. Talbi, E.–G., & Muntean, T. (1993). Hill-climbing, simulated annealing and genetic algorithms: A comparative study. Proceedings of the 26th Hawaii International Conference on Systems Science (HICSS-26), 2, 565-573. Tang, W., & Yip, L. (2004). Hardware Implementation of Genetic Algorithm using FPGA. The 47th IEEE International Midwest Symposium on Circuits and Systems. 549-552. Vahid, F., & Givargis, T. (2000). Embedded System Design–A Hardware/Software Introduction. New York: John Wiley & Sons Inc. Unified 72 Vega-Rodriguez, M. A., Gutierrez-Gil, R., Avila-Román, J. M., Sánchez Pérez, J. M., & Gómez Pulido, J. A. (2005). Genetic Algorithms Using Parallelism and FPGAs: The TSP as Case Study. Proceedings of the 2005 International Conference on Parallel Processing Workshops. 573-579. Wang, L., Siegel, H. J., Roychowdhury, V. P., & Maciejewski, A. A. (1997). Task Matching and Scheduling in Heterogeneous Computing Environments Using a Genetic-Algorithm Based Approach. Journal of Parallel and Distributed Computing, 47, 8-22. Wu, M. Y., & Gajski, D. D. (1990). Hypertool: A Programming Aid for Message-Passing Systems. IEEE Transactions on Parallel and Distributed Systems. 1, 3, 330-343. Yang, T., & Gerasoulis, A. (1993). List Scheduling with and without Communication Delays. Parallel Computing Journal, 19, 1321–1344. Yeniay, O. (2005). Penalty Function Methods for Constrained Problems Optimization with Genetic Algorithms. Mathematical and Computational Applications. 10, 1, 4556. Yoshida, N., & Yasuoka, T. (1999). Multi-GAP: Parallel and Distributed Genetic Algorithms in VLSI. Proceedings of the IEEE SMC’99 Conference Proceedings on Systems, Man and Cybernetics. 5, 571-576. Zomaya, A.Y., Ward, C., & Macey, B. (1999). Genetic scheduling for parallel processor systems: comparative studies and performance issues. IEEE Transactions on Parallel and Distributed Systems, 10, 8, 795-812. APPENDIX A This appendix presents the VHDL code for some of the modules that are a part of the HGA and PHGA implementation of the GA based task scheduler. A1. GA HEADER PACKAGE GA_Header IS -- Initial Parameters of GA constant maxnumtasks constant width_numtasks constant maxfitness constant precfitness constant priofitness constant maxpop constant maxnumgen constant width_membus constant rngseed constant numtasks constant popsize constant probx constant probm : integer; -- maximum number of tasks : integer; -- Width of maximum number of tasks : integer; -- Maximum fitness : integer; -- Weight for precedence : integer; -- Weight for priority : integer; -- maximum size of population : integer; -- maximum number of generations : integer; -- width of memory bus : STD_LOGIC_VECTOR (width_rng DOWNTO 0) := "1001011011001100";-- Random number generator seed : integer; -- number of tasks in taskgaraph : integer := 140; -- population size : STD_LOGIC_VECTOR(width_prob DOWNTO 0) := "110010"; -- crossover rate : STD_LOGIC_VECTOR(width_prob DOWNTO 0) := "100110"; -- mutation rate --Data structure for task string type taskn IS ARRAY (0 TO numtasks) OF STD_LOGIC_VECTOR(7 DOWNTO 0); --Data structure for a chromosome type chromosome IS record tasknum : fit : END record; taskn; STD_LOGIC_VECTOR(15 DOWNTO 0); 73 74 --Data structure for a task type task IS record tnumber tetime tpriority numofpred pred numofsuc suc lpriority lflag END record; : : : : : : : : : STD_LOGIC_VECTOR(7 DOWNTO 0); STD_LOGIC_VECTOR(7 DOWNTO 0); STD_LOGIC_VECTOR(7 DOWNTO 0); STD_LOGIC_VECTOR(7 DOWNTO 0); taskn; STD_LOGIC_VECTOR(7 DOWNTO 0); taskn; integer range 0 to 21; STD_LOGIC; --Data structure for a task graph type taskgraph IS ARRAY (0 TO numtasks) OF task; END; package body GA_Header IS END; A2. MEMORY CONTROLLER MODULE ENTITY mem_ctrl IS PORT( -- Basic signals to start GA scheduler clk : IN STD_LOGIC; -- clock signal reset : IN STD_LOGIC; -- reset signal start : IN STD_LOGIC; -- start GA signal init -- Signal to start RTG, IPG, RNG modules : OUT STD_LOGIC; -- initiate module signal datain address dataout rw -- Signals for memory read write : IN STD_LOGIC_VECTOR(7 DOWNTO 0); -- data from mem : OUT STD_LOGIC_VECTOR(13 DOWNTO 0);-- addr for r/w : OUT STD_LOGIC_VECTOR(7 DOWNTO 0); -- data to mem : OUT STD_LOGIC; -- read/write signal reqtg acktg addrTG valout -- Signals for reading Task graph : IN STD_LOGIC; -- request from RTG : OUT STD_LOGIC; -- acknowledge RTG : IN STD_LOGIC_VECTOR(13 DOWNTO 0);-- addr from RTG : OUT STD_LOGIC_VECTOR(7 DOWNTO 0);-- data to RTG -- Signals for writing new evaluated population from FM old_new : IN STD_LOGIC; -- select new population address signal reqFM : IN STD_LOGIC; -- request from FM ackFM : OUT STD_LOGIC; -- acknowledge FM 75 valin addrFM GA_donein reqseq addrseq ackseq seqout : IN STD_LOGIC_VECTOR(7 DOWNTO 0); -- data to be written : IN STD_LOGIC_VECTOR(13 DOWNTO 0); -- address of data : IN STD_LOGIC; -- GA found best schedule signal -- Signals for reading old population for PS module : IN STD_LOGIC; -- request from PS : IN INTEGER RANGE 0 to popsize; -- address of member : OUT STD_LOGIC; -- acknowledge to PS : OUT chromosome; -- sEND member to PS -- Signal for indicating GA is done scheduling GA_done : OUT STD_LOGIC -- GA done scheduling ); END mem_ctrl; ARCHITECTURE behavior OF mem_ctrl IS TYPE states IS(start1, start2, idle, readTG, seqpop, seqpop1, FMinitial, FMwrite, done1, done2, write1, write2, read1, read2, readseq1, readseq2, readseq3, readseq4, clkdone, clkdone1, clkdone2, clkdone3); SIGNAL state: states; constant popbase0 : INTEGER := 1153; constant popbase1 : INTEGER := 7954; constant popbase2 : INTEGER := 32; constant bestaddr : STD_LOGIC_VECTOR(13 DOWNTO 0) := "11111111110000"; BEGIN main: PROCESS(clk,reset,GA_donein,old_new) VARIABLE popbase variable clkcntvec variable chromo variable chromsize variable psizetemp : INTEGER range 0 TO 7955; : STD_LOGIC_VECTOR(31 DOWNTO 0); : chromosome; : integer range 0 to 32; : INTEGER RANGE 0 to popsize; BEGIN IF reset = '0' THEN init <= '0'; GA_done <= '0'; rw <= '0'; flag := '0'; flag1 := '0'; ackFM <= '0'; ackseq <= '0'; clkcnt := 0; chromsize := 0; tempcnt := 0; cnt := 0; state <= start1; 76 ELSIF (clk'EVENT AND clk = '1' AND clk'LAST_VALUE = '0') THEN CASE state IS WHEN start1 => init <= '0'; rw <= '0'; flag := '0'; ackFM <= '0'; acktg <= '0'; ackseq <= '0'; chromsize := 0; IF start = '0' THEN GA_done <= '0'; state <= start2; END IF; WHEN start2 => init <= '0'; GA_done <= '0'; rw <= '0'; flag := '0'; acktg <= '0'; ackFM <= '0'; ackseq <= '0'; chromsize := 0; stcnt := '0'; IF start = '0' THEN init <= '1'; state <= idle; END IF; WHEN idle => flag := '0'; flag1 := '0'; rw <= '0'; IF reqtg = '1' THEN state <= readTG; -- read task graph ELSIF GA_donein = '1' THEN stcnt := '0'; clkcntvec := conv_std_logic_vector(clkcnt,32); state <= clkdone; -- count number of clocks ELSIF reqFM = '1' THEN stcnt := '1'; state <= FMinitial; -- initialize FM ELSIF reqseq = '1' THEN chromsize := 0; state <= seqpop; -- sequence through population END IF; WHEN readTG => 77 IF reqtg = '1' THEN popbase := 0; acktg <= '1'; address <= conv_std_logic_vector(popbase + conv_integer((addrTG)), 14); state <= read1; END IF; WHEN read1 => rw <= '1'; state <= read2; WHEN read2 => valout <= datain; rw <='1'; dataout <= datain; acktg <= '0'; state <= idle; WHEN seqpop => IF reqseq = '1' THEN IF old_new = '0' THEN popbase := popbase1; ELSE popbase := popbase0; END IF; ackseq <= '1'; state <= seqpop1; END IF; WHEN seqpop1 => IF reqseq = '0' THEN -- chromosome number to be read available chromsize := 0; psizetemp := addrseq; state <= readseq1; END IF; WHEN readseq1 => IF chromsize < numtasks THEN -- read tasks address <= conv_std_logic_vector((popbase + 34*(psizetemp) + chromsize), 14); flag := '0'; flag1 := '0'; ELSE -- read fitness value IF flag1 = '0' THEN address <= conv_std_logic_vector(popbase + ((32*(psizetemp + 1))+(psizetemp*2)), 14); flag := '1'; ELSE address <= conv_std_logic_vector(popbase + ((32*(psizetemp + 1))+(psizetemp*2)+1), 14); flag := '1'; 78 END IF; END IF; state <= readseq2; WHEN readseq2 => rw <= '1'; state <= readseq3; WHEN readseq3 => IF flag = '0' THEN -- read task numbers chromo.tasknum(chromsize) := datain; rw <= '1'; dataout <= datain; chromsize := chromsize + 1; state <= readseq4; ELSE IF flag1 = '0' THEN -- read upper byte of fitness value chromo.fit(15 DOWNTO 8) := datain; rw <= '1'; dataout <= datain; flag1 := '1'; state <= readseq4; ELSE chromo.fit(7 DOWNTO 0) := datain;-- read lower byte of fitness rw <= '1'; dataout <= datain; seqout <= chromo; flag := '0'; flag1 := '0'; ackseq <= '0'; -- reading chromosome complete state <= idle; END IF; END IF; WHEN readseq4 => rw <= '0'; state <= readseq1; WHEN FMinitial => IF reqFM = '1' THEN IF old_new = '0' THEN popbase := popbase0; ELSE popbase := popbase1; END IF; ackFM <= '1'; state <= FMwrite; END IF; WHEN FMwrite => IF reqFM = '0' THEN 79 address <= conv_std_logic_vector(popbase + conv_integer(('0' & addrFM)), 14); dataout <= valin; state <= write1; END IF; WHEN write1 => rw <= '1'; state <= write2; WHEN write2 => IF flag = '0' THEN rw <= '0'; ackFM <= '0'; state <= idle; ELSE IF cnt = 0 THEN state <= clkdone1; ELSIF cnt = 1 THEN state <= clkdone2; ELSIF cnt = 2 THEN state <= clkdone3; ELSE state <= done1; END IF; END IF; WHEN clkdone => address <= bestaddr; dataout <= clkcntvec(31 DOWNTO 24); flag := '1'; state <= write1; WHEN clkdone1 => address <= bestaddr+1; dataout <= clkcntvec(23 DOWNTO 16); flag := '1'; cnt := 1; state <= write1; WHEN clkdone2 => address <= bestaddr+2; dataout <= clkcntvec(15 DOWNTO 8); flag := '1'; cnt := 2; state <= write1; WHEN clkdone3 => address <= bestaddr+3; dataout <= clkcntvec(7 DOWNTO 0); flag := '1'; 80 cnt := 3; state <= write1; WHEN done1 => rw <= '0'; state <= done2; WHEN done2 => init <= '0'; rw <= '0'; ackFM <= '0'; GA_done <= '1'; flag := '0'; acktg <= '0'; ackseq <= '0'; chromsize := 0; state <= start1; END CASE; IF stcnt /= '0' THEN clkcnt := clkcnt+1; END IF; END IF; END PROCESS main; END behavior; A3. RANDOM NUMBER GENERATOR MODULE ENTITY rand_gen IS PORT( clk load : IN STD_LOGIC ; -- clock signal : IN STD_LOGIC ; -- load signal from MC -- Random signals to CMM domut : OUT STD_LOGIC_VECTOR(width_prob DOWNTO 0); docross : OUT STD_LOGIC_VECTOR(width_prob DOWNTO 0); mutpt1 : OUT STD_LOGIC_VECTOR(width_numtasks DOWNTO 0); mutpt2 : OUT STD_LOGIC_VECTOR(width_numtasks DOWNTO 0); crosspt1 : OUT STD_LOGIC_VECTOR(width_numtasks DOWNTO 0); crosspt2 : OUT STD_LOGIC_VECTOR(width_numtasks DOWNTO 0); -- Random signals to SM select1 : OUT STD_LOGIC_VECTOR(width_pop DOWNTO 0); select2 : OUT STD_LOGIC_VECTOR(width_pop DOWNTO 0); -- Random signal to IPG rand_init : OUT INTEGER RANGE 0 TO numtasks); 81 END rand_gen; ARCHITECTURE behavior OF rand_gen IS TYPE states IS (idle, init, toint, tomod); SIGNAL state: states; BEGIN PROCESS(clk,load) variable rn_vec1,rn_vec2 : STD_LOGIC_VECTOR(width_rng DOWNTO 0); variable evenodd : STD_LOGIC; variable rn_int1: INTEGER RANGE 0 to 65535; variable b : INTEGER := 0; constant c : INTEGER := numtasks; constant d : REAL := 65535.0; variable rn_int2: INTEGER RANGE 0 to 9; BEGIN IF LOAD = '0' THEN state <= idle; ELSIF rising_edge ( CLK ) THEN CASE state IS WHEN idle => rn_vec1 := rngseed; state <= init; WHEN init => rn_vec2(15) := '0' XOR rn_vec1(15) XOR rn_vec1(14); rn_vec2(14) := rn_vec1(15) XOR rn_vec1(14) XOR rn_vec1(13); evenodd := '0'; FOR i IN 13 DOWNTO 1 LOOP rn_vec2(i) := rn_vec1(i+1) XOR rn_vec1(i-1); IF evenodd = '1' THEN rn_vec2(i) := rn_vec2(i) XOR rn_vec1(i); END IF; evenodd := NOT evenodd; END LOOP; rn_vec2(0) := rn_vec1(1) XOR '0'; IF evenodd = '1' THEN rn_vec2(0) := rn_vec2(0) XOR rn_vec1(0); END IF; rn_vec1 := rn_vec2; state <= toint; WHEN toint => rn_int1 := 0; rn_int2 := 0; b := 1; FOR i in 0 to 15 LOOP IF rn_vec1(i) = '1' THEN 82 rn_int1 := rn_int1 + b; END IF; b := b*2; END LOOP; state <= tomod; WHEN tomod => rn_int2 := rn_int1 mod c; rand_init <= rn_int2; domut <= rn_vec1(width_rng DOWNTO (width_rng-width_prob)); mutpt1 <= conv_std_logic_vector(((conv_integer('0' & (rn_vec1((width_rng7) DOWNTO (width_rng - width_numtasks - 7))))) mod numtasks), (width_numtasks + 1)); mutpt2 <= conv_std_logic_vector(((conv_integer('0' & (rn_vec1((width_rng9) DOWNTO (width_rng - width_numtasks - 9))))) mod numtasks),(width_numtasks + 1)); docross <= rn_vec1(width_rng-2 DOWNTO width_rng-width_prob-2); crosspt1 <= conv_std_logic_vector((conv_integer(rn_vec1(width_rng-8 DOWNTO width_rng-width_numtasks-8)) mod numtasks), width_numtasks+1); crosspt2 <= conv_std_logic_vector((conv_integer(rn_vec1(width_rng-4 DOWNTO width_rng-width_numtasks-4)) mod numtasks), width_numtasks+1); pselect1 <= conv_std_logic_vector((conv_integer(rn_vec1(width_rng-3 DOWNTO width_rng-width_pop-3)) mod popsize), width_pop+1); pselect2 <= conv_std_logic_vector((conv_integer(rn_vec1(width_rng-1 DOWNTO width_rng-width_pop-1)) mod popsize), width_pop+1); select1 <= conv_std_logic_vector(((conv_integer(rn_vec1(width_rng-5 DOWNTO width_rng-width_pop-5)) mod (popsize-1)) + 1), width_pop+1); select2 <= conv_std_logic_vector(((conv_integer(rn_vec1(width_rng-6 DOWNTO width_rng-width_pop-6)) mod (popsize-1)) + 1), width_pop+1); state <= init; END CASE; END IF; END PROCESS; END behavior; A4. READ TASK GRAPH MODULE ENTITY readTG IS PORT( clk : IN STD_LOGIC; -- clock signal -- Signals to and from MC inittg : IN STD_LOGIC; -- initiate RTG 83 ackmc reqmc datain addrmc : IN STD_LOGIC; -- acknowledge from MC : OUT STD_LOGIC; -- request to MC : IN STD_LOGIC_VECTOR(7 DOWNTO 0); -- data from MC : OUT STD_LOGIC_VECTOR(13 DOWNTO 0); -- address of TG -- Signals to and from FM tasktableout : OUT taskgraph; -- task table to FM initFM : OUT STD_LOGIC -- initiate FM); END readTG; ARCHITECTURE behavior OF readTG IS TYPE states IS (idle, read1, readpred, awaitackmem1, awaitackmem2, filltable, loop1, loop2, loop3,loop4,loop5,loop6,done); SIGNAL state: states; SIGNAL tg : taskgraph; BEGIN PROCESS(clk,inittg,tg) variable tempcnt variable temppred,i,j,k variable count1, test2, temp variable count2 variable addr variable data : INTEGER RANGE 0 TO 32; : INTEGER RANGE 0 TO 32; : INTEGER RANGE 0 TO 40; : STD_LOGIC_VECTOR(13 DOWNTO 0); : STD_LOGIC_VECTOR(13 DOWNTO 0); : STD_LOGIC_VECTOR(7 DOWNTO 0); BEGIN IF inittg = '0' THEN reqmc <= '0'; tempcnt := 0; temppred := 0; count1 := 0; count2 := "00000000000000"; initFM <= '0'; state <= idle; ELSIF rising_edge ( CLK) THEN CASE state IS WHEN idle => reqmc <= '0'; tempcnt := 0; temppred := 0; count1 := 0; count2 := "00000000000000"; initFM <= '0'; i := 0; j := 0; k := 0; state <= read1; 84 WHEN read1 => IF tempcnt < numtasks THEN tg(tempcnt).lflag <= '0'; tg(i).lpriority <= 0; addr := count2; addrmc <= addr; reqmc <= '1'; -- request to Mem ctrller state <= awaitackmem1; ELSE tempcnt := 0; i := 0; state <= filltable; END IF; WHEN awaitackmem1 => IF ackmc ='1' THEN reqmc <= '0'; state <= awaitackmem2; END IF; WHEN awaitackmem2 => IF ackmc ='0' THEN IF count1 = 0 THEN tg(tempcnt).tnumber <= datain; count1 := count1 + 1; count2 := count2 + 1; state <= read1; ELSIF count1 = 1 THEN tg(tempcnt).tetime <= datain; count1 := count1 + 1; count2 := count2 + 1; state <= read1; ELSIF count1 = 2 THEN tg(tempcnt).tpriority <= datain; count1 := count1 + 1; count2 := count2 + 1; state <= read1; ELSIF count1 = 3 THEN tg(tempcnt).numofpred <= datain; count1 := count1 + 1; count2 := count2 + 1; temppred := 0; state <= readpred; ELSE tg(tempcnt).pred(temppred) <= datain; count1 := count1 + 1; count2 := count2 + 1; temppred := temppred + 1; state <= readpred; END IF; END IF; 85 WHEN readpred => IF temppred < conv_integer(tg(tempcnt).numofpred)THEN addr := count2; addrmc <= addr; reqmc <= '1'; state <= awaitackmem1; ELSE count1 := 0; tempcnt := tempcnt + 1; state <= read1; END IF; WHEN filltable => IF tempcnt < numtasks THEN i := 0; state <= loop1; ELSE i := 0; state <= loop4; END IF; WHEN loop1 => IF i < numtasks THEN IF tg(i).lflag = '0' THEN IF tg(i).numofpred = "00000000" THEN tg(i).lpriority <= H_P; tg(i).lflag <= '1'; tempcnt := tempcnt + 1; i := i + 1; state <= loop1; ELSE test2 := 1; temp := 0; j := 0; state <= loop2; END IF; ELSE i := i + 1; state <= loop1; END IF; ELSE state <= filltable; END IF; WHEN loop2 => IF j < conv_integer(tg(i).numofpred) THEN k := 0; state <= loop3; ELSE IF temp = conv_integer(tg(i).numofpred) THEN 86 tg(i).lpriority <= H_P - test2; tg(i).lflag <= '1'; tempcnt := tempcnt + 1; i := i + 1; state <= loop1; END IF; i := i + 1; state <= loop1; END IF; WHEN loop3 => IF k < numtasks THEN IF tg(i).pred(j) = tg(k).tnumber THEN IF tg(k).lflag /= '0' THEN IF tg(k).numofpred /= "00000000" THEN IF test2 <= (H_P - tg(i).lpriority) THEN test2 := H_P - tg(k).lpriority + 1; temp := temp + 1; ELSE temp := temp + 1; END IF; ELSE temp := temp + 1; END IF; END IF; END IF; k := k+1; state <= loop3; ELSE j:= j+1; state <= loop2; END IF; WHEN loop4 => IF i < numtasks THEN j := 0; tg(i).numofsuc <= "00000000"; state <= loop5; ELSE state <= done; END IF; WHEN loop5 => IF j < numtasks THEN k := 0; state <= loop6; ELSE i := i + 1; state <= loop4; END IF; 87 WHEN loop6 => IF k < conv_integer(tg(j).numofpred) THEN IF tg(i).tnumber = tg(j).pred(k) THEN tg(i).suc(conv_integer(tg(i).numofsuc)) <= tg(j).tnumber; tg(i).numofsuc <= tg(i).numofsuc + 1; k := k + 1; j := j+1 ; state <= loop5; ELSE k := k+1; state <= loop6; END IF; ELSE j := j + 1; state <= loop5; END IF; WHEN done => reqmc <= '0'; tasktableout <= tg; initFM <= '1'; END CASE; END IF; END PROCESS; END behavior; A4. INITIAL POPULATION GENERATOR MODULE ENTITY GA_initdata1 IS PORT( clk : IN STD_LOGIC; -- clock signal initipg : IN STD_LOGIC; -- initiate IPG randint : IN INTEGER RANGE 0 TO numtasks; -- random number from RNG -- Signals to FM ackfm : IN STD_LOGIC; -- acknowledge from FM reqfm : OUT STD_LOGIC; -- request to FM chromoout : OUT chromosome -- member to FM); END GA_initdata1; ARCHITECTURE behavior OF GA_initdata1 IS TYPE states IS (idle, init1,init2,awaitackfm1,awaitackfm2,initrand1, initrand2, done); SIGNAL state: states; BEGIN PROCESS(clk,initipg) variable chromo : chromosome; variable tog : STD_LOGIC := '0'; 88 variable psizetemp : integer RANGE 0 TO 200; variable chromsize : integer RANGE 0 TO 32; variable tempcnt : integer RANGE 0 TO 32; variable index1, index2 : integer RANGE 0 TO 32; variable temp : STD_LOGIC_VECTOR(7 DOWNTO 0); BEGIN IF initipg = '0' THEN reqfm <= '0'; state <= idle; psizetemp := 0; chromsize := 0; tempcnt := 0; ELSIF rising_edge (CLK) THEN CASE state IS WHEN idle => state <= init1; WHEN init1 => IF psizetemp < popsize THEN state <= init2; ELSE state <= done; END IF; WHEN init2 => IF chromsize < numtasks THEN chromo.tasknum(chromsize) := conv_std_logic_vector((chromsize + 1), 8); chromsize := chromsize + 1; state <= init1; ELSE chromo.fit := conv_std_logic_vector(50000,16); tempcnt := 0; state <= initrand1; END IF; WHEN initrand1 => IF tempcnt < numtasks THEN index1 := tempcnt; index2 := randint; state <= initrand2; ELSE reqfm <= '1'; state <= awaitackfm1; END IF; WHEN initrand2 => temp := chromo.tasknum(index1); chromo.tasknum(index1) := chromo.tasknum(index2); 89 chromo.tasknum(index2) := temp; tempcnt := tempcnt+1; state <= initrand1; WHEN awaitackfm1 => IF ackfm ='1' THEN chromoout <= chromo; reqfm <= '0'; state <= awaitackfm2; END IF; WHEN awaitackfm2 => IF ackfm = '0' THEN psizetemp := psizetemp +1; IF psizetemp < popsize THEN state <= init2; ELSE state <= done; END IF; END IF; WHEN done => reqfm <= '0'; END CASE; END IF; END PROCESS; END behavior; A5. FITNESS MODULE ENTITY GA_Fitness IS PORT( clk : IN STD_LOGIC; -- clock signal -- Signals to and from RTG initfm : IN STD_LOGIC; -- start FM tasktablein : IN taskgraph; -- task graph table -- Signals to and from IPG reqipg : IN STD_LOGIC; -- request from IPG dataipg : IN chromosome; -- data from IPG ackipg : OUT STD_LOGIC; -- acknowledge IPG -- Signals to and from MC ackmc : IN STD_LOGIC; -- acknowledge from MC reqmc : OUT STD_LOGIC; -- request to MC addrmc : OUT STD_LOGIC_VECTOR(13 DOWNTO 0); -- address of data dataout : OUT STD_LOGIC_VECTOR(7 DOWNTO 0); -- data old_new : OUT STD_LOGIC; -- old/new population toggle 90 best_done : OUT STD_LOGIC; -- best schedule obtained -- Signals to and from CMM reqxover : IN STD_LOGIC; -- request from CMM child1 : IN chromosome; -- child1 from CMM child2 : IN chromosome; -- child2 from CMM ackxover : OUT STD_LOGIC; -- acknowledge to CMM -- Signals to and from SM sof : OUT integer RANGE 0 TO 10000000; -- Sum of fitness to SM initGAmod : OUT STD_LOGIC); -- Start PS,SM and CMM modules END GA_Fitness; ARCHITECTURE behavior OF GA_Fitness IS TYPE states IS (idle0, idle1, idle, fm1,lp1,lp2,lp3,lp4,lp5,lp6,lp7,lp8,lp9,lp10,lp11,lp12, fm2, ackmem1_1, ackmem1_2, ackmem2, done1, done2, done3, bestdone); SIGNAL state: states; SIGNAL bestfit : chromosome; SIGNAL best : STD_LOGIC; BEGIN PROCESS(bestfit.fit) BEGIN IF bestfit.fit = "1100001101010000" THEN best <= '1'; ELSE best <= '0'; END IF; END PROCESS; PROCESS(clk,initfm,best) variable individual,ch1,ch2 variable tg variable tog variable psizetemp variable chromsize variable addr variable data variable data1 variable flag,ipgdone variable cnt,cntchild variable RQ variable EQ variable sumfit1, sumfit2 variable i,j,k,l,m,n,p,rqc variable pri, prec variable bestnum BEGIN : chromosome; : taskgraph; : STD_LOGIC ; : integer RANGE 0 TO 200; : integer RANGE 0 TO 32; : STD_LOGIC_VECTOR(13 DOWNTO 0); : STD_LOGIC_VECTOR(7 DOWNTO 0); : STD_LOGIC_VECTOR(15 DOWNTO 0); : STD_LOGIC; : STD_LOGIC; : taskn; : STD_LOGIC_VECTOR((numtasks-1) downto 0); : integer RANGE 0 TO 10000000; : integer RANGE 0 TO 32; : integer RANGE 0 TO 45000; : integer RANGE 0 TO 200; 91 IF initfm = '0' THEN reqmc <= '0'; ackipg <= '0'; best_done <= '0'; bestfit.fit <= "0000000000000000"; psizetemp := 0; chromsize := 0; tog := '0'; old_new <= tog; flag := '0'; cnt := '0'; cntchild := '0'; initGAmod <= '0'; ipgdone := '0'; state <= idle0; ELSIF rising_edge(CLK) THEN CASE state IS WHEN idle0 => tg := tasktablein; state <= idle1; WHEN idle1 => psizetemp := 0; chromsize := 0; ipgdone := '0'; sumfit1 := 0; sumfit2 := 0; state <= idle; WHEN idle => flag := '0'; prec := precfitness; pri := priofitness; cntchild := '0'; IF best = '1' THEN state <= bestdone; ELSIF ipgdone = '0' THEN IF reqipg = '1' THEN IF psizetemp < popsize THEN cntchild := '1'; ackipg <= '1'; state <= fm1; END IF; END IF; ELSE IF reqxover = '1' THEN IF psizetemp < popsize THEN ackxover <= '1'; state <= fm1; 92 END IF; END IF; END IF; WHEN fm1 => flag := '0'; cnt := '0'; IF ipgdone = '0' THEN IF reqipg = '0' THEN individual := dataipg; i := 0; rqc := 0; state <= lp1; END IF; ELSIF reqxover = '0' THEN IF cntchild = '0' THEN ch1 := child1; ch2 := child2; individual := ch1; i := 0; rqc := 0; state <= lp1; ELSE individual := ch2; i := 0; rqc := 0; state <= lp1; END IF; END IF; WHEN lp1 => ackipg <= '0'; IF cntchild = '1' THEN ackxover <= '0'; -- acknowledge that children are copied IF psizetemp = (popsize-1) THEN initGAmod <= '0'; END IF; END IF; IF i <numtasks THEN RQ(i) := "00000000"; EQ (i):= '0'; IF tg(i).numofpred = 0 THEN RQ(rqc) := tg(i).tnumber; rqc := rqc + 1; END IF; i := i+1; state <= lp1; ELSE i := 0; state <= lp2; 93 END IF; WHEN lp2 => IF i < numtasks THEN j := 0; state <= lp3; ELSE IF prec /= precfitness THEN IF prec > 0 THEN IF prec < priofitness THEN individual.fit := conv_std_logic_vector((maxfitness/1000), 16); ELSE IF pri < 0 THEN individual.fit := conv_std_logic_vector((prec + pri),16); ELSE individual.fit := conv_std_logic_vector((prec + pri),16); END IF; END IF; ELSE individual.fit := conv_std_logic_vector((maxfitness/1000),16); END IF; ELSE individual.fit := conv_std_logic_vector((prec + pri),16); END IF; chromsize := 0; state <= fm2; END IF; WHEN lp3 => IF j < rqc THEN IF individual.tasknum(i) = RQ(j) THEN k := 0; state <= lp4; ELSE IF j = (rqc-1) THEN EQ(conv_integer(individual.tasknum(i))-1) := '1'; j := numtasks; m := 0; n := 0; state <= lp11; ELSE j := j+1; state <= lp3; END IF; END IF; ELSE i := i + 1; state <= lp2; END IF; 94 WHEN lp4 => IF k < rqc THEN IF RQ(k) /= "00000000" THEN l := 0; state <= lp5; ELSE k := k+1; state <= lp4; END IF; ELSE EQ(conv_integer(individual.tasknum(i))-1) := '1'; RQ(j) := "00000000"; j := numtasks; state <= lp6; END IF; WHEN lp5 => IF l < numtasks THEN IF RQ(k) = tg(l).tnumber THEN IF tg(conv_integer(individual.tasknum(i))-1).tpriority < tg(l).tpriority THEN pri := pri - 50 * (numtasks - i - 1); END IF; END IF; l := l + 1; state <= lp5; ELSE k := k + 1; state <= lp4; END IF; WHEN lp6 => IF conv_integer(tg(conv_integer(individual.tasknum(i))-1).numofsuc) > 0 THEN k := 0; state <= lp7; ELSE j := j+1; state <= lp3; END IF; WHEN lp7 => IF k < conv_integer(tg(conv_integer(individual.tasknum(i))-1).numofsuc) THEN l := 0; state <= lp8; ELSE j := j+1; state <= lp3; END IF; 95 WHEN lp8 => IF l < numtasks THEN IF tg(conv_integer(individual.tasknum(i))-1).suc(k) = tg(l).tnumber THEN IF EQ(l) = '0' THEN IF conv_integer(tg(l).numofpred) > 1 THEN m := 0; n := 0; state <= lp9; ELSE RQ(rqc) := tg(l).tnumber; IF rqc = (numtasks - 1) THEN rqc := 0; ELSE rqc := rqc+1; END IF; END IF; END IF; END IF; l := l+1; state <= lp8; ELSE k := k + 1; state <= lp7; END IF; WHEN lp9 => IF n < conv_integer(tg(l).numofpred) THEN p := 0; state <= lp10; ELSE IF m = conv_integer(tg(l).numofpred) THEN RQ(rqc) := tg(m).tnumber; IF rqc = (numtasks) THEN rqc := 0; ELSE rqc := rqc + 1; END IF; END IF; l := l+1; state <= lp8; END IF; WHEN lp10 => IF p < numtasks THEN IF tg(l).pred(n) = tg(p).tnumber THEN IF EQ(p) = '1' THEN m := m + 1; END IF; END IF; p := p + 1; 96 state <= lp10; ELSE n := n+1; state <= lp9; END IF; WHEN lp11 => IF n < conv_integer(tg(conv_integer(individual.tasknum(i))-1).numofpred) THEN p := 0; state <= lp12; ELSE IF m /= conv_integer(tg(conv_integer(individual.tasknum(i))1).numofpred) THEN prec := prec - 150 * (numtasks - i -1); END IF; state <= lp6; END IF; WHEN lp12 => IF p < numtasks THEN IF tg(conv_integer(individual.tasknum(i))-1).pred(n) = tg(p).tnumber THEN IF EQ(p) = '1' THEN m := m + 1; END IF; END IF; p := p + 1; state <= lp12; ELSE n := n+1; state <= lp11; END IF; WHEN fm2 => IF chromsize < numtasks THEN data := individual.tasknum(chromsize); addr := conv_std_logic_vector((34*(psizetemp) + chromsize), 14); chromsize := chromsize + 1; reqmc <= '1'; state <= ackmem1_1; ELSE data1 := individual.fit; IF individual.fit > bestfit.fit THEN bestnum := psizetemp; bestfit <= individual; END IF; reqmc <= '1'; state <= ackmem1_2; END IF; 97 WHEN ackmem1_1 => IF ackmc ='1' THEN addrmc <= addr; dataout <= data; reqmc <= '0'; state <= ackmem2; END IF; WHEN ackmem1_2 => IF ackmc ='1' THEN IF cnt = '0' THEN addr := conv_std_logic_vector(((32*(psizetemp + 1))+ (psizetemp*2)), 14); addrmc <= addr; dataout <= data1(15 DOWNTO 8); cnt := '1'; flag := '0'; reqmc <= '0'; ELSE addr := conv_std_logic_vector(((32*(psizetemp + 1))+ (psizetemp*2) +1), 14); addrmc <= addr; dataout <= data1(7 DOWNTO 0); reqmc <= '0'; chromsize := 0; sumfit1 := sumfit1 + conv_integer('0' & data1); flag := '1'; END IF; state <= ackmem2; END IF; WHEN ackmem2 => IF ackmc = '0' THEN IF flag = '1' THEN -- checks IF fitness lower byte is written psizetemp := psizetemp +1; IF cntchild = '0' THEN flag := '0'; prec := precfitness; pri := priofitness; cntchild := '1'; state <= fm1; ELSE IF psizetemp < popsize THEN state <= idle; ELSE state <= done1; END IF; END IF; ELSE state <= fm2; END IF; 98 END IF; WHEN done1 => reqmc <= '0'; state <= done2; WHEN done2 => reqmc <= '0'; tog := NOT(tog); old_new <= tog; sumfit2 := sumfit1; state <= done3; WHEN done3 => reqmc <= '0'; ipgdone := '1'; sof <= sumfit2; sumfit1 := 0; psizetemp := 0; chromsize := 0; initGAmod <= '1'; state <= idle; WHEN bestdone => reqmc <= '0'; best_done <= '1'; initGAmod <= '0'; END CASE; END IF; END PROCESS; END behavior; A6. POPULATION SEQUENCER MODULE ENTITY GA_popseq IS PORT( clk : IN STD_LOGIC; -- clock signal initseq : IN STD_LOGIC; -- initiate signal from FM -- Signals to and from MC ackmc : IN STD_LOGIC; -- acknowledge from MC datain : IN chromosome; -- member from MC reqmc : OUT STD_LOGIC; -- request to MC addrmc : OUT INTEGER RANGE 0 to popsize; -- address of member -- Signals to SM chromoout : OUT chromosome -- member to SM); END GA_popseq; 99 ARCHITECTURE behavior OF GA_popseq IS TYPE states IS (idle, getmember, getnext); SIGNAL state: states; BEGIN PROCESS(clk,initseq) variable member : chromosome; variable memaddr: INTEGER RANGE 0 to popsize; BEGIN IF initseq ='0' THEN state <= idle; reqmc <= '0'; addrmc <= 0; ELSIF rising_edge (CLK) THEN CASE state IS WHEN idle => memaddr := 0; reqmc <= '1'; state <= getmember; WHEN getmember => IF ackmc = '1' THEN addrmc <= memaddr; reqmc <= '0'; state <= getnext; END IF; WHEN getnext => IF ackmc = '0' THEN member := datain; chromoout <= member; IF memaddr < (popsize-1) THEN memaddr := memaddr + 1; ELSE memaddr := 0; END IF; reqmc <= '1'; state <= getmember; END IF; END CASE; END IF; END PROCESS; END behavior; 100 A7. SELECTION MODULE ENTITY GA_selection IS PORT( clk : IN STD_LOGIC; -- clock signal rand1 : IN STD_LOGIC_VECTOR (width_pop DOWNTO 0); -- random select1 from RNG rand2 : IN STD_LOGIC_VECTOR (width_pop DOWNTO 0); -- random select2 from RNG -- Signals from FM initselect : IN STD_LOGIC; -- initiate SM signal from FM sof : IN integer RANGE 0 TO 10000000; -- sum of fitness from FM -- Signals from SM dup : IN STD_LOGIC; -- duplicate member signal from SM datain : IN chromosome; -- member from SM -- Signals to and from CMM ackxover : IN STD_LOGIC; -- acknowledge from CMM reqxover : OUT STD_LOGIC; -- request to CMM parent1 : OUT chromosome; -- parent 1 to CMM parent2 : OUT chromosome -- parent 2 to CMM ); END GA_selection; ARCHITECTURE behavior OF GA_selection IS TYPE states IS (idle, init, getmembers,ackxover1,ackxover2); SIGNAL state: states; SIGNAL fitness: integer RANGE 0 TO 10000000; SIGNAL scale : STD_LOGIC_VECTOR (width_pop DOWNTO 0); SIGNAL scaledfit: integer RANGE 0 TO 10000000; BEGIN PROCESS(fitness,scale) BEGIN scaledfit <= fitness /(conv_integer('0' & scale)); END PROCESS; PROCESS(clk,initselect) variable member : chromosome; variable memaddr,psizecnt : INTEGER RANGE 0 to popsize; variable sum : integer RANGE 0 TO 10000000; variable donea, doneb : STD_LOGIC; variable first,second : STD_LOGIC := '0'; variable rnd1, rnd2 : STD_LOGIC_VECTOR (width_pop DOWNTO 0); variable scaledfit1,scaledfit2: integer RANGE 0 TO 10000000; variable accum1, accum2 : integer RANGE 0 TO 10000000; variable member1, member2: chromosome; 101 BEGIN IF initselect = '0' THEN state <= idle; sum := 0; reqxover <= '0'; ELSIF rising_edge (CLK) THEN CASE state IS WHEN idle => sum := sof; -- get sum of fitness from FM state <= init; WHEN init => -- initialize all initial variables IF second = '0' THEN IF first = '0' THEN rnd1 := rand1; rnd2 := rand2; fitness <= sum; scale <= rnd1; first := '1'; --get scaled fitness for first individual ELSE scaledfit1 := scaledfit; --scaled fitness for 1 fitness <= sum; scale <= rnd2; first := '0'; second := '1';--get scaled fitness for second individual END IF; ELSE scaledfit2 := scaledfit; --scaled fitness for 2 accum1 := 0; accum2 := 0; donea := '0'; doneb := '0'; first := '0'; second := '0'; state <= getmembers; END IF; WHEN getmembers => IF donea = '0' THEN -- got 1st member IF doneb = '0' THEN -- got 2nd member member1 := datain; member2 := datain; first := '1'; ELSE member1 := datain; first := '1'; END IF; ELSE IF doneb = '0' THEN member2 := datain; 102 first := '1'; ELSE -- IF both members have been selected reqxover <= '1'; state <= ackxover1; END IF; END IF; IF first = '1' THEN --IF both members have not been selected accum1 := accum1 + conv_integer('0' & member1.fit); accum2 := accum2 + conv_integer('0' & member2.fit); first := '0'; END IF; IF accum1 > scaledfit1 THEN donea := '1'; ELSIF accum2 > scaledfit2 THEN doneb := '1'; END IF; WHEN ackxover1 => IF ackxover = '1' THEN parent1 <= member1; parent2 <= member2; reqxover <= '0'; state <= ackxover2; END IF; WHEN ackxover2 => IF ackxover = '0' THEN state <= init; END IF; END CASE; END IF; END PROCESS; END behavior; A8. CROSSOVER AND MUTATION MODULE ENTITY GA_xover_mut IS PORT( clk : IN STD_LOGIC; -- clock signal -- Signals from RNG rndm : IN STD_LOGIC_VECTOR (width_prob DOWNTO 0); -- probability of mutation rndx : IN STD_LOGIC_VECTOR (width_prob DOWNTO 0); -- probability of crossover xpoint1 : IN STD_LOGIC_VECTOR(width_numtasks DOWNTO 0); -crossover point 1 103 xpoint2 mpoint1 mpoint2 : IN STD_LOGIC_VECTOR(width_numtasks DOWNTO 0); -crossover point 2 : IN STD_LOGIC_VECTOR(width_numtasks DOWNTO 0); -- mutation point 1 : IN STD_LOGIC_VECTOR(width_numtasks DOWNTO 0); -- mutation point 2 -- Signals to and from SM reqSM : IN STD_LOGIC; -- request from SM parent1 : IN chromosome; -- parent 1 parent2 : IN chromosome; -- parent 2 ackSM : OUT STD_LOGIC; -- acknowledge to SM -- Signals to and from FM initxover : IN STD_LOGIC; -- initiate CMM signal from FM ackFM : IN STD_LOGIC; -- acknowledge from FM reqFM : OUT STD_LOGIC; -- request to FM child1 : OUT chromosome; -- child 1 child2 : OUT chromosome -- child 2); END GA_xover_mut; ARCHITECTURE behavior OF GA_xover_mut IS TYPE states IS (idle, getparents, acksel, xover1, lp1, lp2, lp3, lp4, xover2, xover3, mutation,awaitackfm1,awaitackfm2); SIGNAL state: states; BEGIN PROCESS(clk,initxover) variable c1,c2,p1,p2 : chromosome; variable temp1,temp2 : STD_LOGIC_VECTOR(7 DOWNTO 0); variable chk1, chk2 : STD_LOGIC_VECTOR((numtasks-1) DOWNTO 0); variable i,j,k : integer RANGE 0 TO 32; variable cross1, cross2 : STD_LOGIC_VECTOR(width_numtasks DOWNTO 0); variable mut1, mut2, x1, x2 : STD_LOGIC_VECTOR(width_numtasks DOWNTO 0); variable test1, test2 : STD_LOGIC; BEGIN IF initxover ='0' THEN state <= idle; reqFM <= '0'; ackSM <= '0'; i := 0; j := 0; k := 0; ELSIF rising_edge (CLK) THEN CASE state IS WHEN idle => reqFM <= '0'; ackSM <= '0'; state <= getparents; 104 WHEN getparents => IF reqSM = '1' THEN ackSM <= '1'; x1 := xpoint1; x2 := xpoint2; mut1 := mpoint1; mut2 := mpoint2; state <= acksel; END IF; WHEN acksel => IF reqSM = '0' THEN p1 := parent1; p2 := parent2; IF x1 <= x2 THEN IF x1 = x2 THEN cross1 := x1; cross2 := x2; state <= xover1; ELSE cross1 := x1; cross2 := x2; state <= xover1; END IF; ELSE cross1 := x2; cross2 := x1; state <= xover1; END IF; END IF; WHEN xover1 => ackSM <= '0'; chk1 := conv_std_logic_vector(0,numtasks); chk2 := conv_std_logic_vector(0,numtasks); test1 := '0'; test2 := '0'; IF rndx < probx THEN -- perfrom crossover IF cross1 = cross2 THEN -- simple crossover i := 0; state <= lp1; ELSE -- PMX crossover c1 := p1; c2 := p2; c1.fit := conv_std_logic_vector(50000,16); c2.fit := conv_std_logic_vector(50000,16); j := conv_integer('0' & cross1); state <= xover2; END IF; ELSE -- do not perform crossover c1 := p1; 105 c2 := p2; c1.fit := conv_std_logic_vector(50000,16); c2.fit := conv_std_logic_vector(50000,16); state <= mutation; END IF; WHEN lp1 => -- copy tasks from parent to child before crossover point IF i < conv_integer('0' & cross1) THEN c1.tasknum(i) := p1.tasknum(i); chk1(((conv_integer('0' & p1.tasknum(i)))-1)) := '1'; c2.tasknum(i) := p2.tasknum(i); chk2(((conv_integer('0' & p2.tasknum(i)))-1)) := '1'; i := i + 1; state <= lp1; ELSE i := conv_integer('0' & cross1); j := 0; state <= lp2; END IF; WHEN lp2 => -- fill tasks after xover point not encountered be4 IF i < numtasks THEN IF chk1(((conv_integer('0' & p2.tasknum(i)))-1)) = '1' THEN c1.tasknum(i) := conv_std_logic_vector(0,8); ELSE c1.tasknum(i) := p2.tasknum(i); chk1(((conv_integer('0' & p2.tasknum(i)))-1)) := '1'; END IF; IF chk2(((conv_integer('0' & p1.tasknum(i)))-1)) = '1' THEN c2.tasknum(i) := conv_std_logic_vector(0,8); ELSE c2.tasknum(i) := p1.tasknum(i); chk2((conv_integer('0' & p1.tasknum(i))-1)) := '1'; END IF; i := i + 1; state <= lp2; ELSE i := conv_integer('0' & cross1); j := 0; state <= lp3; END IF; WHEN lp3 => IF i < numtasks THEN IF c1.tasknum(i) = "00000000" THEN IF j < numtasks THEN IF chk1(j) = '0' THEN c1.tasknum(i) := conv_std_logic_vector((j + 1),8); chk1(j) := '1'; state <= lp3; 106 ELSE j := j + 1; state <= lp3; END IF; ELSE j := 0; state <= lp4; END IF; ELSE j := 0; state <= lp4; END IF; ELSE state <= mutation; END IF; WHEN lp4 => IF c2.tasknum(i) = "00000000" THEN IF j < numtasks THEN IF chk2(j) = '0' THEN c2.tasknum(i) := conv_std_logic_vector((j + 1),8); chk2(j) := '1'; state <= lp4; ELSE j := j + 1; state <= lp4; END IF; ELSE i := i + 1; j := 0; state <= lp3; END IF; ELSE i := i + 1; j := 0; state <= lp3; END IF; WHEN xover2 => IF j <= conv_integer('0'& cross2) THEN temp1 := c1.tasknum(j); c1.tasknum(j) := c2.tasknum(j); c2.tasknum(j) := temp1; k := 0; state <= xover3; ELSE state <= mutation; END IF; 107 WHEN xover3 => IF k < numtasks THEN IF k /= j THEN IF c1.tasknum(j) = c1.tasknum(k) THEN c1.tasknum(k) := c2.tasknum(j); END IF; IF c2.tasknum(j) = c2.tasknum(k) THEN c2.tasknum(k) := c1.tasknum(j); END IF; END IF; k := k +1; ELSE j := j+1; state <= xover2; END IF; WHEN mutation => IF rndm < probm THEN IF mut1 = mut2 THEN reqFM <= '1'; state <= awaitackfm1; ELSE temp1 := c1.tasknum((conv_integer('0' & mut1))); temp2 := c2.tasknum((conv_integer('0' & mut2))); c1.tasknum(conv_integer('0' & mut1)) := c1.tasknum(( conv_integer ('0' & mut2))); c2.tasknum(conv_integer('0' & mut2)) := c2.tasknum((conv_integer ('0' & mut1))); c1.tasknum((conv_integer('0' & mut2))):= temp1; c2.tasknum((conv_integer('0' & mut1))):= temp2 ; reqFM <= '1'; state <= awaitackfm1; END IF; ELSE reqFM <= '1'; state <= awaitackfm1; END IF; WHEN awaitackfm1 => IF ackFM = '1' THEN child1 <= c1; child2 <= c2; reqFM <= '0'; state <= awaitackfm2; END IF; WHEN awaitackfm2 => IF ackFM = '0' THEN state <= getparents; END IF; 108 END CASE; END IF; END PROCESS; END behavior; A9. FITNESS MODULES FOR PHGA ENTITY GA_Fitness IS PORT( clk : IN STD_LOGIC; -- clock signal -- Signals to and from RTG initfm : IN STD_LOGIC; -- start FM tasktablein : IN taskgraph; -- task graph table -- Signals to and from IPG reqipg : IN STD_LOGIC; -- request from IPG dataipg : IN chromosome; -- data from IPG ackipg : OUT STD_LOGIC; -- acknowledge IPG -- Signals to and from MC ackmc : IN STD_LOGIC; -- acknowledge from MC reqmc : OUT STD_LOGIC; -- request to MC addrmc : OUT STD_LOGIC_VECTOR(13 DOWNTO 0); -- address of data dataout : OUT STD_LOGIC_VECTOR(7 DOWNTO 0); -- data old_new : OUT STD_LOGIC; -- old/new population toggle best_done : OUT STD_LOGIC; -- best schedule obtained -- Signals to and from CMM reqxover : IN STD_LOGIC; -- request from CMM child1 : IN chromosome; -- child1 from CMM child2 : IN chromosome; -- child2 from CMM ackxover : OUT STD_LOGIC; -- acknowledge to CMM -- Signals to and from SM sof : OUT integer RANGE 0 TO 10000000; -- Sum of fitness to SM initGAmod : OUT STD_LOGIC -- Start PS,SM and CMM modules ); END GA_Fitness; ARCHITECTURE behavior OF GA_Fitness IS TYPE states IS (idle0, idle1, idle, ipgfm1, ipgfm2, xoverfm1, xoverfm2, stfm, prefm2, fm2, ackmem1_1,ackmem1_2,ackmem2, done1, done2, done3, bestdone); TYPE statesfm IS (lp0, lp1,lp2,lp3,lp4,lp5,lp6,lp7,lp8,lp9,lp10,lp11,lp12,donefm); SIGNAL state SIGNAL statefm1,statefm2 SIGNAL bestfit SIGNAL best : states; : statesfm; : chromosome; : STD_LOGIC; 109 SIGNAL ch1m2f, ch2m2f, ch1f2m, ch2f2m : chromosome; SIGNAL tg : taskgraph; SIGNAL st1, st2, dn1, dn2 : STD_LOGIC; BEGIN PROCESS(bestfit.fit) BEGIN IF bestfit.fit = "1100001101010000" THEN best <= '1'; ELSE best <= '0'; END IF; END PROCESS; PROCESS(clk,st1) variable chromo1 variable tg1 variable RQ variable EQ variable i,j,k,l,m,n,p,rqc variable pri, prec : chromosome; : taskgraph; : taskn; : STD_LOGIC_VECTOR((numtasks-1) downto 0); : integer RANGE 0 TO 32; : integer RANGE 0 TO 45000; BEGIN IF st1 = '0' THEN dn1 <= '0'; statefm1 <= lp0; ELSIF rising_edge(CLK) THEN CASE statefm1 IS WHEN lp0 => i := 0; rqc := 0; chromo1 := ch1m2f; statefm1 <= lp1; WHEN lp1 => prec := precfitness; pri := priofitness; IF i <numtasks THEN RQ(i) := "00000000"; EQ(i) := '0'; IF tg(i).numofpred = 0 THEN RQ(rqc) := tg(i).tnumber; rqc := rqc + 1; END IF; i := i+1; statefm1 <= lp1; ELSE i := 0; 110 statefm1 <= lp2; END IF; WHEN lp2 => IF i < numtasks THEN j := 0; statefm1 <= lp3; ELSE IF prec /= precfitness THEN IF prec > 0 THEN IF prec < priofitness THEN chromo1.fit := conv_std_logic_vector((maxfitness/1000), 16); ELSE IF pri < 0 THEN chromo1.fit := conv_std_logic_vector((prec + pri),16); ELSE chromo1.fit := conv_std_logic_vector((prec + pri),16); END IF; END IF; ELSE chromo1.fit := conv_std_logic_vector((maxfitness/1000),16); END IF; ELSE chromo1.fit := conv_std_logic_vector((prec + pri),16); END IF; ch1f2m <= chromo1; statefm1 <= donefm; END IF; WHEN lp3 => IF j < rqc THEN IF chromo1.tasknum(i) = RQ(j) THEN k := 0; statefm1 <= lp4; ELSE IF j = (rqc-1) THEN EQ(conv_integer(chromo1.tasknum(i))-1) := '1'; j := numtasks; m := 0; n := 0; statefm1 <= lp11; ELSE j := j+1; statefm1 <= lp3; END IF; END IF; ELSE i := i + 1; statefm1 <= lp2; END IF; 111 WHEN lp4 => IF k < rqc THEN IF RQ(k) /= "00000000" THEN l := 0; statefm1 <= lp5; ELSE k := k+1; statefm1 <= lp4; END IF; ELSE EQ(conv_integer(chromo1.tasknum(i))-1) := '1'; RQ(j) := "00000000"; j := numtasks; statefm1 <= lp6; END IF; WHEN lp5 => IF l < numtasks THEN IF RQ(k) = tg1(l).tnumber THEN IF tg(conv_integer(chromo1.tasknum(i))-1).tpriority < tg(l).tpriority THEN pri := pri - 50 * (numtasks - i - 1); END IF; END IF; l := l + 1; statefm1 <= lp5; ELSE k := k + 1; statefm1 <= lp4; END IF; WHEN lp6 => IF conv_integer(tg(conv_integer(chromo1.tasknum(i))-1).numofsuc) > 0 THEN k := 0; statefm1 <= lp7; ELSE j := j+1; statefm1 <= lp3; END IF; WHEN lp7 => IF k < conv_integer(tg(conv_integer(chromo1.tasknum(i))-1).numofsuc) THEN l := 0; statefm1 <= lp8; ELSE j := j+1; statefm1 <= lp3; END IF; 112 WHEN lp8 => IF l < numtasks THEN IF tg(conv_integer(chromo1.tasknum(i))-1).suc(k) = tg(l).tnumber THEN IF EQ(l) = '0' THEN IF conv_integer(tg(l).numofpred) > 1 THEN m := 0; n := 0; statefm1 <= lp9; ELSE RQ(rqc) := tg(l).tnumber; IF rqc = (numtasks - 1) THEN rqc := 0; ELSE rqc := rqc+1; END IF; END IF; END IF; END IF; l := l+1; statefm1 <= lp8; ELSE k := k + 1; statefm1 <= lp7; END IF; WHEN lp9 => IF n < conv_integer(tg(l).numofpred) THEN p := 0; statefm1 <= lp10; ELSE IF m = conv_integer(tg(l).numofpred) THEN RQ(rqc) := tg(m).tnumber; IF rqc = (numtasks) THEN rqc := 0; ELSE rqc := rqc + 1; END IF; END IF; l := l+1; statefm1 <= lp8; END IF; WHEN lp10 => IF p < numtasks THEN IF tg(l).pred(n) = tg(p).tnumber THEN IF EQ(p) = '1' THEN m := m + 1; END IF; END IF; p := p + 1; 113 statefm1 <= lp10; ELSE n := n+1; statefm1 <= lp9; END IF; WHEN lp11 => IF n < conv_integer(tg(conv_integer(chromo1.tasknum(i))-1).numofpred) THEN p := 0; statefm1 <= lp12; ELSE IF m /= conv_integer(tg(conv_integer(chromo1.tasknum(i))1).numofpred) THEN prec := prec - 150 * (numtasks - i -1); END IF; statefm1 <= lp6; END IF; WHEN lp12 => IF p < numtasks THEN IF tg(conv_integer(chromo1.tasknum(i))-1).pred(n) = tg(p).tnumber THEN IF EQ(p) = '1' THEN m := m + 1; END IF; END IF; p := p + 1; statefm1 <= lp12; ELSE n := n+1; statefm1 <= lp11; END IF; WHEN donefm => dn1 <= '1'; END CASE; END IF; END PROCESS; PROCESS(clk,st2) variable chromo2 : chromosome; variable RQ : taskn; variable EQ : STD_LOGIC_VECTOR((numtasks-1) downto 0); variable i, j, k, l, m, n, p, rqc : integer RANGE 0 TO 32; variable pri, prec : integer RANGE 0 TO 45000; BEGIN IF st2 = '0' THEN dn2 <= '0'; 114 statefm2 <= lp0; ELSIF rising_edge(CLK) THEN CASE statefm2 IS WHEN lp0 => i := 0; rqc := 0; chromo2 := ch2m2f; statefm2 <= lp1; WHEN lp1 => prec := precfitness; pri := priofitness; IF i <numtasks THEN RQ(i) := "00000000"; EQ(i) := '0'; IF tg(i).numofpred = 0 THEN RQ(rqc) := tg(i).tnumber; rqc := rqc + 1; END IF; i := i+1; statefm2 <= lp1; ELSE i := 0; statefm2 <= lp2; END IF; WHEN lp2 => IF i < numtasks THEN j := 0; statefm2 <= lp3; ELSE IF prec /= precfitness THEN IF prec > 0 THEN IF prec < priofitness THEN chromo2.fit := conv_std_logic_vector((maxfitness/1000), 16); ELSE IF pri < 0 THEN chromo2.fit := conv_std_logic_vector((prec + pri),16); ELSE chromo2.fit := conv_std_logic_vector((prec + pri),16); END IF; END IF; ELSE chromo2.fit := conv_std_logic_vector((maxfitness/1000),16); END IF; ELSE chromo2.fit := conv_std_logic_vector((prec + pri),16); END IF; ch2f2m <= chromo2; 115 statefm2 <= donefm; END IF; WHEN lp3 => IF j < rqc THEN IF chromo2.tasknum(i) = RQ(j) THEN k := 0; statefm2 <= lp4; ELSE IF j = (rqc-1) THEN EQ(conv_integer(chromo2.tasknum(i))-1) := '1'; j := numtasks; m := 0; n := 0; statefm2 <= lp11; ELSE j := j+1; statefm2 <= lp3; END IF; END IF; ELSE i := i + 1; statefm2 <= lp2; END IF; WHEN lp4 => IF k < rqc THEN IF RQ(k) /= "00000000" THEN l := 0; statefm2 <= lp5; ELSE k := k+1; statefm2 <= lp4; END IF; ELSE EQ(conv_integer(chromo2.tasknum(i))-1) := '1'; RQ(j) := "00000000"; j := numtasks; statefm2 <= lp6; END IF; WHEN lp5 => IF l < numtasks THEN IF RQ(k) = tg(l).tnumber THEN IF tg(conv_integer(chromo2.tasknum(i))-1).tpriority < tg(l).tpriority THEN pri := pri - 50 * (numtasks - i - 1); END IF; END IF; l := l + 1; statefm2 <= lp5; 116 ELSE k := k + 1; statefm2 <= lp4; END IF; WHEN lp6 => IF conv_integer(tg(conv_integer(chromo2.tasknum(i))-1).numofsuc) > 0 THEN k := 0; statefm2 <= lp7; ELSE j := j+1; statefm2 <= lp3; END IF; WHEN lp7 => IF k < conv_integer(tg(conv_integer(chromo2.tasknum(i))-1).numofsuc) THEN l := 0; statefm2 <= lp8; ELSE j := j+1; statefm2 <= lp3; END IF; WHEN lp8 => IF l < numtasks THEN IF tg(conv_integer(chromo2.tasknum(i))-1).suc(k) = tg(l).tnumber THEN IF EQ(l) = '0' THEN IF conv_integer(tg(l).numofpred) > 1 THEN m := 0; n := 0; statefm2 <= lp9; ELSE RQ(rqc) := tg(l).tnumber; IF rqc = (numtasks - 1) THEN rqc := 0; ELSE rqc := rqc+1; END IF; END IF; END IF; END IF; l := l+1; statefm2 <= lp8; ELSE k := k + 1; statefm2 <= lp7; END IF; 117 WHEN lp9 => IF n < conv_integer(tg(l).numofpred) THEN p := 0; statefm2 <= lp10; ELSE IF m = conv_integer(tg(l).numofpred) THEN RQ(rqc) := tg(m).tnumber; IF rqc = (numtasks) THEN rqc := 0; ELSE rqc := rqc + 1; END IF; END IF; l := l+1; statefm2 <= lp8; END IF; WHEN lp10 => IF p < numtasks THEN IF tg(l).pred(n) = tg(p).tnumber THEN IF EQ(p) = '1' THEN m := m + 1; END IF; END IF; p := p + 1; statefm2 <= lp10; ELSE n := n+1; statefm2 <= lp9; END IF; WHEN lp11 => IF n < conv_integer(tg(conv_integer(chromo2.tasknum(i))-1).numofpred) THEN p := 0; statefm2 <= lp12; ELSE IF m /= conv_integer(tg(conv_integer(chromo2.tasknum(i))1).numofpred) THEN prec := prec - 150 * (numtasks - i -1); END IF; statefm2 <= lp6; END IF; WHEN lp12 => IF p < numtasks THEN IF tg(conv_integer(chromo2.tasknum(i))-1).pred(n) = tg(p).tnumber THEN IF EQ(p) = '1' THEN m := m + 1; END IF; 118 END IF; p := p + 1; statefm2 <= lp12; ELSE n := n+1; statefm2 <= lp11; END IF; WHEN donefm => dn2 <= '1'; END CASE; END IF; END PROCESS; PROCESS(clk,initfm,best) variable individual : chromosome; variable tog : STD_LOGIC ; variable psizetemp : integer RANGE 0 TO 200; variable chromsize,i : integer RANGE 0 TO 32; variable addr : STD_LOGIC_VECTOR(13 DOWNTO 0); variable data : STD_LOGIC_VECTOR(7 DOWNTO 0); variable data1 : STD_LOGIC_VECTOR(15 DOWNTO 0); variable flag,ipgdone : STD_LOGIC; variable cnt,cntchild : STD_LOGIC; variable sumfit1, sumfit2 : integer RANGE 0 TO 10000000; variable bestnum : integer RANGE 0 TO 200; BEGIN IF initfm = '0' THEN reqmc <= '0'; ackipg <= '0'; best_done <= '0'; bestfit.fit <= "0000000000000000"; st1 <= '0'; st2 <= '0'; psizetemp := 0; chromsize := 0; tog := '0'; old_new <= tog; flag := '0'; cnt := '0'; cntchild := '0'; initGAmod <= '0'; ipgdone := '0'; state <= idle0; ELSIF rising_edge(CLK) THEN CASE state IS WHEN idle0 => tg <= tasktablein; 119 state <= idle1; WHEN idle1 => psizetemp := 0; chromsize := 0; ipgdone := '0'; sumfit1 := 0; sumfit2 := 0; state <= idle; WHEN idle => flag := '0'; cntchild := '0'; IF best = '1' THEN state <= bestdone; ELSIF ipgdone = '0' THEN IF reqipg = '1' THEN IF psizetemp < popsize THEN ackipg <= '1'; state <= ipgfm1; END IF; END IF; ELSE IF reqxover = '1' THEN IF psizetemp < popsize THEN ackxover <= '1'; state <= xoverfm1; END IF; END IF; END IF; WHEN ipgfm1 => flag := '0'; cnt := '0'; IF reqipg = '0' THEN cntchild := '1'; ch2m2f <= dataipg; st2 <= '1'; state <= ipgfm2; END IF; WHEN ipgfm2 => flag := '0'; cnt := '0'; IF reqipg = '0' THEN cntchild := '1'; ch2m2f <= dataipg; st2 <= '1'; state <= stfm; END IF; 120 WHEN xoverfm1 => flag := '0'; cnt := '0'; IF reqxover = '0' THEN ch1m2f <= child1; ch2m2f <= child2; st1 <= '1'; st2 <= '1'; state <= xoverfm2; END IF; WHEN xoverfm2 => flag := '0'; cnt := '0'; IF reqxover = '0' THEN ch1m2f <= child1; ch2m2f <= child2; st1 <= '1'; st2 <= '1'; state <= stfm; END IF; WHEN stfm => ackipg <= '0'; ackxover <= '0'; IF cntchild = '1' THEN IF psizetemp = (popsize-1) THEN initGAmod <= '0'; END IF; ELSE IF psizetemp = (popsize-2) THEN initGAmod <= '0'; END IF; END IF; state <= prefm2; WHEN prefm2 => chromsize := 0; IF dn1 = '1' THEN IF dn2 = '1' THEN individual := ch2f2m; st2 <= '0'; state <= fm2; ELSE individual := ch1f2m; st1 <= '0'; 121 state <= fm2; END IF; ELSIF dn2 = '1' THEN individual := ch2f2m; st2 <= '0'; state <= fm2; END IF; WHEN fm2 => IF chromsize < numtasks THEN data := individual.tasknum(chromsize); addr := conv_std_logic_vector((34*(psizetemp) + chromsize), 14); chromsize := chromsize + 1; reqmc <= '1'; state <= ackmem1_1; ELSE data1 := individual.fit; IF individual.fit > bestfit.fit THEN bestnum := psizetemp; bestfit <= individual; END IF; reqmc <= '1'; state <= ackmem1_2; END IF; WHEN ackmem1_1 => IF ackmc ='1' THEN addrmc <= addr; dataout <= data; reqmc <= '0'; state <= ackmem2; END IF; WHEN ackmem1_2 => IF ackmc ='1' THEN IF cnt = '0' THEN addr := conv_std_logic_vector(((32*(psizetemp + 1))+ (psizetemp*2)), 14); addrmc <= addr; dataout <= data1(15 DOWNTO 8); cnt := '1'; flag := '0'; reqmc <= '0'; ELSE addr := conv_std_logic_vector(((32*(psizetemp + 1))+ (psizetemp*2)+1), 14); addrmc <= addr; dataout <= data1(7 DOWNTO 0); reqmc <= '0'; chromsize := 0; sumfit1 := sumfit1 + conv_integer('0' & data1); 122 flag := '1'; END IF; state <= ackmem2; END IF; WHEN ackmem2 => IF ackmc = '0' THEN IF flag = '1' THEN -- checks IF fitness lower byte is written psizetemp := psizetemp +1; IF dn1 = '0' THEN IF dn2 = '0' THEN IF psizetemp < popsize THEN state <= idle; ELSE state <= done1; END IF; ELSE state <= prefm2; END IF; ELSIF dn2 = '0' THEN state <= prefm2; END IF; ELSE state <= fm2; END IF; END IF; WHEN done1 => reqmc <= '0'; state <= done2; WHEN done2 => reqmc <= '0'; tog := NOT(tog); old_new <= tog; sumfit2 := sumfit1; state <= done3; WHEN done3 => reqmc <= '0'; ipgdone := '1'; sof <= sumfit2; sumfit1 := 0; psizetemp := 0; chromsize := 0; initGAmod <= '1'; state <= idle; WHEN bestdone => reqmc <= '0'; best_done <= '1'; 123 initGAmod <= '0'; END CASE; END IF; END PROCESS; END behavior;