Evolution of Complex Behavior Controllers using Genetic Algorithms Kerry A. Gruber Jason Baurick Sushil J. Louis Computer Science Dept. University of Nevada, Reno Reno, NV 89557 kkgruber@ix.netcom.com http://www.cs.unr.edu/~gruber Computer Science Dept. University of Nevada, Reno Reno, NV 89557 Genetic Algorithm Systems Lab. Computer Science Dept. University of Nevada, Reno Reno, NV 89557 sushil@cs.unr.edu Abstract Development of robotic controllers for complex behavioral tasks is difficult when the exact nature or complexity of an environment is not known in advance. This paper explores the use of genetic algorithms to evolve neural network controllers that exhibit generalized complex behavior. We compare the performance of the evolved controllers to those developed by human programmers. Keywords: Genetic Algorithms, Neural Networks, simulated robotics. 1 INTRODUCTION Lack of knowledge regarding the actual operating environment at design time makes design of controllers that perform reliably under changing environmental conditions a difficult task. This is especially true when the operating environment is expected to be dynamic in nature, or the exact specifications, such as the topology, are unknown. In order for a controller to perform reliably under such conditions; it must be able to adapt to its environment, or it must be able to generalize complex behaviors applicable to all possible conditions. Artificial neural networks are an increasingly popular alternative to traditional control strategies, particularly when the exact mechanics of a problem are difficult to define or are non-linear in nature. Neural networks are capable of learning very high-order statistical correlations that are present in a training environment, and very importantly, they provide a powerful and attractive mechanism for generalizing behavior to new environments (Nolfi et al. 1990). However, neural networks are limited in their ability by the methods used in training. Traditional approaches to training neural networks are based on a supervised model using such techniques as backpropogation or conjugate gradient descent. The supervised model relies on a set of examples to provide a mapping from input to output. For the simulation used in this paper, it is possible to design a reaction-based neural network controller using the supervised model based on desired reactions to specific states. This would, in effect, be the same as programming the controller using a standard rule set. The programmers understanding of the environment and the anticipated consequences of predefined reactions limit such rule sets. Neuro-evolution, the use of Genetic Algorithms (GAs) to train or build neural networks, has been shown as a promising alternative to classical techniques for design of complex behavioral systems (Nolfi et al. 1990; Yamauchi and Beer, 1994; Moriarty and Miikkulainen, 1997). The Genetic Algorithm Training of Robot Simulations (GATORS) system uses a GA to train the weights of a neural network to control the motor functions of robots in a simulated environment. Fitness within the system is based on performance of complex behavioral tasks. Wall following, vacuum cleaning, obstacle avoidance, and prey-capture scenarios are examples of such complex behavior because they are not tied to specific spatial locations (Gomez and Miikkulainen, 1997). A demonstration version of the system is available via the web at http://www.cs.unr.edu/~gruber. The body of this paper is organized as follows: The next section provides a short introdution to artificial neural networks. (Readers already familiar with this topic are invited to skip to Section 3.) Section 3 describes the simulated environment and agents. The methods employed by the GATORS system are discussed in Section 4. Results of our methods are provided in Section 5. A summary of the conclusions reached is given in Section 6. In Section 7 we outline possible future work related to the GATORS system. 2 2.1 BACKGROUND NEURAL NETWORKS A neural network is an artificial representation of the connections and functions occurring within the brain. Artificial neural networks do not approach the complexity of the human brain; but do share two key similarities: First, the building blocks of both networks are simple computational devices that are highly connected. Second, the connections between neurons determine the function of the network (Hagan et al., 1996). An artificial neural network is a group of simple interconnected neurons; normally divided into layers. (refer to Figure 1). In its simplest form; each neuron consists of a series of connections with the previous layer, a bias input, and an output that feeds to the next layer. Each connection has a “weight” associated with it, which is multiplied with the input from the previous layer. The output is simply the sum of the bias and the inputs multiplied by their associated weights and passed through an activation function. If we consider a simple two input neuron; the inputs, weights, and bias form the equation of a line. The output of the neuron is based on whether the input point lies above or below the line. So, each neuron divides the two-dimension input space in half. This concept can be extended to N-dimensional space. Each neuron separates the input space in half. The network acts as an effective classifier. Input Layer I1 Variable Number of Obs. Obs. Area Coverage Ear Length Hearing Range Number of Prey Prey Length Prey Sleep Time 3.2 bh1 Output bo Weights In bhn Figure 1: Artificial Neural Network 3.1 Table 1: Summary of Environment Variables Min. Value 3 5% 1 20 1 1 100 Max. Value 10 10% 10 200 4 10 1000 ROBOTS Hidden Layer Output Layer 3 Obstacles are randomly placed within the environment for the robots to avoid or negotiate in order to find and capture prey. This increases the complexity of the control task since you do not want to cover the same location over and over, particularly in tight locations where the robots tend to bounce between close walls. As robots negotiate obstacles, it is possible for them to “crash” into the obstacle in the process. This is a situation which is highly undesirable and causes the robot to stop moving and corresponds to the a real-world robot hitting an obstacle and possible leading to damage. The robots must learn to negotiate obstacles by coming close enough to sense them while reducing the number of crashes. Table 1 outlines the variables related to the simulation. These values are randomly set prior to a simulation and do not change during the simulation. SIMULATION There are two types of robots in the simulations, predators and prey. In this study, prey do not move and w are only interested in evolving controllers for predators. Each predator is equipped with a set of seven sensors (refer to Figure 2). Five of the sensors are binary touch sensors, and the remaining two are hearing sensors that return real values. The touch sensors signal contact with an obstacle provide obstacle location information. The hearing sensors determine the approximate direction of prey in relation to a robot’s current position. The five binary touch sensors are located at the front, rear, and center of the robot (refer to Figure 2). When the center touch sensor (also called the crash sensor) touches an obstacle, a crash is registered against the robot. DESCRIPTION The simulation is based on those used by Louis, Sun, and Li (Louis and Sun, 1996; Louis and Li, 1996). The simulator is primarily designed to develop and test robots in performance of “vacuum cleaning”, the ability to cover as large an area as possible in the time allotted. The simulation is designed so that a number of complex behaviors are required in order to achieve acceptable results. Each robot is equipped with a battery that decreases as the robot moves. Prey/food in the simulation emit “sound” and when robots move close enough, are “eaten” by robots to recharge their batteries. Once eaten, prey cannot be re-eaten for a set amount of time. 6 Units Variable Length 10 Units Hearing Sensor Rear Touch Sensor Figure 2: Robot Sensor Locations 3.4 The hearing sensors are located on the right and left sides of the robot (refer to Figure 2). The lengths of the hearing sensors are variable from 1 to 10 and are randomly set during training. Prey emit sound whose magnitude received at the hearing sensors is proportional to the hearing range of a robot and inversely proportional to the square of the distance. Sound is cut off outside a robot’s specified hearing range. The total sound level received is the sum of the emissions of all prey. Hearing ranges vary from 20 to 200 and are set randomly during training. Each robot is equipped with a pair of motors, one on each side. As robots move in the environment, their battery levels decrease in accordance with the distance moved. The maximum move distance during a single step is 4 units in the forward direction and 2 units in reverse. The change in heading angle is proportional to the difference in actuator levels and varies within a maximum range of ± 30°. 3.3 ENVIRONMENT The environment is a spatially independent grid that may be varied in size from 200x200 to 1000x1000. For testing and training purposes we use a size of 300x300. The simulation randomly places rectangular obstacles into the environment for each generation. The number of obstacles ranges from 3 to 10. The obstacle dimensions are also randomly generated for each generation. The maximum area a single obstacle may cover is limited 10% of the total environment area, and the minimum area to 5%. This assures that obstacles never completely cover the entire operating region. The environment also contains a minimum border of 5 units between the environment edge and all obstacles. The border is small enough that a robot can not fit between the environment’s outer edge and an obstacle without sensing one or the other. Figure 3 displays a random environment generated by the simulation. Prey Obstacles Predators Figure 3: Randomly Generated Environment SIMULATION PROCESS Our simulations run for 1000 time steps. During a time step each robot is given a turn to move. Movement order is random for each step. The simulation supplies each controller with the current sensors state made up of battery level, the hearing sensor magnitudes, and the touch sensor states. No absolute or relative location information is supplied by the simulation. The simulation moves the robots based on the controller output levels, current position, current heading, and interaction with any obstacles. Angular change is registered prior to displacement. If the crash sensor of a robot crosses the boundary of an obstacle or the environment edge, the robot is placed just inside the location of intersection with the obstacle and a crash is registered against the robot. Noise may also be present in the and a noise level bias determines the overall effect of noise in the system. The bias may be varied from 25 to 75. The noise function returns a value of ±1, dependent upon whether a random number modulus 100 is above or below the bias level. The value returned is added to the change in robot heading and distance. Use of levels below 50% results in an overall bias to move left and forward; while levels above 50% results in a bias to move backward and right. As the value moves farther from the central value of 50, the effect of noise becomes more pronounced. If a crash is registered, the simulation still decreases the battery level of the robot as if the full move had occurred. If the battery level for a robot reaches 0, the robot is considered “dead”, and is no longer allowed to move. Each predator begins with a battery level of 1000. 4 4.1 METHODOLOGY NEURAL NETWORK The system utilizes a two-layer feed-forward neural network. The network contains of 10 input nodes, with all inputs normalized between 0 and 1. W varied the number of hidden nodes between 1 to 10. The controller presented here uses four hidden nodes based on experimental results. The system uses four output nodes for motor control stimuli. The activation function is a standard sigmoid function. The outputs are limited using a threshold value of 0.5. The robots can not effectively capture prey without a minimum of state information. This is due to the presence of virtual food sources as perceived by the robots. Figure 4 shows two circles centered on each of the hearing sensors. The magnitude received by the hearing sensors is the same whether the source is located at either position. Recurrent networks have been used as an effective strategy for incorporation of short-term state information (Gomez and Miikkulainen, 1997). We chose to use a simpler method so that only the minimum state information necessary is present. Instead of feeding the hearing levels directly to the network, the levels generate two binary inputs that signify the side the prey is on and whether prey is present. We save these values at each step for use during the subsequent step. Figure 5: Encoding Methods 4.2 GENETIC ALGORITHM 4.2.1 Encoding GAs are highly disruptive when used in conjunction with neural networks. The process of combining dissimilar sections from different networks often leads to nonfunctioning networks. In order to minimize disruption, two different encoding strategies are used in formation of the GA strings. The objective of both strategies is to minimize disruption by placing highly dependent sections of the neural network in close proximity during encoding. For both strategies, we use a 16-bit binary encoding of each weight and bias value. The first method groups together the input weights and biases associated with each neuron. The objective in this strategy is to maintain the full set of weights associated with each node during mating (refer to Figure 5). The total number of bits required to encode a single network is given by: Ntot=((Ni+1)*Nh + (Nh+1)*No)* 16, where Ni is the number of inputs, Nh is the number of hidden nodes, and No is the number of output nodes. Since each network contains ten inputs and four outputs, a network with four hidden nodes requires a 1024-bit string. The input weights to each node are encoded to represent numbers between ±100.0. The bias weights are encoded to represent numbers between ±100.0*Ni, where Ni is the number of inputs to a given node. The biases are then able to offset the full range of possible input values. Actual Prey Location Node N Node 1 WN1 B1 ... W1N W2N ... WNN BN Method I Input N Input 1 W11 B11 ... W1N B1N ... WN1 BN1 Method II Initialization We use two methods to initialize the GA. The first method is by simple random generation of the bits. The second is performed a little differently. First, we generate the input weights randomly. Then we set the bias for each node based on the sum of the input weights: Bj = 0.5 * Wij, where Bj is the bias for an individual node and Wij are the input weights from layer i to layer to the node j. Because the inputs are limited between 0 and 1, this places each node statistically in its operating region. We found that by doing so, the primary characteristics of early generations were significantly altered. Table 2: Initial Characteristics 4.2.3 Figure 4: Virtual Prey Location ... 4.2.2 Behavior Viable Non-Moving Circular Movement Straight Line/Crash Hearing Sensors Virtual Prey Location W11 W21 For the second strategy, we used a novel approach. With it, we broke the bias apart. Each input to a node was associated with its own part of the bias. The input weight and the associated bias are encoded next to each other in the strings. We group the weights according to the inputs, rather than the nodes, so all weights associated with a given input are next to each other. The number of bits required for this encoding is given by: Ntot=(Ni*Nh + Nh*No)*16*2, where the variables are the same as previously. For our network, an encoding with four hidden nodes requires a 1792-bit string. ... WNN BNN w/ Op. Pt. Initialization 40 10 33 17 w/o Op. Pt. Initialization 12 12 66 10 Fitness Determination The fitness of a controller is based on five features: the number of prey consumed, the number of touches, the number of crashes, the distance traveled, and area coverage. Early in the GAs evolution, there are only a few viable candidates produced during each generation. These viable candidates tend to have scores that are disproportionately large relative to the others. When random initialization is used, most of the controllers tend to move in small stationary circles, the others tend to move in either in a straight line until they contact an obstacle and repeatedly crash into it, or do not move at all. Only a small number of initial controllers exhibit acceptable characteristics for multiple categories (refer to Table 2). Because the robots that move in circles normally move the full distance possible without eating prey, when a scaled sum of the features is used the tend to overwhelm the population. Diversity is quickly lost and premature convergence ensues. If the scale factors are increased to produce good candidates during the early generations, later generations are scored too highly in those categories; and once again, there is a loss in diversity. Instead of a scaled sum of features, the fitness function uses the average and standard deviation for a given generation to compute the fitness score. The fitness function is given by: Fi = Wf * Sf * 2(Xif-f)/f where Fi is the fitness for the ith chromosome, Wf is a user-settable weight factor for a feature, Sf is a hard-coded scale factor for a feature, Xif is the score for a given feature, f is the mean value of the scores for a feature, and f is the standard deviation of the feature. (For crashes, the exponent terms are reversed.) Using this function, a chromosome with a score one standard deviation above the average will receive a score which is twice the average. As controllers learn and attain higher scores for a given feature, the fitness function also changes; keeping all portions of the fitness stable. The hard-coded scale factors were determined by examining the score distributions over 100 generations. They were chosen so that the scores in each category tend to be distributed within the same range of values. 4.4 The interface to the system is a set of three Java programs that may be run as applications, or as applets. A configuration interface is provided for setting the variables associated with the system. The second program allows the user to view the chromosome fitnesses and features independently as the GA progress. The last program allows testing of the final controller produced by the system. To ensure that no abuse of systems resources occurs, the applet version is limited in the number of variables that may be set. The applet version of the system is available for use by the general public at http://www.cs.unr.edu/~gruber. With it, a user may create and test their own robots. Additional Structure The standard deviation for consumption is normally on the order of 1.0. When a candidate consumes a large number of prey, this leads to scores which are significantly larger than the rest of the population. The diversity of the population is quickly lost, even with the fitness function used. Scaling was introduced to limit premature convergence associated with a few candidates having very large finesses relative to the rest of the population (Goldberg, 1989). Due to the highly disruptive nature of the GA, we use an elitist selection algorithm to maintain the most fit individuals. Without elitism, the system tends to dismantle viable candidates before they can produce better offspring. During each generation, the 30 best fit chromosomes kept without crossover or mutation. The crossover rate is set at 100% for the remaining chromosomes. 4.3 FINAL SELECTION In order to determine the final candidate, the best 20 controllers are scored in 10 random environments. The candidate with the highest fitness sum is then selected. We found that testing the candidate in even 10 environments, did not always yield the best controller; at least in our opinion. This is a very subjective determination. For the web applet implementation, the best candidate, as determined by the system, is used for testing and display. For our own purposes, we hand IMPLEMENTATION The system uses an Intel based Beowulf cluster with eight Pentium II 400MHz machines running RedHat Linux. The GATORS system uses the Local Area Multicomputer (LAM) implementation of the Message Passing Interface (MPI) for inter-process communication. The system achieves an overall speed-up of about 4 (refer to Figure 6). Speed-up 4.2.4 selected what we considered to be a good candidate based on objective criteria in specialized test environments. The controller selected was trained in the absence of noise. 4.50 4.00 3.50 3.00 2.50 2.00 1.50 1.00 0.50 3.93 3.48 3.37 2.88 2.34 1.00 1 1.71 0.93 2 3 4 5 6 7 8 Number of Processors Figure 6: Speed-Up vs. No. of Processors 5 RESULTS We test the hand-selected controller produced by the GATORS system in a competitive environment against three controllers designed by human programmers. Testing takes place by incrementing one of five variables over its allowed range in 100 divisions. For each division, the system runs the controllers through 100 random environments and computes the average scores over the simulations. The system uses the same 100 random environments during each set in the process. The only condition varied in each set of simulations is the variable being studied. Unless specified, noise was turned off for all tests. All other variables are randomly selected within allowed ranges. The variables used for testing are predator ear length, predator hearing range, prey dormancy period, prey eat range, and system noise. Figure 7 shows the average area coverage of the test unit in competition with the three human-developed controllers. The graph shows that controller the constructed by GATORS achieves an average area coverage that is 33% higher than the best controller produced by the human programmers. 1 1450 2 3 Test Unit 1400 Distance 1350 1300 1250 1200 1150 1 2 3 1100 Test Unit 1050 0 20 700 30 40 50 60 70 80 90 100 600 Figure 9: Average distance covered 500 Even with a lower consumption rate, it is able to cover a greater area than the others. 400 300 0 10 20 30 40 50 60 70 80 90 100 Random Environment Set Figure 7: Average Area Coverage The controller’s ability to follow walls is exemplified by the number of times it touches obstacles during a simulation. Figure 8 shows that the test unit touches obstacles at about twice the rate of its nearest competitor. The ability to use energy efficiently is a highly desirable characteristic. The test unit has a tendency to move at a slower rate than the other units and covers less distance. The test unit traverses only about 85% of the distance of two of the other controllers (refer to Figure 9). By moving at a slower rate, the unit is able to conserve energy. The relation of area coverage to the prey sleep time provides a good correlation to energy conservation. The test unit exhibits a higher independence to sleep time than the other controllers (refer to Figure 10). The test unit appears to survive for longer periods of time, allowing sleeping prey to awaken to be consumed. The 1 700 Number of Touches 10 Random Environment Set 2 3 The controller performed at only an average rate with respect to noise. Figure 11 shows that the controller is affected by noise as much as the controllers designed by hand. The reason for the reversal of results with respect to the human-designed controllers is that the GATORS controller moves backwards as a rule. This may be due to the fact that there is only one speed in the reverse direction. The other scoring categories showed similar results in the presence of noise. This is partly due to the fact that the controller that we present here was constructed in environments in which the noise bias was turned off. We found that controllers built without noise exhibited better characteristics in non-noise environments. The controllers built in noisy environments do not perform as well as those built in noiseless environments, but they do perform better than the unit presented here 1 2 3 Test Unit 800 750 700 Area Coverage Coverage 800 Test Unit 650 600 550 500 450 400 350 600 300 100 500 190 280 370 460 550 640 730 820 910 1000 Prey Sleep Time 400 Figure 10: Area coverage versus sleep time when noise is present. 300 200 100 0 10 20 30 40 50 60 70 80 Random Environment Set 90 100 Figure 8: Average Number of Touches other controllers quickly consume any food present, and then use up their energy before the prey awakens. The test unit’s consumption rate decreases in line with the other controllers as the prey sleep time is lengthened; but as we observe in Figure 10, the area coverage is not decreased like the other controllers. The test unit also has a lower consumption rate than two other controllers. 6 CONCLUSIONS The controllers the GA produced exceeded those developed by human programmers in several categories; particularly in the areas of area coverage and energy conservation. In addition, the controllers produced exhibited a number of other complex behaviors. The unit studied here is proficient hunter; able to negotiate obstacles in its path when prey is detected, and capture prey on the other side of those obstacles. References Area Coverage 1 2 3 D. E. Goldberg (1989). Genetic Algorithms in Search, Optimization, and Machine Learning. Reading, MA: Addison-Wesley. Test Unit 900 800 700 600 500 400 300 200 100 0 F. Gomez and R. Miikkulainen (1997). Incremental evolution of complex general behavior. In Evolutionary Computation 5, 317-342. M. T. Hagan, H. B. Demuth, M. Beale (1996). Neural Network Design. Boston, MA: PWS Publishing Co. 25 30 35 40 45 50 55 60 65 70 75 Noise Bias Figure 11: Coverage vs. Noise Bias However, the use of GA to train a neural network controller for complex behavioral functions is a difficult task. This is primarily related to the subjective nature of the scoring procedures used. We found no hard and fast rules regarding scoring which yielded optimal results. As the final performance is subjective in many respects there may be no single optimum, and is simply based on finding desired characteristics for a given circumstance. We found that in order to get any acceptable results, it is necessary to take into account the different stages that occur in development as evolution progresses. In particular, the scoring function used during infancy is of primary importance. Without taking infancy into account, the large differences in fitnesses between dissimilar categories lead to pre-mature convergence due to a loss in genetic diversity. By using a scoring function that is based on relative fitness, rather than absolute measurements, we were able to get much better results. We also found that by using an operating point initialization, the initial chromosomes exhibited behaviors which were more diverse than those produced using random initialization. Noise within the environment made it almost impossible to construct viable controllers in our system. We believe the controllers developed in infancy are unable to cope with the presence of noise. For that reason, the GA is unable to progress in a normal nature. 7 FUTURE WORK The GA functioned at a much higher level using the operating-point initialization functions. Further study into the relationships associated with this type of initialization may prove interesting. The exact reasons for the increase in performance is not immediately self-evident. The fitness function also appears to perform better than the initial scaled sum-of-fitness methods attempted. Both methods deserve further examination. J. H. Holland (1973). Genetic algorithms and the optimal allocation of trials. SIAM Journal of Computing 2 (2), 88-105. S. Louis and J. Sun (1996). Designing robust situated agents using genetic algorithms. Proceedings of the ISCA 5th International Conference on Intelligent Systems, 140144. ISCA Press. S. Louis and G. Li (1996). Designing robust situated agents using genetic algorithms. Proceedings of the ISCA 5th International Conference on Intelligent Systems, 145149. ISCA Press. D. E. Moriarty and R. Miikkulainen (1997). Forming neural networks through efficient and adaptive coevolution. In Evolutionary Computation 5, 373-399. S. Nolfi, J. L. Elman, and D. Parisi (1990). Learning and evolution in neural networks. Technical Report 9019, Center for Research in Language, University of California, San Diego. B. Yamauchi and R. D. Beer (1994). Integrating reactive, sequential, and learning behavior using dynamical neural networks. In D. Cliff, P. Husbands, J. Meyer and S. Wilson (Eds.), From Animals to Animats 3: Proceedings of the Third International Conference on Simulation of Adaptive Behavior (SAB94), 382-391. MIT Press.