ECE 517 Final Project Development of Predator/Prey Behavior via Reinforcement Learning Alexander Saites 12/5/2011 Abstract This paper details the use of an artificial neural network to elicit predator and prey type behavior from multiple, simulated robotic agents. The agents are given their own position along with the positions of the other robots within a small space. They use a neural network to estimate state-action values in order to determine an action to take. Experiments were held using different reward schemes along with different epsilon and decay values. The application was written in C/C++ using Player-Stage, but because results in testing were poor, the design was re-implemented in Matlab, and the problem was consistently simplified. After many iterations, the Matlab version seemed to develop predator/prey like behaviors. I. Background Reinforcement learning is an effective way to find good policies for a particular problem; however, states-action spaces grow exponentially with the number of input variables. Thus, tabular methods quickly become unrealistic in many problems. As a result, artificial neural networks can be used to model the relationship between the states and actions and their corresponding state-action values. This allows memory requirements to grow linearly with the number of state-action variables. By using this method, Q(s,a) can be estimated and corrected using common temporal-difference methods, such as SARSA. II. Introduction In this experiment, multiple agents are placed in a small room and attempt to achieve different goals. “Predator” agents attempt to capture a “prey” by getting within one meter of the prey. Prey attempt to escape the predators by staying more than one meter away. Each agent is given the exact position of other agents along with its own position in the world. At any time, an agent must choose which direction to travel. If the agent attempts to take an illegal action, such as moving into the boarder, it simply remains in its current state. The goal was to train the agents either predator or prey behavior using a neural network to model state-action values. To accomplish this, SARSA was applied, using feedforward neural networks for each agent. III. Design Objectives and Challenges The design objective was to create a simple and extendable program that elegantly facilitated changes to the source and modeling of the environment. The primary challenge in this was speed of execution. Originally, the problem was modeled using Player-Stage, an open-source robotics simulator. The advantage of Player-Stage is that it offers an easy way to model robots; however, as a soft real-time system, it becomes unstable when simulation speeds are greater than real-time. As a result, experiments had to be run for several hours before results could be evaluated. Later, the problem was redesigned in Matlab to ease execution of experiments and testing. Technical Approach In the Player-Stage version, a multi-layer, feed-forward neural network API was written in C. It consists of a directed graph of input, hidden, and output nodes (neurons), connected by weighted synapses. An extra input node with the output value 1 is additionally connected to each hidden node to serve as bias. The input of a hidden node is the linear sum of all the products of input neurons’ outputs and the weights between the input and hidden neurons in question. Its output is then defined by the hyperbolic tangent function. Errors are defined as the difference between desired and actual output, and they are back-propagated through the network via the standard back-propagation algorithm. Each agent uses its own neural network to learn the state-action values. The inputs to the network were defined in different ways in different experiments. In the Player-Stage approach, the inputs consist of a robot’s position, the position of other robots (for a predator, only the prey’s position is used here; the prey uses the positions of both predators), and the current action it is executing. In Player-Stage, speeds are set in terms of linear speed and yaw speed. Linear speed may be any positive గ గ or negative real value and is interpreted as distance per second. Yaw speed is a value between – and , ଶ ଶ and is interpreted as radians per second. The predator has a maximum absolute linear speed of .4 and the prey has a maximum absolute linear speed of .5. Both of these speeds are discretized as nine values, giving 81 possible actions. In the Player-Stage version, each robot’s code runs as a separate process. The program works by initializing the robots in the world. An action is taken by setting the linear and angular speeds of the robot. After this, the program determines the new state and the reward. The reward is observed according to one of two reward schemes. In one reward scheme, a predator receives a reward of +1 for coming within a set distance of the prey and the prey receives a -1 reward for coming within this distance of the predator. In the other scheme, a predator is given a negative reward equal in magnitude to the distance between itself and the prey, and the prey’s reward is the positive sum of the distances between itself and the predators. For the purposes of the neural network, all values were normalized to lie between zero and one. Using the new state, the program determines its next action via an epsilon greedy method. The optimal action is determined by feeding the neural network the values of the next state and each of the possible actions, and then finding which returns the greatest value. A random number is generated, and if this value is less than ε, a random action that is not the optimal action is chosen instead. The error for the neural network’s output is determined by the standard TD-error, ( + ᇱ , − , ), where , is the output of the neural network using the current state and action and ᇱ ( , ) is the output of the network using the newly observed state and newly chosen action. The errors are propagated through the network and the weights are updated accordingly. Finally, the current state is set to the new state and the current action is set to the newly chosen action. ε is then decayed by the decay value µ. In the Matlab version, one of four actions is chosen: up, down, left, or right. The input to the network is positions of each of the agents and the current action that agent is executing. The networks of all the agents are controlled from the same script, but are managed independently. Graphical output is displayed periodically by plotting the predator and prey positions in a figure. Other than these, the code is roughly identical. The Matlab code uses the University of Tennessee’s Machine Intelligence Lab’s Artificial Neural Network Toolkit. This control scheme for both programs may be modeled in the following flowchart: The process repeats forever, as the task is continuous. Reward Schemes A. Discrete The original reward scheme used was discrete. If the predator came within some set distance of the prey, the predator received a +1 reward and the prey received a -1 reward. However, when the agents were untrained, this seemed to lead to slow learning, so the continuous method was devised. B. Continuous In the continuous case, the reward was equal to the distance between the predator and the prey. For the prey, this value was negative, and for the predator, the value was positive. The idea was the neural network would be able to learn the relationship between the distance between the agents and the resulting reward. This reward scheme consistently produced better results. Experiments and Results The first experiments were performed using Player-Stage in a large environment with several rooms. The agents used quite a bit of information to train the neural network. In the original version, inputs to the neural network included the (x,y,θ) pose of the robot, 180 degrees of laser scan directly in front of the robot, and the (x,y) return value from a fiducial finder (each robot had a fiducial: the predator’s fiducial returned one value and the prey’s fiducial returned another). In addition, the predator robots knew the prey-fiducial return value of the other predator robot. Gamma was set to .8 and epsilon was held constant at .1. To encourage faster learning, the robots were all placed near each other at first. The room and robots can be seen in the figure below. Figure 1: The robots in Player-Stage. The blue robots are predator and the green robot is the prey. After running this scenario for 12 hours, the robots did not perform well. If they were, follow-the-leader type behavior should develop. However, instead the robots generally just spun in circles. It was realized that the problem was effectively a POMDP, as state information about other robots was only available when facing another robot. As such, the problem was simplified further. The inputs to the neural network were changed to only the pose of the robot and the (x,y) positions of the other robots. No laser scan or fiducial data was used. Again, the program was ran for several hours, but the results were again unimpressive. The robots simply turned in circles, only venturing outside of that when ε-greedy caused them to take a random action. Thus, the problem was simplified further. To make the problem easier, a different map, consisting only of a small room, was used. This new map can be seen below: Figure 2: A smaller room was used. Despite this constrained space, results were still poor. The robots continued to just spin in circles showing no real signs of learning. Experimentations with different gamma values and reward schemes yielded identical results. Clearly, either the problem theory was incorrect, or the problem was still too difficult for the neural net to learn in any reasonable time. To attempt to further simplify the problem, the program was remodeled in Matlab. In this scenario, only one predator and one prey were used. The environment was a 5x5 room in which the predator could move in increments of .4 and the prey could move in increments of .5. The only actions were to move up, down, left, or right. The input to the neural net consisted of the current position of the prey, the current position of the predator, and the current action to take. After letting this run for 100,000 iterations, no clear behavior had seemed to develop. However, after fixing the prey’s position to (2.5, 2.5) and allowing the predator to run free, the predator did indeed seem to begin to learn a good strategy. Figure 3: The predator learns to stay near the prey Allowing this same network to train an additional 100,000 iterations allowed the average reward to grow. It seems the predator has learned to move toward the prey. But could it handle a moving target? After the above experiments, the prey was released from its fixed position. The predator used the same neural net it had before, but the prey started from scratch. The two were pitted against each other, and the prey moved away from the predator while the predator followed. Figure 4: The predator learns to follow the prey and the prey learns to run away Figure 5: The dip in predator rewards show the prey is learning Generally, the average reward per time step hangs around -2.8 for the predator (2.8 for the prey). It seems the predator backs off from the prey and waits for it to move from a corner before quickly moving toward it. If these were animals, we might suggest that the predator is luring the prey away. IV. Conclusions The continued failure of the agents in this project to learn shows that representation plays a strong role in the speed with which a neural network can be trained. It is critical that neural networks be trained using relevant information from which solutions to a problem can be devised. In the original Player-Stage version of this problem, there was simply too much extraneous input and not enough relevant data for the neural net to approximate the proper function. Worse, the speed at which PlayerStage operates is not conducive to this type of experiment. Neural networks need many, many examples to properly estimate the state-action values. In the Matlab version, it took over 200,000 steps for the predator to just learn basic proper behavior. In Player-Stage this would take many hours, whereas in Matlab it is less than 10 minutes.