ECE 517 Final Project Development of Predator/Prey Behavior via Reinforcement Learning

advertisement
ECE 517 Final Project
Development of Predator/Prey Behavior via
Reinforcement Learning
Alexander Saites
12/5/2011
Abstract
This paper details the use of an artificial neural network to elicit predator and prey type behavior from
multiple, simulated robotic agents. The agents are given their own position along with the positions of
the other robots within a small space. They use a neural network to estimate state-action values in
order to determine an action to take. Experiments were held using different reward schemes along with
different epsilon and decay values. The application was written in C/C++ using Player-Stage, but because
results in testing were poor, the design was re-implemented in Matlab, and the problem was
consistently simplified. After many iterations, the Matlab version seemed to develop predator/prey like
behaviors.
I. Background
Reinforcement learning is an effective way to find good policies for a particular problem;
however, states-action spaces grow exponentially with the number of input variables. Thus, tabular
methods quickly become unrealistic in many problems. As a result, artificial neural networks can be used
to model the relationship between the states and actions and their corresponding state-action values.
This allows memory requirements to grow linearly with the number of state-action variables. By using
this method, Q(s,a) can be estimated and corrected using common temporal-difference methods, such
as SARSA.
II. Introduction
In this experiment, multiple agents are placed in a small room and attempt to achieve different
goals. “Predator” agents attempt to capture a “prey” by getting within one meter of the prey. Prey
attempt to escape the predators by staying more than one meter away. Each agent is given the exact
position of other agents along with its own position in the world. At any time, an agent must choose
which direction to travel. If the agent attempts to take an illegal action, such as moving into the boarder,
it simply remains in its current state. The goal was to train the agents either predator or prey behavior
using a neural network to model state-action values. To accomplish this, SARSA was applied, using feedforward neural networks for each agent.
III. Design
Objectives and Challenges
The design objective was to create a simple and extendable program that elegantly facilitated
changes to the source and modeling of the environment. The primary challenge in this was speed of
execution. Originally, the problem was modeled using Player-Stage, an open-source robotics simulator.
The advantage of Player-Stage is that it offers an easy way to model robots; however, as a soft real-time
system, it becomes unstable when simulation speeds are greater than real-time. As a result,
experiments had to be run for several hours before results could be evaluated. Later, the problem was
redesigned in Matlab to ease execution of experiments and testing.
Technical Approach
In the Player-Stage version, a multi-layer, feed-forward neural network API was written in C. It
consists of a directed graph of input, hidden, and output nodes (neurons), connected by weighted
synapses. An extra input node with the output value 1 is additionally connected to each hidden node to
serve as bias. The input of a hidden node is the linear sum of all the products of input neurons’ outputs
and the weights between the input and hidden neurons in question. Its output is then defined by the
hyperbolic tangent function. Errors are defined as the difference between desired and actual output,
and they are back-propagated through the network via the standard back-propagation algorithm.
Each agent uses its own neural network to learn the state-action values. The inputs to the
network were defined in different ways in different experiments. In the Player-Stage approach, the
inputs consist of a robot’s position, the position of other robots (for a predator, only the prey’s position
is used here; the prey uses the positions of both predators), and the current action it is executing. In
Player-Stage, speeds are set in terms of linear speed and yaw speed. Linear speed may be any positive
గ
గ
or negative real value and is interpreted as distance per second. Yaw speed is a value between – and ,
ଶ
ଶ
and is interpreted as radians per second. The predator has a maximum absolute linear speed of .4 and
the prey has a maximum absolute linear speed of .5. Both of these speeds are discretized as nine values,
giving 81 possible actions.
In the Player-Stage version, each robot’s code runs as a separate process. The program works by
initializing the robots in the world. An action is taken by setting the linear and angular speeds of the
robot. After this, the program determines the new state and the reward. The reward is observed
according to one of two reward schemes. In one reward scheme, a predator receives a reward of +1 for
coming within a set distance of the prey and the prey receives a -1 reward for coming within this
distance of the predator. In the other scheme, a predator is given a negative reward equal in magnitude
to the distance between itself and the prey, and the prey’s reward is the positive sum of the distances
between itself and the predators. For the purposes of the neural network, all values were normalized to
lie between zero and one.
Using the new state, the program determines its next action via an epsilon greedy method. The
optimal action is determined by feeding the neural network the values of the next state and each of the
possible actions, and then finding which returns the greatest value. A random number is generated, and
if this value is less than ε, a random action that is not the optimal action is chosen instead. The error for
the neural network’s output is determined by the standard TD-error,
( + ᇱ , − , ),
where , is the output of the neural network using the current state and action and ᇱ (
, ) is the
output of the network using the newly observed state and newly chosen action. The errors are
propagated through the network and the weights are updated accordingly. Finally, the current state is
set to the new state and the current action is set to the newly chosen action. ε is then decayed by the
decay value µ.
In the Matlab version, one of four actions is chosen: up, down, left, or right. The input to the
network is positions of each of the agents and the current action that agent is executing. The networks
of all the agents are controlled from the same script, but are managed independently. Graphical output
is displayed periodically by plotting the predator and prey positions in a figure. Other than these, the
code is roughly identical. The Matlab code uses the University of Tennessee’s Machine Intelligence Lab’s
Artificial Neural Network Toolkit.
This control scheme for both programs may be modeled in the following flowchart:
The process repeats forever, as the task is continuous.
Reward Schemes
A. Discrete
The original reward scheme used was discrete. If the predator came within some set distance of the
prey, the predator received a +1 reward and the prey received a -1 reward. However, when the agents
were untrained, this seemed to lead to slow learning, so the continuous method was devised.
B. Continuous
In the continuous case, the reward was equal to the distance between the predator and the prey.
For the prey, this value was negative, and for the predator, the value was positive. The idea was the
neural network would be able to learn the relationship between the distance between the agents and
the resulting reward. This reward scheme consistently produced better results.
Experiments and Results
The first experiments were performed using Player-Stage in a large environment with several rooms.
The agents used quite a bit of information to train the neural network. In the original version, inputs to
the neural network included the (x,y,θ) pose of the robot, 180 degrees of laser scan directly in front of
the robot, and the (x,y) return value from a fiducial finder (each robot had a fiducial: the predator’s
fiducial returned one value and the prey’s fiducial returned another). In addition, the predator robots
knew the prey-fiducial return value of the other predator robot. Gamma was set to .8 and epsilon was
held constant at .1. To encourage faster learning, the robots were all placed near each other at first. The
room and robots can be seen in the figure below.
Figure 1: The robots in Player-Stage. The blue robots are predator and the green robot is the prey.
After running this scenario for 12 hours, the robots did not perform well. If they were, follow-the-leader
type behavior should develop. However, instead the robots generally just spun in circles. It was realized
that the problem was effectively a POMDP, as state information about other robots was only available
when facing another robot.
As such, the problem was simplified further. The inputs to the neural network were changed to only the
pose of the robot and the (x,y) positions of the other robots. No laser scan or fiducial data was used.
Again, the program was ran for several hours, but the results were again unimpressive. The robots
simply turned in circles, only venturing outside of that when ε-greedy caused them to take a random
action. Thus, the problem was simplified further.
To make the problem easier, a different map, consisting only of a small room, was used. This new map
can be seen below:
Figure 2: A smaller room was used.
Despite this constrained space, results were still poor. The robots continued to just spin in circles
showing no real signs of learning. Experimentations with different gamma values and reward schemes
yielded identical results. Clearly, either the problem theory was incorrect, or the problem was still too
difficult for the neural net to learn in any reasonable time.
To attempt to further simplify the problem, the program was remodeled in Matlab. In this scenario, only
one predator and one prey were used. The environment was a 5x5 room in which the predator could
move in increments of .4 and the prey could move in increments of .5. The only actions were to move
up, down, left, or right. The input to the neural net consisted of the current position of the prey, the
current position of the predator, and the current action to take. After letting this run for 100,000
iterations, no clear behavior had seemed to develop. However, after fixing the prey’s position to (2.5,
2.5) and allowing the predator to run free, the predator did indeed seem to begin to learn a good
strategy.
Figure 3: The predator learns to stay near the prey
Allowing this same network to train an additional 100,000 iterations allowed the average reward to
grow. It seems the predator has learned to move toward the prey. But could it handle a moving target?
After the above experiments, the prey was released from its fixed position. The predator used the same
neural net it had before, but the prey started from scratch. The two were pitted against each other, and
the prey moved away from the predator while the predator followed.
Figure 4: The predator learns to follow the prey and the prey learns to run away
Figure 5: The dip in predator rewards show the prey is learning
Generally, the average reward per time step hangs around -2.8 for the predator (2.8 for the prey). It
seems the predator backs off from the prey and waits for it to move from a corner before quickly
moving toward it. If these were animals, we might suggest that the predator is luring the prey away.
IV. Conclusions
The continued failure of the agents in this project to learn shows that representation plays a strong
role in the speed with which a neural network can be trained. It is critical that neural networks be
trained using relevant information from which solutions to a problem can be devised. In the original
Player-Stage version of this problem, there was simply too much extraneous input and not enough
relevant data for the neural net to approximate the proper function. Worse, the speed at which PlayerStage operates is not conducive to this type of experiment. Neural networks need many, many examples
to properly estimate the state-action values. In the Matlab version, it took over 200,000 steps for the
predator to just learn basic proper behavior. In Player-Stage this would take many hours, whereas in
Matlab it is less than 10 minutes.
Download