Full text - FNWI (Science) Education Service Centre

advertisement
Applying Reinforcement Learning
Techniques in the
“Catch-the-Thief”
Domain
Master Thesis
Stefan Stavrev
University of Amsterdam
Master Artificial Intelligence
Track: Gaming
Supervised by:
Ph.D. Shimon A. Whiteson (UvA)
MSc. Harm van Seijen (T.N.O.)
1
2
Contents
1 Introduction ............................................................................. 4
2 Background ............................................................................. 5
2.1 The RL framework ............................................................... 5
2.2 Planning and learning .......................................................... 7
2.3 Model-free and model-based learning .................................... 7
2.4 SMDPs and Options ............................................................. 7
2.5 Q-learning .......................................................................... 8
2.6 Value iteration .................................................................... 9
2.7 Multi-agent systems ............................................................ 9
3 Domain description .................................................................. 10
3.1 The environment ................................................................ 10
3.2 Thief behavior ................................................................... 12
4 Methods ................................................................................. 13
4.1 State-space reduction ......................................................... 13
4.2 Action-space reduction ....................................................... 15
4.3 Algorithms ........................................................................ 16
4.3.1 Combination of offline Value Iteration and shortest-distanceto-target Deterministic Strategy (CVIDS) ................................. 17
4.3.2 Options-based Reinforcement Learning Algorithm (ORLA) .. 17
4.3.3 Shortest Distance Deterministic Strategy (SDDS) ............. 21
5 Experimental results ................................................................ 21
5.1 Experimental setup ............................................................ 21
5.2 Results ............................................................................. 25
6 Analysis and discussion ............................................................ 27
6.1 ORLA-SA vs. CVIDS vs. SDDS.............................................. 27
6.2 ORLA-MA vs. ORLA-SA ........................................................ 28
7 Related work ........................................................................... 31
8 Conclusions and Future work..................................................... 32
8.1 Conclusions ....................................................................... 32
8.2 Future work....................................................................... 32
Acknowledgments ...................................................................... 33
Appendices ................................................................................ 33
1. Features to states and vice-versa .......................................... 33
2. Actions to Features and vice-versa ......................................... 34
3. Line segment to line segment intersection .............................. 34
4. Standard Error calculation ..................................................... 37
Bibliography .............................................................................. 37
3
1 Introduction
Modern Artificial Intelligence is focused on designing intelligent agents.
An agent is a program that can make observations and take decisions
as a consequence of these observations. An agent is considered to be
intelligent, if it can make autonomous decisions and act in a rational
way. Learning how to act rationally is not easy. The different machine
learning fields have different perspectives on learning. Some of them
consider learning by explicit examples of correct behavior, while others
rely on positive and negative reward signals for judging the quality of
an agent’s action. The latter field is called Reinforcement Learning
(RL). In RL, an agent learns by interacting with its environment. In
certain problems, there are multiple agents. One such problem is the
“catch-the-thief” problem. In this problem several guards are pursuing
a thief in a simulated environment.
The main difficulty of this domain is the scalability, because complexity
grows exponentially with the number of agents and the size of the
environment. To tackle the scalability problem, we make use of
options [3]. Options are an extension over (primitive) actions and may
take several timesteps to complete. Options are typically added to the
state-space to obtain more effective exploration. However, we use
them to reduce the size of the state-action space. This is achieved by
only letting the agent learn in certain parts of the state-space, while in
the rest of the state space a fixed, hand-coded option is followed by
the agent. The motivation behind this approach is that for large parts
of the environment it is often easy to come up with a good policy. We
propose to use hand-coded options for those parts and only use
learning for the more difficult regions, where constructing a good
policy is hard.
The main purpose of this master thesis is to empirically compare
different strategies to use options for dealing with the scalability issue
present in the catch-the-thief problem. We are going to investigate
both single-agent and multi-agent methods. Based on an analysis of
the different methods, hypotheses will be formed about which method
will perform well on which task variation and these hypotheses will be
tested with experiments.
We are mainly interested in the performance in the limit that can be
achieved with these methods. Under these settings, scalablity plays a
4
role, due to the limited space (RAM) available to store data, and
because the total computation time to compute the maximum
performance is limited in any practical setting. While we mainly focus
on RL methods, for completeness, we also compare against a planning
method.
The rest of this thesis is organized as follows:
First, we present some background knowledge about the
reinforcement learning framework in general and relevant RL
techniques, that we are going to use in our research (Section 2);
In Section 3, we formally introduce our domain and explain the typical
difficulties of RL in this domain. Section 4 describes the methods and
algorithms that we use, as well as relevant techniques for reducing the
total state-action space. In addition, we form hypothesis about the
performance and scalability of the proposed methods. Section 5
describes the experiments that we conduct and shows relevant results.
In Section 6 we perform empirical analysis and discuss the outcome of
our experiments. After that, we will give a concise overview of existing
empirical research in the “Catch the thief” domain, as well as single
and multi-agent perspectives for modeling an agent’s behavior
(Section 7). In Section 8 we summarize our conclusion, and propose
extension for future work.
2 Background
2.1 The RL framework
We already (briefly) mentioned what an agent is in the Introduction.
More formally, an RL agent :” An abstract entity (usually a program)
that can make observations, takes actions, and receives rewards for
the actions taken. Given a history of such interactions, the agent must
make the next choice of action so as to maximize the long term sum of
rewards. To do this well, an agent may take suboptimal actions which
allow it to gather the information necessary to later take optimal or
near-optimal actions with respect to maximizing the long term sum of
rewards”. [44][1]
Software agents can act, following hand-coded rules or learn how to
act by employing machine learning algorithms. Reinforcement learning
is one such sub-area of machine learning.
Formally, one way to describe a reinforcement learning problem is by
a Markov Decision Process (MDP). An MDP is defined as the tuple
5
{S,A,T,R}: S is a set of states, A is a set of actions, T is the
transitional probability; R is a reward function. In MDP environments,
a learning agent selects and executes action at  A at a current state
st  S
at time t. At time t+1, the agent moves to state
st 1  S
and
receives a reward rt 1 . The agent’s goal is to maximize the discounted
sum of rewards from time t to ∞, called return
[1]:
Rt , defined as follows

Rt  rt 1  rt  2   rt 3  ...    k rt  k 1
2
where
 (0  
k 0
 1) is a discount factor, specifying the degree of
importance of future rewards.
The agent chooses its actions according to a policy , which is a
mapping from states to actions. Each policy is associated with a state
value function V ( s ) , which predicts the expected return for state s,
when following policy  :
V  (s)  E[ Rt | st  s] ,
where E[.] indicates the expected value.
The optimal value of a state s,
V * ( s ) , is defined as the maximum
value over all possible policies:
 t
*
V ( s )  max E   R( st , at ) | s0

t 0

 s, at   ( st )

(1)
Related to the state-value function is the action-value function
Q ( s ) ,
which gives the expected return when taking action a in state s and
following policy  thereafter:
Q (s, a)  E[ Rt | st  s, at  a] ,
The optimal Q-value of state-action pair (s,a), is the maximum Qvalue over all possible policies:
 t

*
Q ( s, a)  max E   R( st , at ) | s0  s, a0  a, at 0   ( st ) (2)

t 0

6
An optimal policy
π *(s) is a policy whose state-value function is equal
to (1).
2.2 Planning and learning
When the model of the environment is available, planning methods [1]
can be used. Examples of such algorithms are Policy Iteration and
Value Iteration. In practice however, a model of the environment is
not always available and has to be learned instead. In this case,
learning methods [1] can be used, such as Monte Carlo, Q-learning
and Sarsa.
2.3 Model-free and model-based learning
There are different ways of learning a good policy. Model-free learning
algorithms directly update state or state-action values using observed
samples. In the limit, some model-free learning methods are
*
guaranteed to find an optimal policy π (s) [1]. Most model-free
methods require only O(|S||A|) space.
In model-based learning [1], on the other hand, a model of the
environment is estimated to determine the transition probability T
between states and the reward function R. Then, T and R are used to
compute the optimal values by means of off-line planning. Examples of
such techniques are Dyna, Prioritized Sweeping [1]. The advantage of
employing such methods is they require fewer samples in order to
achieve a good policy. Model-based methods require quadratic
memory space - O(|S|²|A|).
The choice between model-free and model-based methods depends on
the complexity of the domain. Using a model requires more memory,
but fewer samples.
2.4 SMDPs and Options
A Semi-Markov Decision Process (SMDP) is an extension of an MDP,
appropriate for modeling continuous-time discrete-event systems [32],
7
[33], [34]. It is formally defined as the tuple {S,A,T,R,F}, where S is a
finite set of states, A is the set of primitive actions, T is the transition
function from a state-action pair to a new state, R is the reward
function and F is a function, giving transition times probabilities for
each state-action pair [2]. Discrete SMDPs represent transition
distributions as F(s’, N | s,a), which specifies the expected number of
steps N that action a will take before terminating in state s’, starting
in state s.
Options are a generalization over primitive actions [3]. Primitive
actions always take one timestep to be executed, while an option may
take two, three or more timesteps to complete. That is why, options
are appropriate for modeling state-transitions that take variable
amounts of time, like those in a SMDP. An option, is defined as the
triplet { I,π, }, where I  S is an initiation set; π is the option policy
 1 is a termination condition. An option is available in state st if
and only if st  I . If the option is taken, then actions are selected
according to π until the option terminates stochastically according to
 . In particular, a option executes as follows. First, the next action
at is selected according to a policy π . The environment then makes a
transition to state st 1 , where the option either terminates, with
probability  ( st 1 ), or else continues, determining at 1 according to
π ( st 1 ), possibly terminating in st  2 according to  ( st  2 ), and so
and
on.
2.5 Q-learning
Q-learning [1][27] is a well-known model-free algorithm, in which an
agent iteratively chooses actions according to a policy  , and updates
state-action values according to the rule:
Q( s, a)  (1   )Q( s, a)   [ R   max Q( s ', a ')]
a'
The learning rate  ( 0    1 ) determines to what extent a new
sample changes the old Q-value. A factor of 0 will make the agent not
learn anything, while a factor of 1 would make the agent consider only
 plays a role similar to the  in  -models [47], but
with an opposite sense. That is,  (s) in here corresponds to (1 -  (s)) in [47].
1
The termination condition
8
the most recent information. Obviously, choosing the correct
important for learning.

is
2.6 Value iteration
Value iteration (VI) [1] is a planning algorithm. It uses as input the
transition and reward function, T and R, and computes from this the
optimal value function V* by iteratively improving the value estimates.
Once the optimal value function is found, the optimal policy can easily
be derived. Note that we can use VI in model-based learning to
compute a policy, based on the model estimate.
Below, the pseudo-code for the VI algorithm is given.
loop until policy  is good enough
for all states s  S
for all actions a  A
Q( s, a)  R( s, a)   s ' T ( s, a, s' )V ( s' )
V(s)  max a Q(s,a)
end for
end for
end
2.7 Multi-agent systems
So far, we assumed the existence of only a single agent in an
environment. In practice, however, there can be multiple agents
interacting with each other in different ways in an environment. A
system that consists of a group of such agents is called a Multi-Agent
System (MAS).
Formally, a MAS is a generalization of an MDP and is called a
stochastic game (SG). A SG is the tuple {S,N,Ai,P(s’, a ),ri}. S is a
finite set of states, N is a finite set of n agents, Ai is a set of actions,
available to agent i, P(s’, a ) is the transitional probability of reaching
state s’ after executing the joint action a , a vector of the of individual
agent’s primitive actions, at state s, ri is the individual reward for
agent i. The rewards ri(s, a ) also depend on the joint action.
9
There are different ways to select a joint action
a:
1) Central control mechanism
a) The central approach is to combine all agents into a single, “Big”
agent. That way, the multi-agent (MA) problem is transformed
into single – agent problem and we can use regular single agent
MDP techniques to solve it. Using that approach, guarantees
*
finding the optimal policy π in the limit, but convergence can be
very slow. For example, if there are 4 agents, each with 4
available actions, the total number of joint actions is 44 = 256
2) Decentralized (distributed) control mechanisms
a) Another way to compute a is by using coordination graphs2 and
explicit communication mechanisms, in order to model
dependencies between individual agents. As a consequence of
these mechanisms, the joint action is computed in a more
efficient way and computational speed-up is achieved.
b) Finally, there are multi-agent methods that do not explicitly
compute the joint action nor assume any dependence between
the agents. Those approaches rely on implicit coordination
mechanisms and/or shared value functions in order to achieve
coordination between the agents. The total action space scales
linearly with the number of agents. From the example above, 4
agents with 4 available actions will result in 44=256 joint actions,
but we will only need 4 * 4 = 16 Q-values. The implicit
coordination mechanisms make it unnecessary to store 256 Qvalues to describe all 256 joint actions.
The main advantages of employing distributed MAS are computational
speedup and reduction of the total values of the action-space vector
that we need to store. A disadvantage is the decrease of the
performance in the limit.
3 Domain description
3.1 The environment
A graph, where there is a node for each agent and an edge between two agents if
they must directly coordinate their actions to optimize some particular Q-value
2
10
The domain in this thesis is the “catch-the-thief” domain, also known
as the “predator-prey” domain – a number of agents, called guards,
are learning a control policy to catch a rogue entity, called the thief.
The guards try to catch the thief as quickly as possible, while
preventing it from escaping through one of the exits. In this specific
case, the domain is modeled as an MDP with a discrete, finite stateaction space. The choice has been made to use a discrete environment
because it is easy to implement and analyze. A state consists of a
feature vector, where each feature encodes the position of one of the
entities (a guard or a thief). There are also obstacles/walls in the
environment. (Figure 1)
Figure 1: A snapshot of our catch-the-thief domain. Guards are darkblue boxes, the thief is a red box, exits are green and walls are blue
boxes.
11
The size of the total state space S, in the specific case of a 20 x 20 2D
navigational cell-grid, two guards and one thief, and not considering
obstacles (137 in our case), is equal to:
[(20 * 20) – 137]³ = 1.82 * 107
(3)
Each guard can take 4 (primitive) actions, corresponding with
movements in the “up”, “down”, “left” and “right” direction. Since
there are two guards, the total number of joint actions is equal to: 4²
= 16. As a consequence, the total state-action space S x A is equal to:
1.82 * 107 x 16 = 2.91* 108

300 million state-action pairs. (4)
This number is substantial. That is why we are going to look for a way
to reduce the total state action space in the Methods section.
Note that the guards are the entities that are controlled by the
algorithm. From the perspective of these guards, the thief is part of
the environment. The reward ri ( s ) received by each agent, after
execution of a joint action is equal to:
+1, if the thief is caught,

 -1, if the thief has escaped,
0, otherwise.

A thief is caught when it is on the same grid cell in the environment as
one of the guards.
The thief’s behavior is deterministic, but effected by whether it
observes the guards are not. In the next section, we explain in detail
its behavior.
3.2 Thief behavior
At each timestep, the thief takes a single step in one of four possible
directions: up, down, right or left. As mentioned before, the thief
chooses its actions deterministically. The thief’s first priority is not
being caught by the guards. The thief can observe guards using a
visual system with a sensitive area of 13 x 13 cells centered around it.
Line segment to line segment intersection is used for determining
12
whether the thief sees any of the guards. Implementation details can
be found in Appendix 3. The system detects if there are any guards
within the sensitive area of the thief. If that is the case, this system
reduces the set of available movement directions, by removing the one
that brings the thief closer to that guard. This process is repeated for
all guards. From the set of remaining actions, the thief takes the one
that results in the shortest path to one the exits (ties are settled using
a priority order over the actions in order to keep the thief behavior
deterministic). We use the number of cells, free of guards and
obstacles for measuring the proximity of the thief to one of the exits.
At a single timestep, the following occurs in order:
1. The guards select their movement actions.
2. The guards are moved.
3. If the thief is not caught after the guards are moved, it selects a
movement direction, based on the location of the exits and the new
location of the visible guards, and takes one step in that direction.
4 Methods
In this section we are going to look closely at the algorithms we will
use to in our experiments. Furthermore, we explain the state-action
space reduction techniques that these methods employ.
4.1 State-space reduction
We mention in the introduction that the state-space can be reduced by
employing options. In this section we explain how our approach works
in detail.
The core concept we use is to only do learning in certain parts of the
environment, while in others hand-coded options are used. The
motivation behind this approach is that for certain states it is difficult
to come up with a good hand-coded strategy and learning has to be
used. On the other hand, there are those states where the hand-coded
strategy works sufficiently well and learning is practically unnecessary.
In principal, the state-space of any MDP can be divided into learning
and non-learning subsets.
Before we explain our technique in more detail, we will first give a few
definitions. Let’s denote with S our total state-space. Then, S can be
divided into:
13

States where we use learning. For these states we estimate Qvalues. We call the subset of states where learning occurs, QST.

States, where hand-coded options are used. For these states we
do not estimate Q-values. We call the subset of states where
options are used: NQST.
We now give an example of such a division for our ‘catch-the-thief’
domain. We relate the QST and NQST subsets to so-called learning
areas (LAs). These LAs are subsets of the floorplan, where the guards
and the thief move around in. We define the QST for this task, as the
subset of states where both guards and the thief are in the same LA:
QST , if all entities are inside one and the same LA
s
 NQST , otherwise
We now compute the reduction of the state-space achieved, using task
parameters. We denote:
 n the total number of entities(guards plus thieves)
 L1, L2 …Lk the learning areas
 k the number of learning areas, in our case: L1 and L2 (Figure 2)
 C1 ,C2 … Ck , the number of free cells in each learning area Lk.
 | S L | : the size of QST
L1
L2
Figure 2: A representation of the Learning areas L1 and L2, as part of
the whole shopping mall floorplan
The number of cells free of obstacles in a learning area Ck raised to
the power of all entities n gives us the number of states per learning
area Lk. The total size of QST, | S L | , is then equal to:
14
k
| S L |  cin
(5)
i 1
In our environment:
n3
S  1.82 *107
k2
c1  42
c2  42
Then, according to (5):
| SL | 423  423  1.48*105
Dividing | S L | by S yields the reduction percentage.
| S L | 1.48*105

 0.0081 %
S
1.82 *107
As we can see, we will be using less that one percent of the original
state-space size, which is a significant reduction.
4.2 Action-space reduction
In addition to reducing the state-space, one of the methods that we
will use (ORLA-MA, see section 4.3.2.2), uses a strategy to reduce the
action-space. In particular, we employ a Multi-agent learning
technique called Distributed Value Functions (DBF) [4] which models
each agent separately, but the reward is common and each agent
shares some part of its value function with the other agent:
15
Qg ( s, ag )  (1-  )Qg ( s, ag )   [ R   iVg' ]
where
Vg'  [Vg'  (1- )Vg'  ]
Here, with the subscript ‘g’ we denote the agent we are updating; with
‘g-‘ we denote the other agent in the environment;
is a weighting
function that determines how the state value of agent g will get
updated.

As a result, instead of searching in a 4² = 16 action space we only look
into a 4 + 4 = 8 action space and the total action space is reduced by
half. Please note, that since we model the actions of each agent
separately, we are not trying to explicitly find the optimal joint action
a . We will discuss the trade-offs (if any) of this approach in the
Experiments and Analysis sections.
4.3 Algorithms
There are three algorithms that we develop. The first one is a
Combination of offline Value Iteration and shortest-distance-to-target
Deterministic Strategy – CVIDS. This method uses planning (value
iteration) for some parts of the environment and a options for
controlling the agent in other parts, effectively reducing the size of the
state-space. The second algorithm reduces the state-space by
employing temporarily extended actions (options) and learning only in
some areas of the environment. We refer to the latter method as
Options-based Reinforcement Learning Algorithm, ORLA. Furthermore,
in order to achieve action-space reduction, we extend ORLA into a
multi-agent system with DVF. For consistency reasons, we will refer to
the single-agent version of this algorithm (ORLA) as ORLA-SA, and the
multi-agent- as ORLA-MA. The final algorithm is a simple fixed handcoded policy and is used for benchmarking (SDDS).
16
4.3.1 Combination of offline Value Iteration and shortestdistance-to-target Deterministic Strategy (CVIDS)
For CVIDS, planning only occurs within a subset of the total space, the
learning areas. On the other hand, in the non-learning area, a fixed
deterministic policy is used, in which each guard takes the shortest
distance to the thief. There is a special abstract state “outside of
learning area” which has a constant Q-value, used for updating the
states that lie just on the border of a learning area.
The problem that we have to solve is now re-defined as an MDP with a
smaller state-space. Note that the optimal solution for this derived
MDP will in general not be optimal in the original MDP. We sacrifice
some performance for computational gain and domain-size reduction.
4.3.2 Options-based Reinforcement Learning Algorithm
(ORLA)
4.3.2.1 ORLA-SA
The second algorithm, ORLA-SA, models the problem as an SMDP [3].
As mentioned earlier, only some states (QST) have Q-values and the
generalized Q-learning update rule is used:
Q(s, a)  (1 )Q(s, a) [R   i max Q(s', a')] (6)
a'
and
R    i ri
i 0
where
i is the number of steps, since the last update
a is the action initiated at s, lasted i steps and terminated in state s’,
while generating a total discounted sum of rewards R,
r i is the reward after taking a step i
 is the discount factor.
Consider for example the transition sequence shown in Figure 3.
17
After the agent is in state SC (QST), it takes an action (the action with
the highest Q-value), that leads him to state SD (NQST). Reward is
collected and i is increased. While in SD , the agent temporarily
extends the action, taken in SC with a deterministic action. After
another transition, the agent is in state SE (QST). The extended
action is terminated, and the state-value of SC is updated using the
state-value of SE using (6)
SA
SB
i
SC
SD
SE
0
1
2
Figure 3: The shaded nodes represent QST; the blank nodes – NQST;
the arrows show the transition after taking an action a in state s; the
number i = 0,1,2... is the number of timesteps since the last update.
Pseudo code for the algorithm:
18
observe thief behavior
loop( for n number of episodes )
if s  QST :
select action a with highest Q-value
sLast  s
aLast  a
R = 0;
i = 0;
else
chose deterministically shortest
distance action a
end if
take action a
observe reward r,
new
old
i
s ,a
s ,a
i = i + 1;
 R   r
R
if
s '  QST
Q(sLast , aLast )  (1 )Q(sLast , aLast ) 
[ R   i max Q(s', a')]
a'
end if
s  s’
end
4.3.2.2 ORLA-MA
The algorithm, described in 4.3.2.1, is easily extended for the DVF
approach. Instead of having a global Q-value table, we use local Qvalue tables, one per agent. Each agent selects the action with the
highest Q-value from its own table. Then both agents perform their
individually selected actions, observe the next state and receive a
global reward, based on their joint action a and s.
We can then re-write ORLA-SA as:
19
observe thief behavior
loop( for n number of episodes )
if s  QST :
for each agent g:
select action ag with highest Qvalue
aLastg  ag
end for
sLast  s
R = 0;
i = 0;
else
for each agent g:
chose deterministically shortest
distance action ag
end for
end if
for each agent g:
take action ag
end for
observe joint action
observe reward r,
a
old
i
Rsnew

R


r

,a
s ,a
i = i + 1;
if
s '  QST :
for each agent g:
Qg (sLast , aLast g ) 
 (1 )Qg (sLast , aLast g ) 
  [ R   i ( max Qg (s', a'g ) 
a' g
 (1 ) max Qg  (s', a'g  ))]
a' g 
end for
end if
s  s’
end
20
where with “g– “ we denote the other agent; aLastg is the last action,
taken by agent g, inside a learning area, and  is a weighting
function, signifying the contribution of the state-value of agent g- on
that of agent g.
4.3.3 Shortest Distance Deterministic Strategy (SDDS)
The SDDS is an agent strategy, where the guards always take the
shortest distance to the thief, i.e. no learning occurs. The distance is
defined as the number of steps it would take the agent to move to the
current position of the thief, taking obstacles into account. Ties are
settled using a priority order over the actions of the guards.
5 Experimental results
In this section, the algorithms discussed in the previous section are
compared and evaluated. The evaluation criterion is the mean return
computed after learning has finished. It is calculated by summing the
individual return per evaluation episode reti , and dividing that sum by
the total number of evaluation episodes maxEp3:
maxEp
MeanRet 
 reti
i 1
maxEp
5.1 Experimental setup
For the first part of the experiments, for all three algorithms, the
guards are initialized at the same position (9, 10) and (10, 10) and the
thief is initialized at one of twelve pre-defined possible positions. The
discrete set of initialization positions is created on purpose, since we
The implementation itself was done using unmanaged C++ for the algorithms,
managed C++ (Microsoft.Net ,MS Visual studio 2005) for building the simulator
(which is the main tool for running the experiments) and OpenGL for the
visualization. All the experiments are run on a WindowsXP 32-bit machine, with
Intel® dual core @ 1.73 MHz CPU, 1 GB of Memory and 128 MB Video card.
3
21
want to do a controlled experiment and confirm the hypothesis that
less initialization positions will cost less computational resource.
Different initialization positions for the thief result in different stateaction pairs being visited during learning. With 12 initialization
positions, some state-action pairs will not be learned and hence, no
additional resource is needed for their update. There are also two
initialization positions (among those 12) for the thief, where it is
impossible for the guards to catch it before escaping, no matter how
good their strategy is. These positions were included on purpose, in
order to:


observe whether those 2 initialization positions affect
performance in a certain way
see what the resulting policy will be (using the simulator) for the
learning agent(s).
We will also test how the number of initialization positions affects the
total performance of the algorithms by using random initialization for
the thief, in addition to using the 12 hand-picked initialization
positions.
The total size of learning areas we define depends on:


how effective we want our algorithms to be - the bigger the size
of the learning areas, the more effective the algorithm, but also
the more resource are needed.
how much computational time (and memory resource) we want
to save
The more resources we are trying to save, the less effective our
methods will be. One of the goals when running those experiments is
to discover and discuss what sizes and positions of Learning Area(s)
work well and give satisfactory performance. Another goal is to
determine whether there are any tradeoffs between the algorithms or
any specific and interesting learning cases.
22
Figure 4: A snapshot of our domain-simulator. The Learning areas are
represented by orange cells.
First, we will run and evaluate the SDDS algorithm - it does not need
time for learning and is going to be our baseline. The second one to
test is CVIDS. The environment functions T and R are pre-computed.
After that, offline Value Iteration is run on the derived MDP (the
learning areas – orange squares) and Q-values are updated.
The third experiment is done using ORLA and has two variations-single
agent and multi-agent perspective. The environment is a SMDP in this
case, but learning occurs only in the orange areas. Since only ORLAMA achieves reduction of the action space, we are going to use the
memory resource we saved to include more states in QST. We add
more states in QST by increasing the size of the Learning area. We
extend the LA horizontally towards the exits (Figure 4). That way we
expect ORLA-MA to perform better than ORLA-SA. As an additional
test, we will compare ORLA-SA and ORLA-MA for the same size
learning area, without including any additional states in QST for ORLAMA.
23
During evaluation, the action selection for CVIDS and ORLA is the
same– select the action with the highest Q-value, if s  QST ;
otherwise use the option.
Before we run our experiments, we define our global parameters.
We use a discount factor of   0.99 and a decaying learning rate
 ( 0    1). The learning rate is initially set high because in the
beginning of learning new experience is more important;  keeps
decreasing with a constant, decay rate  ( 0    1 ). Setting  high will
make the algorithm converge faster to a policy  , but it may well not
be an optimal one (    * ). Choosing  to be very low, on the other
hand, will facilitate convergence but at the cost of longer learning and
higher computation time.
In order to decay  , we use the formula:
 [ s][ a] 
0
 ( N [ s][ a]  1)  1
(Fig. 6)
where
 [ s ][ a ] is the learning rate for the currently updated state-action pair
 0 is the initial learning rate, set to 1
 is the decay rate
N [ s ][ a ] is the number of times a state-action pair has been visited
Figure 5: Decaying alpha, with different decay rate: 0.9, 0.5 and 0.2,
respectively.
24
We use  -greedy exploration policy, with a fixed  ( 0    1), where
an agent selects a random action in each turn with probability of  .
5.2 Results
During preliminary experiments we determined the optimal
parameters for each of the algorithms. We then compared the
methods using these optimal parameters.
For ORLA-MA, we initially fix the exploration factor, and try to estimate
the optimal parameter  for weighting the shared agents’ value
functions. Varying  between 0.1 and 0.9, yielded maximum mean
return for  = 0.8. Trying out different exploration factors  and
learning rate  , when  = 0.8, does not seem to have a direct impact
on the performance. This suggests that  is independent of  and  ,
and only depends on our domain complexity.
The decay-rate  is set initially to 0.09. Experimenting with different
values we have determined that   0.02 is optimal. Setting it higher
makes ORLA-SA and ORLA-MA reach a lower value at the end of the
learning period. Setting  lower makes the algorithms achieve a
slightly-better performance at the cost of twice the learning episodes.
Finally, we try out different (constant) exploration factors  in the
range 0.1 to 0.3. Values higher than 0.3 make learning hard. As a
result, a value of  = 0.12 was obtained as seemingly optimal for
ORLA-MA and   0.1 for ORLA-SA
To summarize, the (optimal) parameters are:
ORLA-SA
0  1
  0.02
  0.1
ORLA-MA
0  1
  0.02
 0.12
 = 0.8
In our first experiment (Table 1), we compare ORLA-SA to CVIDS and
SDDS. The notation “12 | Rand “ shows the type of initialization for
the thief: 12 hand-picked initialization positions or random
initialization.
25
Evaluation
SDDS
Runs: 1000 12
| Rand
Mean
0.24 |0.075
return
Standard
0.003|0.002
Error
Table 1: A comparison
ORLA-MA.
CVIDS
ORLA-SA
12
| Rand 12 | Rand
0.47 |0.12
0.69 | 0.2
0.002|0.002
ORLA-MA
12 | Rand
0.65 | 0.17
0.001|0.001 0.001|0.002
between SDDS, CVIDS, ORLA-SA and
For our second major experiment, we compare ORLA-SA and ORLA-MA
Tables 2 show a comparison between ORLA-SA and ORLA-MA. In the
middle column, the agents learn in exactly the same learning area(but
smaller action space for ORLA-MA), while the third column shows the
performance of learning, using a bigger learning area for ORLA-MA,
but keeping the total state-action space approximately the same size
as ORLA-SA.
Evaluation
Runs: 1000
ORLA-SA
ORLA-MA:
same LA
12
12
| Rand
| Rand
ORLA-MA: Equal
state-action
space
12
| Rand
80000 | 150000
Number of
60000 | 150 000 30000 | 70000
learning
episodes
Mean
0.686 | 0.197
0.645 | 0.165
0.750 | 0.113
return
Standard
0.0019 | 0.0023
0.0013 | 0.0021 0.001 | 0.002
Error
Table 2: The thief is initialized in 12 pre-defined positions in the
environment and then randomly.
Evaluation
Runs: 1000
ORLA-SA
ORLA-MA:
same LA
Number of
learning
episodes
Mean return
Standard
Error
10 000
10 000
ORLA-MA:
Equal stateaction space
10 000
0.615
0.003
0.643
0.003
0.678
0.003
26
Table 3: A comparison between ORLA-SA and ORLA-MA,
using the same number of episodes and same learning
rate. The thief is initialized in 1 of 12 possible
positions.
As we can see in Table 3, the multi-agent case of ORLA converges very
quickly to a near-optimal policy. Only after 60 000, the single-agent
ORLA is able to achieve a slightly better performance (Table 2). Using
the same state-action space for ORLA-MA (but bigger Learning area)
gives even better performance benefit (Table 3).
6 Analysis and discussion
As expected, the shortest distance (SDDS) is not always the optimal
policy, since the thief can take detours and therefore escapes from the
chasing guards (Fig. 6).
Figure 6: The thief is trying to escape, by taking a detour to the
closest exit.
6.1 ORLA-SA vs. CVIDS vs. SDDS
The results of our first experiment (Table 1) show a significant
advantage for ORLA-SA over the other 2 algorithms (CVIDS and
SDDS). The low performance of SDDS is easily explained: there is no
27
learning or reasoning involved and the naïve approach does not
perform that well. CVIDS achieves higher performance over SDDS,
because SDDS is a simple hand-coded strategy and CVIDS is a more
sophisticated algorithm. The MDP over which planning is performed,
however, is different from the original MDP and thus CVIDS achieves
only an optimum in the derived MDP. Finally, ORLA-SA outperforms
both SDDS and CVIDS because it uses learning in a SMDP rather than
in a derived MDP. In practice, the choice of an algorithm depends
whether a model of the environment is available or not.
6.2 ORLA-MA vs. ORLA-SA
After conducting the experiments and observing the mean return and
standard error, we can conclude that ORLA-MA performs better after
10 000 learning episodes (Table 3). In the limit, ORLA-MA achieves a
slightly lower performance compared with ORLA-SA, but the
convergence speed of ORLA-MA is higher. It is surprising that Multiagent DVF approach performs that well, since convergence to the
optimal policy is not guaranteed [4], [19]. We found that if we use 30
000 episodes for learning, the performance of ORLA-MA is sufficiently
good. Up to 50 000 learning episodes, ORLA-MA is still able to
outperform ORLA-SA. Increasing the number of learning episodes
further, however, does not give addition improvement on ORLA-MA’s
performance.
When using bigger LA, but approximately the same size state-action
space (and 12 initialization positions), it is not surprising that ORLAMA outperforms ORLA-SA, given that they had similar performance
using the same LA, since ORLA-MA algorithm has more area, available
for learning. The only noticeable trade-off is that, when using a bigger
LA, ORLA-MA needs more samples/training episodes to converge. This
result is expected: the subset QST is larger and therefore the
algorithm requires more samples. On the other hand, when initializing
the thief randomly in the bigger LA, performance decreases for ORLAMA as well as ORLA-SA. Performance of ORLA-MA is significantly lower
in the same-size LA, as well (Table 2). Apparently, there are some
states that are very difficult to be learnt by ORLA-MA. Fig. 7 identifies
and compares the learning performance when encountering such a
state.
28
Figure 7: A comparison during learning between ORLA-SA (red) and
ORLA-SA (blue) for a single, particularly difficult initialization position.
We ca see a similar relation, for “an easy” initialization position (Fig.8).
Figure 8: A comparison between ORLA-SA (red) and ORLA-MA (blue)
for a single, relatively easy initialization position.
29
Figure 9: Performance (during learning) of ORLA-MA, for the same-size
LA as ORLA-SA and random initialization for the thief.
In the difficult case, the mean return of ORLA-MA is pretty close to
that of ORLA-SA. Performance is quite unstable and sometimes the
mean return even drops below 0, which means that the thief escapes
more often than being captured. In contrast, (easy case, Fig. 8) ORLAMA’s mean return is still lower, but it never gets below 0. The
interpretation of that behavior is ORLA-MA manages to capture the
thief, but not in the minimal number of steps.
The main reason for the unstable multi-agent behavior is that each
agent tries to optimize its Q-table individually, not only by receiving a
reward at the end of each episode, but also by minimizing the number
of steps, taken in the course of an episode. Furthermore, the Qiupdate for the state value of agent i could use a sub-optimal sample
from the contribution of the other agent j. Apparently, there are not
any difficult cases in the 12 initialization positions of the thief, since
ORLA-MA’s performance is nearly-optimal, compared to ORLA-SA
(Table 2). As a result, it seems that even one or two difficult
initialization positions may worsen the overall performance (Table 2,
Fig. 7, and Fig. 9).
An interesting parameter is the exploration factor  . By changing it
even by a small fraction, ORLA-MA’s performance varies dramatically.
We can only speculate on the reasons for that influence, since we did
not find any theoretical or empirical explanation for that phenomenon.
Furthermore, each agent explores independently – the optimal joint
*
action a is never used for updating Q-values.
30
Varying  in ORLA-SA does not seem to have that big impact on its
performance. A value of  = 0.1 seems optimal for the experiments.
As we know, Q-value updates in this case rely solely on the optimal
*
joint action a .
7 Related work
The “Catch the thief” domain is a variation of the well-known predatorpray domain, first formulated by M. Benda et. al. [36] in 1986. In the
original domain, 4 predators catch the pray if they surround it from all
sides. The transition model of the pray is stochastic. However, in
Catch-the-thief domain 2 guards try to catch a thief by occupying the
one and the same cell and the thief moves deterministically. We
improve upon previous work by employing methods that reduce the
state-action space, thus addressing the problem of scalability.
The predator-pray domain has been studied by Peter Stone and
Manuela Veloso [37] and Xiao Song Lu and Daniel Kudenko [40]. Using
multi-agent systems to solve the original predator-pray domain yielded
good practical results - a recent multi-agent approach in the pursuitevasion game (another variation of the predator-pray domain) has
been formulated by Bruno Bouzy and Marc M´etivier [10] ;
competition between agents has been researched by Jacob Schrum
[14] and dynamic analysis using  -greedy exploration and multiple
agents has been recently conducted by Eduardo Rodrigues Gomes et
al. [22]. During our research it was important to determine what type
of decentralized multi-agent systems to use and the type of
coordination mechanisms. We found insights by reviewing the
problems and domains, where it is useful to apply MAS and distributed
control, which were well studied by Stone and Veloso [37].
Furthermore, cooperation and coordination mechanisms between
multiple agents have been researched by numerous scientists [6] [8]
[12] [20] [24] [26] [31] [38] [39].
The relation between SMDPs and options is well-studied in the paper of
Sutton et.al.[3]. However, multi-agent approaches using options have
not been studied that well and this fact gave additional motivation for
conducting our research. In contrast with sparse cooperative Qlearning and collaborative agents, that still compute the joint action in
a more-efficient manner, we did not employ any of those mechanisms,
since the number of agents in our case is small enough and action
31
decomposition does not make a lot of sense. Instead, we reduced the
total Q-value table.
The choice has been made to employ multi-agent Distributed Value
Functions [4], since it can effectively reduce the total action space. In
addition, this approach is one of a few that reduce the total actionspace (the others being Independent Learners [19] and a probabilistic
approach – Frequency Maximum Q Value- FMQ [45]). Independent
learners does not create the incentive of cooperate, since each agent
models the environment separately, completely ignoring the presence
of other learning agents, which are affecting the transition model, as
well [19]. As a result, estimating that model becomes extremely hard.
The other approach, FMQ [45] makes use of probabilities, in order to
estimate the model. Despite the fact that it reduces the action-space,
it needs additional resource for computing and storing probabilities.
[45][17].
8 Conclusions and Future work
8.1 Conclusions
We have presented, compared and analyzed different RL algorithms
and evaluated their performance in a deterministic and discrete
environment and were able to efficiently reduce the state-action space
that poses the typical problem of scalability in this learning scenario.
Using options seemed to work well. Our results demonstrate that using
multiple agents for learning with distributed value functions yielded
results, very close in terms of performance to the centralized RL
approach, keeping in mind that convergence DVF is not guaranteed. It
is interesting to note, that there are certain trade-offs when applying
DVF. First, learning performance may become unstable, if the
environment is too complex. We showed that, given the specific thief
behavior, described in section 3.2, initialization plays a significant role
in learning and convergence, since some state-action pairs are never
visited.
8.2 Future work
There are several interesting extensions of the work described in this
thesis. One such direction for future studies could be using a
continuous environment with combination of options and some kind of
32
a global function-approximator. A typical example of such an
environment is the helicopter hover domain [46]. Another possibility is
using local state representations si [4]-instead of modeling the whole
environment (i.e. the positions of all entities in catch-the-thief
domain), each agent only uses its own position and that of the thief.
That way, additional memory space will be saved and scalability will
remain linear with respect to the number of agents. In addition, moreinteresting experiments can be run (like, increasing the number of
thieves, guards or the size of the LA). It is also worth mentioning that
the framework and simulator developed can easily be extended to
work with RL-GLUE [43].
Acknowledgments
This master thesis was a challenging research topic. I will take the
opportunity to express my sincere gratitude to the people that made it
possible.
First of all, I thank my parents for their continuous support during my
studies.
I thank both my supervisors - Ph.D. Shimon A. Whiteson and MSc.
Harm van Seijen for their guidance and academic advice.
I would also like to express special thanks to Prof. Jeff Schneider from
CMU for the useful hints he gave me during the course of my research.
Last but not the least, I thank my friends for their support and
motivation.
Appendices
1. Features to states and vice-versa
Instead of using series of features for representing a certain state in
the environment, it is much easier if we encode those positions as a
unique integer number. In case of 2 features X and Y, the formula is:
s = X * (Max number of Y positions) + Y
The reverse encoding is as follows:
33
X = s / Max number of Y positions, where ‘/’ is integer
division, and the result of division is the whole part of
the number(e.g. 39 / 10 = 3)
Y = s % Max number of Y positions, where ‘%’ is a modular
division, and the result of division is the remainder of
the number (e.g. 39 % 10 = 9)
In the general case:
{ X 1 , X 2 ..2. X N } : feature set
N : number of features
noX i : number of feature values of feature Xi
xi feature value of feature Xi
s  A1 x1  A2 x2  ...  AN xN
where
AN  1
Ak 
N
 noX
i
, for k = 1 : N-1
i  k 1
The decoding from state s to feature values x1…xN is:
xi  s / AK
s  s % AK
Essentially, this encoding/decoding can be viewed as a perfect hashing
algorithm, in which the hash-key is the string with (unique) positions
and the hash-value is the integer, to which these positions are
mapped.
2. Actions to Features and vice-versa
The same algorithm is used as in features to states, with the only
difference that instead of the position tuple (x,y) we now have the
tuple (a1,a2) for the discrete actions of each one of the guards.
3. Line segment to line segment intersection
In order to determine whether the line segment, drawn from the
center of the thief to the center of a guard, intersects with walls in the
environment between them, we use simple 2D linear algebra formulas:
34
The equations of the lines are
Pa = P1 + ua (P2 - P1)
Pb = P3 + ub (P4 - P3)
Solving for the point where Pa = Pb gives the following two equations
in two unknowns (ua and ub)
x1  ua (x2 - x1 )  x3  ub (x4 - x3 )
and
y1  ua (y2 - y1 )  y3  ub (y4 - y3 )
Solving gives the following expressions for ua and ub
ua 
( x4  x3 )( y1  y3 ) ( y4  y3 )( x1  x3 )
( y4  y3 )( x2  x1) ( x4  x3 )( y2  y1)
ub 
( x2  x1)( y1  y3 ) ( y2  y1)( x1  x3 )
( y4  y3 )( x2  x1) ( x4  x3 )( y2  y1)
Substituting either of these into the corresponding equation for the
line gives the intersection point. For example the intersection point
(x,y) is:
x = x1 + ua (x2 - x1)
y = y1 + ua (y2 - y1)
35
These equations apply to lines, but since we want to check if there is
an intersection between line segments, then it is only necessary to test
if ua and ub lie between 0 and 1. We need them both to lie within the
range of 0 to 1 ( 0  ua , ub  1 ) so that the intersection point is within
both line segments (i.e. they actually intersect).
It is also important to note that the inequalities are strict. If they are
not, the line segments may happen to intersect exactly at their
beginning or end points, resulting in the thief seeing a guard through a
wall (Fig. 10).
We only need 2 dimensions, since the guards can move in 2 Degrees
of freedom with fixed z position(ground). In addition, we only need to
fail(to observe no intersection) once to determine that the thief can
see one of the guards.
Figure 10: A thief can see the bottom guard, if the inequality is not
strict: 0  ua , ub  1 ; according to the formula, there is no intersection,
hence-no obstacle is present between the guard and the thief.
36
4. Standard Error calculation
The standard error of the mean is a statistical measurement, showing
how much, on average, each value of a population diverges from the
population mean. It is calculated, by dividing the standard deviation 
by the square root of the population size N:
StdErr 

N
,
the standard deviation being estimated by:
1 N
2

 ( xi  x ) ,
N i 1
where xi is the i-th sample and
x
is the population mean.
Bibliography
[1] Reinforcement Learning: An Introduction, Richard S. Sutton,
Andrew G. Barto, 1998
[2] Advances in neural information processing systems 16, MIT 2004
[3] Between MDPs and semi-MDPs: A framework for temporal
abstraction in reinforcement learning, Richard S. Sutton , Doina
Precup, Satinder Singh, 1998
[4] Distributed Value Functions, J. Schneider, W. Wong, A. Moore, M.
Riedmiller, 1999
[5] Problem Solving With Reinforcement Learning, Gavin
Adrian Rummery, 1995
[6] Cooperative Multi-Agent Learning: The State of the Art, Liviu
Panait, Sean Luke, George Mason University, 2005
[7] Viewing and Camera Control in OpenGL, Niels Joubert, James
Andrews, 2008
[8] Sparse Cooperative Q-learning, Jelle R. Kok, Nikos Vlassis, 2004
[9] Empirical Studies in Action Selection with Reinforcement Learning,
37
Shimon Whiteson, Matthew E. Taylor, Peter Stone, 2007
[10] Multi-Agent Model-Based Reinforcement Learning Experiments in
the Pursuit Evasion Game, Bruno Bouzy and Marc M´etivier, 2008
[11] Insights In Reinforcement Learning, Hado Philip van Hasselt,
2011
[12] Self-Organizing Neural Architectures and Cooperative Learning in
a Multiagent Environment, Dan Xiao and Ah-Hwee Tan, Senior
Member, IEEE, 2007
[13] Design and Control of Large Collections of Learning Agents,
Adrian Agogino, 2001
[14] Competition Between Reinforcement Learning Methods in a
Predator-Prey GridWorld, Jacob Schrum, 2008
[15] Time-based Reward Shaping in Real-Time
Strategy Games, Martin Midtgaard, Lars Vinther, Jeppe R.
Christiansen, Allan M. Christensen, and Yifeng Zeng
Aalborg University, Denmark, 2010
[16] Organizational Learning And Network Adaptation
In Multi-Agent Systems, Matthew E. Gaston, 2005
[17] A Comprehensive Survey of Multi-Agent Reinforcement Learning,
Lucian Bus¸oniu, Robert Babuska, Bart De Schutter, 2008
[18] A Theoretical Analysis of Cooperative Behavior in Multi-agent Qlearning, Ludo Waltman, Uzay Kaymak, 2006
[19] Collaborative Multiagent Reinforcement Learning by Payoff
Propagation, Jelle R. Kok, Nikos Vlassis, 2006
[20] Collaborative Reinforcement Learning of Autonomic Behavior,
Jim Dowling, Raymond Cunningham, Eoin Curran, Vinny Cahill,
2004
[21] Convergence in Multiagent Coordination, Consensus, and
Flocking, Vincent D. Blondel, Julien M. Hendrickx, Alex Olshevsky,
John N. Tsitsiklis, 2005
38
[22] Dynamic Analysis of Multiagent Q-learning with  -greedy
Exploration, Eduardo Rodrigues Gomes, Ryszard Kowalczyk,
2009, ICML '09 Proceedings of the 26th Annual International
Conference on Machine Learning
[23] Extending Q-Learning to General Adaptive Multi-Agent Systems
Gerald Tesauro, 2004
[24] Multi Agent Reinforcement Learning: Independent vs Cooperative
Agents, Ming Tan,1993
[25] Dynamic Programming, R. E. Bellman, Princeton University Press,
Princeton, NJ, 1957
[26] Distributed On-Line Learning. Cooperative Multi Agent System,
Ido Levy, 2007
[27] Learning With Delayed Rewards, C.J. C. H. Watkins, Ph.D.thesis,
Cambridge University Psychology Department, 1989
[28] Quicker Q-Learning in Multi-Agent Systems, Adrian K. Agogino,
Kagan Tumer, NASA Ames Research Center, 2005
[29] Reinforcement Learning in Large Multiagent
Systems, Adrian K. Agogino, Kagan Tumer, NASA Ames Research
Center, 2006
[30] Role-based Context-specic Multiagent Q-learning, Jiang Da-Wei,
Wang Shi-Yuan, Dong Yi-Sheng, 2007
[31] Using the Max-Plus Algorithm for Multiagent Decision Making in
Coordination Graphs, Jelle R. Kok, Nikos Vlassis, 2005
[32] Reinforcement learning methods for continuous-time Markov
decision problems, Bradtke, S.J., and Duff, M.O.,1995. Advances
in Neural Information Processing Systems 7:393–400. MIT Press,
Cambridge, MA.
[33] Self-improving factory simulation using continuous-time averagereward reinforcement learning, Mahadevan, S., Marchalleck, N.,
Das, T., Gosavi, A., 1997. Proceedings of the 14th International
Conference on Machine Learning, pp. 202–210.
39
[34] Hierarchical control and learning for Markov decision processes,
Parr, R, 1998, PhD Thesis, University of California at Berkeley,
[36] Benda, M., Jagannathan, V., Dodhiawala, R. (1986). On
optimal cooperation of knowledge sources - an experimental
investigation. Technical Report BCS-G2010-280, Boeing Advanced
Technology Center, Boeing Computing Services, Seattle,
Washington.
[37] Multiagent Systems: A Survey from a Machine Learning
Perspective, Peter Stone, Manuela Veloso, 2000
[38] Multi-Agent Coordination with Event-based Communication
Pedro V. Teixeira, Dimos V. Dimarogonas, Karl H. Johansson,
Joao Sousa, 2010
[39] QUICR-learning for Multi-Agent Coordination
Adrian K. Agogino, Kagan Tumer, NASA, 2006
[40] Reinforcement Learning in the Sensor-Evader Domain
Xiao Song Lu and Daniel Kudenko, Department of Computer
Science University of York, UK, 1999
[41] Artificial Intelligence- A Modern Approach, S. Russell, P.
Norvig, 2003
[42] On-line q-learning using connectionist systems, G. Rummery, M.
Niranjan, 1994.
[43] http://glue.rl-community.org/wiki/Main_Page
[44] Efficient Exploration in Reinforcement Learning - John
Langford, 2006
[45] Reinforcement learning of coordination in cooperative multi-agent
systems, S. Kapetanakis and D. Kudenko, in Proceedings 18th
National Conference on Artificial Intelligence and 14th Conference
on Innovative Applications of Artificial Intelligence (AAAI/IAAI02), Menlo Park, US, 28 July – 1 August 2002, pp. 326331.
[46] RL competition, http://www.rl-competition.org/
[47] TD models: Modeling the world at a mixture of time scales, R.S.
40
Sutton, in: Proc. 12th International Conference on Machine
Learning, Morgan Kaufmann, San Mateo, CA, 1995, pp. 531–539.
41
Download