Document 14242142

advertisement
Speeding up Relational Reinforement Learning
Through the Use of an Inremental First Order
Deision Tree Learner
Kurt Driessens, Jan Ramon and Hendrik Blokeel
Department of Computer Siene K.U.Leuven
Celestijnenlaan 200A, B-3001 Leuven, Belgium
fkurt.driessens,jan.ramon,hendrik.blokeelgs.kuleuven.a.be
Abstrat. Relational reinforement learning (RRL) is a learning tehnique that ombines standard reinforement learning with indutive logi
programming to enable the learning system to exploit strutural knowledge about the appliation domain.
This paper disusses an improvement of the original RRL. We introdue
a fully inremental rst order deision tree learning algorithm TG and
integrate this algorithm in the RRL system to form RRL-TG.
We demonstrate the performane gain on similar experiments to those
that were used to demonstrate the behaviour of the original RRL system.
1
Introdution
Relational reinforement learning is a learning tehnique that ombines reinforement learning with relational learning or indutive logi programming. Due to
the use of a more expressive representation language to represent states, ations
and Q-funtions, relational reinforement learning an be potentially applied
to a wider range of learning tasks than onventional reinforement learning. In
partiular, relational reinforement learning allows the use of strutural representations, the abstration from spei goals and the exploitation of results from
previous learning phases when addressing new (more omplex) situations.
The RRL onept was introdued by Dzeroski, De Raedt and Driessens in
[5℄. The next setions fous on a few details and the shortomings of the original
implementation and the improvements we suggest. They mainly onsist of a
fully inremental rst order deision tree algorithm that is based on the G-tree
algorithm of Chapman and Kaelbling [4℄.
We start with a disussion of the implementation of the original RRL system.
We then introdue the TG-algorithm and disuss its integration into RRL to
form RRL-TG. Then an overview of some preliminary experiments is given to
ompare the performane of the original RRL system to RRL-TG and we disuss
some of the harateristis of RRL-TG.
Initialise Q^ 0 to assign 0 to all (s; a) pairs
Initialise Examples to the empty set.
e := 0
while true
generate an episode that onsists of states s0 to si and ations a0 to ai 1
(where aj is the ation taken in state sj ) through the use of a standard
^e
Q-learning algorithm, using the urrent hypothesis for Q
for j=i-1 to 0 do
generate example x = (sj ; aj ; q^j ),
^ e (sj +1 ; a0 )
where q^j := rj + maxa0 Q
if an example (sj ; aj ; q
^old ) exists in Examples, replae it with x,
else add x to Examples
^ e using TILDE to produe Q^ e+1 using Examples
update Q
for j=i-1 to 0 do
for all ations ak possible in state sj do
^ e+1
if state ation pair (sj ; ak ) is optimal aording to Q
then generate example (sj ; ak ; ) where = 1
else generate example (sj ; ak ; ) where = 0
update P^e using TILDE to produe Pe^+1 using these examples (sj ; ak ; )
e := e + 1
Fig. 1.
2
The RRL algorithm.
Relational Reinforement Learning
The RRL-algorithm onsists of two learning tasks. In a rst step, the lassial
Q-learning algorithm is extended by using a relational regression algorithm to
represent the Q-funtion with a logial regression tree, alled the Q-tree. In
a seond step, this Q-tree is used to generate a P-funtion. This P-funtion
desribes the optimality of a given ation in a given state, i.e., given a stateation pair, the P-funtion desribes whether this pair is a part of an optimal
poliy. The P-funtion is represented by a logial deision tree, alled the P-tree.
More information on logial deision trees (lassiation and regression) an be
found in [1, 2℄.
2.1 The Original RRL Implementation
Figure 1 presents the original RRL algorithm. The logial regression tree that
represents the Q-funtion in RRL is built starting from a knowledge base whih
holds orret examples of state, ation and Q-funtion value triplets. To generate
the examples for tree indution, RRL starts with running a normal episode
just like a standard reinforement learning algorithm [10, 9, 6℄ and stores the
states, ations and q-values enountered in a temporary table. At the end of
eah episode, this table is added to a knowledge base whih is then used for the
tree indution phase. In the next episode, the q-values predited by the generated
tree are used to alulate the q-values for the new examples, but the Q-tree is
generated again from srath.
The examples used to generate the P-tree are derived from the obtained Qtree. For eah state the RRL system enounters during an episode, it investigates
all possible ations and lassies those ations as optimal or not aording to the
q-values predited by the learned Q-tree. Eah of these lassiations together
with the aording states and ations are used as input for the P-tree building
algorithm.
Note that old examples are kept in the knowledge base at all times and never
deleted. To avoid having too muh old (and noisy) examples in the knowledge
base, if a state{ation pair is enountered more than one, the old example in
the knowledge base is replaed with a new one whih holds the updated q-value.
2.2 Problems with the original RRL
We identify four problems with the original RRL implementation that diminish
its performane. First, it needs to keep trak of an ever inreasing amount of
examples: for eah dierent state-ation pair ever enountered a Q-value is kept.
Seond, when a state-ation pair is enountered for the seond time, the new
Q-value needs to replae the old value, whih means that in the knowledge base
the old example needs to be looked up and replaed. Third, trees are built
from srath after eah episode. This step, as well as the example replaement
proedure, takes inreasingly more time as the set of examples grows. A nal
point is that leaves of a tree are supposed to identify lusters of equivalent stateation pairs, "equivalent" in the sense that they all have the same Q-value. When
updating the Q-value for one state-ation pair in a leaf, the Q-value of all pairs
in the leaf should automatially be updated; but this is not what is done in the
original implementation; an existing (state,ation,Q) example gets an updated
Q-value at the moment when exatly the same state-ation pair is enountered,
instead of a state-ation pair in the same leaf.
2.3 Possible Improvements
To solve these problems, a fully inremental indution algorithm is needed. Suh
an algorithm would relieve the need for the regeneration of the tree when new
data beomes available. However, not just any inremental algorithm an be
used. Q-values generated with reinforement learning are usually wrong at the
beginning and the algorithm needs to handle this appropriately.
Also an algorithm that doesn't require old examples to be kept for later referene, eliminates the use of old information to generalise from and thus eliminates
the need for replaement of examples in the knowledge base. In a fully inremental system, the problem of storing dierent q-values for similar but distint
state-ation pairs should also be solved.
In the following setion we introdue suh an inremental algorithm.
3
The TG Algorithm
3.1 The G-algorithm
We use a learning algorithm that is an extension of the G-algorithm [4℄. This
is a deision tree learning algorithm that updates its theory inrementally as
examples are added. An important feature is that examples an be disarded
after they are proessed. This avoids using a huge amount of memory to store
examples.
On a high level (f. Figure 2), the G-algorithm (as well as the new TGalgorithm) stores the urrent deision tree, and for eah leaf node statistis for
all tests that ould be used to split that leaf further. Eah time an example is
inserted, it is sorted down the deision tree aording to the tests in the internal
nodes, and in the leaf the statistis of the tests are updated.
reate an empty leaf
data available do
split data down to leafs
update statistis in leaf
if split needed then
grow two empty leafs
while
endwhile
Fig. 2.
The TG-algorithm.
The examples are generated by a simulator of the environment, aording to
some reinforement learning strategy. They are tuples (S tate; Ation; QV alue).
3.2 Extending to rst order logi
We extend the G algorithm in that we use a relational representation language
for desribing the examples and for the tests that an be used in the deision
tree. This has several advantages. First, it allows us to model examples that
an't be stored in a propositional way. Seond, it allows us to model the feature
spae. Even when it would be possible in theory to enumerate all features, as in
the ase of the bloks world with a limited number of bloks, a problem is only
tratable when a smaller number of features is used. The relational language
an be seen as a way to onstrut useful features. E.g. when there are no bloks
on some blok A, it is not useful to provide a feature to see whih bloks are
on A. Also, the use of a relational language allows us to struture the feature
spae as e.g. on(State,blok a,blok b) and on(State,blok ,blok d) are treated
in exatly the same way.
The onstrution of new tests happens by a renement operator. A more
detailed desription of this part of the system an be found in [1℄.
on(State,BlockB,BlockA)
yes
no
clear(State,BlockB)
yes
Qvalue = 0.5
no
Qvalue=0.4
Qvalue=0.3
on(State,BlockA,BlockB)
on(State,BlockC,BlockB)
...
clear(State,BlockA)
Fig. 3.
A logial deision tree
Figure 3 gives an example of a logial deision tree. The test in some node
should be read as the existentially quantied onjuntion of all literals in the
nodes in the path from the root of the tree to that node.
The statistis for eah leaf onsist of the number of examples on whih eah
possible test sueeds, the sum of their Q values and the sum of squared Q
values. Moreover, the Q value to predit in that leaf is stored. This value is
obtained from the statistis of the test used to split its parent when the leaf
was reated. Later, this value is updated as new examples are sorted in the leaf.
These statistis are suÆient to ompute whether some test is signiant, i.e.
the variane of the Q-values of the examples would be redued by splitting the
node using that partiular test. A node is split after some minimal number of
examples are olleted and some test beomes signiant with a high ondene.
In ontrast to the propositional system, keeping trak of the andidate-tests
(the renements of a query) is a non-trivial task. In the propositional ase the
set of andidate queries onsists of the set of all features minus the features that
are already tested higher in the tree. In the rst order ase, the set of andidate
queries onsists of all possible ways to extend a query. The longer a query is and
the more variables it ontains, the larger is the number of possible ways to bind
the variables and the larger is the set of andidate tests.
Sine a large number of suh andidate tests exist, we need to store them
as eÆiently as possible. To this aim we use the query paks mehanism introdued in [3℄. A query pak is a set of similar queries strutured into a tree;
ommon parts of the queries are represented only one in suh a struture. For
instane, a set of onjuntions f(p(X ); q (X )); (p(X ); r(X ))g is represented as a
term p(X ); (q (X ); r(X )). This an yield a signiant gain in pratie. (E.g., as-
suming a onstant branhing fator of b in the tree, storing a pak of n queries
of length l (n = bl ) takes O(nb=(b 1)) memory instead of O(nl).)
Also, exeuting a set of queries strutured in a pak on the examples requires onsiderably less time than exeuting them all separately. An empirial
omparison of the speeds with and without paks will be given in setion 1.
Even storing queries in paks requires muh memory. However, the paks
in the leaf nodes are very similar. Therefore, a further optimisation is to reuse
them. When a node is split, the pak for the new right leaf node is the same
as the original pak of the node. For the new left sub-node, we urrently only
reuse them if we add a test whih does not introdue new variables beause in
that ase the query pak in the left leaf node will be equal to the pak in the
original node (exept for the hosen test whih of ourse an't be taken again).
In further work, we will also reuse query paks in the more diÆult ase when a
test is added whih introdues new variables.
3.3 RRL-TG
To integrate the TG-algorithm into RRL we removed the alls to the TILDE
algorithm and use TG to adapt the Q-tree and P-tree when new experiene (in
the form of (S tate; Ation; QV alue) triplets) is available. Thereby, we solved the
problems mentioned in Setion 2.2.
{ The trees are no longer generated from srath after eah episode.
{ Beause TG only stores statistis about the examples in the tree and only
referenes these examples one (when they are inserted into the tree) the
need for remembering and therefore searhing and replaing examples is
relieved.
{ Sine TG begins eah new leaf with ompletely empty statistis, examples
have a limited life span and old (possibly noisy) q-value examples will be
deleted even if the exat same state-ation pair is not enountered twie.
Sine the bias used by this inremental algorithm is the same as with TILDE, the
same theories an be learned by TG. Both algorithms searh the same hypothesis
spae and although TG an be misled in the beginning due to its inremental
nature, in the limit the quality of approximations of the q-values will be the
same.
4
Experiments
In this setion we ompare the performane of the new RRL-TG with the original
RRL system. Further omparison with other systems is diÆult beause there
are no other implementations of RRL so far. Also, the number of rst order
regression tehniques is limited and a full omparison of rst order regression is
outside the sope of this paper.
In a rst step, we reran the experiments desribed in [5℄. The original RRL
was tested on the bloks world [8℄ with three dierent goals: unstaking all bloks,
staking all bloks and staking two spei bloks. Bloks an be on the oor
or an be staked on eah other. Eah state an be desribed by a set (list) of
fats, e.g., s1 = flear(a); on(a; b); on(b; ); on(; f loor)g. The available ations
are then move(x; y ) where x 6= y and x is a blok and y is a blok or the oor.
The unstaking goal is reahed when all bloks are on the oor, the staking goal
when only one blok is on the oor. The goal of staking two spei bloks (e.g.
a and b) is reahed when the fat on(a; b) is part of the state desription, note
however that RRL learns to stak any two spei bloks by generalising from
on(a; b) or on(; d) to on(X; Y ) where X and Y an be substituted for any two
bloks. In the rst type of experiments, a Q-funtion was learned for state-spaes
with 3, 4 and 5 bloks. In the seond type, a more general strategy for eah goal
was learned by using poliy learning on state-spaes with a varying number of
bloks.
Afterwards, we disuss some experiments to study the onvergene behaviour of the TG-algorithm and the sensitivity of the algorithm to some of its
parameters.
1
1
0.9
0.9
0.8
0.8
0.7
0.7
Average Reward
Average Reward
4.1 Fixed State Spaes
0.6
0.5
0.4
0.3
0.2
0
0
50
100
150
Number of Episodes
0.5
0.4
0.3
0.2
’stack 3’
’stack 4’
’stack 5’
0.1
0.6
’unstack 3’
’unstack 4’
’unstack 5’
0.1
0
200
0
50 100 150 200 250 300 350 400 450 500
Number of Episodes
1
0.9
Average Reward
0.8
0.7
0.6
0.5
0.4
0.3
0.2
’on(a,b) 3’
’on(a,b) 4’
’on(a,b) 5’
0.1
0
0
Fig. 4.
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Number of Episodes
The learning urves for xed numbers of bloks
For the experiments with a xed number of bloks, the results are shown
in Figure 4. The generated poliies were tested on 10 000 randomly generated
examples. A reward of 1 was given if the goal-state was reahed in the minimal
number of steps. The graphs show the average reward per episode. This gives an
indiation of how well the Q-funtion generates the optimal poliy. Compared to
the original system, the number of episodes needed for the algorithm to onverge
to the orret poliy is muh larger. This an be explained by two harateristis
of the new system. Where the original system would automatially orret its
mistakes when generating the next tree by starting from srath, the new system
has to ompensate for the mistakes it makes near the root of the tree | due
to faulty Q-values generated at the beginning | by adding extra branhes to
the tree. To ompensate for this fat a parameter has been added to the TGalgorithm whih speies the number of examples that need to be led in a leaf,
before the leaf an be split. These two fators | the overhead for orreting
mistakes and the delay put on splitting leaves | ause the new RRL system to
onverge slower when looking at the number of episodes.
However, when omparing timing results the new system learly outperforms
the old one, even with the added number of neessary episodes. Table 1 ompares
the exeution times of RRL-TG with the timings of the RRL system given in
[5℄. For the original RRL system, running experiments in state spaes with more
than 5 bloks quikly beame impossible. Learning in a state-spae with 8 bloks,
RRL-TG took 6.6 minutes to learn for 1000 episodes, enough to allow it to
onverge for the staking goal.
Table 1.
Exeution time of the RRL algorithm on Sun Ultra 5/270 mahines.
Original RRL
Staking (30 episodes)
Unstaking (30 episodes)
On(a,b) (30 episodes)
RRL-TG
Staking (200 episodes)
without paks Unstaking (500 episodes)
On(a,b) (5000 episodes)
RRL-TG
Staking (200 episodes)
with paks
Unstaking (500 episodes)
On(a,b) (5000 episodes)
3 bloks
4 bloks
5 bloks
6.15 min
62.4 min
306 min
8.75 min not stated not stated
20 min not stated not stated
19.2 se
26.5 se
39.3 se
1.10 min
1.92 min
2.75 min
25.0 min
57 min
102 min
8.12 se
11.4 se
16.1 se
20.2 se
35.2 se
53.7 se
5.77 min
7.37 min
11.5 min
4.2 Varying State Spaes
In [5℄ the strategy used to ope with multiple state-spaes was to start learning
on the smallest and then expand the state-spae after a number of learning
episodes. This strategy worked for the original RRL system beause examples
were never erased from the knowledge base, so when hanging to a new statespae the examples from the previous one would still be used to generate the Qand P-trees.
However, if we apply the same strategy to RRL-TG, the TG-algorithm rst
generates a tree for the small state-spae and after hanging to a larger statespae, it expands the tree to t the new state spae, ompletely forgetting what
it has learned by adding new branhes to the tree, thereby making the tests in
the original tree insigniant. It would never generalise over the dierent statespaes. Instead, we oer a dierent state-spae to RRL-TG every episode. This
way, the examples the TG-algorithm uses to split leaves will ome from dierent
state-spaes and allow RRL-TG to generalise over them. In the experiments on
the bloks world, we varied the number of bloks between 3 and 5 while learning.
To make sure that the examples are spread throughout several state-spaes we
set the minimal number of examples needed to split a leaf to 2400, quite a bit
higher than for xed state-spaes. As a result, RRL-TG requires a higher number
of episodes to onverge.
At episode 15000, P-learning was started. Note that sine P-learning depends
on the presene of good Q-values, in the inremental tree learning setting it is
unwise to start building P-trees from the beginning, beause the Q-values at
that time are misleading, ausing suboptimal splits to be inserted into the P-tree
in the beginning. Due to the inremental nature of the learning proess, these
suboptimal splits are not removed afterwards, whih in the end leads to a more
omplex tree that is learned more slowly. By letting P-learning start only after
a reasonable number of episodes, this eet is redued, although not neessarily
entirely removed (as Q-learning ontinues in parallel with P-learning).
1
0.9
Average Reward
0.8
0.7
0.6
0.5
0.4
0.3
0.2
’stack Var’
’unstack Var’
’on(a,b) Var’
0.1
0
0
5000
10000
15000
20000
25000
30000
Number of Episodes
Fig. 5. The learning urves for varying numbers of bloks. P-learning is started at
episode 15 000.
Figure 5 shows the learning urves for the three dierent goals. The generated
poliies were tested on 10 000 randomly generated examples with the number of
bloks varying between 3 and 10. The rst part of the urves indiate how well
the poliy generated by the Q-tree performs. From episode 15000 onwards, when
P-learning is started, the test results are based on the P-tree instead of the Qtree; on the graphs this is visible as a sudden drop of performane bak to almost
zero, followed by a new learning urve (sine P-learning starts from srath). It
an be seen that P-learning onverges fast to a better performane level than Qlearning attained. After P-learning starts, the Q-tree still gets updated, but no
longer tested. We an see that although the Q-tree is not able to generalise over
unseen state-spaes, the P-tree | whih is derived from the Q-tree | usually
an.
The jump to an auray of 1 in the Q-learning phase for staking the bloks
is purely aidental, and disappears when the Q-tree starts to model the enountered state-spaes more aurately. The generated Q-tree at that point in
time is shown in Figure 6. Although this tree does not represent the orret
Q-funtion in any way, it does | by aident | lead to the orret poliy for
staking any number of bloks. Later on in the experiment, the Q-tree will be
muh larger and represent the Q-funtion muh loser, but it does not represent
an overall optimal poliy anymore.
state(S),ation_move(X,Y),numberofbloks(S,N)
height(S,Y,H),diff(N,H,D),D<2 ?
yes:
qvalue(1.0)
no:
height(S,B,C),height(S,Y,D), D < C ?
yes:
qvalue(0.149220955007456)
no:
qvalue(0.507386078412001)
Fig. 6.
The Q-tree for Staking after 4000 episodes
The algorithm did not onverge for the On(A; B ) goal but we did get better
performane than the original RRL. This is probably due to the fat that we
were able to learn more episodes.
4.3 The Minimal Sample Size
We deided to further explore the behaviour of the RRL-TG algorithm with
respet to the delay parameter whih speies the minimal sample size for TG
to split a leaf. To test the eet of the minimal sample size, we started RRL-TG
for a total of 10 000 episodes and started P-learning after 5000 episodes. We then
varied the minimal sample size from 50 to 2000 examples. We ran these tests
with the unstaking goal.
Figure 7 shows the learning urves for the dierent settings, Table 2 shows
the sizes of the resulting trees after 10 000 episodes ( with P-learning starting at
1
0.9
0.8
Average Reward
0.7
0.6
0.5
0.4
0.3
’50’
’400’
’1000’
’2000’
0.2
0.1
0
0
Fig. 7.
2000
4000
6000
Number of Episodes
8000
10000
The learning urves for varying minimal sample sizes
episode 5000. The algorithm does not onverge for a sample size of 50 examples.
Even after P-learning optimality is not reahed. The algorithm makes too many
mistakes by relying on too small a set of examples when hoosing a test. With
an inreasing sample size omes a slower onvergene to the optimal poliy. For
a sample size of 1000 examples, the dip in the learning urve during P-learning
is aused by a hange in the Q-funtion (whih is still being learned during the
seond phase). When looking at the sizes of the generated trees, it is obvious
that the generated poliies benet from a larger minimal sample size, at the ost
of slower onvergene. Although experiments with a smaller minimal sample size
have to orret for their early mistakes by building larger trees, they often still
sueed in generating an optimal poliy.
Table 2.
The generated tree sizes for varying minimal sample sizes
50 400 1000 2000
Q-tree 31 26 16
9
P-tree 9 10
8
8
5
Conluding Remarks
This paper desribed how we upgraded the G-tree algorithm of Chapman and
Kaelbling [4℄ to the new TG-algorithm and by doing so greatly improved the
speed of the RRL system presented in [5℄.
We studied the onvergene behaviour of RRL-TG with respet to the minimal sample size TG needs to split a leaf. Larger sample sizes mean slower
onvergene but smaller funtion representations.
We are planning to apply the RRL algorithm to the Tetris game to study
the behaviour of RRL in more omplex domains than the bloks world.
Future work will ertainly inlude investigating other representation possibilities for the Q-funtion. Work towards rst order neural networks and Bayesian
networks [7℄ seems to provide promising alternatives.
Further work on improving the TG algorithm will inlude the use of multiple
trees to represent the Q- and P-funtions and further attempts to derease the
amount of used memory. Also, some work should be done to automatially nd
the moment to start P-learning and the optimal minimal sample size. Both will
inuene the size of the generated poliy and the onvergene speed of TG-RRL.
Aknowledgements
The authors would like to thank Saso Dzeroski, Lu De Raedt and Maurie
Bruynooghe for their suggestions onerning this work. Jan Ramon is supported
by the Flemish Institute for the Promotion of Siene and Tehnologial Researh
in Industry (IWT). Hendrik Blokeel is a post-dotoral fellow of the Fund for
Sienti Researh of Flanders (FWO-Vlaanderen).
Referenes
1. H. Blokeel and L. De Raedt. Top-down indution of rst order logial deision
trees. Artiial Intelligene, 101(1-2):285{297, June 1998.
2. H. Blokeel, L. De Raedt, and J. Ramon. Top-down indution of lustering trees.
In Proeedings of the 15th International Conferene on Mahine Learning, pages
55{63, 1998. http://www.s.kuleuven.a.be/~ml/PS/ML98-56.ps.
3. H. Blokeel, B. Demoen, L. Dehaspe, G. Janssens, J. Ramon, and H. Vandeasteele.
Exeuting query paks in ILP. In J. Cussens and A. Frish, editors, Proeedings of
the 10th International Conferene in Indutive Logi Programming, volume 1866
of Leture Notes in Artiial Intelligene, pages 60{77, London, UK, July 2000.
Springer.
4. David Chapman and Leslie P. Kaelbling. Input generalization in delayed reinforement learning: An algorithm and performane omparisions. In Proeedings of the
International Joint Conferene on Artiial Intelligene, 1991.
5. S. Dzeroski, L. De Raedt, and K. Driessens. Relational reinforement learning.
Mahine Learning, 43:7{52, 2001.
6. L. Kaelbling, M. Littman, and A. Moore. Reinforement learning: A survey.
Journal of Artiial Intelligene Researh, 4:237{285, 1996.
7. K. Kersting and L. De Raedt. Bayesian logi programs. In Proeedings of the tenth
international onferene on indutive logi programming, work in progress trak,
2000.
8. P. Langley. Elements of Mahine Learning. Morgan Kaufmann, 1996.
9. T. Mithell. Mahine Learning. MGraw-Hill, 1997.
10. R. Sutton and A. Barto. Reinforement Learning: an introdution. The MIT Press,
Cambridge, MA, 1998.
Download