torrey.aaai08.ppt

advertisement
Transfer in Reinforcement Learning
via Markov Logic Networks
Lisa Torrey, Jude Shavlik,
Sriraam Natarajan, Pavan Kuppili, Trevor Walker
University of Wisconsin-Madison, USA
Possible Benefits of Transfer in RL
performance
Learning curves in the target task:
with transfer
without transfer
training
The RoboCup Domain
2-on-1 BreakAway
3-on-2 BreakAway
Reinforcement Learning
States are described by features:
Environment
state
action
reward
distance(me,teammate1) = 15
distance(me,opponent1) = 5
angle(opponent1, me, teammate1) = 30
…
Actions are:
Move
Pass
Shoot
Agent
Rewards are:
+1 for scoring
0 otherwise
Our Previous Methods

Skill transfer



Learn a rule for when to take each action
Use rules as advice
Macro transfer


Learn a relational multi-step action plan
Use macro to demonstrate
Transfer via Markov Logic Networks
Source-task
learner
Learn
MLN
Q-function
Source-task
Q-function
and data
Demonstrate
Analyze
Target-task
learner
MLN
Q-function
Markov Logic Networks

A Markov network models a joint distribution
Y
X
A

B
A Markov Logic Network combines probability with logic




Z
Template: a set of first-order formulas with weights
Each grounded predicate in a formula becomes a node
Predicates in grounded formula are connected by arcs
Probability of a world:
Richardson and Domingos, ML 2006
(1/Z) exp( Σ WiNi )
MLN Q-function
Formula 1
IF
distance(me, Teammate) < 15
W1 = 0.75
AND
angle(me, goalie, Teammate) > 45
N1 = 1 teammate
THEN
Q є (0.8, 1.0)
Formula 2
IF
distance(me, GoalPart) < 10
W1 = 1.33
AND
angle(me, goalie, GoalPart) > 45
N1 = 3 goal parts
THEN
Q є (0.8, 1.0)
Probability that Q є (0.8, 1.0):
__exp(W1N1 + W1N1)__
1 + exp(W1N1 + W1N1)
Grounded Markov Network
angle(me, goalie, goalLeft) > 45
distance(me, goalLeft) < 10
distance(me, teammate1) < 15
Q є (0.8, 1.0)
angle(me, goalie, teammate1) > 45
distance(me, goalRight) < 10
angle(me, goalie, goalRight) > 45
Learning an MLN



Find good Q-value bins using hierarchical
clustering
Learn rules that classify examples into
bins using inductive logic programming
Learn weights for these formulas to
produce the final MLN
Frequency
Frequency
Binning via Hierarchical Clustering
Q-value
Frequency
Q-value
Q-value
Classifying Into Bins via ILP

Given examples



Positive: inside this Q-value bin
Negative: outside this Q-value bin
The Aleph* ILP learning system finds rules
that separate positive from negative


Builds rules one predicate at a time
Top-down search through the feature space
* Srinivasan, 2001
Learning Formula Weights

Given formulas and examples



Same examples as for ILP
ILP rules as network structure
Alchemy* finds weights that make the
probability estimates accurate

Scaled conjugate-gradient algorithm
* Kok, Singla, Richardson, Domingos, Sumner, Poon and Lowd, 2004-2007
Using an MLN Q-function
Q є (0.8, 1.0)
P1 = 0.75
Q=
Q є (0.5, 0.8)
P2 = 0.15
Q є (0, 0.5)
P2 = 0.10
P1 ● E [Q | bin1]
+ P2 ● E [Q | bin2]
+ P3 ● E [Q | bin3]
Q-value of most
similar training
example in bin
Example Similarity

E [Q | bin] = Q-value of most similar training example in bin

Similarity = dot product of example vectors

Example vector shows which bin rules the example satisfies
Rule 1
1
Rule 2
-1
Rule 3
1
…
1
1
-1
Experiments

Source task: 2-on-1 BreakAway



3000 existing games from the learning curve
Learn MLNs from 5 separate runs
Target task: 3-on-2 BreakAway



Demonstration period of 100 games
Continue training up to 3000 games
Perform 5 target runs for each source run
Discoveries



Results can vary widely with the sourcetask chunk from which we transfer
Most methods use the “final” Q-function
from the last chunk
MLN transfer performs better from chunks
halfway through the learning curve
Results in 3-on-2 BreakAway
0.6
Probability of Goal
0.5
0.4
0.3
MLN Transfer
Macro Transfer
Value-function Transfer
Standard RL
0.2
0.1
0
0
500
1000
1500
Training Games
2000
2500
3000
Conclusions




MLN transfer can significantly improve initial
target-task performance
Like macro transfer, it is an aggressive approach
for tasks with similar strategies
It “lifts” transferred information to first-order
logic, making it more general for transfer
Theory refinement in the target task may be
viable through MLN revision
Potential Future Work

Model screening for transfer learning

Theory refinement in the target task

Fully relational RL in RoboCup using
MLNs as Q-function approximators
Acknowledgements

DARPA Grant HR0011-07-C-0060

DARPA Grant FA 8650-06-C-7606
Thank You
Download