Transfer in Reinforcement Learning via Markov Logic Networks Lisa Torrey, Jude Shavlik, Sriraam Natarajan, Pavan Kuppili, Trevor Walker University of Wisconsin-Madison, USA Possible Benefits of Transfer in RL performance Learning curves in the target task: with transfer without transfer training The RoboCup Domain 2-on-1 BreakAway 3-on-2 BreakAway Reinforcement Learning States are described by features: Environment state action reward distance(me,teammate1) = 15 distance(me,opponent1) = 5 angle(opponent1, me, teammate1) = 30 … Actions are: Move Pass Shoot Agent Rewards are: +1 for scoring 0 otherwise Our Previous Methods Skill transfer Learn a rule for when to take each action Use rules as advice Macro transfer Learn a relational multi-step action plan Use macro to demonstrate Transfer via Markov Logic Networks Source-task learner Learn MLN Q-function Source-task Q-function and data Demonstrate Analyze Target-task learner MLN Q-function Markov Logic Networks A Markov network models a joint distribution Y X A B A Markov Logic Network combines probability with logic Z Template: a set of first-order formulas with weights Each grounded predicate in a formula becomes a node Predicates in grounded formula are connected by arcs Probability of a world: Richardson and Domingos, ML 2006 (1/Z) exp( Σ WiNi ) MLN Q-function Formula 1 IF distance(me, Teammate) < 15 W1 = 0.75 AND angle(me, goalie, Teammate) > 45 N1 = 1 teammate THEN Q є (0.8, 1.0) Formula 2 IF distance(me, GoalPart) < 10 W1 = 1.33 AND angle(me, goalie, GoalPart) > 45 N1 = 3 goal parts THEN Q є (0.8, 1.0) Probability that Q є (0.8, 1.0): __exp(W1N1 + W1N1)__ 1 + exp(W1N1 + W1N1) Grounded Markov Network angle(me, goalie, goalLeft) > 45 distance(me, goalLeft) < 10 distance(me, teammate1) < 15 Q є (0.8, 1.0) angle(me, goalie, teammate1) > 45 distance(me, goalRight) < 10 angle(me, goalie, goalRight) > 45 Learning an MLN Find good Q-value bins using hierarchical clustering Learn rules that classify examples into bins using inductive logic programming Learn weights for these formulas to produce the final MLN Frequency Frequency Binning via Hierarchical Clustering Q-value Frequency Q-value Q-value Classifying Into Bins via ILP Given examples Positive: inside this Q-value bin Negative: outside this Q-value bin The Aleph* ILP learning system finds rules that separate positive from negative Builds rules one predicate at a time Top-down search through the feature space * Srinivasan, 2001 Learning Formula Weights Given formulas and examples Same examples as for ILP ILP rules as network structure Alchemy* finds weights that make the probability estimates accurate Scaled conjugate-gradient algorithm * Kok, Singla, Richardson, Domingos, Sumner, Poon and Lowd, 2004-2007 Using an MLN Q-function Q є (0.8, 1.0) P1 = 0.75 Q= Q є (0.5, 0.8) P2 = 0.15 Q є (0, 0.5) P2 = 0.10 P1 ● E [Q | bin1] + P2 ● E [Q | bin2] + P3 ● E [Q | bin3] Q-value of most similar training example in bin Example Similarity E [Q | bin] = Q-value of most similar training example in bin Similarity = dot product of example vectors Example vector shows which bin rules the example satisfies Rule 1 1 Rule 2 -1 Rule 3 1 … 1 1 -1 Experiments Source task: 2-on-1 BreakAway 3000 existing games from the learning curve Learn MLNs from 5 separate runs Target task: 3-on-2 BreakAway Demonstration period of 100 games Continue training up to 3000 games Perform 5 target runs for each source run Discoveries Results can vary widely with the sourcetask chunk from which we transfer Most methods use the “final” Q-function from the last chunk MLN transfer performs better from chunks halfway through the learning curve Results in 3-on-2 BreakAway 0.6 Probability of Goal 0.5 0.4 0.3 MLN Transfer Macro Transfer Value-function Transfer Standard RL 0.2 0.1 0 0 500 1000 1500 Training Games 2000 2500 3000 Conclusions MLN transfer can significantly improve initial target-task performance Like macro transfer, it is an aggressive approach for tasks with similar strategies It “lifts” transferred information to first-order logic, making it more general for transfer Theory refinement in the target task may be viable through MLN revision Potential Future Work Model screening for transfer learning Theory refinement in the target task Fully relational RL in RoboCup using MLNs as Q-function approximators Acknowledgements DARPA Grant HR0011-07-C-0060 DARPA Grant FA 8650-06-C-7606 Thank You