Using Advice to Transfer Knowledge Acquired in One Reinforcement Learning Task to Another Lisa Torrey, Trevor Walker, Jude Shavlik University of Wisconsin-Madison, USA Richard Maclin University of Minnesota-Duluth, USA Our Goal Transfer knowledge… … between reinforcement learning tasks … employing SVM function approximators … using advice Transfer Exploit previously learned models Learn first task knowledge acquired Learn related task Improve learning of new tasks with transfer without transfer performance experience Reinforcement Learning +2 0 -1 state…action…reward…new state Q-function: value of taking action from state Policy: take action with max Qaction(state) Advice for Transfer Based on what worked in Task A, I suggest… Task A Solution Task B Learner I’ll try it, but if it doesn’t work I’ll do something else. Advice improves RL performance Advice can be refined or even discarded Transfer Process Task A experience Task A Q-functions Transfer Advice Task B experience Task B Q-functions Advice from user (optional) Mapping from user Task A Task B Advice from user (optional) RoboCup Soccer Tasks KeepAway BreakAway Keep ball from opponents Score a goal [Stone & Sutton, ICML 2001] [Maclin et al., AAAI 2005] RL in RoboCup Tasks KeepAway BreakAway Features (time left) Actions Pass, Hold Rewards Each time step: +1 Pass, Move, Shoot At end: +2, 0, or -1 Transfer Process Task A experience Task A Q-functions Transfer Advice Task B experience Task B Q-functions Mapping from user Task A Task B Approximating Q-Functions Given examples State features Si= <f1 , … , fn> Estimated values y Qaction(Si) Learn linear coefficients y = w1 f1 + … + wn fn + b Non-linearity from Boolean tile features tilei,lower,upper = 1 if lower ≤ fi < upper Support Vector Regression Q-estimate y state S Linear Program minimize ||w||1 + |b| + C||k||1 such that y - k Sw + b y + k Transfer Process Task A experience Task A Q-functions Transfer Advice Task B experience Task B Q-functions Mapping from user Task A Task B Advice Example if distance_to_goal 10 and shot_angle 30 then prefer shoot over all other actions Need only follow advice approximately Add soft constraints to linear program Incorporating Advice Maclin et al., AAAI 2005 if v11 f1 + … + v1n fn d1 … and vm1 f1 + … + vmn fn dn then Qshoot > Qother for all other Advice and Q-functions have same language Linear expressions of features Transfer Process Task A experience Task A Q-functions Transfer Advice Task B experience Task B Q-functions Mapping from user Task A Task B Expressing Policy with Advice Qhold_ball(s) Old Q-functions Qpass_near(s) Qpass_far(s) Advice expressing policy if Qhold_ball(s) > Qpass_near(s) and Qhold_ball(s) > Qpass_far(s) then prefer hold_ball over all other actions Mapping Actions Qhold_ball(s) Old Q-functions Qpass_near(s) Qpass_far(s) hold_ball move pass_near pass_near pass_far Mapping from user Mapped policy if Qhold_ball(s) > Qpass_near(s) and Qhold_ball(s) > Qpass_far(s) then prefer move over all other actions Mapping Features Mapping from user Q-function mapping Qhold_ball(s) = w1 (dist_keeper1)+ w2 (dist_taker2)+ … Q´hold_ball(s) = w1 (dist_attacker1)+ w2 (MAX_DIST)+ … Transfer Example Old model Qx = wx1f1 + wx2f2 + bx Mapped model Q´x = wx1f´1 + wx2f´2 + bx Qy = wy1f1 Q´y = wy1f´1 Qz = if + by wz2f2 + bz Advice Q´x > Q´y if Q´z = + by wz2f´2 + bz Advice (expanded) wx1f´1 + wx2f´2 + bx > wy1f´1 + by and Q´x > Q´z and wx1f´1 + wx2f´2 + bx > wz2 f´2 + bz then prefer x´ then prefer x´ to all other actions Transfer Experiment Between RoboCup subtasks From 3-on-2 KeepAway To 2-on-1 BreakAway Two simultaneous mappings Transfer passing skills Map passing skills to shooting Experiment Mappings Play a moving KeepAway game Pass Pass, Hold Move Pretend teammate is standing in the goal Pass Shoot imaginary teammate Experimental Methodology Averaged over 10 BreakAway runs Transfer: advice from one KeepAway model Control: runs without advice Results Probability (Score Goal) 0.8 0.6 0.4 0.2 0 0 2500 5000 7500 10000 Games Played Analysis Transfer advice helps BreakAway learners Improvement is delayed 7% more likely to score a goal after learning Advantage begins after 2500 games Some advice rules apply rarely Preconditions for shoot advice not often met Related Work: Transfer Remember action subsequences [Singh, ML 1992] Restrict action choices [Sherstov & Stone, AAAI 2005] Transfer Q-values directly in KeepAway [Taylor & Stone, AAMAS 2005] Related Work: Advice “Take action A now” [Clouse & Utgoff, ICML 1992] “In situations S, action A has value X ” [Maclin & Shavlik, ML 1996] “In situations S, prefer action A over B” [Maclin et al., AAAI 2005] Future Work Increase speed of linear-program solving Decrease sensitivity to imperfect advice Extract advice from kernel-based models Help user map actions and features Conclusions Transfer exploits previously learned models to improve learning of new tasks Advice is an appealing way to transfer Linear regression approach incorporates advice straightforwardly Transferring a policy accommodates different reward structures Acknowledgements DARPA grant HR0011-04-1-0007 United States Naval Research Laboratory grant N00173-04-1-G026 Michael Ferris Olvi Mangasarian Ted Wild