torrey.ilp07.ppt

Relational Macros for Transfer in Reinforcement Learning Lisa Torrey, Jude Shavlik, Trevor Walker University of Wisconsin-Madison, USA Richard Maclin University of Minnesota-Duluth, USA Transfer Learning Scenario Agent learns Task A Agent encounters related Task B Agent recalls relevant knowledge from Task A Agent uses this knowledge to learn Task B quickly Goals of Transfer Learning performance Learning curves in the target task: with transfer without transfer training Reinforcement Learning Described by a set of features Observe world state Use the rewards to estimate the Qvalues of actions in states Take an action Receive a reward Policy: choose the action with the highest Q-value in the current state The RoboCup Domain 4-on-3 BreakAway 2-on-1 BreakAway 3-on-2 BreakAway Transfer in Reinforcement Learning  Related work      Model reuse (Taylor & Stone 2005) Policy reuse (Fernandez & Veloso 2006) Option transfer (Perkins & Precup 1999) Relational RL (Driessens et al. 2006) Our previous work   Policy transfer (Torrey et al. 2005) Skill transfer (Torrey et al. 2006) Copy the Q-function Learn rules that describe when to take individual actions Now we learn a strategy instead of individual skills Representing a Multi-step Strategy isClose(Opponent) hold ← true allOpponentsFar Really these are rule sets, not just single rules    pass(Teammate) ← isOpen(Teammate) The learning agent jumps between players A relational macro is a finite-state machine Nodes represent internal states of agent in which limited independent policies apply Conditions for transitions and actions are in first-order logic Our Proposed Method    Learn a relational macro that describes a successful strategy in the source task Execute the macro in the target task to demonstrate the successful strategy Continue learning the target task with standard RL after the demonstration Learning a Relational Macro  We use ILP to learn macros     Aleph: top-down search in a bottom clause Heuristic and randomized search Maximize F1 score We learn a macro in two phases   The action sequence (node structure) The rule sets for actions and transitions Learning Macro Structure  Objective: find an action pattern that separates good and bad games macroSequence(Game) ← actionTaken(Game, StateA, move, ahead, StateB), actionTaken(Game, StateB, pass, _, StateC), actionTaken(Game, StateC, shoot, _, gameEnd). move(ahead) pass(Teammate) shoot(GoalPart) Learning Macro Conditions  Objective: describe when transitions and actions should be taken For the transition from move to pass transition(State) ← feature(State, distance(Teammate, goal)) < 15. For the policy in the pass node action(State, pass(Teammate)) ← feature(State, angle(Teammate, me, Opponent)) > 30. move(ahead) pass(Teammate) shoot(GoalPart) Examples for Actions scoring non-scoring Game 1: move(ahead) pass(a1) positive shoot(goalRight) Game 2: move(ahead) pass(a2) shoot(goalLeft) negativ e Game 3: move(right) pass(a1) Game 4: move(ahead) pass(a1) move(ahead) pass(Teammate) shoot(goalRight) shoot(GoalPart) Examples for Transitions scoring non-scoring positive shoot(goalRight) Game 1: move(ahead) pass(a1) Game 2: move(ahead) move(ahead) shoot(goalLeft) Game 3: move(ahead) move(ahead) pass(Teammate) pass(a1) negativ e shoot(goalRight) shoot(GoalPart) Transferring a Macro  Demonstration       Execute the macro strategy to get Q-value estimates Infer low Q-values for actions not taken by macro Compute an initial Q-function with these examples Continue learning with standard RL Advantage: potential for large immediate jump in performance Disadvantage: risk that agent will blindly follow an inappropriate strategy Experiments  Source task: 2-on-1 BreakAway    3000 existing games from the learning curve Learn macros from 5 separate runs Target tasks: 3-on-2 and 4-on-3 BreakAway    Demonstration period of 100 games Continue training up to 3000 games Perform 5 target runs for each source run 2-on-1 BreakAway Macro pass(Teammate) In one source run this node was absent This shot is apparently a leading pass move(Direction) The learning agent jumped players here shoot(goalRight) The ordering of these nodes varied shoot(goalLeft) Results: 2-on-1 to 3-on-2 0.6 Probability of Goal 0.5 0.4 0.3 Standard RL Model Reuse Skill Transfer Relational Macro 0.2 0.1 0 0 500 1000 1500 Training Games 2000 2500 3000 Results: 2-on-1 to 4-on-3 0.6 Probability of Goal 0.5 0.4 0.3 Standard RL Model Reuse Skill Transfer Relational Macro 0.2 0.1 0 0 500 1000 1500 Training Games 2000 2500 3000 Conclusions    This transfer method can significantly improve initial target-task performance It can handle new elements being added to the target task, but not new objectives It is an aggressive approach that is a good choice for tasks with similar strategies Future Work  Alternative ways to apply relational macros in the target task    Keep the initial benefits Alleviate risks when tasks differ more Alternative ways to make decisions about steps within macros  Statistical relational learning techniques Acknowledgements  DARPA Grant HR0011-04-1-0007  DARPA IPTO contract FA8650-06-C-7606 Thank You Rule scores    Each transition and action has a set of rules, one or more of which may fire If multiple rules fire, we obey the one with the highest score The score of a rule is the probability that following it leads to a successful game Score = # source-task games that followed the rule and scored # source-task games that followed the rule

torrey.ilp07.ppt

Related documents

Products

Support

torrey.ilp07.ppt

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib