Learning from Observation Using Primitives Darrin Bentivegna Outline Motivation Test environments Learning from observation Learning from practice Contributions Future directions Bentivegna Thesis, July 2004 Motivation Reduce the learning time needed by robots. Quickly learn skills from observing others. Improve performance through practice. Adapt to environment changes. Create robots that can interact with and learn from humans in a human-like way. Bentivegna Thesis, July 2004 Real World Marble Maze Bentivegna Thesis, July 2004 Real World Air Hockey Bentivegna Thesis, July 2004 Research Strategy Domain knowledge: library of primitives. Manually defining primitives is a natural way to specify domain knowledge. Focus of research is on how to use a fixed library of primitives Marble Maze Primitives Roll To Corner Guide Bentivegna Thesis, July 2004 Roll Off Wall Leave Corner Roll From Wall Primitives in Air Hockey Right Bank Shot - Defend Goal Bentivegna Thesis, July 2004 Straight Shot Left Bank Shot -Static Shot -Idle Take home message Learning using primitives greatly speeds up learning and allows more complex problems to be performed by robots. Memory based learning makes learning from observation easy. I created a way to do memory based reinforcement learning. Problem is no fixed set of parameters to adjust. Learn by adjusting distance function. Present algorithms that learn from both observation and practice. Bentivegna Thesis, July 2004 Observe Critical Events in Marble Maze Raw Data Bentivegna Thesis, July 2004 Observe Critical Events in Marble Maze Raw Data Wall Contact Inferred Bentivegna Thesis, July 2004 Observe Critical Events in Air Hockey Paddle X Human paddle movement Puck Y Puck movement Puck X Paddle Y Shots made by human +y +x Bentivegna Thesis, July 2004 Learning From Observation Memory-based learner: Learn by storing experiences. Primitive selection: K-nearest neighbor. Sub-goal generation: Kernel regression (distance weighted averaging) based on remembered primitives of the appropriate type. Action generation: Learned or fixed policy. Bentivegna Thesis, July 2004 Three Level Structure Primitive Selection Sub-goal Generation Action Generation Bentivegna Thesis, July 2004 Learning from Observation Framework Learning from Observation Primitive Selection Sub-goal Generation Action Generation Bentivegna Thesis, July 2004 Observe Primitives Performed by a Human 20 15 ◊-Guide ○-Roll To Corner □- Roll Off Wall *-Roll From Wall Bentivegna Thesis, JulyX-Leave 2004 Corner 10 5 0 0 5 10 15 20 25 Primitive Database Create a data point for each observed primitive. The primitive type performed: TYPE State of the environment at the start of the primitive performance. ( M x , M y , M x , M y , Bx , By ) State of the environment at the end of the primitive performance. ( EM x , EM y , EM x , EM y , EBx , EBy ) ( M x , M y , M x , M y , Bx , By ) (TYPE , EM x , EM y , EM x , EM y , EBx , EBy ) Bentivegna Thesis, July 2004 Marble Maze Example Primitive Selection Sub-goal Generation Action Generation Bentivegna Thesis, July 2004 Primitive Type Selection Lookup using environment state. q (M x , M y , M x , M y , Bx , By ) Primitive Selection Weighted nearest neighbor. d ( x, q) 2 w ( x q ) j j j j ,M ,B ,B ) (M x , M y , M x y x y , EM , EB , EB ) (TYPE , EM x , EM y , EM x y x y Many ways to select a primitive type. Use closest point. Use n nearest points to vote. Highest frequency. Weighted by distance from the query point. Bentivegna Thesis, July 2004 Sub-goal Generation Action Generation Sub-goal Generation Locally weighted average over nearby primitives (data points) of the same type. ,M ,B ,B ) (M x , M y , M x y x y , EB , EB ) (TYPE , EM x , EM y , EM x , EM y x y Use a kernel function to control Primitive Selection Sub-goal Generation Action Generation the influence of nearby data points. K d e d 2 n y ( q) y K (d ( x , q)) i i 1 n K (d ( x , q)) i 1 Bentivegna Thesis, July 2004 ( EM x , EM y , EM x , EM y , EBx , EBy ) Action Generation Provides the action (motor command) to perform at each time step. LWR, neural networks, physical model, etc. Bentivegna Thesis, July 2004 Primitive Selection Sub-goal Generation Action Generation Creating an Action Generation Module (Roll to Corner) Record at each time step from the beginning to the end of the primitive: , M , B , B ) Environment state: ( M x , M y , M x y x y Actions taken: ( Bx , B y ) , EM , EB , EB ) End state: ( EM x , EM y , EM x y x y Observed environment states Start of the Roll to Corner Primitive End of the Roll to Corner Primitive Incoming Velocity Vector ,M , B , B ), ( EM , EM , EM , EM , EB , EB )) (( M x , M y , M x y x y x y x y x y ( B , B ) Bentivegna Thesis,xJuly 2004 y Transform to a Local Coordinate Frame Global information ,M , B , B ), ( EM , EM , EM , EM , EB , EB )) (( M x , M y , M x y x y x y x y x y ( Bx , B y ) Primitive specific local information. , B , B ), ( EM , EB , EB )) (( X , M x x y x x y ( Bx , B y ) Reference Point +Y Dist to the end +X Bentivegna Thesis, July 2004 Learning the Maze Task from Only Observation Bentivegna Thesis, July 2004 Related Research: Primitive Recognition Survey of research in human motion analysis and recognizing human activities from image sequences. Aggarwal and Cai. Recognize over time. HMM, Brand, Oliver, and Pentland. Template matching, Davis and Bobick. Discover Primitives. Fod, Mataric, and Jenkins. Bentivegna Thesis, July 2004 Related Research: Primitive Selection Predefined sequence. Virtual characters, Hodgins, et al., Faloutsos, et al., and Mataric et al. Mobile robots, Balch, et al. and Arkin, et al. Learn from observation. Assembly, Kuniyoshi, Inaba, Inoue, and Kang. Use a planning system. Assembly, Thomas and Wahl. RL, Ryan and Reid. Bentivegna Thesis, July 2004 Related Research: Primitive Execution Predefine execution policy. Virtual characters, Mataric et al., and Hodgins et al. Mobile robots, Brooks et al. and Arkin. Learn while operating in the environment. Mobile robots, Mahadevan and Connell RL, Kaelbling, Dietterich, and Sutton at al. Learn from observation Mobile robots, Larson and Voyles, Hugues and Drogoul, Grudic and Lawrence. High DOF robots, Aboaf et al., Atkeson, and Schaal. Bentivegna Thesis, July 2004 Review Learning from Observation Primitive Selection Sub-goal Generation Action Generation Bentivegna Thesis, July 2004 Using Only Observed Data Tries to mimic the teacher. Can not always perform primitives as well as the teacher. Sometimes select the wrong primitive type for the observed state. Does not know what to do in states it has not observed. No way to know it should try something different. Solution: Learning from practice. Bentivegna Thesis, July 2004 Improving Primitive Selection and Subgoal Generation from Practice Learning from Observation Primitive Selection Sub-goal Generation Learning from Practice Bentivegna Thesis, July 2004 Action Generation Improving Primitive Selection and Sub-goal Generation Through Practice Need task specification information to create a reward function. Learn by adjusting distance to query: Scale distance function by value of using a data point. d ( x, q) w (x j j j qj ) 2 d ( x, q) d ( x, q) f ( x, q) f(data point location, query location) related to Q value: 1/Q or exp(-Q) Associate scale values with each data point. The scale values must be stored, and learned. Bentivegna Thesis, July 2004 Store Values in Function Approximator Look-up table. Fixed size. Locally Weighted Projection Regression (LWPR), Schaal, et al. Create a model for each data point. Indexed by the difference between the query point and data point’s state (delta-state). ( M x , M y , M x , M y , Bx , By ) (TYPE , EM x , EM y , EM x , EM y , EBx , EBy ) Bentivegna Thesis, July 2004 Learn Values Using a Reinforcement Learning Strategy Q( st , at ) Q( st , at ) rt 1 max Q( st 1 , a) Q( st , at ) a State: delta-state. Action: Using this data point. Reward Assignment Positive: Making progress through the maze. Negative: Falling into a hole. Going backwards through the maze. Taking time performing the primitive. Bentivegna Thesis, July 2004 Learning the Value of Choosing a Data Point (Simulation) Testing area Incoming Velocity Vector Computed Scale Values +Y (12.9,18.8) +X Observed Roll Off Wall Primitive = Two marble positions with the incoming velocity as shown when the LWPR model associated with the Roll Off Wall primitive shown is queried. BAD GOOD Bentivegna Thesis, July 2004 Maze Learning from Practice Real World Simulation Cumulative failures/meter 200 180 160 Obs. Only 140 120 100 Table 80 60 40 LWPR 20 0 Bentivegna Thesis, July 2004 0 50 100 150 200 250 300 Learning New Strategies Human Learning From Observation Bentivegna Thesis, July 2004 After Learning From Practice Learning Action Generation from Practice Learning from Observation Primitive Selection Sub-goal Generation Learning from Practice Bentivegna Thesis, July 2004 Action Generation Improving Action Generation Through Practice Environment changes over time. Need to compensate for structural modeling error. Can not learn everything from only observing others. Bentivegna Thesis, July 2004 Knowledge for Making a Hit Target Line Target Location Absolute Post-hit Velocity Hit Line Hit Path of the Location incoming puck After hit location has been determined. Puck Puck movement Impact Robot Motion Puck-paddle collision Paddle placement Target Outgoing Incoming Robot Location Paddle movement timingPaddle Velocity Puck Velocity Trajectory Bentivegna Thesis, July 2004 Results of Learning Straight Shots (Simulation) Observed 44 straight shots made by the Puck human. Motion Impact Robot Running average of 5 shots. Target Error(meters) Target Too much Location Bentivegna Thesis, July 2004 Outgoing Incoming Robot noise in hardware Puck Velocity Paddle sensing. Velocity Trajectory Practice 100 Shots Number of shot taken Robot Model (Real World) Puck Motion Target Location Bentivegna Thesis, July 2004 Impact Outgoing Puck Velocity Robot Incoming Paddle Velocity Robot Trajectory Obtaining Proper Robot Movement Six set robot configurations. Interpolate between the four surrounding configurations. Compute joint angles P(x,y) +Y +X Paddle command: Desired end location and time of the trajectory, (x, y, t). Follows fifth-order polynomial equation, zero start and end velocity and acceleration. Bentivegna Thesis, July 2004 Robot Model Desired state of the puck at hit time. Compute the movement command Robot trajectory Generate robot trajectory (x, y, t) (x, y, t) Pre-set time delay 0.3 0.25 0.2 0.15 0.1 +y 0.05 Starting location +x Bentivegna Thesis, July 2004 0 0.2 0.22 0.24 0.26 0.28 0.3 0.32 0.34 0.36 0.38 0.4 Robot Movement Errors 0.3 Observed path of the paddle. 0.25 Desired hit location – Location of highest paddle velocity - Desired trajectory 0.2 0.15 0.1 +y 0.05 Starting location +x 0 0.2 0.22 0.24 0.26 0.28 0.3 0.32 0.34 0.36 0.38 0.4 Movement accuracy determined by many factors. Speed of the movement. Friction between the paddle and the board. Hydraulic pressure applied to the robot. Operating within the designed performance parameters. Bentivegna Thesis, July 2004 Robot Model Puck Motion Target Location Impact Outgoing Puck Velocity Robot Incoming Paddle Velocity Robot Trajectory Learn to properly place the paddle. Learn the timing of the paddle. Observe its own actions: Actual hit point (highest velocity point). Time from when the command is given to the time the paddle observed at the hit position. Bentivegna Thesis, July 2004 Improving the Robot Model Desired state of the puck at hit time. Robot the Compute Movement movement LWPR command (x, y, t) (x, y, t) Time delay Pre-set Timing info. from LWPR time delay Bentivegna Thesis, July 2004 Generate robot trajectory (x, y, t) Robot trajectory Using the Improved Robot Model Desired hit location – Location of highest paddle velocity - Desired trajectory Observed path of the paddle. +y +x Bentivegna Thesis, July 2004 Starting location Using the Improved Robot Model Desired hit location – Location of highest paddle velocity - Desired trajectory Observed path of the paddle. +y +x Bentivegna Thesis, July 2004 Starting location Real-World Air Hockey Bentivegna Thesis, July 2004 Major Contributions A framework has been created as a tool in which to perform research in learning from observation using primitives. Flexible structure allows for the use of various learning algorithms. Can also learn from practice. Presented learning methods that can learn quickly from observed information and also have the ability to increase performance through practice. Created a unique algorithm that gives a robot the ability to learn the effectiveness of data points in a data base and then use that information to change its behavior as it operates in the environment. Presented a method of breaking the learning problem into small learning modules. Individual modules have more opportunities to learn and generalize. Bentivegna Thesis, July 2004 Some Future Directions Automatically defining primitive types. Explore how to represent learned information so it can be used in other tasks/environments. Can robots learn about the world from playing these games? Explore other ways to select primitives and sub-goals. Use the observed information to create a planner. Investigate methods of exploration at primitive selection and sub-goal generation. Simultaneously learn primitive selection and action generation. Bentivegna Thesis, July 2004