From: AAAI Technical Report WS-99-15. Compilation copyright © 1999, AAAI (www.aaai.org). All rights reserved. Using Primitives in Learning From Observation: A Preliminary Report Darrin C. Bentivegna and Christopher G. Atkeson College of Computing, Georgia Institute of Technology, 801 Atlantic Drive, Atlanta, GA30332-0280 ATRHumanInformation Processing Research Laboratories, 2-2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-02, Japan dbent@cc.gatech.edu, cga@cc.gatech.edu www.cc.gatech.edu / people /home/ dbent, www.cc.gatech.edu/fac/Chris.Atkeson Abstract A virtual air-hockey game and a virtual and an actual marble-mazegame have been created that allow data to be captured while the gamesare being played. This data is parsed into small parts of the task called primitives and databases of observed primitives are created. A learning agent then uses the database to perform the task. A virtual air-hockeyplayer has learned to select and makeshots. Anagent is being created that will learn howto play the marble-mazegamein a similar manner. Introduction Humanlearning is often accelerated by observing a task being performed or attempted by someone else. If robots can be programmedto use such observations to accelerate learning their usability and functionality will be increased and programming and learning time will be decreased. This paper describes research that explores the use of primitives in learning from observation. Our ultimate goal is to show that the use of primitives accelerates learning, and that the primitives can be automatically learned by observing a teacher’s performance. This paper describes how the parameters of a set of predefined primitives can be learned. Two virtual environments and a physical implementation will be described that are being used for this research. One of the tasks is playing air-hockey. Figure 1 shows a virtual air-hockey game that was created that allows a person to play against a virtual player. The virtual air-hockey player learns which hit to select, how to make the selected hit, and how to move the paddle from observing a human opponent. The methods used to extract this information from captured data and howthe virtual player uses this information will be described. The other task used in this research is a marble maze game that has been implemented in a virtual environment and a physical implementation, figures 2 and 3. The marble maze environment is near the beginning of Copyright (~) 1999, AmericanAssociation for Artificial Intelligence (www.aaai.org).All rights reserved. 4O its development. The construction of and the learning methods to be used in this environment will also be described. These domains were chosen because of the ease with which they can be simulated in virtual environments and because they provide a starting point to obtain more information on learning from observation. The physical environments are also small enough to be operated in a laboratory. Since the basic movementsin these domains are only in two dimensions, motion capture and object manipulation is simplified. A camera based motion capture system can easily be used to collect data in a hardware implementation (Bishop & Spong 1999; Ohshima et al. 1998). A stationary arm or some other similar robotic device can be programmedto play airhockey on an actual table (Spong 1999; Bishop ~ Spong 1999). A marble-maze game (Labyrinth 1999) has been outfitted with stepper motors and a camera so that the movements may be captured and controlled by a computer. Primitives Robots typically must generate commandsto all their actuators at regular intervals. The analog controllers for our seven degree of freedom arm are given desired torques for each joint at roughly 500Hz. Thus, a task with a one second duration is parameterized with 7 * 500 = 3500 parameters. Learning in this high dimensional space can be quite slow or can fail totally. Randomsearch in such a space is hopeless. In addition, since robot movementstake place in real time, learning approaches that require more than hundreds of movements are often not feasible. Special purpose techniques have been developed to deal with this problem, such as trajectory learning (An, Atkeson, &5Hollerbach 1988) and learning from observation (Atkeson & Schaal 1997a; 1997b; Hayes & Demiris 1994; Kuniyoshi, Inaba, & Inoue 1994; Bakker &: Kuniyoshi 1996; Dillmann et al. 1996; Hirzinger 1996; Ikeuchi et al. 1996). It is our hope that primitives can be used to reduce the dimensionality of the learning problem (Arkin 1998; Schmidt 1988). Primitives are solutions to small parts of a task that can be combined to complete the task. A solution to a task may be made up of many primitives. Figure 2: The virtual opponent’s slot while also preventing it from going into your ownslot. Figure 1 shows the virtual air hockey game created that can be played on a computer running OpenInventor and Tcl/TK. The game consists of two paddles, a puck and a board to play on. A human player controls one paddle using a mouse. At the other end is a simulated or virtual player. The code can be obtained at www.cc.gatech.edu/proj ects/Learning_Research/. The movementof the virtual player has very limited physics incorporated into it. The paddle movement is constrained to operate with a velocity limit. Paddle accelerations are not monitored and therefore can be unrealistically large. The virtual player uses only its arm and hand to position the paddle. For a given desired paddle location, the arm and hand are placed to put the paddle in the appropriate location, and resolving any redundancies so as to make the virtual player look "human-like". If the location is not within the limits of the board and the reach of the virtual player the location is adjusted to the closest reachable point. The torso is currently fixed in space but could be programmed to move in a realistic manner. The virtual player’s head movesso that it is always pointing in the direction of its hand, but is irrelevant to the task in this implementation. The paddles and the puck are constrained to stay on the board. There is a small amount of friction between the puck and the board’s surface. There is also energy loss in collisions between the puck and the walls of the board and the paddles. Spin of the puck is ignored in the simulation. The position of the two paddles and the puck, and any collisions occurring within sampling intervals are recorded. Humandomain knowledge was used to define a set of primitives to work with initially. Three hit primitives Figure 1: The virtual air hockey environment. The disc shaped object near the center line is a puck which slides on the table and bounces off the sides, the other two disc shaped objects are the paddles. The far paddle is controlled by the virtual player and the closer paddle is controlled by a human player by moving a mouse. The object of the game is to score points by making the puck hit the opposite goal (the purple/light area at the ends of the board). In the air-hockey environment, for example, there may be primitives for hitting the puck, capturing the puck, and defending the goal. There are many possible primitives, and it is often possible to break a primitive up into smaller primitives. In this research, a human, using domain knowledge, designs the candidate primitives that are to be used. Algorithms are created to segment the observed behavior into primitives and to build a database of these primitives. The agent then uses this database to decide at certain times what primitive to perform and how to perform it. Virtual marble maze environment. Air Hockey Air-hockey is a game played by two people. They use round paddles to hit a fiat round puck across a table. Air is forced up through many tiny holes in the table surface that create a cushion of air for the puck to slide on with relatively little friction. The table has an edge around it that prevents the puck from going off of the table, and the puck bounces off of this edge with little loss of velocity. At each end of the table there is a slot that the puck can fit through. The objective of the game is to hit the puck so that it goes into the 41 Figure 4: Three hit primitives being performed by the virtual player: right, straight, and left. Figure 3: The physical marble maze implementation. are shownin figure 4. The full list of primitives used is: ¯ Left Hit: the player hits the puck and it hits the left wall and then travels toward the opponent’s goal. ¯ Straight Hit: the player hits the puck and it travels toward the opponent’s goal without hitting the side walls. ¯ Right Hit: the player hits the puck and it hits the right wall and then travels toward the opponent’s goal. ¯ No Hit: the player deliberately does not hit the puck. ¯ Prepare: movements made while the puck is on the opposite side from the player. ¯ Multi-Shot: movements made after a shot while the puck is still on the same side. Selecting the appropriate primitive The virtual player must decide which primitive to perform. The prepare primitive is performed whenever the puck is on the side opposite the player. In all the remaining primitives the puck is on the same side as the player, so selecting which of the other primitives to perform requires taking into account the current board state. A database was created to guide selection of primitives, incorporating prior observations of primitives being executed. The context or state in which each primitive has been performed is extracted from the data, and is used by a nearest neighbor lookup process to find the past primitive execution whose context is most similar to the current context. In this implementation a primitive is selected, and then run to completion, before the next primitive is selected and executed. In future implementations we plan to look at systems in which primitives can run concurrently, and interrupt and override other primitives. Critical events are used to segment the data and to decide when a specific primitive is being performed. Critical events are usually rare occurrences. For example, the puck mostly travels in a straight line with a gradually decreasing velocity. Critical events for the puck include collisions, in which the ball speed and direction are rapidly changed. Using critical events, the raw data is segmented into the above primitives. In air hockey, the hit primitives are parameterized by the incoming puck position (where it crossed the center line) and velocity (when it crossed the center line), the hit location, and the outgoing puck velocity and its target. To determine the target of the hit, the puck’s velocity vector after the hit is made is observed and a physical model is used to determine where the puck would hit the back wall if it was not blocked by the opponent. This use of a physical model enables the learning agent to estimate the target being attempted without the shot having to be completed. The accuracy of the physical model can be critical in producing the correct data. Other methods can be used to reduce the reliance on the physical model, such as only considering shots that have actually hit the back wall without having hit any other walls or paddles. To determine which primitive to use the virtual player observes the puck’s position and velocity when it crosses the centerline. It queries the database using a nearest neighbor technique, looking for the previously executed primitive with puck position and velocity closest to the current position and velocity. It then returns that primitive. Obtaining hit parameters The parameters for the hit primitives are the desired hit location, the puck’s desired post-hit velocity, and the target location. Currently these parameters are returned from the nearest neighbor query to the database as part of the selected primitive as explained in the previous section. A future implementation will obtain the parameters by interpolating between parameters of previously executed primitives of the selected type. Finding the right paddle motion The hit parameters encode the motion of the puck before and after the hit. Nowthe virtual player must figure out how to move the paddle to implement this hit. This can be done in many ways. Three methods have been tried; a physical model, neural networks, and kernel regression (Atkeson, Moore, 8z Schaal 1997). The physical model contains a simulation algorithm and computes the required paddle movementsto hit the puck to a desired location with the desired output velocity. The computed movement is the minimummovement needed to obtain the correct hit. Paddle velocity that is perpendicular to the normal of the paddle-puck collision does not affect the puck’s movement. This method ignores puck spin. Using the physical model produces extremely accurate results but does not take into account the information obtained from the observation. The accuracy of the physical model largely determines the results of this method. If a physical model is not available some other method must be used. For the neural network and kernel regression methods, information is extracted from the captured data so as to have the virtual player move the paddle the way that the human moved the paddle to make a shot. A database is again created using critical events and contains the following information: Input: ¯ The XYlocation of the puck when it was hit. ¯ The velocity componentsof the puck when it was hit. ¯ The absolute velocity of the puck just after it is hit. ¯ The position on the back wall that the puck would hit if unobstructed. Output: ¯ The paddle’s velocity components at the time of the collision. ¯ The location of the paddle relative to the puck at the time of contact. This database is used in a learning module that tells the virtual player the paddle’s velocity componentsand relative position that is needed to make the desired shot. This database has been approximated using neural networks and also kernel regression. The query to the learning module is the puck’s velocity and desired hit location, the desired velocity of the puck after it is hit, and the desired location to shoot for on the back wall. The learning module then outputs the information needed by the virtual player to make the shot. The prepare primitive Prepare is the action that is performed by a player when the puck is on the other side of the centerline. In order for the virtual player to learn this behavior from the human player a database is created from the observed data. This database contains the ball’s and the human puck’s position and velocity components during the time whenthe puck is on the other side of the center line. Kernel regression of this data is used to determine what the virtual player will do when the puck is on the other side of the center line. The learning module returns the desired position and velocity of the paddle. Only the desired position is currently being used. The virtual player moves toward this position with a given velocity that is hard-coded. If that point is not reached by the end of the time cycle, the virtual player will ignore it and movetoward the most recent desired point. Multi-Shot primitive If the teacher did not make a hit for the incoming puck parameters the primitive selection query will return the no-hit primitive. In this case something other then a hit must be performed. It may also be that the virtual player attempts a shot but does not make it correctly and the puck does not return to the other side of the board. In these situations the multi-shot primitive is performed. Once the virtual player starts executing this primitive, it will continue until the puck crosses the center line. The database for this primitive contains the velocity and position of the puck whenit is on a player’s side, the position and velocity of the player’s paddle, and the position and velocity of the opponent’s paddle. The position and velocity of the puck and the opponent’s paddle are interpolated using kernel regression. This primitive generates the desired paddle position and velocity for the observed state. Marble Maze Primitive learning is also being explored in the marble maze environment in software, figure 2, and on hardware, figure 3. In the marble maze game a player controls a marble through a maze by tilting the board that the marble is rolling on. The board is tilted using two knobs. There are obstacles, in the form of holes, that the marble may fall into. In both versions the time and the board and ball positions are recorded approximately 30 times a second as a human plays the game. Research in this environment has just begun and this section describes the methodthat will be used for having an agent learn how to play the marble-maze game. RollOffWall(ROW) RollWallStop(RWS) .;.-. Q.... @..... ,C| \ -_., Guide(GID) .:.-’. Land (LND) @ R011 From Wall(RFW) ...--, 2-" 0"" ’"’ I,. _ Figure 5: Primitives used in the marble-maze. Roll OffWall > Land> Roll Wall Stop "::@ .... .. :::;I Figure 6: Connected primitives in the marble-maze. Primitives are extracted from the captured data and a database is created for each primitive. Once again, a human designed the primitives and created an algorithm to find the primitives in the captured data. The following primitives are currently being explored and are shownin figure 5: ¯ Wall Roll Stop (WRS): The ball rolls along a wall and stops when it hits another wall. ¯ Roll Off Wall (ROW):The ball rolls along a wall and then rolls off the end of the wall. ¯ Guide (GID): The ball is rolled from one location another without touching a wall. ¯ Land (LND): The ball lands on a wall. ¯ Roll From Wall (RFW):The ball hits, or is on, a wall and then is maneuveredoff it. Figure 6 shows how three primitives can be chained together. To learn from the captured data, a database will be used in a three step process similar to the one used in the air hockey environment. A database of primitives will be created that contains all previously executed primitives. This database will contain the state of the game at the beginning and end of execution of a primitive. More specifically, the database will contain at least the following information: Inputs: ¯ The XYball position. ¯ The ball’s velocity components. ¯ The board’s angular position Outputs: 44 ¯ The primitive used under these conditions ¯ The primitive parameters such as the length the ball travels during primitive execution and/or the ending velocity of the ball The agent will first query the database to decide which primitive to perform. It will then obtain the parameters needed for the performance of that primitive. Lastly the agent will use the parameters and the database to find out how to perform the selected primitive. To select a primitive to use, the agent will observe the position and velocity of the ball and position of the board and use this as a query into the database. The primitive will be chosen from the database using a single nearest-neighbor approach. The database will then be queried again, focusing only on the primitives of the selected type performed near the given location, to obtain the appropriate parameters for the primitive to be executed. Nowthat the agent knows which primitive to perform and what parameters to use, it needs to know where the board needs to be moved to in order to perform the primitive. The database of the selected primitive will be queried for the needed action at every time step during the primitive execution. This database takes as input the parameters of the primitive that the agent would like to execute and outputs the board movement necessary to successfully perform the primitive. Since the database was created from captured data the board movements generated are based on the movements the human made in that situation. Kernel regression will be used to interpolate data from the database. The number of points and the kernel function that will be used in the regression to obtain desired results will need to be found by experimentation. Discussion There is great deal of work still to do. This section discusses some of the issues that will be addressed soon or may be addressed in the future. How the maze differs from air hockey In air hockey the primitives are indexed using an absolute position, the location on the board. In the marble maze game we have indexed the primitives using a relative position, so as to support generalization. Whether air hockey should use a relative position index or the marble maze game should use &n absolute position index remains to be explored. In air hockey, obtaining the hit parameters and then using this information as a second query to find howto make the hit may be be less effective than looking up how to make the hit in a single query. However, the hit parameters such as target location serve as useful subgoals, and the agent can practice obtaining those subgoals independently from learning from observation. Function approximators and features One issue is finding a good function approximator for the type and distribution of data we typically observe. The number of points and the kernel function to use in kernel regression will be explored. A number of alternative function approximators will be explored. Currently only the nearest data-point is used in deciding which primitive to use and the parameters to use with that primitive. As the research progresses methods to combine a number of nearest neighbors to decide what the correct action should be will be investigated. Other features may be added to the learning method to increase the performance of the agent as it plays the games. The opponent’s paddle position and/or velocity are not being considered when selecting a primitive in the air-hockey domain. The opponents movements may be significant in the way the human moves and should be taken into account the when choosing a primitive. Once the agents are performing at an acceptable level, other ways to learn the task will be explored, such as using reinforcement learning. This will provide a method to compare with learning using primitives. Data can also be captured as the agent attempts to play the game. This data can be parsed in real time for primitives that can be added to the primitive database. Methodsto give the air-hockey virtual player more human-like movementswill also be explored. Primitives The choice of input and output parameters in the creation of the databases affects its performance. The hit database, for example, was originally designed to output the needed position of the paddle for a given position of the puck to perform a hit. Using this implementation the virtual-player consistently made poor shots. Changing to the use of a relative position of the puck and paddle in terms of an angle greatly improved the hit performance. The parameters for the hit primitives are really subgoals for that part of the task. Wecan use the notion of a target for a hit primitive to independently train the hit primitive. Other primitives do not need any parameters other than the current state of the game, since their only objective is to imitate what the humandid in the next time step for the given state of the game, rather than achieve an external result. Future research will clarify the role of subgoals in learning from observation. There is more then one way to implement a primitive. As described earlier, the hit primitive was implemented using a physical model, neural networks and kernel regression. The information needed to perform a primitive can be learned from observed data or from its own trials. In these cases a number of different numerical learning methods may be used. On the other hand it may be that the primitive is very simple and can be hard coded. In the above two environments the state of the game is very simple. In a real game of air hockey the move- ment of the opponent’s body may be significant in determining what moves are made by a player. Discovering what features are relevant or important can be very difficult. For example the head movement may not be important but the eye movement could be very important. It is important that the definition of each primitive support segmentation, selection, parameter generation, and execution. If the primitive is poorly defined it may be difficult to find in the observed data and maybe difficult to segment from other primitives. The primitive should be defined in terms of critical events or certain environment states. Primitives also need to have all the degrees of freedom to perform the task, but not extraneous degrees of freedom. Defining primitives is an iterative process. Once a set of primitives are defined and tried out, they must then be evaluated. Someprimitives may be changed or deleted. Newprimitives may need to be added. The virtual player will play like the teacher, making the same mistakes as the teacher, and may never discover that there may be a better way to perform for an observed state. Whena primitive is found in the training data, the state under which it is performed is recorded. This information allows us to discover what the teacher actually did. It may be that this was not what the teacher had planned to do. If the teacher consistently makeerrors, it will appear that the incorrect primitive should be performed. Automatically finding primitives In our research, humans using knowledge of the domain select the primitives to be used. The selected primitives are basic and hard lines separate one primitive from another. But in reality this is not true and learning primitives is very difficult for the following reasons. ¯ Variability - a primitive may not be performed the same way each time. * Blending/co-articulation - primitives may blend into each other. The line separating one primitive from another may change over time. The way the primitive is performed may also change over time. Perceptual confusion - a primitive may be confused for a different one. Do not have segmented data - the data is not easily segmented into primitives. Conclusions A virtual air-hockey game and a virtual and hardware marble-maze game have been created that allow position data to be captured while the games are being played. Humans, using domain knowledge, select primitives to use and create software needed to parse the captured data and create primitive databases. The agent then uses these databases to perform the task. A virtual air-hockey player has learned a shot strategy, how to hit, and prepare from observing a human. An agent is being created that will learn how to play the marble-maze game in a similar manner. Acknowledgments Support for both investigators was provided by the ATRHuman Information Processing Research Laboratories and by National Science Foundation AwardIIS9711770. The demonstration maze program that is included with OpenInventor was used as a basis for the virtual marble maze program. References An, C. H.; Atkeson, C. G.; and Hollerbach, J. M. 1988. Model-Based Control of a Robot Manipulator. Cambridge, MA:MIT Press. Arkin, R. C. 1998. Behavior-Based Robotics. Cambridge, MA:MIT Press. Atkeson, C. G., and Schaal, S. 1997a. Learning tasks from a single demostration. In Proceedings of the 1997 IEEE International Conference on Robotics and Automation (ICRA97), 1706-1712. Atkeson, C. G., and Schaal, S. 1997b. Robot learning from demonstration. In D. H. Fisher, J., ed., Proceedings of the 1997 International Conference on Machine Learning (ICML97), 12-20. Morgan Kaufmann. Atkeson, C. G.; Moore, A. W.; and Schaal, S. 1997. Locally weighted learning. Artificial Intelligence Review 11:11-73. Bakker, P., and Kuniyoshi, Y. 1996. Robot see, robot do: An overview of robot imitation. In AISB96 Workshop on Learning in Robots and Animals, 3-11. Bishop, B. E., and Spong, M. W. 1999. Vision-based control of an air hockey robot. IEEE Control Systems Magazine 19(3):23-32. Dillmann, R.; Friedrich, H.; Kaiser, M.; and Ude, A. 1996. Integration of symbolic and subsymbolic learning to support robot programming by human demonstration. In Giralt, G., and Hirzinger, G., eds., Robotics Research: The Seventh International Symposium, 296--307. Springer, NY. Hayes, G., and Demiris, J. 1994. A robot controller using learning by imitation. In A. Borkowski and J. L. Crowley (Eds.), Proceedings of the 2nd International Symposium on Intelligent Robotic Systems, 198-204. Hirzinger, G. 1996. Learning and skill acquisition. In Giralt, G., and Hirzinger, G., eds., Robotics Research: The Seventh International Symposium, 277-278. Springer, NY. Ikeuchi, K.; Miura, J.; Suehiro, T.; and Conanto, S. 1996. Designing skills with visual feedback for APO. In Giralt, G., and Hirzinger, G., eds., Robotics Research: The Seventh International Symposium, 308-320. Springer, NY. Kuniyoshi, Y.; Inaba, M.; and Inoue, H. 1994. Learning by watching: Extracting reusable task knowledge from visual observation of human performance. In 46 IEEE Transactions on Robotics and Automation, 799822. Labyrinth. 1999. Labyrinth game obtained from the Pavilion company. Ohshima, T.; Satoh, K.; Yamamoto, H.; and Tamura, H. 1998. AR2hockey: A case study of collaborative augmented reality. In IEEE Virtual Reality Annual International Symposium, 268-275. Schmidt, R. A. 1988. Motor Learning and Control. Champaign, IL: HumanKinetics Publishers. Spong, M. W. 1999. Robotic air hockey. http://cyclops.csl.uiuc.edu.