Action Recognition Through an Action Generation Mechanism Barış Akgün bakgun@ceng.metu.edu.tr Doruk Tunaoğlu doruk@ceng.metu.edu.tr Erol Şahin erol@ceng.metu.edu.tr Kovan Research Lab. Computer Engineering Department, Middle East Technical University, Ankara / Turkey Abstract In this paper, we studied how actions of others can be recognized by the very same mechanisms that generate similar actions. We used action generation mechanisms to simulate actions and compare resulting trajectories with the observed one as the action unfolds. Specifically, we modified Dynamic Movement Primitives (DMP) for online action recognition and used them as our action generation mechanism. A human demonstrator applied three different actions on two different objects. Recordings of these actions were used to train DMPs and test our approach. We showed that our system is capable of online action recognition within approximately the first one-third of the observed action with a success rate of over %90. Online capabilities of the recognition system were also tested in an interactive game with the robot. 1. Introduction Action recognition, the understanding of what a human (or another robot) is doing, is an essential competence for a robot to interact and communicate with humans. Nature might have already provided an elegant way for action recognition through mirror neurons. These neurons, discovered in the area F5 (and some other areas) of the premotor cortex of macaque monkeys, fire both during the generation and the observation of goal-directed actions (Rizzolatti et al., 1996, Gallese et al., 1996). Grasping is a prominent example of these actions. Grasping related mirror neurons of a macaque monkey are active both when the monkey grasps a piece of food and when a human or another monkey does the same thing. In this context, goal-oriented means that the action is directed towards an object being the food in the previous example. These neurons do not respond when the observed agent tries to grasp the air where there is no object. The dual response characteristics is attributed to action recognition hypothesis of mirror neurons; action generation and recognition shares the very same neural circuitry and that motor system is activated during action recognition. This paper proposes an online action recognition scheme using an action generation mechanism taking its inspiration from mirror neurons. 2. Related Work There have been various modelling efforts and studies related to integrated action generationrecognition approaches after the discovery of mirror neurons. Some computational models are developed to explain and/or mimic the mirror neuron system and some other models to directly be used in robotics (action generation etc.). From the robotics side, these models mostly deal with action generation, action learning and imitation, mentioning recognition as a side study. A review regarding the computational models of imitation learning is given in (Schaal et al., 2003a). In addition, (Oztop et al., 2006) presents a review about imitation methods and their relations to mirror neurons. This literature survey will present computational models that could be used as integrated action generation-recognition mechanisms. Recurrent Neural Network with Parametric Biases (RNNPB) is a modified Jordan-Type recurrent neural network with some special neurons in its input layer that can act both as input and output, according to context. These neurons constitute the parametric bias (PB) vector (Tani, 2003). RNNPBs are used to learn, generate and recognize actions (Tani et al., 2004). PB vectors are calculated per action but weights of the network are shared across different actions. For generation, learned PB vectors are input to the network according to the desired action. However, goal setting is not possible. Recognition mode is similar to learning mode; PB vectors are calculated from the observed action. Then the calculated PB vector is compared with the existing vectors for recognition. PB vectors are used as output in the learning and recognition mode and used as input in the generation mode. Mental State Inference (MSI) (Oztop et al., 2005) model is a motor control model augmented with an action recognition scheme. Action generation mode incorporates a forward model, to compensate for sensory processing delays. This forward model is exploited for action recognition. Actor is assumed to make a certain action and a mental simulation of that action is carried out by the observer, using the forward model. The simulated and observed action trajectories are compared and according to the error, assumed action is updated. Modular Selection and Identification for Control (MOSAIC) (Wolpert and Kawato, 1998, Haruno et al., 2001) is an architecture for adaptive motor control. Given a desired trajectory, this architecture creates necessary motor commands to achieve it. The controller-predictor pairs compete and cooperate to handle a given control task in action generation mode. The contribution of a pair to the total control is proportional to its predictor’s performance. Predictors’ performances are represented with responsibility signals. For action recognition, an observed trajectory is fed to the system as the desired trajectory and the responsibility signals are calculated which are then compared with the ones in the observer’s memory. Hierarchical Attentive Multiple Models for Execution and Recognition (HAMMER) architecture is first used for imitation in (Demiris and Hayes, 2002) and recognition in (Demiris and Khadhouri, 2006). It consists of multiple forward and inverse models, as in MOSAIC. However, only a single pair handles the control task which is determined by its forward model’s prediction performance, called the confidence value. For recognition, observed trajectory is given to the architecture and the confidence values are calculated. The action which has the highest confidence value is the one recognized. Dynamic Movement Primitives (DMP) (Ijspeert et al., 2001, Schaal et al., 2003b) are nonlinear dynamic systems which are used for imitation learning, action generation and recognition. The imitation learning consists of estimating parameters of a function approximator and these parameters are stored to be used during action generation. For action generation, goal and corresponding learned parameters are fed to the system. Recognition with DMPs are done by treating the observed action as a new one to be learned (Ijspeert et al., 2003). Then the calculated parameters are compared with the ones in observer’s repertoire. There are some drawbacks of the aforementioned architectures for action recognition. RNNPBs are limited to offline recognition since PB vectors are calculated iteratively and all of the action needs to be observed. DMPs also need to observe the whole action to calculate its parameters and use them for recognition. Depending on implementation, MOSAIC can suffer from a similar problem. MOSAIC and MSI methods use instantaneous error to update responsibility signals and action estimates respectively which is prone to noise. MOSAIC, HAMMER and MSI architectures do not provide a way for imitation learning and recognition is done on the assumption that the observer’s and the actor’s actions are similar. Goal-setting is not available for the RNNPB approach, which limits action recognition on a single goal. 3. System Architecture In this section, we first describe the DMP framework and then present how we extended DMPs for online action recognition. 3.1 Dynamical Movement Primitives A DMP generates a trajectory x(t) by defining a differential equation for ẍ(t). There are more than one DMP formulation. The one that we are interested is; ẍ = K(g − x) − Dẋ + K f (s, w), ~ (1) where K is the spring constant, D is the damping coefficient, g is the goal position, s is the phase variable, w ~ is the parameters of the function approximator f and x, ẋ and ẍ are respectively the position, velocity and acceleration in the task space. The phase variable s is defined to decay from 1 to 0 in time as: s0 = 1, (2) ṡ = −αs, =⇒ s = e −α(t−t0 ) (3) , (4) where α is calculated by forcing s = sf at the end of the demonstrated behavior. Equation 1 defines a DMP for a one dimensional system. For a multi-dimensional system, a separate DMP is learned for each dimension, bound via the common phase variable s. Note that DMP’s can be defined in the joint space (with joint angles being the variable) as well as in the Cartesian space. We can analyze the behavior of the system in two parts: The canonical part (K(g − x) − Dẋ) which is inspired by mass-spring-damper equations and the non-linear part (K f (s, w)) ~ which perturbs the massspring-damper system. The canonical part has a global asymptotic stable equilibrium point which is {x, ẋ} = {g, 0}. In other words, given any initial condition the system is guaranteed to converge to g with zero velocity as √ time goes to ∞. Moreover, if we ensure D = 2 K then the system will be critically damped and it will converge to the equilibrium point as quickly as possible without any oscillation. For a multi-dimensional system, the canonical part generates a simple linear trajectory between the starting point and the goal point in the task space. The non-linear part perturbs the canonical part in order to generate complex trajectories. For this, the perturbing function f (s, w) ~ learns to approximate the additional input to the dynamics generated by the canonical part necessary to follow a trajectory, through its parameters w ~ which are estimated from demonstration. 3.2 Learning by Demonstration For our case, learning by demonstration (LbD) refers to teaching a robot an action by demonstrating that action (as opposed to programming it). The presented recognition scheme requires that the actor and the observer have similar action generation mechanisms and LbD is used to ensure this. For DMPs, LbD is done by calculating the parameters w ~ from a demonstrated action. A demonstrated action has different trajectories for each of its dimension. Given trajectories, x(t), for each of the dimensions, w ~ of each dimension can be calculated as follows: First, the phase variable s(t) is calculated from equation 4 where α is calculated by forcing a final value (sf ) to the phase variable. Then for each of the dimensions, f (t) is calculated as: f (t) = ẍ(t) − K (g − x(t)) + Dẋ(t) . K (5) Since s(t) and f (t) are calculated, and since s(t) is a monotonically decreasing function between 1 and 0, f (s) is also known for s ∈ [0 1]. Finally w ~ for each dimension is estimated by applying linear regression using radial basis functions where s values are treated as inputs and the corresponding f (s) values are treated as targets. 3.3 Extending DMPs for Online Recognition There are some major problems of DMPs for online recognition, mostly because of the non-linear part: 4. Another consequence of the second problem: s depends on starting time of an action (Equation 4) which may not be observed or estimated by an observer since it is an internal parameter of the actor. In order to circumvent these problems, we replaced f (s, w) ~ with f (~z, w) ~ in equation 1, where vector ~z consists of variables that are calculated from the current state only. We chose to represent our state in Cartesian space instead of joint space. The reasoning behind is that it is easier to observe the end-effector of the actor in Cartesian space than the angular positions of his joints. We used object-centered Cartesian positions and velocities as the variables of ~z. The objectcentered representation implicitly sets g as zero for all dimensions. Such a choice of hand-object relations is inspired from (Oztop and Arbib, 2002, Oztop et al., 2005). Using hand-object relations for generation and recognition greatly reduces actorobserver differences and allows seamless use of generation mechanisms in recognition. As stated before, our hand-object relations are defined as relative position and velocity of the end-effector with respect to the object. Learning the non-linear part means learning a mapping from ~z to f~ where elements of f~ are the f value for each DOF. Given N recorded trajectories ~xi (t) for an action, each trajectory is translated to an object-centered reference frame. Then, ~x˙ i (t) and ¨i (t) are calculated through simple differentiation. ~x f~i (t) is calculated using equation 5, and ~zi (t) is constructed by concatenating ~xi (t) and ~x˙ i (t). Finally, a feed-forward neural network is trained by feeding all ~z values and the corresponding f~ values as inputs and target values respectively. 3.4 Recognition The aim of the presented approach is to recognize the observed action, and the object being acted upon, in an online manner. For recognition, initial state observations are given to action generation systems as initial conditions and future trajectories are simulated. These are then compared with the observed trajectory by calculating the cumulative error; 1. The lack of an online recognition scheme. erri (tc ) = 2. The use of a phase variable that runs in an openloop fashion (i.e., not related to x or ẋ). 3. A consequence of the second problem: α in equations 3 and 4 has to be calculated from the end of the motion which prevents the recognition to be done in online fashion. tc X ||~xo (t) − ~xi (t)|| , (6) t=t0 where erri is the ith behavior’s cumulative error, t0 is the observed initial time, tc is the current time, ~xo is the observed position, ~xi is the ith behavior’s simulated position. Then, to have a quantitative measure of similarity Figure 1: Recognition architecture flow diagram. Motion capture system tracks possible objects and the end-effector from which state variables are calculated. When recognition starts, the observed state is used as initial values to the learned behaviors (not shown). The learned behaviors (i.e., generation systems) then simulate future trajectories. As the action unfolds, observed and simulated trajectories are compared and responsibility signals are calculated. From these signals, recognition decision is made according to threshold. between actions, we define recognition signals as: e−erri (tc ) rsi (tc ) = P −err (t ) . j c je (7) Recognition signals could be interpreted as the likelihood of the observed motion to be the corresponding simulated action. These are similar to the responsibility signals in (Haruno et al., 2001). However, instantaneous error is used in responsibility signal calculation, which is prone to noise and the authors also point this out to be problematic. During the recognition phase, if the ratio of the highest recognition signal and the second highest gets above a certain threshold (Υ), the recognition decision is made and the behavior corresponding to the highest signal is chosen. In case there are multiple objects in the environment, behaviors are simulated for each object, by calculating hand-object relations accordingly. If we have m behaviors and n objects in the scene, then m × n behaviors are simulated and compared. This also allows to recognize which object is being acted upon. We do not learn a separate network for the same behavior on different objects. We use the same neural network for learning and generating the trajectory of one behavior on all objects; only changing inputs according to the object. The flow-chart of the overall recognition architecture is depicted in figure 1. 4. 4.1 Experimental Framework Setup There are two objects, a motion capture system, an actor (human) and an observer (robot) in our setup. The actor and the observer face each other and the objects are hanging in between as shown in figure 2. The motion capture system is used for tracking the centers of the objects and the end-effector of the ac- tor. We used three reaching behaviors: • R(X): Reaching object X from right • D(X): Reaching object X directly • L(X): Reaching object X from left which can be applied to one of the two objects: • LO: Left object (wrt. the actor) • RO: Right object (wrt. the actor) The use of multiple behaviors on multiple objects aimed to show that the proposed method can learn and recognize different behaviors for different goals. These behaviors and objects are depicted in figure 3. The actor can apply one of the behaviors on one of the objects. Thus, the recognition decision has to be made among six choices. Note that we do not use a different neural network for all behavior-object pairs but use a different one for each behavior. The inputs given to the same network changes between objects since hand-object relations are different. Figure 3: Depiction of behaviors and approximate geometric information from top view of the setup hand trajectories and the object centers are recorded with the motion capture system. The beginning and the end of the recordings are truncated automatically to capture the start and the end of the actual motion, using a velocity threshold (0.05m/s). Samples from the recorded trajectories are shown in figure 4. 60% of the processed recordings are used for training and the rest are left for testing. For different threshold (Υ) values, we have recorded decision times for each testing sample and measured the success rates. We √ have selected K = 10 and D = 2 10 empirically. The simulate-observe-compare loop of the approach is run with the same frequency as the motion capture system, 30Hz. Figure 2: Experimental setup We used the VisualeyezTM VZ 40001 motion capture system (Phoenix Technologies Incorporated). This device measures the 3D position of active markers, which are attached to the points of interest. In our setup, the markers are attached to the centers of the objects and to the right wrist (the end-effector) of the actor as shown in figure 2. The wrist is chosen to minimize occlusion during tracking. The frequency of the sensor is set to 30Hz per marker. 4.2 Experiments The actor performed 50 repetitions of each behavior on each object (a total of 300 repetitions) and the 1 http://www.ptiphoenix.com/VZmodels.php Figure 4: A subset of recorded human trajectories 5. Results and Discussions We plotted the time evolution of recognition signals for nine different recordings in figure 5. In all plots, recognition signals start from the same initial value, 1 6 , corresponding to 6 possible choices (3 actions × 2 object). As the action unfolds recognition signal of the corresponding action goes to one while suppressing others. Although there may be a confusion in the beginning parts of an action (e.g. top right plot), our system recognizes it correctly as it unfolds. According to the chosen threshold value Υ, there Figure 5: The time evolution of recognition signals. X-axis is time in seconds and Y-axis is recognition signal magnitude. Figure 6: Recognition rate and decision time (percentage completed) vs. threshold value Υ. Percentage completed is defined as decision time divided by action completion time. is a trade-off between the decision time and the success rate. To optimize Υ, we recorded success rates and mean decision times (for correct decisions) for different Υ values which are shown in figure 6. We chose Υ as 1.9 to obtain 90% recognition rate. On the average, the system makes a decision when 33% of the observed action is completed with this chosen Υ value. Figure 7 plots the histogram of the decision times for this Υ value. Table 1 shows the confusion matrix for the chosen Υ value. Cases where the object acted upon is not correctly decided is low: 25% of wrong decisions and 2.5% of all decisions. Since we want to make recognition before the action is completed, there are Figure 7: Distribution of decision times (percentage completed) for Υ = 1.9 confusions between reaching right of the left object and reaching left of the right object (See figure 3). Also, there is a confusion between reaching directly and reaching from left for the left object. These are expected since human motion has noise and variance between repetitions and a demonstrator may not give the initial curvature expected from the action every time (see figure 4). 6. Robot Demonstration To demonstrate that our system works in an online fashion, we tested it in an interactive game with the humanoid robot iCub. The robot’s head and eyes have a total of 6 DOF and arms have 7 DOF each, excluding the hands (Metta et al., 2007). It also has Figure 8: Demonstrations with the robot: Each row shows a different demonstration. The first column shows the starting point of the actions. The second column shows the point where the system recognizes the action (indicated by the eye-blink). The third column is the point where the demonstrator finishes his action and the last column is the point where the robot finishes its action. Table 1: Confusion matrix for Υ = 1.9 (Recognition rate = 90%) Object LO RO Object Behavior R D L R D A R 17 0 0 0 0 0 LO D 1 15 2 0 0 0 L 0 4 18 0 0 0 R 0 0 0 18 0 0 RO D L 0 2 0 1 0 0 1 1 20 0 0 20 facial expressions implemented by leds and moving eyelids. For our demonstration, robot was seated in front of the demonstrator, on the other side of the objects, as shown in figure 2. The interactive game is as follows: The actor applies one of the actions to one of the objects. The robot immediately raises its eyebrows and blinks when it recognizes the action and reacts by turning his head to the predicted object. The robot also makes a hand-coded counter action which is defined as bringing its hand on the opposite side of the object. The reason for using hand-coded actions is that action generation is out of the scope of this paper. DMPs have already been shown to have good generation characteristics. Snapshots from a captured video of the game can be seen in figure 8. This fig- ures show that our action recognition method can be used for online interaction with robots. 7. Conclusion We demonstrated an online recognition architecture which can recognize an action before it is completed. Our recognition architecture is based on a generation system where we have modified DMPs to overcome some of their shortcomings. Our modifications remove the dependence of DMPs on the initial position and time (i.e., internal parameters). Also, we have changed recognition approach in (Ijspeert et al., 2003) to allow for online recognition. We have defined recognition signals to have a quantitative way for measuring similar actions. The temporal profile of our recognition signals are similar to mirror neuron responses, although more experiments and comparison should be done to claim functional similarity. Although seemingly similar, our work differs from HAMMER and MOSAIC in major aspects. We are working in trajectory space and our controller and trajectory generator are decoupled; i.e., we do not need to calculate motor commands for recognition. Our architecture has the ability to learn actions by demonstration, which is absent in HAMMER and MOSAIC. Both in HAMMER and MOSAIC, it is assumed that the actor and the observer have sim- ilar internal models, which cannot be guaranteed. HAMMER is more suited for recognizing straight line trajectories. Although a solution is proposed in (Demiris and Simmons, 2006) by using via-points, everything needs to be calculated offline. Our recognition metric, recognition signals, is robust against noise since it uses the cumulative error unlike MOSAIC and MSI. Acknowledgements This research was supported by European Commission under the ROSSI project(FP7-216125) and by TÜBİTAK under the project 109E033. Barış Akgün acknowledges the support of the TÜBİTAK 2228 graduate student fellowship and Doruk Tunaoğlu acknowledges the support of the TÜBİTAK 2210 graduate student fellowship. References Demiris, J. and Hayes, G. (2002). Imitation as a dual-route process featuring predictive and learning components; a biologically-plausible computational model. In Dautenhahn, K. and Nehaniv, C., Imitation in animals and artifacts. MIT Press. Demiris, Y. and Khadhouri, B. (2006). Hierarchical attentive multiple models for execution and recognition of actions. Robotics and Autonomous Systems, 54:361–369. Demiris, Y. and Simmons, G. (2006). Perceiving the unusual: Temporal properties of hierarchical motor representations for action perception. Neural Networks, 19:272–284. Gallese, V., Fadiga, L., Fogassi, L., and Rizzolatti, G. (1996). Action recognition in the premotor cortex. Brain, 119(2):593–609. Haruno, M., Wolpert, D. M., and Kawato, M. M. (2001). Mosaic model for sensorimotor learning and control. Neural Computation, 13(10):2201– 2220. Ijspeert, A., Nakanishi, J., and Schaal, S. (2001). Trajectory formation for imitation with nonlinear dynamical systems. Proceedings 2001 IEEE/RSJ International Conference on Intelligent Robots and Systems. Expanding the Societal Role of Robotics in the the Next Millennium (Cat. No.01CH37180), pages 752–757. Ijspeert, A. J., Nakanishi, J., and Schaal, S. (2003). Learning attractor landscapes for learning motor primitives. In Becker, S., Thrun, S., and Obermayer, K., (Eds.), Advances in Neural Information Processing Systems, volume 15, pages 1547–1554. MIT-Press. Metta, G., Sandini, G., Vernon, D., Natale, L., and Nori, F. (2007). The icub cognitive humanoid robot: An open-system research platform for enactive cognition. In 50 Years of Artificial Intelligence, pages 358–369. Springer Berlin / Heidelberg. Oztop, E. and Arbib, M. A. (2002). Schema design and implementation of the grasp-related mirror neuron system. Biological Cybernetics, 87:116– 140. Oztop, E., Kawato, M., and Arbib, M. (2006). Mirror neurons and imitation: A computationally guided review. Neural Networks, 19:254–271. Oztop, E., Wolpert, D., and Kawato, M. (2005). Mental state inference using visual control parameters. Cognitive Brain Research, 22:129–151. Rizzolatti, G., Fadiga, L., Gallese, V., and Fogassi, L. (1996). Premotor cortex and the recognition of motor actions. Cognitive brain research, 3:131–141. Schaal, S., Ijspeert, A., and Billard, A. (2003a). Computational approaches to motor learning by imitation. Philosophical Transactions: Biological Sciences, 358:537–547. Schaal, S., Peters, J., Nakanishi, J., and Ijspeert, A. (2003b). Learning movement primitives. In International Symposium on Robotics Research. Springer. Tani, J. (2003). Learning to generate articulated behavior through the bottom-up and the topdown interaction processes. Neural Networks, 16:11–23. Tani, J., Ito, M., and Sugita, Y. (2004). Selforganization of distributedly represented multiple behavior schemata in a mirror system : reviews of robot experiments using RNNPB. Neural Networks, 17:1273–1289. Wolpert, D. and Kawato, M. (1998). Multiple paired forward and inverse models for motor control. Neural Networks, 11:1317–1329.