Action Recognition Through an Action Generation Mechanism

advertisement
Action Recognition Through an Action Generation
Mechanism
Barış Akgün
bakgun@ceng.metu.edu.tr
Doruk Tunaoğlu
doruk@ceng.metu.edu.tr
Erol Şahin
erol@ceng.metu.edu.tr
Kovan Research Lab.
Computer Engineering Department,
Middle East Technical University,
Ankara / Turkey
Abstract
In this paper, we studied how actions of
others can be recognized by the very same
mechanisms that generate similar actions. We
used action generation mechanisms to simulate actions and compare resulting trajectories with the observed one as the action
unfolds. Specifically, we modified Dynamic
Movement Primitives (DMP) for online action
recognition and used them as our action generation mechanism. A human demonstrator
applied three different actions on two different objects. Recordings of these actions were
used to train DMPs and test our approach.
We showed that our system is capable of online action recognition within approximately
the first one-third of the observed action with
a success rate of over %90. Online capabilities
of the recognition system were also tested in
an interactive game with the robot.
1.
Introduction
Action recognition, the understanding of what a human (or another robot) is doing, is an essential competence for a robot to interact and communicate
with humans. Nature might have already provided
an elegant way for action recognition through mirror neurons. These neurons, discovered in the area
F5 (and some other areas) of the premotor cortex
of macaque monkeys, fire both during the generation and the observation of goal-directed actions
(Rizzolatti et al., 1996, Gallese et al., 1996). Grasping is a prominent example of these actions. Grasping related mirror neurons of a macaque monkey are
active both when the monkey grasps a piece of food
and when a human or another monkey does the same
thing. In this context, goal-oriented means that the
action is directed towards an object being the food
in the previous example. These neurons do not respond when the observed agent tries to grasp the air
where there is no object.
The dual response characteristics is attributed to
action recognition hypothesis of mirror neurons; action generation and recognition shares the very same
neural circuitry and that motor system is activated
during action recognition. This paper proposes an
online action recognition scheme using an action generation mechanism taking its inspiration from mirror
neurons.
2.
Related Work
There have been various modelling efforts and
studies related to integrated action generationrecognition approaches after the discovery of mirror neurons. Some computational models are developed to explain and/or mimic the mirror neuron system and some other models to directly be
used in robotics (action generation etc.). From the
robotics side, these models mostly deal with action
generation, action learning and imitation, mentioning recognition as a side study. A review regarding the computational models of imitation learning is given in (Schaal et al., 2003a). In addition,
(Oztop et al., 2006) presents a review about imitation methods and their relations to mirror neurons. This literature survey will present computational models that could be used as integrated action
generation-recognition mechanisms.
Recurrent Neural Network with Parametric Biases
(RNNPB) is a modified Jordan-Type recurrent neural network with some special neurons in its input
layer that can act both as input and output, according to context. These neurons constitute the
parametric bias (PB) vector (Tani, 2003). RNNPBs
are used to learn, generate and recognize actions
(Tani et al., 2004). PB vectors are calculated per
action but weights of the network are shared across
different actions. For generation, learned PB vectors
are input to the network according to the desired
action. However, goal setting is not possible. Recognition mode is similar to learning mode; PB vectors
are calculated from the observed action. Then the
calculated PB vector is compared with the existing
vectors for recognition. PB vectors are used as output in the learning and recognition mode and used
as input in the generation mode.
Mental State Inference (MSI) (Oztop et al., 2005)
model is a motor control model augmented with an
action recognition scheme. Action generation mode
incorporates a forward model, to compensate for sensory processing delays. This forward model is exploited for action recognition. Actor is assumed to
make a certain action and a mental simulation of
that action is carried out by the observer, using the
forward model. The simulated and observed action
trajectories are compared and according to the error,
assumed action is updated.
Modular Selection and Identification for Control
(MOSAIC)
(Wolpert and Kawato, 1998,
Haruno et al., 2001) is an architecture for adaptive
motor control. Given a desired trajectory, this
architecture creates necessary motor commands to
achieve it. The controller-predictor pairs compete
and cooperate to handle a given control task in
action generation mode. The contribution of a
pair to the total control is proportional to its
predictor’s performance. Predictors’ performances
are represented with responsibility signals. For
action recognition, an observed trajectory is fed
to the system as the desired trajectory and the
responsibility signals are calculated which are then
compared with the ones in the observer’s memory.
Hierarchical Attentive Multiple Models for Execution and Recognition (HAMMER) architecture is
first used for imitation in (Demiris and Hayes, 2002)
and recognition in (Demiris and Khadhouri, 2006).
It consists of multiple forward and inverse models,
as in MOSAIC. However, only a single pair handles the control task which is determined by its forward model’s prediction performance, called the confidence value. For recognition, observed trajectory is
given to the architecture and the confidence values
are calculated. The action which has the highest
confidence value is the one recognized.
Dynamic
Movement
Primitives
(DMP)
(Ijspeert et al., 2001, Schaal et al., 2003b) are nonlinear dynamic systems which are used for imitation
learning, action generation and recognition. The
imitation learning consists of estimating parameters
of a function approximator and these parameters
are stored to be used during action generation. For
action generation, goal and corresponding learned
parameters are fed to the system. Recognition with
DMPs are done by treating the observed action as a
new one to be learned (Ijspeert et al., 2003). Then
the calculated parameters are compared with the
ones in observer’s repertoire.
There are some drawbacks of the aforementioned
architectures for action recognition. RNNPBs are
limited to offline recognition since PB vectors are
calculated iteratively and all of the action needs to
be observed. DMPs also need to observe the whole
action to calculate its parameters and use them for
recognition. Depending on implementation, MOSAIC can suffer from a similar problem. MOSAIC
and MSI methods use instantaneous error to update
responsibility signals and action estimates respectively which is prone to noise. MOSAIC, HAMMER
and MSI architectures do not provide a way for imitation learning and recognition is done on the assumption that the observer’s and the actor’s actions
are similar. Goal-setting is not available for the RNNPB approach, which limits action recognition on a
single goal.
3.
System Architecture
In this section, we first describe the DMP framework
and then present how we extended DMPs for online
action recognition.
3.1
Dynamical Movement Primitives
A DMP generates a trajectory x(t) by defining a differential equation for ẍ(t). There are more than one
DMP formulation. The one that we are interested is;
ẍ = K(g − x) − Dẋ + K f (s, w),
~
(1)
where K is the spring constant, D is the damping coefficient, g is the goal position, s is the phase variable,
w
~ is the parameters of the function approximator f
and x, ẋ and ẍ are respectively the position, velocity
and acceleration in the task space.
The phase variable s is defined to decay from 1 to
0 in time as:
s0
=
1,
(2)
ṡ = −αs,
=⇒ s = e
−α(t−t0 )
(3)
,
(4)
where α is calculated by forcing s = sf at the end of
the demonstrated behavior.
Equation 1 defines a DMP for a one dimensional
system. For a multi-dimensional system, a separate
DMP is learned for each dimension, bound via the
common phase variable s. Note that DMP’s can be
defined in the joint space (with joint angles being the
variable) as well as in the Cartesian space.
We can analyze the behavior of the system in two
parts: The canonical part (K(g − x) − Dẋ) which
is inspired by mass-spring-damper equations and the
non-linear part (K f (s, w))
~ which perturbs the massspring-damper system.
The canonical part has a global asymptotic stable equilibrium point which is {x, ẋ} = {g, 0}. In
other words, given any initial condition the system
is guaranteed to converge to g with zero velocity as
√
time goes to ∞. Moreover, if we ensure D = 2 K
then the system will be critically damped and it will
converge to the equilibrium point as quickly as possible without any oscillation.
For a multi-dimensional system, the canonical part
generates a simple linear trajectory between the
starting point and the goal point in the task space.
The non-linear part perturbs the canonical part in
order to generate complex trajectories. For this, the
perturbing function f (s, w)
~ learns to approximate
the additional input to the dynamics generated by
the canonical part necessary to follow a trajectory,
through its parameters w
~ which are estimated from
demonstration.
3.2
Learning by Demonstration
For our case, learning by demonstration (LbD) refers
to teaching a robot an action by demonstrating that
action (as opposed to programming it). The presented recognition scheme requires that the actor and
the observer have similar action generation mechanisms and LbD is used to ensure this.
For DMPs, LbD is done by calculating the parameters w
~ from a demonstrated action. A demonstrated action has different trajectories for each of its
dimension. Given trajectories, x(t), for each of the
dimensions, w
~ of each dimension can be calculated
as follows: First, the phase variable s(t) is calculated
from equation 4 where α is calculated by forcing a
final value (sf ) to the phase variable. Then for each
of the dimensions, f (t) is calculated as:
f (t) =
ẍ(t) − K (g − x(t)) + Dẋ(t)
.
K
(5)
Since s(t) and f (t) are calculated, and since s(t) is
a monotonically decreasing function between 1 and
0, f (s) is also known for s ∈ [0 1]. Finally w
~ for
each dimension is estimated by applying linear regression using radial basis functions where s values
are treated as inputs and the corresponding f (s) values are treated as targets.
3.3
Extending DMPs for Online Recognition
There are some major problems of DMPs for online
recognition, mostly because of the non-linear part:
4. Another consequence of the second problem: s
depends on starting time of an action (Equation 4) which may not be observed or estimated
by an observer since it is an internal parameter
of the actor.
In order to circumvent these problems, we replaced
f (s, w)
~ with f (~z, w)
~ in equation 1, where vector ~z
consists of variables that are calculated from the current state only. We chose to represent our state in
Cartesian space instead of joint space. The reasoning
behind is that it is easier to observe the end-effector
of the actor in Cartesian space than the angular positions of his joints.
We used object-centered Cartesian positions and
velocities as the variables of ~z.
The objectcentered representation implicitly sets g as zero
for all dimensions. Such a choice of hand-object
relations is inspired from (Oztop and Arbib, 2002,
Oztop et al., 2005). Using hand-object relations for
generation and recognition greatly reduces actorobserver differences and allows seamless use of generation mechanisms in recognition. As stated before,
our hand-object relations are defined as relative position and velocity of the end-effector with respect to
the object.
Learning the non-linear part means learning a
mapping from ~z to f~ where elements of f~ are the
f value for each DOF. Given N recorded trajectories
~xi (t) for an action, each trajectory is translated to
an object-centered reference frame. Then, ~x˙ i (t) and
¨i (t) are calculated through simple differentiation.
~x
f~i (t) is calculated using equation 5, and ~zi (t) is constructed by concatenating ~xi (t) and ~x˙ i (t). Finally,
a feed-forward neural network is trained by feeding
all ~z values and the corresponding f~ values as inputs
and target values respectively.
3.4
Recognition
The aim of the presented approach is to recognize the
observed action, and the object being acted upon, in
an online manner.
For recognition, initial state observations are given
to action generation systems as initial conditions and
future trajectories are simulated. These are then
compared with the observed trajectory by calculating the cumulative error;
1. The lack of an online recognition scheme.
erri (tc ) =
2. The use of a phase variable that runs in an openloop fashion (i.e., not related to x or ẋ).
3. A consequence of the second problem: α in equations 3 and 4 has to be calculated from the end
of the motion which prevents the recognition to
be done in online fashion.
tc
X
||~xo (t) − ~xi (t)|| ,
(6)
t=t0
where erri is the ith behavior’s cumulative error, t0
is the observed initial time, tc is the current time,
~xo is the observed position, ~xi is the ith behavior’s
simulated position.
Then, to have a quantitative measure of similarity
Figure 1: Recognition architecture flow diagram. Motion capture system tracks possible objects and the end-effector
from which state variables are calculated. When recognition starts, the observed state is used as initial values to the
learned behaviors (not shown). The learned behaviors (i.e., generation systems) then simulate future trajectories. As
the action unfolds, observed and simulated trajectories are compared and responsibility signals are calculated. From
these signals, recognition decision is made according to threshold.
between actions, we define recognition signals as:
e−erri (tc )
rsi (tc ) = P −err (t ) .
j c
je
(7)
Recognition signals could be interpreted as the
likelihood of the observed motion to be the corresponding simulated action. These are similar to the
responsibility signals in (Haruno et al., 2001). However, instantaneous error is used in responsibility signal calculation, which is prone to noise and the authors also point this out to be problematic.
During the recognition phase, if the ratio of the
highest recognition signal and the second highest gets
above a certain threshold (Υ), the recognition decision is made and the behavior corresponding to the
highest signal is chosen.
In case there are multiple objects in the environment, behaviors are simulated for each object, by
calculating hand-object relations accordingly. If we
have m behaviors and n objects in the scene, then
m × n behaviors are simulated and compared. This
also allows to recognize which object is being acted
upon.
We do not learn a separate network for the same
behavior on different objects. We use the same neural network for learning and generating the trajectory of one behavior on all objects; only changing
inputs according to the object.
The flow-chart of the overall recognition architecture is depicted in figure 1.
4.
4.1
Experimental Framework
Setup
There are two objects, a motion capture system, an
actor (human) and an observer (robot) in our setup.
The actor and the observer face each other and the
objects are hanging in between as shown in figure 2.
The motion capture system is used for tracking the
centers of the objects and the end-effector of the ac-
tor.
We used three reaching behaviors:
• R(X): Reaching object X from right
• D(X): Reaching object X directly
• L(X): Reaching object X from left
which can be applied to one of the two objects:
• LO: Left object (wrt. the actor)
• RO: Right object (wrt. the actor)
The use of multiple behaviors on multiple objects
aimed to show that the proposed method can learn
and recognize different behaviors for different goals.
These behaviors and objects are depicted in figure 3.
The actor can apply one of the behaviors on one of
the objects. Thus, the recognition decision has to be
made among six choices. Note that we do not use a
different neural network for all behavior-object pairs
but use a different one for each behavior. The inputs
given to the same network changes between objects
since hand-object relations are different.
Figure 3: Depiction of behaviors and approximate geometric information from top view of the setup
hand trajectories and the object centers are recorded
with the motion capture system. The beginning and
the end of the recordings are truncated automatically
to capture the start and the end of the actual motion,
using a velocity threshold (0.05m/s). Samples from
the recorded trajectories are shown in figure 4. 60%
of the processed recordings are used for training and
the rest are left for testing. For different threshold
(Υ) values, we have recorded decision times for each
testing sample and measured the success
rates. We
√
have selected K = 10 and D = 2 10 empirically.
The simulate-observe-compare loop of the approach
is run with the same frequency as the motion capture
system, 30Hz.
Figure 2: Experimental setup
We used the VisualeyezTM VZ 40001 motion capture system (Phoenix Technologies Incorporated).
This device measures the 3D position of active markers, which are attached to the points of interest. In
our setup, the markers are attached to the centers of
the objects and to the right wrist (the end-effector) of
the actor as shown in figure 2. The wrist is chosen to
minimize occlusion during tracking. The frequency
of the sensor is set to 30Hz per marker.
4.2
Experiments
The actor performed 50 repetitions of each behavior
on each object (a total of 300 repetitions) and the
1 http://www.ptiphoenix.com/VZmodels.php
Figure 4: A subset of recorded human trajectories
5.
Results and Discussions
We plotted the time evolution of recognition signals
for nine different recordings in figure 5. In all plots,
recognition signals start from the same initial value,
1
6 , corresponding to 6 possible choices (3 actions × 2
object). As the action unfolds recognition signal of
the corresponding action goes to one while suppressing others. Although there may be a confusion in the
beginning parts of an action (e.g. top right plot), our
system recognizes it correctly as it unfolds.
According to the chosen threshold value Υ, there
Figure 5: The time evolution of recognition signals. X-axis is time in seconds and Y-axis is recognition signal magnitude.
Figure 6: Recognition rate and decision time (percentage
completed) vs. threshold value Υ. Percentage completed
is defined as decision time divided by action completion
time.
is a trade-off between the decision time and the success rate. To optimize Υ, we recorded success rates
and mean decision times (for correct decisions) for
different Υ values which are shown in figure 6.
We chose Υ as 1.9 to obtain 90% recognition rate.
On the average, the system makes a decision when
33% of the observed action is completed with this
chosen Υ value. Figure 7 plots the histogram of the
decision times for this Υ value.
Table 1 shows the confusion matrix for the chosen Υ value. Cases where the object acted upon is
not correctly decided is low: 25% of wrong decisions
and 2.5% of all decisions. Since we want to make
recognition before the action is completed, there are
Figure 7: Distribution of decision times (percentage completed) for Υ = 1.9
confusions between reaching right of the left object
and reaching left of the right object (See figure 3).
Also, there is a confusion between reaching directly
and reaching from left for the left object. These are
expected since human motion has noise and variance
between repetitions and a demonstrator may not give
the initial curvature expected from the action every
time (see figure 4).
6.
Robot Demonstration
To demonstrate that our system works in an online
fashion, we tested it in an interactive game with the
humanoid robot iCub. The robot’s head and eyes
have a total of 6 DOF and arms have 7 DOF each,
excluding the hands (Metta et al., 2007). It also has
Figure 8: Demonstrations with the robot: Each row shows a different demonstration. The first column shows the
starting point of the actions. The second column shows the point where the system recognizes the action (indicated
by the eye-blink). The third column is the point where the demonstrator finishes his action and the last column is the
point where the robot finishes its action.
Table 1: Confusion matrix for Υ = 1.9 (Recognition rate
= 90%)
Object
LO
RO
Object
Behavior
R
D
L
R
D
A
R
17
0
0
0
0
0
LO
D
1
15
2
0
0
0
L
0
4
18
0
0
0
R
0
0
0
18
0
0
RO
D
L
0
2
0
1
0
0
1
1
20 0
0 20
facial expressions implemented by leds and moving
eyelids. For our demonstration, robot was seated in
front of the demonstrator, on the other side of the
objects, as shown in figure 2.
The interactive game is as follows: The actor applies one of the actions to one of the objects. The
robot immediately raises its eyebrows and blinks
when it recognizes the action and reacts by turning his head to the predicted object. The robot also
makes a hand-coded counter action which is defined
as bringing its hand on the opposite side of the object. The reason for using hand-coded actions is that
action generation is out of the scope of this paper.
DMPs have already been shown to have good generation characteristics. Snapshots from a captured
video of the game can be seen in figure 8. This fig-
ures show that our action recognition method can be
used for online interaction with robots.
7.
Conclusion
We demonstrated an online recognition architecture which can recognize an action before it is
completed. Our recognition architecture is based
on a generation system where we have modified
DMPs to overcome some of their shortcomings. Our
modifications remove the dependence of DMPs on
the initial position and time (i.e., internal parameters). Also, we have changed recognition approach in
(Ijspeert et al., 2003) to allow for online recognition.
We have defined recognition signals to have a quantitative way for measuring similar actions. The temporal profile of our recognition signals are similar to
mirror neuron responses, although more experiments
and comparison should be done to claim functional
similarity.
Although seemingly similar, our work differs from
HAMMER and MOSAIC in major aspects. We are
working in trajectory space and our controller and
trajectory generator are decoupled; i.e., we do not
need to calculate motor commands for recognition.
Our architecture has the ability to learn actions by
demonstration, which is absent in HAMMER and
MOSAIC. Both in HAMMER and MOSAIC, it is
assumed that the actor and the observer have sim-
ilar internal models, which cannot be guaranteed.
HAMMER is more suited for recognizing straight
line trajectories. Although a solution is proposed
in (Demiris and Simmons, 2006) by using via-points,
everything needs to be calculated offline. Our recognition metric, recognition signals, is robust against
noise since it uses the cumulative error unlike MOSAIC and MSI.
Acknowledgements
This research was supported by European Commission under the ROSSI project(FP7-216125) and by
TÜBİTAK under the project 109E033. Barış Akgün
acknowledges the support of the TÜBİTAK 2228
graduate student fellowship and Doruk Tunaoğlu acknowledges the support of the TÜBİTAK 2210 graduate student fellowship.
References
Demiris, J. and Hayes, G. (2002). Imitation as
a dual-route process featuring predictive and
learning components; a biologically-plausible
computational model. In Dautenhahn, K. and
Nehaniv, C., Imitation in animals and artifacts.
MIT Press.
Demiris, Y. and Khadhouri, B. (2006). Hierarchical attentive multiple models for execution
and recognition of actions. Robotics and Autonomous Systems, 54:361–369.
Demiris, Y. and Simmons, G. (2006). Perceiving
the unusual: Temporal properties of hierarchical motor representations for action perception.
Neural Networks, 19:272–284.
Gallese, V., Fadiga, L., Fogassi, L., and Rizzolatti,
G. (1996). Action recognition in the premotor
cortex. Brain, 119(2):593–609.
Haruno, M., Wolpert, D. M., and Kawato, M. M.
(2001). Mosaic model for sensorimotor learning
and control. Neural Computation, 13(10):2201–
2220.
Ijspeert, A., Nakanishi, J., and Schaal, S. (2001).
Trajectory formation for imitation with nonlinear dynamical systems. Proceedings 2001
IEEE/RSJ International Conference on Intelligent Robots and Systems. Expanding the Societal Role of Robotics in the the Next Millennium
(Cat. No.01CH37180), pages 752–757.
Ijspeert, A. J., Nakanishi, J., and Schaal, S. (2003).
Learning attractor landscapes for learning motor primitives. In Becker, S., Thrun, S., and
Obermayer, K., (Eds.), Advances in Neural Information Processing Systems, volume 15, pages
1547–1554. MIT-Press.
Metta, G., Sandini, G., Vernon, D., Natale, L., and
Nori, F. (2007). The icub cognitive humanoid
robot: An open-system research platform for enactive cognition. In 50 Years of Artificial Intelligence, pages 358–369. Springer Berlin / Heidelberg.
Oztop, E. and Arbib, M. A. (2002). Schema design
and implementation of the grasp-related mirror
neuron system. Biological Cybernetics, 87:116–
140.
Oztop, E., Kawato, M., and Arbib, M. (2006). Mirror neurons and imitation: A computationally
guided review. Neural Networks, 19:254–271.
Oztop, E., Wolpert, D., and Kawato, M. (2005).
Mental state inference using visual control parameters. Cognitive Brain Research, 22:129–151.
Rizzolatti, G., Fadiga, L., Gallese, V., and Fogassi,
L. (1996). Premotor cortex and the recognition of motor actions. Cognitive brain research,
3:131–141.
Schaal, S., Ijspeert, A., and Billard, A. (2003a).
Computational approaches to motor learning by
imitation. Philosophical Transactions: Biological Sciences, 358:537–547.
Schaal, S., Peters, J., Nakanishi, J., and Ijspeert,
A. (2003b). Learning movement primitives. In
International Symposium on Robotics Research.
Springer.
Tani, J. (2003). Learning to generate articulated
behavior through the bottom-up and the topdown interaction processes. Neural Networks,
16:11–23.
Tani, J., Ito, M., and Sugita, Y. (2004). Selforganization of distributedly represented multiple behavior schemata in a mirror system : reviews of robot experiments using RNNPB. Neural Networks, 17:1273–1289.
Wolpert, D. and Kawato, M. (1998). Multiple
paired forward and inverse models for motor
control. Neural Networks, 11:1317–1329.
Download