Towards Action Representation Within the Framework of Conceptual Spaces: Preliminary Results

advertisement
Cognitive Robotics
AAAI Technical Report WS-12-06
Towards Action Representation Within the
Framework of Conceptual Spaces: Preliminary Results
Oliver Beyer, Philipp Cimiano, Sascha Griffiths
CITEC, Bielefeld University
Abstract
vector space. The strength of this theory is that it is cognitively plausible and lends itself to computational implementation as it builds on standard vector-based representations
and geometric operations that manipulate these representations. Gärdenfors and Warglien (2012) have also presented
an approach to represent action concepts in the framework of
conceptual spaces. Actions are modelled by adding a force
domain to the conceptual space which allows us to represent dynamic concepts. The force dimension captures the
dynamic content of actions and thus forms a crucial component of a representation of action.
In this paper, we present a computational model of action representation based on the conceptual spaces framework of Gärdenfors (2004). The representation focuses exclusively on the representation of the spatio-temporal structure of an action by encoding the relative position of a
moved trajector over time. This can be essentially regarded
as an encoding of the force that acts upon a moved trajector during the action. We represent action categories as
vector-based prototypes that define a Voronoi cell that represents a property in the sense of the conceptual spaces
framework. In this paper we formalize this idea and discuss a computational model of this representation of action
categories. We present an approach that can extract such
representations from video data and present some preliminary results on two tasks: an unsupervised task consisting
of grouping similar actions together as well as a supervised
task in which unseen actions are classified into the appropriate action category. We apply our model to naturalistic
data from the Motionese dataset in which parents demonstrate actions to teach their children (Rohlfing et al. 2006;
Vollmer et al. 2010).
The main question we approach in this article is whether
vector-based representations of the trajectory are enough to
discriminate between different action concepts, considering
eight actions found in the Motionese dataset: ’push’, ’pull’,
’put’, ’switch’, ’shut’, ’close’, ’open’ and ’place’. Our results
show that, while a few classes can be discriminated easily on
the basis of this vector-based representation of the trajectory,
with a growing number of classes the discriminative power
decreases rapidly and substantially.
The structure of the paper is as follows: in the next section
Conceptual Spaces we give a brief overview of the theory
of conceptual spaces, focusing in particular on how action
We propose an approach for the representation of actions
based on the conceptual spaces framework developed by
Gärdenfors (2004). Action categories are regarded as properties in the sense of Gärdenfors (2011) and are understood as
convex regions in action space. Action categories are mainly
described by a force signature that represents the forces that
act upon a main trajector involved in the action. This force
signature is approximated via a representation that specifies
the time-indexed position of the trajector relative to several
landmarks. We also present a computational approach to extract such representations from video data. We present results on the Motionese dataset consisting of videos of parents
demonstrating actions on objects to their children. We evaluate the representations on a clustering and a classification task
showing that, while our representations seems to be reasonable, only a handful of actions can be discriminated reliably.
Introduction
Cognitve systems and robots in particular need to be able to
recognize and reason about actions. This requires an appropriate formalism for the representation of actions which encompasses: i) the participants of that action, ii) its teleological structure including goals and intentions of participants,
iii) its spatio-temporal structure as well as iv) its preconditions and effects on the world. Such a representation would
ideally support action recognition, reasoning about the goals
of participants, simulation of the action, planning etc. Most
importantly, such a representation should not be specific to
a particular action i.e. a particular instance carried out on
a specific object (e.g. representing a specific instance of a
’putting’ event), but capture general properties of the action
category, e.g. the action category of putting sth. into sth. else.
Essentially, we require a holistic and gestalt-like representation that allows to represent the action category in a way that
abstracts away from specific participants, specific objects involved, etc.
An appealing theory that can be used to represent such
action concepts is the conceptual spaces framework by
Gärdenfors (2004). Gärdenfors proposes a geometric framework for the representation of concepts as convex regions in
c 2012, Association for the Advancement of Artificial
Copyright Intelligence (www.aaai.org). All rights reserved.
8
concepts can be represented in this framework. In Section
Computational Model of Action Spaces we present our
computational model, and in Section Extraction of Action
Representations we present an approach to extract such
representations from video data. In Section Experiments we
present our experiments and our results on a clustering and
classification task. Before concluding, we discuss some related work.
Figure 1: Concept of ’across’ which illustrates the trajector
moving across the landmark.
Conceptual Spaces
In his book, Gärdenfors (2004) argues that action categories can be also regarded as convex regions in space. More
details on how actions can be represented in the conceptual
spaces framework have been provided by Gärdenfors and
Warglien (2012). The main point is that actions are encoded
in a force domain and the representation need not include
any more information about the participants in an event or
action other than the forces which they contribute to the
event. In line with the criterion that concepts need to be convex regions in a space, they suggest that an action category
is a convex region in action space. The basic difference between object categories and action categories is that an object category is mainly determined by static quality dimensions and domains, while action categories involve forces in
several dimensions that act in specific ways and directions
as the action progresses.
Following the approach of Gärdenfors and Warglien, in
our model actions are represented through the forces that
act on so called trajectors which describe the center of focus for a particular action concept. For instance, the action
categories ’push’ and ’pull’ differ in their direction, while
the intensity of the forces might be equal. Therefore, in our
approach we encode the spatio-temporal relation of the trajector relative to a set of landmark(s). As this relation may
change over time depending on the applied forces, the time
can be seen as additional dimension in our representation.
We thus propose a simplified representation of an action category through a prototype that represents the typical spatiotemporal signature of the forces involved in an action that
belongs to this category. These forces are essentially encoded in the form of a trajectory that specifies the position of
the trajector relative to a set of landmarks at different points
in time. Take the example of the action depicted in Figure
1. It depicts three time points during the action of moving a
circle (the trajector) across the rectangle (the landmark). The
three time points can be characterized as follows: at time
point 1, the trajector is left from the landmark. At time point
2, the circle is on the landmark, while at time point 3, the
trajector is on the right side of the landmark. This example
illustrates how the relative positions of the trajector to one
or several landmarks can be used to describe the prototypical signature of an action.
The conceptual spaces theory by Gärdenfors (2004) proposes a geometric approach to the representation of concepts. Following the insight that many natural categories are
convex, it proposes to formalize properties and concepts as
convex regions in vector space. The conceptual spaces approach rests on five important notions. These are quality dimensions, domains, properties, concepts and instances:
• a quality dimension represents some way in which two
stimuli can be different (e.g. temperature, weight, etc.);
each quality dimension is assumed to have a metric
• a domain encompasses several quality dimensions that
form a unit (e.g. the domain of ’color’ is composed of
three quality dimensions that we can observe: hue, saturation and brightness)
• properties are convex regions in one domain (e.g. the
property of being ’red’ would be a convex region in the
domain of color)
• concepts consist of convex subsets of quality dimensions
in different domains, e.g. a ’red circle’ involves the convex property ’red’ in the domain of color and the convex
property ’circle’ in the domain of form.
• instances are specific vectors (points) in a vector space
representing a specific entity, object, action etc.
The theory of conceptual spaces is appealing for a number
of reasons:
1. It is cognitively plausible in the sense that it is compatible
with various empirical findings in the cognitive sciences,
i.e. with prototype effects (Rosch 1975) and with the fact
that many natural categories are convex. This has been
for example demonstrated empirically for the domain of
color (Jäger 2010).
2. It spans multiple levels of representation from subsymbolic to symbolic, thus supporting tasks at different levels of abstraction from recognition to planning. This is
a result of adopting geometric, vector-based representations that are to some extent grounded in perceptual and
other sensorimotor features. On the other hand, the vectorbased representation can be manipulated symbolically by
standard operations on vectors, supporting in particular
compositionality as new concepts and properties can be
created by combining existing representations.
Computational Model of Action Spaces
In this section we present our computational model for the
representation of action categories which builds on the conceptual spaces theory of Gärdenfors (2004). Before presenting an approach that extracts the appropriate vector-based
3. The theory lends itself to computational implementation
as vectors and operations which manipulate these can be
implemented straightforwardly.
9
representation of actions from video data, we first formalize
our model.
Formalization of a conceptual space
In correspondence with Gärdenfors (2011) we represent action concepts geometrically. A conceptual space, i.e. the
space of all possible concepts C can be formalized as C =
D1 × ... × Dn where the Di are so called domains. A domain is itself a space Di = Di,1 × ... × Di,m . Given a certain domain Di with m integral quality dimensions, we call
a convex subset P ⊆ Di,1 × ... × Di,m a property. We assume that a special value ∅ is a member of every domain. If
the value of a certain domain is ∅ for a certain instance, this
means that this integral domain simply does not apply to the
instance in question.
In the domain of colors with three integral dimensions
corresponding to HSV values, the property ’red’ is a convex
subset of the color domain hue×saturation×brightness,
which consists of three integral quality dimensions. A concept is a tuple (p1 , p2 , p3 , ..., pn ) where the pi s are convex
subsets of Di . A concept might be for instance a red circle,
involving both the color domain as well as the domain of
form.
Figure 2: Action representation of the concept ’put’ showing
the relative distance to the landmark and the representation.
The category cput , represented by a prototype pput is thus
the convex subset consisting of all particular actions ξi that
are more similar to pput than to the prototype of any other
action prototype pj . This definition thus presupposes some
notion of distance between instances of actions that we define below.
Extraction of Action Representations
In this section we present an approach to extracting action
representations as described above from video data depicting 8 different actions carried out by humans on 10 objects.
Without prior knowledge about the scene or the relevant objects it is a difficult task to determine the forces that act on
a moved object. Thus, we first determine the object that is
moved (the trajector) as well as its trajectory. This is a nontrivial task as videos often include a lot of motion besides
the motion induced by the main trajector. In the next step we
then generate a representation of the action on the basis of
the trajectory. We make the assumption that there is always
only one relevant motion in each of our input videos. In the
following we present an approach to extracting the main trajector and its trajectory. As shown in Figure 3 the procedure
consists of the following 4 steps:
Representation of actions
The action depicted in Figure 1 can be described with respect to one landmark. In the general case, we might describe the relative position of the main trajector with respect to several landmarks. In this paper we thus consider
the special case in which there are two landmarks: a source
landmark representing the origin of the action and a target
landmark representing the end of the trajectory of the main
trajector. Nevertheless, our formalization can be straightforwardly extended to the case of several landmarks.
An action category can be seen as a concept in the above
sense whereby the only integral dimension relevant for the
representation of an action is the domain of force F. Without loss of generality, let Dk be the force domain. An action in our sense is thus an element A = (∅, ∅, ...f, ..., ∅) ∈
D1 × ...Dk−1 × F × Dk+1 ... × Dn with ΠF (A) = f ∈ F
being the projection of A onto the force domain. Hereby, we
formalize F as F = P(ALs ×LLs ×ALg ×LLg ×T ) where
ALs , LLs , ALg , LLg = {0, 1} and T is a set of timestamps,
i.e. integers for our purposes. Here ALs represents whether
the trajector is above the source landmark at a certain timepoint t, while LLs represents whether the trajector is left of
the source landmark. Analogously, ALg and LLg have the
same interpretation relative to the target landmark. An example of this representation is given in Figure 2. The left
side of the figure shows the relative position of the trajector towards the two landmarks (Lsource , Lgoal ). Every red
dot indicates one of the 12 discrete timestamps of the trajector’s path. The 12 tuples on the right side of the figure
that correspond to each of these dots and that constitute our
representation of action.
Given a prototypical instance pa of an action category ca ,
the concept is defined as:
1. Finding regions of high motion
2. Determining region trajectories
3. Selecting the main trajectory
4. Generating action representations
We describe these steps in more detail in the following.
Finding regions of high motion: We define the video input signal as V = (f1 , ..., fn ) with fi being the i-th frame
of the video. A frame is essentially a two dimensional image
displayed in the video at a certain time. As there is usually
more activity in videos besides the main movement, it is a
difficult task to detect the trajectory of the main trajector. In
order to segment out the main motion from background motion, we identify regions of high motion in the first processing step to reduce our search space. In order to find these regions, we generate the images δ1 , ..., δn−1 by subtracting the
gray values pixel by pixel of every two consecutive frames
fi , fi+1 from each other:
ca = {ξi | ξi ∈ F ∧ ∀j ∆(ξi , pa ) ≤ ∆(ξi , pj )}
δi = fi − fi+1
10
For each frame
S fi we then compute maximal sets of contiguous pixels q (xq , yq ) such that δi (xq , yq ) ≥ θ where
θ is some empirically determined threshold. This yields a
fi
number of high motion regions: (R1fi , ..., Rm
) for every
frame fi .
These regions define the areas where we locally start our
search for possible trajectories.
FOR i := 1 TO m
DTW[0][i] := infinity
FOR i := 1 TO n
DTW[i][0] := infinity
DTW[0][0] := 0
FOR i := 1 TO n
FOR j := 1 TO m
cost:= d(a1[i], a2[j])
DTW[i][j] := cost + min(DTW[i-1][j],
DTW[i][j-1], DTW[i-1][j-1])
RETURN DTW[n][m]
Determining region trajectories: In order to connect regions in a frame fi to regions in the following frame fi+1
with the goal of computing a trajectory, we apply the optical flow approach (Barron (1994)) on every region Rjfi
of each frame fi in order to estimate the successor refi+1
fi
)
for every region (R1fi , ..., Rm
gion succ(Rjfi ) = Rj+1
of every frame fi . Thus, we get a sequence of regions
succ(...succ(Rjfi )...) for every region Rjfi starting in frame
fi . In order to determine a trajectory for such a region sequence, we define a central point cpfj i for every region Rjfi
being located at the center of the region in the image. We
further define the trajectory Tj as a sequence of the form
(cpfj 1 , cpfj 2 , ..., cpfj n ).
Clustering similar actions
We carry out experiments in which we cluster action instances extracted from video data as described above and
determine a prototype for every action category as the median element of each cluster. We rely on a k-Median clustering approach (Bradley et al. (1997)) which is similar in spirit
to k-Means with the exception of adapting the prototypes to
the median instead of the mean. The clustering algorithm is
thus as follows:
1. Initialize pi with 1 ≤ i ≤ k to some randomly chosen ξi
2. Assign each ξi to the closest prototype p s.t. ∀j ∆(ξi , p) ≤
∆(ξi , pj )
Selecting the main trajectory: In order to find the main
trajectory, we follow the assumption that the trajectory of
interest is the longest one. Thus the main trajectory can be
described as follows:
3. For each pi find
P the median of the cluster with ξmedian =
arg minξk ∈pi ξj ∈pi ∆(ξk , ξj ); set pi = ξmedian
4. REPEAT steps 2-3 until clustering is stable
Tmax = arg max |Tj |
j
Classifying actions
with |Tj | = n being the length of the trajectory.
We further consider the task of classifying unseen actions
into their corresponding action category in a supervised
fashion. For this we use labelled data in which every action has been manually assigned by annotators to their corresponding category. The training is performed by the following two steps:
Action representation: We define the location of the trajectory T (fi ) on frame fi and the position of the two landmarks Lsource , Lgoal as follows:
T (fi ) = cpfj i with cpfj i ∈ Tmax
Lsource = T (f1 )
Lgoal = T (fn )
1. Group actions according to their labels as follows:
Sj = {ξi | label(ξi ) = j}
Based on the main trajectory Tmax , we encode the corresponding force f = Π(A) ∈ F of the action A as sequence
of vectors (ALs , LLs , ALg , LLg , i) ∈ f , with
1 if T (fi ) is above Lsource /Lgoal
ALs /Lg =
0 if T (fi ) is under Lsource /Lgoal
1 if T (fi ) is left of Lsource /Lgoal
LLs /Lg =
0 if T (fi ) is right of Lsource /Lgoal
2. Compute the median pj for each label action class j as described
above
Similarity between actions
In essence this corresponds to a 1-nearest neighbour classifier in which a new action simply is classified into the class
of the nearest prototype.
The prediction of the label of a new action ξnew is done
as follows:
class(ξnew ) := label(pa )
where
pa = arg min ∆(ξnew , pk )
pk
The standard Euclidean distance cannot be used in our case
to compare actions as the trajectories have different lengths.
Thus, we make use of dynamic time warping (DTW) to define a distance measure ∆ and thus be able to compare trajectories of variable length. In order to compare two actions
ξi and ξj we determine the distance ∆(ξi , ξj ) by DTW as
follows:
Experiments
Dataset
The data used in our study is the Motionese dataset (Rohlfing et al. 2006; Vollmer et al. 2010). This collection of
videos was recordered to investigate explicit tutoring in
adult-child interactions. For this purpose, 64 pairs of parents presented a set of 10 objects both to their infants and to
DTW-DISTANCE (Action a1, Action a2)
DTW[n][m]
11
Figure 3: Phases of the automatic extraction of action representations from the input video to the action representation.
the respective other adult while also explaining a pre-defined
task with the object. The parent and the child faced each
other during the task, sitting opposite of each other across
a table. We selected 8 (’put’, ’pull’, ’open’, ’shut’, ’switch’,
’place’, ’close’, ’push’) actions that were performed with the
10 objects. The data used in the present study uses the material which was filmed from the infants’ perspective. It was
annotated by selecting segments that depict one of the 8 actions in question. The files were then cut up into those units
and processing was done on these segments.
Clustering Results
In first phase of our experiment we took 19 examples of each
action category (152 in total) to perform a clustering task
(purely unsupervised) as described above. In order to establish whether our compact representation does not lose too
much information of the source trajectory, we compare our
results to a baseline which discriminates between the actions
on the base of the raw trajectory as substitution for our action representation. To evaluate how well the clusters have
been formed we measured the purity of our clustering which
is defined as follows:
1 X
purity(Ω, C) =
max |ωk ∩ cj |
j
N
Figure 4: Results of the clustering.
The second observation is the fact that the clustering using our action representation performs comparable to our
baseline. The maximum distance between the two accuracies is 12.03% in case of only two clusters. This difference
constantly decreases with the number of clusters down to
0.7% in case of 8 clusters. Our representation thus allows to
discriminate between a few classes of clearly differentiable
action categories, but performance drops severely for every
additional action class added.
k
with Ω = {ω1 , ω2 , ..., ωK } being the set of clusters and
C = {c1 , c2 , ..., cJ } being the set of classes. In order to
visualize the impact of the number of action categories on
our clustering task, we performed seven experiments with
k = 2, ..., 8, always selecting that subset of k action categories leading to the highest purity. We compare our action
representation with the pure trajectory of the action in order to quantify how much information is lost. The results are
shown in Figure 4. The results license two observations. The
first one is the fact that with an increasing number of classes
our starts off at 94.74% drops substantially for every category that is added. Table 1 shows the purity as well as the
classes considered for each k. While the actions ’pull’ and
’place’ can be discriminated very reliably with a purity of
94.74%, when adding ’close’ performance drops already by
16%. When adding one class more, i.e. ’open’, purity drops
again by 13% to 65,79%.
Classification Results
In the second phase of our experiment, the classification
task, we performed a 10-fold cross validation of the accuracy, training on 9-folds and evaluating on the remaining
fold. Table 1 shows the resulting accuracy of our system.
While our approach yields accuracies of 95% when classifying an action into one of the three most distinguishable
classes, the performance drops severely when considering
more than 4 action categories. Table 1 also shows the percentage of cases in which the convexity is violated for each
action category, plotting the average number of times that an
action belonging to action category i is actually closer to the
prototype (median) pj of category j than to pi . The diagram
shows that the classes are rather coherent, with violations of
convexity in only 5%-10% of the cases.
12
k
Action Categories
2
3
4
5
6
7
8
pull, place
pull, place, close
pull, place, close, open
pull, place, close, open, switch
pull, place, close, open, switch, shut
pull, place, close, open, switch, shut, push
pull, place, close, open, switch, shut, push, put
Accuracy of
Classification
100.00
95.00
66.67
57.14
44.44
40.00
35.71
Purity of
Clustering
94.74
78.95
65.79
53.68
45.61
42.86
36.84
Percentage of
Convexity Violations
3.44
4.34
5.07
5.80
7.25
8.75
10.15
Table 1: Accuracy of our classifier and percentage of the convexity violations during the classification process.
Discussion
Figure 5 plots the distance between the prototypes of different action categories. We can observe that certain pairs of action categories are very close distance-wise to each other, i.e.
’push’ and ’switch’, ’switch’ and ’place’, ’push’ and ’pull’
and ’push’ and ’shut’, ’open’ and ’pull’, ’open’ and ’shut’.
While it is understandable that push, place, switch and shut
can be easily confused because they all involve a forward
movement towards the pushed, switched or shut object, the
closeness between push and pull is certainly surprising as
they involve movements towards and away from the object
respectively. A deeper investigation of the data revealed an
explanation for the fact that ’pull’ and ’push’ are often confused. A pull action typically involves an action of grasping
directed towards the object such that they are difficult to distinguish. This problem is further exacerbated by the fact that
many action videos depicting a pull action are cut off right
after the grasping, so that the actual pulling cannot be actually seen on the video, but can be inferred by a human
(but not by a machine). Thus, depending on the completeness of the action depicted in the video clip, a pull action
might actually be confounded with actions directed towards
the object. As described later, this is one of the issues we
face concerning our data set. Actions which are very dissimilar are: ’pull’ and ’place’ and ’open’ and ’place’ which
are directed towards (in the case of ’place’) and away from
(’pull’ and ’open’) the object.
Our results clearly suggest that additional features will
be necessary to distinguish between actions such as ’push’,
’place’, ’switch’ and ’shut’ which all involve a movement
targeted towards the object and can thus not be discriminated
easily only on the basis of trajectory. Features are needed
that represent the specific way in which the object is manipulated as well as a representation of the resultative state of
the object, i.e. moved forward in the case of ’push’, closed
in the case of ’shut’, switched off in the case of ’switch’.
Needless to say that extracting such fine-grained information from perception is a challenge that nevertheless needs
to be addressed to obtain expressive action representations
with sufficient discriminatory power.
Thus, while we have shown that our representation is
generally reasonable, being able to discriminate between a
few easily distinguishable categories and that the convexity
property seems to hold to a large extent, the representation is
clearly limited, not being able to reliable discriminate more
than a handful of actions. This can be due to several issues:
Figure 5: Distance between the prototypes of each action
class.
• Trajectories: The trajectories have been extracted automatically without any prior knowledge. This means that
there are certain heuristics in the extraction process that
could add additional noise to the trajectory, which is the
main element in our process of generating action schemas.
This might be solved by manually annotating the trajectory of the main trajector.
• Annotations: The annotations of the actions depicted in
a given segment are also subject to noise. First, there is
room for interpretation as to what action is actually depicted in a given video segment. This can be addressed
by having several annotators and only considering those
annotations as reliable on which at least two annotators
agree. Further, the temporal extension of the action is also
subject to interpretation, leading to the fact that many actions (see our discussion of ’pull’ above) were not completely visible in the segments specified by our annotators.
This can also be tackled by having several annotators and
taking the longest segment spanning the segments chosen
by different annotators.
• Limits of our representation: As already discussed above,
our representation has also got clear limitations and would
need to be extended to accommodate more complex features such as those mentioned above.
In our future work we will continue to investigate the limits of our representation as well as improve the approach to
extract representations from video data, possibly also creating a gold standard in which the representations are extracted by hand, thus being able to study the performance of
13
derstood as convex regions in action space. Action categories are represented via vector-based prototypes that define a Voronoi cell and thus a convex region in action space.
In our approach, action categories are described by a prototypical force signature that represents the forces that act
upon a trajector involved in the action. This force signature
is approximated via a representation of the time-indexed position of the trajector relative to a set of given landmarks. We
also present a computational approach to extract such representations from video data. A DTW-based distance measure
is used as metric in action space and to define prototypes as
median elements. The prototype is thus not an average vector, but a specific instance that is closest to all other instances
of the action category.
We have presented results of an unsupervised clustering
as well as a supervised classification task on the Motionese
dataset consisting of videos of parents demonstrating actions
on objects to their children. Our results show that, while our
representations seem to be reasonable, the representations
extracted do not allow to discriminate between nor classify
the actions with high accuracy beyond a handful of action
categories. While this might be partially due to the noise
in trajectory extraction, it also hints at the fact that a mere
representation of the force signature is not enough for a cognitive and holistic representation of action. Instead, we think
that a representation of action that encompasses i) its participants, ii) the relation of the participant and the object, as well
as iii) holistic and prepositional features such as ‘the moved
object is in the hand of the agent at the beginning of the action’ or ‘the moved object is in the other object at the end
of the action’ are crucial. The representation of such more
holistic and more cognitively oriented aspects of an action
will require the ability to represent and reason about image
schemas and other more basic cognitive frames or templates.
our representation in case of zero noise.
Related Work
There have been several approaches to action representation
and recognition, in particular in the area of robotics.
Most approaches are task-specific and do not strive for
a general gestalt-based and cognitively inspired representation such as the one we present here. For example,
some models just represent specific actions such as hand
movements (Li et al. 2007) and often even require a specific
hardware configuration such as motion tracking systems
which require markers or data gloves (Lütkebohle et al.
2010).
One promising line of investigation are saliency-based
models. The work of Tanaka et al (2011) is similar to our
approach. However, what is noticeable is that the method
does not scale up to a large number of actions either. In
their approach, just five actions are discriminated. The
question is also in how far such systems can cope with more
naturalistic data. The action categories in our approach were
fixed by the experimental design adopted in the Motionese
data where parents were asked to demonstrate a number of
pre-defined actions on a set of objects to their children. In
contrast, Laptev et al. (2008) show how action recognition
is possible on realistic video material which is taken from
television, film and personal videos. It would be interesting
to compare their approach to ours as we have aim to
represent the content of actions.
We have presented a model for action representations that
tries to represent action categories building on the conceptual spaces framework of Gärdenfors (2004) in such a
way that the representation abstracts from specific instances
including specific actors and objects. Ours is the first
implementation known to us of the proposal by Gärdenfors
and Warglien for the representation of action that works
with real naturalistic data. In this sense, we have been the
first to present empirical results with a representation of
action that is in line with the conceptual spaces framework,
exploring the empirical implications of adopting such a
representation. Further work will be devoted to extending
such representations to more holistic representations that
represent i) participants, ii) their goals and intentions,
iii) the spatio-temporal relation between participants and
objects involved in the action etc. Knowledge about the
type of the landmark as well as of the objects manipulated,
their functional and non-functional properties will clearly
also be important. Such a holistic representation will be
ultimately necessary if we want to endow cognitive systems
with the capability of recognizing and understanding the
semantics of actions, as well as being able to reason about
their consequences and implications.
Acknowledgements: We thank Katharina Rohlfing and
her group for kindly allowing us to use the Motionese
dataset.
References
Barron, J.; Fleet, D.; and Beauchemin, S. 1994. Performance of optical flow techniques. International Journal of
Computer Vision (IJCV) 12(1):43–77.
Bradley, P.; Mangasarian, O.; and Street, W. 1997. Clustering via concave minimization. Proceedings of the International Conference on Advances in Neural Information Processing Systems (NIPS) 368–374.
Gärdenfors, P. 2004. Conceptual spaces: The geometry of
thought. Cambridge, MA: The MIT Press.
Gärdenfors, P. 2011. Semantics based on conceptual spaces.
In Banerjee, M., and Seth, A., eds., Logic and Its Applications, volume 6521 of Lecture Notes in Computer Science.
Springer Berlin / Heidelberg. 1–11.
Grdenfors, P., and Warglien, M. 2012. Using conceptual
spaces to model actions and events. Journal of Semantics.
Jäger, G. 2010. Natural color categories are convex sets. In
Aloni, M.; Bastiaanse, H.; de Jager, T.; and Schulz, K., eds.,
Conclusion
In this paper, we have presented an approach to representing actions based on the conceptual spaces framework developed by Gärdenfors (2004). Action categories are regarded
as properties in the sense of Gärdenfors (2011) and are un-
14
Logic, Language and Meaning, Lecture Notes in Computer
Science. Springer Berlin / Heidelberg.
Laptev, I.; Marszaek, M.; Schmid, C.; and Rozenfeld, B.
2008. Learning realistic human actions from movies. In
In Proceedings of the Conference on Computer Vision and
Pattern Recognition (CVPR).
Li, Z.; Wachsmuth, S.; Fritsch, J.; and Sagerer, G. 2007.
View-adaptive manipulative action recognition for robot
companions. In Proceedings of the International Conference on Intelligent Robots and Systems (IROS), 1028–1033.
Lütkebohle, I.; Peltason, J.; Haschke, R.; Wrede, B.; and
Wachsmuth, S. 2010. The curious robot learns grasping
in multi-modal interaction. Interactive Communication for
Autonomous Intelligent Robots. video submission with abstract.
Rohlfing, K.; Fritsch, J.; Wrede, B.; and Jungmann, T. 2006.
How can multimodal cues from child-directed interaction
reduce learning complexity in robots? Advanced Robotics
20(10):1183–1199.
Rosch, E. 1975. Cognitive representations of semantic
categories. Journal of Experimental Psychology: General
104:192–233.
Tanaka, G.; Nagai, Y.; and Asada, M. 2011. Bottom-up
attention improves action recognition using histograms of
oriented gradients. In Proceedings of the 12th IAPR Conference on Machine Vision Applications, 467–470.
Vollmer, A.-L.; Pitsch, K.; Lohan, K. S.; Fritsch, J.; Rohlfing, K. J.; and Wrede, B. 2010. Developing feedback: How
children of different age contribute to a tutoring interaction
with adults. In Proceedings of the International Conference
on Development and Learning, 76–81.
15
Download