Cognitive Robotics AAAI Technical Report WS-12-06 Towards Action Representation Within the Framework of Conceptual Spaces: Preliminary Results Oliver Beyer, Philipp Cimiano, Sascha Griffiths CITEC, Bielefeld University Abstract vector space. The strength of this theory is that it is cognitively plausible and lends itself to computational implementation as it builds on standard vector-based representations and geometric operations that manipulate these representations. Gärdenfors and Warglien (2012) have also presented an approach to represent action concepts in the framework of conceptual spaces. Actions are modelled by adding a force domain to the conceptual space which allows us to represent dynamic concepts. The force dimension captures the dynamic content of actions and thus forms a crucial component of a representation of action. In this paper, we present a computational model of action representation based on the conceptual spaces framework of Gärdenfors (2004). The representation focuses exclusively on the representation of the spatio-temporal structure of an action by encoding the relative position of a moved trajector over time. This can be essentially regarded as an encoding of the force that acts upon a moved trajector during the action. We represent action categories as vector-based prototypes that define a Voronoi cell that represents a property in the sense of the conceptual spaces framework. In this paper we formalize this idea and discuss a computational model of this representation of action categories. We present an approach that can extract such representations from video data and present some preliminary results on two tasks: an unsupervised task consisting of grouping similar actions together as well as a supervised task in which unseen actions are classified into the appropriate action category. We apply our model to naturalistic data from the Motionese dataset in which parents demonstrate actions to teach their children (Rohlfing et al. 2006; Vollmer et al. 2010). The main question we approach in this article is whether vector-based representations of the trajectory are enough to discriminate between different action concepts, considering eight actions found in the Motionese dataset: ’push’, ’pull’, ’put’, ’switch’, ’shut’, ’close’, ’open’ and ’place’. Our results show that, while a few classes can be discriminated easily on the basis of this vector-based representation of the trajectory, with a growing number of classes the discriminative power decreases rapidly and substantially. The structure of the paper is as follows: in the next section Conceptual Spaces we give a brief overview of the theory of conceptual spaces, focusing in particular on how action We propose an approach for the representation of actions based on the conceptual spaces framework developed by Gärdenfors (2004). Action categories are regarded as properties in the sense of Gärdenfors (2011) and are understood as convex regions in action space. Action categories are mainly described by a force signature that represents the forces that act upon a main trajector involved in the action. This force signature is approximated via a representation that specifies the time-indexed position of the trajector relative to several landmarks. We also present a computational approach to extract such representations from video data. We present results on the Motionese dataset consisting of videos of parents demonstrating actions on objects to their children. We evaluate the representations on a clustering and a classification task showing that, while our representations seems to be reasonable, only a handful of actions can be discriminated reliably. Introduction Cognitve systems and robots in particular need to be able to recognize and reason about actions. This requires an appropriate formalism for the representation of actions which encompasses: i) the participants of that action, ii) its teleological structure including goals and intentions of participants, iii) its spatio-temporal structure as well as iv) its preconditions and effects on the world. Such a representation would ideally support action recognition, reasoning about the goals of participants, simulation of the action, planning etc. Most importantly, such a representation should not be specific to a particular action i.e. a particular instance carried out on a specific object (e.g. representing a specific instance of a ’putting’ event), but capture general properties of the action category, e.g. the action category of putting sth. into sth. else. Essentially, we require a holistic and gestalt-like representation that allows to represent the action category in a way that abstracts away from specific participants, specific objects involved, etc. An appealing theory that can be used to represent such action concepts is the conceptual spaces framework by Gärdenfors (2004). Gärdenfors proposes a geometric framework for the representation of concepts as convex regions in c 2012, Association for the Advancement of Artificial Copyright Intelligence (www.aaai.org). All rights reserved. 8 concepts can be represented in this framework. In Section Computational Model of Action Spaces we present our computational model, and in Section Extraction of Action Representations we present an approach to extract such representations from video data. In Section Experiments we present our experiments and our results on a clustering and classification task. Before concluding, we discuss some related work. Figure 1: Concept of ’across’ which illustrates the trajector moving across the landmark. Conceptual Spaces In his book, Gärdenfors (2004) argues that action categories can be also regarded as convex regions in space. More details on how actions can be represented in the conceptual spaces framework have been provided by Gärdenfors and Warglien (2012). The main point is that actions are encoded in a force domain and the representation need not include any more information about the participants in an event or action other than the forces which they contribute to the event. In line with the criterion that concepts need to be convex regions in a space, they suggest that an action category is a convex region in action space. The basic difference between object categories and action categories is that an object category is mainly determined by static quality dimensions and domains, while action categories involve forces in several dimensions that act in specific ways and directions as the action progresses. Following the approach of Gärdenfors and Warglien, in our model actions are represented through the forces that act on so called trajectors which describe the center of focus for a particular action concept. For instance, the action categories ’push’ and ’pull’ differ in their direction, while the intensity of the forces might be equal. Therefore, in our approach we encode the spatio-temporal relation of the trajector relative to a set of landmark(s). As this relation may change over time depending on the applied forces, the time can be seen as additional dimension in our representation. We thus propose a simplified representation of an action category through a prototype that represents the typical spatiotemporal signature of the forces involved in an action that belongs to this category. These forces are essentially encoded in the form of a trajectory that specifies the position of the trajector relative to a set of landmarks at different points in time. Take the example of the action depicted in Figure 1. It depicts three time points during the action of moving a circle (the trajector) across the rectangle (the landmark). The three time points can be characterized as follows: at time point 1, the trajector is left from the landmark. At time point 2, the circle is on the landmark, while at time point 3, the trajector is on the right side of the landmark. This example illustrates how the relative positions of the trajector to one or several landmarks can be used to describe the prototypical signature of an action. The conceptual spaces theory by Gärdenfors (2004) proposes a geometric approach to the representation of concepts. Following the insight that many natural categories are convex, it proposes to formalize properties and concepts as convex regions in vector space. The conceptual spaces approach rests on five important notions. These are quality dimensions, domains, properties, concepts and instances: • a quality dimension represents some way in which two stimuli can be different (e.g. temperature, weight, etc.); each quality dimension is assumed to have a metric • a domain encompasses several quality dimensions that form a unit (e.g. the domain of ’color’ is composed of three quality dimensions that we can observe: hue, saturation and brightness) • properties are convex regions in one domain (e.g. the property of being ’red’ would be a convex region in the domain of color) • concepts consist of convex subsets of quality dimensions in different domains, e.g. a ’red circle’ involves the convex property ’red’ in the domain of color and the convex property ’circle’ in the domain of form. • instances are specific vectors (points) in a vector space representing a specific entity, object, action etc. The theory of conceptual spaces is appealing for a number of reasons: 1. It is cognitively plausible in the sense that it is compatible with various empirical findings in the cognitive sciences, i.e. with prototype effects (Rosch 1975) and with the fact that many natural categories are convex. This has been for example demonstrated empirically for the domain of color (Jäger 2010). 2. It spans multiple levels of representation from subsymbolic to symbolic, thus supporting tasks at different levels of abstraction from recognition to planning. This is a result of adopting geometric, vector-based representations that are to some extent grounded in perceptual and other sensorimotor features. On the other hand, the vectorbased representation can be manipulated symbolically by standard operations on vectors, supporting in particular compositionality as new concepts and properties can be created by combining existing representations. Computational Model of Action Spaces In this section we present our computational model for the representation of action categories which builds on the conceptual spaces theory of Gärdenfors (2004). Before presenting an approach that extracts the appropriate vector-based 3. The theory lends itself to computational implementation as vectors and operations which manipulate these can be implemented straightforwardly. 9 representation of actions from video data, we first formalize our model. Formalization of a conceptual space In correspondence with Gärdenfors (2011) we represent action concepts geometrically. A conceptual space, i.e. the space of all possible concepts C can be formalized as C = D1 × ... × Dn where the Di are so called domains. A domain is itself a space Di = Di,1 × ... × Di,m . Given a certain domain Di with m integral quality dimensions, we call a convex subset P ⊆ Di,1 × ... × Di,m a property. We assume that a special value ∅ is a member of every domain. If the value of a certain domain is ∅ for a certain instance, this means that this integral domain simply does not apply to the instance in question. In the domain of colors with three integral dimensions corresponding to HSV values, the property ’red’ is a convex subset of the color domain hue×saturation×brightness, which consists of three integral quality dimensions. A concept is a tuple (p1 , p2 , p3 , ..., pn ) where the pi s are convex subsets of Di . A concept might be for instance a red circle, involving both the color domain as well as the domain of form. Figure 2: Action representation of the concept ’put’ showing the relative distance to the landmark and the representation. The category cput , represented by a prototype pput is thus the convex subset consisting of all particular actions ξi that are more similar to pput than to the prototype of any other action prototype pj . This definition thus presupposes some notion of distance between instances of actions that we define below. Extraction of Action Representations In this section we present an approach to extracting action representations as described above from video data depicting 8 different actions carried out by humans on 10 objects. Without prior knowledge about the scene or the relevant objects it is a difficult task to determine the forces that act on a moved object. Thus, we first determine the object that is moved (the trajector) as well as its trajectory. This is a nontrivial task as videos often include a lot of motion besides the motion induced by the main trajector. In the next step we then generate a representation of the action on the basis of the trajectory. We make the assumption that there is always only one relevant motion in each of our input videos. In the following we present an approach to extracting the main trajector and its trajectory. As shown in Figure 3 the procedure consists of the following 4 steps: Representation of actions The action depicted in Figure 1 can be described with respect to one landmark. In the general case, we might describe the relative position of the main trajector with respect to several landmarks. In this paper we thus consider the special case in which there are two landmarks: a source landmark representing the origin of the action and a target landmark representing the end of the trajectory of the main trajector. Nevertheless, our formalization can be straightforwardly extended to the case of several landmarks. An action category can be seen as a concept in the above sense whereby the only integral dimension relevant for the representation of an action is the domain of force F. Without loss of generality, let Dk be the force domain. An action in our sense is thus an element A = (∅, ∅, ...f, ..., ∅) ∈ D1 × ...Dk−1 × F × Dk+1 ... × Dn with ΠF (A) = f ∈ F being the projection of A onto the force domain. Hereby, we formalize F as F = P(ALs ×LLs ×ALg ×LLg ×T ) where ALs , LLs , ALg , LLg = {0, 1} and T is a set of timestamps, i.e. integers for our purposes. Here ALs represents whether the trajector is above the source landmark at a certain timepoint t, while LLs represents whether the trajector is left of the source landmark. Analogously, ALg and LLg have the same interpretation relative to the target landmark. An example of this representation is given in Figure 2. The left side of the figure shows the relative position of the trajector towards the two landmarks (Lsource , Lgoal ). Every red dot indicates one of the 12 discrete timestamps of the trajector’s path. The 12 tuples on the right side of the figure that correspond to each of these dots and that constitute our representation of action. Given a prototypical instance pa of an action category ca , the concept is defined as: 1. Finding regions of high motion 2. Determining region trajectories 3. Selecting the main trajectory 4. Generating action representations We describe these steps in more detail in the following. Finding regions of high motion: We define the video input signal as V = (f1 , ..., fn ) with fi being the i-th frame of the video. A frame is essentially a two dimensional image displayed in the video at a certain time. As there is usually more activity in videos besides the main movement, it is a difficult task to detect the trajectory of the main trajector. In order to segment out the main motion from background motion, we identify regions of high motion in the first processing step to reduce our search space. In order to find these regions, we generate the images δ1 , ..., δn−1 by subtracting the gray values pixel by pixel of every two consecutive frames fi , fi+1 from each other: ca = {ξi | ξi ∈ F ∧ ∀j ∆(ξi , pa ) ≤ ∆(ξi , pj )} δi = fi − fi+1 10 For each frame S fi we then compute maximal sets of contiguous pixels q (xq , yq ) such that δi (xq , yq ) ≥ θ where θ is some empirically determined threshold. This yields a fi number of high motion regions: (R1fi , ..., Rm ) for every frame fi . These regions define the areas where we locally start our search for possible trajectories. FOR i := 1 TO m DTW[0][i] := infinity FOR i := 1 TO n DTW[i][0] := infinity DTW[0][0] := 0 FOR i := 1 TO n FOR j := 1 TO m cost:= d(a1[i], a2[j]) DTW[i][j] := cost + min(DTW[i-1][j], DTW[i][j-1], DTW[i-1][j-1]) RETURN DTW[n][m] Determining region trajectories: In order to connect regions in a frame fi to regions in the following frame fi+1 with the goal of computing a trajectory, we apply the optical flow approach (Barron (1994)) on every region Rjfi of each frame fi in order to estimate the successor refi+1 fi ) for every region (R1fi , ..., Rm gion succ(Rjfi ) = Rj+1 of every frame fi . Thus, we get a sequence of regions succ(...succ(Rjfi )...) for every region Rjfi starting in frame fi . In order to determine a trajectory for such a region sequence, we define a central point cpfj i for every region Rjfi being located at the center of the region in the image. We further define the trajectory Tj as a sequence of the form (cpfj 1 , cpfj 2 , ..., cpfj n ). Clustering similar actions We carry out experiments in which we cluster action instances extracted from video data as described above and determine a prototype for every action category as the median element of each cluster. We rely on a k-Median clustering approach (Bradley et al. (1997)) which is similar in spirit to k-Means with the exception of adapting the prototypes to the median instead of the mean. The clustering algorithm is thus as follows: 1. Initialize pi with 1 ≤ i ≤ k to some randomly chosen ξi 2. Assign each ξi to the closest prototype p s.t. ∀j ∆(ξi , p) ≤ ∆(ξi , pj ) Selecting the main trajectory: In order to find the main trajectory, we follow the assumption that the trajectory of interest is the longest one. Thus the main trajectory can be described as follows: 3. For each pi find P the median of the cluster with ξmedian = arg minξk ∈pi ξj ∈pi ∆(ξk , ξj ); set pi = ξmedian 4. REPEAT steps 2-3 until clustering is stable Tmax = arg max |Tj | j Classifying actions with |Tj | = n being the length of the trajectory. We further consider the task of classifying unseen actions into their corresponding action category in a supervised fashion. For this we use labelled data in which every action has been manually assigned by annotators to their corresponding category. The training is performed by the following two steps: Action representation: We define the location of the trajectory T (fi ) on frame fi and the position of the two landmarks Lsource , Lgoal as follows: T (fi ) = cpfj i with cpfj i ∈ Tmax Lsource = T (f1 ) Lgoal = T (fn ) 1. Group actions according to their labels as follows: Sj = {ξi | label(ξi ) = j} Based on the main trajectory Tmax , we encode the corresponding force f = Π(A) ∈ F of the action A as sequence of vectors (ALs , LLs , ALg , LLg , i) ∈ f , with 1 if T (fi ) is above Lsource /Lgoal ALs /Lg = 0 if T (fi ) is under Lsource /Lgoal 1 if T (fi ) is left of Lsource /Lgoal LLs /Lg = 0 if T (fi ) is right of Lsource /Lgoal 2. Compute the median pj for each label action class j as described above Similarity between actions In essence this corresponds to a 1-nearest neighbour classifier in which a new action simply is classified into the class of the nearest prototype. The prediction of the label of a new action ξnew is done as follows: class(ξnew ) := label(pa ) where pa = arg min ∆(ξnew , pk ) pk The standard Euclidean distance cannot be used in our case to compare actions as the trajectories have different lengths. Thus, we make use of dynamic time warping (DTW) to define a distance measure ∆ and thus be able to compare trajectories of variable length. In order to compare two actions ξi and ξj we determine the distance ∆(ξi , ξj ) by DTW as follows: Experiments Dataset The data used in our study is the Motionese dataset (Rohlfing et al. 2006; Vollmer et al. 2010). This collection of videos was recordered to investigate explicit tutoring in adult-child interactions. For this purpose, 64 pairs of parents presented a set of 10 objects both to their infants and to DTW-DISTANCE (Action a1, Action a2) DTW[n][m] 11 Figure 3: Phases of the automatic extraction of action representations from the input video to the action representation. the respective other adult while also explaining a pre-defined task with the object. The parent and the child faced each other during the task, sitting opposite of each other across a table. We selected 8 (’put’, ’pull’, ’open’, ’shut’, ’switch’, ’place’, ’close’, ’push’) actions that were performed with the 10 objects. The data used in the present study uses the material which was filmed from the infants’ perspective. It was annotated by selecting segments that depict one of the 8 actions in question. The files were then cut up into those units and processing was done on these segments. Clustering Results In first phase of our experiment we took 19 examples of each action category (152 in total) to perform a clustering task (purely unsupervised) as described above. In order to establish whether our compact representation does not lose too much information of the source trajectory, we compare our results to a baseline which discriminates between the actions on the base of the raw trajectory as substitution for our action representation. To evaluate how well the clusters have been formed we measured the purity of our clustering which is defined as follows: 1 X purity(Ω, C) = max |ωk ∩ cj | j N Figure 4: Results of the clustering. The second observation is the fact that the clustering using our action representation performs comparable to our baseline. The maximum distance between the two accuracies is 12.03% in case of only two clusters. This difference constantly decreases with the number of clusters down to 0.7% in case of 8 clusters. Our representation thus allows to discriminate between a few classes of clearly differentiable action categories, but performance drops severely for every additional action class added. k with Ω = {ω1 , ω2 , ..., ωK } being the set of clusters and C = {c1 , c2 , ..., cJ } being the set of classes. In order to visualize the impact of the number of action categories on our clustering task, we performed seven experiments with k = 2, ..., 8, always selecting that subset of k action categories leading to the highest purity. We compare our action representation with the pure trajectory of the action in order to quantify how much information is lost. The results are shown in Figure 4. The results license two observations. The first one is the fact that with an increasing number of classes our starts off at 94.74% drops substantially for every category that is added. Table 1 shows the purity as well as the classes considered for each k. While the actions ’pull’ and ’place’ can be discriminated very reliably with a purity of 94.74%, when adding ’close’ performance drops already by 16%. When adding one class more, i.e. ’open’, purity drops again by 13% to 65,79%. Classification Results In the second phase of our experiment, the classification task, we performed a 10-fold cross validation of the accuracy, training on 9-folds and evaluating on the remaining fold. Table 1 shows the resulting accuracy of our system. While our approach yields accuracies of 95% when classifying an action into one of the three most distinguishable classes, the performance drops severely when considering more than 4 action categories. Table 1 also shows the percentage of cases in which the convexity is violated for each action category, plotting the average number of times that an action belonging to action category i is actually closer to the prototype (median) pj of category j than to pi . The diagram shows that the classes are rather coherent, with violations of convexity in only 5%-10% of the cases. 12 k Action Categories 2 3 4 5 6 7 8 pull, place pull, place, close pull, place, close, open pull, place, close, open, switch pull, place, close, open, switch, shut pull, place, close, open, switch, shut, push pull, place, close, open, switch, shut, push, put Accuracy of Classification 100.00 95.00 66.67 57.14 44.44 40.00 35.71 Purity of Clustering 94.74 78.95 65.79 53.68 45.61 42.86 36.84 Percentage of Convexity Violations 3.44 4.34 5.07 5.80 7.25 8.75 10.15 Table 1: Accuracy of our classifier and percentage of the convexity violations during the classification process. Discussion Figure 5 plots the distance between the prototypes of different action categories. We can observe that certain pairs of action categories are very close distance-wise to each other, i.e. ’push’ and ’switch’, ’switch’ and ’place’, ’push’ and ’pull’ and ’push’ and ’shut’, ’open’ and ’pull’, ’open’ and ’shut’. While it is understandable that push, place, switch and shut can be easily confused because they all involve a forward movement towards the pushed, switched or shut object, the closeness between push and pull is certainly surprising as they involve movements towards and away from the object respectively. A deeper investigation of the data revealed an explanation for the fact that ’pull’ and ’push’ are often confused. A pull action typically involves an action of grasping directed towards the object such that they are difficult to distinguish. This problem is further exacerbated by the fact that many action videos depicting a pull action are cut off right after the grasping, so that the actual pulling cannot be actually seen on the video, but can be inferred by a human (but not by a machine). Thus, depending on the completeness of the action depicted in the video clip, a pull action might actually be confounded with actions directed towards the object. As described later, this is one of the issues we face concerning our data set. Actions which are very dissimilar are: ’pull’ and ’place’ and ’open’ and ’place’ which are directed towards (in the case of ’place’) and away from (’pull’ and ’open’) the object. Our results clearly suggest that additional features will be necessary to distinguish between actions such as ’push’, ’place’, ’switch’ and ’shut’ which all involve a movement targeted towards the object and can thus not be discriminated easily only on the basis of trajectory. Features are needed that represent the specific way in which the object is manipulated as well as a representation of the resultative state of the object, i.e. moved forward in the case of ’push’, closed in the case of ’shut’, switched off in the case of ’switch’. Needless to say that extracting such fine-grained information from perception is a challenge that nevertheless needs to be addressed to obtain expressive action representations with sufficient discriminatory power. Thus, while we have shown that our representation is generally reasonable, being able to discriminate between a few easily distinguishable categories and that the convexity property seems to hold to a large extent, the representation is clearly limited, not being able to reliable discriminate more than a handful of actions. This can be due to several issues: Figure 5: Distance between the prototypes of each action class. • Trajectories: The trajectories have been extracted automatically without any prior knowledge. This means that there are certain heuristics in the extraction process that could add additional noise to the trajectory, which is the main element in our process of generating action schemas. This might be solved by manually annotating the trajectory of the main trajector. • Annotations: The annotations of the actions depicted in a given segment are also subject to noise. First, there is room for interpretation as to what action is actually depicted in a given video segment. This can be addressed by having several annotators and only considering those annotations as reliable on which at least two annotators agree. Further, the temporal extension of the action is also subject to interpretation, leading to the fact that many actions (see our discussion of ’pull’ above) were not completely visible in the segments specified by our annotators. This can also be tackled by having several annotators and taking the longest segment spanning the segments chosen by different annotators. • Limits of our representation: As already discussed above, our representation has also got clear limitations and would need to be extended to accommodate more complex features such as those mentioned above. In our future work we will continue to investigate the limits of our representation as well as improve the approach to extract representations from video data, possibly also creating a gold standard in which the representations are extracted by hand, thus being able to study the performance of 13 derstood as convex regions in action space. Action categories are represented via vector-based prototypes that define a Voronoi cell and thus a convex region in action space. In our approach, action categories are described by a prototypical force signature that represents the forces that act upon a trajector involved in the action. This force signature is approximated via a representation of the time-indexed position of the trajector relative to a set of given landmarks. We also present a computational approach to extract such representations from video data. A DTW-based distance measure is used as metric in action space and to define prototypes as median elements. The prototype is thus not an average vector, but a specific instance that is closest to all other instances of the action category. We have presented results of an unsupervised clustering as well as a supervised classification task on the Motionese dataset consisting of videos of parents demonstrating actions on objects to their children. Our results show that, while our representations seem to be reasonable, the representations extracted do not allow to discriminate between nor classify the actions with high accuracy beyond a handful of action categories. While this might be partially due to the noise in trajectory extraction, it also hints at the fact that a mere representation of the force signature is not enough for a cognitive and holistic representation of action. Instead, we think that a representation of action that encompasses i) its participants, ii) the relation of the participant and the object, as well as iii) holistic and prepositional features such as ‘the moved object is in the hand of the agent at the beginning of the action’ or ‘the moved object is in the other object at the end of the action’ are crucial. The representation of such more holistic and more cognitively oriented aspects of an action will require the ability to represent and reason about image schemas and other more basic cognitive frames or templates. our representation in case of zero noise. Related Work There have been several approaches to action representation and recognition, in particular in the area of robotics. Most approaches are task-specific and do not strive for a general gestalt-based and cognitively inspired representation such as the one we present here. For example, some models just represent specific actions such as hand movements (Li et al. 2007) and often even require a specific hardware configuration such as motion tracking systems which require markers or data gloves (Lütkebohle et al. 2010). One promising line of investigation are saliency-based models. The work of Tanaka et al (2011) is similar to our approach. However, what is noticeable is that the method does not scale up to a large number of actions either. In their approach, just five actions are discriminated. The question is also in how far such systems can cope with more naturalistic data. The action categories in our approach were fixed by the experimental design adopted in the Motionese data where parents were asked to demonstrate a number of pre-defined actions on a set of objects to their children. In contrast, Laptev et al. (2008) show how action recognition is possible on realistic video material which is taken from television, film and personal videos. It would be interesting to compare their approach to ours as we have aim to represent the content of actions. We have presented a model for action representations that tries to represent action categories building on the conceptual spaces framework of Gärdenfors (2004) in such a way that the representation abstracts from specific instances including specific actors and objects. Ours is the first implementation known to us of the proposal by Gärdenfors and Warglien for the representation of action that works with real naturalistic data. In this sense, we have been the first to present empirical results with a representation of action that is in line with the conceptual spaces framework, exploring the empirical implications of adopting such a representation. Further work will be devoted to extending such representations to more holistic representations that represent i) participants, ii) their goals and intentions, iii) the spatio-temporal relation between participants and objects involved in the action etc. Knowledge about the type of the landmark as well as of the objects manipulated, their functional and non-functional properties will clearly also be important. Such a holistic representation will be ultimately necessary if we want to endow cognitive systems with the capability of recognizing and understanding the semantics of actions, as well as being able to reason about their consequences and implications. Acknowledgements: We thank Katharina Rohlfing and her group for kindly allowing us to use the Motionese dataset. References Barron, J.; Fleet, D.; and Beauchemin, S. 1994. Performance of optical flow techniques. International Journal of Computer Vision (IJCV) 12(1):43–77. Bradley, P.; Mangasarian, O.; and Street, W. 1997. Clustering via concave minimization. Proceedings of the International Conference on Advances in Neural Information Processing Systems (NIPS) 368–374. Gärdenfors, P. 2004. Conceptual spaces: The geometry of thought. Cambridge, MA: The MIT Press. Gärdenfors, P. 2011. Semantics based on conceptual spaces. In Banerjee, M., and Seth, A., eds., Logic and Its Applications, volume 6521 of Lecture Notes in Computer Science. Springer Berlin / Heidelberg. 1–11. Grdenfors, P., and Warglien, M. 2012. Using conceptual spaces to model actions and events. Journal of Semantics. Jäger, G. 2010. Natural color categories are convex sets. In Aloni, M.; Bastiaanse, H.; de Jager, T.; and Schulz, K., eds., Conclusion In this paper, we have presented an approach to representing actions based on the conceptual spaces framework developed by Gärdenfors (2004). Action categories are regarded as properties in the sense of Gärdenfors (2011) and are un- 14 Logic, Language and Meaning, Lecture Notes in Computer Science. Springer Berlin / Heidelberg. Laptev, I.; Marszaek, M.; Schmid, C.; and Rozenfeld, B. 2008. Learning realistic human actions from movies. In In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR). Li, Z.; Wachsmuth, S.; Fritsch, J.; and Sagerer, G. 2007. View-adaptive manipulative action recognition for robot companions. In Proceedings of the International Conference on Intelligent Robots and Systems (IROS), 1028–1033. Lütkebohle, I.; Peltason, J.; Haschke, R.; Wrede, B.; and Wachsmuth, S. 2010. The curious robot learns grasping in multi-modal interaction. Interactive Communication for Autonomous Intelligent Robots. video submission with abstract. Rohlfing, K.; Fritsch, J.; Wrede, B.; and Jungmann, T. 2006. How can multimodal cues from child-directed interaction reduce learning complexity in robots? Advanced Robotics 20(10):1183–1199. Rosch, E. 1975. Cognitive representations of semantic categories. Journal of Experimental Psychology: General 104:192–233. Tanaka, G.; Nagai, Y.; and Asada, M. 2011. Bottom-up attention improves action recognition using histograms of oriented gradients. In Proceedings of the 12th IAPR Conference on Machine Vision Applications, 467–470. Vollmer, A.-L.; Pitsch, K.; Lohan, K. S.; Fritsch, J.; Rohlfing, K. J.; and Wrede, B. 2010. Developing feedback: How children of different age contribute to a tutoring interaction with adults. In Proceedings of the International Conference on Development and Learning, 76–81. 15