Knowledge Representation and Reasoning in Robotics: Papers from the AAAI Spring Symposium An Approach for Scene Interpretation Using Qualitative Descriptors, Semantics, and Domain Knowledge Zoe Falomir * Universität Bremen (Germany) Abstract ital cameras from which they obtain information about the environment. The ideal systems for interacting with people would be those capable of interpreting their environment cognitively, that is, similarly to how people do it. In this way, those systems may align its concepts with human concepts and thus provide common ground for establishing sophisticated communication. Because digital images represent visual data numerically, most image processing has been successfully carried out by applying mathematical techniques to obtain and describe image content. However, there is a growing trend of works in the literature which extract qualitative/semantic information from images of real scenes and use this information to group images into categories. Some approaches described images semantically using only a word/concept (Oliva and Torralba 2001; Quattoni and Torralba 2009) and other approaches described more than one component semantically (Qayyum and Cohn 2007; Lim and Chan 2012). All these approaches provide evidence for the effectiveness of using qualitative/semantic information to describe images. However, very few approaches describe real images semantically as a set of components arranged in the space. Qualitative Image Descriptions (QIDs) based on visual and spatial features (Falomir et al. 2012) showed high adaptability to different real-world scenarios, here they are extended with semantics, domain knowledge and spatial logics (Kn-QID). The domain knowledge provided will allow the system to categorize unknown objects according to their qualitative descriptors, but also will allow the system to identify target objects and provide semantics for describing their affordances, mobility, etc. Spatial logics are defined for reasoning about the obtained semantics and a proof of concept is presented which identifies a change in the scene which is “not ok” and involves an action to be performed by the robot. An approach for cognitive scene interpretation is proposed based on qualitative descriptors, domain knowledge and spatial logics. Qualitative descriptors of shape, colour, topology and location are used for describing any object in the scene. Two kinds of domain knowledge are provided: (i) categorizations of objects according to their qualitative descriptors, and (ii) information about target objects in the scenario. Spatial logics are defined for reasoning about obtained semantics. A proof-of-concept is presented at the Interact@Cartesium scenario. Introduction In an envisaged future where intelligent robots populate our world we imagine an everyday phone conversation with the robot left at home, asking “Is everything fine?” which the robot happily confirms. So that an intelligent agent will be able to reach such a conclusion, the robot needs to observe its environment and interpret information perceived. So this task reaches out far beyond object recognition. Any environment, in particular those populated by humans, are subject to change: cups, toys, or pieces of furniture are frequently moved around; flowers or decoration are put up or discarded; interior furnishings get overhauled once in a while. Due to the variety of possible changes it is thus infeasible to use machine learning methods to pre-learn any arrangement that could be considered “normal”. Instead it requires scene understanding to interpret observations sensibly. And this requires considering the context to interpret the meaning of a feature occurrence in the variety of possible situations, and also reasoning to reach a conclusion on basis of a manageable knowledge base that can differentiate normal from abnormal situations. In order to create such a system, perception need to be integrated with abstract reasoning which, in its full extent, is amongst the most long-standing and fundamental problems in AI research. Companion robots and ambient intelligent systems interact with human beings. Those systems usually integrate dig- Qualitative Image Descriptors (QIDs) The QID approach (Falomir et al. 2012) applies a color segmentation method (Felzenszwalb and Huttenlocher 2004) and then extracts the closed boundary of the relevant regions detected within a digital image. Each object/region extracted is described qualitatively by describing its shape and its colour. The spatial object description is composed of a topological description and an orientation description. Thus, the complete image is described as a set of qualita- * Correspondence to: Zoe Falomir, Cognitive Systems (CoSy), FB3 - Informatics, Universität Bremen, P.O. Box 330 440, 28334 Bremen, Germany. E-mail: zfalomir@informatik.uni-bremen.de. Copyright © 2014, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 95 Topological Description tive descriptors of objects as: [[QSD1 , QCD1 , Topology1 , Location1 ], . . . , [QSDn , QCDn , Topologyn , Locationn ]] where n is the total number of objects. In order to represent the topological relationships of the objects in the image, the intersection model for region configurations in R2 (Egenhofer and Franzosa 1991) is used, which describes the topology situation in space (invariant under translation, rotation and scaling) of an object A with respect to (wrt) another object B (A wrt B) as: TLAB = {disjoint, touching, completely inside, container}. The TLAB determines if an object is completely inside or if it is the container of another object. It also defines the neighbours of an object as all the other objects with the same container which can be (i) disjoint from the object, if they do not have any edge or vertex in common; (ii) or touching the object, if they have at least one vertex or edge in common or if the Euclidean distance between them is smaller than a certain threshold set by experimentation. Qualitative Shape Description (QSD) After analysing the slope of the pixels in the object boundary, the relevant points of shape are extracted. Each of these points ({P0 ,P1 ,...,PN }) is described by four features: • the Edge Connection (EC) occurring at P, described as: {line line, line curve, curve line, curve curve, curvature point}; • Angle (A) at the relevant point P (which is a not a curvature point) described by the qualitative tags: {very acute, acute, right, obtuse, very obtuse}; • Type of Curvature (TC) at the relevant point P (which is a curvature point) described qualitatively by the tags: {very acute, acute, semicircular, plane, very plane}; Location Description For obtaining the location of an object A wrt its container or the location of an object A wrt an object B, neighbour of A, a sector-based model (Hernández 1991) is used. This Location Reference System divides the space into nine regions, LRSLAB = {up, down, left, right, up left, up right, down left, down left, centre}. To obtain the location of each object wrt another or wrt the image, the centre of the LRS is placed on the centroid of the reference object and its up area is fixed to the upper edge of the image. The location of an object is the union of all the location labels obtained for each of the relevant points of the shape of the object. If an object is located in all the regions of the LRS, it is considered to be in the centre. • Compared Length (L) of the two edges connected by P, described qualitatively by: {much shorter (msh), half length (hl), a bit shorter (absh), similar length (sl), a bit longer (abl), double length (dl), much longer (ml)}; • Convexity (C) at the relevant point P, described as: {convex, concave}. Thus, the complete shape of an object is described as a set of qualitative descriptions of points as: [[EC1 , A1 | TC1 , L1 , C1 ], . . . , [ECm , Am | TCm , Lm , Cm ]] where m is the amount of relevant points in the object (Falomir et al. 2013a). Generating Semantics from the QIDs Qualitative Colour Description (QCD) From the qualitative descriptors obtained, Description Logics (DL) are generated which describe images (Table 1 α rules) and for each scene, facts are obtained (Table 1 γ rules). The Red, Green and Blue (RGB) color channels are translated into Hue, Saturation and Lightness (HSL) coordinates, and from these, a Qualitative Color Reference System is defined as QCRS, where QCLAB1..5 refers to the qualitative labels related to colour and QCINT 1..5 refers to the intervals of HSL color coordinates associated with each color label: QCLAB1 = {black, dark-grey, grey, light-grey, white} QCINT1 = {[0, 20), [20, 30), [30, 40), [40, 80), [80, 100) ∈ L / ∀ H ∧ S ∈ [0, 20] } QCLAB2 = {red, orange, yellow, green, turquoise, blue, purple, pink} QCINT2 = {(335, 360] ∧ [0, 15], (15,40], (40, 80], (80, 160], (160, 200], (200, 260], (260, 297], (297, 335] ∈ H / S ∈ (50, 100] ∧ L ∈ (40, 55] } QCLAB3 = {pale- + QCLAB2 } QCINT3 = { ∀ HINT2 / S∈(20, 50] ∧ L ∈ (40, 55] } QCLAB4 = {light- + QCLAB2 } QCINT4 = { ∀HINT2 / S∈(50, 100] ∧ L ∈ (55, 100] } QCLAB5 = {dark- + QCLAB2 } QCINT5 = { ∀ HINT2 / S∈(50, 100] ∧ L ∈ (20, 40]} The QCRS depends on the vision system, but it is adaptable to other systems and/or scenarios by defining other color tags and/or other HSL values (Falomir et al. 2013b). Table 1: (a) Excerpt of the reference conceptualization of the objects in the images and (b) basic image facts for a scene. α1 Object type v ∃has shape.QSD type α2 Object type v ∃has colour.QCD type α3 Object type v ∃has location.Scene type α4 Object type v ∃is touching.Object type γ1 Scene type : scene1 γ2 Object type : object1 γ3 Object type v {object1 , object2 , object3 , ..., objectn } γ4 object1 6≈ object2 6≈ ... 6≈ objectn γ5 is container(scene1 , object1 ) γ6 is container(scene1 , object2 ) γ7 is container(scene1 , object3 ) Incorporating Domain Knowledge The domain knowledge included in this approach consists on: (i) logic definitions to categorize objects based on their qualitative descriptors, (ii) images of ‘target’ objects which 96 are ‘known’ by the agent and can be detected using feature invariant detectors and (iii) semantics for target objects regarding their use, dynamics, expected location, etc. Table 3: Adding more semantics to target objects. σ1 TargetObj type v name σ2 TargetObj type v ∃is matched.Object type Contextual Knowledge for Object Categorization σ4 TargetObj type v ∃is static. σ5 TargetObj type v ∃is easy to carry. Taking into account these semantics, contextualized knowledge for the domain has been defined for characterizing: (1) the wall; (2) the floor at CoSy lab; (3) a poster at CoSy lab; and (4) a seat at CoSy lab. For example, an object may be categorized as floor at CoSy lab if it is blue or black and it is located down in the scene and not up (see Table 2). σ6 TargetObj type v ∃has unexpected location.Object type σ7 TargetObj type v ∃has utility.Affordance type Incorporating Spatial Logics Some spatial logics are provided to the agent for scene interpretation (Table 4). First, a target object would have been moved accidentally if it is a static object, but easy-to-carry around and it is not in its expected location in the scene. Similarly, a target object would have fallen down if it is static, easy-to-carry, and it is located in the floor. Moreover, a target object could be considered as disappeared or stolen if cannot be detected and it is static and not easy-to-carry around. It can also be deduced that the scene has suffered a small change when the target object has moved accidentally or has fallen down, and that there is an alarm when the target object has disappeared. Table 2: Domain Knowledge for Object Categorization β1 Wall ≡ Object type u 3 has colour.{pale yellow} t β2 CoSy Floor ≡ 3 has colour.{white} t 3 has colour.{light grey} Object type u (3 has colour.{blue} t 3 has colour.{black}) β3 u∃is down.Scene u ¬(∃is up.Scene) CoSy Poster ≡ Object type u 3 has shape.{quadrilateral} u β4 CoSy Seat ≡ (3 has colour.{white} t 3 has colour.{light grey}) u ¬(∃is down.Scene}) u ∃is up.Scene Object type u (3 has colour.{orange} t 3 has colour.{dark orange}) u∃is down.Scene u ¬(∃up.Scene) Table 4: Applying Spatial Logics to Objects in the Scene. Identifying Target Objects using Feature Detectors In order to detect a ‘target’ object in an image, the Kn-QID approach uses the Speeded-Up Robust Features (SURF) invariant descriptor and detector (Bay et al. 2008) which was demonstrated to be the fastest detector in the literature. The matching algorithm selected is the fast approximate nearest neighbours (FLANN) algorithm (Muja and Lowe 2009). Both SURF and FLANN algorithms are implemented in the Open Computer Vision Library1 (OpenCV). According to the location of matching features of the target object in the image, the Kn-QID approach determines if they are inside a region extracted by the QID approach by applying Jordan’s curve theorem (Courant and Robbins 1996) and a correspondence between the target object and a region in the QID is determined. In order to avoid false positives, the qualitative features of the objects are also considered in the matching process according to the domain knowledge given. For example, a fire-extinguisher will not be matched to an object which colour is different than red. As result, the region in the image corresponding to the target object is identified by its name. δ1 TargetObj type v moved ≡ ∃is static. u ∃is easy to carry. u δ2 TargetObj type v fall down ≡ ∃has unexpected location.Object type ∃is static. u ∃is easy to carry. u δ3 TargetObj type v missed ≡ ∃completely inside.CoSy Floor ¬is static u ∃¬is easy to carry. u δ4 Scene type v is changed ≡ ¬(∃is matched.Object type) ∃TargetObj type. v moved t δ5 Scene type v has alarm ≡ ∃TargetObj type. v fall down ∃TargetObj type. v missed A Proof-of-Concept The scenario can involve a mobile robot incorporating a camera or a intelligent ambient system such as Interact@Cartesium at Cartesium building (Universität Bremen), which provides intelligent door tags (computers) installed in the walls next to every office of the CoSy group (see Figure 1). These tags are computers which include video cameras which take pictures of the daily living environment. In such a scenario, an agent can be assigned a ‘surveillance’ task and given a ‘target’ object to control. Contextual Knowledge for Target Objects Apart from the contextual knowledge, other semantics can be added to the target objects according to: (1) if they have been detected by feature detectors SURF+FLANN, (2) if they are static or not, (3) if they are easy-to-carry around the scenario (changed from one place to another frequently), (4) their unexpected locations in the scenario, (5) what they can be used for, etc. (see Table 3). Figure 1: Scenarios: (a) Pioneer robot at UJI corridors, (b) Cartesium building (the arrows indicate videocameras). 1 http://www.opencv.org.cn/opencvdoc/2.3.1/html/ 97 An example of the results provided by the Kn-QID approach is shown in Figure 2. A digital image is segmented and the relevant regions are extracted automatically and described qualitatively. The target object selected (the cup) is matched to object 20 using SURF and FLANN. After the matching, the corresponding semantic descriptors from the scene are obtained. The target object selected (the cup) and is located inside object 30 in the scene, which is characterized from the qualitative descriptors as CoSy floor. And following the spatial logics provided, the agent infers that the cup has fallen down. Note that objects 17 and 18 are also categorized as CoSy seats. for improving human-machine communication; (iv) extending semantics for characterizing objects in different kinds of scenarios (e.g. laboratories/classrooms/libraries or outdoor areas) for robot general indoor semantic localization. Acknowledgments This work was supported by European Commission through FP7 Marie Curie IEF actions under project COGNITIVEAMI (GA 328763) and the interdisciplinary Trasregional Collaborative Research Center Spatial Cognition SFB/TR 8. References Bay, H., Ess, A., Tuytelaars, T., and Van Gool, L. 2008. Speeded-up robust features (SURF). Comput. Vis. Image Underst. 110(3):346– 359. Courant, R., and Robbins, H. 1996. What Is Mathematics? An Elementary Approach to Ideas and Methods. Oxford University Press. Egenhofer, M. J., and Franzosa, R. 1991. Point-set topological spatial relations. International Journal of Geographical Information Systems 5(2):161–174. Falomir, Z., Museros, L., Gonzalez-Abril, L., Escrig, M. T., and Ortega, J. A. 2012. A model for qualitative description of images based on visual and spatial features. Comput. Vis. Image Underst. 116:698–714. Falomir, Z., Gonzalez-Abril, L., Museros, L., and Ortega, J. 2013a. Measures of similarity between objects from a qualitative shape description. Spat. Cogn. Comput. 13:181–218. Falomir, Z., Museros, L., Gonzalez-Abril, L., and Sanz, I. 2013b. A model for qualitative colour comparison using interval distances. Displays 34:250–257. Felzenszwalb, P. F., and Huttenlocher, D. P. 2004. Efficient graphbased image segmentation. Int. J. Comput. Vis. 59(2):167–181. Hernández, D. 1991. Relative representation of spatial knowledge: The 2-D case. In Mark, D. M., and Frank, A. U., eds., Cognitive and Linguistic Aspects of Geographic Space, NATO Advanced Studies Institute. Dordrecht: Kluwer. 373–385. Lim, C. H., and Chan, C. S. 2012. A fuzzy qualitative approach for scene classification. In Proc. IEEE Int. Conf. on Fuzzy Systems, Brisbane, Australia, June 10-15, 2012, 1–8. Muja, M., and Lowe, D. 2009. Fast approximate nearest neighbors with automatic algorithm configuration. In In VISAPP International Conference on Computer Vision Theory and Applications, 331–340. Oliva, A., and Torralba, A. 2001. Modeling the shape of the scene: A holistic representation of the spatial envelope. Int. J. Comput. Vision 42(3):145–175. Qayyum, Z. U., and Cohn, A. G. 2007. Image retrieval through qualitative representations over semantic features. In Proc. 18th British Machine Vision Conf, Warwick, UK, 610–619. Quattoni, A., and Torralba, A. 2009. Recognizing indoor scenes. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 0, 413–420. Los Alamitos, CA, USA: IEEE Computer Society. Figure 2: Scene captured by the Interact@Cartesium: a target object cup (object 20 in the image) has fallen down (it is inside CoSy Floor or Object 30). Discussion The Kn-QID approach use qualitative descriptors to describe objects which are ‘unknown’ by the system (i.e. not stored in memory, not seen before in a scenario) and domain knowledge to categorize objects without any previous training. The proposed approach can characterize regions of images as walls, floors, posters and seats, under different illumination conditions and from different points of view. Qualitative colours are defined as intervals of HSL values, then different values of hues get the same color name and also different values of lightness. Moreover, the definitions for the categorization of objects can include different color names, which is an advantage when the color perceived by the cameras (robot webcam and Cartesium videocameras) is very approximated by different named (i.e. white and light grey for walls, orange and dark orange for the coach) because of the limits of the intervals. Furthermore, simple spatial logics have been defined to identify not ok situations and a proof-of-concept experiment has been carried out with success. As future work, we intend to: (i) create a repository of target objects for benchmarking in this scenario; (ii) incorporating logics for explaining spatio-temporal changes in the scenes; (iii) generating descriptions in natural language for describing the scene taking into account cognitive studies 98