An Approach for Scene Interpretation Zoe Falomir

advertisement
Knowledge Representation and Reasoning in Robotics: Papers from the AAAI Spring Symposium
An Approach for Scene Interpretation
Using Qualitative Descriptors, Semantics, and Domain Knowledge
Zoe Falomir *
Universität Bremen (Germany)
Abstract
ital cameras from which they obtain information about the
environment. The ideal systems for interacting with people would be those capable of interpreting their environment
cognitively, that is, similarly to how people do it. In this way,
those systems may align its concepts with human concepts
and thus provide common ground for establishing sophisticated communication.
Because digital images represent visual data numerically,
most image processing has been successfully carried out by
applying mathematical techniques to obtain and describe image content. However, there is a growing trend of works
in the literature which extract qualitative/semantic information from images of real scenes and use this information to
group images into categories. Some approaches described
images semantically using only a word/concept (Oliva and
Torralba 2001; Quattoni and Torralba 2009) and other approaches described more than one component semantically
(Qayyum and Cohn 2007; Lim and Chan 2012). All these
approaches provide evidence for the effectiveness of using
qualitative/semantic information to describe images. However, very few approaches describe real images semantically
as a set of components arranged in the space. Qualitative Image Descriptions (QIDs) based on visual and spatial features
(Falomir et al. 2012) showed high adaptability to different
real-world scenarios, here they are extended with semantics, domain knowledge and spatial logics (Kn-QID). The
domain knowledge provided will allow the system to categorize unknown objects according to their qualitative descriptors, but also will allow the system to identify target objects and provide semantics for describing their affordances,
mobility, etc. Spatial logics are defined for reasoning about
the obtained semantics and a proof of concept is presented
which identifies a change in the scene which is “not ok” and
involves an action to be performed by the robot.
An approach for cognitive scene interpretation is proposed based on qualitative descriptors, domain knowledge and spatial logics. Qualitative descriptors of shape,
colour, topology and location are used for describing
any object in the scene. Two kinds of domain knowledge are provided: (i) categorizations of objects according to their qualitative descriptors, and (ii) information about target objects in the scenario. Spatial
logics are defined for reasoning about obtained semantics. A proof-of-concept is presented at the Interact@Cartesium scenario.
Introduction
In an envisaged future where intelligent robots populate our
world we imagine an everyday phone conversation with the
robot left at home, asking “Is everything fine?” which the
robot happily confirms. So that an intelligent agent will be
able to reach such a conclusion, the robot needs to observe
its environment and interpret information perceived. So this
task reaches out far beyond object recognition. Any environment, in particular those populated by humans, are subject
to change: cups, toys, or pieces of furniture are frequently
moved around; flowers or decoration are put up or discarded;
interior furnishings get overhauled once in a while. Due to
the variety of possible changes it is thus infeasible to use
machine learning methods to pre-learn any arrangement that
could be considered “normal”. Instead it requires scene understanding to interpret observations sensibly. And this requires considering the context to interpret the meaning of a
feature occurrence in the variety of possible situations, and
also reasoning to reach a conclusion on basis of a manageable knowledge base that can differentiate normal from abnormal situations. In order to create such a system, perception need to be integrated with abstract reasoning which, in
its full extent, is amongst the most long-standing and fundamental problems in AI research.
Companion robots and ambient intelligent systems interact with human beings. Those systems usually integrate dig-
Qualitative Image Descriptors (QIDs)
The QID approach (Falomir et al. 2012) applies a color segmentation method (Felzenszwalb and Huttenlocher 2004)
and then extracts the closed boundary of the relevant regions detected within a digital image. Each object/region
extracted is described qualitatively by describing its shape
and its colour. The spatial object description is composed of
a topological description and an orientation description.
Thus, the complete image is described as a set of qualita-
* Correspondence to: Zoe Falomir, Cognitive Systems (CoSy),
FB3 - Informatics, Universität Bremen, P.O. Box 330 440, 28334
Bremen, Germany. E-mail: zfalomir@informatik.uni-bremen.de.
Copyright © 2014, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
95
Topological Description
tive descriptors of objects as: [[QSD1 , QCD1 , Topology1 ,
Location1 ], . . . , [QSDn , QCDn , Topologyn , Locationn ]]
where n is the total number of objects.
In order to represent the topological relationships of the objects in the image, the intersection model for region configurations in R2 (Egenhofer and Franzosa 1991) is used, which
describes the topology situation in space (invariant under
translation, rotation and scaling) of an object A with respect
to (wrt) another object B (A wrt B) as: TLAB = {disjoint,
touching, completely inside, container}.
The TLAB determines if an object is completely inside or
if it is the container of another object. It also defines the
neighbours of an object as all the other objects with the same
container which can be (i) disjoint from the object, if they do
not have any edge or vertex in common; (ii) or touching the
object, if they have at least one vertex or edge in common
or if the Euclidean distance between them is smaller than a
certain threshold set by experimentation.
Qualitative Shape Description (QSD)
After analysing the slope of the pixels in the object boundary, the relevant points of shape are extracted. Each of these
points ({P0 ,P1 ,...,PN }) is described by four features:
• the Edge Connection (EC) occurring at P, described
as: {line line, line curve, curve line, curve curve, curvature point};
• Angle (A) at the relevant point P (which is a not a curvature point) described by the qualitative tags: {very acute,
acute, right, obtuse, very obtuse};
• Type of Curvature (TC) at the relevant point P (which
is a curvature point) described qualitatively by the tags:
{very acute, acute, semicircular, plane, very plane};
Location Description
For obtaining the location of an object A wrt its container or
the location of an object A wrt an object B, neighbour of A,
a sector-based model (Hernández 1991) is used. This Location Reference System divides the space into nine regions,
LRSLAB = {up, down, left, right, up left, up right, down left,
down left, centre}.
To obtain the location of each object wrt another or wrt
the image, the centre of the LRS is placed on the centroid of
the reference object and its up area is fixed to the upper edge
of the image. The location of an object is the union of all the
location labels obtained for each of the relevant points of the
shape of the object. If an object is located in all the regions
of the LRS, it is considered to be in the centre.
• Compared Length (L) of the two edges connected
by P, described qualitatively by: {much shorter (msh),
half length (hl), a bit shorter (absh), similar length (sl),
a bit longer (abl), double length (dl), much longer (ml)};
• Convexity (C) at the relevant point P, described as:
{convex, concave}.
Thus, the complete shape of an object is described as a set
of qualitative descriptions of points as: [[EC1 , A1 | TC1 , L1 ,
C1 ], . . . , [ECm , Am | TCm , Lm , Cm ]] where m is the amount
of relevant points in the object (Falomir et al. 2013a).
Generating Semantics from the QIDs
Qualitative Colour Description (QCD)
From the qualitative descriptors obtained, Description Logics (DL) are generated which describe images (Table 1 α
rules) and for each scene, facts are obtained (Table 1 γ rules).
The Red, Green and Blue (RGB) color channels are translated into Hue, Saturation and Lightness (HSL) coordinates,
and from these, a Qualitative Color Reference System is defined as QCRS, where QCLAB1..5 refers to the qualitative labels related to colour and QCINT 1..5 refers to the intervals of
HSL color coordinates associated with each color label:
QCLAB1 = {black, dark-grey, grey, light-grey, white}
QCINT1 = {[0, 20), [20, 30), [30, 40), [40, 80), [80, 100) ∈ L
/ ∀ H ∧ S ∈ [0, 20] }
QCLAB2 = {red, orange, yellow, green, turquoise, blue, purple, pink}
QCINT2 = {(335, 360] ∧ [0, 15], (15,40], (40, 80], (80, 160],
(160, 200], (200, 260], (260, 297], (297, 335] ∈ H / S ∈ (50,
100] ∧ L ∈ (40, 55] }
QCLAB3 = {pale- + QCLAB2 }
QCINT3 = { ∀ HINT2 / S∈(20, 50] ∧ L ∈ (40, 55] }
QCLAB4 = {light- + QCLAB2 }
QCINT4 = { ∀HINT2 / S∈(50, 100] ∧ L ∈ (55, 100] }
QCLAB5 = {dark- + QCLAB2 }
QCINT5 = { ∀ HINT2 / S∈(50, 100] ∧ L ∈ (20, 40]}
The QCRS depends on the vision system, but it is adaptable to other systems and/or scenarios by defining other
color tags and/or other HSL values (Falomir et al. 2013b).
Table 1: (a) Excerpt of the reference conceptualization of the
objects in the images and (b) basic image facts for a scene.
α1
Object type v ∃has shape.QSD type
α2
Object type v ∃has colour.QCD type
α3
Object type v ∃has location.Scene type
α4
Object type v ∃is touching.Object type
γ1
Scene type : scene1
γ2
Object type : object1
γ3
Object type v {object1 , object2 , object3 , ..., objectn }
γ4
object1 6≈ object2 6≈ ... 6≈ objectn
γ5
is container(scene1 , object1 )
γ6
is container(scene1 , object2 )
γ7
is container(scene1 , object3 )
Incorporating Domain Knowledge
The domain knowledge included in this approach consists
on: (i) logic definitions to categorize objects based on their
qualitative descriptors, (ii) images of ‘target’ objects which
96
are ‘known’ by the agent and can be detected using feature
invariant detectors and (iii) semantics for target objects regarding their use, dynamics, expected location, etc.
Table 3: Adding more semantics to target objects.
σ1
TargetObj type v name
σ2
TargetObj type v ∃is matched.Object type
Contextual Knowledge for Object Categorization
σ4
TargetObj type v ∃is static.
σ5
TargetObj type v ∃is easy to carry.
Taking into account these semantics, contextualized knowledge for the domain has been defined for characterizing: (1)
the wall; (2) the floor at CoSy lab; (3) a poster at CoSy lab;
and (4) a seat at CoSy lab. For example, an object may be
categorized as floor at CoSy lab if it is blue or black and it
is located down in the scene and not up (see Table 2).
σ6
TargetObj type v ∃has unexpected location.Object type
σ7
TargetObj type v ∃has utility.Affordance type
Incorporating Spatial Logics
Some spatial logics are provided to the agent for scene interpretation (Table 4). First, a target object would have been
moved accidentally if it is a static object, but easy-to-carry
around and it is not in its expected location in the scene. Similarly, a target object would have fallen down if it is static,
easy-to-carry, and it is located in the floor. Moreover, a target
object could be considered as disappeared or stolen if cannot be detected and it is static and not easy-to-carry around.
It can also be deduced that the scene has suffered a small
change when the target object has moved accidentally or has
fallen down, and that there is an alarm when the target object
has disappeared.
Table 2: Domain Knowledge for Object Categorization
β1
Wall ≡
Object type u 3 has colour.{pale yellow} t
β2
CoSy Floor ≡
3 has colour.{white} t 3 has colour.{light grey}
Object type u (3 has colour.{blue} t 3 has colour.{black})
β3
u∃is down.Scene u ¬(∃is up.Scene)
CoSy Poster ≡ Object type u 3 has shape.{quadrilateral} u
β4
CoSy Seat ≡
(3 has colour.{white} t 3 has colour.{light grey}) u
¬(∃is down.Scene}) u ∃is up.Scene
Object type u (3 has colour.{orange} t
3 has colour.{dark orange})
u∃is down.Scene u ¬(∃up.Scene)
Table 4: Applying Spatial Logics to Objects in the Scene.
Identifying Target Objects using Feature Detectors
In order to detect a ‘target’ object in an image, the Kn-QID
approach uses the Speeded-Up Robust Features (SURF) invariant descriptor and detector (Bay et al. 2008) which was
demonstrated to be the fastest detector in the literature. The
matching algorithm selected is the fast approximate nearest neighbours (FLANN) algorithm (Muja and Lowe 2009).
Both SURF and FLANN algorithms are implemented in the
Open Computer Vision Library1 (OpenCV).
According to the location of matching features of the target object in the image, the Kn-QID approach determines
if they are inside a region extracted by the QID approach
by applying Jordan’s curve theorem (Courant and Robbins
1996) and a correspondence between the target object and
a region in the QID is determined. In order to avoid false
positives, the qualitative features of the objects are also considered in the matching process according to the domain
knowledge given. For example, a fire-extinguisher will not
be matched to an object which colour is different than red.
As result, the region in the image corresponding to the target
object is identified by its name.
δ1
TargetObj type v moved ≡
∃is static. u ∃is easy to carry. u
δ2
TargetObj type v fall down ≡
∃has unexpected location.Object type
∃is static. u ∃is easy to carry. u
δ3
TargetObj type v missed ≡
∃completely inside.CoSy Floor
¬is static u ∃¬is easy to carry. u
δ4
Scene type v is changed ≡
¬(∃is matched.Object type)
∃TargetObj type. v moved t
δ5
Scene type v has alarm ≡
∃TargetObj type. v fall down
∃TargetObj type. v missed
A Proof-of-Concept
The scenario can involve a mobile robot incorporating
a camera or a intelligent ambient system such as Interact@Cartesium at Cartesium building (Universität Bremen),
which provides intelligent door tags (computers) installed in
the walls next to every office of the CoSy group (see Figure
1). These tags are computers which include video cameras
which take pictures of the daily living environment. In such
a scenario, an agent can be assigned a ‘surveillance’ task and
given a ‘target’ object to control.
Contextual Knowledge for Target Objects
Apart from the contextual knowledge, other semantics can
be added to the target objects according to: (1) if they have
been detected by feature detectors SURF+FLANN, (2) if
they are static or not, (3) if they are easy-to-carry around the
scenario (changed from one place to another frequently), (4)
their unexpected locations in the scenario, (5) what they can
be used for, etc. (see Table 3).
Figure 1: Scenarios: (a) Pioneer robot at UJI corridors, (b)
Cartesium building (the arrows indicate videocameras).
1 http://www.opencv.org.cn/opencvdoc/2.3.1/html/
97
An example of the results provided by the Kn-QID approach is shown in Figure 2. A digital image is segmented
and the relevant regions are extracted automatically and described qualitatively. The target object selected (the cup) is
matched to object 20 using SURF and FLANN. After the
matching, the corresponding semantic descriptors from the
scene are obtained. The target object selected (the cup) and
is located inside object 30 in the scene, which is characterized from the qualitative descriptors as CoSy floor. And following the spatial logics provided, the agent infers that the
cup has fallen down. Note that objects 17 and 18 are also
categorized as CoSy seats.
for improving human-machine communication; (iv) extending semantics for characterizing objects in different kinds of
scenarios (e.g. laboratories/classrooms/libraries or outdoor
areas) for robot general indoor semantic localization.
Acknowledgments
This work was supported by European Commission through
FP7 Marie Curie IEF actions under project COGNITIVEAMI (GA 328763) and the interdisciplinary Trasregional
Collaborative Research Center Spatial Cognition SFB/TR 8.
References
Bay, H., Ess, A., Tuytelaars, T., and Van Gool, L. 2008. Speeded-up
robust features (SURF). Comput. Vis. Image Underst. 110(3):346–
359.
Courant, R., and Robbins, H. 1996. What Is Mathematics? An
Elementary Approach to Ideas and Methods. Oxford University
Press.
Egenhofer, M. J., and Franzosa, R. 1991. Point-set topological spatial relations. International Journal of Geographical Information
Systems 5(2):161–174.
Falomir, Z., Museros, L., Gonzalez-Abril, L., Escrig, M. T., and
Ortega, J. A. 2012. A model for qualitative description of images
based on visual and spatial features. Comput. Vis. Image Underst.
116:698–714.
Falomir, Z., Gonzalez-Abril, L., Museros, L., and Ortega, J. 2013a.
Measures of similarity between objects from a qualitative shape
description. Spat. Cogn. Comput. 13:181–218.
Falomir, Z., Museros, L., Gonzalez-Abril, L., and Sanz, I. 2013b.
A model for qualitative colour comparison using interval distances.
Displays 34:250–257.
Felzenszwalb, P. F., and Huttenlocher, D. P. 2004. Efficient graphbased image segmentation. Int. J. Comput. Vis. 59(2):167–181.
Hernández, D. 1991. Relative representation of spatial knowledge:
The 2-D case. In Mark, D. M., and Frank, A. U., eds., Cognitive and Linguistic Aspects of Geographic Space, NATO Advanced
Studies Institute. Dordrecht: Kluwer. 373–385.
Lim, C. H., and Chan, C. S. 2012. A fuzzy qualitative approach
for scene classification. In Proc. IEEE Int. Conf. on Fuzzy Systems,
Brisbane, Australia, June 10-15, 2012, 1–8.
Muja, M., and Lowe, D. 2009. Fast approximate nearest neighbors
with automatic algorithm configuration. In In VISAPP International Conference on Computer Vision Theory and Applications,
331–340.
Oliva, A., and Torralba, A. 2001. Modeling the shape of the scene:
A holistic representation of the spatial envelope. Int. J. Comput.
Vision 42(3):145–175.
Qayyum, Z. U., and Cohn, A. G. 2007. Image retrieval through
qualitative representations over semantic features. In Proc. 18th
British Machine Vision Conf, Warwick, UK, 610–619.
Quattoni, A., and Torralba, A. 2009. Recognizing indoor scenes.
In IEEE Computer Society Conference on Computer Vision and
Pattern Recognition, volume 0, 413–420. Los Alamitos, CA, USA:
IEEE Computer Society.
Figure 2: Scene captured by the Interact@Cartesium: a target object cup (object 20 in the image) has fallen down (it is
inside CoSy Floor or Object 30).
Discussion
The Kn-QID approach use qualitative descriptors to describe
objects which are ‘unknown’ by the system (i.e. not stored in
memory, not seen before in a scenario) and domain knowledge to categorize objects without any previous training.
The proposed approach can characterize regions of images as walls, floors, posters and seats, under different illumination conditions and from different points of view. Qualitative colours are defined as intervals of HSL values, then
different values of hues get the same color name and also different values of lightness. Moreover, the definitions for the
categorization of objects can include different color names,
which is an advantage when the color perceived by the cameras (robot webcam and Cartesium videocameras) is very
approximated by different named (i.e. white and light grey
for walls, orange and dark orange for the coach) because of
the limits of the intervals.
Furthermore, simple spatial logics have been defined to
identify not ok situations and a proof-of-concept experiment
has been carried out with success.
As future work, we intend to: (i) create a repository of target objects for benchmarking in this scenario; (ii) incorporating logics for explaining spatio-temporal changes in the
scenes; (iii) generating descriptions in natural language for
describing the scene taking into account cognitive studies
98
Download