Aligned Scene Modeling of a Robot’s Vista Space – An Evaluation

advertisement
Language-Action Tools for Cognitive Artificial Agents: Papers from the 2011 AAAI Workshop (WS-11-14)
Aligned Scene Modeling of a
Robot’s Vista Space – An Evaluation
Agnes Swadzba and Sven Wachsmuth
Applied Informatics, Faculty of Technology, Bielefeld University
Universitätsstraße 25, 33613 Bielefeld, Germany
{aswadzba, swachsmu}@techfak.uni-bielefeld.de
the space where the interaction takes place. A collaborative
interaction of the partners in and with the environment requires that both partners know where they are, what spatial
structures they are talking about, and what scene elements
they are going to manipulate. Our research focuses on the
new area of 3D spatial modeling for human-robot interaction. A commonly used scenario where a robot is expected
to acquire interactively spatial knowledge is the so-called
“home tour” scenario (COGNIRON 2004). There a robot is
shown around in an apartment and taught relevant scene information. In this paper we focus on situations where a room
is introduced in details through a description. We address
here the question of how relevant supporting structures
can be inferred from a depiction and grounded visually, so
that the resulting model is aligned to the describer’s representation. From psychological research we know that space
is encoded in the so-called situation model (Zwaan 1999).
Therefore, spatial descriptions play an important role in exchanging concepts about the surrounding. As linguistic experiments propose that cognition and language schematize
the spatial world in the same way (Tversky and Lee 1998;
Freksa 2008) language can be seen as a systematic framework for conveying space by selecting or neglecting certain aspects (Talmy 1983). This means that a hearer builds
a model from a description which is similar to the model
the speaker has built from perception (Waltz 1980). Further,
models of communication partners will be aligned if information about the surrounding is exchanged verbally (Pickering and Garrod 2004). This allows a smooth communication
as similar concepts are represented (Vasudevan et al. 2007).
The problem with spatial descriptions is that they are underspecified (Waltz 1980) which means that an agent needs also
a sensory input to resolve ambiguities in the descriptions.
Based on this requirement and the findings about the nature
of spatial descriptions we have proposed a computational
model, called aligned scene model, which enables a robot to
infer supporting scene structures from a description and to
ground them in its visual perception of the scene (Swadzba
et al. 2009). This paper gives an overview of the aligned
scene modeling approach for a robots’ vista spaces and gives
a detailed analysis of the models generated from a significant large collection of different descriptions with respect to
description strategies and object detection errors.
Abstract
One kind of meaningful structures in indoor rooms are
supporting structures like tables and cupboards. A robot
will need to know these structures for a natural interaction with the human and the environment. As bottom-up
detection of such structures is a challenging problem,
we propose to estimate potential supporting structures
from a spatial description like “a bowl on the table”.
As language and cognition schematize the space in the
same way it is possible to estimate the representation of
the space underlying a scene description. To do so, we
introduce the aligned modeling approach which consists
of rules transforming a sequence of object relations into
a set of trees and a methodology to ground the abstract
representation of the scene layout in the current perception using detectors for small movable objects and an
extraction of planar surfaces. An analysis of 30 descriptions shows the robustness of our approach to a variety
of description strategies and object detection errors.
Introduction
Spatial awareness and the ability to communicate about the
environment are key capabilities enabling an agent to perform day-to-day navigation tasks. In general, space can be
partitioned along the actions that are required to perceive
it (Kuipers 2000; Montello 1993). The perception of the
large-scale space involves locomotion of an agent while the
vista space can be explored quickly from a single view point
by eventually moving the gaze. Applying this definition to
a domestic environment, a complete apartment belongs to
the large-scale space and single rooms or room parts to the
vista space. Since the ability to navigate and to localize itself is essential for a mobile robot, much research has concentrated on modeling the robot’s large-scale space providing methods for Simultaneously Localization and Mapping
(SLAM) (Thrun, Burgard, and Fox 2000) and motion planning (Philippsen and Siegwart 2003). Analyzing percepts of
the immediate surroundings becomes relevant for a robot if
obstacles must be avoided, e. g., (Yuan et al. 2009), or an
unknown environment has to be explored, e. g., (Topp and
Christensen 2008). But a vista space analysis is even more
important in situations with agent-agent interaction as it is
c 2011, Association for the Advancement of Artificial
Copyright Intelligence (www.aaai.org). All rights reserved.
30
Related work
are put into relation with their supporting room structure or
with other objects being on the same room structure. This
can be seen quantitatively in Figure 1(c) which shows the
usage frequencies of different relation types. The majority of
specified relations are relations between objects on the same
super-ordinate structure (B) or relations between object and
the corresponding super-ordinate structure (A, nearly never
R). Therefore, we propose the following terminology:
This section discusses relevant work on inferring scene models from verbal descriptions, extracting intermediate scene
structures from visual perception, and integrating verbal
and visual scene interpretations. First systems like the Spatial Analog Model have transformed spatial prepositions
like “the box is on the table, . . .” into a 3D box model
satisfying gravity conditions (Boggess 1979). The Wordsinto-Pictures approach extends this by constructing objects
from a finite set of geometric primitives and incorporating constraints on orientation and position from object relations into a potential field (Olivier, Maeda, and Tsujii
1994). In contrast to this abstract modeling, recent robotic
methods provide meaningful labeling (e. g., as “wall”) of
3D points using top-down knowledge encoded in ontologies (Nüchter and Hertzberg 2008; Cantzler, Fisher, and
Devy 2002; Grau 1997) or bottom-up information from
point-based features (Munoz et al. 2009; Rusu et al. 2008;
Triebel et al. 2007). So far, the integration of verbal descriptions and visual perception has been studied with single caption words (Jamieson et al. 2010; Wang, Hoiem, and Forsyth
2009) or simple table-top scenarios. Spatial relations and visual perceptions have been associated in a probabilistic manner using Bayesian inference processes (Wachsmuth and
Sagerer 2002), stochastic layers (Mavridis and Roy 2006),
or potential fields (Brenner et al. 2007).
Definition. Orthogonal relation,
a relation between an object and its super-ordinate structure (e. g., “a lion on the chair”)
Definition. Parallel relation,
a relation between objects localized on the same structure
(e. g., “a car in front of the koala” both are on the table)
These definitions follow the basic physical fact that due
to gravity objects are placed on tables, attached to walls, or
contained in cupboards (Torralba, Russell, and Yuen 2009;
Olivier, Maeda, and Tsujii 1994). The explicit and implicit
references to super-ordinate structures reflect the hierarchical character of the underlying situation models, so that treelike structures are a suitable representation to maintain such
models. Handling relations only based on their assignment
to the orthogonal or parallel type allows a scene modeling
that is independent from the exact view on the scene and
the used reference frames. The challenges faced when transforming a description into a tree representation T are incompleteness, ambiguity, and disordered arrangement. To
cope with these problems, we assume that descriptions are
sequences of object relations. An existing tree set Tt−1 is
updated by the current relation to Tt according to certain
rules. Figure 1(d) introduces our compact illustration of tree
sets. The nodes correspond to scene items while the edges
indicate the relations between them. Each parent node is the
supporting structure for all its children. A tree set is updated
by o = obj( “o” ) → no , which inserts for an object o with
the label “o” a node no into T . delete(n) deletes the node
n from tree set T and child(no , np ) inserts a directed edge
from the parent node np to the child node no expressing the
supporting characteristic of the parent structure to the child
object.
The computational model
This section introduces the computational model for acquiring an aligned scene representation which reflects the tutor’s scene representation and the visual reality perceived as
bottom-up extracted 3D planar patches. An abstract representation is inferred from a scene description and grounded
in the visual perception utilizing object detection and surface
extraction results.
From verbal descriptions to set of trees
Studies on the nature of spatial descriptions have revealed
that spatial knowledge is organized hierarchically (Hirtle
and Jonides 1985). This is visible in the partitioning of space
into figures and spatial relations (Tversky and Lee 1998)
and the use of horizontal and vertical planes as reference
frames (Tversky, Kim, and Cohen 1999). As the spatial relations are a prominent part of spatial descriptions we have
conducted an empirical study to learn more about the nature
of these relations. In this study 30 free-form descriptions of
two rooms presented as pictures have been collected (→ Figure 1). It can be assumed that depictions of true scenes (Henderson and Hollingworth 1999) are more realistic than descriptions of ersatz scenes (displays of arbitrary arrayed objects). This allows us to use these descriptions to test our
scene modeling on 3D data acquired from the same view on
the scene as the pictures have been taken. A detailed analysis of the descriptions shows that spatial room structures
like pieces of furniture or other room parts serve as crystallization points for room descriptions (Vorwerg and Rickheit
2009). Typical relations are “the lion is on the chair” or “a
toy car is in front of the koala” (→ Figure 1(a)). Objects
A parallel relation, rel ( o1 , o2 ), between object o1 and
o2 is handled by the basic rule shown in Figure 2(a). Their
supporting structure is inserted as new node np where no1
and no2 become its child nodes. If one of the related objects
has already a parent the exception rule is applied Figure 2(b).
An orthogonal relation, rel⊥ (o1 , o2 ), is handled as
shown in Figure 3(a). A directed edge between the nodes
no1 and no2 implements the given relation. In the exceptional case both trees or subtrees can be fused to one tree
with no2 as the root node if the existing parent node has an
empty label or “o2” as label (see Figure 3(b)).
Figure 4 shows an example description and its tree set
representation resulting from the sequence-wise processing
of the description with the rules defined above.
31
60
50
40
30
20
10
(a) playroom scene S1
0
(b) living room scene S2
(c) usage frequency of relations
(d) tree notation
Figure 1: These are the two scenes (a) S1 and (b) S2 used in the study. In two experiments the subjects have been told to freely describe the photographs. (c) This histogram gives
the usage frequency of all relations occurred across all acquired scene descriptions (Vorwerg and Rickheit 2009) (d) shows our compact notation for two trees. All children of a node
are listed directly below the parent node.
(a) basic parallel rule
(b) exception from parallel rule
Figure 4: Left: Example description about the playroom given by subject 3(p) in the
pilot study. The red framed words mark the objects and the green underlined words
the relations. Double underlined means parallel relation and single underlined means
orthogonal reference. Right: the resulting tree set T representing the description on
the left.
Figure 2: Visualization of rules handling parallel relations. (a) the basic rule, (b) handling the case that the node no1 has already a parent node np .
(a) basic orthogonal rule
(b) exception from orthogonal rule
Figure 3: Visualization of rules handling orthogonal relations. (a) the basic rule, (b)
handling the case that the node no1 has already a parent node np .
Grounding the scene model in the 3D perception
(a) Manually drawn object boxes
with labels.
This section presents how the set of trees resulting from
the computation presented in the previous section can be
mapped on a 3D perception (e. g., a SwissRanger scan) of
the described scene. The challenges are ambiguous labels
like object categories (e. g., “soft toys”) and parent nodes
with an empty label, the so-called “virtual patches”. As
man-made environments mainly consist of planar surfaces
it is reasonable to estimate supporting structures as planar
patches. We propose to compute the patch parameters using the 3D locations of small movable objects attached to
them. We assume that detectors for small compact objects,
which are not part of the spatial layout of a room like “lion”,
“koala”, . . ., are easier to train and detect more robust than
detectors for spatial structures like “chair” or “table”. As the
development of a robust object detection system is out of
the scope of this work, we have labeled manually objects in
the SwissRanger amplitude image simulating an output of
an object detector (→ Figure 5(a)). The following algorithm
consists of four steps. First, the confirmed objects of a parent
node (distinct objects in the scene) are identified. Second,
these objects and the corresponding relation type are used to
estimate the patch parameters of the parent nodes. Third, the
estimated patch is used to resolve existing category labels.
And fourth, ambiguities and errors are handled.
(b) 3D object hulls.
Figure 5: Manually labeled objects and their corresponding 3D convex hulls.
(I) O, the set of objects known to the robot(→ Figure 5(a))
is divided for a certain description (→ Figure 4) into a set of
confirmed objects Ocon which are those objects that have
been mentioned in the description and a set of potential objects Opot holding the remaining objects.
(II) For each parent node np ∈ T its potential planar patch
p
p
is computed using its distinct objects Ocon
⊂ Ocon . A
Ppot
planar patch is described by an orientation and a position in
the global coordinate system:
P : n · x − d = 0.
(1)
Knowledge about the type of relation used to assign objects
supports the computation of the patch orientation. A patch
where objects are located on is estimated as horizontal patch:
T
(2)
np = 0, 1, 0 .
A supporting structure containing objects (indicated by an
in-relation) has a vertical orientation:
(3)
np = ap × bp .
32
(a) a horizontal plane
i
i
}i=1...m 1 . A real patch Preal
assigned to a potential
{Preal
p
if they share a similar orientation and position in
patch Ppot
space. Several real patches can be assigned to one potential
patch and one real patch can be assigned to multiple potential patches. The matching type indicates the status of an initial scene model. If it contains only injective mappings from
real patches to potential patches the model is correct while a
i
constellation where one real patch Preal
is assigned to several potential patches with competing labels, e. g., “p1” and
“p2”, means an object mismatch. To correct this, all objects
p1
p2
and Ppot
are checked whether they are positioned
of Ppot
i
i
in/on the real patch Preal
. Preal
is put to that potential patch
i
.
holding the biggest percentage of objects lying in/on Preal
i
In the case where a bottom-up patch Preal is assigned to a
set of potential patches where exactly one patch has a nonempty label all parent nodes can be merged to one node labeled with “p” removing virtual patches.
(b) a vertical plane
Figure 6: The figures show two potential patches. A patch is represented by an ellipse,
its normal vector (
nchair , n ), and its centroid (
cchair , c ). Each object is represented by its convex hull and its 3D location (llion , lrobot , lraven , lfred , lpokemon ).
Evaluation
(a) initial model
Figure 8 shows the generated aligned scene models based
on the 30 descriptions collected in the studies. The room
structures are represented by ellipses and are localized on
level-1 in the tree sets T since objects positioned in the leafs
of the trees are used to estimate these structures. The histograms in Figure 9 show the distribution of the structure
labels found in the models. A structure appearing in many
models is a prominent element of the scenario. In general, no
differences in the descriptions of the pilot study and the main
study can be observed. Each model is aligned to the specific
level of detail given in the description. Most of the descriptions are quite detailed with many relations between objects
and room structures. There are only few descriptions where
objects have been simply listed (compare model 1, 2, 5, and
7 in Figure 8(a) where nearly no spatial structures have be
concluded). In the playroom scenario (→ Figure 9(a)) the
“table” and the “chair” are the most dominant structures.
Further, the shelves at the wall are also interesting structures. Half of the participants have given enough information to conclude patches for “cupboard2” and “cupboard3”.
The individual reference of the shelves may be primed by
the different colors of the shelves as the contrary effect is
visible in the descriptions of the living room S2 where the
shelves have the same color and are fused to the structure
“cupboard” (→ Figure 9(b)). The other meaningful structures in the living room are “table’, “sofa”, and “wall”. In
general, the system fails to compute a potential patch for a
spatial structures (indicated by “Cannot compute potential
patch for ...”) when only the category of objects located in
resp. on the structure is known (like “there are soft toys in
the cupboard”). Without knowing at least one specific object
this category label cannot be resolved. Finally, it can be said
that all generated models are an appropriate representation
(b) final model
p
}p=1...7
{Ppot
computed for the
Figure 7: (a) shows the potential planar patches
tree set T shown in Figure 4. Patches are visualized as colored ellipses. (b) shows the
scene model after correcting wrongly assigned objects and fusing virtual patches with
named potential patches. Patches are fused if they share the same bottom-up patch.
where ap = ( 0, 1, 0 )T and bp the orientation of a line fitted robustly via RANSAC through the object points projected in the xz-plane (ground plane). If the relation between child and parent node is not known, the orientation of
the 3D arrangement of the confirmed objects with respect to
the ground plane determines the orientation of the potential
patch (if < 45◦ : horizontal else: vertical). The position
of the patch is computed from the centroid cp of the objects’
3D locations: dp = np · cp . Figure 6 visualizes a horizontal
and a vertical potential patch.
p
can be utilized to resolve possible cate(III) Finally, Ppot
gory labels. Objects in Opot having the specified category
and a distance to the corresponding patch smaller than a
given threshold are assigned to the potential patch. It is assumed that the human tutor has referred to these objects.
Figure 7(a) shows the initial scene model of the example description in Figure 4. Patches for “corner”, “table”, “chair”,
“cupboard2”, and “cupboard3”, and two virtual patches with
an empty label “ ” are estimated. The category label “softtoy on the table” is resolved to “kangaroo”, “bear”, “frog”,
and “stitch”. The category label “games in cupboard3” is resolved to “games1” and “games2”.
1
Planar patches can be extracted via region growing utilizing the conormality and coplanarity value of point pairs (Stamos
and Allen 2002). Several runs of the RANdon SAmple Consensus
(RANSAC) algorithm (Fischler and Bolles 1981) improve the initial regions. A detailed description of the planar patch extraction
algorithm can be found in (Swadzba and Wachsmuth 2008).
(IV) The initial scene model contains errors (cupboard3
is too big) and virtual patches (patches with an empty label “ ”) due to errors in the resolution of category labels
and underspecifications in the descriptions. Both problems
can be solved be incorporating bottom-up extracted patches
33
(b) structure labels in living room S2
Figure 9: This figure shows the distribution of the structure labels appearing in the
models generated for the descriptions acquired in our studies (pilot study: playroom
(10 participants); main study: playroom, living room (10 participants) → 30 descriptions). The bar shows the number of models in which the corresponding structure label
can be found.
(a) structure labels in playroom S1
(a) Scene models generated from descriptions of scene S1 collected in pilot study.
(b) Scene models generated from descriptions of scene S1 collected in main study.
(c) Scene models generated from descriptions of scene S2 collected in main study.
Figure 8: This figure gives an overview over the resulting scene models for all descriptions collected in the pilot and the main study. The estiamted supporting patches
are visualized as ellipses. The convex hulls are displayed only for those objects that
have been related to the corresponding scene structure. Few subjects have just listed
objects without relating them to each other (models 1, 2, 5, and 7 in pilot study).
deviation
corner
table
chair
cupboard2
cupboard3
conormal
coplanar
6.01◦
0mm
0◦
12mm
0◦
302mm
0.26◦
22mm
14.53◦
10mm
Figure 10: This figure shows the object set where small and large scaling and translation errors are applied to the 2D object boxes. Small errors mean that an overlap
with the ground truth box still exists while large errors mean no overlap. Half of the
objects is influenced by errors. On the right, the aligned scene model is shown computed on the erroneous object set. The table gives the deviation (conormal: orientation;
coplanar: position) of this model to the ground truth model given in Figure 7(b).
of the respective scenario. The analysis of this large set of
descriptions ensures that we have covered different description strategies. The strength of our approach is that for every description room structures can be inferred and modeled
across different scenes and appearances and that the level of
details in the model is aligned to the given description preserving resources.
As assuming perfect object detection is not realistic we
have also examined the influence of typical errors on the
model formation process. Figure 10 shows an object set
where typical errors like scaling (bounding box is too small
or too big), translation (bounding box is misplaced), and
missing errors (no bounding box available) have been applied randomly selected objects. Half of the objects in the
set of known objects is affected by the errors. The figure
presents also the scene model estimated using this errorneous object set. The table in Figure 10 gives the deviation of
the model from the ground truth model in Figure 7(b). Overall, the patches are estimated robustly. “cupboard2”, “table”,
and “corner” show only minor deviations from the original
patches. The twisting of “cupboard3”-patch by 14.53◦ is acceptable since the overall position of the patch is correct. It is
still possible to assign the correct real patch to “cupboard3”
because we allow due to noise a deviation of up to 30◦ bet-
ween the potential patch normal and the real patch normal.
Only the “chair”-patch is too big and misplaced. If 3D data
of the chair itself could be perceived step (IV) would detect
that the “robot”-object is misplaced with respect to the chair
as it handles mismatched objects arising from errors when
resolving category labels. As the computation of the 3D object hulls incorporates a removing of outlier points it can
handle slightly misplaced object boxes. Further, the orientation of the horizontal patches is only influenced by the orientation of the camera and the RANSAC based approach for
computing the vertical patches can deal with remaining outliers. Also, missing objects can be compensated if enough
other objects are known for the corresponding parent node.
To summarize, our method proposes a robust estimatation
of the patch orientation that can handle object outliers. The
computation of the patch position is more sensitive to noise.
If enough correct objects are given a wrong object position
can be averaged out. But if only one object is associated to
a parent node a big translation or scaling error will corrupt
the potential patch of this node. In general, the corruption
affects the position and expansion of a node patch.
34
Conclusion
Olivier, P.; Maeda, T.; and Tsujii, J. 1994. Automatic Depiction of
Spatial Descriptions. In NCAI, volume 2, 1405–1410.
Philippsen, R., and Siegwart, R. 2003. Smooth and Efficient Obstacle Avoidance for a Tour Guide Robot. In ICRA, volume 1,
446–451.
Pickering, M. J., and Garrod, S. 2004. Toward a Mechanistic Psychology of Dialogue. Behavioral and Brain Sciences
27(2):169–190.
Rusu, R. B.; Marton, Z. C.; Blodow, N.; and Beetz, M. 2008.
Learning Informative Point Classes for the Acquisition of Object
Model Maps. In ICARCV, 643–650.
Stamos, I., and Allen, P. K. 2002. Geometry and Texture Recovery
of Scenes of Large Scale. CVIU 88(2):94–118.
Swadzba, A., and Wachsmuth, S. 2008. Categorizing Perceptions
of Indoor Rooms using 3D Features. In Structural, Syntactic, and
Statistical Pattern Recognition, volume 5342 of LNCS, 744–754.
Swadzba, A.; Vorwerg, C.; Wachsmuth, S.; and Rickheit, G.
2009. A Computational Model for the Alignment of Hierarchical Scene Representations in Human-Robot Interaction. In IJCAI,
1857–1863.
Talmy, L. 1983. How Language Structures Space. In H. L. Pick,
J., and Acredolo, L. P., eds., Spatial Orientation: Theory, Research
and Application, 225–282.
Thrun, S.; Burgard, W.; and Fox, D. 2000. A Real-time Algorithm
for Mobile Robot Mapping with Applications to Multi-Robot and
3D Mapping. In ICRA, volume 1, 321–328.
Topp, E., and Christensen, H. 2008. Detecting Structural Ambiguities and Transitions during a Guided Tour. In ICRA, 2564–2570.
Torralba, A.; Russell, B. C.; and Yuen, J. 2009. LabelMe: Online Image Annotation and Applications. Technical report, MIT
CSAIL.
Triebel, R.; Schmidt, R.; Mozos, Ó. M.; and Burgard, W. 2007.
Instance-based AMN Classification for Improved Object Recognition in 2D and 3D Laser Range Data. In IJCAI, 2225–2230.
Tversky, B., and Lee, P. U. 1998. How Space Structures Language.
In Spatial Cognition, An Interdisciplinary Approach to Representing and Processing Spatial Knowledge, volume 1404 of LNCS,
157–176.
Tversky, B.; Kim, J.; and Cohen, A. 1999. Mental Models of
Spatial Relations and Transformations from Language. In Mental
Models in Discourse Processing and Reasoning. 239–258.
Vasudevan, S.; Gächter, S.; Nguyen, V.; and Siegwart, R. 2007.
Cognitive Maps for Mobile Robots – An Object based Approach.
RAS 55(5):359–371.
Vorwerg, C., and Rickheit, G. 2009. Verbal Room Descriptions Reflect Hierarchical Spatial Models. In Annual Meeting of the American Psychological Society. Poster.
Wachsmuth, S., and Sagerer, G. 2002. Bayesian Networks for
Speech and Image Integration. In AAAIConf. on AI, 300–306.
Waltz, D. L. 1980. Understanding Scene Descriptions as Event
Simulations. In Annual Meeting on Association for Computational
Linguistics, 7–11. Association for Computational Linguistics.
Wang, G.; Hoiem, D.; and Forsyth, D. 2009. Building Text Features for Object Image Classification. In CVPR, 1367–1374.
Yuan, F.; Swadzba, A.; Philippsen, R.; Engin, O.; Hanheide, M.;
and Wachsmuth, S. 2009. Laser-based Navigation Enhanced with
3D Time-of-Flight Data. In ICRA, 2844–2850.
Zwaan, R. A. 1999. Situation Models: The Mental Leap into
Imagined Worlds. Current Directions in Psychological Science
8(1):15–18.
In this paper we have analyzed the performance of the
aligned scene modeling approach regarding robustness to
different room description strategies by analyzing a large
set of descriptions. The computational model is based on
the fact that the construction of spatial descriptions is driven
by gravity, which means that the descriptions mainly consist
of orthogonal relations between objects and their supporting structures and parallel relations between objects located
on/in the same structure. The classification of relations in a
description only into two types allows the design of compact
rules that can transform relations to an abstract tree representation independent from the order in which the relations are
given. The evaluation shows that the strategies developed for
grounding the abstract model in the perception of the scene
are robust to missing data and object detection errors.
Acknowledgement. The work has been funded by the
CRC673 “Alignment in Communication”.
References
Boggess, L. 1979. Spatial Operators in Natural Language Understanding: The Prepositions. In Southeast Regional Conference,
8–11.
Brenner, M.; Hawes, N.; Kelleher, J.; and Wyatt, J. 2007. Mediating Between Qualitative and Quantitative Representations for
Task-Orientated Human-Robot Interaction. In IJCAI, 2072–2077.
Cantzler, H.; Fisher, R. B.; and Devy, M. 2002. Improving Architectural 3D Reconstruction by Plane and Edge Constraining. In
BMVC, 43–52.
COGNIRON. 2004. The Cognitive Robot Companion. (FP6-IST002020), http://www.cogniron.org.
Fischler, M. A., and Bolles, R. C. 1981. Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Communications of the
ACM 24(6):381–395.
Freksa, C. 2008. Zur interdiziplinären Erforschung räumlichen
Denkens. Kognitive Psychologie – Ausgewählte Grundlagen- und
Anwendungsbeispiele 87–108.
Grau, O. 1997. A Scene Analysis System for the Generation of 3D
Models. In 3DIM, 221–228.
Henderson, J. M., and Hollingworth, A. 1999. High-level Scene
Perception. Annual Review of Psychology 50:243–271.
Hirtle, S. C., and Jonides, J. 1985. Evidence of Hierarchies in
Cognitive Maps. Memory & Cognition 13:208–217.
Jamieson, M.; Fazly, A.; Stevenson, S.; Dickinson, S.; and
Wachsmuth, S. 2010. Using Language to Learn Structured Appearance Models for Image Annotation. PAMI 32(1):148–164.
Kuipers, B. 2000. The Spatial Semantic Hierarchy. AI
119:191–233.
Mavridis, N., and Roy, D. 2006. Grounded Situation Models for
Robots: Where Words and Percepts Meet. In IROS, 4690–4697.
Montello, D. R. 1993. Scale and Multiple Psychologies of Space.
In LNCS: Spatial Information Theory A Theoretical Basis for GIS,
volume 716, 312–321.
Munoz, D.; Bagnell, J. A.; Vandapel, N.; and Hebert, M. 2009.
Contextual Classification with Functional Max-Margin Markov
Networks. In CVPR, 975–982.
Nüchter, A., and Hertzberg, J. 2008. Towards Semantic Maps for
Mobile Robots. RAS 56(11):915–926.
35
Download