Language-Action Tools for Cognitive Artificial Agents: Papers from the 2011 AAAI Workshop (WS-11-14) Aligned Scene Modeling of a Robot’s Vista Space – An Evaluation Agnes Swadzba and Sven Wachsmuth Applied Informatics, Faculty of Technology, Bielefeld University Universitätsstraße 25, 33613 Bielefeld, Germany {aswadzba, swachsmu}@techfak.uni-bielefeld.de the space where the interaction takes place. A collaborative interaction of the partners in and with the environment requires that both partners know where they are, what spatial structures they are talking about, and what scene elements they are going to manipulate. Our research focuses on the new area of 3D spatial modeling for human-robot interaction. A commonly used scenario where a robot is expected to acquire interactively spatial knowledge is the so-called “home tour” scenario (COGNIRON 2004). There a robot is shown around in an apartment and taught relevant scene information. In this paper we focus on situations where a room is introduced in details through a description. We address here the question of how relevant supporting structures can be inferred from a depiction and grounded visually, so that the resulting model is aligned to the describer’s representation. From psychological research we know that space is encoded in the so-called situation model (Zwaan 1999). Therefore, spatial descriptions play an important role in exchanging concepts about the surrounding. As linguistic experiments propose that cognition and language schematize the spatial world in the same way (Tversky and Lee 1998; Freksa 2008) language can be seen as a systematic framework for conveying space by selecting or neglecting certain aspects (Talmy 1983). This means that a hearer builds a model from a description which is similar to the model the speaker has built from perception (Waltz 1980). Further, models of communication partners will be aligned if information about the surrounding is exchanged verbally (Pickering and Garrod 2004). This allows a smooth communication as similar concepts are represented (Vasudevan et al. 2007). The problem with spatial descriptions is that they are underspecified (Waltz 1980) which means that an agent needs also a sensory input to resolve ambiguities in the descriptions. Based on this requirement and the findings about the nature of spatial descriptions we have proposed a computational model, called aligned scene model, which enables a robot to infer supporting scene structures from a description and to ground them in its visual perception of the scene (Swadzba et al. 2009). This paper gives an overview of the aligned scene modeling approach for a robots’ vista spaces and gives a detailed analysis of the models generated from a significant large collection of different descriptions with respect to description strategies and object detection errors. Abstract One kind of meaningful structures in indoor rooms are supporting structures like tables and cupboards. A robot will need to know these structures for a natural interaction with the human and the environment. As bottom-up detection of such structures is a challenging problem, we propose to estimate potential supporting structures from a spatial description like “a bowl on the table”. As language and cognition schematize the space in the same way it is possible to estimate the representation of the space underlying a scene description. To do so, we introduce the aligned modeling approach which consists of rules transforming a sequence of object relations into a set of trees and a methodology to ground the abstract representation of the scene layout in the current perception using detectors for small movable objects and an extraction of planar surfaces. An analysis of 30 descriptions shows the robustness of our approach to a variety of description strategies and object detection errors. Introduction Spatial awareness and the ability to communicate about the environment are key capabilities enabling an agent to perform day-to-day navigation tasks. In general, space can be partitioned along the actions that are required to perceive it (Kuipers 2000; Montello 1993). The perception of the large-scale space involves locomotion of an agent while the vista space can be explored quickly from a single view point by eventually moving the gaze. Applying this definition to a domestic environment, a complete apartment belongs to the large-scale space and single rooms or room parts to the vista space. Since the ability to navigate and to localize itself is essential for a mobile robot, much research has concentrated on modeling the robot’s large-scale space providing methods for Simultaneously Localization and Mapping (SLAM) (Thrun, Burgard, and Fox 2000) and motion planning (Philippsen and Siegwart 2003). Analyzing percepts of the immediate surroundings becomes relevant for a robot if obstacles must be avoided, e. g., (Yuan et al. 2009), or an unknown environment has to be explored, e. g., (Topp and Christensen 2008). But a vista space analysis is even more important in situations with agent-agent interaction as it is c 2011, Association for the Advancement of Artificial Copyright Intelligence (www.aaai.org). All rights reserved. 30 Related work are put into relation with their supporting room structure or with other objects being on the same room structure. This can be seen quantitatively in Figure 1(c) which shows the usage frequencies of different relation types. The majority of specified relations are relations between objects on the same super-ordinate structure (B) or relations between object and the corresponding super-ordinate structure (A, nearly never R). Therefore, we propose the following terminology: This section discusses relevant work on inferring scene models from verbal descriptions, extracting intermediate scene structures from visual perception, and integrating verbal and visual scene interpretations. First systems like the Spatial Analog Model have transformed spatial prepositions like “the box is on the table, . . .” into a 3D box model satisfying gravity conditions (Boggess 1979). The Wordsinto-Pictures approach extends this by constructing objects from a finite set of geometric primitives and incorporating constraints on orientation and position from object relations into a potential field (Olivier, Maeda, and Tsujii 1994). In contrast to this abstract modeling, recent robotic methods provide meaningful labeling (e. g., as “wall”) of 3D points using top-down knowledge encoded in ontologies (Nüchter and Hertzberg 2008; Cantzler, Fisher, and Devy 2002; Grau 1997) or bottom-up information from point-based features (Munoz et al. 2009; Rusu et al. 2008; Triebel et al. 2007). So far, the integration of verbal descriptions and visual perception has been studied with single caption words (Jamieson et al. 2010; Wang, Hoiem, and Forsyth 2009) or simple table-top scenarios. Spatial relations and visual perceptions have been associated in a probabilistic manner using Bayesian inference processes (Wachsmuth and Sagerer 2002), stochastic layers (Mavridis and Roy 2006), or potential fields (Brenner et al. 2007). Definition. Orthogonal relation, a relation between an object and its super-ordinate structure (e. g., “a lion on the chair”) Definition. Parallel relation, a relation between objects localized on the same structure (e. g., “a car in front of the koala” both are on the table) These definitions follow the basic physical fact that due to gravity objects are placed on tables, attached to walls, or contained in cupboards (Torralba, Russell, and Yuen 2009; Olivier, Maeda, and Tsujii 1994). The explicit and implicit references to super-ordinate structures reflect the hierarchical character of the underlying situation models, so that treelike structures are a suitable representation to maintain such models. Handling relations only based on their assignment to the orthogonal or parallel type allows a scene modeling that is independent from the exact view on the scene and the used reference frames. The challenges faced when transforming a description into a tree representation T are incompleteness, ambiguity, and disordered arrangement. To cope with these problems, we assume that descriptions are sequences of object relations. An existing tree set Tt−1 is updated by the current relation to Tt according to certain rules. Figure 1(d) introduces our compact illustration of tree sets. The nodes correspond to scene items while the edges indicate the relations between them. Each parent node is the supporting structure for all its children. A tree set is updated by o = obj( “o” ) → no , which inserts for an object o with the label “o” a node no into T . delete(n) deletes the node n from tree set T and child(no , np ) inserts a directed edge from the parent node np to the child node no expressing the supporting characteristic of the parent structure to the child object. The computational model This section introduces the computational model for acquiring an aligned scene representation which reflects the tutor’s scene representation and the visual reality perceived as bottom-up extracted 3D planar patches. An abstract representation is inferred from a scene description and grounded in the visual perception utilizing object detection and surface extraction results. From verbal descriptions to set of trees Studies on the nature of spatial descriptions have revealed that spatial knowledge is organized hierarchically (Hirtle and Jonides 1985). This is visible in the partitioning of space into figures and spatial relations (Tversky and Lee 1998) and the use of horizontal and vertical planes as reference frames (Tversky, Kim, and Cohen 1999). As the spatial relations are a prominent part of spatial descriptions we have conducted an empirical study to learn more about the nature of these relations. In this study 30 free-form descriptions of two rooms presented as pictures have been collected (→ Figure 1). It can be assumed that depictions of true scenes (Henderson and Hollingworth 1999) are more realistic than descriptions of ersatz scenes (displays of arbitrary arrayed objects). This allows us to use these descriptions to test our scene modeling on 3D data acquired from the same view on the scene as the pictures have been taken. A detailed analysis of the descriptions shows that spatial room structures like pieces of furniture or other room parts serve as crystallization points for room descriptions (Vorwerg and Rickheit 2009). Typical relations are “the lion is on the chair” or “a toy car is in front of the koala” (→ Figure 1(a)). Objects A parallel relation, rel ( o1 , o2 ), between object o1 and o2 is handled by the basic rule shown in Figure 2(a). Their supporting structure is inserted as new node np where no1 and no2 become its child nodes. If one of the related objects has already a parent the exception rule is applied Figure 2(b). An orthogonal relation, rel⊥ (o1 , o2 ), is handled as shown in Figure 3(a). A directed edge between the nodes no1 and no2 implements the given relation. In the exceptional case both trees or subtrees can be fused to one tree with no2 as the root node if the existing parent node has an empty label or “o2” as label (see Figure 3(b)). Figure 4 shows an example description and its tree set representation resulting from the sequence-wise processing of the description with the rules defined above. 31 60 50 40 30 20 10 (a) playroom scene S1 0 (b) living room scene S2 (c) usage frequency of relations (d) tree notation Figure 1: These are the two scenes (a) S1 and (b) S2 used in the study. In two experiments the subjects have been told to freely describe the photographs. (c) This histogram gives the usage frequency of all relations occurred across all acquired scene descriptions (Vorwerg and Rickheit 2009) (d) shows our compact notation for two trees. All children of a node are listed directly below the parent node. (a) basic parallel rule (b) exception from parallel rule Figure 4: Left: Example description about the playroom given by subject 3(p) in the pilot study. The red framed words mark the objects and the green underlined words the relations. Double underlined means parallel relation and single underlined means orthogonal reference. Right: the resulting tree set T representing the description on the left. Figure 2: Visualization of rules handling parallel relations. (a) the basic rule, (b) handling the case that the node no1 has already a parent node np . (a) basic orthogonal rule (b) exception from orthogonal rule Figure 3: Visualization of rules handling orthogonal relations. (a) the basic rule, (b) handling the case that the node no1 has already a parent node np . Grounding the scene model in the 3D perception (a) Manually drawn object boxes with labels. This section presents how the set of trees resulting from the computation presented in the previous section can be mapped on a 3D perception (e. g., a SwissRanger scan) of the described scene. The challenges are ambiguous labels like object categories (e. g., “soft toys”) and parent nodes with an empty label, the so-called “virtual patches”. As man-made environments mainly consist of planar surfaces it is reasonable to estimate supporting structures as planar patches. We propose to compute the patch parameters using the 3D locations of small movable objects attached to them. We assume that detectors for small compact objects, which are not part of the spatial layout of a room like “lion”, “koala”, . . ., are easier to train and detect more robust than detectors for spatial structures like “chair” or “table”. As the development of a robust object detection system is out of the scope of this work, we have labeled manually objects in the SwissRanger amplitude image simulating an output of an object detector (→ Figure 5(a)). The following algorithm consists of four steps. First, the confirmed objects of a parent node (distinct objects in the scene) are identified. Second, these objects and the corresponding relation type are used to estimate the patch parameters of the parent nodes. Third, the estimated patch is used to resolve existing category labels. And fourth, ambiguities and errors are handled. (b) 3D object hulls. Figure 5: Manually labeled objects and their corresponding 3D convex hulls. (I) O, the set of objects known to the robot(→ Figure 5(a)) is divided for a certain description (→ Figure 4) into a set of confirmed objects Ocon which are those objects that have been mentioned in the description and a set of potential objects Opot holding the remaining objects. (II) For each parent node np ∈ T its potential planar patch p p is computed using its distinct objects Ocon ⊂ Ocon . A Ppot planar patch is described by an orientation and a position in the global coordinate system: P : n · x − d = 0. (1) Knowledge about the type of relation used to assign objects supports the computation of the patch orientation. A patch where objects are located on is estimated as horizontal patch: T (2) np = 0, 1, 0 . A supporting structure containing objects (indicated by an in-relation) has a vertical orientation: (3) np = ap × bp . 32 (a) a horizontal plane i i }i=1...m 1 . A real patch Preal assigned to a potential {Preal p if they share a similar orientation and position in patch Ppot space. Several real patches can be assigned to one potential patch and one real patch can be assigned to multiple potential patches. The matching type indicates the status of an initial scene model. If it contains only injective mappings from real patches to potential patches the model is correct while a i constellation where one real patch Preal is assigned to several potential patches with competing labels, e. g., “p1” and “p2”, means an object mismatch. To correct this, all objects p1 p2 and Ppot are checked whether they are positioned of Ppot i i in/on the real patch Preal . Preal is put to that potential patch i . holding the biggest percentage of objects lying in/on Preal i In the case where a bottom-up patch Preal is assigned to a set of potential patches where exactly one patch has a nonempty label all parent nodes can be merged to one node labeled with “p” removing virtual patches. (b) a vertical plane Figure 6: The figures show two potential patches. A patch is represented by an ellipse, its normal vector ( nchair , n ), and its centroid ( cchair , c ). Each object is represented by its convex hull and its 3D location (llion , lrobot , lraven , lfred , lpokemon ). Evaluation (a) initial model Figure 8 shows the generated aligned scene models based on the 30 descriptions collected in the studies. The room structures are represented by ellipses and are localized on level-1 in the tree sets T since objects positioned in the leafs of the trees are used to estimate these structures. The histograms in Figure 9 show the distribution of the structure labels found in the models. A structure appearing in many models is a prominent element of the scenario. In general, no differences in the descriptions of the pilot study and the main study can be observed. Each model is aligned to the specific level of detail given in the description. Most of the descriptions are quite detailed with many relations between objects and room structures. There are only few descriptions where objects have been simply listed (compare model 1, 2, 5, and 7 in Figure 8(a) where nearly no spatial structures have be concluded). In the playroom scenario (→ Figure 9(a)) the “table” and the “chair” are the most dominant structures. Further, the shelves at the wall are also interesting structures. Half of the participants have given enough information to conclude patches for “cupboard2” and “cupboard3”. The individual reference of the shelves may be primed by the different colors of the shelves as the contrary effect is visible in the descriptions of the living room S2 where the shelves have the same color and are fused to the structure “cupboard” (→ Figure 9(b)). The other meaningful structures in the living room are “table’, “sofa”, and “wall”. In general, the system fails to compute a potential patch for a spatial structures (indicated by “Cannot compute potential patch for ...”) when only the category of objects located in resp. on the structure is known (like “there are soft toys in the cupboard”). Without knowing at least one specific object this category label cannot be resolved. Finally, it can be said that all generated models are an appropriate representation (b) final model p }p=1...7 {Ppot computed for the Figure 7: (a) shows the potential planar patches tree set T shown in Figure 4. Patches are visualized as colored ellipses. (b) shows the scene model after correcting wrongly assigned objects and fusing virtual patches with named potential patches. Patches are fused if they share the same bottom-up patch. where ap = ( 0, 1, 0 )T and bp the orientation of a line fitted robustly via RANSAC through the object points projected in the xz-plane (ground plane). If the relation between child and parent node is not known, the orientation of the 3D arrangement of the confirmed objects with respect to the ground plane determines the orientation of the potential patch (if < 45◦ : horizontal else: vertical). The position of the patch is computed from the centroid cp of the objects’ 3D locations: dp = np · cp . Figure 6 visualizes a horizontal and a vertical potential patch. p can be utilized to resolve possible cate(III) Finally, Ppot gory labels. Objects in Opot having the specified category and a distance to the corresponding patch smaller than a given threshold are assigned to the potential patch. It is assumed that the human tutor has referred to these objects. Figure 7(a) shows the initial scene model of the example description in Figure 4. Patches for “corner”, “table”, “chair”, “cupboard2”, and “cupboard3”, and two virtual patches with an empty label “ ” are estimated. The category label “softtoy on the table” is resolved to “kangaroo”, “bear”, “frog”, and “stitch”. The category label “games in cupboard3” is resolved to “games1” and “games2”. 1 Planar patches can be extracted via region growing utilizing the conormality and coplanarity value of point pairs (Stamos and Allen 2002). Several runs of the RANdon SAmple Consensus (RANSAC) algorithm (Fischler and Bolles 1981) improve the initial regions. A detailed description of the planar patch extraction algorithm can be found in (Swadzba and Wachsmuth 2008). (IV) The initial scene model contains errors (cupboard3 is too big) and virtual patches (patches with an empty label “ ”) due to errors in the resolution of category labels and underspecifications in the descriptions. Both problems can be solved be incorporating bottom-up extracted patches 33 (b) structure labels in living room S2 Figure 9: This figure shows the distribution of the structure labels appearing in the models generated for the descriptions acquired in our studies (pilot study: playroom (10 participants); main study: playroom, living room (10 participants) → 30 descriptions). The bar shows the number of models in which the corresponding structure label can be found. (a) structure labels in playroom S1 (a) Scene models generated from descriptions of scene S1 collected in pilot study. (b) Scene models generated from descriptions of scene S1 collected in main study. (c) Scene models generated from descriptions of scene S2 collected in main study. Figure 8: This figure gives an overview over the resulting scene models for all descriptions collected in the pilot and the main study. The estiamted supporting patches are visualized as ellipses. The convex hulls are displayed only for those objects that have been related to the corresponding scene structure. Few subjects have just listed objects without relating them to each other (models 1, 2, 5, and 7 in pilot study). deviation corner table chair cupboard2 cupboard3 conormal coplanar 6.01◦ 0mm 0◦ 12mm 0◦ 302mm 0.26◦ 22mm 14.53◦ 10mm Figure 10: This figure shows the object set where small and large scaling and translation errors are applied to the 2D object boxes. Small errors mean that an overlap with the ground truth box still exists while large errors mean no overlap. Half of the objects is influenced by errors. On the right, the aligned scene model is shown computed on the erroneous object set. The table gives the deviation (conormal: orientation; coplanar: position) of this model to the ground truth model given in Figure 7(b). of the respective scenario. The analysis of this large set of descriptions ensures that we have covered different description strategies. The strength of our approach is that for every description room structures can be inferred and modeled across different scenes and appearances and that the level of details in the model is aligned to the given description preserving resources. As assuming perfect object detection is not realistic we have also examined the influence of typical errors on the model formation process. Figure 10 shows an object set where typical errors like scaling (bounding box is too small or too big), translation (bounding box is misplaced), and missing errors (no bounding box available) have been applied randomly selected objects. Half of the objects in the set of known objects is affected by the errors. The figure presents also the scene model estimated using this errorneous object set. The table in Figure 10 gives the deviation of the model from the ground truth model in Figure 7(b). Overall, the patches are estimated robustly. “cupboard2”, “table”, and “corner” show only minor deviations from the original patches. The twisting of “cupboard3”-patch by 14.53◦ is acceptable since the overall position of the patch is correct. It is still possible to assign the correct real patch to “cupboard3” because we allow due to noise a deviation of up to 30◦ bet- ween the potential patch normal and the real patch normal. Only the “chair”-patch is too big and misplaced. If 3D data of the chair itself could be perceived step (IV) would detect that the “robot”-object is misplaced with respect to the chair as it handles mismatched objects arising from errors when resolving category labels. As the computation of the 3D object hulls incorporates a removing of outlier points it can handle slightly misplaced object boxes. Further, the orientation of the horizontal patches is only influenced by the orientation of the camera and the RANSAC based approach for computing the vertical patches can deal with remaining outliers. Also, missing objects can be compensated if enough other objects are known for the corresponding parent node. To summarize, our method proposes a robust estimatation of the patch orientation that can handle object outliers. The computation of the patch position is more sensitive to noise. If enough correct objects are given a wrong object position can be averaged out. But if only one object is associated to a parent node a big translation or scaling error will corrupt the potential patch of this node. In general, the corruption affects the position and expansion of a node patch. 34 Conclusion Olivier, P.; Maeda, T.; and Tsujii, J. 1994. Automatic Depiction of Spatial Descriptions. In NCAI, volume 2, 1405–1410. Philippsen, R., and Siegwart, R. 2003. Smooth and Efficient Obstacle Avoidance for a Tour Guide Robot. In ICRA, volume 1, 446–451. Pickering, M. J., and Garrod, S. 2004. Toward a Mechanistic Psychology of Dialogue. Behavioral and Brain Sciences 27(2):169–190. Rusu, R. B.; Marton, Z. C.; Blodow, N.; and Beetz, M. 2008. Learning Informative Point Classes for the Acquisition of Object Model Maps. In ICARCV, 643–650. Stamos, I., and Allen, P. K. 2002. Geometry and Texture Recovery of Scenes of Large Scale. CVIU 88(2):94–118. Swadzba, A., and Wachsmuth, S. 2008. Categorizing Perceptions of Indoor Rooms using 3D Features. In Structural, Syntactic, and Statistical Pattern Recognition, volume 5342 of LNCS, 744–754. Swadzba, A.; Vorwerg, C.; Wachsmuth, S.; and Rickheit, G. 2009. A Computational Model for the Alignment of Hierarchical Scene Representations in Human-Robot Interaction. In IJCAI, 1857–1863. Talmy, L. 1983. How Language Structures Space. In H. L. Pick, J., and Acredolo, L. P., eds., Spatial Orientation: Theory, Research and Application, 225–282. Thrun, S.; Burgard, W.; and Fox, D. 2000. A Real-time Algorithm for Mobile Robot Mapping with Applications to Multi-Robot and 3D Mapping. In ICRA, volume 1, 321–328. Topp, E., and Christensen, H. 2008. Detecting Structural Ambiguities and Transitions during a Guided Tour. In ICRA, 2564–2570. Torralba, A.; Russell, B. C.; and Yuen, J. 2009. LabelMe: Online Image Annotation and Applications. Technical report, MIT CSAIL. Triebel, R.; Schmidt, R.; Mozos, Ó. M.; and Burgard, W. 2007. Instance-based AMN Classification for Improved Object Recognition in 2D and 3D Laser Range Data. In IJCAI, 2225–2230. Tversky, B., and Lee, P. U. 1998. How Space Structures Language. In Spatial Cognition, An Interdisciplinary Approach to Representing and Processing Spatial Knowledge, volume 1404 of LNCS, 157–176. Tversky, B.; Kim, J.; and Cohen, A. 1999. Mental Models of Spatial Relations and Transformations from Language. In Mental Models in Discourse Processing and Reasoning. 239–258. Vasudevan, S.; Gächter, S.; Nguyen, V.; and Siegwart, R. 2007. Cognitive Maps for Mobile Robots – An Object based Approach. RAS 55(5):359–371. Vorwerg, C., and Rickheit, G. 2009. Verbal Room Descriptions Reflect Hierarchical Spatial Models. In Annual Meeting of the American Psychological Society. Poster. Wachsmuth, S., and Sagerer, G. 2002. Bayesian Networks for Speech and Image Integration. In AAAIConf. on AI, 300–306. Waltz, D. L. 1980. Understanding Scene Descriptions as Event Simulations. In Annual Meeting on Association for Computational Linguistics, 7–11. Association for Computational Linguistics. Wang, G.; Hoiem, D.; and Forsyth, D. 2009. Building Text Features for Object Image Classification. In CVPR, 1367–1374. Yuan, F.; Swadzba, A.; Philippsen, R.; Engin, O.; Hanheide, M.; and Wachsmuth, S. 2009. Laser-based Navigation Enhanced with 3D Time-of-Flight Data. In ICRA, 2844–2850. Zwaan, R. A. 1999. Situation Models: The Mental Leap into Imagined Worlds. Current Directions in Psychological Science 8(1):15–18. In this paper we have analyzed the performance of the aligned scene modeling approach regarding robustness to different room description strategies by analyzing a large set of descriptions. The computational model is based on the fact that the construction of spatial descriptions is driven by gravity, which means that the descriptions mainly consist of orthogonal relations between objects and their supporting structures and parallel relations between objects located on/in the same structure. The classification of relations in a description only into two types allows the design of compact rules that can transform relations to an abstract tree representation independent from the order in which the relations are given. The evaluation shows that the strategies developed for grounding the abstract model in the perception of the scene are robust to missing data and object detection errors. Acknowledgement. The work has been funded by the CRC673 “Alignment in Communication”. References Boggess, L. 1979. Spatial Operators in Natural Language Understanding: The Prepositions. In Southeast Regional Conference, 8–11. Brenner, M.; Hawes, N.; Kelleher, J.; and Wyatt, J. 2007. Mediating Between Qualitative and Quantitative Representations for Task-Orientated Human-Robot Interaction. In IJCAI, 2072–2077. Cantzler, H.; Fisher, R. B.; and Devy, M. 2002. Improving Architectural 3D Reconstruction by Plane and Edge Constraining. In BMVC, 43–52. COGNIRON. 2004. The Cognitive Robot Companion. (FP6-IST002020), http://www.cogniron.org. Fischler, M. A., and Bolles, R. C. 1981. Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Communications of the ACM 24(6):381–395. Freksa, C. 2008. Zur interdiziplinären Erforschung räumlichen Denkens. Kognitive Psychologie – Ausgewählte Grundlagen- und Anwendungsbeispiele 87–108. Grau, O. 1997. A Scene Analysis System for the Generation of 3D Models. In 3DIM, 221–228. Henderson, J. M., and Hollingworth, A. 1999. High-level Scene Perception. Annual Review of Psychology 50:243–271. Hirtle, S. C., and Jonides, J. 1985. Evidence of Hierarchies in Cognitive Maps. Memory & Cognition 13:208–217. Jamieson, M.; Fazly, A.; Stevenson, S.; Dickinson, S.; and Wachsmuth, S. 2010. Using Language to Learn Structured Appearance Models for Image Annotation. PAMI 32(1):148–164. Kuipers, B. 2000. The Spatial Semantic Hierarchy. AI 119:191–233. Mavridis, N., and Roy, D. 2006. Grounded Situation Models for Robots: Where Words and Percepts Meet. In IROS, 4690–4697. Montello, D. R. 1993. Scale and Multiple Psychologies of Space. In LNCS: Spatial Information Theory A Theoretical Basis for GIS, volume 716, 312–321. Munoz, D.; Bagnell, J. A.; Vandapel, N.; and Hebert, M. 2009. Contextual Classification with Functional Max-Margin Markov Networks. In CVPR, 975–982. Nüchter, A., and Hertzberg, J. 2008. Towards Semantic Maps for Mobile Robots. RAS 56(11):915–926. 35