Intelligent Robotic Systems: Papers from the AAAI 2013 Workshop Grounded Spatial Language — An Integrated AI Research Program Michael Spranger Sony Computer Science Laboratories 3-14-13, Higashigotanda Shinagawa-ku Tokyo, Japan 141-0022 This question however goes deeper. We have to understand the particular role of spatial subsystems such as projective relations (front/back/left/right), and proximal relations (near/far) in association with other spatial systems such toponyms. 3. How is it acquired? To answer this question, we need to identify learning operators and the structure of learning situations. Abstract This paper details recent progress in modelling and understanding the processing, acquisition and evolution of grounded spatial language. It summarises key insights from the project and gives an overview of achievements possible today in integrated A.I. research programs. Spatial language is one of the pivotal achievements of human intelligence. It shows human ingenuity through a staggering amount of cross-cultural variation (Evans and Levinson 2009). Spatial language is ubiquitous and affects all aspects of cognition (Li and Gleitman 2002; Majid et al. 2004). Moreover, spatial language is a metaphoric source of structure for many other subsystems of language such as temporal language (Boroditsky 2000). All of these points make spatial language interesting from the viewpoint of linguistics and psychology. But, spatial language also provides a huge opportunity for research on integrated Artificial Intelligence. Spatial language is an integrated phenomenon that necessarily requires thinking about routine language processing involving perception, semantic and syntactic processing. But, language is also a cultural phenomenon. The languages of the world are very different. Yet humans are generally able to learn languages fast (especially in ontogeny) and adjust to different languages. Lastly, humans actively shape and change language so as reflect changes in the ecosystem, as well as the cultural level (e.g. ideology and technology). This paper surveys recent research on locative spatial language that integrates various aspects of language to build a unified theory. The research aims can be summarised along well-established principles for scientific explanations defined in biology such as Tinbergen’s 4 principles (Tinbergen 1963). For language in general and spatial language in particular, we need to answer the following questions. 4. How does it evolve? This question is primarily directed at language change. The spatial systems in use today are different from the once used by our ancestors. To understand these processes, we need to understand the mechanisms behind invention and alignment of language. We explore these questions by building real-time artificial systems and testing our hypotheses in synthetic experiments. That is, we are interested in identifying concrete, implementable mechanisms that give rise to the phenomenon under investigation. For question 1 we built a large scale reconstruction of German locative spatial language including conceptualisation and linguistic processing. Question two is answered in experiments where agents are given parts of the reconstructed system to identify the exact function of spatial subsystems. Questions 3 and 4 are treated in experiments where agents start with little or no spatial language and have to acquire German locative language or evolve spatial language on their own. For space reasons, we only focus on questions 1 and 4 in this paper with other questions answered in more detail in the papers cited. Spatial Language Games We frame our research along the basic experimental methodology of grounded evolutionary language games (Steels 2012), which allows to explore and validate hypotheses about language. Language games are routinised interactions between two or more robots that try to talk about objects or events in their environment. Figure 1 shows the environment in which two robots interact. Both robots are equipped with a vision system that singles out and tracks objects (Spranger 2008; Spranger, Loetzsch, and Steels 2012). The environment contains four types of objects: blocks (colored circles), boxes (rectangles), robots (arrows) and geocentric markers (marker on the wall, 1. What is the mechanism? This question targets the processing of spatial language and requires a mechanistic answer of how people conceptualise reality and how spatial grammar and lexicons function. 2. What is the function? Spatial language has a particular role in human communication in that it allows individuals to refer to specific objects using spatial relations. c 2013, Association for the Advancement of Artificial Copyright Intelligence (www.aaai.org). All rights reserved. 79 speaker obj-265 box-1 same, the game is a success and the speaker signals this outcome to the hearer. hearer robot-2 box-1 robot-2 In order, to succeed in these interactions, robots are operating software systems that allow them to perceive and conceptualise reality, engage in linguistic interactions, as well as mechanisms for non-linguistic feedback (such as pointing). Moreover, these interactions can fail. An agent might not know a particular category or be unable to conceptualise. Invention, adoption and alignment operators orchestrate the adaptation of agent internal conceptual and linguistic inventories and the overall development of the population. obj-253 obj-249 obj-252 obj-268 obj-266 7. If the game is a failure, the speaker points to the topic T he had originally chosen. robot-1 robot-1 Figure 1: Robot setup for researching spatial language. To the left the world model extracted by the left robot is shown. To the right the same for the other robot is depicted. speaker hearer joint attention sensorimotor systems world world model goal Conceptualization sensorimotor systems world model One of the most important insights one gets when dealing with natural language is the importance of conceptualisation. The meaning of a sentence is not just a factual statement that can be true or false. A sentence is asking the listener to imagine, construe the world in a particular way. This might involve immediately experienced reality, but is not at all constraint to the direct links with the physical world. Conceptualisation affects all aspects of cognition. For instance, spatial categories are conceptualised in different ways to construct new conceptual spaces (Gärdenfors 2004) from sensory information. Here are two examples, which use the spatial category “vorne” (front) action interpretation conceptualisation reference reference meaning meaning utterance production parsing Figure 2: Semiotic cycle underlying spatial language games. (1) der vordere Block the.NOM front.ADJ.NOM block.NOM ‘The front block’, blue line in Figure 1). The vision system extracts the objects from the environment and computes a number of raw, continuous-valued features such as x, y, width, and height, but also color values in the YCrCb color space. Always two agents randomly drawn from a population interact, one acts as the speaker, the other as the hearer. The spatial language game uses the following game script assuming a population P of agents, and a world consisting of a set of individual objects (Figure 2. shows the processing flows). 1. The speaker selects an object out of the context, further called the topic T . 2. The speaker tries to find a spatial conceptualisation of the scene that discriminates the topic. This can involve different spatial relations, landmarks and perspectives. 3. The speaker expresses the meaning given his knowledge of the language (lexicon and grammar). This is possibly incomplete and involves new words or concepts. 4. The hearer parses the utterance and interprets the meaning. He handles ambiguities, unknown words and missing information by integrating contextual information into the processing. 5. The hearer points to this object. 6. The speaker checks whether the hearer selected the same object as the one he had originally chosen. If they are the (2) der Block vorne the block.NOM in the front.ADV ‘The front block’ Underlying these examples are different conceptualisation strategies involving the same spatial relation. In the first sentence, the category is conceptualised as a filter that identifies the most front object from a group of objects in the same class (blocks) (Tenbrink and Moratz 2003). In the second sentence, the category is conceptualised as a region (Tenbrink 2007). Old ideas about the semantics of sentences lend itself well to modelling and processing of these phenomena. In particular, the 70s A.I. movement of procedural semantics (Winograd 1971) is helpful for modelling the computational nature of conceptualisation because it understands sentences as a program. An utterance such as “der Block links von der Kiste von dir aus” (the block left of the box from your perspective) conveys detailed instructions to the listener as to how to conceptualize reality in order to identify the phrase’s referent. We therefore model the semantics of such a phrase as a set of procedures, i.e. a program, which consists of a number of operations and categories. Figure 3 shows a procedural semantics representation of the phrase. The structure consists of a set of cognitive operations that involve, for example, the construction of regions, the identification of landmarks, the application of perspective transformations and so 80 (apply-selector ?topic ?source ?selector) interpretation (bind (apply-class ?source ?source-2 ?source-2 (construct-region-lateral ?region (apply-selector ?lm ?ctx-pp ?landmarks ?landmarks utterance unique) object-class ?class block) parse ?region) ?lm ?cat ?ctx ?f-o-r) (bind ?selector-2) (bind (apply-class ?selector ?class) (bind (apply-spatial-region selector f-o-r ?f-o-r lateral-category meaning (intrinsic) meaning (relative) find-topic find-topic relative) ?cat left) ?landmark-class) (bind selector ?selector-2 unique) obj-252 (0.26) (geometric-transform ?ctx-pp ?ctx (bind (identify-discourse-participant (get-context ?ctx) ?perspective (bind obj-249 (0.08) obj-253 (0.51) ?perspective) ?ctx object-class ?landmark-class box) ?perspective-role) discourse-role ?perspective-role Figure 4: Interpretation of Example 3 in the setup depicted in Figure 1. In parsing, Fluid Construction Grammar (a grammar formalism used in the experiments) finds two possible interpretations (relative and intrinsic). IRL is then called on each of these interpretations separately and recovers two possible conceptualizations for the relative reading. All three possible interpretations and their corresponding topics are scored based on similarity (see Figure 5) for a depiction of similarity functions. The hearer then decides that obj-253 is the best interpretation. hearer) Figure 3: IRL-network representing the meaning of the phrase “der Block links von der Kiste von dir aus” (the block left of the box from your perspective). on and so forth. But, the program also contains references to spatial categories, selectors and other semantic entities that are processed by cognitive operations. To represent semantics, we have developed a formalism called Incremental Recruitment Language (IRL) (Spranger et al. 2012), which captures the procedural nature of semantic structure and provides tools for autonomous agents to conceptualise and interpret reality automatically, as well as learn and evolve these procedural structures. Most importantly though, IRL provides a connection between the discrete sensorimotor spaces spawned by pattern matching and feature detection algorithms dominant in robot vision and sensorimotor control with the symbol high-level reasoning world of linguistic processing. To achieve this procedures can be implemented using state-of-the-art mechanisms from machine learning and feature detection. potential are semantically ambiguous utterances (Spranger and Loetzsch 2011). We will focus here briefly on how integrated systems can deal with ambiguous utterances. Many German locative phrases are semantically ambiguous. Let us consider the following two examples from German. (3) der Block vor der the.NOM block.NOM front.PREP the.DAT Kiste box.DAT.FEM ‘The block in front of the box’ (4) der Block vor der the.NOM block.NOM front.PREP the.DAT Kiste von dir aus box.DAT from.PREP your.DAT perspective ‘The block in front of the box from your perspective’ Example 3 is ambiguous with respect to how the landmark object, in this case the left box, is conceptualized. The phrase can have an intrinsic or relative reading, which are two different ways of construing the coordinate system of the landmark. The second example does not have this problem. The perspective marker clearly signals a relative reading of the phrase. Interestingly, this fact can only be established after parsing the complete phrase. To illustrate this dependency consider Example 4, which is not semantically ambiguous (with respect to intrinsic and relative readings) because it Linguistic Processing Language is an inferential coding system (Sperber and Wilson 1986). Information is often left out or ambiguous. People often speak ungrammatically. Language only encodes evidence of the intention of the speaker and never provides enough information. In order, to decipher the information the speaker wants to convey, listeners have to integrate the noisy and error-prone information conveyed in the utterance with the noisy and error-prone information coming from other sensory systems. To deal with these inherent “deficiencies” linguistic processing has to be tightly integrated with other subsystems such as conceptualisation. The sensorimotor system and the conceptual frameworks can help in understanding problematic utterances. One example where this plays out its full 81 man locative grammar given to the interacting agents is a success and the hearer agent is able to correctly identify the topic. Importantly, in all three cases agents rely on the power of integrated conceptualisation, syntactic and semantic processing. Evolution Figure 5: Possible interpretations of Example 3 in the setup depicted in Figure 1. From left to right (1) intrinsic interpretation, (2) relative from the perspective of the hearer, and (3) relative from the perspective of the speaker. All of these interpretations have different topics. The intrinsic representation evaluates to object obj-252, the relative interpretation from the hearer evaluates to obj-249, and the relative interpretation from the speaker to obj-253. The darker (more red) the higher the similarity of the interpretation with that location in space. Language is a complex adaptive system (Beckner et al. 2009) that changes on short and long timescales and is shaped by autonomous agents in cultural populations. Truly intelligent systems actively participate in changing the ecosystem and (for humans) the cultural sphere. This means, we have to understand and implement systems that participate in the cultural negotiation of language. Cultural evolution is best understood in terms of biological evolutionary theory, the most comprehensive theory about the organisation and emergence of complexity in distributed systems. However, the cultural layer is obviously different from biological evolution. We are not dealing with organisms and genes but rather with symbolic communication. Therefore, the application of biological concepts to cultural phenomena is metaphorical and theories and concepts have to be adapted to the phenomenon at hand. Nevertheless, key principles can be directly applied. Speakers face communicative problems, for instance, the need to talk about a specific object in the environment. Faced with this the speaker might invent a new word or category, conceptualisation strategy or grammatical construction. The hearer then has to interpret intention of the speaker (possibly using extra-linguistic feedback). He then can learn something about the parts of the utterance he has not understood, was unable to parse or interpret. The first step in this process introduces variation in the language. That is the speaker is changing the inventory of the language. However, invention always happens in a single interaction between subgroups of the population (often only two agents of a larger population). Therefore, agents need additional mechanisms for orchestrating the coordination of joint inventories. One such coordination mechanism is alignment. Language users adapt to their communication partners on all levels of processing (Garrod and Pickering 2009) phonetic, syntactic, semantic and conceptual. For spatial language, agents are, for instance, more likely to use a landmark earlier introduced in the dialogue. The same phenomenon occurs across many dialogues (Garrod and Doherty 1994; Barr 2004) and forms the basis of language change beyond immediate interactions of peers. features a perspective marker in the end. Suppose that a speaker uttered this phrase in the spatial scene shown in Figure 1. In this scene there are at least three possible conceptualizations of the scene which are compatible with the information conveyed in the utterance. One is the intrinsic interpretation. The other two are variants of the relative interpretation. Relative conceptualizations of spatial scenes depend on perspective. The scene has two robots which both could in principle be used as perspective. IRL recovers all three conceptualizations of the scene. The hearer can then choose which of the interpretations is the best one (see Figure 5 for a depiction of the three possible interpretations and Figure 4 for an overview of processing). The final decision is based on the discriminative power of the three possible interpretations. In this particular configuration all three interpretations lead to different results. This is not always the case. There are three ways of dealing with semantic ambiguity. • The speaker detects that the phrase would be ambiguous in re-entrance and chooses to avoid the problem by expressing himself differently. We achieve this by using reversable processing systems. Agents can both parse and produce utterances using the same inventories. • In some scenes even though a phrase might be highly ambiguous with many different interpretations, all of these interpretations refer to the same object. In this case disambiguation becomes unnecessary. An example where this happens are certain vertical relations for which intrinsic, absolute and relative interpretations often overlap (Carlson 1999). Our robots can test for this case through the integration of linguistic processing with sensorimotor systems. Possible candidates for interpretation are tested against the context and possibly rejected or considered equal. • The speaker relies on the interpretation power of the hearer. For this particular scene interaction with a Ger- Category Alignment Using this approach, we carried out experiments. In these experiments robots start with no or little given knowledge and evolve linguistic and conceptual knowledge. Agents are given various invention and alignment operators that orchestrate the development of individuals and the overall coordination of the group in a distributed system. For instance, the following operators are sufficient for organising lexicon evolution. See (Spranger 2012a) for technical details. 82 4.5 4 0.8 3.5 3 0.6 2.5 2 0.4 communicative success # categories # constructions interpretation similarity 0.2 0 0 1000 3000 5000 7000 number of interactions 9000 1.5 1 0.5 # of categories and constructions communicative succ, interpretation similarity 1 inventcategory DIAGNOSTICS AND REPAIRS restart PROCESSING LAYER inventconstruction failure conceptualize restart success failure produce Figure 7: Meta-layer 0 Figure 6: Results for a formation experiment in which agents develop a projective category system. Increasing Complexity Similar experiments have been carried out for other aspects of spatial language including in evolution of landmarking systems (Spranger 2012b), grammar evolution (Spranger and Steels 2012) and the evolution of conceptualisation strategies (Spranger 2013). These experiments gradually increase the complexity of syntactic and semantic representations evolved by populations. In lexicon studies agents typically evolve particular types of category systems such as proximal (near/far). In follow-up studies agents additionally negotiate which type of category system should be used. That is the population autonomously decides whether to build a projective category system (front/back) or an absolute (north/south). Lastly, agents can freely evolve how complex their syntactic representations have to be. Agents might develop grammatical markings or decide to stay lexical only. Invention: Speaker cannot find a discriminating spatial category in production • Diagnostic: When the speaker cannot conceptualize a meaning (step 2 of the spatial language game fails). • Repair: The speaker constructs a spatial relation R based on the topic. The new category is necessarily based on the distance or angle observed for the topic object. Additionally, the speaker invents a new construction (form-meaning mapping) associating R with the topic direction or angle. Adoption: Hearer encounters unknown spatial term s • Diagnostic: When the hearer does not know a term (step 4 fails). • Repair: The hearer signals failure and the speaker points to the topic T. The hearer then constructs a spatial relation R based on the relevant strategy and the topic pointed at. Additionally, the speaker invents a new construction associating R with s. Meta-level Architecture A key result from these experiments is the development of a meta-level architecture that integrates across conceptual, semantic and syntactic processing. The architecture allows agents to 1) operate a diverse invention and learning operators, 2) try out (simulate) the effects of operators in the current interaction, and 3) orchestrate choosing or storing results of learning and invention. The meta-layer is split into 2 components. 1) Diagnostics hook into routine processing and try to identify problems. 2) Repair strategies extend routine processing when problems have been detected. They try to solve problems when information becomes available for doing so. For instance, a learner confronted with a new word, has to wait for the speaker to point to the object he had in mind. It is only after he has received additional information that he can learn the meaning of the new word. Therefore, the diagnosis of a problem and its repair has to be separated. Figure 7 shows the flow of processing when the speaker invents a new spatial relation. When a new spatial relation is invented the speaker immediately tries out whether it can solve his problem in conceptualisation. This way the speaker avoids complications with other relations already in his inventory. After that the speaker checks whether he needs to invent a new word for the spatial relation. In each case, he restarts processing at an appropriate point. Category alignment: After each interaction, the participants update their internal representations. Successfully used categories and words are rewarded, unsuccessful punished. Moreover, agents will change the category representations to better reflect the information they received from the current interaction. Figure 6 details the dynamics of a population that is evolving a spatial lexicon. It summarises 25 experimental runs (on grounded data). In each run the population starts with an empty lexicon and no communicative success. Agents engage in 10000 spatial language games. Over time the population becomes increasingly successful in performing spatial language games on the objects in their environment. Agents are inventing a lexicon of spatial categories as can be seen in the development of number of spatial categories floating in the population. Agents also align their spatial lexical which is signified by an increase in interpretation similarity, which measures how similar categories are across agents. 83 Discussion Boroditsky, L. 2000. Metaphoric structuring: Understanding time through spatial metaphors. Cognition 75(1):1–28. Carlson, L. A. 1999. Selecting a reference frame. Spatial Cognition and Computation 1(4):365–379. Evans, N., and Levinson, S. C. 2009. The myth of language universals: Language diversity and its importance for cognitive science. Behavioral and Brain Sciences 32(05):429–448. Gärdenfors, P. 2004. Conceptual spaces: The geometry of thought. MIT press. Garrod, S., and Doherty, G. 1994. Conversation, co-ordination and convention: an empirical investigation of how groups establish linguistic conventions. Cognition 53(3):181–215. Garrod, S., and Pickering, M. 2009. Joint action, interactive alignment, and dialog. Topics in Cognitive Science 1(2):292–304. Li, P., and Gleitman, L. 2002. Turning the tables: language and spatial reasoning. Cognition 83(3):265 – 294. Majid, A.; Bowerman, M.; Kita, S.; Haun, D.; and Levinson, S. C. 2004. Can language restructure cognition? The case for space. Trends in Cognitive Sciences 8(3):108–114. Sperber, D., and Wilson, D. 1986. Relevance: Communication and cognition. Harvard University Press. Spranger, M., and Loetzsch, M. 2011. Syntactic Indeterminacy and Semantic Ambiguity: A Case Study for German Spatial Phrases. In Steels, L., ed., Design Patterns in Fluid Construction Grammar, volume 11 of Constructional Approaches to Language. John Benjamins. 265–298. Spranger, M., and Steels, L. 2012. Emergent Functional Grammar for Space. In Steels, L., ed., Experiments in Cultural Language Evolution. John Benjamins. 207—232. Spranger, M.; Pauw, S.; Loetzsch, M.; and Steels, L. 2012. Openended Procedural Semantics. In Steels, L., and Hild, M., eds., Language Grounding in Robots. Springer. 153–172. Spranger, M.; Loetzsch, M.; and Steels, L. 2012. A Perceptual System for Language Game Experiments. In Steels, L., and Hild, M., eds., Language Grounding in Robots. Springer. 89–110. Spranger, M. 2008. World models for grounded language games. German diplom thesis, Humboldt-Universität zu Berlin. Spranger, M. 2012a. The co-evolution of basic spatial terms and categories. In Steels, L., ed., Experiments in Cultural Language Evolution. John Benjamins. 111–141. Spranger, M. 2012b. Potential stages in the cultural evolution of spatial language. In The Evolution of Language: Proceedings of the 9th International Conference (EVOLANG9). Spranger, M. 2013. Evolving grounded spatial language strategies. KI - Künstliche Intelligenz 1–10. Steels, L. 2012. Evolutionary language games as a paradigm for integrated ai research. In 2012 AAAI Spring Symposium Series, Designing Intelligent Robots. Tenbrink, T., and Moratz, R. 2003. Group-based spatial reference in linguistic human-robot interaction. In Proceedings of EuroCogSci’03, The European Cognitive Science Conference, 325– 330. Lawrence Erlbaum. Tenbrink, T. 2007. Space, time, and the use of language: An investigation of relationships, volume 36 of Cognitive Linguistics Research. Berlin, DE: Walter de Gruyter. Tinbergen, N. 1963. On aims and methods of ethology. Zeitschrift für Tierpsychologie 20(4):410–433. Winograd, T. 1971. Procedures as a Representation for Data in a Computer Program for Understanding Natural Language. Ph.D. Dissertation, Massachusetts Institute of Technology. A.I. has brought about many diverse techniques. Often these techniques solve particular problems well and then people go on and apply them to everything they can find as potential problem. This has led to an explosion of machine learning techniques such as Bayesian Reasoning applied to various problems including language processing. Probabilistic techniques are immensely important and have led to extreme progress. However, the dominant experience when building integrated A.I. systems that produce and parse, learn and interact in the real-world is that there is no single mechanism that solves all problems. For instance, in language processing probabilistic techniques are useful if large data sets are available for offline training of very specific tasks. On the other hand more open tasks such as interactive systems also require production of language, as well as one-shot learning from minimal data input. Anther example is dynamical systems theory and neural networks, which have been extremely influential in recent years. However, often solutions involving neural networks turn out to be complex even for simple tasks, non-scalable and hard to maintain. We have learnt from our research that we can only tackle the hard problems in A.I. through an integrated approach that exploits and combines the state of the art in different areas of A.I. and mashes them with novel and sometimes old and forgotten ideas. Another important lesson that we took from the study of language and complex adaptive systems in general is the importance of biological theories as the prime scientific framework for understanding the emergence of complexity. A.I. can benefit tremendously from the study of autonomous biological systems both in scientific methodology and concrete theories. Biology has come up and is currently developing important tools for measuring complexity and for tracing the origins of complexity for specific traits and behaviours. A.I. should focus not only on achieving intelligence but also on artificial evolution of intelligence. This is of course something pursued in A.I. but from our stance receives far too little attention. The research presented in this paper spans across different areas of Artificial Intelligence. It integrates novel ideas to answer on the one hand scientific questions about language processing, acquisition and evolution. On the other hand while building these theories and hypotheses we are building artificial systems that are not only scientific tools but also implement the ideas and therefore lead to real-world interactive systems. We are currently applying our insights, software architectures and software systems in real-world applications and industrial products. References Barr, D. J. 2004. Establishing conventional communication systems: Is common knowledge necessary? Cognitive science 28(6):937–962. Beckner, C.; Blythe, R.; Bybee, J.; Christiansen, M. H.; Croft, W.; Ellis, N. C.; Holland, J.; Ke, J.; Larsen-Freeman, D.; and Schoenemann, T. 2009. Language is a complex adaptive system: Position paper. Language Learning 59:1–26. 84