MEETING ON BUILDING COGNITIVE SYSTEMS LUXEMBOURG, JULY 2, 2002 Contributors HILARY BUXTON, UNIV. OF SUSSEX (UK) JAN-OLOF EKLUNDH, KTH (S) GÖSTA GRANLUND, LINKÖPING UNIV. (S) BERNHARD NEBEL, ALBERT-LUDWIGS-UNIVERSITÄT FREIBURG (D) SAJIT RAO, UNIV. OF GENOA (I) DAVID VERNON, CAPTEC (IRL) EC members HORST FORSTER GIOVANNI BATISTA VARILE COLETTE MALONEY (RAPPORTEUR) 1. INTRODUCTION One could claim that ubiquitous computing is already here – each year we produce more than one processor chip for each person on the planet and the growth rate of chip production is greater than the population growth rate. It might be harder to make a similar claim about ambient intelligence. Central to the ambient intelligence vision is the ability of computationally empowered devices to interconnect with each other and with us. Using sensors to provide a window from the world of interconnected computation into the real physical world, these devices will sense the world around us and respond by interacting with the world or by communicating with us. These devices will need to combine visual and auditory capabilities to sense what is happening in the world and to understand and engage in dialogue with humans – preferably without the need for keyboards and mice. This is not just a matter of understanding commands, but also of understanding context. How do we build systems that can interact with the real world in an intelligent and reasoned way? Until now we have built systems that can operate in restricted domains or in carefully controlled environments, i.e., in artificially constrained worlds, where models can be constructed with sufficient complexity to allow algorithms to perform well. At the other extreme, we have been able to build systems that do not rely on explicit models, but rather react to the real world. Such systems require the programmer to anticipate all possible situations that might be encountered. Systems that respond purposefully to the real world rather than just reacting, must operate between these extremes. Such systems must combine reaction with the ability to explicitly represent heuristics, and to learn and reason in pursuit of goaldirected behaviour. Construction of these ‘cognitive’ systems will require: 1 integration of technologies that have evolved from several disciplines in order to provide intelligent reasoning capabilities and the versatility that is needed to interpret and interact with the real world1; integration of information from multiple sensors and multiple cues in order to allow constraints imposed by the real world (biological limitations, physical laws,..) including information about context, to provide the robustness needed for practical applications. However, constructing and maintaining a coherent world model from the contributions of a variety of sensors in a perceptual system is as yet a largely unsolved problem. The goal of developing cognitive systems, i.e., systems that can perceive and act, reason and learn, and that are capable of interpretation and interaction in the real-world environment, is not new. What makes this goal worth pursuing now is that the computing power available today can support the development of systems that can operate under real-time constraints2. A technology of cognitive systems will enable progress in recognition and categorization of objects, interpretation of activity and behaviour, visual guidance and navigation, and speech communication with systems. The goal of this workshop was to review the current status of research, to identify challenges and opportunities for progress, and to recommend where future research efforts could focus. The timeframe is from now until 2012. 2. STATUS OF RESEARCH TODAY 2.1 cognitive systems Perception is fundamental to cognitive systems. Perception provides information about the environment in which the cognitive system exists. Visual perception is a particularly powerful sensing modality with many uses. However, several other perceptual channels are of interest including auditory and tactile perception, chemical senses such as taste and smell. Cognitive systems will be characterised by their ability to learn adaptively in realtime from the perceptual input in order to perform specific goal-directed tasks. This ability to acquire new knowledge and adapt existing knowledge to new circumstances provides a means of dealing with the unrestricted real world environment and of using generalised concepts across application domains. Acquisition of knowledge – or learning – by autonomous interactions with the environment will enable cognitive systems to perform tasks in ways that were not conceived of in their design. Many researchers advocate that cognitive systems should be physically embodied and derive information from several perceptual modalities. Such systems must be able to act on the world. Most importantly they should develop as complete systems. Cognitive systems will be adaptive and anticipatory, robust and autonomous, interactive and dynamic. They will be diverse in form and function. They are expected to play a key enabling role in applications ranging from image interpretation (eg: in medical or aerospace domains), behavioural interpretation 1 2 computer vision, natural language processing, artificial intelligence, mathematics, neuroscience, robotics,… cf: Meeting on Cognitive Vision Systems, June 21, 2000 2 (eg: in crowd surveillance or traffic monitoring), human-machine interaction (speech and activity recognition) to autonomous mobile robots working in remote/hazardous environments. Of particular interest are on the one hand, systems that will allow humans to interact with machines in a more natural way and on the other, systems that will support humans in performing tasks that are tedious, difficult or beyond their capabilities. 2.2 perception The purpose of perceptual processing is to produce a response. The response may be an action upon the environment. It may be to reconfigure the system’s internal models of interaction according to the context (current state of the environment), Or, it may be to generate in a subsequent step a generalised symbolic representation, which will allow the functional context to be communicated. The functional context is important as we rarely use representations in an intentional vacuum – we always have goals. Representations of context must go beyond mapping of percepts to linguistic descriptions, as a purely descriptive basis for understanding will not lead to the development of cognitive systems. Rather the representation must be grounded in perception. Cognitive systems must thus be able to act as well as perceive and they must be developed in a full perceptionaction feedback cycle. It is expected that high performance systems for interpretation of static imagery, would also be developed as cognitive systems in a perception-action feedback process. Thus cognitive systems do not necessarily have to perform physical actions in the external world at run-time, but may operate off-line. The output may not be physical action but rather can be to communicate the intended actions to another system. Thus, cognitive systems can also be useful in applications which do not require advanced mechanical manipulators. One such important application field is in activity interpretation for human-machine interaction. 2.3 Outlook Current successful applications of perceptual systems are primarily in well understood worlds with limited complexity. Applications involving visual perception today include video sequence analysis, visual surveillance, man-machine interaction, and visual inspection. For the most part, these applications are achieved using pattern recognition techniques with little or no cognitive ability. Research in psychophysics and neuroscience today devotes considerable efforts to problems in object recognition and categorization. The AI community is oriented towards higher level symbolic processing such as reasoning and planning. In the learning community there is limited attention to vision applications, partly due to the limited capacity of today’s learning architectures. Progress will require an integration of insights from these disciplines, which have so far acted in relative isolation. 3. SIGNIFICANT ADVANCES / BREAKTHROUGHS OVER THE NEXT 5 – 10 YEARS Context determines how to interpret sensory data. The interpretation of a percept, whether as a set of pixels or some other sensory data, depends on context. The representation and acquisition of context is an extremely important issue. The extension of current recognition techniques to interpretation is not likely to work 3 without the use of context to limit and guide interpretation. The real vision/perception/cognition problem is not to generate descriptions of shape or models for this, but to robustly map percepts into action, function and behaviour. We need to classify things according to what they can be used for or which goals they can help us achieve. This is necessarily context- or application- specific. A significant but necessary advance with respect to present day capabilities will be the recognition of objects under relatively unrestricted conditions, such as variable illumination, pose, scale, orientation against a structured background and partial occlusion. Greatly improved systems for speech recognition and synthesis will be of utmost importance to support the development as well as the operation of applications in cognitive vision. The interfacing of language systems to cognitive systems is an important research challenge. In general terms this implies the removal or the insertion of detailed, system specific context, to produce or receive symbols sufficiently invariant to be communicable with another system. Efficient training environments will be needed for increasingly complex systems, as training will supplement algorithmic prescription. This will necessarily require the construction of complete – probably hybrid symbolic-perceptual - systems to facilitate the study of learning and emergent behaviour. One could argue that machines with semantics of a human must be trained as if they were human. A key issue will be to achieve behavioural plasticity – i.e., the ability of an embodied system to do a task which it was not explicitly programmed to do. 4. MAIN DIFFICULTIES / CHALLENGES Many basic problems remain. From a computational perspective: closing the loop in realistic test cases, i.e. building complete systems that can deal with non-trivial cases; developing the underlying semantics for action (grounding language in perception) combining perceptual and symbolic processes for interpretation of events and generation of new behaviour; speed (of processing, memory access, learning, overall system). From the perspective of managing complexity and achieving a balance between distributed and centralised control: information representations which are sufficiently adaptable (e.g. generic vs specific) to allow effective communication to be established between system parts; obtaining a coherent global behaviour from the interactions of all the system parts (ultimately hand-tuned and unscalable); 5. TARGETS FOR R + D IN EUROPE Computer vision has emerged as a well defined domain with roots in geometry, statistics, signal processing and informatics. Cognitive science is an experimental field and uses methods from psychophysics and neuroscience. Part of cognition deals with the processes going from percepts to actions, which require numerical representations. Much of the exploration of human cognition does however involve 4 what should be viewed as symbolic levels of representation and processing, such as categorization, reasoning and use of language. A skillful integration of these very different domains is essential. Targets for interdisciplinary research include: investigation of evolutionary computation and machine learning as ways to implement cognition; work on extending learning algorithms; on-line (incremental) learning of conceptual models to allow systems to adapt continuously in real-time to task and environment; studying the interplay of overt and covert attention for vision tasks – biological vision; developing the mathematics of interaction invariance to model attention and cognition; multimodal interaction, i.e., between symbolic speech structure and cognitive object structure - psycholinguistics A major target is to develop systems that allow performance to be extendable, systems that can do more than “canned” predefined tasks. Operation in relatively unconstrained environments will require systems to be capable of handling new task specifications rapidly or on the fly. This is accomplished by learning. More efficient information representations are needed to facilitate learning: locality of information representations to allow faster convergence in learning; representation of confidence or certainty; representation and handling of sparse and incomplete data. It is worth stressing that work should be performed in a systems perspective: there should be a system that "perceives". Inspiration of how this can be done can be obtained both from computer and systems science, and from biology. Two specific targets in this context: designing a system with well-developed memory capacities (short-term, longterm, iconic, associative,...) that can be used in multiple applications; use and assessment of context and situations as well as system drives and tasks. In summary, the key to progress lies in the ability of systems to acquire information about the world through sensory channels, and by combining the perceptual input with computation, extract the knowledge needed to perform tasks. New behaviour must driven by knowledge acquired through interaction. This in turn requires a developmental paradigm for managing emergent behaviour. The only cognitive systems that we know develop as complete systems. We need to build systems! 5