Published in Proc. of Int. Conf. on Intelligent Robots and Systems (IROS) 2004 Core Technologies for Service Robotics Niklas Karlsson, Mario E. Munich, Luis Goncalves, Jim Ostrowski, Enrico Di Bernardo, and Paolo Pirjanian Evolution Robotics, Inc. Pasadena, California, USA Email: {niklas,mario,luis,jim,enrico,paolo}@evolution.com Abstract— Service robotics products are becoming a reality. This paper describes three core technologies that enable the next generation of service robots. They are low-cost, make use of low-cost hardware, and prepare for a short time-tomarket for product development. The first technology is an object recognition system, which can be used by the robot to interact with the environment. The second technology is a vision-based navigation system (vSLAMTM ), which simultaneously can build a map and localize the robot in the map. Finally, the third technology is a flexible and rich software platform (ERSPTM ) that assists developers in rapid design and prototyping of robotics applications. I. I NTRODUCTION Products such as Sony’s entertainment robot AiboTM , Electrolux’s robotic vacuum cleaner TrilobiteTM , iRobot’s RoombaTM , and prototypes from numerous robotic research laboratories across the world forecast a rapid development of service robots over the next several years. In fact, the momentum created by these early products combined with recent advances in computing and sensor technology could lead to the creation of a major industry for service robots. The applications of service robotics range from existing products, where robotic technologies are incorporated to enhance and improve the product functionality, to new products, where robotic technologies enable completely new functionalities. One example of the first case is vacuum cleaners, where the enhancement is reflected in a never before seen degree of autonomy. An example of the second case is companionship robots. Other examples can be found in interactive entertainment and edutainment. A necessary element to make the industry for service robots grow is to develop core robotic technologies that deliver flexibility and autonomy with low-cost hardware components. At Evolution Robotics (ER) we develop such technologies, and they are available as individual components and at low cost for the consumer market. This allows the robotics community to focus on the value added applications development rather than solving the core robotic problems, which on the other hand enables the industry to proliferate. A major challenge in providing core technologies for service robots is that the technologies must be low-cost, yet provide rich and flexible features. In particular, the technologies should enable a short time-to-market. At the same time, they must provide the robots “intelligent” features, meaning that they should give the robots the ability to interact with the environment and to operate somewhat autonomously. For example, a robotic vacuum cleaner which retails at 199USD can only cost about 60USD in bill of materials. The 60USD must contain all the hardware components, assembly costs, and typically packaging, manuals, and shipping. This constraint greatly limits the choices of the sensors, actuators, and amount of computation that will be available for the autonomous capabilities of the product. Due to this cost constraint, currently most robotic products which retail at a price acceptable to the consumers provide very limited autonomous functionality. For example, robotic vacuum cleaning products such as Roomba and the like perform random navigation for floor coverage and tactile sensing for collision detection and avoidance. However, random navigation for floor cleaning is not efficient since it cannot guarantee complete coverage from wall to wall and from room to room at a reasonable power consumption. Random coverage cannot guarantee that the robot will be able to return to a home location, e.g. a docking station for self-charging. Localization can enable a robot to perform complete coverage at a closeto-optimal power budget as well as allow it to navigate from room to room to perform its cleaning task and finally return back to its docking station for recharging. Combining cost constraints with requirements on product durability, reliability, power consumption, and expectations of autonomous capabilities creates a major challenge in creating successful products. This paper is organized as follows. Section II highlights some of the requirements that are placed on the technologies that are used for service robotics. Each requirement is thereafter addressed by proposing appropriate core technologies. Section III describes ERs object recognition system. This object recognition system is one component in a novel navigation system, which is described in Section IV. A competitive set of core technologies for service robotics also requires a rich and flexible software architecture. Section V describes such an architecture. A few examples of product concepts, which are enabled by the proposed core technologies, are described in Section VI. Finally, Section VII establish some concluding remarks. II. R EQUIREMENTS The service robotics market imposes a complex set of requirements on potential products: it is a consumer market, so robots should be inexpensive, the market is highly competitive, so products should have a short time-tomarket. At the same time, robots need to provide services in an “intelligent” way, so they should be autonomous, adaptive, and able to interact with the environment. This paper describes a set of solutions for these requirements that have been developed at Evolution Robotics, Inc. The price sensitivity of the market creates the need for the development of low-cost technologies. This is achieved first of all by basing the technologies primarily on hardware components that are relatively inexpensive, for example, cameras and IR sensors. Indeed, a web camera is two orders of magnitude less expensive than, e.g., a SICKTM laser range sensor. In order for a robot to qualify as a service robot, it must also possess high level of autonomy. It must be able to operate within its environment with a minimum amount of user interaction. This is achieved by providing a robust navigation system containing mapping, as well as, localization capabilities. Furthermore, to serve its purpose, a service robot must have the ability to interact with and make inference from its environment. A strong object recognition system satisfies both these requirements. By extracting relevant information from vision data, the service robot can make conscious and intelligent decisions based on interaction with the environment and people in the environment. To be competitive, the core technologies for service robotics must also offer a short time-to-market, in that the time from a conceptual idea to a final product must be minimized. This translates into a requirement on the software platform used for development. The software platform should provide an architecture that is rich and flexible. The architecture should be designed in a way that allows the developer to reconfigure the hardware without rewriting more than a very small amount of code. Some of the above requirements also imply that the core technologies must enable adaptivity of the system. For example, the navigation system must be able to deal with a changing environment. III. ER V ISION The ER Vision object recognition system offered by Evolution Robotics is versatile and works robustly with low-cost cameras. It addresses the greatest challenge for object recognition systems, which is the development of algorithms that reliably and efficiently perform recognition in realistic settings with inexpensive hardware and limited computing power. The object recognition system can be used in applications such as navigation, manipulation, human-robot interaction, and security. It can be used in mobile robots to support navigation, localization, mapping, and visual servoing. It can also be used in machine vision applications for object identification and hand-eye coordination. Other applications include entertainment and education. The object recognition system provides a critical building block in applications such as reading children’s books aloud, or automatically identifying and retrieving information about a painting or sculpture in a museum. The object recognition system specializes on recognizing planar, textured objects. However, three-dimensional objects composed of planar, textured structures, or composed of slightly curved components, are also reliably recognized by the system. Three-dimensional deformable objects such as a human face cannot be handled in a robust manner. The object recognition system can be trained to recognize objects using a single, low-cost camera. Its main strength lies in its ability to provide reliable recognition in realistic environments where lighting can change dramatically. The following are the two steps involved in using the Evolution Robotics’ object recognition algorithm: Training: Training is accomplished by capturing one or more images of an object from various viewpoints. For planar objects, only the front and rear views are necessary. For 3-D objects, several views, covering all facets of the object, are necessary. Recognition: The algorithm extracts up to 1,000 local, unique features for each object. A small subset of those features and their interrelation identifies the object. The object’s name and its full pose up to scale with respect to the camera are the results of recognition. The object recognition algorithm is based on extracting salient features from an image of an object [6]. Each feature is uniquely described by the texture of a small window of pixels around it. The model of an object consists of the coordinates of all these features along with each feature’s texture description. When the algorithm attempts to recognize objects in a new image, it first finds features in the new image. It then tries to associate features in the new image with all the features in the database of models. This matching is based on the similarity of the feature texture. If many features in the new image have good matches to the same database model, that potential object match is refined. The refinement process involves the computation of an affine transform between the new image and the database model, so that the relative position of the features is preserved through the transformation. The algorithm outputs all object matches for which the optimized affine transform results in a small root-mean-square pixel error between the features found in the new image, and the corresponding affine-transformed features of the original model. Figure 1 shows the main characteristics of the recognition algorithm. The first row displays the two objects to be recognized and the other rows presents recognition results under different conditions. The main characteristics of the object recognition algorithm are summarized in the following list: a) Invariant to rotation and affine transformations: The object recognition system recognizes objects even if they are rotated upside down (rotation invariance) or placed at an angle with respect to the optical axis (affine invariance). See second and third row of Figure 1. b) Invariant to changes in scale: Objects can be recognized at different distances from the camera, depending on the size of the objects and the camera resolution. The recognition works reliably from distances of several meters. Fig. 1. Examples of object recognition under a variety of conditions. The first row presents the objects to be recognized and other rows displays recognition results. See second and third row of Figure 1. c) Invariant to changes in lighting: The object recognition system can handle changes in illumination ranging from natural to artificial indoor lighting. The system is insensitive to artifacts caused by reflections or back-lighting. See fourth row of Figure 1. d) Invariant to occlusions: The object recognition system can reliably recognize objects that are partially blocked by other objects, and objects that are partially in the camera’s view. The amount of occlusions allowed is typically between 50% and 90% depending on the object and the camera quality. See fifth row of Figure 1. e) Reliable recognition: The object recognition system has 80-95% success rate in uncontrolled settings. A 95-100% recognition rate is achieved in controlled settings. The recognition speed is a logarithmic function of the number of objects in the database; i.e., the recognition speed is proportional to log (N), where N is the number of objects in the database. The object library can store hundreds or even thousands of objects without a significant increase in computational requirements. The recognition frame rate is proportional to CPU power and image resolution. For example, the recognition algorithm runs at 5 frames per second (fps) at an image resolution of 320x240 on an 850MHZ Pentium III processor and 3 fps at 80x66 on a 100MHz MIPS-based 32-bit processor. Reducing the image resolution decreases the image quality and, ultimately, the recognition rate. However, the object recognition system allows for graceful degradation with decreasing image quality. Each object model requires about 40KB of memory. IV. V SLAM Evolution Robotics’ Visual Simultaneous Localization and Mapping (vSLAMTM ) system is robust and requires no initial map. It is the core navigation functionality of Evolution Robotics Software Platform (ERSPTM ), and enables simultaneous localization and mapping using a low-cost camera and wheel encoders as the only sensors. Because a web camera is used instead of a SICK laser range sensor or other expensive range measuring device, the vSLAM technology offers a dramatic cost reduction. With vSLAM, a service robot can operate in a previously unknown environment to build a map and to localize itself in this map. Relying on visual measurements rather than range measurements when building maps allows vSLAM to deal with cluttered spaces easily. Range-based SLAM techniques, involving laser scanners and similar devices, often fail in cluttered environments. The vSLAM algorithm handles dynamic changes in the environment well, including lighting changes, moving objects and/or people, simply because it possesses such a wealth of information about the created landmarks. Figure 2 illustrates the inputs and outputs of vSLAM. The inputs to the algorithm are dead reckoning information (for example, wheel encoder odometry) and images acquired by a camera. The main components of vSLAM are Visual Localization, SLAM, and Landmark Database. Fig. 2. Block diagram of vSLAM. In the Visual Localization module the object recognition system described in Section III is used to define new and recognize old natural landmarks. The landmarks are used for localization of the robot. In particular, when a previously defined landmark is recognized, a relative measurement of the robot pose is computed with respect to the pose of the robot when the landmark was first defined. Some related work is described in [6], [8], [11] In the SLAM module, the relative measurements are fused with odometry data to update the robot pose and the poses of all the landmarks. For related work, see for example [3], [5], [7], [9], [10]. The SLAM algorithm is developed to track multiple hypothesis, but also to maximize the robustness to dramatic disturbances such as kidnapping, which is the scenario where the robot is lifted and moved to a new location without notification. In the Landmark Database module information is stored and updated on the landmarks in the vSLAM map. For example, the unique identifiers are recorded, as well as estimates of the landmark poses and uncertainty measures of the estimates. When entering a new area or environment, the robot can either load an existing map of that area, or immediately start creating a new map with a new set of landmarks. During its operation, the robot will attempt to create new visual landmarks. A visual landmark is a collection of unique features extracted from an image. Each landmark is associated with an estimate of where the robot was located (the x, y, and heading coordinates) when the landmark was created. As the robot traverses the environment, it continues creating new landmarks and correcting its odometry as necessary, using these new landmarks and a corresponding map. vSLAM is constantly improving the robot’s knowledge about its environment. For example, new landmarks are automatically created if the robot is mobile for too long without recognizing any known landmarks in its database. This enables the robot to constantly refine its understanding of its environment and increase its confidence in where it is located. In order to understand the limitations of vSLAM, it is necessary to understand the different steps that take place in the vSLAM algorithm. These are: 1) Traverse the environment, collect odometry, and acquire an image. 2) Process the image to extract visual features. 3) If the number of visual features are sufficiently large and distinct, then a) Determine if the features correspond to a previously stored landmark. b) If the features do correspond to a previously stored landmark, then: i) Compute a visual measurement of where the robot is located now, relative to where it was located when the landmark was created. ii) Based on the visual measurement, the current map, and the collected odometry; improve the localization of the robot. iii) Based on visual measurement, localization, and collected odometry; refine the estimated location of the landmarks.. c) If the features do not correspond to a previously stored landmark, then try to create a new landmark. In order to create a new landmark, some of the same visual features must be detected in three consecutive image frames. i) Assign a unique landmark number to the features, and add this landmark to the previously created landmarks. ii) Initialize the landmark pose based on odometry information. 4) Return to step 1. Note that, if the environment does not contain a sufficient amount of visual features, then steps 3a through 3c will never be executed. Hence, no landmark will be created and no visual measurements will be computed. vSLAM will still produce pose estimates, but the estimates will be exactly what is obtained from wheel odometry. This scenario is rarely experienced, but is most likely to happen in environments that are free from furniture, decorations, and texture. The vSLAM algorithm is typically good at detecting and correcting for kidnapping. Note that dramatic slippage and collisions may be considered as kidnapping events and must be handled by any reliable navigation technology. vSLAM recovers quickly from kidnapping once the map has become dense with landmarks. Early in the mapping procedure, is it very important to avoid kidnapping and disturbances in general. Figure 3 shows the result of vSLAM after the robot has traveled around in a typical two-bedroom apartment. The robot was driven around along a reference path (this path is unknown to the SLAM algorithm). The vSLAM algorithm builds a map consisting of landmarks marked with blue circles in the figure. The corrected robot path, which uses a combination of visual features and odometry, provides a robust and accurate position determination for the robot as seen by the red path in the figure. Fig. 3. Example result of SLAM using vSLAM. Red path (darker gray): vSLAM estimate of robot trajectory. Green path (lighter gray): odometry estimate of robot trajectory. Blue circles: vSLAM landmarks created during operations. The green path (odometry only) is obviously incorrect, since, according to this path, the robot is traversing through walls and furniture. The red path (the vSLAM corrected path), on the other hand, is consistently following the reference path. Based on experiments in various typical home environments on different floor surfaces (carpet and hardwood floor), and using different image resolution (Low: 320 x 280, High: 640 x 480) the following robot localization accuracy was achieved: Floor surface Carpet Carpet HW HW Image resolution Low High Low High Median error (cm) 13.6 12.9 12.2 11.5 Std dev error (cm) 9.4 10 7.5 8.3 Median error (deg) 4.79 5.97 5.28 3.47 Std dev error (deg) 7.99 7.03 6.62 5.41 V. E VOLUTION ROBOTICS S OFTWARE A RCHITECTURE The Evolution Robotics Software Platform provides basic components and tools for rapid development and proto- typing of robotics applications. The software architecture, which is the infrastructure and one of the main components of ERSP, addresses the issue of short product development cycles (time-to-market). The design of ERSP is modular and allows applications to use only the required portions of the architecture. Figure 4 shows a diagram of the software structure; and the relationship among the software, operating system, and applications. Fig. 4. ERSP structure and relation to application development. There are five main blocks in the diagram – three of them, Applications, OS & Drivers, and 3rd Party Software correspond to external components to the ERSP. The other two blocks correspond to subsets of ERSP: the core technology libraries (left-hand-side block) and the implementation libraries (center block). The core technology libraries consist of the following components: • Architecture: The Architecture component provides a set of Application Programmer’s Interfaces (APIs) for integration of all the software components with each other, and with the robot hardware. The infrastructure consists of APIs to interact with the hardware, to build task-achieving modules that can make decisions and control the robot, to orchestrate the coordination and execution of these modules, and to control access to system resources. The three primary components of the Architecture are the Hardware Abstraction Layer (HAL), the Behavior Execution Layer (BEL), and the Task Execution Layer (TEL). • Vision: The Vision APIs correspond to the object recognition algorithm described in Section III. • Navigation: The Navigation APIs provide mechanisms for controlling movement of the robot. These APIs provide access to modules for mapping, localization (described in Section IV), exploration, path planning, obstacle avoidance, occupancy grid mapping, target following, and teleoperation control. • Interaction: The Interaction APIs support building user interfaces for applications with graphical user interfaces, voice recognition, and speech synthesis. The architecture is composed of three layers and corresponds to a mixed architecture in which the two first layers follow a behavior-based philosophy [1], [2] and the third layer incorporates a deliberative stage for planning and sequencing [4]. The implementation libraries use the interfaces defined by the architecture to implement the three layers of the architecture for particular hardware and purposes. The first layer, HAL, provides interfaces to the hardware devices and low-level operating system (OS) dependencies. This assures portability of Architecture and application programs to other robots and computing environments. At the lowest level, HAL interfaces with device drivers, which communicate with the hardware devices through a communication bus. The description of the resources, devices, busses, their specifications, and the corresponding drivers are managed through configuration files based on Extensible Markup Language (XML). The second layer, BEL, provides infrastructure for development of modular robotic competencies, known as behaviors, for achieving tasks with a tight feedback loop such as following a trajectory, tracking a person, avoiding an object, etc. Behaviors are the basic, reusable building blocks on which robotic applications are built. BEL also provides techniques for coordination of the activities of behaviors, for conflict resolution, and for resource scheduling. Behaviors are organized into a behavior network to achieve more complex goals, such as avoiding obstacles while following a target. Behavior networks are executed synchronously at a rate set by the user. BEL defines a common and uniform interface for all behaviors and the basic protocols for interaction among the behaviors, as well as the order of execution for each behavior in a behavior network. The user has the flexibility of selecting the communication ports and the data transference protocols between behaviors, as well as deciding the internal computation in the behavior. A typical behavior network would contain a number of sensory behaviors and a number of actuation behaviors. The sensory behaviors’ outputs are fed as inputs to the actuation behaviors, which then control the robot according to the sensory data. By analyzing the flow of behavior inputs and outputs, the Behavior Manager automatically orders the execution of behaviors in the network so that sensory behaviors execute before actuation behaviors. The coordination of behaviors is transparent to the user. Finally, the third layer, TEL, provides infrastructure for developing goal-oriented tasks along with mechanisms for the coordination of task executions. Tasks can run in sequence or in parallel. Execution of tasks can also be triggered by user-defined events. Events are conditions or predicates defined on values of variables within BEL or TEL. Complex events can be defined by logical expressions of basic events. While behaviors are highly reactive, and are appropriate for creating robust control loops, tasks are a way to express higher-level execution knowledge and coordinate the actions of behaviors. Tasks can run asynchronously using event triggers or synchronously with other tasks. Timecritical modules such as obstacle avoidance are typically implemented in the BEL, while tasks implement behaviors that are not required to run at a fixed execution rate. Tight feedback loops for controlling the actions of the robot according to perceptual stimuli (presence of obstacles, detection of a person, etc.) are typically implemented in the BEL. Behaviors tend to be synchronous and highly data driven. TEL is more appropriate to deal with complex control flow which depends on context and certain conditions that can arise asynchronously. Tasks tend to be asynchronous and highly event-driven. VI. P RODUCT C ONCEPTS The product conceptualization is the step that takes a given technology and designs an application for it. This section describes a few product concepts enabled by the technologies described in this paper. The described concepts are: • The use of object recognition for command and control of robots • The use of vSLAM for efficient vacuuming by robotic vacuum cleaners An important capability of a service robot is its ability to interact with people in a natural, friendly, and robust manner. Traditionally, speech-based interaction has been the natural choice in a variety of designs. However, the shortcomings of speech recognition in terms of recognition robustness and accuracy with respect to user location renders this technology to be unreliable for robotic applications. Therefore, we propose the use of vision-based object recognition to communicate a command to the robot. Commands are issued by showing predefined cards to the robot. The described robustness of the algorithm makes it an excellent technology for reliable recognition of the cards. The idea of commanding a robot with cards was initially prototyped with Evolution Robotics ER2TM robots. However, the idea is now productized in the latest version of Sony’s AIBOTM robotic dog, the ERS-7. This robot is equipped with a set of LEDs to provide visual feedback to the user. Thus, each time the robot recognizes a cards, a visual display is shown as well as a sound is played. A second important capability of a service robot is its ability to navigate in an unknown environment. For example, an autonomous vacuum cleaner must be able to estimate its own position in order to carry out an efficient sweeping of the floor. Due to the complexity of the problem, most autonomous vacuum cleaners today avoid dealing with the localization problem. Instead, they typically operate under some random walk strategy. This is necessarily an inefficient way of doing coverage since many spots will be covered multiple times before the last spot is covered. Also, there is no mechanism to guarantee that all of the environment has been covered at all. The usage of vSLAM to do mapping and localization in support of efficient sweeping has been prototyped on various robot platforms at Evolution Robotics, and has demonstrated to dramatically improve coverage and efficiency. Experiments have shown that vSLAM offers an increased degree of autonomy and “intelligence” to vacuum cleaners and service robots in general. VII. C ONCLUSIONS We have described three core technologies developed by Evolution Robotics that we constitute key components in the next generation of service robots. They succeed in satisfying the typical requirements on service robotics: lowcost. short time-to-market, reliability to perform in real world settings. The basic requirement is that these technologies have to be low cost to be employed in service robots. At the same time, these technologies must enable the products to have a short time-to-market and provide the products “intelligent” features. The proposed technologies have been developed at Evolution Robotics. The cost requirement is achieved by leveraging on low-cost hardware components like cameras and IR sensors. • ER Vision: The object recognition system gives the robot an ability to interact with and make inference from its environment. • vSLAM: The vision-based navigation system makes it possible for a robot to simultaneously build a map and localize itself in the map. In particular, the vSLAM technology does not require an initial map, but builds a map from scratch. • ERSP: The proposed software architecture addresses the issue of of short product development cycles. The Evolution Robotics Software Platform provides basic components and tools for rapid development and prototyping of robotics applications. The design of ERSP is modular and allows applications to use only the required portions of the architecture. It is designed in a way that allows the developer to reconfigure the hardware without rewriting more than a very small amount of code. R EFERENCES [1] Arkin, R. C., Behavior-Based Robotics, MIT Press, 1998 [2] Brooks, R. A., A Robust Layered Control System for a Mobile Robot, IEEE Journal of Robotics and Automation, March 1986. [3] Fox, D., Burgard, W., and Thrun, S. Markov localization for mobile robots in dynamic environments. Journal of Artificial Intelligence Research 11, pp. 391-427, 1999. [4] Gat, E., On Three-Layer Architectures, in D. Kortenkamp et al. eds., AI and Mobile Robots. AAAI Press, 1998. [5] Gutmann, J.-S. and Konolige, K., Incremental Mapping of Large Cyclic Environments, Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), San Francisco, CA, 2000. [6] Lowe, D. Local feature view clustering for 3D object recognition. Proc. of the 2001 IEEE/RSJ, Int. Conf. on Computer Vision and Pattern Recognition, Kauai, Hawaii, USA, December 2001. [7] Roumeliotis, S.I. and Bekey, G.A. Bayesian estimation and Kalman filtering: A unified framework for mobile robot localization. Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pages 2985–2992, San Francisco, CA, 2000. [8] Se, S., Lowe, D., and Little, J. Local and global localization for mobile robots using visual landmarks. Proc. of the 2001 IEEE/RSJ, Int. Conf. on Intelligent Robots and Systems, Maui, Hawaii, USA, October 2001. [9] Thrun, S., Fox., D., and Burgard, W. A probabilitic approach to concurrent mapping and localization for mobile robots. Machine Learning, 31:29-53, 1998. [10] Thrun, S. Mapping: A Survey. Technical Report, CMU-CS-02-111, Carnegie Mellon University, Pittsburgh, PA, USA, February, 2000. [11] Wolf, J., Burgard, W., and Burkhardt, H. Robust vision-based localization for mobile robots using an image retrievel system based on invariant features Proceedings of the 2002 IEEE Int. Conf. on Robotics and Automation, Washington, DC, USA, May, 2002.