Dialog with Robots: Papers from the AAAI Fall Symposium (FS-10-05) Toward Integrating Natural-HRI into Spoken Dialog Takayuki Kanda ATR Intelligent Robotics and Communication Laboratory 2-2-2 Hikaridai, Keihanna Science City, Kyoto, Japan kanda@atr.jp Abstract non-verbal behaviors are embedded in the appropriate parts of a spoken dialog, while planning is conducted for spoken dialogs. In other words, in this approach, non-verbal behaviors are considered “optional” elements to be added to spoken dialogs. This view seems appropriate for many non-verbal behaviors, particularly those used in virtual agents (Prendinger, et al., 2004; Vilhjalmsson, et al., 2007). However, since robots are physically co-located with the people with whom they are interacting, many non-verbal behaviors require substantial integration when a system plans the robot’s spoken dialog. This paper introduces our previous works on modeling non-verbal behaviors for HRI and discusses three features, initiation, attention, and position, which must be further studied to integrate naturalHRI in spoken dialog planning. This paper summarizes our previous works in modeling non-verbal behaviors for natural human-robot interaction (HRI) and discusses a path for integrating them into spoken dialogs. While some non-verbal behaviors can be considered “optional” elements to be added to a spoken dialog, some non-verbal behaviors substantially require a harmonized plan that simultaneously considers both spoken dialog and non-verbal behavior. The paper discusses such unique HRI features. Introduction In HRI, studies have often focused on the use of the robot’s physical entity, e.g., human-like body properties. Studies have revealed the importance of a robot’s non-verbal behaviors (Breazeal, et al., 2005). Proxemics (position and distance) have also been considered for social situations (Nakauchi and Simmons, 2000; Dautenhahn, et al., 2006). Gaze has been used to provide feedback (Nakano, et al., 2003), maintain engagement, (Sidner, et al., 2004; Rich, et al., 2010), to adjust the conversation flow (Mutlu, et al., 2006; Mutlu, et al., 2009a), and to attract the robot’s attention (Mutlu, et al., 2009b). Pointing is another useful gesture to indicate a conversation’s target objects (Kuzuoka, et al., 2000; Scassellati, 2002; Okuno, et al., 2009). These findings are devoted to make interaction with robots as natural as the interaction among humans, i.e., natural-HRI. In contrast, few studies address the integration of naturalHRI and spoken dialogs. It remains quite challenging to interpret the user’s action and to express the robot’s message. Regarding interpretation capability, working in noisy environments is already a big issue (Roy, et al., 2000; Ishi, et al., 2008). People talk casually (Kriz, et al., 2010), which requires advancements in speech recognition techniques (Cantrell, et al., 2010). Pioneering work has been conducted for cognitive architecture (Kramer, et al., 2007; Trafton, et al., 2008). Regarding expression capability, so far, only a few pioneering studies have been conducted to integrate robot non-verbal behaviors into spoken dialog architecture (Nakano, et al., 2005; Nishimura, et al., 2007; Spexard, et al., 2007; Salem, et al., 2009). Often in such architecture, dialog planners generate multimodal expressions where Modeling non-verbal behaviors for natural-HRI This section reports a summary of our previous works in modeling non-verbal behaviors for natural human-robot interaction (HRI). Since the details are reported elsewhere, we only briefly introduce them here to provide concrete discussion examples for connecting such natural-HRI studies and a spoken dialog system. Deictic interaction When we talk about objects, we often engage in deictic interaction, using such reference terms as this and that with pointing gestures. Such deictic interaction often happens when we start to talk about new things that are outside our shared attention (McNeill, 1987). In (Sugiyama, et al., 2006), we modeled the use of Japanese reference terms, kore, sore, and are, and learned from human behaviors in deictic interactions. Data collection was conducted with 10 pairs of subjects who performed deictic interactions at various spatial configurations. As the analysis of the interaction, the boundary shape is represented as a function of the speakerlistener distance, body orientation (e.g., faced or aligned), and the object’s location (Fig. 1, left, shows a brief summary of this reference-term model). In addition, whether the pointing can specify the target is modeled as the pointing space model (Fig. 1, right). When the pointing 44 does not specify the target, the robot adds an adjective to further specify the object. Fig. 2 shows a scene of deictic interaction with the robot. Fig. 4 Pointing to a region Proxemics Fig. 1 Reference-term and pointing space models During conversations, people adjust their distance to others based on their relationship and the situation. This phenomena is known as proxemics (Hall, 1966). For example, personal conversations often happen within 0.45 – 1.2 m. When people talk about objects, they form an Oshaped space that surrounds the target object (Kendon, 1990). By doing so, each participant can look at the target object as well as the other people in the conversation. In (Yamaoka, et al., 2010), we reproduced this natural interaction with a robot, where the robot behaved as a information-presenter. With a motion capturing system, we measured detailed constraints and parameters from human behaviors in interaction about a target object. For example, we found that the most typical distance between the target object and the presenter was approx 1.0 m and that the distance between the presenter and listener was approx. 1.1 m. More importantly, we found that both the listener and the target object should simultaneously be within 150º of the presenter’s field of view. Fig. 5 shows a rough image of the best position computed from this method. Fig. 6 shows a scene where the robot moves to the location to indicate the computer on the desk. Fig 2 Scene of deictic interaction Further study was conducted for deictic interaction for spaces or regions. People pointed at regions, for example, in such conversation phrases as, “please clean here” and “let’s play there.” In contrast to deictic interaction for objects, since such regions are seemingly not visible, our study started by modeling the concept of regions. (Hato, et al., 2010) revealed that people generally share the concept of regions, and their boundaries consist of the edges of visual scenes as well as the space’s shape. Fig. 3 shows a concept of regions retrieved in an environment. The study also modeled how the robot should perform pointing gestures (Fig. 4) and demonstrated that when the robot possesses the concept of regions, it performs deictic interaction better. Fig. 5 Model of location to indicate target object Fig. 3 Concept of regions Fig. 6 Robot moving to location to indicate computer on desk 45 Implicit shift of attention Approaching In a series of studies (Yamaoka, et al., 2009), we further modeled the relationship between position and attention. Fig. 7 shows two people implicitly sharing attention. In this scene, they are looking at posters in the room. In the left-most picture, the man in yellow looked back and started moving toward the poster behind him. In this interaction, the person in yellow did not say anything, but the person in blue followed him and stood where he could share the attention with the person in yellow (right-most figure). In this interaction, they did not point at the poster to share attention, but by moving to the location they implicitly shared attention. Since robots are mobile, sometimes a robot initiates interaction. In a shopping mall, we studied approaching behaviors for advertisement uses (Satake, et al., 2009). First, we found that a robot that simply approached people was often ignored because people were not aware that it had approached them (Fig. 9). This study incorporates a technique that predicts people’s walking direction (Kanda, et al., 2008) (Fig. 10), so the robot approaches people from the front (Fig. 11). Fig. 9 Unaware of approaching robot (1) (2) Fig. 10 Prediction of people’s walking direction (3) (4) Fig. 7 Implicit shift of attention (Yamaoka, et al., 2009) further modeled this interaction of attention shift. Attention-shift action was detected by recognizing people’s turning behavior, and then the next attention is estimated from their gaze as well as their approaching target. Fig. 8 shows a scene where the robot engages in such interaction. (1) (1) (2) (3) (4) Fig 11: Scene of approaching (2) Toward integrating natural-HRI and dialog planning We previously studied non-verbal behaviors as independent phenomena without integrating them into a larger architecture that simultaneously plans spoken dialog. Although we are working on such integration using a simple markup language (Shi, et al., 2010) that addresses non-verbal behaviors as “optional” elements, we have (3) (4) Fig. 8 Robot’s implicit shift of attention 46 Person: I’d like to buy a cell phone. Any recommendations? Robot: Sure. Look at this (cell phone A). Person: No, I’d like a much smaller one. Robot: How about this one (cell phone B)? Person: That seems too expensive. Robot: This (cell phone C) is small and less expensive. started to recognize difficulties beyond such simple integration. Here we discuss three important features that must be considered beyond simple integration. Initiation If robots are mobile, they often don’t have a choice of location to initiate dialogs. A dialog starts when the robot and a user meet. Suppose a situation where we create a robot that simply talks about the information below: Robot: Welcome. Please look at this new product. Certainly, we expect a pointing gesture to accompany the spoken this. To do so, (Yamaoka, et al., 2010) revealed a method to compute the appropriate location for a robot to talk and point at an object. Thus, one might simply apply a strategy of synchronizing utterances and motions, as in usual virtual agents (Vilhjalmsson, et al., 2007). But a question arises. It usually takes 10-30 seconds for a robot to move between locations even in small rooms. Should a robot always go to the best location? If so, is it ok to let users wait in silence? The timing of the response must also be considered. Although users seem willing to accept delayed responses from a robot (Shiwa, et al., 2009), they expect the robot to respond within a few seconds. If the target is visible, we might prefer quick continuation of the conversation without delay even though the robot is at a less appropriate location. Sometime the robot will move anyway when the target object is not visible or when we intend the listener to pay greater attention to the target. Here, similar to the case of encounters, we need a harmonized plan that considers both HRI and dialog. Users are apparently unwilling to wait in silence for more than 10 seconds. If the robots need time, the system must provide an alternative dialog plan. For example, in the above example, when the robot is going to discuss cell phone B but it is not visible, it would need to generate a dialog like this: Person: No, no. I’d prefer a much smaller one. Robot: Then, well, (while walking to a different location where they can see phone B) let me recommend an alternative that is much smaller than the previous one. (They arrive where phone B can be seen) How about this one? Seemingly, it does not need much dialog planning. However, once an initiation scene is included, this simple interaction becomes complex. Perhaps the robot just meets a user in front of the product. For such cases, the above utterance is completely appropriate. But if the robot meets a user at a location where the product is not visible, how can it initiate the dialog? Once customers encounter the robot, they believe that engagement (Sidner, et al., 2004) is formed with the robot. After that moment, we wonder, do they believe that the robot should speak to them? Or would they assume that the robot is ignoring them if it does not talk to them? But if they believe they were ignored, initiating interaction later will be difficult. Overall, we expect such conversations: Robot: Welcome. Today, I’d like to introduce a new product to you. Please follow me. (Robot walks to the product.) Please take a look at this! To generate such interaction, we seemingly need a substantial integration of dialog and behavioral plans. Attention When a dialog is conducted in a physically-collocated environment, we often do deictic interactions, pointing at objects in the environment, and using reference terms. This affects how a robot plans its dialog. As modeled in (Yamaoka, et al., 2009), our attention is often implicitly shared. If a robot is going to talk about a product on a desk or if a listener’s attention is not yet on the product, the robot should start the dialog with such deictic interaction as “look at this” (with pointing). On the other hand, if the listener’s attention is already on the product, as in the example in Fig. 6, such pointing is socially awkward; instead, the robot should start mentioning such product details as “let me point out a couple of nice features of this product.” Perspective-taking (Trafton, et al., 2008; Berlin, et al., 2006; Marin-Urias, et al., 2009) is probably related to this issue, where different speaker and listener viewpoints change the way of speaking. Summary and Discussion This paper seeks discussion that leads to the integration of natural-HRI into spoken dialog systems. We introduced our previous studies on modeling non-verbal behaviors for natural-HRIs. When we consider integrating them into spoken dialog systems, we faced difficulties from three features: initiation, attention, and position. Since they are probably just the tip of the iceberg, much greater study will be required when we start to explore the field of “dialog with robots.” We started to recognize that dialog planning can not be simple for natural-HRIs, because a dialog plan is not independent from a robot’s motion plan. In particular, the Position Suppose a robot is a shopkeeper in a cell phone shop. In the below conversation, what should the robot do nonverbally? 47 time required for the robot to move affects how the system plans the dialog. If the robot is required to move to an appropriate location for the utterance, some utterances will take much longer than others. A robot also needs a perception of the world that equals that of the user. Deictic interaction is common in daily conversation. When dialog happens at physically colocated situations, we humans have difficulty resisting the urge to point at a target object being discussed and use such deictic terms as this and that. For such deictic interaction, a robot needs to understand the location for objects or other things such as the concept of regions. It also needs to understand people’s attention, especially when sustained attention displayed by body orientation is very visible and would be considered awkward if it were ignored by the robot. Rich, C., Ponsler, B., Holroyd, A. and Sidner, C. L., 2010, Recognizing Engagement in Human-Robot Interaction, ACM/IEEE Int. Conf. on Human-Robot Interaction (HRI2010), pp. 375-382. Mutlu, B., Forlizzi, J. and Hodgins, J., 2006, A Storytelling Robot: Modeling and Evaluation of Human-like Gaze Behavior, IEEE-RAS Int. Conf. on Humanoid Robots (Humanoids'06), pp. 518-523. Mutlu, B., Shiwa, T., Kanda, T., Ishiguro, H. and Hagita, N., 2009a, Footing In Human-Robot Conversations: How Robots Might Shape Participant Roles Using Gaze Cues, ACM/IEEE Int. Conf. on Human-Robot Interaction (HRI2009), pp. 61-68. Mutlu, B., Yamaoka, F., Kanda, T., Ishiguro, H. and Hagita, N., 2009b, Nonverbal Leakage in Robots: Communication of Intentions through Seemingly Unintentional Behavior, ACM/IEEE Int. Conf. on HumanRobot Interaction (HRI2009), pp. 69-76. Acknowledgements We wish to thank Prof. Ishiguro, Dr. Hagita, Prof. Imai, Dr. Yamaoka, Dr. Satake, Dr. Sugiyama, Mr. Okuno, Mr. Shiwa, and Mr. Hato for the modeling studies reported in this paper. We also thank Mr. Shimada and Mr. Shi for their help in discussions and suggestions. This research was supported by the Ministry of Internal Affairs and Communications of Japan. Kuzuoka, H., Oyama, S., Yamazaki, K., Suzuki, K. and Mitsuishi, M., 2000, GestureMan: A Mobile Robot that Embodies a Remote Instructor's Actions, ACM Conference on Computer-supported cooperative work (CSCW2000), pp. 155-162. References Scassellati, B., 2002, Theory of Mind for a Humanoid Robot, Autonomous Robots, vol. 12, pp. 13-24. Breazeal, C., Kidd, C. D., Thomaz, A. L., Hoffman, G. and Berlin, M., 2005, Effects of nonverbal communication on efficiency and robustness in human-robot teamwork, IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS2005), pp. 383-388. Okuno, Y., Kanda, T., Imai, M., Ishiguro, H. and Hagita, N., 2009, Providing Route Directions: Design of Robot's Utterance, Gesture, and Timing, ACM/IEEE Int. Conf. on Human-Robot Interaction (HRI2009), pp. 53-60. Nakauchi, Y. and Simmons, R., 2000, A Social Robot that Stands in Line, IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS2000), pp. 357-364. Roy, N., Pineau, J. and Thrun, S., 2000, Spoken Dialogue Management Using Probabilistic Reasoning, Annual Meeting of the Association for Computational Linguistics (ACL 2000), pp. 93-100. Dautenhahn, K., Walters, M. L., Woods, S., Koay, K. L., Nehaniv, C. L., Sisbot, E. A., Alami, R. and Siméon, T., 2006, How May I Serve You? A Robot Companion Approaching a Seated Person in a Helping Context, ACM/IEEE Int. Conf. on Human-Robot Interaction (HRI2006), pp. 172-179. Ishi, C. T., Matsuda, S., Kanda, T., Jitsuhiro, T., Ishiguro, H., Nakamura, S. and Hagita, N., 2008, A Robust Speech Recognition System for Communication Robots in Noisy Environments, IEEE Transactions on Robotics, vol. 24, pp. 759-763. Nakano, Y. I., Reinstein, G., Stocky, T. and Cassell, J., 2003, Towards a Model of Face-to-Face Grounding, Annual Meeting of the Association for Computational Linguistics (ACL 2003), pp. 553-561. Kriz, S., Anderson, G. and Trafton, J. G., 2010, RobotDirected Speech: Using Language to Assess First-Time Users' Conceptualizations of a Robot, ACM/IEEE Int. Conf. on Human-Robot Interaction (HRI2010), pp. 267274. Sidner, C. L., Kidd, C. D., Lee, C. and Lesh, N., 2004, Where to Look: A Study of Human-Robot Engagement, International Conference on Intelligent User Interfaces (IUI 2004), pp. 78-84. Cantrell, R., Scheutz, M., Schermerhorn, P. and Wu, X., 2010, Robust Spoken Instruction Understanding for HRI, ACM/IEEE Int. Conf. on Human-Robot Interaction (HRI2010), pp. 275-282. 48 Kramer, J., Scheutz, M. and Schermerhorn, P., 2007, "Talk to me!": Enabling Communication between Robotic Architectures and their Implementing Infrastructures, IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS2007), pp. 3044-3049. Referring to Regions, ACM/IEEE Int. Conf. on HumanRobot Interaction (HRI2010), pp. 301-308. Hall, E. T., 1966, The Hidden Dimension. Kendon, A., 1990, Spatial Organization in Social Encounters: the F-formation System, in Conducting Interaction: Patterns of Behavior in Focused Encounters, A. Kendon ed., Cambridge University Press, pp. 209-238. Trafton, J. G., Bugajska, M. D., Fransen, B. R. and Ratwani, R. M., 2008, Integrating Vision and Audition within a Cognitive Architecture to Track Conversations, ACM/IEEE Int. Conf. on Human-Robot Interaction (HRI2008), pp. 201-208. Yamaoka, F., Kanda, T., Ishiguro, H. and Hagita, N., 2010, A Model of Proximity Control for Information-Presenting Robots, IEEE Transactions on Robotics, vol. 26, pp. 187195. Nakano, M., Hasegawa, Y., Nakadai, K., Nakamura, T., Takeuchi, J., Torii, T., Tsujino, H., Kanda, N. and Okuno, H. G., 2005, A Two-Layer Model for Behavior and Dialogue Planning in Conversational Service Robots, IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS2005), pp. 3329-3335. Yamaoka, F., Kanda, T., Ishiguro, H. and Hagita, N., 2009, Developing a Model of Robot Behavior to Identify and Appropriately Respond to Implicit Attention-Shifting, ACM/IEEE Int. Conf. on Human-Robot Interaction (HRI2009), pp. 133-140. Nishimura, Y., Minotsu, S., Dohi, H., Ishizuka, M., Nakano, M., Funakoshi, K., Takeuchi, J., Hasegawa, Y. and Tsujino, H., 2007, A Markup Language for Describing Interactive Humanoid Robot Presentations, International Conference on Intelligent User Interfaces (IUI 2007), pp. 333-336. Satake, S., Kanda, T., Glas, D. F., Imai, M., Ishiguro, H. and Hagita, N., 2009, How to Approach Humans?: Strategies for Social Robots to Initiate Interaction, ACM/IEEE Int. Conf. on Human-Robot Interaction (HRI2009), pp. 109-116. Spexard, T. P., Hanheide, M. and Sagerer, G., 2007, Human-Oriented Interaction With an Anthropomorphic Robot, IEEE Transactions on Robotics, vol. 23, pp. 852862. Kanda, T., Glas, D. F., Shiomi, M., Ishiguro, H. and Hagita, N., 2008, Who will be the customer?: A social robot that anticipates people's behavior from their trajectories, Int. Conf. on Ubiquitous Computing (UbiComp2008), pp. 380389. Salem, M., Kopp, S., IpkeWachsmuth and Joublin, F., 2009, Towards Meaningful Robot Gesture, Cognitive Systems Monographs, vol. 6, pp. 173-182. Shi, C., Kanda, T., Shimada, M., Yamaoka, F., Ishiguro, H. and Hagita, N., 2010, Easy Use of Communicative Behaviors in Social Robots submitted to IRO2010, Prendinger, H., Descamps, S. and Ishizuka, M., 2004, MPML:a markup language for controlling the behavior of life-like characters, Journal of Visual Languages & Computing, vol. 15, pp. 183-203. Berlin, M., Gray, J., Thomaz, A. L. and Breazeal, C., 2006, Perspective Taking: An Organizing Principle for Learning in Human-Robot Interaction, National Conf. on Artificial Intelligence (AAAI2006), pp. 1444-1450. Vilhjalmsson, H., Cantelmo, N., Cassell, J., Chafai, N. E., Kipp, M., Kopp, S., Mancini, M., Marsella, S., N.Marshall, A., Pelachaud, C., Ruttkay, Z., Thórisson, K. R., Welbergen, H. v. and Werf, R. J. v. d., 2007, The Behavior Markup Language:Recent Developments and Challenges, Int. Conf. on Intelligent Virtual Agents, pp. 99-111. Marin-Urias, L. F., Sisbot, E. A., Pandey, A. K., Tadakuma, R. and Alami, R., 2009, Towards Shared Attention through Geometric Reasoning for Human Robot Interaction, Humanoid 2009, Shiwa, T., Kanda, T., Imai, M., Ishiguro, H. and Hagita, N., 2009, How Quickly Should a Communication Robot Respond? Delaying Strategies and Habituation Effects, International Journal of Social Robotics, vol. 1, pp. 141155. McNeill, D., 1987, Psycholinguistics: a new approach, Harpercollins College Div. Sugiyama, O., Kanda, T., Imai, M., Ishiguro, H. and Hagita, N., 2006, Humanlike conversation with gestures and verbal cues based on a three-layer attention-drawing model, Connection Science, vol. 18, pp. 379-402. Hato, Y., Satake, S., Kanda, T., Imai, M. and Hagita, N., 2010, Pointing to Space: Modeling of Deictic Interaction 49