Toward Integrating Natural-HRI into Spoken Dialog Takayuki Kanda

advertisement
Dialog with Robots: Papers from the AAAI Fall Symposium (FS-10-05)
Toward Integrating Natural-HRI into Spoken Dialog
Takayuki Kanda
ATR Intelligent Robotics and Communication Laboratory
2-2-2 Hikaridai, Keihanna Science City, Kyoto, Japan
kanda@atr.jp
Abstract
non-verbal behaviors are embedded in the appropriate parts
of a spoken dialog, while planning is conducted for spoken
dialogs. In other words, in this approach, non-verbal
behaviors are considered “optional” elements to be added
to spoken dialogs. This view seems appropriate for many
non-verbal behaviors, particularly those used in virtual
agents (Prendinger, et al., 2004; Vilhjalmsson, et al., 2007).
However, since robots are physically co-located with the
people with whom they are interacting, many non-verbal
behaviors require substantial integration when a system
plans the robot’s spoken dialog. This paper introduces our
previous works on modeling non-verbal behaviors for HRI
and discusses three features, initiation, attention, and
position, which must be further studied to integrate naturalHRI in spoken dialog planning.
This paper summarizes our previous works in modeling
non-verbal behaviors for natural human-robot interaction
(HRI) and discusses a path for integrating them into spoken
dialogs. While some non-verbal behaviors can be
considered “optional” elements to be added to a spoken
dialog, some non-verbal behaviors substantially require a
harmonized plan that simultaneously considers both spoken
dialog and non-verbal behavior. The paper discusses such
unique HRI features.
Introduction
In HRI, studies have often focused on the use of the robot’s
physical entity, e.g., human-like body properties. Studies
have revealed the importance of a robot’s non-verbal
behaviors (Breazeal, et al., 2005). Proxemics (position and
distance) have also been considered for social situations
(Nakauchi and Simmons, 2000; Dautenhahn, et al., 2006).
Gaze has been used to provide feedback (Nakano, et al.,
2003), maintain engagement, (Sidner, et al., 2004; Rich, et
al., 2010), to adjust the conversation flow (Mutlu, et al.,
2006; Mutlu, et al., 2009a), and to attract the robot’s
attention (Mutlu, et al., 2009b). Pointing is another useful
gesture to indicate a conversation’s target objects
(Kuzuoka, et al., 2000; Scassellati, 2002; Okuno, et al.,
2009). These findings are devoted to make interaction with
robots as natural as the interaction among humans, i.e.,
natural-HRI.
In contrast, few studies address the integration of naturalHRI and spoken dialogs. It remains quite challenging to
interpret the user’s action and to express the robot’s
message. Regarding interpretation capability, working in
noisy environments is already a big issue (Roy, et al.,
2000; Ishi, et al., 2008). People talk casually (Kriz, et al.,
2010), which requires advancements in speech recognition
techniques (Cantrell, et al., 2010). Pioneering work has
been conducted for cognitive architecture (Kramer, et al.,
2007; Trafton, et al., 2008).
Regarding expression capability, so far, only a few
pioneering studies have been conducted to integrate robot
non-verbal behaviors into spoken dialog architecture
(Nakano, et al., 2005; Nishimura, et al., 2007; Spexard, et
al., 2007; Salem, et al., 2009). Often in such architecture,
dialog planners generate multimodal expressions where
Modeling non-verbal behaviors
for natural-HRI
This section reports a summary of our previous works in
modeling non-verbal behaviors for natural human-robot
interaction (HRI). Since the details are reported elsewhere,
we only briefly introduce them here to provide concrete
discussion examples for connecting such natural-HRI
studies and a spoken dialog system.
Deictic interaction
When we talk about objects, we often engage in deictic
interaction, using such reference terms as this and that with
pointing gestures. Such deictic interaction often happens
when we start to talk about new things that are outside our
shared attention (McNeill, 1987).
In (Sugiyama, et al., 2006), we modeled the use of
Japanese reference terms, kore, sore, and are, and learned
from human behaviors in deictic interactions. Data
collection was conducted with 10 pairs of subjects who
performed deictic interactions at various spatial
configurations. As the analysis of the interaction, the
boundary shape is represented as a function of the speakerlistener distance, body orientation (e.g., faced or aligned),
and the object’s location (Fig. 1, left, shows a brief
summary of this reference-term model). In addition,
whether the pointing can specify the target is modeled as
the pointing space model (Fig. 1, right). When the pointing
44
does not specify the target, the robot adds an adjective to
further specify the object. Fig. 2 shows a scene of deictic
interaction with the robot.
Fig. 4 Pointing to a region
Proxemics
Fig. 1 Reference-term and pointing space models
During conversations, people adjust their distance to others
based on their relationship and the situation. This
phenomena is known as proxemics (Hall, 1966). For
example, personal conversations often happen within 0.45
– 1.2 m. When people talk about objects, they form an Oshaped space that surrounds the target object (Kendon,
1990). By doing so, each participant can look at the target
object as well as the other people in the conversation.
In (Yamaoka, et al., 2010), we reproduced this natural
interaction with a robot, where the robot behaved as a
information-presenter. With a motion capturing system, we
measured detailed constraints and parameters from human
behaviors in interaction about a target object. For example,
we found that the most typical distance between the target
object and the presenter was approx 1.0 m and that the
distance between the presenter and listener was approx. 1.1
m. More importantly, we found that both the listener and
the target object should simultaneously be within 150º of
the presenter’s field of view. Fig. 5 shows a rough image
of the best position computed from this method. Fig. 6
shows a scene where the robot moves to the location to
indicate the computer on the desk.
Fig 2 Scene of deictic interaction
Further study was conducted for deictic interaction for
spaces or regions. People pointed at regions, for example,
in such conversation phrases as, “please clean here” and
“let’s play there.” In contrast to deictic interaction for
objects, since such regions are seemingly not visible, our
study started by modeling the concept of regions.
(Hato, et al., 2010) revealed that people generally share
the concept of regions, and their boundaries consist of the
edges of visual scenes as well as the space’s shape. Fig. 3
shows a concept of regions retrieved in an environment.
The study also modeled how the robot should perform
pointing gestures (Fig. 4) and demonstrated that when the
robot possesses the concept of regions, it performs deictic
interaction better.
Fig. 5 Model of location to indicate target object
Fig. 3 Concept of regions
Fig. 6 Robot moving to location to indicate computer on
desk
45
Implicit shift of attention
Approaching
In a series of studies (Yamaoka, et al., 2009), we further
modeled the relationship between position and attention.
Fig. 7 shows two people implicitly sharing attention. In
this scene, they are looking at posters in the room. In the
left-most picture, the man in yellow looked back and
started moving toward the poster behind him. In this
interaction, the person in yellow did not say anything, but
the person in blue followed him and stood where he could
share the attention with the person in yellow (right-most
figure). In this interaction, they did not point at the poster
to share attention, but by moving to the location they
implicitly shared attention.
Since robots are mobile, sometimes a robot initiates
interaction. In a shopping mall, we studied approaching
behaviors for advertisement uses (Satake, et al., 2009).
First, we found that a robot that simply approached people
was often ignored because people were not aware that it
had approached them (Fig. 9). This study incorporates a
technique that predicts people’s walking direction (Kanda,
et al., 2008) (Fig. 10), so the robot approaches people from
the front (Fig. 11).
Fig. 9 Unaware of approaching robot
(1)
(2)
Fig. 10 Prediction of people’s walking direction
(3)
(4)
Fig. 7 Implicit shift of attention
(Yamaoka, et al., 2009) further modeled this interaction
of attention shift. Attention-shift action was detected by
recognizing people’s turning behavior, and then the next
attention is estimated from their gaze as well as their
approaching target. Fig. 8 shows a scene where the robot
engages in such interaction.
(1)
(1)
(2)
(3)
(4)
Fig 11: Scene of approaching
(2)
Toward integrating natural-HRI and dialog
planning
We previously studied non-verbal behaviors as
independent phenomena without integrating them into a
larger architecture that simultaneously plans spoken dialog.
Although we are working on such integration using a
simple markup language (Shi, et al., 2010) that addresses
non-verbal behaviors as “optional” elements, we have
(3)
(4)
Fig. 8 Robot’s implicit shift of attention
46
Person: I’d like to buy a cell phone. Any recommendations?
Robot: Sure. Look at this (cell phone A).
Person: No, I’d like a much smaller one.
Robot: How about this one (cell phone B)?
Person: That seems too expensive.
Robot: This (cell phone C) is small and less expensive.
started to recognize difficulties beyond such simple
integration. Here we discuss three important features that
must be considered beyond simple integration.
Initiation
If robots are mobile, they often don’t have a choice of
location to initiate dialogs. A dialog starts when the robot
and a user meet. Suppose a situation where we create a
robot that simply talks about the information below:
Robot: Welcome. Please look at this new product.
Certainly, we expect a pointing gesture to accompany the
spoken this. To do so, (Yamaoka, et al., 2010) revealed a
method to compute the appropriate location for a robot to
talk and point at an object. Thus, one might simply apply a
strategy of synchronizing utterances and motions, as in
usual virtual agents (Vilhjalmsson, et al., 2007). But a
question arises. It usually takes 10-30 seconds for a robot
to move between locations even in small rooms. Should a
robot always go to the best location? If so, is it ok to let
users wait in silence?
The timing of the response must also be considered.
Although users seem willing to accept delayed responses
from a robot (Shiwa, et al., 2009), they expect the robot to
respond within a few seconds. If the target is visible, we
might prefer quick continuation of the conversation
without delay even though the robot is at a less appropriate
location. Sometime the robot will move anyway when the
target object is not visible or when we intend the listener to
pay greater attention to the target.
Here, similar to the case of encounters, we need a
harmonized plan that considers both HRI and dialog. Users
are apparently unwilling to wait in silence for more than 10
seconds. If the robots need time, the system must provide
an alternative dialog plan. For example, in the above
example, when the robot is going to discuss cell phone B
but it is not visible, it would need to generate a dialog like
this:
Person: No, no. I’d prefer a much smaller one.
Robot: Then, well,
(while walking to a different location where they can
see phone B)
let me recommend an alternative that is much smaller
than the previous one.
(They arrive where phone B can be seen)
How about this one?
Seemingly, it does not need much dialog planning.
However, once an initiation scene is included, this simple
interaction becomes complex. Perhaps the robot just meets
a user in front of the product. For such cases, the above
utterance is completely appropriate. But if the robot meets
a user at a location where the product is not visible, how
can it initiate the dialog?
Once customers encounter the robot, they believe that
engagement (Sidner, et al., 2004) is formed with the robot.
After that moment, we wonder, do they believe that the
robot should speak to them? Or would they assume that the
robot is ignoring them if it does not talk to them? But if
they believe they were ignored, initiating interaction later
will be difficult.
Overall, we expect such conversations:
Robot: Welcome. Today, I’d like to introduce a new
product to you.
Please follow me. (Robot walks to the product.)
Please take a look at this!
To generate such interaction, we seemingly need a
substantial integration of dialog and behavioral plans.
Attention
When a dialog is conducted in a physically-collocated
environment, we often do deictic interactions, pointing at
objects in the environment, and using reference terms.
This affects how a robot plans its dialog. As modeled in
(Yamaoka, et al., 2009), our attention is often implicitly
shared. If a robot is going to talk about a product on a desk
or if a listener’s attention is not yet on the product, the
robot should start the dialog with such deictic interaction
as “look at this” (with pointing). On the other hand, if the
listener’s attention is already on the product, as in the
example in Fig. 6, such pointing is socially awkward;
instead, the robot should start mentioning such product
details as “let me point out a couple of nice features of this
product.” Perspective-taking (Trafton, et al., 2008; Berlin,
et al., 2006; Marin-Urias, et al., 2009) is probably related
to this issue, where different speaker and listener
viewpoints change the way of speaking.
Summary and Discussion
This paper seeks discussion that leads to the integration of
natural-HRI into spoken dialog systems. We introduced
our previous studies on modeling non-verbal behaviors for
natural-HRIs. When we consider integrating them into
spoken dialog systems, we faced difficulties from three
features: initiation, attention, and position. Since they are
probably just the tip of the iceberg, much greater study will
be required when we start to explore the field of “dialog
with robots.”
We started to recognize that dialog planning can not be
simple for natural-HRIs, because a dialog plan is not
independent from a robot’s motion plan. In particular, the
Position
Suppose a robot is a shopkeeper in a cell phone shop. In
the below conversation, what should the robot do nonverbally?
47
time required for the robot to move affects how the system
plans the dialog. If the robot is required to move to an
appropriate location for the utterance, some utterances will
take much longer than others.
A robot also needs a perception of the world that equals
that of the user. Deictic interaction is common in daily
conversation. When dialog happens at physically colocated situations, we humans have difficulty resisting the
urge to point at a target object being discussed and use
such deictic terms as this and that. For such deictic
interaction, a robot needs to understand the location for
objects or other things such as the concept of regions. It
also needs to understand people’s attention, especially
when sustained attention displayed by body orientation is
very visible and would be considered awkward if it were
ignored by the robot.
Rich, C., Ponsler, B., Holroyd, A. and Sidner, C. L., 2010,
Recognizing Engagement in Human-Robot Interaction,
ACM/IEEE Int. Conf. on Human-Robot Interaction
(HRI2010), pp. 375-382.
Mutlu, B., Forlizzi, J. and Hodgins, J., 2006, A Storytelling
Robot: Modeling and Evaluation of Human-like Gaze
Behavior, IEEE-RAS Int. Conf. on Humanoid Robots
(Humanoids'06), pp. 518-523.
Mutlu, B., Shiwa, T., Kanda, T., Ishiguro, H. and Hagita,
N., 2009a, Footing In Human-Robot Conversations: How
Robots Might Shape Participant Roles Using Gaze Cues,
ACM/IEEE Int. Conf. on Human-Robot Interaction
(HRI2009), pp. 61-68.
Mutlu, B., Yamaoka, F., Kanda, T., Ishiguro, H. and
Hagita, N., 2009b, Nonverbal Leakage in Robots:
Communication of Intentions through Seemingly
Unintentional Behavior, ACM/IEEE Int. Conf. on HumanRobot Interaction (HRI2009), pp. 69-76.
Acknowledgements
We wish to thank Prof. Ishiguro, Dr. Hagita, Prof. Imai, Dr.
Yamaoka, Dr. Satake, Dr. Sugiyama, Mr. Okuno, Mr.
Shiwa, and Mr. Hato for the modeling studies reported in
this paper. We also thank Mr. Shimada and Mr. Shi for
their help in discussions and suggestions. This research
was supported by the Ministry of Internal Affairs and
Communications of Japan.
Kuzuoka, H., Oyama, S., Yamazaki, K., Suzuki, K. and
Mitsuishi, M., 2000, GestureMan: A Mobile Robot that
Embodies a Remote Instructor's Actions, ACM Conference
on Computer-supported cooperative work (CSCW2000),
pp. 155-162.
References
Scassellati, B., 2002, Theory of Mind for a Humanoid
Robot, Autonomous Robots, vol. 12, pp. 13-24.
Breazeal, C., Kidd, C. D., Thomaz, A. L., Hoffman, G. and
Berlin, M., 2005, Effects of nonverbal communication on
efficiency and robustness in human-robot teamwork,
IEEE/RSJ Int. Conf. on Intelligent Robots and Systems
(IROS2005), pp. 383-388.
Okuno, Y., Kanda, T., Imai, M., Ishiguro, H. and Hagita,
N., 2009, Providing Route Directions: Design of Robot's
Utterance, Gesture, and Timing, ACM/IEEE Int. Conf. on
Human-Robot Interaction (HRI2009), pp. 53-60.
Nakauchi, Y. and Simmons, R., 2000, A Social Robot that
Stands in Line, IEEE/RSJ Int. Conf. on Intelligent Robots
and Systems (IROS2000), pp. 357-364.
Roy, N., Pineau, J. and Thrun, S., 2000, Spoken Dialogue
Management Using Probabilistic Reasoning,
Annual
Meeting of the Association for Computational Linguistics
(ACL 2000), pp. 93-100.
Dautenhahn, K., Walters, M. L., Woods, S., Koay, K. L.,
Nehaniv, C. L., Sisbot, E. A., Alami, R. and Siméon, T.,
2006, How May I Serve You? A Robot Companion
Approaching a Seated Person in a Helping Context,
ACM/IEEE Int. Conf. on Human-Robot Interaction
(HRI2006), pp. 172-179.
Ishi, C. T., Matsuda, S., Kanda, T., Jitsuhiro, T., Ishiguro,
H., Nakamura, S. and Hagita, N., 2008, A Robust Speech
Recognition System for Communication Robots in Noisy
Environments, IEEE Transactions on Robotics, vol. 24, pp.
759-763.
Nakano, Y. I., Reinstein, G., Stocky, T. and Cassell, J.,
2003, Towards a Model of Face-to-Face Grounding,
Annual Meeting of the Association for Computational
Linguistics (ACL 2003), pp. 553-561.
Kriz, S., Anderson, G. and Trafton, J. G., 2010, RobotDirected Speech: Using Language to Assess First-Time
Users' Conceptualizations of a Robot, ACM/IEEE Int.
Conf. on Human-Robot Interaction (HRI2010), pp. 267274.
Sidner, C. L., Kidd, C. D., Lee, C. and Lesh, N., 2004,
Where to Look: A Study of Human-Robot Engagement,
International Conference on Intelligent User Interfaces
(IUI 2004), pp. 78-84.
Cantrell, R., Scheutz, M., Schermerhorn, P. and Wu, X.,
2010, Robust Spoken Instruction Understanding for HRI,
ACM/IEEE Int. Conf. on Human-Robot Interaction
(HRI2010), pp. 275-282.
48
Kramer, J., Scheutz, M. and Schermerhorn, P., 2007, "Talk
to me!": Enabling Communication between Robotic
Architectures and their Implementing Infrastructures,
IEEE/RSJ Int. Conf. on Intelligent Robots and Systems
(IROS2007), pp. 3044-3049.
Referring to Regions, ACM/IEEE Int. Conf. on HumanRobot Interaction (HRI2010), pp. 301-308.
Hall, E. T., 1966, The Hidden Dimension.
Kendon, A.,
1990, Spatial Organization in Social
Encounters: the F-formation System, in Conducting
Interaction: Patterns of Behavior in Focused Encounters,
A. Kendon ed., Cambridge University Press, pp. 209-238.
Trafton, J. G., Bugajska, M. D., Fransen, B. R. and
Ratwani, R. M., 2008, Integrating Vision and Audition
within a Cognitive Architecture to Track Conversations,
ACM/IEEE Int. Conf. on Human-Robot Interaction
(HRI2008), pp. 201-208.
Yamaoka, F., Kanda, T., Ishiguro, H. and Hagita, N., 2010,
A Model of Proximity Control for Information-Presenting
Robots, IEEE Transactions on Robotics, vol. 26, pp. 187195.
Nakano, M., Hasegawa, Y., Nakadai, K., Nakamura, T.,
Takeuchi, J., Torii, T., Tsujino, H., Kanda, N. and Okuno,
H. G., 2005, A Two-Layer Model for Behavior and
Dialogue Planning in Conversational Service Robots,
IEEE/RSJ Int. Conf. on Intelligent Robots and Systems
(IROS2005), pp. 3329-3335.
Yamaoka, F., Kanda, T., Ishiguro, H. and Hagita, N., 2009,
Developing a Model of Robot Behavior to Identify and
Appropriately Respond to Implicit Attention-Shifting,
ACM/IEEE Int. Conf. on Human-Robot Interaction
(HRI2009), pp. 133-140.
Nishimura, Y., Minotsu, S., Dohi, H., Ishizuka, M.,
Nakano, M., Funakoshi, K., Takeuchi, J., Hasegawa, Y.
and Tsujino, H., 2007, A Markup Language for Describing
Interactive Humanoid Robot Presentations, International
Conference on Intelligent User Interfaces (IUI 2007), pp.
333-336.
Satake, S., Kanda, T., Glas, D. F., Imai, M., Ishiguro, H.
and Hagita, N., 2009, How to Approach Humans?:
Strategies for Social Robots to Initiate Interaction,
ACM/IEEE Int. Conf. on Human-Robot Interaction
(HRI2009), pp. 109-116.
Spexard, T. P., Hanheide, M. and Sagerer, G., 2007,
Human-Oriented Interaction With an Anthropomorphic
Robot, IEEE Transactions on Robotics, vol. 23, pp. 852862.
Kanda, T., Glas, D. F., Shiomi, M., Ishiguro, H. and Hagita,
N., 2008, Who will be the customer?: A social robot that
anticipates people's behavior from their trajectories, Int.
Conf. on Ubiquitous Computing (UbiComp2008), pp. 380389.
Salem, M., Kopp, S., IpkeWachsmuth and Joublin, F.,
2009, Towards Meaningful Robot Gesture, Cognitive
Systems Monographs, vol. 6, pp. 173-182.
Shi, C., Kanda, T., Shimada, M., Yamaoka, F., Ishiguro, H.
and Hagita, N., 2010, Easy Use of Communicative
Behaviors in Social Robots submitted to IRO2010,
Prendinger, H., Descamps, S. and Ishizuka, M., 2004,
MPML:a markup language for controlling the behavior of
life-like characters, Journal of Visual Languages &
Computing, vol. 15, pp. 183-203.
Berlin, M., Gray, J., Thomaz, A. L. and Breazeal, C., 2006,
Perspective Taking: An Organizing Principle for Learning
in Human-Robot Interaction, National Conf. on Artificial
Intelligence (AAAI2006), pp. 1444-1450.
Vilhjalmsson, H., Cantelmo, N., Cassell, J., Chafai, N. E.,
Kipp, M., Kopp, S., Mancini, M., Marsella, S., N.Marshall,
A., Pelachaud, C., Ruttkay, Z., Thórisson, K. R.,
Welbergen, H. v. and Werf, R. J. v. d., 2007, The Behavior
Markup Language:Recent Developments and Challenges,
Int. Conf. on Intelligent Virtual Agents, pp. 99-111.
Marin-Urias, L. F., Sisbot, E. A., Pandey, A. K., Tadakuma,
R. and Alami, R., 2009, Towards Shared Attention through
Geometric Reasoning for Human Robot Interaction,
Humanoid 2009,
Shiwa, T., Kanda, T., Imai, M., Ishiguro, H. and Hagita, N.,
2009, How Quickly Should a Communication Robot
Respond? Delaying Strategies and Habituation Effects,
International Journal of Social Robotics, vol. 1, pp. 141155.
McNeill, D., 1987, Psycholinguistics: a new approach,
Harpercollins College Div.
Sugiyama, O., Kanda, T., Imai, M., Ishiguro, H. and Hagita,
N., 2006, Humanlike conversation with gestures and verbal
cues based on a three-layer attention-drawing model,
Connection Science, vol. 18, pp. 379-402.
Hato, Y., Satake, S., Kanda, T., Imai, M. and Hagita, N.,
2010, Pointing to Space: Modeling of Deictic Interaction
49
Download