Dialog with Robots: Papers from the AAAI Fall Symposium (FS-10-05) A Toolkit for Exploring the Role of Voice in Human-Robot Interaction Z. Henkel V. Groom V. Srinivasan R. Murphy Texas A&M University Stanford University Texas A&M University Texas A&M University zmhenkel@tamu.edu vgroom@stanford.edu vasant.s@gmail.com murphy@cse.tamu.edu C. Nass Stanford University nass@stanford.edu Motivation As part of the “Survivor Buddy” project, we have created an open source speech translator toolkit which allows written or spoken word from multiple independent controllers to be translated into either a single synthetic voice, synthetic voices for each controller, or unchanged natural voice of each controller. The human controllers can work via the internet or be physically co-located with the Survivor Buddy robot. The toolkit is expected to be of use for exploring voice in general human-robot interaction. The Survivor Buddy project is motivated by our prior work which suggests that a trapped victim of a disaster, or other human who is dependent, will treat a rescue robot as a social medium and that the choice of robotic voice will be important. The robot will be both a medium to the “outside” world and a local, independent entity devoted to the victim (e.g., a buddy). One function of the medium is to provide two-way audio communication between the survivor and the emergency response personnel, but more interesting capabilities emerge by fully exploiting web applications. A webenabled robot with a LCD screen could permit the survivor to videoconference with responders (or family), watch live TV, movies, or listen to music. Because the robot would represent multiple people and perform multiple roles, the choice of voice might influence trust, likability, and enhance cognitive functions. The potential significance of choice of robotic voice in a social medium was highlighted by our recent experiences with “SciGirls.” In an episode of “SciGirls,” a Public Broadcast System science reality show, four middle-school girls hypothesized that an extroverted Survivor Buddy would be better liked than an introverted Survivor Buddy, used a textto-speech interface to generate two synthetic voices for the robot with the pitches and frequencies associated with introversion and extroversion, then tested their hypothesis by having 12 of their friends play tic-tac-toe with the robot. Voice appeared more important than non-verbal cues; Fig. 1 shows the Survivor Buddy’s webcam view of the SciGirls’ friends fixated on the stationary robot head rather than the external loud speakers almost 90◦ from the actual source of the voice. The fixation on the robot versus the Figure 1: View from the Survivor Buddy webcam with subpicture showing robot in orange and speakers circled in red. source of the voice is not surprising given the primacy of the face over other communication channels in coordinating the give-and-take of conversation (Knapp and Hall 1978). However, Survivor Buddy did not have a face, just a screen with the logo. If voice is more important than nonverbal cues, it may be a more cost-effective way to make existing non-anthropomorphic robots socially consistent or to reduce costs in new robots. On one hand, the information being conveyed in game playing may be simpler and the environmental conditions favorable. Less structured conditions such as a disaster with background noise and high stress may require non-verbal affect to amplify the intent of the agent. Voice Toolkit In order to explore questions as to the relative contribution of voice versus nonverbal cues and about the type of voice for a social medium, we have built a speech translation toolkit. The toolkit allows natural voice communication as well as extends existing voice recognition and textto-speech systems into a framework of network services. By wrapping recognition and generation abilities as network services, users can choose the best tool available, regardless of their platform or installed libraries. Wrapping a service is a simple process, which involves fashioning a program with c 2010, Association for the Advancement of Artificial Copyright Intelligence (www.aaai.org). All rights reserved. 141 a network layer around the desired framework. For example, the base implementation provides recognition services based on Microsoft’s Speech APIs and CMU’s Sphinx 4 with generation services built on FreeTTS or Apple’s built in operating system voices. The network wrapper allows connected clients full functionality of the existing voice framework. The toolkit is easy to deploy, using peer-to-peer communication between the involved parties and access to a central server for a distribution list of connected clients and services. The client software, written in Java, is divided into three distinct clients distinguished by their roles, is platform independent and easily distributable. It has the following functions: • Natural Voice Pass through between all parties connected. The number of connected parties is limited by bandwidth, usually up to three communicate easily. Stream rates are able to be adjusted when bandwidth is limited. Currently, the Speex codec is used to allow single streams to reach usable rates as low as 20 kilobits per second. The toolkit specifications also allow for implementation of alternate codecs. Figure 2: Clients using the toolkit for the Survivor Buddy project. which clients can speak with the victim and who can login to the system in general. Additionally, the facilitator client is able to decide who may access specific data on the central server. For example, the central server may collect and distribute information about the robot’s proximity to a victim. This data may be relevant and useful to a facilitator, but not to a guest. The central server software serves as a data communication mechanism between the software clients, manages services on the network, such as generation and synthesis, and maintains a directory of users and permissions for logging into the system. The server is also used to coordinate communication through firewalls, via UDP hole punching techniques. Once a client connects to the central server, it is distributed a list of other connected clients, meaning each client must only know the central server address ahead of time. • Voice alteration via real time effects on voice stream. Because the toolkit provides byte level access to the audio streams, we are able to employ digital signal processing techniques to control attributes like reverb or echo. • Voice standardization via voice recognition then synthesis. A controller can choose a synthetic voice to use and speak in that voice via a speech recognition process followed by a speech generation process. • Text To Speech with many choices for voices and options. for example the FreeTTS package provides the voices kevin16, mbrola1, mbrola2, and mbrola3, as well the option to create voices with FestVox. Other packages, as simple as built in operating system voices, provide varying voices as well. Packages like the CSLU Toolkit, which allows tweaking of voice parameters to create new and unique voices, may also be supported through the tuning parameters feature. Current Work The toolkit will be used in experiments starting in Fall 2010 which will explore the appropriate identity (loyalty, expression, locus of control) and type of voice (controller, synthetic for each controller, single synthetic) for the robot serving as a social medium. The expanded toolkit will also permit the association of scripts, policies, priorities, and decision-making aids with different controller roles. Code is available upon request. • Prerecorded audio files can be played back quickly. this is useful for controllers who may need to following a coaching guide in order to handle the situation in the best way possible. Prerecorded files with the most frequently needed phrases, in the best voices, can aide the controller in conversing with a victim. As shown in Fig. 2, the toolkit is used by three software clients and a central server which form the social medium testbed. Following (Riddle, Murphy, and Burke 2005), testbed assumes that there is a victim, a robot (Survivor Buddy), remote controllers (such as structural or medical specialists or family members), and a facilitator (responder) coordinating the interactions. The robot client is managed remotely, and supports simple listening and responding. The guest client provides an interface designed to allow a remote user to communicate with facilitators and the victim, at controlled intervals. The facilitator client acts as a gatekeeper between the other clients, as well as participates in the communications. The facilitator client allows management of Acknowledgments This work was supported in part by NSF Grant IIS-0905485 “The Social Medium is the Message” and by a HRI grant from Microsoft External Research. References Knapp, M., and Hall, J. 1978. Nonverbal Communication in Human Interaction. Holt, Rinehart, Winston. Riddle, D.; Murphy, R.; and Burke, J. 2005. Robot-assisted medical reachback: using shared visual information. In ROMAN 2005. IEEE International Workshop on Robot and Human Interactive Communication, 635 – 642. 142