WHICH CAME FIRST: THE USER OR THE INTERFACE? Keeping the human in the loop from design phase to finished product Dennis Perzanowski Code 5512 Naval Research Laboratory Washington, DC 20375 dennisp@aic.nrl.navy.mil Abstract We believe that cognitively modeled robotic systems will be easier to use. Just as humans communicate with one another, facilitated by shared systems (linguistic, social, and cognitive), human-robot interaction can be similarly facilitated. Without cognitive capabilities, communication between a human and a robot is more difficult and it is perhaps subject to failure. Therefore, we have been involved in designing and implementing a cognitively modeled human-robot interface. Based on a pilot study of our interface, we discuss the importance of human subject studies needed to verify the various claims of a robotics interface that purports to be cognitively modeled. I. Introduction A word of caution is needed up front. In lieu of a formal paper presenting and analyzing results, the following position paper is offered outlining the reasons for a formal report on a human subject study concerning human-robot interaction. Some preliminary results and analyses were already presented elsewhere [Perzanowski, et al., 2003], and we are currently analyzing the results of the more extensive study conducted recently. The current paper 1 may sound suspiciously like a production guide for product development. To some extent, this is not far from the truth. The following is offered with some hindsight in interface design and development and product testing in the form of human subject studies, but mostly, the results of some serious self-evaluation. One might ask what a human subject interaction study has to do with cognitive science and robotics. The answer is rather straightforward: human subject studies are needed to verify the various claims of a robotics interface that purports to be cognitively modeled. However, this is somewhat of a mixed metaphor: we are talking about Copyright © 2004, American Association for Artificial Intelligence (www.aaai.org). All rights reserved. 1 The opinions expressed in this paper are solely those of the author and do not necessarily reflect those of the author’s co-workers and coresearchers. Errors of omission or commission are his alone. human cognitive capabilities, and yet we keep throwing in the term, robot or robotics. We believe that robotic systems that are cognitively modeled will be easier to use. Of course using the term “easier” in the arena of human-computer interaction is like waving a red cloth in a bullring. However, if we can abstract away from what the specifics of “easier” might entail, and simply generalize on the process that enables humans to communicate with one another, we claim that it is the shared systems (linguistic, social, and cognitive, among others perhaps) that facilitate human communication. We claim that robots that think and act more like humans will likewise be easier to interact with because they possess human cognitive capabilities and exhibit human behaviors where and when appropriate. Furthermore—and without making this an argument about a Chinese Room [Searle, 1984], we simply claim that without cognitive capabilities, communication between a human and a robot is more difficult and it is perhaps subject to failure. Cooperative and collaborative interactions may suffer because humans have difficulty interacting with machines, rather than with agents that embody cognitive skills akin to their own. Our story goes something like this: one person may be able to help another individual solve a problem because they have the ability to see the problem from the other person’s point of view, share similar mental representations of events and objects and communicate in shared modes. Therefore, we wish to transfer these cognitive processes and representations to the realm of human-robot interaction. Too often, it seems, interfaces are designed by engineers and computer scientists who think they understand why and how people interact with computer/robot interfaces. Interfaces are designed in laboratory environments, logically planned, intelligently designed, skillfully implemented, with little or no actual user input, other than what the designers think actual human users might want or use. Granted, engineers and computer scientists are human users too, but many people for whom some interfaces are designed are not scientifically trained, mathematically oriented, or savvy in the ways of the CPU. Many users of our interfaces will be, dare we say, ordinary folk who may or may not be computer literate, although more and more people have become so in the past decade. But with the advent of computer literacy notwithstanding, we can still maintain the position that the “ordinary” user must never be lost sight of. Granted, military robots must be a comrade in arms familiar with all of the skills that are commensurate with the art and science of warfare, but a nursebot or a companion robot will probably have to be a bit more “down home” and personal. Furthermore, even the most militaristic robot, the most medically astute nursebot, or the most chatty home companion robot should have the ability to communicate appropriately with the human user in a way that the human user finds natural and habitable, and it should behave in ways that the human user finds natural and customary. In other words, the user should be happy to get a correct response or reaction, not surprised by the method or mode of the response. The user classification scale, therefore, may be a sliding one, moving gradually between higher levels of expertise and almost a kind of shared ignorance but willingness to learn. However, designers of those interfaces must always have an eye on that scale which reflects the abilities of the user. People who design interfaces for physically disabled or challenged individuals would never think of designing an interface for such a person without having that individual in the loop in the design and implementation phase. However, engineers and computer scientists who design robot interfaces frequently forget the client and think that simply because they themselves are representatives of a class of users, their actions and interactions will suffice to produce an adequate interface for everyone. This is not to say that there isn’t a wealth of human factors and human-computer interaction research and literature on the subject. The focus in this position paper is on the design and implementation phase of the interface, not on any post hoc investigation. Sadly, but honestly, a large portion of the current author’s research in humanrobot interaction lacks this pre-design and implementation survey of human user preferences. However, a certain turn of events altered this history recently when questions regarding the incorporation of cognitive skills and behaviors were introduced into the research [Trafton, et al., forthcoming]. We had put together a multimodal interface to a mobile robot that incorporated speech and natural language understanding, natural and symbolic gesturing (the latter being our nomenclature for interactions with a Personal Digital Assistant display and touch screen interactions) [Perzanowski, et al., 1998, 2002, 2003; Skubic, et al., 2004]. Our emphasis was on establishing and maintaining “natural” communications between humans and robots as embodied by natural speech and gestures. We were quite content with demonstrating our multi-modal interface which showed off our talents and accomplishments quite nicely. Of course, it couldn’t have done otherwise! After all, we had designed the interface; we knew how to use it; we knew inherently its strengths and weaknesses and could selectively exercise it accordingly. However, when it came to putting the interface in the hands of real users, we discovered that we really didn’t know or understand how the typical human user approached and wanted to use our interface. We, therefore, conducted a pilot study [Perzanowski, et al., 2003], and based on the analysis therein, were somewhat amazed by the results. II. Of Things Proustian1 and Wellsian2 A few years ago, at the 2000 AAAI Spring Symposium, we participated in a workshop entitled “My Dinner with R2D2,” and basically came away with the idea [Perzanowski, et al., 2000] that the best menu for humanrobot interaction could be provided taking a caviar-andchampagne approach: give the human user the best tools with a wide range of choices and capabilities. We built the interface and reported on it in numerous places, and then decided to conduct a human subject study [Perzanowski, et al., 2003] to see just how people used what we had designed and implemented. Granted, subjects of the pilot study used a version of the interface that we had been using, so it wasn’t as if we were asking these people to design the interface for us. The interface employed both a touch screen with live video feedback (a robot’s eye view) and a plan or mapped view of the robot’s environment (Figure 1). Figure 1. Touch screen display of robot’s eye view (left) and mapped representation (right) of environment. (The large dot on the pillar in the left display indicates where participants touched the screen.) Whenever a subject touched the video display, a large red dot appeared to provide feedback to the user. Natural language capabilities were also provided. In the map representation of the room, the robot was indicated by a red circle with its orientation indicated by thin radiating line inside the circle. Certain objects throughout the room were labeled in this view. When subjects touched this display, a large blue “X” appeared in the map view. With these modes of interaction, subjects were asked to get a mobile robot in a remote location to find a hidden object: an iridescent yellow sign with the word FOO written on it in block letters. In order to compensate for 1 2 Proust, M., Remembrance of Things Past, 1913. Wells, H.G., The Shape of Things to Come, 1933. any deficiencies in any of our modules and software, as well as to provide the cognitive capabilities and behavior we wished to incorporate in the future, we made this a “Wizard of Oz” study. The subjects, of course, were not told that they would be interacting with two humans controlling the robot, its speech, its actions, and its understanding. Instead, they were simply informed that they would be interacting with a rather intelligent mobile robot. More specific details, the 5-part experimental design in which subjects were tested, and other experimental design and specifics are available (Perzanowski, et al., to appear 2004). To our surprise, people didn’t avail themselves of the rich modes of interaction which we thought we had made available to them. Instead, they took rather conservative approaches to their modes of interactions. Sentences tended to be short, simple commands, such as “Go here,” “Move forward one meter,” and if needed, accompanied by gestures on the screen. We had expected that, especially with the video feed, people would jump at the opportunity to talk about the objects viewed, telling the robot to “Move to the left of the box behind the pillar.” Instead of coming to the interface table with yens and appetites for fois gras, people were in actuality taking a more peanut-butter-andjelly-approach. This is not to be construed as a pejorative comment on our part about the kinds of interactions people exhibited, but rather as a reaction to the completely antithetical opinion we had had about how they might use our interface. In the pre-test briefing, we somewhat emphasized the robust capabilities of the system, telling the subjects that the robot understood English, and they could interact with the screen freely by touching it. We found that instead of interacting with the robot with both guns loaded, so to speak, and attacking the problem from a very high level, experimenting to see what the capabilities of the system were, people came to the interface experience rather conservatively and at a rather low level of interaction, starting for the most part at the bottom level with simple commands and then more importantly staying at this low level throughout the task. We were surprised by this result because we felt that it did not correspond to the ways in which people interact with other people. If a person meets another unknown individual, granted, the initial utterances and interactions may be short, simple, halted. But as the two become familiar with each other, communication becomes more complex usually. This was not what we found in our pilot study. People stayed on the initial level of communication throughout. Most of their commands in this task, for example, could be characterized as nothing more than verbal joysticking (euphemistically also known as voice-driven teleoperation). Gestures, an option presented to them, tended to be minimal. In fact, it seemed that the more loquacious subjects in our study relied less on gestures and just uttered more commands. We expected, for example, that providing people with a rather robust interface would enable sophisticated and complicated interactions with complicated locative and spatial referencing, both linguistically and gesturally. For the most part, however, very little, if any, of this was exhibited. Instead, people seemed to be very narrowly focused and simplistic in their commands and interactions. Nothing in our experience had told us that we should expect what we discovered. When we presented the preliminary results of the pilot study, conference attendees suggested that perhaps the design of the experiment had contributed to the kinds of interactions we had elicited. It was suggested that we alter people’s workloads so that their attention had to be directed and redirected during the experiment. Perhaps then we would elicit the kinds of responses we were looking for. However, this suggestion proved to us that we were correct in our initial approach. We wanted to get people to interact in their own natural ways without “doctoring” the environment of the experiment. This is not to say that the latter is not a valid experimental design, but we did not feel that it would be appropriate here. Therefore, given the surprising preliminary results of the pilot study, we conducted a more rigorous study involving 25 subjects. At the time of the writing of this article, we are analyzing the results of this second study and hope to present some of the conclusions drawn from this analysis at this symposium for further discussion. One general conclusion, however, seems already to be emerging. As stated earlier, we conducted a “Wizard of Oz” experiment. In actuality, there were two wizards: one acted as the navigational system of the robot, watching for user’s gestures on a monitor mounted on top of the robot, and manually joysticking the robot around the environment. The other wizard acted as the voice of the robot, providing verbal feedback to the user after the user’s various commands or interactions. For example, after a user told the robot to move to a particular location, the robot verbally responded “I made it to the goal.” Prior to the actual experiment, we looked around for a plausible robot voice to be used for providing feedback to the user; however, all of the ones that we surveyed on the internet sounded either somewhat artificial or even toylike. We, therefore, opted for a voice modulator that was purchased inexpensively in a second-hand store. The modulator was used to distort the human voice sufficiently so that it didn’t sound quite human. Furthermore, one wizard was familiar with the interface and with human voice attributes, so he maintained the same responses for consistency across all subjects, and also maintained a somewhat monotonous tone of voice, trying to emulate other speech systems which do not employ speech contours and other phonetic and phonological characteristics of natural-sounding human speech. We feared that anything that sounded too natural would be a dead give-away and there really wasn’t a smart robot with which the subjects were interacting, but another human. After completion of the human subject experiments and an initial pass of data analysis, it became evident that the rich interactions that we anticipated with our “robust” interface were not forthcoming. We had expected or hoped for utterances rich with spatial and locative information and gestures, given what we thought were robust natural language and visual interactive modalities. We wondered what could be contributing to the lack of more complicated and richer interactions from our users. While this needs to be investigated further, it seems that the voice of the robot used by one of the wizards may have biased our subjects to use short, simple utterances, and perhaps influenced them to shy away from more complicated utterances and even more complicated gestures. We suspect that the wizard’s voice may have “dummied-down” the experiment. Because the voice of the robot sounded monotonous and the robot’s utterances tended to be short, the subjects may have gotten the impression that the robot was not quite as smart as we had told them it was. Instead, hearing a monotonous, curt robot may have caused the subjects to react in kind. Such a finding is not unusual considering how people are affected by another person’s linguistic capabilities in conversational environments. People talking with children, individuals speaking with non-native speakers or second-language learners exhibit similar qualities. More importantly, however, we have seen the affect that inclusion of natural language, and its seeming robustness or lack therefore, can affect an individual’s use of the interface in general. III. Conclusion We have discussed two issues affecting human-robot interaction here. Just as human communication is facilitated by shared cognitive systems, we believe cognitively modeled human-robot interfaces will facilitate human-robot interactions. However, based on our experiences with designing and implementing a cognitively modeled human-robot interface and a subsequent pilot study testing the modalities of the interface, we stress the importance of human-subject studies from the outset in the modeling of the interface. Moreover, if one is going to incorporate natural language into the interface, designers have to be keenly aware of how the use of natural language may affect other modalities and interactions. If used, the natural language mode can have far-reaching consequences on other modalities of the interface. Farther, perhaps, than the designers may have anticipated. While we require additional experimentation to verify our tentative conclusion, we think may have found what caused certain unexpected and seemingly adverse effects in our multimodal interface; namely, the uncharacteristic verbal joysticking of a supposedly robust robot, rather than the richer and more diverse interactions we anticipated our multimodal interface would facilitate. Thus, short, simple, monotonous interchanges may have adverse effects on the entire interface. There may be a direct correlation between the robustness of the natural language interface and the kinds of interactions users will exhibit in other interactive modalities of an interface. Acknowledgments This research has been funded by the DARPA IPTO MARS Program (MIPR #04-L697), the ONR Intelligent Systems Program (Work Order #N0001404WX20210), and an NRL Research Option (Work Certificate #IT-015-09-4C & 4D). References Perzanowski, D., Brock , D., Adams, W., Bugajska, M., Thomas, S., Blisard, S., Schultz, A., Trafton, J.G., and Mintz, F., (to appear 2004), “Toward Multimodal Human-Robot Cooperation and Collaboration,” Proceedings of the First Intelligent Systems Technical Conference, American Institute of Aeronautics and Astronautics. Perzanowski, D., Brock, D., Blisard, S., Adams, W., Bugajska, M., Schultz, A., Trafton, G., Skubic, M., (October 2003) “Finding the FOO: A Pilot Study for a Multimodal Interface,” Proceedings of the IEEE Systems, Man, and Cybernetics Conference, Washington, DC. pp. 3218-3223. Perzanowski, D., Schultz, A., Adams, W., Bugajska, M., Marsh, E., Trafton, G., Brock, D., Skubic, M., and Abramson, M. (2002) “Communicating with Teams of Cooperative Robots.” In MultiRobot Systems: From Swarms to Intelligent Automata. Kluwer: The Netherlands, pp. 185-193. Perzanowski, D., Schultz, A., Marsh, E., and Adams, W., (March 2000) "Two Ingredients for My Dinner with R2D2: Integration and Adjustable Autonomy," Papers from the 2000 AAAI Spring Symposium Series, Menlo Park, CA: AAAI Press, pp. 45-50. Perzanowski, D., Schultz, A., and W. Adams, W., (September 1998) “Integrating Natural Language and Gesture in a Robotics Domain,” Proceedings of the IEEE International Symposium on Intelligent Control: ISIC/CIRA/ISAS Joint Conference, Gaithersburg, MD: National Institute of Standards and Technology, pp. 247-252. Skubic, M., Perzanowski, D., Blisard, S., Schultz, A.C., Adams, W., Bugajska, M., (May 2004), “Spatial Language for HumanRobot Dialogs," IEEE Transactions on Systems, Man, and Cybernetics: Part C: Applications and Reviews, v 34.2, pp. 154167. Trafton, J.G., Schultz, A.C., Perzanowski, D., Adams, W., Bugajska, M.D., Cassimatis, N.L., and Brock, D.P., (forthcoming), “Children and robots learning to play hide and seek.”