Development and Application of Heuristic Evaluation to HumanRobot Interaction Edward Clarkson edward.clarkson@cc.gatech.edu INTRODUCTION The attention paid to human-robot interaction (HRI) issues has grown dramatically as robotic systems have become more capable and as human contact with those systems has become more commonplace. Along with the development of robotic interfaces, there has been an increased interest in the evaluation of these systems. Evaluation of HRI systems has taken place by a variety of methods, but has primarily taken the form of quantitative lab and field studies [10] [16] [23] [24] [26] [31]. These types of techniques, though important, are much more suited to summative evaluations of well-developed systems. In contrast, the human-computer interaction field has long made use of discount evaluation techniques. These techniques are very well-suited to formative studies, which are ideal for guiding the development of HRI systems. One such discount technique is heuristic evaluation [19], which is based on the inspection of a computing system by expert reviewers according to a set of guidelines (the ‘heuristics’). Following the process used by Nielsen in his development of what has become the standard set for traditional desktop applications, other researchers have created new heuristic sets that suited to new applications domains [4] [17]. This work discusses the our work towards the development of a heuristic set for human-monitored HRI systems, including an initial list of guidelines and the preliminary results from a pilot study using them. We will motivate the development of HRI heuristics first by discussing related work (including an overview of heuristic evaluation) before detailing the heuristic development. HUMAN-ROBOT INTERACTION EVALUATION METHODS AND METRICS A consistent theme throughout most of the literature on HRI systems is the relative scarcity of work on evaluation of those systems, especially before 2000. Georgia Tech’s Skills Impact Study notes that most researchers have contributed “lip service” to evaluation, but not many actual experimental studies [13], and [31] makes similar statements, noting that scant work has gone into assuring that HRI displays and interactions controls are at all intuitive for their users. There are researchers who have recently come to recognize the need for evaluation in general and for developing evaluation guidelines. One of the recommendations put forth as part of a case study on USAR at the World Trade Center site [7] was for additional research in perceptual user interfaces. A recent DARPA/NSF report [6] also proposes research in evaluation methodologies and metrics as a productive direction for future research, citing the need for evaluation methods and metrics that can be used to measure the development of human-robot teams. Scholtz has proposed both a framework of interaction roles as well as a few broad guidelines for HRI development [27]. Her interaction roles identify different ways in which humans relate to HRI systems: supervisor, operator, mechanic, teammate, peer and bystander. Partly as a result of the field’s lack of a grasp on good HRI evaluation metrics (discussed below), system evaluations have been limited primarily to empirical testing methods. These analyses have taken the form of field [31] [26] [23] and lab studies [10] [16] [24] of operational HRI systems. In either case, most studies focus on systems designed primarily for teleoperation [16] [24] [26] [31], with a few exceptions such as robot planning software [10]. The issue of evaluation metrics is currently a problematic area for HRI as a field [34]. Qualitative questionnaires, interviews and observational coding procedures have yielded useful results, but comparing such work across domains or applications can be problematic. The list of other (generally quantitative) techniques that have been applied is relatively short, and likewise the metrics employed have been for the most part problem- or domain-specific as well. Domain-dependent metrics (in both field and lab studies) have included task completion time, error frequency, and other ad hoc gauges particular to the task environment [31]. The NASA Task Load Index (NASA-TLX) measurement scale has also been used as a quantitative measure of operator mental workload during robot teleoperation and partially autonomous operation [16] [26]. It measures cognitive load across six subscales (mental demands, physical demands, temporal demands, performance, effort and frustration) and weights each factor to arrive at an overall workload measurement. However, these weights are determined separately for each task in question, so again the TLX metric is difficult of compare directly across domains. There has been some work attempted on domainindependent metrics. Notably, Olsen and Goodrich have proposed neglect tolerance (NT) and robot attention demand (RAD) as particularly important aspects of HRI, and developed a number of metrics that can be used to gauge a human-robot interaction design [25] [38]. Their methods to measure concepts like NT and RAD are fairly high-level and are harder to use practically, though they do provide useful illustrations of methods which can increase metric scores for HRI systems. FORMATIVE AND DISCOUNT EVALUATION A common thread (with a few exceptions, e.g., [23]) among HRI evaluation literature is the focus on formal summative studies and techniques. Summative evaluations are applied on an implemented design product to judge how well it has met design goals; this is in contrast to formative evaluations, which are applied to designs or prototypes with the intention of guiding the design or implementation itself. Note that this distinction concerns only when in the development process a technique is applied and not the method itself. However, many evaluation techniques are clearly better suited to one type of application or the other. We have already noted that research is needed into techniques that can be used to judge the progress of HRI systems [6]. We can consider progress in the macro sense (the advancement of HRI as a field) or micro (the development of an individual project), but both interpretations are important to HRI. In the current state of affairs, there are relatively few tools to judge HRI systems adequately. We can make a few comments about the current situation: the focus on summative applications comes at the expense of formative evaluations in the development of most HRI systems (along with other principles of user-centered design [1]). Largely contributing to this paucity is a lack of evaluation methods that are both suited to formative studies and have been successfully demonstrated on HRI applications. Discount evaluation techniques are methods that are designed explicitly for low cost in terms of manpower and time. Because of these properties, discount evaluations are popular for formative evaluation and are most often applied as such. One such approach whose popularity has grown rapidly in the HCI community is heuristic evaluation (HE)1. HE is a type of usability inspection method, which is a class of techniques involving evaluators examining an interface with the purpose of identifying usability problems. This class of methods has the advantage of being applicable to a wide range of prototypes, from detailed design specifications to fully functioning systems. HE was developed by Nielsen and Molich [19] and has been empirically validated [20]. In accordance with its discount label, it requires only a few (three to five) evaluators who are not necessarily HCI (or HRI) experts (though it is more effective with training). The principle behind HE is that individual inspectors of a system do a relatively poor job, finding a fairly small percentage of the total number of known usability problems. However, Nielsen has shown that there is a wide variance in the problems found between evaluators, so the results of a small group of evaluators can be aggregated to uncover a much larger number of bugs [19]. Briefly, the HE process in particular consists of the following steps [21]: The group that desires a heuristic evaluation (such as a design team) performs preparatory work, generally in the form of: 1. Creation of problem report templates for use by the evaluators. 2. Customization of heuristics to the specific interface being evaluated. Depending on what kind of information the design team is trying to gain, only certain heuristics may be relevant to that goal. In addition, since canonical heuristics are (intentionally) generalized, heuristic descriptions given to the evaluators can include references and examples taken from the system in question. Assemble a small group of evaluators (Nielsen recommends three to five) to perform the HE. These evaluators do not need any domain knowledge of usability or interface design. Woolrych and Cockton dispute the exact number, cautioning that five may not be sufficient for some groups of evaluators or interfaces, but do not provide a specific recommendation [36]; [17] addressed the criticisms in [36] by using eight inspectors. In any case, more than five is not detrimental to the evaluation, and Nielsen asserts that this range is optimal from a costbenefit perspective. Each evaluator independently assesses the system in question and judges its compliance with a set of usability guidelines (the heuristics) provided for them.2 After the results of each assessment have been recorded, the evaluators can convene to aggregate their results and assign severity ratings to the various usability issues. Alternatively, an observer (the experimenter organizing and conducting the 1 More extensive notes on heuristic evaluation are available at: http://www.useit.com/papers/heuristic/. 2 See Appendix A for Nielsen’s canonical heuristic list. evaluation) can perform the aggregation and rating assignment. using similar methodologies, each based on Nielsen’s original work. HE has been shown to find 40 – 60% of usability problems with just three to five evaluators (hence Nielsen’s recommendation), and a case study showed a cost to benefit ratio of 1:48 (as cited in [21]). For those reasons, among others, HE has proven to be very popular in both industry and research. Of course, the results of an HE are highly subjective and probably not repeatable. However, it should be emphasized that this is not a goal for HE; its purpose is to provide a proven evaluation framework that is easy to teach, learn and perform. Its value thus comes from its low cost, its straightforward applicability early in the design process, and the fact that even those with little usability experience can administer a useful HE study. These features, which make it such a popular HCI method, also make it useful for gauging HRI systems. The presence of these works validates our own process in several ways. First, their presence implies the creation of heuristics in a new problem domain (outside traditional HCI interfaces) is both feasible and constructive. Second, the CSCW and ambient domains also bear similarities to some aspects of HRI—specifically the presence of autonomous multi-agent systems and physical embodiment. If HE has been successfully applied to those domains, we should be confident that objections to HE in HRI along those lines are not compelling. Scholtz’s interaction roles are more interesting, but in the vast majority of cases, HRI systems only involve the operator and bystander cases—for the purposes of evaluation we can consider the bystander as part of the environment. Even in the worst case, heuristic sets could be developed for each interaction role; for the present we will concern ourselves mainly with the operator role. HEURISTIC EVALUATION AS APPLIED TO HRI The problem with applying heuristic evaluation to HRI, however, is the invalidity of using existing heuristics for HRI. Aside from this issue, there is the question of whether the method as a whole is suitable to HRI at all. Even if HRI has a great deal in common with mainstream HCI, there are certainly clear distinctions between the fields. Scholtz states that HRI is “fundamentally different” from normal HCI in several aspects, and Yanco, et al. acknowledge HE as a useful HCI method, but rejects its applicability to HRI because Nielsen’s heuristics are not appropriate to the domain. A number of issues have been listed as differentiating factors between HRI and HCI/HMI, including complex control systems, the existence of autonomy and cognition, dynamic operating environments, varied interaction roles, multiagent and multi-operator schemes, and the embodied nature of HRI systems. The issue of the inapplicability of Nielsen’s heuristics to HRI is of course not in debate, since the development of suitable heuristics is the purpose of this work. However, some could argue that one or more of the factors differentiating HRI and HCI calls into question the validity of the actual process. We have mentioned that examples of adapting Nielsen’s heuristics to other problem domains already exist. Baker et al. adapted heuristics for use with the evaluation of CSCW applications [4]. Similarly, Mankoff et al. produced a heuristic list for ambient displays [17]. 3 In each case, the development of the new heuristic sets was prompted by the unsuitability of Nielsen’s canonical set to another domain, and the new lists were developed and validated 3 See Appendix A. The other HCI/HMI and HRI differences mentioned above are not substantial impediments to the aims of this work, either. What is unique about HRI is not that it is different from HCI/HMI in these ways, since many other fields have used HCI/HMI or human factors principles and possess any number of the same features. HRI is unique only because it exhibits all of these distinctions. Since the differentiating factors do not seem to be significant enough so that HE cannot adapt to them individually, we argue that this implies that HE can also adapt to those differences at once. HRI HEURISTICS DEVELOPMENT There are a number of bases from which to develop for potential HRI heuristics: Nielsen’s canonical list [21], HRI guidelines suggested by Scholtz [28] (as well as the implications of her interaction roles), and elements of the ambient and CSCW heuristics [4] [17]. The concepts of situational awareness [37], neglect tolerance and attention demand [38] should be considered; Sheridan’s challenges for human-robot communication [29] can also be taken as issues to be satisfied by an HRI system. These lists and the overall body of work in HRI provide the basis for heuristics applicable to both multi-operator and multi-agent systems; however, for this work we have limited our focus to single operator, single agent settings. The reason is in part to narrow the problem focus and in part because the issues faced by systems where the human:robot ratio is not 1:1 are to a large degree orthogonal to the base 1:1 case. In addition, the validation of single human/robot systems is a wise first step since lessons learned during this work can be applied to further development. We tried to base our initial list on the distinctive characteristics of HRI, with the intention that it should ideally be pertinent to systems ranging from normal windowed software to less traditional interfaces such as that of the iRobot Roomba vacuum cleaner. We also wanted our list to apply equally well to purely teleoperated or monitored autonomous systems. This is a feasible goal, as in some ways, what makes a robotic interface (no matter what the form) effective is no different than what makes anything else usable, be it a door handle or a piece of software [22]. To this end, we identified issues relevant to the user that are common to all of these situations. Norman emphasizes a device’s ability to communicate its state to the user as an important characteristic for usability [22]. Applied to a robot, an interface then should make evident various aspects of the robots status—what is its pose? What is its current task or goal? What does it know about its environment, and what does it not know? Parallel to the issue of what information should be communicated is how it is communicated. The potential complexity of robotic sensor data or behavior parameters is such that careful attention is due to what exactly the user needs out of that data, and designing an interface to communicate that data in the most useful format. Many of these questions have been considered as a part of the heuristic sets mentioned previously, and we leveraged that experience by taking elements in whole and part from those lists to form our own attempt at an HRI heuristic set. The original source or sources of each heuristic before adaptation are also listed: 1. Sufficient information design (Scholtz, Nielsen) The interface should be designed to convey “just enough” information: enough so that the human can determine if intervention is needed, and not so much that it causes overload. 2. Visibility of system status (Nielsen) The system should always keep users informed about what is going on, through appropriate feedback within reasonable time. The system should convey its world model to the user so that the user has a full understanding of the world as it appears to the system. 3. Appropriate information presentation (Scholtz) The interface should present sensor information that is clear, easily understood, and in the form most useful to the user. The system should utilize the principle of recognition over recall. 4. Match between system and the real world (Nielsen, Scholtz) The language of the interaction between the user and the system should be in terms of words, phrases and concepts familiar to the user, rather than system-oriented terms. Follow real-world conventions, making information appear in a natural and logical order. 5. Synthesis of system and interface (None) The interface and system should blend together so that the interface is an extension of the system itself. The interface should facilitate efficient and effective communication between system and user and vice versa. 6. Help users recognize, diagnose, and recover from errors (Nielsen, Scholtz) System malfunctions should be expressed in plain language (no codes), precisely indicate the problem, and constructively suggest a solution. The system should present enough information about the task environment so that the user can determine if some aspect of the world has contributed to the problem. 7. Flexibility of interaction architecture (Scholtz) If the system will be used over a lengthy period of time, the interface should support the evolution of system capabilities, such as sensor and actuator capacity, behavior changes and physical alteration. 8. Aesthetic and minimalist design (Nielsen, Mankoff, Norman) The system should not contain information that is irrelevant or rarely needed. The physical embodiment of the system should be pleasing in its intended setting. Heuristics 1, 2, and 3 all deal with the handling of information in an HRI interface, representative of the importance of data processing in HRI tasks. Eight signifies the potential importance of emotional responses to robotic systems. Number five is indicative of interfaces ability to immerse the user in the system, making operation easier and more intuitive. Four, five and six all deal with the form communication takes between the user and system and vice versa. Finally, #7 reflects the variable nature that HRI platforms can have. Since the heuristics are intended for HRI systems, they focus only on the characteristics distinct to HRI. Many human-robot interfaces (especially those that are for the most part traditional desktop software applications) can and do have usability problems that are associated with ‘normal’ HCI issues (e.g., widget size or placement), but these problems can be addressed by traditional HCI evaluations. HRI HEURISTICS DEVELOPMENT AND USE After creating this list, we kept with a methodology similar to [4] [14] and [17]. After creating our initial list based on brainstorming and informed synthesis of existing work, we conducted an initial study using our new heuristics. This study had a primary and a secondary purpose: first, to uncover problems with and provide feedback on our initial list so that we could make changes to our heuristics as appropriate. Second, we were interested in if what differences there were, if any, between evaluators with backgrounds in HCI versus those from robotics. Research Method Before subject recruitment, we first had to determine the object of their evaluation—that is, the robotic system to evaluate and the context in which to evaluate it. Because the focus of this work is on the development of heuristics and not the development of robotic systems, we used existing an existing hardware platform. We chose the Mobile Robot Lab’s4 Segway RMP 5, which has been modified from the stock model to include a number of additional capabilities. The primary modifications consist of a SICK laser rangefinder, a color streaming video camera, an on-board computer to support the sensors and a protective PVC pipe frame with kill switches (see Fig. 1). Aside from its relative novelty, we were drawn to the Segway for our platform choice because we expected its unique auto-balancing design could present some interesting issues for the evaluators to consider. robot depends so much on its intended task environment, we needed one that was relatively precise. Specifying an environment as, for example, ‘reconnaissance’ is not sufficient since details like possible terrain, operator location, possibility of enemy forces, etc. are essential to judging a system’s efficacy for reconnaissance-style tasks. It is clear that the more realistic (and the more completely) we delineate the task environment, the better we can evaluate robotic systems operating in it. But defining such a framework is not a simple undertaking; it is also not one that is related to the focus of this work. Consequently, instead of outlining some contrived circumstance in which to assess out system, we chose to make use of a task environment that has already been developed and described—the Rescue Robot League6 held at various worldwide locations as a part of the RoboCup competition. If anything, this environment is even too well-stated, since it includes clear rules for scoring the performance of a robotic contestant. Though an explicit score is not to be expected in a real-world environment, using our heuristics in the USAR competition setting is still natural—prospective student entrant groups to the competition are exactly the kinds of people who could benefit most from a cheap, quick evaluation method. After we determined the object of our test for our HRI heuristics, we recruited 10 volunteers from the Georgia Tech College of Computing to serve as the heuristic analysts. By design, all of the volunteers were graduate studies with specialties in either HCI or robotics, with exactly half in each area. There was also an exactly even split along gender lines. The experimental procedure was similar to the standard HE process already described. We held an introductory meeting with all of the evaluators, at which we handed out information packets containing all the materials necessary for their assessments. We also conducted a short 7-point Likert scale demographics survey for each inspector. Figure 1 – Modified Segway platform. Our second major decision was what evaluation environment to use to evaluate the platform we had chosen. Unlike most of traditional HCI, the suitability of a robotic system and interface depends heavily on the context in which it is used: An interface that is used in a command center will have very different requirements from one used in combat; likewise, a system designed explicitly for entertainment is not likely to be useful as a health care service robot. Because the suitability of a 4 5 http://www.cc.gatech.edu/ai/robot-lab/ http://www.segway.com/segway/rmp/ The documents in the information packet included notes on our modified Segway robotic platform and its software interface, the evaluation context (i.e., RoboCup Rescue competition description and rules), descriptions of the heuristics themselves, and copies of problem report forms. We went over these documents to answer evaluator questions, and then showed them a brief live demonstration of teleoperating the robot. After this demonstration, the evaluators were instructed to fill out report forms for all usability problems they perceived, indicating which heuristics were violated. The evaluators were given a week to complete their appraisals and return their problem reports. 6 http://www.isd.mel.nist.gov/projects/USAR/competitions.htm DISCUSSION Analysis Methodology The method we have described follows the example of [4] more than [17] in that we are not comparing our heuristics against another set (such as Nielsen’s). Thus, for our primary research purpose, we would like to answer two questions: 1. 2. For our secondary purpose (investigating differences between HCI- and robotics-focused researchers), we also propose the following research hypotheses. We will refer to evaluators who identified themselves as specializing in HCI and robotics as conditions ‘HCI’ and ‘IS’, respectively. 2. How do inspectors perform using the HRI heuristic set to evaluate usability problems? How widely does the performance of individual evaluators vary? Our procedure for answering these questions will again be similar to [4] (which was based in turn on Nielsen’s work [19], [20]): distill the raw problem reports into a synthesized list suitable to comparison across evaluators. That information will allow us to determine more easily which heuristics were most or least useful, what kinds of problems were not easily explained by our heuristics (components of our first question) and how problem reports differed from inspector to inspector (our second question). 1. Though our data analysis is not complete, we can speculate on a number of issues and critique our methodology thus far based on our experience: The number of problems and severity of problems found by evaluators in condition HCI will not be significantly different from those found in condition IS. The problems found in condition HCI will be more likely to focus on software than hardware issues. Likewise, problems in condition IS will be more likely to be hardware than software issues. We should emphasize once more that this work concerns only on the development and quality of our heuristics, not on the system we chose to use as our evaluation subject. As such, our analysis methods do focus on the usability of the system at all, except as it relates to how our heuristics were used. Results Out analysis of evaluator data is ongoing. The demographics survey showed a mean age of 28.47 and an average of 3.4 years of graduate schooling. 7 One subject did not report an age; another reported an age range. This figure is the average of the 8 other ages and the midpoint of the reported range. Another task environment may have improved the quality of our study. The Segway platform is an intriguing platform, and one that has considered for (outdoor) USAR applications [35]. However, it is not very well-suited for indoor USAR applications for a variety of reasons. Our intention was to present an HRI system with obvious room for improvement to provide fertile ground for our inspectors. But our initial impression was that the mismatch of platform and environment proved a confusing distraction for our participants. An outdoor arena and competition is in development,8 however, and may provide grounds for future work. We did not encompass as much of the existing work in HRI into our heuristics as we should have. This assertion at this point not backed by anything other than experience. But notably, none of our heuristics specifically addresses situational awareness, which has been recognized as a critical factor for HRI (especially teleoperated) systems. Likewise, Olsen and Goodrich’s metrics were not consulted as extensively as merited. We made the decision not to customize the heuristics in our study to the evaluation system. On most HE studies, this would have been an error. But since we are attempting to examine the general heuristics themselves, we made the judgment not to make any modifications to the guidelines. It remains to be seen whether this was an appropriate decision, although we suspect the addition of some examples or clarifying text for each heuristic might have been a useful feature. Similarly, we also suspect our heuristics are not targeted to HRI specifically enough. This is partially a result of our attempt to have broadly applicable heuristics, and partly a result of the author’s (and field’s) evolving knowledge of specific information to include. We expect the next iteration of heuristics to improve dramatically as we (the author and the field) gain additional experience with HRI evaluation. Along with the results of our data analysis, these issues will be important to consider for the next iteration of our heuristics. Additionally, many of the issues discussed at the IEEE/IFRR Summer Robotics School 2004 on Human-Robot Interaction are directly applicable to this work and how to improve the heuristics themselves. Though SRS 2004 and the other issues we have 8 See http://www.isd.mel.nist.gov/projects/USAR/ mentioned here have provided an ample base to construct a modified heuristic list, we will refrain from doing so here. Aside from potentially coloring our interpretation of the evaluator reports and comments, practically it is more sensible to wait until all possible data is available before making judgments about what changes are necessary. CONCLUSIONS AND FUTURE WORK ACKNOWLEDGEMENTS Thanks to Yoichiro Endo for the review of an early version of this report, and to the volunteer evaluation participants. REFERENCES [1] Adams, J. Critical Considerations for Human-Robot Interface Development. In Proceedings of the 2002 AAAI Fall Symposium on Human-Robot Interaction, November 2002. We have summarized current and past work in HRI system development and evaluation, and examined heuristic evaluation (HE), a usability inspection method from HCI. We have found that there is indeed a need for formative and discount evaluation techniques such as HE, but that current heuristics sets are not adequate for application to HRI systems. However, previous work has validated the concept of adapting HE to new problem domains, and those problem domains share some of the differences between traditional HCI and HRI system— indicating that adapting a set of heuristics for HRI is a fruitful endeavor. To that end, we have proposed an initial set of heuristics intended for single operator, single agent human-robot interaction systems. We have ignored other task environments to simplify the development process, since the validation and focus of heuristics for multi-operator or multi-agent systems are significantly different than the base case. [2] Agah, A. Human interactions with intelligent systems: research taxonomy. Computers and Electrical Engineering, vol. 27, pp. 71 – 107, 2001. We have also outlined the procedure we followed for an initial study of the effectiveness of our heuristics. Our analysis of the data gathered will guide the evolution of the heuristics toward a more stable list of guidelines. This will likely require an additional, larger study to validate that the heuristics enable 3-5 inspectors to find 40-60% of known usability issues. Looking beyond the immediate task of data analysis, the use of the heuristics in more realistic settings is needed. The use of robotics competitions such as RoboCup as a means of locating suitable evaluation subjects (motivated groups with complex HRI systems) is a promising avenue [31]. Most desirable would be the application of HE and our heuristics to a real-world, truly formative study; applications to systems as in [31] would be valuable, but those systems are relatively far along in the development process if they are already in competition. [8] Collet, T. and MacDonald, B. User Oriented Interfaces for HumanRobot Interation. Other possible future investigations might include the use of different heuristic sets for each of Scholtz’s interaction roles [27], and to widen our scope to include multioperator and multi-agent systems. As a larger goal, the adaptation of this HCI evaluation technique will hopefully increase the use of other HCI methodologies (such as user-centered design) synergistically—the presence of a validated formative discount technique for HRI systems encouraging more iterative design processes and the adaptation of other HCI methods to HRI. [3] Ali, K. Multiagent Telerobotics: Matching Systems to Tasks. Ph.D. dissertation, Georgia Tech College of Computing, 1999. [4] Baker, K., Greenberg, S. and Gutwin, C. Empirical development of a heuristic evaluation methodology for shared workspace groupware. In Proceedings of CSCW 2002, pp. 96–105. ACM Press, 2002. [5] Breazeal (Ferrell), C. A Motivational System for Regulating HumanRobot Interaction. In Proceedings of AAAI98, Madison, WI, 1998 [6] Burke, J., Murphy, R.R., Rogers, E., Scholtz, J., and Lumelsky, V. Final Report for the DARPA/NSF Interdisciplinary Study on HumanRobot Interaction. Also appears as IEEE Systems, Man and Cybernetics Part A, Vol. 34, No. 2, May 2004. [7] Casper, J., and Murphy, R. Human-Robot Interactions during the Robot-Assisted Urban Search and Rescue Response at the World Trade Center. IEEE Transactions on Systems, Man and Cybernetics Part B , Vol. 33, No. 3, pp. 367-385, June 2003. [9] Dudek, G., Jenkin, M., Milios, E. and Wiles, D. A Taxonomy for Multi-Agent Robotics. Autonomous Robots Vol. 3, 375-397, 1996. [10] Endo, Y., MacKenzie, D.C., and Arkin, R. Usability Evaluation of High-Level User Assistance for Robot Mission Specification. IEEE Transactions on Systems, Man, and Cybernetics - Part C., Vol. 34., No. 2, May 2004 (In Press). [11] Fong, T., Thorpe, C., and Baur, C. Collaboration, dialogue and human-robot interaction. In Proceedings of the 10th International Symposium on Robotics Research, Lorne, Victoria, Australia. Springer, 2001. [12] Fong, T., Nourbakhsh, I. and Dautenhahn, K. A survey of socially interactive robots. Special issue on Socially Interactive Robots, Robotics and Autonomous Systems, 42 (3-4), 2002. [13] Georgia Tech College of Computing and Georgia Tech Research Institute. Real-time Cooperative Behavior for Tactical Mobile Robot Teams; Skills Impact Study for Tactical Mobile Robot Operational Units. DARPA Study, 2000. [14] Greenberg, S., Fitzpatrick, G., Gutwin, C. & Kaplan, S. (1999). Adapting the Locales framework for heuristic evaluation of groupware. In Proceedings OZCHI’99, 28-30, 1999. [15] Haigh, K. and Yanco, H. Automation as Caregiver: A Survey of Issues and Technologies. In Proceedings of the AAAI-2002 Workshop on Automation as Caregiver: The Role of Intelligent Technology in Elder Care, Edmonton, Alberta, 2002. [16] Johnson, C., Adams, J. and Kawamura, K. Evaluation of an Enhanced Human-Robot Interface. In Proceedings of the 2003 IEEE International Conference on Systems, Man, and Cybernetic, 2003. [28] Scholtz, J. Evaluation methods for human-system performance of intelligent systems. In Proceedings of the 2002 Performance Metrics for Intelligent Systems (PerMIS) Workshop, 2002. [17] Mankoff, J., Dey, A.K., Hsieh, G., Kientz, J., Ames, M., Lederer, S. Heuristic evaluation of ambient displays. CHI 2003, ACM Conference on Human Factors in Computing Systems, CHI Letters 5(1): 169-176, 2003. [29] Sheridan, T. “Eight ultimate challenges of human-robot communication,” in Proc. IEEE Int. Workshop Robot-Human Interactive Communication (RO-MAN), pp. 9–14, 1997. [18] McDonald, M. Active Research Topics in Human Machine Interfaces. Sandia National Laboratory Report, December 2000. [19] Molich, R., and Nielsen, J. Improving a human-computer dialogue. Communications of the ACM 33, 3 (March), 338-348, 1990. [20] Nielsen, J. Enhancing the explanatory power of usability heuristics. In Proceedings of the CHI 94 Conference on Human Factors in Computing Systems, Boston, MA, 1994. [21] Nielsen, J. How to Conduct a Heuristic Evaluation. http://www.useit.com/papers/heuristic/heuristic_evaluation.html. [22] Norman, D. The Design of Everyday Things. Doubleday, New York, 1990. [23] Nourbakhsh, I., Bobenage, J., Grange, S., Lutz, R., Meyer, R. and Soto, A. An affective mobile robot educator with a full-time job, Artificial Intelligence 114 (1–2), 95–124, 1999. [24] Olivares, R., C. Zhou, J. Adams, and B. Bodenheimer. Interface Evaluation for Mobile Robot Teleoperation. In Proceedings of the ACM Southeast Conference (ACMSE03), pp. 112-118, Savannah, GA, March 2003. [25] Olsen, D. and Goodrich, M. Metrics for Evaluating Human-Robot Interactions. In Proceedings of PERMIS 2003, September 2003. [30] Taylor, R., Jensen, P., Whitcomb, L., Barnes, A., Kumar, R., Stoianovici, D., Gupta, P., Wang, Z., deJuan, E. and Kavoussi, L. A Steady-Hand Robotic System for Microsurgical Augmentation. International Journal of Robotics Research, 18(12):1201-1210 December 1999. [31] Yanco, H., Drury, J. and Scholtz, J. Beyond Usability Evaluation: Analysis of Human-Robot Interaction at a Major Robotics Competition. Accepted for publication in the Journal of Human-Computer Interaction, to appear 2004. [32] Yanco, H. and Drury, J. A Taxonomy for Human-Robot Interaction. In Proceedings of the AAAI Fall Symposium on HumanRobot Interaction, AAAI Technical Report FS-02-03, pp. 111-119, 2002. [33] Horovitz, E. Principles of Mixed-Initiative User Interfaces. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM Press, 1999, pp. 159-166. [34] Scholtz, J. HRI Evaluation. (lecture) Robotics School 2004, Volterra, Italy. IEEE/IFRR Summer [35] Arkin, Ron. Personal Communication, July 2004. [36] Woolrych, A. and Cockton, G. Why and When Five Test Users Aren’t Enough. In Proceedings of IHM-HCI. 2001. [26] Schipani, S. An Evaluation of Operator Workload, During Partially-Autonomous Vehicle Operation. [37] Endsley, M., Sollenberger, R. and Stein, E. Situation Awareness: A Comparison of Measures. In Proceedings of Human Performance, Situation Awareness and Automation Conference. October, 2000. [27] Scholtz, J. Theory and Evaluation of Human-Robot Interaction. In Proceedings of the Hawaii International Conference on System Science (HICSS). Waikoloa, Hawaii, 2003. [38] Goodrich, M. and Olsen, D. Seven Principles of Efficient Interaction. In Proceedings of IEEE International Conference on Systems, Man, and Cybernetics, October 5-8, 2003, pp. 3943-3948. APPENDIX A OTHER HEURISTIC SETS Nielsen’s Canonical Heuristics [21] Ambient Display Heuristics [17] Visibility of system status The system should always keep users informed about what is going on, through appropriate feedback within reasonable time. Match between system and the real world The system should speak the users' language, with words, phrases and concepts familiar to the user, rather than system-oriented terms. Follow real-world conventions, making information appear in a natural and logical order. User control and freedom Users often choose system functions by mistake and will need a clearly marked "emergency exit" to leave the unwanted state without having to go through an extended dialogue. Support undo and redo. Consistency and standards Users should not have to wonder whether different words, situations, or actions mean the same thing. Follow platform conventions. Error prevention Even better than good error messages is a careful design which prevents a problem from occurring in the first place. Recognition rather than recall Make objects, actions, and options visible. The user should not have to remember information from one part of the dialogue to another. Instructions for use of the system should be visible or easily retrievable whenever appropriate. Flexibility and efficiency of use Accelerators -- unseen by the novice user -- may often speed up the interaction for the expert user such that the system can cater to both inexperienced and experienced users. Allow users to tailor frequent actions. Aesthetic and minimalist design Dialogues should not contain information which is irrelevant or rarely needed. Every extra unit of information in a dialogue competes with the relevant units of information and diminishes their relative visibility. Help users recognize, diagnose, and recover from errors Error messages should be expressed in plain language (no codes), precisely indicate the problem, and constructively suggest a solution. Help and documentation Even though it is better if the system can be used without documentation, it may be necessary to provide help and documentation. Any such information should be easy to search, focused on the user's task, list concrete steps to be carried out, and not be too large. Sufficient information design The display should be designed to convey “just enough” information. Too much information cramps the display, and too little makes the display less useful. Consistent and intuitive mapping Ambient displays should add minimal cognitive load. Cognitive load may be higher when users must remember what states or changes in the display mean. The display should be intuitive. Match between system and real world The system should speak the users’ language, with words, phrases and concepts familiar to the user, rather than system oriented terms. Follow real-world conventions, making information appear in a natural and logical order. Visibility of state An ambient display should make the states of the system noticeable. The transition from one state to another should be easily perceptible. Aesthetic and pleasing design The display should be pleasing when it is placed in the intended setting. Useful and relevant information The information should be useful and relevant to the users in the intended setting. Visibility of system status The system should always keep users informed about what is going on, through appropriate feedback within reasonable time. User control and freedom Users often choose system functions by mistake and will need a clearly marked “emergency exit” to leave the unwanted state without having to go through an extended dialogue. Support undo and redo. Easy transition to more in-depth information If the display offers multi-leveled information, the display should make it easy and quick for users to find out more detailed information. “Peripherality” of display The display should be unobtrusive and remain so unless it requires the user’s attention. User should be able to easily monitor the display. Error prevention Even better than good error messages is a careful design which prevents a problem from occurring in the first place. Flexibility and efficiency of use Accelerators – unseen by the novice user – may often speed up the interaction for the expert user such that the system can cater to both inexperienced and experienced users. Allow users to tailor frequent action CSCW/Groupware Heuristics [4] Provide the means for intentional and appropriate verbal communication The prevalent form of communication in most groups is verbal conversations. The mechanism by which we gather information from verbal exchanges has been coined intentional communication and is used to establish a common understanding of the task at hand. Intentional communication usually happens in one of three ways. 1. People talk explicitly about what they are doing and where they are working within a shared workspace. 2. People overhear others’ conversations. 3. People listen to the running commentary that others tend to produce alongside their actions. Provide the means for intentional and appropriate gestural communication Explicit gestures are used alongside verbal exchanges to carry out intentional communication as a means to directly support the conversation and convey task information. Intentional gestural communication can take many forms. Illustration occurs when speech is acted out or emphasized, e.g., signifying distances with a gap between your hands. Emblems occur when actions replace words, such as a nod of the head indicating ‘yes’. Deictic reference or deixis happens when people reference workspace objects with a combination of intentional gestures and voice communication, e.g., pointing to an object and saying “this one”. Provide consequential communication of an individual’s embodiment A person’s body interacting with a physical workspace is a continuous and immediate information source with many degrees of freedom. In these settings, bodily actions unintentionally “give off” awareness information about what’s going on, who is in the workspace, where they are, and what they are doing. This visible activity of unintentional body language and actions is fundamental for creating and sustaining teamwork, and includes: 1. Actions coupled with the workspace include: gaze awareness (where someone is looking), seeing someone move towards an object, and hearing sounds as people go about their activities. 2. Actions coupled to conversation are the subtle cues picked up from our conversational partners that help us continually adjust our verbal behaviour. Cues may be visual (e.g., facial expressions), or verbal (e.g., intonation and pauses). These provide conversational awareness that helps us maintain a sense of what is happening in a conversation. Provide consequential communication of shared artifacts (i.e. artifact feedthrough) Consequential communication also involves information unintentionally given off by physical artifacts as they are manipulated. This information is called feedback when it informs the person manipulating the artifact, and feedthrough when it informs others who are watching. Seeing and hearing an artifact as it is being handled helps to determine what others are doing to it. Identifying the person manipulating the artifact helps to make sense of the action and to mediate interactions. Provide Protection Concurrent activity is common in shared workspaces, where people can act in parallel and simultaneously manipulate shared objects. Concurrent access of this nature is beneficial; however, it also introduces the potential for conflict. People should be protected from inadvertently interfering with work that others are doing, or altering or destroying work that others have done. To avoid conflicts, people naturally anticipate each other’s actions and take action based on their predictions of what others will do in the future. Therefore, collaborators must be able to keep an eye on their own work, noticing what effects others’ actions could have and taking actions to prevent certain activities. People also follow social protocols for mediating their interactions and minimizing conflicts. Of course, there are situations where conflict can occur (e.g., accidental interference). People are also capable of repairing the negative effects of conflicts and consider it part of the natural dialog. Manage the transitions between tightly and loosely-coupled collaboration Coupling is the degree to which people are working together. It is also the amount of work that one person can do before they require discussion, instruction, information, or consultation with another. People continually shift back and forth between loosely- and tightly-coupled collaboration where they move fluidly between individual and group work. To manage these transitions, people need to maintain awareness of others: where they are working and what they are doing. This allows people to recognize when tighter coupling could be appropriate e.g., when people see an opportunity to collaborate or assist others, when they need to plan their next activity, or when they have reached a stage in their task that requires another’s involvement. Support people with the coordination of their actions An integral part of face-to-face collaboration is how group members mediate their interactions by taking turns and negotiating the sharing of the common workspace. People organize their actions to help avoid conflict and efficiently complete the task at hand. Coordinating actions involves making some tasks happen in the right order and at the right time while meeting the task’s constraints. Within a shared workspace, coordination can be accomplished via explicit communication and the way objects are shared. At the fine-grained level, awareness helps coordinate people’s actions as they work with shared objects e.g., awareness helps people work effectively even when collaborating over a very small workspace. On a larger scale, groups regularly reorganize their division of labour based on what the other participants are doing and have done, what they are still going to do, and what is left to do in the task. Facilitate finding collaborators and establishing contact Most meetings are informal: unscheduled, spontaneous or one-person initiated. In everyday life, these meetings are facilitated by physical proximity since co-located individuals can maintain awareness of who is around. People frequently come in contact with one another through casual interactions (e.g. bumping into people in hallways) and are able to initiate and conduct conversations effortlessly. While conversations may not be lengthy, much can occur such as coordinating actions and exchanging information. In electronic communities, the lack of physical proximity means that other mechanisms are necessary to support awareness and informal encounters. APPENDIX B SURVEY DOCUMENTS HRI Heuristic Evaluation Case Study – Demographics Survey Gender: Age: Male / Female __________________ What is your area of specialization in the College of Computing? _______________________________________________________ How many years of graduate study in the College of Computing have you completed? 1 / 2 / 3 / 4 / 5 / 6 / 7+ Please rate your familiarity with the following areas of study: 1 – not at all familiar with the term; 7 – capable of lecturing on the subject Human-Computer Interaction 1 / 2 / 3 / 4 / 5 / 6 / 7 / 3 / 4 / 5 / 6 / 7 2 / 3 / 4 / 5 / 6 / 7 2 / 3 / 4 / 5 / 6 / 7 3 / 4 / 5 / 6 / 7 Heuristic Evaluation 1 / 2 Intelligent Systems 1 / Robotics 1 / Human-Robot Interaction 1 / 2 / Have you participated in or conducted a heuristic evaluation before (circle both if applicable)? No / Participated / Conducted Evaluation Context The system description you have been given is that of a partially implemented hardware/software solution; portions that are not fully integrated or implemented in the current interfaced are noted as such for clarity. The goal of this evaluation is to examine the suitability of the proposed human-robot system for a particular task environment: urban search and rescue (USAR). You may note that the system as described could have trouble meeting some of the basic requirements of the USAR competition task. This is expected; simply do your best to enumerate the specific characteristics of the system (hardware or software) that are at fault. This task environment is intended to mirror that of the AAAI RoboCup competitions. We have reproduced much of the content of http://www.isd.mel.nist.gov/projects/USAR/competitions.htm below to illustrate the details: Competition Overview The goal of the urban search and rescue (USAR) robot competitions is to increase awareness of the challenges involved in search and rescue applications, provide objective evaluation of robotic implementations in representative environments, and promote collaboration between researchers. It requires robots to demonstrate their capabilities in mobility, sensory perception, planning, mapping, and practical operator interfaces, while searching for simulated victims in unstructured environments. As robot teams begin demonstrating repeated successes against the obstacles posed in the arenas, the level of difficulty will be increased accordingly so that the arenas provide a stepping-stone from the laboratory to the real world. Meanwhile, the yearly competitions will provide direct comparison of robotic approaches, objective performance evaluation, and a public proving ground for field-able robotic systems that will ultimately be used to save lives. Competition Vision When disaster happens, minimize risk to search and rescue personnel, while increasing victim survival rates, by fielding teams of collaborative robots which can: Autonomously negotiate compromised and collapsed structures Find victims and ascertain their conditions Produce practical maps of their locations Deliver sustenance and communications Identify hazards Emplace sensors (acoustic, thermal, hazmat, seismic, etc,…) Provide structural shoring …allowing human rescuers to quickly locate and extract victims. It is ideal for the robots to be capable of all the tasks outlined in the vision, with the first three directly encouraged in the current performance metric (see Fig 5). The remaining tasks will be emphasized in future versions of the performance metric. Rescue Scenario A building has partially collapsed due to earthquake. The Incident Commander in charge of rescue operations at the disaster site, fearing secondary collapses from aftershocks, has asked for teams of robots to immediately search the interior of the building for victims. The mission for the robots and their operators is to find victims, determine their situation, state, and location, and then report back their findings in a map of the building and a victim data sheet. The section near the building entrance appears relatively intact while the interior of the structure exhibits increasing degrees of collapse. Robots must negotiate the lightly damaged areas prior to encountering more challenging obstacles and rubble. USAR Arenas and Victims: The robots are considered expendable in case of difficulty. The rescue arenas constructed to host these competitions are based on the Reference Test Arenas for Urban Search and Rescue Robots developed by the U.S. National Institute of Standards and Technology (NIST). Three arenas, named yellow, orange, and red to indicate their increasing levels of difficulty, form a continuum of challenges for the robots. A maze of walls, doors, and elevated floors provide various tests for robot navigation and mapping capabilities. Variable flooring, overturned furniture, and problematic rubble provide obvious physical obstacles. Sensory obstacles, intended to confuse specific robot sensors and perception algorithms, provide additional challenges. Intuitive operator interfaces and robust sensory fusion algorithms are highly encouraged to reliably negotiate the arenas and locate victims. Figure 2: Rescue Robot League Arenas: a) yellow, b) orange, and c) red The objective for each robot in the competition, and the incentive to traverse every corner of each arena, is to find simulated victims. Each simulated victim is a clothed mannequin emitting body heat and other signs of life, including motion (shifting, waving), sound (moaning, yelling, tapping), and/or carbon dioxide to simulate breathing. Particular combinations of these sensor signatures imply the victim’s state: unconscious, semi-conscious, or aware. Figure 3: Simulated Victim Signs of Life Each victim is placed in a particular rescue situation: surface, trapped, void, or entombed. The victims are distributed throughout the environment in roughly the same situational percentages found in actual earthquake statistics. Each victim also displays an identification tag that is usually placed in hard to reach areas around the victim, requiring advanced robot mobility to identify. Once a victim is found, the robot(s) must determine the victim’s location, situation, state, and tag, and then report their findings on a human readable map. Figure 4: Simulated Victim Situations: a) surface, b) trapped, c) void, and d) entombed Rules and Scoring Metric The competition rules and scoring metric both focus on the basic USAR tasks of identifying live victims, determining victim condition, providing accurate victim location, and enabling victim recovery, all without causing damage to the environment. The teams compete in several missions lasting twenty minutes with the winner achieving the highest cumulative score from all missions (the lowest score is dropped). The performance metric used for scoring encourages identification of detailed victim information through multiple sensors along with generation of easily understandable and accurate maps of the environment. It also encourages teams to minimize the number of operators, which may be achieved through using better operator interfaces and/or autonomous behaviors that allow effective control of multiple robots. Meanwhile, the metric discourages uncontrolled bumping behaviors that may cause secondary collapses or further injure victims. Finally, an arena weighting accounts for the difference in difficulty negotiating each of the three arenas; the more difficult the arena, the higher the arena weighting (score) for each victim found. Figure 5: Performance metric used for scoring System Description and Characteristics Hardware The mobile robot in this system is based on the Segway RMP, a robot based on the Segway HT platform (see http://www.segway.com/segway/rmp/). This platform is a two-wheeled, self-balancing base with connections for additional sensors and processing capabilities. Some of its specifications (as taken from the above web site): 100 lb. Payload zero turning radius 8 mph. top speed 8-10 mile range The particular robotic system you are examining has been modified from the stock system (shown at left). An onboard computer has been added to control various instruments, including a sensor instruments and wireless networking for remote control. This hardware is mounted within a metal frame inside which a spring-mounted sheet provides shock absorption for the some of the more delicate components. Kill switches have also been mounted at the ends of a PVC pipe frame. The switches are triggered when the machine’s motor cannot compensate for extreme pitch angles (caused by terrain type or slope, or sudden velocity changes). Front and rear views of the modified platform are shown below. SICK laser rangefinder Camera Kill Switches Fig 2: Above left: Front ¾ view. Above right: rear ¾ view. The sensors include a SICK laser rangefinder and a streaming video camera. The laser rangefinder is configurable to have a range of 8 meters with 1 mm accuracy or an 80 meter range with 1 cm accuracy. It returns 361 distance measures along the 180 scan arc at a rate of about 5 scans per second. The SICK is currently mounted at a front-facing downward angle for the purposes of analyzing terrain features, but for the purposes of this evaluation should be considered to be mounted parallel to the floor. It also is currently set to the 8 m/1mm accuracy configuration. The camera is an analog color NTSC video capture device connected to a video digitizer/compression unit that streams compressed video wirelessly to a decompression unit, which in turn can provide a real-time video feed to a display unit. The camera is on a fixed-forward mount, and the effective resolution after digitization is 640x480. Operation and Software The operator interfaces with the robot hardware through part of the MissionLab software developed at the Mobile Robot Lab. The basic interface includes a representation of the current joystick position and a real-time visual representation of the SICK laser scanner data (see Fig. 3 below). The figure below shows data from a pair of SICK scanners mounted back-to-back, giving a 360 view of Fig 3: SICK laser scanner data (360 view). the surroundings. This setup is not implemented on the evaluation system. The operational setup is flexible, but generally relies on a dual-monitor setup with one monitor displaying only the video feed and the other the data from the laser scanner and joystick feedback window. The operator controls the robot via a standard analog PC joystick controller. Movement is only performed if the stick is moved while the front trigger button is also depressed. The analog readings mean that the movement speed is proportional to the deflection of the joystick from the center position (i.e., moving the stick slightly forward moves the robot forward slowly; moving the stick all the way forward moves the robot forward at its maximum speed). Fig 4: Sample operational setup. Conclusion If you have any further questions, including those about the capabilities or configuration of the software or hardware presented or confusion about the system as it should be evaluated, please do not hesitate to contact me: edward.clarkson@cc.gatech.edu