WHICH CAME FIRST: THE USER OR THE INTERFACE?
Keeping the human in the loop from design phase to finished product
Dennis Perzanowski
Code 5512
Naval Research Laboratory
Washington, DC 20375
dennisp@aic.nrl.navy.mil
Abstract
We believe that cognitively modeled robotic systems will be
easier to use. Just as humans communicate with one
another, facilitated by shared systems (linguistic, social, and
cognitive), human-robot interaction can be similarly
facilitated. Without cognitive capabilities, communication
between a human and a robot is more difficult and it is
perhaps subject to failure. Therefore, we have been
involved in designing and implementing a cognitively
modeled human-robot interface. Based on a pilot study of
our interface, we discuss the importance of human subject
studies needed to verify the various claims of a robotics
interface that purports to be cognitively modeled.
I. Introduction
A word of caution is needed up front. In lieu of a formal
paper presenting and analyzing results, the following
position paper is offered outlining the reasons for a formal
report on a human subject study concerning human-robot
interaction. Some preliminary results and analyses were
already presented elsewhere [Perzanowski, et al., 2003],
and we are currently analyzing the results of the more
extensive study conducted recently. The current paper 1
may sound suspiciously like a production guide for product
development. To some extent, this is not far from the
truth. The following is offered with some hindsight in
interface design and development and product testing in the
form of human subject studies, but mostly, the results of
some serious self-evaluation.
One might ask what a human subject interaction study
has to do with cognitive science and robotics. The answer
is rather straightforward: human subject studies are needed
to verify the various claims of a robotics interface that
purports to be cognitively modeled. However, this is
somewhat of a mixed metaphor: we are talking about
Copyright © 2004, American Association for Artificial Intelligence
(www.aaai.org). All rights reserved.
1
The opinions expressed in this paper are solely those of the author and
do not necessarily reflect those of the author’s co-workers and coresearchers. Errors of omission or commission are his alone.
human cognitive capabilities, and yet we keep throwing in
the term, robot or robotics.
We believe that robotic systems that are cognitively
modeled will be easier to use. Of course using the term
“easier” in the arena of human-computer interaction is like
waving a red cloth in a bullring. However, if we can
abstract away from what the specifics of “easier” might
entail, and simply generalize on the process that enables
humans to communicate with one another, we claim that it
is the shared systems (linguistic, social, and cognitive,
among others perhaps) that facilitate human
communication. We claim that robots that think and act
more like humans will likewise be easier to interact with
because they possess human cognitive capabilities and
exhibit human behaviors where and when appropriate.
Furthermore—and without making this an argument about
a Chinese Room [Searle, 1984], we simply claim that
without cognitive capabilities, communication between a
human and a robot is more difficult and it is perhaps
subject to failure. Cooperative and collaborative
interactions may suffer because humans have difficulty
interacting with machines, rather than with agents that
embody cognitive skills akin to their own. Our story goes
something like this: one person may be able to help
another individual solve a problem because they have the
ability to see the problem from the other person’s point of
view, share similar mental representations of events and
objects and communicate in shared modes. Therefore, we
wish to transfer these cognitive processes and
representations to the realm of human-robot interaction.
Too often, it seems, interfaces are designed by engineers
and computer scientists who think they understand why
and how people interact with computer/robot interfaces.
Interfaces are designed in laboratory environments,
logically planned, intelligently designed, skillfully
implemented, with little or no actual user input, other than
what the designers think actual human users might want or
use. Granted, engineers and computer scientists are human
users too, but many people for whom some interfaces are
designed are not scientifically trained, mathematically
oriented, or savvy in the ways of the CPU. Many users of
our interfaces will be, dare we say, ordinary folk who may
or may not be computer literate, although more and more
people have become so in the past decade. But with the
advent of computer literacy notwithstanding, we can still
maintain the position that the “ordinary” user must never
be lost sight of. Granted, military robots must be a
comrade in arms familiar with all of the skills that are
commensurate with the art and science of warfare, but a
nursebot or a companion robot will probably have to be a
bit more “down home” and personal. Furthermore, even
the most militaristic robot, the most medically astute
nursebot, or the most chatty home companion robot should
have the ability to communicate appropriately with the
human user in a way that the human user finds natural and
habitable, and it should behave in ways that the human user
finds natural and customary. In other words, the user
should be happy to get a correct response or reaction, not
surprised by the method or mode of the response.
The user classification scale, therefore, may be a sliding
one, moving gradually between higher levels of expertise
and almost a kind of shared ignorance but willingness to
learn. However, designers of those interfaces must always
have an eye on that scale which reflects the abilities of the
user. People who design interfaces for physically disabled
or challenged individuals would never think of designing
an interface for such a person without having that
individual in the loop in the design and implementation
phase. However, engineers and computer scientists who
design robot interfaces frequently forget the client and
think that simply because they themselves are
representatives of a class of users, their actions and
interactions will suffice to produce an adequate interface
for everyone.
This is not to say that there isn’t a wealth of human
factors and human-computer interaction research and
literature on the subject. The focus in this position paper is
on the design and implementation phase of the interface,
not on any post hoc investigation. Sadly, but honestly, a
large portion of the current author’s research in humanrobot interaction lacks this pre-design and implementation
survey of human user preferences. However, a certain turn
of events altered this history recently when questions
regarding the incorporation of cognitive skills and
behaviors were introduced into the research [Trafton, et al.,
forthcoming].
We had put together a multimodal interface to a mobile
robot that incorporated speech and natural language
understanding, natural and symbolic gesturing (the latter
being our nomenclature for interactions with a Personal
Digital Assistant display and touch screen interactions)
[Perzanowski, et al., 1998, 2002, 2003; Skubic, et al.,
2004]. Our emphasis was on establishing and maintaining
“natural” communications between humans and robots as
embodied by natural speech and gestures. We were quite
content with demonstrating our multi-modal interface
which showed off our talents and accomplishments quite
nicely. Of course, it couldn’t have done otherwise! After
all, we had designed the interface; we knew how to use it;
we knew inherently its strengths and weaknesses and could
selectively exercise it accordingly. However, when it came
to putting the interface in the hands of real users, we
discovered that we really didn’t know or understand how
the typical human user approached and wanted to use our
interface. We, therefore, conducted a pilot study
[Perzanowski, et al., 2003], and based on the analysis
therein, were somewhat amazed by the results.
II. Of Things Proustian1 and Wellsian2
A few years ago, at the 2000 AAAI Spring Symposium, we
participated in a workshop entitled “My Dinner with
R2D2,” and basically came away with the idea
[Perzanowski, et al., 2000] that the best menu for humanrobot interaction could be provided taking a caviar-andchampagne approach: give the human user the best tools
with a wide range of choices and capabilities. We built the
interface and reported on it in numerous places, and then
decided to conduct a human subject study [Perzanowski, et
al., 2003] to see just how people used what we had
designed and implemented.
Granted, subjects of the pilot study used a version of the
interface that we had been using, so it wasn’t as if we were
asking these people to design the interface for us. The
interface employed both a touch screen with live video
feedback (a robot’s eye view) and a plan or mapped view
of the robot’s environment (Figure 1).
Figure 1. Touch screen display of robot’s eye view (left) and mapped
representation (right) of environment. (The large dot on the pillar in the
left display indicates where participants touched the screen.)
Whenever a subject touched the video display, a large
red dot appeared to provide feedback to the user. Natural
language capabilities were also provided. In the map
representation of the room, the robot was indicated by a red
circle with its orientation indicated by thin radiating line
inside the circle. Certain objects throughout the room were
labeled in this view. When subjects touched this display, a
large blue “X” appeared in the map view.
With these modes of interaction, subjects were asked to
get a mobile robot in a remote location to find a hidden
object: an iridescent yellow sign with the word FOO
written on it in block letters. In order to compensate for
1
2
Proust, M., Remembrance of Things Past, 1913.
Wells, H.G., The Shape of Things to Come, 1933.
any deficiencies in any of our modules and software, as
well as to provide the cognitive capabilities and behavior
we wished to incorporate in the future, we made this a
“Wizard of Oz” study. The subjects, of course, were not
told that they would be interacting with two humans
controlling the robot, its speech, its actions, and its
understanding. Instead, they were simply informed that
they would be interacting with a rather intelligent mobile
robot. More specific details, the 5-part experimental
design in which subjects were tested, and other
experimental design and specifics are available
(Perzanowski, et al., to appear 2004).
To our surprise, people didn’t avail themselves of the
rich modes of interaction which we thought we had made
available to them. Instead, they took rather conservative
approaches to their modes of interactions. Sentences
tended to be short, simple commands, such as “Go here,”
“Move forward one meter,” and if needed, accompanied by
gestures on the screen. We had expected that, especially
with the video feed, people would jump at the opportunity
to talk about the objects viewed, telling the robot to “Move
to the left of the box behind the pillar.” Instead of coming
to the interface table with yens and appetites for fois gras,
people were in actuality taking a more peanut-butter-andjelly-approach. This is not to be construed as a pejorative
comment on our part about the kinds of interactions people
exhibited, but rather as a reaction to the completely
antithetical opinion we had had about how they might use
our interface.
In the pre-test briefing, we somewhat emphasized the
robust capabilities of the system, telling the subjects that
the robot understood English, and they could interact with
the screen freely by touching it. We found that instead of
interacting with the robot with both guns loaded, so to
speak, and attacking the problem from a very high level,
experimenting to see what the capabilities of the system
were, people came to the interface experience rather
conservatively and at a rather low level of interaction,
starting for the most part at the bottom level with simple
commands and then more importantly staying at this low
level throughout the task. We were surprised by this result
because we felt that it did not correspond to the ways in
which people interact with other people.
If a person meets another unknown individual, granted,
the initial utterances and interactions may be short, simple,
halted. But as the two become familiar with each other,
communication becomes more complex usually. This was
not what we found in our pilot study. People stayed on the
initial level of communication throughout. Most of their
commands in this task, for example, could be characterized
as nothing more than verbal joysticking (euphemistically
also known as voice-driven teleoperation). Gestures, an
option presented to them, tended to be minimal. In fact, it
seemed that the more loquacious subjects in our study
relied less on gestures and just uttered more commands.
We expected, for example, that providing people with a
rather robust interface would enable sophisticated and
complicated interactions with complicated locative and
spatial referencing, both linguistically and gesturally. For
the most part, however, very little, if any, of this was
exhibited. Instead, people seemed to be very narrowly
focused and simplistic in their commands and interactions.
Nothing in our experience had told us that we should
expect what we discovered.
When we presented the preliminary results of the pilot
study, conference attendees suggested that perhaps the
design of the experiment had contributed to the kinds of
interactions we had elicited. It was suggested that we alter
people’s workloads so that their attention had to be
directed and redirected during the experiment. Perhaps
then we would elicit the kinds of responses we were
looking for. However, this suggestion proved to us that we
were correct in our initial approach. We wanted to get
people to interact in their own natural ways without
“doctoring” the environment of the experiment. This is not
to say that the latter is not a valid experimental design, but
we did not feel that it would be appropriate here.
Therefore, given the surprising preliminary results of the
pilot study, we conducted a more rigorous study involving
25 subjects. At the time of the writing of this article, we
are analyzing the results of this second study and hope to
present some of the conclusions drawn from this analysis at
this symposium for further discussion. One general
conclusion, however, seems already to be emerging.
As stated earlier, we conducted a “Wizard of Oz”
experiment. In actuality, there were two wizards: one
acted as the navigational system of the robot, watching for
user’s gestures on a monitor mounted on top of the robot,
and manually joysticking the robot around the
environment. The other wizard acted as the voice of the
robot, providing verbal feedback to the user after the user’s
various commands or interactions. For example, after a
user told the robot to move to a particular location, the
robot verbally responded “I made it to the goal.”
Prior to the actual experiment, we looked around for a
plausible robot voice to be used for providing feedback to
the user; however, all of the ones that we surveyed on the
internet sounded either somewhat artificial or even toylike. We, therefore, opted for a voice modulator that was
purchased inexpensively in a second-hand store. The
modulator was used to distort the human voice sufficiently
so that it didn’t sound quite human. Furthermore, one
wizard was familiar with the interface and with human
voice attributes, so he maintained the same responses for
consistency across all subjects, and also maintained a
somewhat monotonous tone of voice, trying to emulate
other speech systems which do not employ speech contours
and other phonetic and phonological characteristics of
natural-sounding human speech. We feared that anything
that sounded too natural would be a dead give-away and
there really wasn’t a smart robot with which the subjects
were interacting, but another human.
After completion of the human subject experiments and
an initial pass of data analysis, it became evident that the
rich interactions that we anticipated with our “robust”
interface were not forthcoming. We had expected or hoped
for utterances rich with spatial and locative information
and gestures, given what we thought were robust natural
language and visual interactive modalities. We wondered
what could be contributing to the lack of more complicated
and richer interactions from our users. While this needs to
be investigated further, it seems that the voice of the robot
used by one of the wizards may have biased our subjects to
use short, simple utterances, and perhaps influenced them
to shy away from more complicated utterances and even
more complicated gestures. We suspect that the wizard’s
voice may have “dummied-down” the experiment.
Because the voice of the robot sounded monotonous and
the robot’s utterances tended to be short, the subjects may
have gotten the impression that the robot was not quite as
smart as we had told them it was. Instead, hearing a
monotonous, curt robot may have caused the subjects to
react in kind. Such a finding is not unusual considering
how people are affected by another person’s linguistic
capabilities in conversational environments. People talking
with children, individuals speaking with non-native
speakers or second-language learners exhibit similar
qualities. More importantly, however, we have seen the
affect that inclusion of natural language, and its seeming
robustness or lack therefore, can affect an individual’s use
of the interface in general.
III. Conclusion
We have discussed two issues affecting human-robot
interaction here. Just as human communication is
facilitated by shared cognitive systems, we believe
cognitively modeled human-robot interfaces will facilitate
human-robot interactions. However, based on our
experiences with designing and implementing a cognitively
modeled human-robot interface and a subsequent pilot
study testing the modalities of the interface, we stress the
importance of human-subject studies from the outset in the
modeling of the interface. Moreover, if one is going to
incorporate natural language into the interface, designers
have to be keenly aware of how the use of natural language
may affect other modalities and interactions. If used, the
natural language mode can have far-reaching consequences
on other modalities of the interface. Farther, perhaps, than
the designers may have anticipated. While we require
additional experimentation to verify our tentative
conclusion, we think may have found what caused certain
unexpected and seemingly adverse effects in our
multimodal interface; namely, the uncharacteristic verbal
joysticking of a supposedly robust robot, rather than the
richer and more diverse interactions we anticipated our
multimodal interface would facilitate. Thus, short, simple,
monotonous interchanges may have adverse effects on the
entire interface. There may be a direct correlation between
the robustness of the natural language interface and the
kinds of interactions users will exhibit in other interactive
modalities of an interface.
Acknowledgments
This research has been funded by the DARPA IPTO
MARS Program (MIPR #04-L697), the ONR Intelligent
Systems Program (Work Order #N0001404WX20210), and
an NRL Research Option (Work Certificate #IT-015-09-4C
& 4D).
References
Perzanowski, D., Brock , D., Adams, W., Bugajska, M., Thomas,
S., Blisard, S., Schultz, A., Trafton, J.G., and Mintz, F., (to
appear 2004), “Toward Multimodal Human-Robot Cooperation
and Collaboration,” Proceedings of the First Intelligent Systems
Technical Conference, American Institute of Aeronautics and
Astronautics.
Perzanowski, D., Brock, D., Blisard, S., Adams, W., Bugajska,
M., Schultz, A., Trafton, G., Skubic, M., (October 2003) “Finding
the FOO: A Pilot Study for a Multimodal Interface,” Proceedings
of the IEEE Systems, Man, and Cybernetics Conference,
Washington, DC. pp. 3218-3223.
Perzanowski, D., Schultz, A., Adams, W., Bugajska, M., Marsh,
E., Trafton, G., Brock, D., Skubic, M., and Abramson, M. (2002)
“Communicating with Teams of Cooperative Robots.” In MultiRobot Systems: From Swarms to Intelligent Automata. Kluwer:
The Netherlands, pp. 185-193.
Perzanowski, D., Schultz, A., Marsh, E., and Adams, W., (March
2000) "Two Ingredients for My Dinner with R2D2: Integration
and Adjustable Autonomy," Papers from the 2000 AAAI Spring
Symposium Series, Menlo Park, CA: AAAI Press, pp. 45-50.
Perzanowski, D., Schultz, A., and W. Adams, W., (September
1998) “Integrating Natural Language and Gesture in a Robotics
Domain,” Proceedings of the IEEE International Symposium on
Intelligent Control: ISIC/CIRA/ISAS Joint Conference,
Gaithersburg, MD: National Institute of Standards and
Technology, pp. 247-252.
Skubic, M., Perzanowski, D., Blisard, S., Schultz, A.C., Adams,
W., Bugajska, M., (May 2004), “Spatial Language for HumanRobot Dialogs," IEEE Transactions on Systems, Man, and
Cybernetics: Part C: Applications and Reviews, v 34.2, pp. 154167.
Trafton, J.G., Schultz, A.C., Perzanowski, D., Adams, W.,
Bugajska, M.D., Cassimatis, N.L., and Brock, D.P.,
(forthcoming), “Children and robots learning to play hide and
seek.”