Voice vs. Finger - Seidenberg School of Computer Science and

advertisement
Voice vs. Finger: A Comparative Study of the Use of
Speech or Telephone Keypad for Navigation
Jennifer Lai
IBM Corporation/ T.J. Watson Research Center
30 Saw Mill River Road
Hawthorne, NY. 10532, USA
+1 914-784-6515
lai@watson.ibm.com
ABSTRACT
In this paper, we describe the empirical findings from a
user study (N=16) that compares the use of touch-tone and
speech modalities to navigate within a telephone-based
message retrieval system. Unlike previous studies
comparing these two modalities, the speech system used
was a working natural language system. Contrary to
findings in earlier studies, results indicate that in spite of
occasionally low accuracy rates, a majority of users
preferred interacting with system by speech. The
interaction with the speech modality was rated as being
more satisfying, more entertaining, and more natural than
the touch-tone modality.
Keywords
Speech User Interfaces, Voice User Interfaces (VUIs),
Natural Language, Keypad Input, DTMF, Navigation
INTRODUCTION
As Voice User Interfaces (VUIs) become more common,
users of these systems have sometimes experienced the
discomfort of needing to interact with a computer by voice,
over a phone line, in a public setting: “NO, play the NEXT
message …… NEEXXXT MESSSSAGE”. This can cause
observers to stare, some with horror, some with sympathy,
wondering either how anyone could be so uncivil to a coworker, or how this unfortunate, bedraggled businessman
could have been saddled with such an incompetent
assistant.
VUIs, as the name would imply, rely on the use of one’s
voice, usually for both data input, (e.g. in response to the
question “what city please”) and for navigation within the
application (e.g. “open my calendar please”). [For a
discussion of voice user interfaces see 1,15]. As designers
Kwan Min Lee
Department of Communication
Stanford University
Stanford, CA 94305-2050, USA
+1 650-497-7357
kmlee@stanford.edu
and users of VUIs, we too have occasionally found
ourselves in similar situations, wishing for silent and 100%
accurate interaction.
While embarrassment can be the humorous side of using
speech-based interfaces in public places, there are more
serious reasons for needing a silent mode of interacting
with a speech system such as a need for privacy (e.g.
airport ) or to be considerate to other people (e.g. meeting).
Additionally, subscribers who use a cell phone to call a
speech-based system are subject to the vagaries of varying
levels of cell coverage. When cell signal is low, there is a
large and negative impact on the accuracy of the speech
recognition, making the system unusable with speech.
Prior to the more generalized availability of speech
recognition software and development toolkits, most
telephone-based interfaces relied on touch-tone input using
the telephone keypad, or Dual Tone Multiple Frequency
(DTMF). Voice processing systems, which first started to
appear in the 1980s, used DTMF for input and recorded
human speech as output [2]. The fixed set of twelve keys
(ten digits as well as the # and * keys) on the keypad lent
itself to the construction of applications that present the
caller with lists of options (e.g. “press one for Sales”),
commonly referred to as menus. Since that time, menudriven Interactive Voice Response (IVR) applications have
become a pervasive phenomenon in the United States [9].
As an input mechanism, DTMF has the advantage of being
both instantaneous and 100% accurate. This is in contrast to
speech recognition, which is neither. Processing of the
speech can cause delays for the caller, and variability of the
spoken input will prevent speech from being 100% accurate
in the foreseeable future. However, given the limitation of
using only twelve keys for input, many users of IVR
systems report feelings of “voice mail jail” [10] often
causing them to hang up in frustration.
Given the tradeoff between unconstrained and natural
input, which could be erroneously interpreted, and highly
constrained but accurate input, we wondered which
modality users of a telephone-based message retrieval
system would prefer. This paper reports on an experiment
which compared using either a DTMF key or natural
speech to request a function within a telephone based
application.
PRIOR WORK
While we were not able to find any studies that compared
use of a natural language system with touch-tone input,
there are several studies that compare use of a simple
speech input system and DTMF [5,8,11].
Delogu et al. [5] found no difference in terms of task
completion time, and number of turns per task when
comparing DTMF input for an IVR system with three
different types of speech input. The three types they
experimented with were simple digit recognition, simple
command recognition, and a natural command recognition
system that recognized relatively complete sentences. The
digit system only recognized single digits 1 through 9,
while the simple command system recognized a short list of
words such as “next,” “skip,” “yes,” or “no.”. However, the
analysis of user attitude toward the different systems,
indicated that users preferred the natural command
recognition system over DTMF based system for input. The
DTMF input was preferred to the both simple digit and the
command recognition systems. It should be noted that the
voice systems used by Delogu et al. did not involve any
true speech recognition technology. They used a Wizard of
Oz method to compensate for technological barriers. In
addition, there was no statistical analysis of the attitudinal
survey data reported in their paper. Instead, they only
provided the proportions for three questions (1. which of
the three prototypes will be better accepted in the market
place?; 2. Which is the most enjoyable system?; 3. Which
system will you prefer to use in a real service?)
Similar to the above study, Foster et al. [8] also found that
users preferred connected word (CW)1 speech input and
DTMF input to the isolated word (IW)2 speech input
modality. In addition, they reported the interaction effect
between cognitive abilities (spatial and verbal abilities) of
users and their attitudes toward the different modalities
tested. Users with high cognitive abilities significantly
preferred DTMF over CW and IW input. While the
interaction effect between users’ spatial ability and their
preference for DTMF can easily be explained by the
positive effect of high spatial ability on mental mapping of
DTMF options, the positive effect of verbal skills on the
DTMF preference cannot be explained very well.
Finally, Goldstein et al. [11] reported no difference in task
completion times between a DTMF based hierarchical1
In a CW-based system, users say a string of words
following a system prompt, without any pause required
between the words.
2
In the IW system, users say only a single word after a
prompt.
structure navigation system and a flexible-structure voice
navigation system. With the hierarchical system, the users
needed to first select a general option and then proceed to
more specific choices. In the flexible-structure system, the
users could move directly from one function to another.
These systems have a larger vocabulary size and are more
error prone. Unlike the Foster et al (1998)’s finding on the
interaction effect between spatial ability and attitudes
toward different modalities, they found no difference in
subjective measures. However, they found a significant
interaction effect with regard to task completion times.
Users with high spatial ability finished tasks more quickly
when they used the flexible-structure voice navigation
system. In contrast, low spatial ability users finished tasks
more quickly when they used the DTMF based
hierarchical-structure navigation system. This study, also
relied on a Wizard of Oz methodology where participants
believe they are interacting with a machine, but in reality it
is a human who is controlling the interaction with the
participant.
In a study reported by Karis, [12] thirty-two subjects
performed call management tasks twice, once using speech
and once with touch-tone. The tasks involved interacting
with a “single telephone number” service that included a
variety of call management features.
The majority of
subjects (58.1%) preferred using DTMF rather than speech,
and tasks were completed faster using touch tone, although
both the type of task and whether the subject had a quick
reference guide influenced task completion times. In this
study there appeared to be only a loose relationship
between accuracy of the speech recognition, ratings of
acceptability, and preference choices. Some subjects who
experienced low system recognition performance still said
they preferred to interact via speech rather than touch-tone.
Although not directly related to the comparison between a
speech-based system and DTMF one, a market report by
Nuance [13] shows users’ general attitude toward their
current voice navigation system which has a much larger
vocabulary than the ones tested above. Even though
Nuance systems do not use true natural language, the report
nevertheless provides useful data for the prediction of how
people will evaluate large vocabulary speech systems when
compared to other methods such as speaking with a human
operator or using touch tone. In the report, 80% of users
said they were either as satisfied or more satisfied using a
speech system than they had been using a touch-tone
system. Interestingly, only 68% percent of users agreed
either strongly or somewhat strongly with the sentence
comparing two different modalities—“I like speaking my
responses better than pushing buttons.” In addition, slightly
more than half of the queried users (64 %) responded that
their utterances were understood either very well or
somewhat well by the system.
In summary, the referenced studies showed that the DTMF
systems were preferred to isolated-word or simple digit
recognition-based systems. When compared to connected-
word systems, DTMF systems were rated either similarly
or better especially by users with high spatial abilities. Low
spatial ability users were able to finish the task more
quickly with a DTMF-based hierarchical system than with
a speech-based flexible system, whereas high spatial ability
users finished tasks more quickly with speech rather than
with the DTMF.
These results are surprisingly contrary to the widespread
belief that most users would prefer a more natural form of
interaction (such as speech) to a more artificial one (such as
the keypad). We believe that one of the primary
explanations for these results is due to the constraints of the
speech systems tested. Fay [7] and Karis [12] also mention
that users may actually prefer touch-tone based systems to
voice command systems, due at least in part to
the limitation of speech recognition technology.
Without digressing too far into the differences of the
various underlying speech technologies that can be used in
speech-based systems (for an authoritative description of
these see Schmandt), speech systems can be either
command based (single word input), grammar based
(recognizing specifically predefined sentences) or natural
language (NL) based. Natural language recognition refers
to systems that act on unconstrained speech [3]. In this
case, the user does not need to know a set of predefined
phrases to interact with the system successfully. NL
systems are often trained on large amount of statistically
analyzed data, and once trained, can understand a phrasing
of a request that it has never seen before. This important
move from speech “recognition” (the simple matching of
an acoustic signal to a word) to speech “understanding”
(extracting the meaning behind the words), often stated as
the holy-grail of speech recognition, is working in limited
domains today [4]. The Mobile Assistant is one such
example of a true NL system.
THE MOBILE ASSISTANT
This experiment was conducted within the framework of a
working application called the Mobile Assistant (MA). It is
a system that gives users ubiquitous access to unified
messages (email, voice mail and faxes) and calendar
information from a telephone. Calendar and messages can
be accessed in one of three modalities:
1) From a desktop computer in a combination of
audio and visual. This is the similar to the standard
configuration today, except that voicemail is
received as an audio attachment and can be
listened to from the inbox. For internal calls, the
identity of the caller is listed in the header
information of the message. Email messages can
be created using the MA system and are also
received as audio attachments.
2) From a SmartPhone (a cell phone with a multi-line
display and a web browser) in a silent visual
mode. In this case, users connect over the network
and can read their email and calendar entries on
the phone’s display. Notifications of the arrival of
urgent email messages and voicemail are usually
sent to this phone, however the user can tailor
which notifications they receive and which email
addressable device they want the notifications sent
to. Voicemail messages cannot be accessed in the
silent visual mode since we do not transcribe them
and thus they must be accessed by calling in to the
system and listening to them.
3) From any phone using speech technologies (both
recognition and synthesis) in an auditory mode. In
this situation the users speak their requests for
information which are interpreted by the mobile
assistant. Examples of requests are: “do I have any
messages from John?”, “what’s on my calendar
next Friday at 10:00 a.m.?”, or “play my
voicemail messages”. The MA replies to their
queries and reads them the requested messages or
calendar entries.
The focus of the research has been on supporting the
pressing communication needs of mobile workers and
overcoming technological hurdles such as high accuracy
speech recognition in noisy environments, natural language
understanding and optimal message presentation on a
variety of devices and modalities. This system is currently
being used by over 150 users at IBM Research to access
their business data.
The component of the system that was targeted by this
study is the third component, which allows users to access
messages in an auditory mode using speech technologies.
METHODOLOGY
Experimental Design
We employed a within-subject design in order to maximize
the power of the statistical tests. All participants
experienced both the speech only and the DTMF only
condition. They were asked to complete identical tasks for
both input modalities. We created two test email accounts
for the experiment, and populated the inboxes with
messages. The order of modality and test account used was
counter balanced to eliminate any possible order effect.
Each account had fifteen email messages. To eliminate any
possible bias which might have been due to any
particularities of an account, we systematically rotated the
assignment of an account to a tested modality. (See Table 1
for the order and assignments). Unlike most of the earlier
studies mentioned, we did not use a Wizard of Oz
methodology but instead used a real working speech
system, which employs both recognition algorithms and
Natural Language Understanding models to interpret the
user’s requests.
User 1
Speech / Account 1
DTMF / Account 2
8
Reply to a message
User 2
DTMF / Account 1
Speech / Account 2
9
Forward a message
User 3
Speech / Account 2
DTMF / Account 1
User 4
DTMF / Account 2
Speech / Account 1
Table 1: Order of modalities and accounts used for the
experiment
Participants
A total of 16 participants (eight females and eight males)
were recruited from the IBM research center in Hawthorne,
New York. We recruited participants with a wide range in
ages (from late teens to over sixty) because our target user
population also varies greatly in age. All participants
except one, were naïve users with regard to speech systems.
None of the participants had any experience using the
Mobile Assistant. While participants volunteered their time
for the sake of science, they were given a parting gift of
either a hat or a pen as a token of our appreciation. All
participants were videotaped with informed consent and
debriefed at the end of the experiment session.
Apparatus
DTMF modality
Table 2: List of DTMF keys used in the experiment and
their corresponding functions
In both conditions the system output was spoken, using
synthesized speech. The participants could interrupt the
spoken output at any time by using the star (*) key. The
pound (#) key is mapped to the “Next” function and is
context-dependent. If the user has just listened to today’s
calendar (invoked with the 4 key) the pound key will play
the calendar entries for the following day. On the other
hand, if the user has just heard an email message, the pound
key will cause the next message to be played.
Since a real speech application was used for this
experiment, we had to deal with the fact that the user is
presented with confirmation dialogs at different points in
the interaction. For example when the user asks to delete a
message, the system confirms the operation first: “are you
sure you want to delete the message from Jane Doe?” Thus
the 1 and 2 keys were used to handle replies to the
confirmation dialogs.
NL-based speech modality
In this condition the participants were instructed that their
only mode of interaction with the system could be through
the telephone keypad. Thus to log on, when prompted by
the system, they used the keypad to enter the telephone
number for the account and then entered the six digit
password. Each of the 12 telephone keypad buttons had a
function. (See Table 2 for the list of keys and their
corresponding functions). The list for the function mapping
was given to the participants on the same page as the task
list. They were told that they could refer to the mapping as
often as they wanted to. They were also given a sample
interaction on the same page: “To listen to the second
email message in your inbox, press 5 and after listening to
the message press the # key to hear the next message.”
In speech condition, the participants were instructed that
their only mode of interaction with the system could be
with speech. Thus to log on, when prompted by the system,
they spoke the name of the test account and then spoke the
six digit password.
Because the system accepts natural language input, it was
not necessary (nor would it be feasible) to define for the
participants everything that they could say to the system.
However, in an effort to balance both conditions, the
participants were given sample phrases to give them an
idea of the types of things that can be said. The sample
phrases included on the task description were:
-
How many messages did I receive yesterday?
-
Do I have any messages from Peggy Jones?
-
What’s on my calendar next Wednesday?
DTMF Key
Function
Procedure
* (star key)
Interrupt system output
# (pound key)
Next message/Next day
0
Cancel current request
1
Yes
2
No
3
Play phonemail
4
Play today’s calendar
5
Play first email msg.
6
Delete a message
7
Repeat
The participants took the study one at a time in a usability
lab. They were told that the purpose of the study was to
examine certain usability issues related to the use of a
phone-based universal messaging and calendar system
called the Mobile Assistant. Upon arrival in the lab, the
participants were seated and given a booklet with the
instructions on the first page. The instruction page
described the purpose of the study and had a list of the
required tasks. It also had the phone number to call to reach
the system, the name of the test account they should use,
and the password. In the case of the DTMF condition, the
instruction page also had the function mapping and sample
interaction. In the case of the speech-only condition, this
page had the sample phrases that could be spoken to the
system.
The participants were instructed that their task was to
interact with the Mobile Assistant on the telephone to
manage several email and calendar tasks.
The
experimenter assured the participants that all the data
collected would be confidential. The experimenter showed
the task list to the participant and talked about the sample
interaction that was given to make sure the participant
understood what was involved. Then the participant was
shown the questionnaire that needed completing after the
first condition. The experimenter then explained that they
would then be asked to complete the same tasks in the
alternate condition and showed them the instruction page
and questionnaire for the second condition.
After the experimenter left the room, the participants dialed
the number for the Mobile Assistant. They used the
telephone on the table in front of them. The speech output
from the system played through a set of speakers as well as
the handset so that the system’s output could be captured in
the videotapes.
I would like to have your input for the workshop proposal
that we discussed at lunch the other day. While the
deadline is not until sometime in September, it also
coincides with the papers deadline so it would be great if
we could get this work out of the way. We really only need
about 2 pages for the write up. Please let me know what
time would be convenient to meet next week. How about a
working lunch?
Many thanks,
Jennifer
Ninth message in inbox 2:
Hi Jacob,
I was given the job of compiling the highlights reports for
this month and would very much appreciate it if you could
send me your input by the end of the day today. Since you
have been out of the office on vacation for three out of the
four weeks this month, I would totally understand if you did
not have much to contribute. Either way, please send me
your input as soon as possible.
Many thanks in advance,
Task Description
Participants were asked to complete the following tasks in
both conditions. The only difference from one condition to
the other was the inbox that they were logging in to (and
thus the messages contained in it) and the method of
interaction used.
Task1: Log on
Task2: Listen to the fourth message in the mail box
Task3: Find out if they received a message from a
particular person. If so, listen to it.
Task4: Reply to the above message.
Task5: Delete the message
Task6: Find out what time they are meeting with David
Jones on Friday.
Both test inboxes were balanced for number of messages,
type of message and total number of words for all
messages. Also, each message was approximately the same
length as the message in the same position in the other test
inbox. Thus the first message in test inbox 1 had about the
same number of words as the first message in test inbox 2,
as did the second message, third message etc…
In the third task, the participants were asked to determine if
they had received a message from a particular person (one
of the experimenters and authors of this paper). In both
inboxes, this message was always in the middle of the list
of messages, which was the ninth message down. The
messages were comparable as shown below.
Ninth message in inbox 1:
Hello,
Jennifer
We tried to define the tasks in such a way that half of them
would be well suited to a sequential traversal of
information and thus favor the DTMF condition, and the
remaining half would be better suited to random access of
information (and thus favor the speech condition). For
example, we believed that the sixth task, which asked them
to find out what time they were meeting with David on
Friday favored random access of data. With speech, the
user could simply ask the system “what meetings do I have
on Friday” and listen until the meeting with David was
mentioned. With DTMF, the user had to first play today’s
calendar, interrupt the listing with the star key (or listen to
the entire day), and then press the pound key to get the next
day’s listing. We expected that Task 1 (logging on), Task 4
(reply), and Task 5 would favor the DTMF condition,
whereas Task 2 (listen to the fourth message), Task 3 (find
a message from a particular person), and Task 6 (find the
meeting time with David) would favor the Speech
condition.
Measures
According to ETSI (the European Telecommunications
Standards Institute), usability in telephony applications is
defined as the level of effectiveness, efficiency and
satisfaction with which a specific user achieves specific
goals in a particular environment [6]. Effectiveness is
defined here as how well a goal is achieved in a sense of
absolute quality; efficiency as the amount of resources and
effort that are used to achieve the specific goal; and
satisfaction as the degree that users are satisfied with a
specific system.
In this study, we examined all three elements of usability.
We measured the effectiveness of a system by calculating a
success rate for each user task. The amount of time to finish
each task was used as a proxy measure for efficiency. User
satisfaction was evaluated through a series of survey
questions asked immediately following the use of each
modality.
In order to measure user satisfaction, we administrated two
types of questions. The first question asked users to
evaluate the interaction with the system that they had just
used. The questions asked users to evaluate the system
itself, regardless of their evaluation of the interaction. We
speculated that the evaluation of the system and the
evaluation of the interaction could be different. That is, it
would be possible for a user to evaluate the state-of-art NLbased speech system very positively due to the novelty
effect (the cool factor). This same user however, might
evaluate his speech-based interaction with the system
negatively, because he had difficulty accomplishing the
required task.
Immediately after completing all tasks, participants were
asked to evaluate their interaction with the system by
indicating how well certain adjectives described the
interaction, on a scale of 1 to 10 (1 = “Describes Very
Poorly”, 10 = “Describes Very Well”). Four adjectives—
comfortable, exhausting (reverse coded), frustrating
(reverse coded), and satisfying—were used to create a
index we called “interaction satisfaction.” Five
adjectives—boring(reverse coded), cool, entertaining, fun,
and interesting—were used to create an “interaction
entertainment” index. Finally, another four adjectives—
artificial(reverse coded), natural, repetitive(reverse coded),
and strained (reversed coded)—were used to form an
“interaction naturalness” index.
Interaction entertainment
Speech: .84
DTMF: .79
Interaction naturalness
Speech: .84
DTMF: .77
System satisfaction
Speech: .82
DTMF: .83
System entertainment
Speech: .92
DTMF: .75
System easiness
Speech: .87
DTMF: .68
Table 3. Reliability of each index
Lastly, after completing both conditions and their
associated questionnaires, users were asked to write the
answers to the following questions:
-
Which of the two navigation methods did you
prefer?
-
Why?
-
What would it take (either changes to the system
or circumstances of use) to get you to use the
navigation method that you least preferred?
RESULTS
Effectiveness of a system
Table 4 shows the success rate for each task with each of
the two input modalities.
Speech
DTMF
Task 1: Log on
69%
100%
Task 2: Listen to the fourth
message in the mail box
56%
75%
Task3: Find out if they received
a message from a particular
person. If so, listen to it.
56%
63%
2) system satisfaction: consisting of comfortable,
frustrating [reverse coded], satisfying, and
reliable;
Task 4: Reply to the above
message.
44%
56%
Task 5: Delete the message.
56%
50%
3)
Task 6: Find out what time they
are meeting with David Jones
on Friday.
69%
38%
After evaluating the interaction, participants then evaluated
their general impression of the system in the same way as
above. Three indices were created with regard to the
evaluation of the system:
1) system entertainment: consisting of boring
[reverse coded], cool, entertaining, and fun;
system easiness: consisting of easy, complicated
[reverse coded], confusing [reverse coded],
intuitive, and user-friendly.
Task
Table 4. Success rate for each task with each modality
The reliability for each index is shown in Table 3.
Efficiency of the system
Index
Interaction satisfaction
Cronbach’s alpha
Speech: .77
DTMF: .81
Table 5 shows the average number of seconds to complete
each task with both modalities. For each task, only the
times the user succeeded at the task were used to calculate
the average.
Task
Speech
DTMF
Task1: Log on
42 sec.
22 sec.
Task2: Listen to the fourth message
in the mail box
56 sec.
47 sec.
Task3: Find out if they received a
message from a particular person. If
so, listen to it.
19 sec.
46 sec.
Task4: Reply to the above message.
10 sec.
6 sec.
Task5: Delete the message.
20 sec.
9 sec.
Task6: Find out what time they are
meeting with David Jones on
Friday.
44 sec.
102
sec.
Table 5. Average times (in seconds) for task completion
The speech modality (M = 4.95; S.D. = 1.97) and DTMF
modality (M = 4.41; S.D. = 1.67) did not differ significantly
with regard to the level of satisfaction with the system, F(1,
15)= .97, p< .34, 2 = .061.
System entertainment
Participants evaluated the entertainment value of the
system more positively with the speech modality (M =
6.06; S.D. = 2.03) than via DTMF modality (M = 3.54;
S.D. = 1.39), F(1, 15)= 17.27, p<.001, 2 = .54.
System easiness
Participants evaluated the system as being easier when
using a speech modality (M = 6.06; S.D. = 2.07) than with
a via DTMF modality (M = 4.41; S.D. = 1.77), F(1, 15)=
9.05, p<.01, 2 = .38.
Speech
User satisfaction
As discussed before, we used six indices to measure user
satisfaction in greater detail. For statistical analyses, we
used a repeated measure ANOVA with modality as the
repeated factor. There was no between-subjects factor.
Interaction satisfaction
DTMF
5
4
2
1
0
Interaction entertainment
Participants evaluated their interaction via speech modality
(M = 5.79; S.D. = 1.83) as more entertaining than with the
DTMF modality (M = 3.69; S.D. = 1.47), F(1, 15)= 12.77,
p<.01, 2 = .46.
Interaction naturalness
Participants evaluated their interaction via speech modality
(M = 4.89; S.D. = 2.36) as more natural than with the
DTMF modality (M = 3.65; S.D. = 1.63), F(1, 15)= 4.12,
p<.06, 2 = .22.
Figure 1 summarizes the mean values for the three indices
used to evaluate the interaction.
6
Speech
5
DTMF
4
3
2
1
0
Interaction
entertainment
6
3
Participants evaluated their interaction via speech modality
(M = 5.16; S.D. = 1.99) as more satisfying than with the
DTMF modality (M = 3.84; S.D. = 1.81), F(1, 15)= 4.35,
p<.055, 2 = .23.
Interaction
satisfaction
System satisfaction
Interaction
naturalness
Figure 1. Mean values for user evaluation of the
interaction
System
satisfaction
System
entertainment
System easiness
Figure 2. Mean values for user evaluation of the system
Modality Preference
In response to the question “which of the two navigation
methods did you prefer?” 69% of participants (N = 11)
chose speech as their preferred modality, whereas only
25% of participant (N = 4) selected DTMF. One user did
not show any preference for a particular modality. When
asked why, participants who chose the speech modality
mostly indicated that it is because using speech is easy,
intuitive, flexible, and fun. The other main reason cited was
that it frees up their hands and enables them to do multiple
tasks at the same time. The dominant reason for those who
preferred the DTMF modality was that it is less error-prone
than speech. One participant wrote that she prefers DTMF,
simply because “she can interrupt the system more easily”.
When asked what would be required to make the
participant want to use the least preferred modality we were
expecting to see things like a need for hands free usage for
those whom had preferred DTMF, and a need for silent and
private interaction (for the participants that preferred
speech). While one participant mentioned the usefulness of
DTMF in a noisy environment, most responded by saying
they would like to use a combination of modalities.
DISCUSSION
Unlike previous findings, our results indicate that users
prefer the spoken interaction to the DTMF interaction for
the NL based message retrieval system. This was in spite of
the fairly high error rates experienced with the NL
recognition technology. While it is not uncommon for a
grammar-based system to get accuracy levels in the midnineties for spoken phrases that are part of the known
vocabulary, the Mobile Assistant has an accuracy level
which ranges between 75 and 80% depending on the task.
The big advantage of a NL system is that is highly usable
by first-time and novice users, and this might have been a
factor in the results.
Another factor that might have contributed to the
preference for speech, is the trade-off that users make
between the advantage of using a speech-based system
(such as the ability to control a device in a hands-free
mode) and the disadvantage of dealing with recognition
errors. In the domain of messaging, the need for hands-free
control is well noted when attempting to access messages
from a cellular car phone. There is also the fun factor to
consider. When interviewing a participant as to why he
preferred the speech modality when his experience with the
speech system had been rather dismal (to our observation)
he replied “Well, I guess speech is just more fun.”
Lastly, one can speculate that it may have seemed more
natural for users to speak to the system since the system
was speaking to them. As mentioned earlier, in both
conditions, the system spoke to the participants using
synthetic speech. The system is designed to be contrite and
apologetic if it does not understand what the user is saying.
It also makes (rather feeble) attempts at humor, and it
always polite. Perhaps, in keeping with the media equation
theory, [14] these rather anthropomorphic characteristics of
the system contributed to the findings.
ACKNOWLEDGMENTS
We thank David Wood for his critical role in the vision and
implementation of the Mobile Assistant project, Marisa
Viveros for her leadership and support of this and other
research projects, and all the kind participants who took
part in the experiment and gave us their feedback. We also
thank the other MA team members for their valuable work
on the project since without them it never could have
happened!
REFERENCES
1. Ballentine, B., Morgan D. How to Build a Speech
Recognition Application: A Style Guide for Telephony
Dialogues. Published by Enterprise Integration Group,
Inc., San Ramon, California, 1999.
2. Ballentine, B. Re-engineering the Speech Menu: A
Device Approach to Interactive List-Selection. In
Gardner-Bonneau (ed.) Human Factors and Voice
Interactive Systems. Kluwer Academic Publishers, 1999
3. Boyce, S. Natural Spoken Dialogue Systems for
Telephony Applications. In Communications of the
ACM, September 2000, Vol. 43, Number 9
4. Davies, K. et al. The IBM Conversational Telephony
System for Financial Applications. In Proceedings of
Eurospeech ’99 . Budapest, Hungary, Sept. 1999
5. Delogu, C., Di Carlo, A., Rotundi, P., & Sartori, D.
A comparison between DTMF and ASR IVR services
through objective and subjective evaluation FUB report:
5D01398. Proceedings of "IVTTA'98", Turin,
September 1998, pp. 145-150
6. European Telecommunications Standards Institute
(ETSI). Human Factors (HF), Guide for usability
evaluations. ETSI Technical Report, ETR 095, 1993.
7. Fay,D. Interfaces to automated telephone services: Do
users prefer touchtone or automatic speech recognition?
In Proceedings of the 14th International Symposium on
Human Factors in Telecommunications (pp. 339-349).
Darmstadt, Germany: R. v. Decker’s Verlag. 1993
8. Foster, J.C., McInnes, F.R., Jack, M.A., Love, S.,
Dutton, R. T., Nairn, I.A., White, L.S. An experimental
evaluation of preference for data entry method in
automated telephone services. Behaviour & Information
Technology, 17 (2), 82-92.
9. Gardner-Bonneau, D, Guidelines for Speech Enabled
IVR Application Design. In Gardner-Bonneau (ed.)
Human Factors and Voice Interactive Systems. Kluwer
Academic Publishers, 1999
10. Greve, F. Dante’s 8th circle of hell: Voice mail. St. Paul
Pioneer Press, 1996.
11. Goldstein, M., Bretan, I., Sallnas, E.-L. & Bjork, H.
Navigational abilities in audial voice-controlled
dialogue structures. Behaviour & Information
Technology, 18 (2), 83-95.
12. Karis, D. Speech recognition systems: performance,
preference, and design. In 16th International
Symposium on Human Factors in Telecommunications
1997, P65-72
13. Nuance. Market Research: 2000 Speech User
Scorecard. Menlo Park, CA: Nuance. 2000.
14. Reeves, B. and Nass, C. The media equation: How
people treat computers, television, and new media like
real people and places. Cambridge University Press,
New York, 1996.
15. Schmandt, C. Voice Communication with Computers:
Conversational Systems. Van Nostrand Reinhold, New
York, 1994
Download