Evaluating the Effectiveness of Dialogue for ... Automated Spoken Questionnaire

advertisement
From: AAAI Technical Report SS-95-06. Compilation copyright © 1995, AAAI (www.aaai.org). All rights reserved.
Evaluating the Effectiveness of Dialogue for an
AutomatedSpoken Questionnaire
Stephen Sutton, Brian Hansen, Terri Lander, David G. Novick, Ronald Cole
Center for SpokenLanguageUnderstanding
OregonGraduateInstitute of Science &Technology
20000 NWWalker Rd.
PO Box 91000
Portland, OR97291-1000
Email: sutton@cse.ogi.edu
Abstract
follow-upquestions (or sub-dialogues).For example,
Wepresent and apply an empirical methodology
for
person’smaritalstatus is determined
as follows:
evaluatingthe effectivenessof dialoguesin spoken
¯ Haveyouever beenmarried?Please say yes or no.
languagesystems. This methodology
is suitable in
particularfor evaluation
of dialogue-based
systemsthat
If"yes"
then:
collect information
fromthe user, suchas an automated
spokenquestionnaire. Our methodfor assessing
¯ Whichof thefollowing
bestdescribes
your
effectiveness
involves
coding
answers
from
users
for
marital
status:
nowmarried,
widowed,
divorced
responsiveness.
For
this
effort,
wedeveloped
a behavioral
or separated?
coding
scheme
tailored
totherequirements
ofautomated
spoken
questionnaires
interacting
viathetelephone.
The
Theproject is currentlyenteringan evaluationphasein
codes
cover
t range
ofbehavior
from
"Concise"
to"No
whichthe systemwill be assessedas part of the larger
re.,aponse."
1995CensusTest. Approximately
200,000callers from
Wehaveusedthisevaluation
methodology
inthe
three selected test cities will be given the option of
development
ofanautomated
spoken
questionnaire.
In
connection
with
this
project,
wecollected
over
4,000 providing Census information by filling out (and
telephone
calls
responding
tothe
questionnaire.
A sample returning by mall) a Censusquestionnaire, or via the
ofthecalls
wastranscribed
andcoded
using
our
telephonethroughinteraction with a humanoperator or
behavioral
coding
scheme.
Wethen
used
thedata
from
the
the ASQ.
codes
tochoose
among
alternative
protocols
forthe
dialogue
and
toevaluate
differences
insystem
voice,
such
Despite the manyadvancesin spokenlanguagesystem
asnatural
versus
synthetic
andmale
versus
female.
In
technology,state-of-the-art systemsmuststill find ways
particular,
weilluslrate
the
utility
ofour
methodology
by
testing
thehypothesis
that
a synthesized
system
voice to limit the expected vocabularyof user responses,
especiallyin a task suchas the Censuswherethe system
would
elicit
more
constrained
user
responses
thana
is faced with a widevariety of regional accents. The
human
voiceandreportthe evaluation
results.
problemis further exacerbatedby the periodic nature of
the censustask: the user cannotbe expectedto learn the
Introduction
capabilities of the systemgiven that they interact with
In this paper, wepresentan evaluationmethodology
that
the systemonceevery ten years. Thesefactors require
was used in the developmentof a prototype automated that our systems elicit only the most concise and
spokenquestionnaire (ASQ)for the Year 2000Census recognizable answers to questions, and that we
(Coleet at., 1994).Ourmotivationarises fromthe need empiricallyrefine the questionsin order to elicit those
for an evaluationmetricanda systematicprocedurewith concise answers. There is a dangerhowever,in making
whichto measure
the effectivenessof a dialogue.This, in
concisenessour sole criterion; a protocol that relied
turn, wouldprovide a basis for makingdialoguedesign entirely on yes/no answers would be easy for the
decisions, both in terms of establishing an appropriate recognizerbut frustrating for users. Thus,the design
systemprotocolandexploringthe effects of varyingthe
process involves exploring the space of possible
system’svoice(e.g., human
or synthesized).
dialogues in search of one that is relatively natural
Briefly, the prototype ASQis designed to capture
without compromising
the overall effectiveness of the
census informationfroma telephonedialogue. Typical system.
informationincludes
name,
sex,dateofbirth,
marital
slams, origin and race.I Thenature of interaction is
1.Thespecificinformation
requirements
werelaid out bythe
predominantly
single-initiative andis led by the system.
U.S.Bureau
of the Census
in accordance
withthe contentsof
the written
C~us
questionnaire.
Informationis elicited througha series of questionsand
156
What do we mean by effectiveness? The census task
has a clearly defined objective, namely to elicit the
requisite information from callers. Thus, we choose the
responsiveness of callers’ utterances as our primary
evaluation metric. To this end, we developed a
behavioral coding schemefor characterizing callers’
uuerances. Wereport on the replicability of coding
results by multiple judges and illustrate the utility the
coding scheme by testing the hypothesis that a
synthesized system voice would elicit more constrained
user responses than a humanvoice.
Behavioral
coding
Behavioral coding is an analytical technique for
classifying users’ responses according to type. The
behavioral coding schemeprovides a set of features that
characterize aspects of the response deemeduseful for
ASQevaluation purposes. Particular emphasis was
placed on capturing the responsivenessof utterances.
The behavioral coding schemeis an adaptation of a
coding scheme used by the Bureau of the Census
Code
ResponseClass
(Esposito & Rothgeb, 1992) for the purposes
pretesting a questionnaire. It was intended to identify
particular
interviewer and respondent behavior
indicative of problemswith a question or sections of the
questionnaire. Wemadea number of key modifications
to improve its suitability for use with an ASQ. For
instance, the original schemeincluded a single class for
"adequate answers," which we expanded into three
separate classes that were significant for expected
accuracy of recognition. Also, the single class for
"inadequate answers" was expandedinto two classes that
woulddistinguish utterances useful for dialogue design.
The precise nature of these new classes will be
introduced shortly. These extensions provide additional
detail useful for improvingthe system.
The behavioral coding scheme consists of eleven
classes. Eachclass has an associated code which is used
during transcription to label subjects’ responses. Broadly
speaking, the classes can be organized into four groups:
(1) adequate answers, (2) inadequate answers,
(3)"meta" codes, and (4)multiple codes. In
Description
Example
AA1
AdequateAnswer1
Answeris conciseand responsive.
S: Haveyou ever beenmarried?
U: Yes
AA2
Adequate
Answer
2
Answer
is usablebut not concise.
S: Haveyou ever beenmarried?
U: NoI haven’t
AA3
Adequate
Answer3
Answer
is responsivebut not usable.
S: Haveyouever beenmarried?
U: Unfortunately
IAI
Inadequate
AnswerI
Answer
does not appearto be responsive.
S: Whatis yoursex, femaleor male?
U: Neither
IA2
InadequateAnswer2
Usersays nothingat all.
S: Whatis yoursex, femaleor male?
U: <silence>
QA
Qualified Answer
User
expresses
uncertainty
inanotherwise
adequateanswer.
S: Whatyear were youborn?
U:Nineteen
fifty five I think
RC
Requestfor Clarification Userrequestsclarification as to the meaning of a conceptor surveyquestion.
S: Areyoublack, whiteor other?
U: Whatdo you mean?
IN
Interruption
Userinterrupts the speakingof the question.
S: Whatyear were youborn?
U:*teenfifty five
DK
Don’t Know
Userresponds"I don’t know"or some
other equivalentformulation.
S: Areyoublack, whiteor other?
U: I’mnot sure
RF
Refusal
Userrefuses to answer.
S: Whatyear were youborn?
U: I’mnot telling you
0
Other
User
behaviorthat
isnotcaptured
bythe
codes
listed
above.
S: Whatyear were youborn?
U: Thirty two<noise>
Table 1: Summaryof behavioral coding scheme
157
following sections, we describe the general
characteristics of each group and individual classes
within a group. Table 1 gives a summaryof the
behavioral codingschemeincluding examplebehavioral
2codeclassifications.
Adequate answers
Adequate
answersare responsesthat containthe required
information. Adequacy
is a measureof responsiveness
with respectto the intent of the questionrather thanthe
intent of the answer.Thatis, eventhoughthe user may
have attemptedto provide a responsiveanswer,if she
misunderstood the question and gave the wrong
information it would be regarded as "inadequate"
accordingto this definition.
Wedistinguish amongthree levels of adequate
answers.
Adequate answer 1 (AA1) is a concise
response in whichthe answercontains precisely the
information sought. Adequateanswer 2 (AA2.)
is a
response considered usable but not concise; an AA2
contains the sought-after information along with
additional, usually predictable, words. AnAdequate
#mswer 3 (AA3) is responsive but is problematic in
someaspect. TypicalAA3characteristics include:
¯ Verbosity. The speaker is verbose in an
unpredictable way, such as "Yeah, we were
marriedoncebut thenI left her" in responseto the
question "Haveyou ever beenmarried?"
¯ Truncation.Thespeaker gets interrupted by the
systembeforeshe is donespeaking.For instance,
the speakermaybeginto say "thirty four"but get
cut off after uttering"Thirty-fo*"
¯ Additional information. The user provides more
informationthan wasspecifically requested. For
example,the user says both first nameand last
namein responseto the request "Please say your
last name."
¯ Self correction. The person provides multiple
answers, as a result of correcting an apparent
mistake,suchas "yes, I meanno."
¯ Paraphrasing.Theresponse is phrasedin such a
waythat the meaningis still apparent given
adequateknowledgeand inference capabilities.
For example,the response"I’m sure not female"
in response to the question "Whatis your sex,
femaleor male?"
Froman ASQperspective, AA1responses consist
entirely of target wordsthat appear in a minimal
2.SeeLander
(1994)
fora moredetailed
account
ofthebehavioral
coding
scheme,
including
labeling
conventions.
158
recognitionvocabulary.Theseare the preferred kind of
response. AA/responsesconsist of target wordsin a
predictable context and require word-spottingspeech
recognitioncapabilities. AA3responsestypically require
natural languageprocessing,althoughthe tendencyfor
unpredictable words mayexceed speech recognition
capabilities andas such mayoften be unusable.
Inadequate answers
Inadequateanswersare responsesthat do not contain the
required information.Wedistinguish betweentwo kinds
of inadequate response. Inadequate answer 1
(IA1)is a case wherethe user says something
other than
the informationrequired. Responsesfalling into this
class mayarise for a numberof reasons, including
misunderstanding of the question and general
uncooperative behavior. Inadequate answer 2
flAl) is a casewherethe user says nothingat all. There
are a numberof possible reasons for observing such
behavior,includingusers choosingnot to respond,users
not realizing they wereexpectedto respond, or users
havinghungup.
Froman ASQperspective, the challenge is to
distinguishIA1responsesfromother classes of response.
Of particular concernis the problemof false positives:
misrecognizingan inadequate response as an adequate
response. IA2responsespresent less of a problemthan
IA1responses,althoughhigh levels of background
noise
can causedifficulties. Also, it is usefulto monitorthe
drop-out rate of callers since this can provide some
indicationof user frustrationlevels.
Meta codes
In manyways, meta codes are related to the group of
inadequateanswersconsideredearlier. Whatsets them
apart, however, is some apparent reason for their
inadequacy.Wedistinguish betweenthree kinds of Meta
responses. Request for clarification (RC)
responsesare wherethe user, instead of providingan
answer,seeksto clarify the question. Thereare various
kindsof clarification requestspossible,includinggeneral
request for repetition, specific request for repetition,
specificrequestfor confirmation,
andspecific requestfor
specification (Lloyd, 1992). Don" t know (DK)
responsesare cases wherethe user indicates that the
information is not available. Refusal O~ responses
are wherethe user explicitly declines to provide the
requestedinformation.
Multiple codes
Three codesare intendedfor use in certain restricted
combinations with codes already described. The
Qualified answer (QA) code signifies uncertainty
in the response. It should be used in conjunctionwith the
AA3code, such that an otherwise concise response
(normally coded as AA1) becomes "AA3+QA"when
qualified.
The Interruption(~ code is used to label
responses that interrupt systemprompts. In systems that
do not support ovedappingspeech, a user’s interruption
will result in part or all of the response being cut off.
Note that this code is intended for use when the user
interrupts the system rather than the other wayaround.
The IN code is used in conjunction with either the AA3
or IA1codes dependingon the level of intelligibility of
the response.
The Other (O) code denotes "other respondent
behavior" and is used to classify behaviors such as
extraneous speech not directed at the computer, laughs,
coughs, sneezes and excessive backgroundnoise.
Latmlcrl
Labeler2
Agreement
AA1
1036
1065
1029
AA2
53
37
34
AA3
50
34
29
IAI
I
4
1
IA2
I
0
0
QA
0
2
0
RC
0
0
0
IN
14
0
0
DK
0
0
0
RF
0
0
0
O
59
31
18
Table 2: Agreementstatistics
Evaluation
Weconducteda data collection effort in which callers
interacted with an early system prototype that
incorporated recognition capabilities only at decision
points in the dialogue. At all other places, the user’s
response was simply recorded for analysis by a human
transcriber. This configuration enabled us to evaluate
alternative protocols and to experiment with varying
speaker characteristics (e.g., synthesized or human
voice, maleor female). At the sametime, it enabledus to
collect data to Irain task-specific speechrecognizers. The
data collection effort yielded approximately4,000 calls.
In the following sections, we describe two experiments
that make use of these data and demonstrate the
practicality of the behavioral coding scheme.
Inter-rater
Code
I
agreement
Weconducted an experiment to assess the inter.rater
agreement of two labelers with respect to the behavior
codes. The study involved a random data sample
containing 1141 utterances (100 calls) from the human/
male condition. The overall level of agreement was
measured at 91.3%. Table 2 provides a summaryof
agreement for each behavior code. It shows the number
of times each labeler assigned a given category and the
number of times the two labelers were in agreement.
These data reflect multiple codes having been broken
downinto their constituent parts. It should be noted that
these results are based on humanjudgmentsonly. It is
possible to perform automated screening of various
kinds to falter out obviouserrors. Thefact that the level
of agreementis quite high is in part due to the relatively
high proportion of responsiveutterances, whichin turn is
159
a reflection of the success of our efforts during earlier
rounds of protocol design.
An analysis of the cases of disagreement revealed
various probablecauses. The labelers had different levels
of experience with using the behavioral coding scheme.
One labeler was very experienced and had been directly
involved in designing the coding conventions. The other
labeler had only a moderate amount of experience. In
addition to misunderstandingsof the coding conventions
and basic mistakes, there were perceptual differences as
well. Some of the observed cases of perceptual
differences were (1) uncertainty whether the caller was
addressing the computer or someoneelse in the room;
(2) ambiguity in the information given--in one case, the
caller gave two namesin response to a last namequestion
and it is not clear whetherthese are both last namesor
whether these are a first
and a last name;
(3) disagreements arising from difficulties hearing and
interpreting
what the caller is saying; and
(4)disagreements arising from difficulty in coding
subjective judgementsof levels of backgroundnoise.
Human versus
synthetic
speech
Weconducted an experiment to compare the effects of
humanspeech versus synthetic speech on the user’s
behavior. In one condition, the ASQused a recorded
humanvoice for system prompts. In the other condition,
it used a commercially available ~h synthesizer,
DecTalk. Weconsidered the distribution of behavior
codes associated with individual questions. In particular,
hal
AAI&AA2
AA1 &AA2&AA3
Question
human
symh
human
symh
human
symh
91.3
92.1
96.3
95.5
99.7
99.6
97.3
96.9
98.5
98.4
99.7
99.4
maritalstatus
91.9
93.7
93.1
95.6 *
99.1
99.7 *
date of birth
91.8 *
90.1
95.5
94.8
99.6
99.6
origin
97.5
99.0 *
98.1
99.4
*
98.9
99.7
91.2
90.1
95.5
94.0
99.3
98.7
name:
sex
rig~e
Table 3: Percentage of responsive answers
this initial analysis examines the distribution of AA1, of the behavioral coding scheme for evaluating
AAI&AA2and AAI&AA2&AA3
classes.
These alternatives
indialogue.
classes represent answers that are concise, concise or
nearly so, and responsive, respectively.
Our initial hypothesis of howthe conditions might
influence a user’s behavior is that answers would be
more constrained in the synthesized case than in the
human case. That is, we expected to see a higher
proportion of AA1responses in the synthesized case.
Our reasoning was that we expected that people would
be morecareful and conscious of their responses whenit
is apparent they are conversingwith a computer.
Table 3 contains a summaryof the results. The table
shows the percentage of responses in each class (AA1,
AAI&AA2, AAI&AA2&AA3)for each question.
Questions with sub-questions have been combined by
taking the weightedaverage of the results from the subquestions. Cases where one condition showsa significant
increase in responsivenessare markedby an asterisk (*).
The results provide inconclusive support for our
hypothesis. The synthesized condition appears to show
no notable increase in the proportion of AA1answers.
Likewise, the AAI&AA2and AAI&AA2&AA3
cases
provide little in the way of conclusive evidence to
suggestan effect.
Thereis a certain amountof variability in these results,
including cases wherethe humancondition appears more
responsive than the synthesized condition. A possible
factor that maycontribute to this variability is the
"delivery" of each question. That is, prosodically
"marked" questions are likely to influence the
conciseness of the user’s answersand so mayaccount for
the unexpectedresults. This claim will require further
analysis to substantiate.
Even though this analysis of effects of the human
versus synthetic speech provides inconclusive support
for our hypothesis,it has servedto demonstratethe utility
160
Discussion
Our behavioral coding schemeprovides a useful metric
for performingan objective evaluation of user behavior.
Such an evaluation procedure can be complementedwith
subjective user feedbackto provide an empirically-based
evaluation
of a protocol.
Furthermore,
when
incorporated into an iterative design process for refining
protocols, the behavioral coding scheme provides a
powerful tool for developing an effective ASQ. In
particular, the behavioral coding scheme aids ASQ
developmentin a numberof respects, including:
¯ Performance. The coding scheme provides an
indication of howwell the system can do. System
evaluation should lake into account the numberof
adequate answers (AA1, AA2and AA3)not just
the overall numberof responses.
Effort allocation. The coding scheme suggests
where to expend effort: which questions of the
protocol would benefit most from refinement.
Also, which aspects of the system (e.g., speech
recognition, natural language processing or
dialogue) most warrant improvement.
Protocol development. The coding scheme, when
used in an iterative design process, helps guide the
protocol refinement and improveresponsiveness.
Training. The coding schemehelps identify data
most suitable for training speechrecognizers.
Validation. The coding scheme can be used as a
cross-check for validating transcriptions and so
helps to eliminate transcription errors.
¯ Language. The coding scheme is usable across
different languages. For instance, we madeuse of
the same scheme when developing a prototype
Census ASQfor Spanish.
The behavioral coding scheme described is intended
for summarizing user responses only. It would be
possible to extend the scheme to cover aspects of
interviewer (or system) behavior. Although it was
designed in the context of the Censustask, it should be
possible to adapt it easily for use in other spoken
language system tasks.
Conclusion
Wehave presented a coding schemefor classifying the
behavior of users interacting with an ASQ.This coding
schemeprovides a basis for evaluating the effectiveness
of ASQ dialogues
based on quantifying
the
responsiveness of users. Wehave madeextensive use of
the coding schemein the Censustask for designing the
dialogue era prototype ASQand measuring the system’s
effectiveness. This paper has provided a detailed account
of the behavioral coding schemeand discussed someof
its uses. Wereported on an experiment in which we
measured the inter-rater
agreement between two
161
labelers, with respect to the behavioral codingscheme,to
be 91.3%. Also, we demonstrated the utility of the
coding schemein testing the hypothesis that synthesized
speech would elicit more concise responses than
recorded humanspeech when used for system prompts.
Our results provided inconclusive support for this
hypothesis.
References
Cole, R.A., Novick, D.G., Fanty, M., Vermeulen,P.,
Sutton, S., Burnett, D., Schalkwyk, J. 1994. A
proWrypevoice-response questionnaire for the U.S.
Census. Proceedings of International Conference on
Spoken Language Processing, 683-686. Yokohama,
Japan.
Esposito, J., and Rothgeb, J. 1992. Behavior coding
manual: CATI/CAPIoverlap. Draft report. Bureau of
Labor Statistics/Bureau of the Census.
Lander, T. 1994. Behavior Code Guide. Internal
report. Center for Spoken Language Understanding,
OregonGraduate Institute.
Lloyd, P. 1992. The role of clarification requests in
children’s communication of route directions by
telephone. Discourse Processes, 15: 357-374.
Download