Error Detection in Human-Machine Interaction Dan Bohus

advertisement
Error Detection in
Human-Machine
Interaction
Dan Bohus
DoD Group, Oct 2002
Errors in Spoken-Language Interfaces

Speech Recognition is problematic:
 Input
signal quality
 Accents, Non-native speakers
 Spoken language disfluencies: stutters, falsestarts, /mm/, /um/
Typical Word Error Rates in SDS: 10-30%
 Systems today lack the ability to gracefully
recover from error

An example
S: Are you a registered user ?
U: No I'm not. No [NO I'M NOT NO]
S: What is your full name ?
U: Adam Schumacher [ATHENS IN AKRON]
S: There is an Athens in Georgia and in Greece. Which destination did you want ?
U: Neither [NEITHER]
S: I'm sorry, I didn't catch that. There is an Athens in Georgia and in Greece. Which destination
did you want ?
U: Georgia [GEORGIA]
S: A flight from Athens... Where do you want to go ?
U: Start over [START OVER]
S: Do you really want to start over ?
U: Yes [YES]
S: What city are you leaving from ?
U: Hamilton, Ontario [HILTON ONTARIO]
S: Sorry, I'm not sure I understood what you said. Where are you leaving from ?
U: Hamilton [HILTON]
S: I'm still having trouble understanding you. To go on, I need you to answer the following
question. What city are you leaving from ?
U: Toronto [TORONTO]
Pathway to a solution

Make systems aware of unreliability in
their inputs
 Confidence

scores
Develop a model which learns to optimally
choose between several prevention/repair
strategies
 Identify
strategies
 Express them in a computable manner
 Develop the model
Papers

Error Detection in Spoken HumanMachine Interaction
[E.Krahmer, M. Swerts, M. Theune, M. Weegels]

Problem Spotting in Human-Machine
Interaction
[E.Krahmer, M. Swerts, M. Theune, M. Weegels]

The Dual of Denial: Discomfirmations in
Dialogue and Their Prosodic Correlates
[E.Krahmer, M. Swerts, M. Theune, M. Weegels]
Goals

[Let’s look at dialog on page 2]
(1) Analysis of positive an negative cues
we use in response to implicit and explicit
verification questions
 (2) Explore the possibilities of spotting
errors on line

Explicit vs. Implicit

Explicit
 Presumably

easier for the system to verify
But there’s evidence that it’s not as easy …
 Leads
to more turns, less efficiency,
frustration

Implicit
 Efficiency
 But
induces a higher cognitive burden which
can result in more confusion
 ~ Systems don’t deal very well with it…
Clarke & Schaeffer

Grounding model
 Presentation
phase
 Acceptance phase

Various indicators
Go ON / YES
 Go BACK / NO


Can we detect them reliably (when
following implicit and explicit verification
questions) ?
Positive and Negative Cues
Positive
Negative
Short turns
Long turns
Unmarked word order
Marked word order
Confirm
Discomfirm
Answer
No answer
No corrections
Corrections
No repetitions
Repetitions
New info
No new info
Experimental Setup / Data
120 dialogs : Dutch SDS providing train
timetable information
 487 utterances

 44
(~10%) not used
Users accepting a wrong result
 Barge-in
 Users starting their own contribution

 Left
443 resulting adjacent S/U utterances
Results – Nr of words
Explicit
Implicit
~Problems
1.68
3.21
Problems
3.44
7.12
Results – Empty turns (%)
Explicit
Implicit
~Problems
0%
3.4%
Problems
2.6%
10.3%
Results – Marked word order %
Explicit
Implicit
~Problems
3.3%
1.2%
Problems
4.4%
26.9%
Results – Yes/No
Explicit
Implicit
~Problems
Problems
Yes
92.8%
6.1%
No
0%
56.6%
Other
7.1%
37.1%
Yes
0%
0%
No
0%
15.4%
Other
100% ?
84.6%
Results – Repeated/Corrected/New
Explicit
Implicit
~Problems
Problems
Repeated
8.5%
23.9%
Corrected
0%
72.6%
New
11.4%
12.4%
Repeated
2.4%
61.0%
Corrected
0%
92.3%
New
53.6%
36.5%
First conclusion

People use more negative cues when
there are problems

And even more so for implicit
confirmations (vs. explicit ones)
How well can you classify

Using individual features
 Look
at precision/recall
Explicit: absence of confirmation
 Implicit: non-zero number of corrections


Multiple features
 Used
memory based learning
97% accuracy (maj. Baseline 68%)
 Confirm + Correct is winning, although individually
less good
 This is overall, right ? How about for explicit vs.
implicit ?

BUT !!!

How many of these features are available on-line?
Positive
Short turns
Unmarked word order
Negative
Long turns
Marked word order
Confirm
Answer
No corrections ?
Disconfirm
No answer
Corrections ?
No repetitions ?
New info ?
Repetitions ?
No new info ?
What else can we throw at it ?
Prosody (next paper)
 Lexical information
 Acoustic confidence scores

 Maybe
also of previous utterances
Repetitions/Corrections/New info on
transcript ?
…
…

Papers

Error Detection in Spoken HumanMachine Interaction
[E.Krahmer, M. Swerts, M. Theune, M. Weegels]

Problem Spotting in Human-Machine
Interaction
[E.Krahmer, M. Swerts, M. Theune, M. Weegels]

The Dual of Denial: Discomfirmations in
Dialogue and Their Prosodic Correlates
[E.Krahmer, M. Swerts, M. Theune, M. Weegels]
Goals

Investigate the prosodic correlates of
disconfirmations
 Is
this slightly different than before ? (i.e. now
looking at any corrections? Answer: No)
 Looked at prosody on “NO” as a go_on vs a
go_back:
 Do you want to fly from Pittsburgh ?
 Shall I summarize your trip ?
Human-human
Higher pitch range, longer duration
 Preceded by a longer delay
 High H% boundary tone


Expected to see same behavior for
disconfirmation in human-machine
Prosodic correlates
POSITIVE(‘go on’)
NEGATIVE(‘go back’)
Boundary tone
Low
High
Duration
Short
Long
Delay
Short
Long
Pause
Short
Long
Pitch range
Low
High
Features
 Yes,
the correlations are there as expected
Perceptual analysis
Took 40 “No” from No+stuff, 20 go_on and
20 go_back (note that some features are
lost this way…)
 Forced choice randomized task, w/ no
feedback; 25 native speakers of Dutch
 Results

 17
go_on correctly identified above chance
 15 go_back correctly identified above chance;
but also 1 incorrectly identified above chance.
Discussion

Q1: Blurred relationships …
 Confidence
annotation
 Go_on / Go_back signal
Is that the same as corrections ?
 Is that the most general case for responses to
implicit/explicit verifications, or should we have a
separate detector ?


Q2: What other features could we throw at
these problems ? What are the “most
juicy” ones ?
Discussion

Q3: For implicit confirms, are these
different in terms of induced response
behavior ?
 When
do you want to leave Pittsburgh ?
 Travelling from Pittsburgh … when do you
want to leave ?
 When do you want to leave from Pittsburgh to
Boston ?
Download