Error Detection and Correction in SDS Julia Hirschberg CS 4706 7/15/2016

advertisement
Error Detection and Correction in SDS
Julia Hirschberg
CS 4706
7/15/2016
1
Today
• Avoiding errors
• Detecting errors
– From the user side: what cues does the user
provide to indicate an error?
– From the system side: how likely is it the
system made an error?
• Dealing with Errors: what can the system do
when it thinks an error has occurred?
• Evaluating SDS: evaluating ‘problem’ dialogues
7/15/2016
2
Avoiding misunderstandings
• The problem
• By imitating human performance
– Timing and grounding (Clark ’03)
– Confirmation strategies
– Clarification and repair subdialogues
7/15/2016
3
Today
• Avoiding errors
• Detecting errors
– From the user side: what cues does the user
provide to indicate an error?
– From the system side: how likely is it the
system made an error?
• Dealing with Errors: what can the system do
when it thinks an error has occurred?
• Evaluating SDS: evaluating ‘problem’ dialogues
7/15/2016
4
Percentage of all repetitions
Learning from Human Behavior: Features in
repetition corrections (KTH)
7/15/2016
50
40
adults
children
30
20
10
0
more
increased shifting of
clearly loudness focus
articulated
5
Learning from Human Behavior (Krahmer et al ’01)
• Learning from human behavior
– ‘go on’ and ‘go back’ signals in grounding
situations (implicit/explicit verification)
– Positive: short turns, unmarked word order,
confirmation, answers, no corrections or
repetitions, new info
– Negative: long turns, marked word order,
disconfirmation, no answer, corrections,
repetitions, no new info
7/15/2016
6
– Hypotheses supported but…
• Can these cues be identified automatically?
• How might they affect the design of SDS?
7/15/2016
7
Today
• Avoiding errors
• Detecting errors
– From the user side: what cues does the user
provide to indicate an error?
– From the system side: how likely is it the
system made an error?
• Dealing with Errors: what can the system do
when it thinks an error has occurred?
• Evaluating SDS: evaluating ‘problem’ dialogues
7/15/2016
8
Systems Have Trouble Knowing When
They’ve Made a Mistake
• Hard for humans to correct system
misconceptions (Krahmer et al `99)
User: I want to go to Boston.
System: What day do you want to go to Baltimore?
– Easier: answering explicit requests for
confirmation or responding to ASR rejections
System: Did you say you want to go to Baltimore?
System: I'm sorry. I didn't understand you. Could
you please repeat your utterance?
7/15/2016
9
• But constant confirmation or over-cautious
rejection lengthens dialogue and decreases user
satisfaction
7/15/2016
10
…And Systems Have Trouble Recognizing
User Corrections
• Probability of recognition failures increases after
a misrecognition (Levow ‘98)
• Corrections of system errors often
hyperarticulated (louder, slower, more internal
pauses, exaggerated pronunciation)  more
ASR error (Wade et al ‘92, Oviatt et al ‘96,
Swerts & Ostendorf ‘97, Levow ‘98, Bell &
Gustafson ‘99)
7/15/2016
11
Can Prosodic Information Help Systems
Perform Better?
• If errors occur where speaker turns are
prosodically ‘marked’….
– Can we recognize turns that will be
misrecognized by examining their prosody?
– Can we modify our dialogue and
recognition strategies to handle corrections
more appropriately?
7/15/2016
12
Approach
• Collect corpus from interactive voice response
system
• Identify speaker ‘turns’
misrecognition
• incorrectly recognized
aware site
• where speakers first aware of error
• that correct misrecognitions correction
• Identify prosodic features of turns in each
category and compare to other turns
• Use Machine Learning techniques to train a
classifier to make these distinctions
automatically
7/15/2016
13
Turn Types
TOOT: Hi. This is AT&T Amtrak Schedule
System. This is TOOT. How may I help you?
User: Hello. I would like trains from
Philadelphia to New York leaving on Sunday at
ten thirty in the evening.
misrecognition
TOOT: Which city do you want to go to?
User: New York.
correction
aware site
7/15/2016
14
TOOT Dialogues
• Collected to study effects of differences in
dialogue strategy on user performance and
satisfaction (Litman & Pan ‘99)
– type of initiative (system, user, mixed)
– type of confirmation (explicit, implicit, none)
– adaptability condition
• Subjects
– 39 summer students
– 16/23 (F/M)
– 20/19 (native speaker/non)
7/15/2016
15
•Platform: combined over-the-phone ASR and
TTS (Kamm et al ‘97) with web access to train
information
•Task: find train information for 4 scenarios
•Corpus for current study:
2328 speaker turns
52 dialogues
•Misrecognitions
Overall word accuracy: 61%
Overall concept accuracy(CA): 71%
 “I want to go to Boston from Philadelphia” (2
domain concepts)
 recognized as “I want to go to Boston” (one
concept) = 50%
7/15/2016
16
A Successful Dialogue
S: Hi, this is AT&T Amtrak schedule system. This
is TOOT. How may I help you?
U: I would like a train to New York City from
Philadelphia on Sunday at ten thirty P M
S: I heard you say “go to New York City from
Philadelphia on Sunday approximately at ten
thirty p.m." Do you want me to find the trains
from Philadelphia to New York City on Sunday
approximately at ten thirty p. m. now?
U: Yes.
S: I am going to get the train schedule for you....
7/15/2016
17
Are Misrecognitions, Aware Turns,
Corrections Measurably Different
from Other Turns?
• For each type of turn:
– For each speaker, for each prosodic
feature, calculate mean values for e.g. all
correctly recognized speaker turns and
for all incorrectly recognized turns
– Perform paired t-tests on these speaker
pairs of means (e.g., for each speaker,
pairing mean values for correctly and
incorrectly recognized turns)
7/15/2016
18
How: Prosodic Features Examined per Turn
• Raw prosodic/acoustic features
–
–
–
–
–
–
f0 maximum and mean (pitch excursion/range)
rms maximum and mean (amplitude)
total duration
duration of preceding silence
amount of silence within turn
speaking rate (estimated from syllables of
recognized string per second)
• Normalized versions of each feature
(compared to first turn in task, to previous
turn in task, Z scores)
7/15/2016
19
Distinguishing Correct Recognitions
from Misrecognitions (NAACL ‘00)
• Misrecognitions differ prosodically from
correct recognitions in
–
–
–
–
–
F0 maximum (higher)
RMS maximum (louder)
turn duration (longer)
preceding pause (longer)
slower
• Effect holds up across speakers and
even when hyperarticulated turns are
excluded
7/15/2016
20
WER-Based Results
Misrecognitions are higher in pitch,
louder, longer, more preceding pause
and less internal silence
7/15/2016
F0 Max
F0 Mean
RMS Max
RMS Mean
Duration
Prior Pause
Tempo
% Sil in Turn
T stat
Mean MisrecMean Rec
P
5.78
1.52
2.52
-1.82
9.94
5.586
-4.71
-1.48
25.84 Hz
1.56 Hz
150.56
-25.05
2.13 sec
0.29 sec
0.54 sps
-.02%
0.000
0.140
0.020
0.080
0.000
0.000
0.000
0.150
21
Predicting Turn Types Automatically
• Ripper (Cohen ‘96) automatically induces rule
sets for predicting turn types
– greedy search guided by measure of information gain
– input: vectors of feature values
– output: ordered rules for predicting dependent
variable and (X-validated) scores for each rule set
• Independent variables:
– all prosodic features, raw and normalized
– experimental conditions (adaptability of system,
initiative type, confirmation style, subject, task)
– gender, native/non-native status
– ASR recognized string, grammar, and acoustic
confidence score
7/15/2016
22
ML Results: WER-defined Misrecognition
Feature Set
Baseline
Prosody, ASR Conf,
String, Grammar
ASR Conf, String,
Grammar
ASR String
ASR Conf
Prosody
Duration
7/15/2016
Estimated
Error (SE)
32.35% (NA)
8.64% (0.53)
14.83% (0.81)
18.00% (0.86)
18.91% (1.00)
19.20% (0.80)
20.92% (0.85)
23
Best Rule-Set for Predicting WER
Using prosody, ASR conf, ASR
string, ASR grammar
if (conf <= -2.85 ^ (duration >= 1.27) ^ then F
if (conf <= -4.34) then F
if (tempo <= .81) then F
If (conf <= -4.09 then F
If (conf <= -2.46 ^ str contains “help” then F
If conf <= -2.47 ^ ppau >= .77 ^ tempo <= .25 then F
If str contains “nope” then F
If dur >= 1.71 ^ tempo <= 1.76 then F
else T
7/15/2016
24
Today
• Avoiding errors
• Detecting errors
– From the user side: what cues does the user
provide to indicate an error?
– From the system side: how likely is it the
system made an error?
• Dealing with Errors: what can the system do
when it thinks an error has occurred?
• Evaluating SDS: evaluating ‘problem’ dialogues
7/15/2016
25
Error Handling Strategies
• If systems can recognize their lack of
recognition, how should they inform the user
that they don’t understand (Goldberg et al ’03)?
– System rephrasing vs. repetitions vs.
statement of not understanding
– Apologies
• What behaviors might these produce?
– Hyperarticulation
– User frustration
– User repetition vs. rephrasing
7/15/2016
26
• What lessons do we learn?
– When users are frustrated they are generally
harder to recognize accurately
– When users are increasingly misrecognized
they tend to be misrecognized more often and
become increasingly frustrated
– Apologies combined with rephrasing of
system prompts tend to decrease frustration
and improve WER: Don’t just repeat!
– Users are better recognized when they
rephrase their input
7/15/2016
27
Today
• Avoiding errors
• Detecting errors
– From the user side: what cues does the user
provide to indicate an error?
– From the system side: how likely is it the
system made an error?
• Dealing with Errors: what can the system do
when it thinks an error has occurred?
• Evaluating SDS: evaluating ‘problem’ dialogues
7/15/2016
37
Recognizing `Problematic’ Dialogues
• Hastie et al, “What’s the Trouble?” ACL 2002
• How to define a dialogue as problematic?
– User satisfaction is low
– Task is not completed
• How to recognize?
– Train on a corpus of recorded dialogues (1242
DARPA Communicator dialogues)
– Predict
• User Satisfaction
• Task Completion (0,1,2)
7/15/2016
38
– User Satisfaction features:
7/15/2016
39
Results
7/15/2016
40
Next Class
• Speech data mining
• HW3c due
7/15/2016
41
Download