Spoken Dialogue Systems CS 4705

advertisement
Spoken Dialogue Systems
CS 4705
Boston Directions Corpus
• Erratum from last week: speakers did give
directions to a silent confederate, a Harvard
student
Talking to a Machine….and
(often) Getting an Answer
• Today’s spoken dialogue systems make it possible
to accomplish real tasks without talking to a
person
– Could Eliza do this?
– What do today’s systems do better?
– Do they actually embody human intelligence?
• Key advances
–
–
–
–
Stick to goal-directed interactions in a limited domain
Prime users to adopt the vocabulary you can recognize
Partition the interaction into manageable stages
Judicious use of system vs. mixed initiative
Today
•
•
•
•
•
•
•
•
•
Utterances
Turn-taking
Initiative Strategies
Grounding
Confirmation Strategies
The Waxholm Project: Word and Topic Prediction
Evaluating Spoken Dialogue Systems
Predicting System Errors and User Corrections
More Examples
Dialogue vs. Monologue
• Monologue and dialogue both involve interpreting
–
–
–
–
Information status
Coherence issues
Reference resolution
Speech acts, implicature, intentionality
• Dialogue involves managing
– Turn-taking
– Grounding and repairing misunderstandings
– Initiative and confirmation strategies
Segmenting Speech into Utterances
• What is an `utterance’?
– Why is EOU detection harder than EOS?
– How does speech differ from text?
– Single syntactic sentence may span several turns
A: We've got you on USAir flight 99
B: Yep
A: leaving on December 1.
– Multiple syntactic sentences may occur in single turn
A: We've got you on USAir flight 99 leaving on December. Do
you need a rental car?
– Intonational definitions: intonational phrase, breath
group, intonation unit
Turns and Utterances
• Dialogue is characterized by turn-taking: who
should talk next, and when they should talk
• How do we identify turns in recorded speech?
– Little speaker overlap (around 5% in English --although
depends on domain)
– But little silence between turns either
• How do we know when a speaker is giving up or
taking a turn? Holding the floor? How do we
know when a speaker is interruptable?
Simplified Turn-Taking Rule (Sacks et al)
• At each transition-relevance place (TRP) of each
turn:
– If current speaker has selected A as next speaker, then A
must speak next
– If current speaker does not select next speaker, any
other speaker may take next turn
– If no one else takes next turn, the current speaker may
take next turn
• TRPs are where the structure of the language
allows speaker shifts to occur
• Adjacency pairs set up next speaker expectations
–
–
–
–
GREETING/GREETING
QUESTION/ANSWER
COMPLIMENT/DOWNPLAYER
REQUEST/GRANT
• ‘Significant silence’ is dispreferred
A: Is there something bothering you or not? (1.0s)
A: Yes or no? (1.5s)
A: Eh?
B: No.
Intonational Cues to Turntaking
• Continuation rise (L-H%) holds the floor
• H-H% requests a response
– L*H-H% (ynq contour)
– H* H-H% (highrise question contour)
• Intonational contours signal dialogue acts in
adjacency pairs
Timing and Turntaking
• How should we time responses in a SDS?
– Japanese studies of aizuchi (backchannels) (Koiso et al
‘98, Takeuchi et al ‘02) in natural speech
– Lexical information: particles ne and ka ending
preceding turn or (in telephone shopping) product
names
– Length of preceding utterance, f0, loudness, and pause
after even more important in predicting turntaking
Turntaking and Initiative Strategies
• System Initiative
S: Please give me your arrival city name.
U: Baltimore.
S: Please give me your departure city name….
• User Initiative
S: How may I help you?
U: I want to go from Boston to Baltimore on November 8.
• `Mixed’ initiative
S: How may I help you?
U: I want to go to Boston.
S: What day do you want to go to Boston?
Grounding (Clark & Shaefer ‘89)
• Conversational participants don’t just take turns
speaking….they try to establish common ground
(or mutual belief)
• H must ground a S's utterances by making it clear
whether or not understanding has occurred
• How do hearers do this?
S: I can upgrade you to an SUV at that rate.
– Continued attention
(U gazes appreciatively at S)
– Relevant next contribution
U: Do you have a RAV4 available?
– Acknowledgement/backchannel
U: Ok/Mhmmm/Great!
– Demonstration/paraphrase
U: An SUV.
– Display/repetition
U: You can upgrade me to an SUV at the same rate?
– Request for repair
U: I beg your pardon?
Detecting Grounding Behavior
• Evidence of system misconceptions reflected in user
responses (Krahmer et al ‘99, ‘00)
– Responses to incorrect verifications
• contain more words (or are empty)
• show marked word order (especially after implicit verifications)
• contain more disconfirmations, more repeated/corrected info
– ‘No’ after incorrect verifications vs. other ynq’s
• has higher boundary tone
• wider pitch range
• longer duration
• longer pauses before and after
• more additional words after it
• User information state reflected in response
(Shimojima et al ’99, ‘01)
– Echoic responses repeat prior information – as
acknowledgment or request for confirmation
S1: Then go to Keage station.
S2: Keage.
– Experiment:
• Identify ‘degree of integration’ and prosodic features
(boundary tone, pitch range, tempo, initial pause)
• Perception studies to elicit ‘integration’ effect
– Results: fast tempo, little pause and low pitch signal
high integration
Grounding and Confirmation Strategies
U: I want to go to Baltimore.
• Explicit
S: Did you say you want to go to Baltimore?
• Implicit
S: Baltimore. (H* L- L%)
S: Baltimore? (L* H- H%)
S: What time do you want to leave Baltimore?
• No confirmation
The Waxholm Project at KTH
• tourist information
• Stockholm archipelago
• time-tables, hotels, hostels, camping
and dining possibilities.
• mixed initiative dialogue
• speech recognition
• multimodal synthesis
• graphic information
• pictures, maps, charts and time-tables
• Demos at
http://www.speech.kth.se/multimodal
Dialogue control state prediction
Dialog grammar specified by a number of states
Each state associated with an action
database search, system question… …
Probable state determined from semantic features
Transition probability from one state to state
Dialog control design tool with a graphic interface
Waxholm Topics
TIME_TABLE Task: get a time-table.
Example: När går båten? (When does the boat leave?)
SHOW_MAP Task : get a chart or a map displayed.
Example: Var ligger Vaxholm? (Where is Vaxholm located?)
EXIST Task : display lodging and dining possibilities.
Example: Var finns det vandrarhem? (Where are there hostels?)
OUT_OF_DOMAIN
Task : the subject is out of the domain.
Example: Kan jag boka rum. (Can I book a room?)
NO_UNDERSTANDING Task : no understanding of user intentions.
Example: Jag heter Olle. (My name is Olle)
END_SCENARIO Task : end a dialog.
Example: Tack. (Thank you.)
Topic selection
FEATURES
TOPIC EXAMPLES
TIME
TABLE
SHOW
MAP
FACILITY NO UNDER- OUT OF
STANDING DOMAIN
OBJECT
QUEST-WHEN
QUEST-WHERE
FROM-PLACE
AT-PLACE
.062
.188
.062
.250
.062
.312
.031
.688
.031
.219
.073
.024
.390
.024
.293
.091
.091
.091
.091
.091
.067
.067
.067
.067
.067
.091
.091
.091
.091
.091
TIME
PLACE
OOD
END
HOTEL
HOSTEL
ISLAND
PORT
MOVE
.312
.091
.062
.062
.062
.062
.333
.125
.875
.031
.200
.031
.031
.031
.031
.556
.750
.031
.024
.500
.122
.024
.488
.122
.062
.244
.098
.091
.091
.091
.091
.091
.091
.091
.091
.091
.067
.067
.933
.067
.067
.067
.067
.067
.067
.091
.091
.091
.909
.091
.091
.091
.091
.091
{ p(ti | F )}
argmax
i
END
Topic prediction results
% Errors
15
12,9
8,8
10
5
12,7
8,5
“no understanding”
excluded
3,1 2,9
0
complete
parse
All
raw
data
no extra
linguistic
sounds
User answers to questions?
The answers to the question:
“What weekday do you want to go?”
(Vilken veckodag vill du åka?)
•
•
•
•
•
22%
11%
11%
7%
6%
• -
Friday (fredag)
I want to go on Friday (jag vill åka på fredag)
I want to go today (jag vill åka idag)
on Friday (på fredag)
I want to go a Friday (jag vill åka en fredag)
are there any hotels in Vaxholm?
(finns det några hotell i Vaxholm)
User answers to questions?
The answers to the question:
“What weekday do you want to go?”
(Vilken veckodag vill du åka?)
•
•
•
•
•
22%
11%
11%
7%
6%
• -
Friday (fredag)
I want to go on Friday (jag vill åka på fredag)
I want to go today (jag vill åka idag)
on Friday (på fredag)
I want to go a Friday (jag vill åka en fredag)
are there any hotels in Vaxholm?
(finns det några hotell i Vaxholm)
Examples of questions
and answers
Hur ofta åker du utomlands på semestern? Hur ofta reser du utomlands på semestern?
jag åker en gång om året kanske
jag åker ganska sällan utomlands på semester
jag åker nästan alltid utomlands under min
semester
jag åker ungefär 2 gånger per år utomlands på
semester
jag åker utomlands nästan varje år
jag åker utomlands på semestern varje år
jag åker utomlands ungefär en gång om året
jag är nästan aldrig utomlands
en eller två gånger om året
en gång per semester
kanske en gång per år
ungefär en gång per år
åtminståne en gång om året
nästan aldrig
jag reser en gång om året utomlands
jag reser inte ofta utomlands på semester det blir mera i
arbetet
jag reser reser utomlands på semestern vartannat år
jag reser utomlands en gång per semester
jag reser utomlands på semester ungefär en gång per år
jag brukar resa utomlands på semestern åtminståne en
gång i året
en gång per år kanske
en gång vart annat år
varje år
vart tredje år ungefär
nu för tiden inte så ofta
varje år brukar jag åka utomlands
Results
no
no
reuse
4% 2%answer
other
24%
reuse
52%
18%
ellipse
The Waxholm system
There
Information
Information
Information
Which
areWhen
IIs
IWaxholm
am
lots
Which
think
day
This
itWhere
looking
possible
of
Ido
about
about
Where
of
want
about
is
Iboats
hotels
the
want
the
Thank
ais
can
Thank
table
hotels
the
evening
to
The
for
week
shown
to
the
from
isto
go
are
Ieat
restaurants
boats
Waxholm?
you
find
city
hotels
of
go
tomorrow
you
is
in
do
Stockholm
in
the
to
on
boats
shown
too
Waxholm?
hotels?
Waxholm?
you
to
Waxholm
boats...
this
inWaxholm
Waxholm
want
depart?
in
map
inWaxholm
to
this
toWaxholm
go?
table
is
on a Friday,
Fromis
At
shown
where
shown
whatin
do
time
inthis
you
this
do
table
want
table
you to
want
go to go?
How do we evaluate Dialogue Systems?
• PARADISE framework (Walker et al ’00)
• “Performance” of a dialogue system is affected
both by what gets accomplished by the user and
the dialogue agent and how it gets accomplished
Maximize
Task Success
Minimize
Costs
Efficiency
Measures
Qualitative
Measures
What metrics should we use?
• Efficiency of the Interaction:User Turns,
System Turns, Elapsed Time
• Quality of the Interaction: ASR rejections,
Time Out Prompts, Help Requests, Barge-Ins,
Mean Recognition Score (concept accuracy),
Cancellation Requests
• User Satisfaction
• Task Success: perceived completion,
information extracted
7/15/2016
29
User Satisfaction:
Sum of Many Measures
• Was Annie easy to understand
in this conversation? (TTS
Performance)
• In this conversation, did Annie
understand what you said?
(ASR Performance)
• In this conversation, was it
easy to find the message you
wanted? (Task Ease)
• Was the pace of interaction with
Annie appropriate in this
conversation? (Interaction Pace)
• In this conversation, did you
know what you could say at
each point of the dialog?
7/15/2016
(User Expertise)
• How often was Annie sluggish
and slow to reply to you in this
conversation? (System
Response)
• Did Annie work the way you
expected her to in this
conversation? (Expected
Behavior)
• From your current experience
with using Annie to get your
email, do you think you'd use
Annie regularly to access your
mail when you are away from
your desk? (Future Use)
30
Performance Model
• Weights trained for each independent factor via
multiple regression modeling: how much does
each contribute to User Satisfaction?
• Result useful for system development
– Making predictions about system modifications
– Distinguishing ‘good’ dialogues from ‘bad’ dialogues
• But … can we also tell on-line when a dialogue is
‘going wrong’
Identifying Misrecognitions, Awares and User
Corrections Automatically (Hirschberg,
Litman & Swerts)
• Collect corpus from interactive voice response
system
• Identify speaker ‘turns’
• incorrectly recognized
• where speakers first aware of error
• that correct misrecognitions
• Identify prosodic features of turns in each
category and compare to other turns
• Use Machine Learning techniques to train a
classifier to make these distinctions automatically
Turn Types
TOOT: Hi. This is AT&T Amtrak Schedule
System. This is TOOT. How may I help you?
User: Hello. I would like trains from
Philadelphia to New York leaving on Sunday at
ten thirty in the evening.
misrecognition
TOOT: Which city do you want to go to?
User: New York.
correction
aware site
Results
• Reduced error in predicting misrecognized turns to
8.64%
• Error in predicting ‘awares’ (12%)
• Error in predicting corrections (18-21%)
Percentage of all repetitions
Features in repetition corrections (KTH)
50
40
adults
children
30
20
10
0
more
increased shifting of
clearly loudness focus
articulated
The August system
People
IWhat
IStrindberg
IYes,
Over
call
The
can
How
come
Strindberg
The
Perhaps
myself
answer
that
information
who
amany
from
Royal
million
was
live
we
Strindberg,
questions
the
was
people
Institute
ain
will
people
smart
glass
married
ishere?
shown
live
thing
about
of
houses
live
but
in I
Yes,
When
it
What
Do
You
do
might
you
Thank
you
Good
were
are
is
was
like
your
be
do
welcome!
bye!
you
that
you!
born
for
itdepartment
name?
born?
we
ameet
inliving?
will!
1849
Strindberg,
ofdon’t
Speech,
should
in the
really
three
Technology!
Stockholm?
on
soon
Stockholm
not
KTH
Music
tothe
have
throw
say!
times!
again!
map
and
and
a surname
Stockholm
stones
area
Hearing
Initial experiments
• Studies on human-human conversation
• The Higgins domain (similar to Map Task)
• Using ASR in one direction to elicit error handling
behaviour
User
Speaks
ASR
Listens
Vocoder
Reads
Speaks
Operator
Non-Understanding Error Recovery (Skantze
’03)
• Humans tend not to signal non-understanding:
– O:
Do you see a wooden house in front of you?
– U:
ASR: YES CROSSING ADDRESS NOW
(I pass the wooden house now)
– O:
Can you see a restaurant sign?
• This leads to
– Increased experience of task success
– Faster recovery from non-understanding
Conclusions
• Spoken dialogue systems presents new problems - but also new possibilities
– Recognizing speech introduces a new source of errors
– Additional information provided in the speech stream
offers new information about users’ intended meanings,
emotional state (grounding of information, speech acts,
reaction to system errors)
• Why spoken dialogue systems rather than webbased interfaces?
Download