Spoken Dialogue Systems Julia Hirschberg LSA07 353 7/15/2016

advertisement
Spoken Dialogue Systems
Julia Hirschberg
LSA07 353
7/15/2016
1
Today
• Learning from human-human dialogues
• Issues in SDS
7/15/2016
2
The Waxholm Project at KTH
• tourist information
• Stockholm archipelago
• time-tables, hotels, hostels, camping
and dining possibilities.
• mixed initiative dialogue
• speech recognition
• multimodal synthesis
• graphic information
• pictures, maps, charts and time-tables
• Demos at
http://www.speech.kth.se/multimodal
7/15/2016
3
The Waxholm system
7/15/2016
There
Information
Information
Information
Which
areWhen
IIs
IWaxholm
am
lots
Which
think
day
This
itWhere
looking
possible
of
Ido
about
about
Where
of
want
about
is
Iboats
hotels
the
want
the
Thank
ais
can
Thank
table
hotels
the
evening
to
The
for
week
shown
to
the
from
isto
go
are
Ieat
restaurants
boats
Waxholm?
you
find
city
hotels
of
go
tomorrow
you
is
in
do
Stockholm
in
the
to
on
boats
shown
too
Waxholm?
hotels?
Waxholm?
you
to
Waxholm
boats...
this
inWaxholm
Waxholm
want
depart?
in
map
inWaxholm
to
this
toWaxholm
go?
table
is
on a Friday,
Fromis
At
shown
where
shown
whatin
do
time
inthis
you
this
do
table
want
table
you to
want
go to go?
4
Today
• Some Swedish examples
• Controlling the dialogue flow
– State prediction
• Controlling lexical choice
• Learning from human-human dialogue
– User feedback
• Evaluating systems
7/15/2016
5
Dialogue control state prediction
Dialog grammar specified by a number of states
Each state associated with an action
database search, system question… …
Probable state determined from semantic features
Transition probability from one state to state
Dialog control design tool with a graphic interface
7/15/2016
6
Waxholm Topics
TIME_TABLE Task: get a time-table.
Example: När går båten? (When does the boat leave?)
SHOW_MAP Task : get a chart or a map displayed.
Example: Var ligger Vaxholm? (Where is Vaxholm located?)
EXIST Task : display lodging and dining possibilities.
Example: Var finns det vandrarhem? (Where are there hostels?)
OUT_OF_DOMAIN
Task : the subject is out of the domain.
Example: Kan jag boka rum. (Can I book a room?)
NO_UNDERSTANDING Task : no understanding of user intentions.
Example: Jag heter Olle. (My name is Olle)
END_SCENARIO Task : end a dialog.
Example: Tack. (Thank you.)
7/15/2016
7
Topic selection
FEATURES
7/15/2016
TOPIC EXAMPLES
TIME
TABLE
SHOW
MAP
FACILITY NO UNDER- OUT OF
STANDING DOMAIN
OBJECT
QUEST-WHEN
QUEST-WHERE
FROM-PLACE
AT-PLACE
.062
.188
.062
.250
.062
.312
.031
.688
.031
.219
.073
.024
.390
.024
.293
.091
.091
.091
.091
.091
.067
.067
.067
.067
.067
.091
.091
.091
.091
.091
TIME
PLACE
OOD
END
HOTEL
HOSTEL
ISLAND
PORT
MOVE
.312
.091
.062
.062
.062
.062
.333
.125
.875
.031
.200
.031
.031
.031
.031
.556
.750
.031
.024
.500
.122
.024
.488
.122
.062
.244
.098
.091
.091
.091
.091
.091
.091
.091
.091
.091
.067
.067
.933
.067
.067
.067
.067
.067
.067
.091
.091
.091
.909
.091
.091
.091
.091
.091
{ p(ti | F )}
argmax
i
END
8
Topic prediction results
% Errors
15
12,9
8,8
10
5
12,7
8,5
All
“no understanding”
excluded
3,1 2,9
0
complete
parse
7/15/2016
raw
data
no extra
linguistic
sounds
9
Today
• Some Swedish examples
• Controlling the dialogue flow
– State prediction
• Controlling lexical choice
• Learning from human-human dialogue
– User feedback
• Evaluating systems
7/15/2016
10
User answers to questions?
The answers to the question:
“What weekday do you want to go?”
(Vilken veckodag vill du åka?)
•
•
•
•
•
22%
11%
11%
7%
6%
• -
7/15/2016
Friday (fredag)
I want to go on Friday (jag vill åka på fredag)
I want to go today (jag vill åka idag)
on Friday (på fredag)
I want to go a Friday (jag vill åka en fredag)
are there any hotels in Vaxholm?
(finns det några hotell i Vaxholm)
11
Examples of questions
and answers
Hur ofta åker du utomlands på semestern? Hur ofta reser du utomlands på semestern?
jag åker en gång om året kanske
jag åker ganska sällan utomlands på semester
jag åker nästan alltid utomlands under min
semester
jag åker ungefär 2 gånger per år utomlands på
semester
jag åker utomlands nästan varje år
jag åker utomlands på semestern varje år
jag åker utomlands ungefär en gång om året
jag är nästan aldrig utomlands
en eller två gånger om året
en gång per semester
kanske en gång per år
ungefär en gång per år
åtminståne en gång om året
nästan aldrig
7/15/2016
jag reser en gång om året utomlands
jag reser inte ofta utomlands på semester det blir mera i
arbetet
jag reser reser utomlands på semestern vartannat år
jag reser utomlands en gång per semester
jag reser utomlands på semester ungefär en gång per år
jag brukar resa utomlands på semestern åtminståne en
gång i året
en gång per år kanske
en gång vart annat år
varje år
vart tredje år ungefär
nu för tiden inte så ofta
varje år brukar jag åka utomlands
12
Results
no
no
reuse
4% 2%answer
other
24%
reuse
52%
18%
ellipse
7/15/2016
13
Today
• Some Swedish examples
• Controlling the dialogue flow
– State prediction
• Controlling lexical choice
• Learning from human-human dialogue
– User feedback
• Evaluating systems
7/15/2016
14
The August system
7/15/2016
People
IWhat
IStrindberg
IYes,
Over
call
The
can
How
come
Strindberg
The
Perhaps
myself
answer
that
information
who
amany
from
Royal
million
was
live
we
Strindberg,
questions
the
was
people
Institute
ain
will
people
smart
glass
married
ishere?
shown
live
thing
about
of
houses
live
but
in I
Yes,
When
it
What
Do
You
do
might
you
Thank
you
Good
were
are
is
was
like
your
be
do
welcome!
bye!
you
that
you!
born
for
itdepartment
name?
born?
we
ameet
inliving?
will!
1849
Strindberg,
ofdon’t
Speech,
should
in the
really
three
Technology!
Stockholm?
on
soon
Stockholm
not
KTH
Music
tothe
have
throw
say!
times!
again!
map
and
and
a surname
Stockholm
stones
area
Hearing
15
Evidence from Human Performance
• Users provide explicit positive and negative
feedback
• Corpus-based vs. laboratory experiments –
do these tell us different things?
7/15/2016
16
Adapt – demonstration
of ”complete” system
7/15/2016
17
Feedback and ‘Grounding’: Bell & Gustafson ’00
• Positive and negative
– Previous corpora: August system
• 18% of users gave pos or neg feedback in subcorpus
• Push-to-talk
• Corpus: Adapt system
– 50 dialogues, 33 subjects, 1845 utterances
– Feedback utterances labeled w/
• Positive or negative
• Explicit or implicit
• Attention/Attitude
• Results:
– 18% of utterances contained feedback
– 94% of users provided
7/15/2016
18
– 65% positive, 2/3 explicit, equal amounts of
attention vs. attitude
– Large variation
• Some subjects provided at almost every turn
• Some never did
• Utility of study:
– Use positive feedback to model the user
better (preferences)
– Use negative feedback in error detection
7/15/2016
19
The HIGGINS domain
This is a
3D test
environment
•
•
The primary domain of HIGGINS is city navigation for pedestrians.
Secondarily, HIGGINS is intended to provide simple information about the
immediate surroundings.
7/15/2016
20
Initial experiments
• Studies on human-human conversation
• The Higgins domain (similar to Map Task)
• Using ASR in one direction to elicit error
handling behaviour
User
7/15/2016
Speaks
ASR
Listens
Vocoder
Reads
Speaks
Operator
21
Non-Understanding Error Recovery
(Skantze ’03)
• Humans tend not to signal non-understanding:
– O:
Do you see a wooden house in front of
you?
– U:
ASR: YES CROSSING ADDRESS
NOW
(I pass the wooden house now)
– O:
Can you see a restaurant sign?
• This leads to
– Increased experience of task success
– Faster recovery from non-understanding
7/15/2016
22
Today
• Some Swedish examples
• Controlling the dialogue flow
– State prediction
• Controlling lexical choice
• Learning from human-human dialogue
– User feedback
• Evaluating systems
7/15/2016
23
Evaluating Dialogue Systems
• PARADISE framework (Walker et al ’00)
• “Performance” of a dialogue system is affected
both by what gets accomplished by the user and
the dialogue agent and how it gets
accomplished
Maximize
Task Success
Minimize
Costs
Efficiency
Measures
7/15/2016
Qualitative
Measures
24
Task Success
•Task goals seen as Attribute-Value Matrix
ELVIS e-mail retrieval task (Walker et al ‘97)
“Find the time and place of your meeting with
Kim.”
Attribute
Selection Criterion
Time
Place
Value
Kim or Meeting
10:30 a.m.
2D516
•Task success defined by match between AVM
values at end of with “true” values for AVM
7/15/2016
25
Metrics
• Efficiency of the Interaction:User Turns, System
Turns, Elapsed Time
• Quality of the Interaction: ASR rejections, Time Out
Prompts, Help Requests, Barge-Ins, Mean
Recognition Score (concept accuracy), Cancellation
Requests
• User Satisfaction
• Task Success: perceived completion, information
extracted
7/15/2016
26
Experimental Procedures
• Subjects given specified tasks
• Spoken dialogues recorded
• Cost factors, states, dialog acts automatically
logged; ASR accuracy,barge-in hand-labeled
• Users specify task solution via web page
• Users complete User Satisfaction surveys
• Use multiple linear regression to model User
Satisfaction as a function of Task Success and
Costs; test for significant predictive factors
7/15/2016
27
User Satisfaction:
Sum of Many Measures
• Was Annie easy to understand
in this conversation? (TTS
Performance)
• In this conversation, did Annie
understand what you said?
(ASR Performance)
• In this conversation, was it
easy to find the message you
wanted? (Task Ease)
• Was the pace of interaction
with Annie appropriate in this
conversation? (Interaction
Pace)
• In this conversation, did you
know what you could say at
each point of the dialog?
7/15/2016
(User Expertise)
• How often was Annie sluggish
and slow to reply to you in this
conversation? (System
Response)
• Did Annie work the way you
expected her to in this
conversation? (Expected
Behavior)
• From your current experience
with using Annie to get your
email, do you think you'd use
Annie regularly to access your
mail when you are away from
your desk? (Future Use)
28
Performance Functions from Three Systems
• ELVIS User Sat.= .21* COMP + .47 * MRS - .15 * ET
• TOOT User Sat.= .35* COMP + .45* MRS - .14*ET
• ANNIE User Sat.= .33*COMP + .25* MRS +.33* Help
– COMP: User perception of task completion
(task success)
– MRS: Mean recognition accuracy (cost)
– ET: Elapsed time (cost)
– Help: Help requests (cost)
7/15/2016
29
Performance Model
• Perceived task completion and mean recognition
score are consistently significant predictors of
User Satisfaction
• Performance model useful for system
development
– Making predictions about system
modifications
– Distinguishing ‘good’ dialogues from ‘bad’
dialogues
• But can we also tell on-line when a dialogue is
‘going wrong’
7/15/2016
30
Next Class
• Turn-taking (J&M, Link to conversational
analysis description, Beattie on Margaret
Thatcher)
7/15/2016
31
Download