Dialogue Systems Julia Hirschberg CS 4705 7/15/2016

advertisement
Dialogue Systems
Julia Hirschberg
CS 4705
7/15/2016
1
Today
• Examples from English and Swedish
• Controlling the dialogue flow
– State prediction
• Influencing user behavior
– Entrainment
• Learning from human-human dialogue
– User feedback
• Role of ‘personality’ in SDS
• Evaluating SDS
7/15/2016
2
Issues in Designing SDS
• Coverage: functionality and vocabulary
• Dialogue control:
– System
– User
– Mixed initiative
• Confirmation strategies:
– Explicit
– Implicit
– None
7/15/2016
3
The Waxholm Project at KTH
• Tourist information
• Stockholm archipelago
• time-tables, hotels, hostels, camping
and dining possibilities.
• Mixed initiative dialogue
• speech recognition
• multimodal synthesis
• Graphic information
• pictures, maps, charts and time-tables
• Demos at
http://www.speech.kth.se/multimodal
7/15/2016
4
The Waxholm system
7/15/2016
There
Information
Information
Information
Which
areWhen
IIs
IWaxholm
am
lots
Which
think
day
This
itWhere
looking
possible
of
Ido
about
about
Where
of
want
about
is
Iboats
hotels
the
want
the
Thank
ais
can
Thank
table
hotels
the
evening
to
The
for
week
shown
to
the
from
isto
go
are
Ieat
restaurants
boats
Waxholm?
you
find
city
hotels
of
go
tomorrow
you
is
in
do
Stockholm
in
the
to
on
boats
shown
too
Waxholm?
hotels?
Waxholm?
you
to
Waxholm
boats...
this
inWaxholm
Waxholm
want
depart?
in
map
inWaxholm
to
this
toWaxholm
go?
table
is
on a Friday,
Fromis
At
shown
where
shown
whatin
do
time
inthis
you
this
do
table
want
table
you to
want
go to go?
5
Dialogue Control: Predicting Dialogue State
• Dialog grammar specified by a number of states
– Each state associated with an action
– Database search, system question… …
– Probable state determined from semantic
features
• Train transition probabilities from state to state
• Dialog control design tool with a graphic
interface
7/15/2016
6
Waxholm Topics
TIME_TABLE Task: get a time-table.
Example: När går båten? (When does the boat leave?)
SHOW_MAP Task : get a chart or a map displayed.
Example: Var ligger Vaxholm? (Where is Vaxholm located?)
EXIST Task : display lodging and dining possibilities.
Example: Var finns det vandrarhem? (Where are there hostels?)
OUT_OF_DOMAIN
Task : the subject is out of the domain.
Example: Kan jag boka rum. (Can I book a room?)
NO_UNDERSTANDING Task : no understanding of user intentions.
Example: Jag heter Olle. (My name is Olle)
END_SCENARIO Task : end a dialog.
Example: Tack. (Thank you.)
7/15/2016
7
Topic selection
FEATURES
7/15/2016
TOPIC EXAMPLES
TIME
TABLE
SHOW
MAP
FACILITY NO UNDER- OUT OF
STANDING DOMAIN
OBJECT
QUEST-WHEN
QUEST-WHERE
FROM-PLACE
AT-PLACE
.062
.188
.062
.250
.062
.312
.031
.688
.031
.219
.073
.024
.390
.024
.293
.091
.091
.091
.091
.091
.067
.067
.067
.067
.067
.091
.091
.091
.091
.091
TIME
PLACE
OOD
END
HOTEL
HOSTEL
ISLAND
PORT
MOVE
.312
.091
.062
.062
.062
.062
.333
.125
.875
.031
.200
.031
.031
.031
.031
.556
.750
.031
.024
.500
.122
.024
.488
.122
.062
.244
.098
.091
.091
.091
.091
.091
.091
.091
.091
.091
.067
.067
.933
.067
.067
.067
.067
.067
.067
.091
.091
.091
.909
.091
.091
.091
.091
.091
{ p(ti | F )}
argmax
i
END
8
Topic prediction results
% Errors
15
12,9
8,8
10
5
12,7
8,5
All
“no understanding”
excluded
3,1 2,9
0
complete
parse
7/15/2016
raw
data
no extra
linguistic
sounds
9
Entrainment (Adaptation, Accommodation,
Alignment)
• Hypothesis: over time, people tend to adapt
their communicative behavior to that of their
conversational partner
• Issues
– What are the dimensions of entrainment?
– How rapidly do people adapt?
– Does entrainment occur (on the human
side) in human/computer conversations?
– Can this be used to the system’s
advantage? The user’s?
7/15/2016
10
Varieties of Entrainment…
• Lexical: S and H tend over time to adopt the same
method of referring to items in a discourse
A: It’s that thing that looks like a harpsichord.
B: So the harpsichord-looking thing…
....
B: The harpsichord…
• Phonological
– Word pronunciation: voice/voiceless /t/ in better
• Acoustic/Prosodic
– Speaking rate, pitch range, choice of contour
• Discourse/dialogue/social
– Marking of topic shift, turn-taking
7/15/2016
11
The Vocabulary Problem
• Furnas et al ’87: the probability that 2 subjects will
producing the same name for a command for common
computer operations varied from .07-.18
– Remove a file: remove, delete, erase, kill, omit,
destroy, lose, change, trash
– With 20 synonyms for a single command, the
likelihood that 2 people will choose the same one was
80%
– With 25 commands, the likelihood that 2 people who
choose the same term think it means the same
command was 15%
• How can people possibly communicate?
– They collaborate on choice of referring expressions
7/15/2016
12
Early Studies of Priming Effects
• Hypothesis: Users will tend to use the
vocabulary and syntax the system uses
– Evidence from data collections in the field
• Systems should take advantage of this proclivity
to prime users to speak in ways that the system
can recognize well
7/15/2016
13
User Responses to Vaxholm
The answers to the question:
“What weekday do you want to go?”
(Vilken veckodag vill du åka?)
•
•
•
•
•
22%
11%
11%
7%
6%
• -
7/15/2016
Friday (fredag)
I want to go on Friday (jag vill åka på fredag)
I want to go today (jag vill åka idag)
on Friday (på fredag)
I want to go a Friday (jag vill åka en fredag)
are there any hotels in Vaxholm?
(finns det några hotell i Vaxholm)
14
Verb Priming: How often do you go abroad on
holiday?
Hur ofta åker du utomlands på semestern? Hur ofta reser du utomlands på semestern?
jag åker en gång om året kanske
jag åker ganska sällan utomlands på semester
jag åker nästan alltid utomlands under min
semester
jag åker ungefär 2 gånger per år utomlands på
semester
jag åker utomlands nästan varje år
jag åker utomlands på semestern varje år
jag åker utomlands ungefär en gång om året
jag är nästan aldrig utomlands
en eller två gånger om året
en gång per semester
kanske en gång per år
ungefär en gång per år
åtminståne en gång om året
nästan aldrig
7/15/2016
jag reser en gång om året utomlands
jag reser inte ofta utomlands på semester det blir mera i
arbetet
jag reser reser utomlands på semestern vartannat år
jag reser utomlands en gång per semester
jag reser utomlands på semester ungefär en gång per år
jag brukar resa utomlands på semestern åtminståne en
gång i året
en gång per år kanske
en gång vart annat år
varje år
vart tredje år ungefär
nu för tiden inte så ofta
varje år brukar jag åka utomlands
15
Results
no
no
reuse
4% 2%answer
other
24%
reuse
52%
18%
ellipse
7/15/2016
16
Lexical Entrainment in Referring
Expressions
• Choice of Referring Expressions: Informativeness vs.
availability (basic level or not) vs. saliency vs. recency
• Gricean prediction
– People use descriptions that minimally but effectively
distinguish among items in the discourse
• Garrod & Anderson ’87 Output/Input Principle
– Conversational partners formulate their current
utterance according to the model used to interpret
their partner’s most recent utterance
• Clark, Brennan, et al’s Conceptual Pacts
– People make Conceptual Pacts wrt appropriate
referring expressions made with particular
conversational partners
– They are loath to abandon these even when shorter
expressions possible
7/15/2016
17
Entrainment in Spontaneous Speech
S13: the orange M&M looking kind of scared and then a
one on the bottom left and a nine on the bottom right
S12: alright I have the exact same thing I just had it's an
M&M looking scared that's orange
S13: yeah the scared M&M guy yeah
S12: framed mirror and the scared M&M on the lower right
S13: and it's to the right of the scared M&M guy
S13: yeah and the iron should be on the same line as the
frightened M&M kind of like an L
S12: to the left of the scared M&M to the right of the onion
and above the iron
7/15/2016
18
Extraterrestrial vs Alien I
s11: okay in the middle of the card I have an
extraterrestrial figure…
s11: okay middle of the card I have the
extraterrestrial
…
s10: I've got the blue lion with the extraterrestrial
on the lower right
s11: okay I have the extraterrestrial now and then
I have the eye at the bottom right corner
s10: my extraterrestrial's gone
7/15/2016
19
Extraterrestrial vs. Alien II
S03: okay I have a blue lion and then the extraterrestrial at
the lower right corner
S11: mm I'll pass I have the alien with an eye in the lower
right corner
S03: um I have just the alien so I guess I'll match that
-----------------------------------------------------------------------------S10: yes now I've got that extraterrestrial with the yellow
lion and the money
…
S12: oh now I have the blue lion in the center with our little
alien buddy in the right hand corner
S10: with the alien buddy so I'm gonna match him with the
single blue lion okay I've got our alien with the eye in
the corner
7/15/2016
20
Timing and Voice Quality
• Guitar & Marchinkoski ’01:
– How early do we start to adapt to others’ speech?
– Do children adapt their speaking rate to their mother’s
speech?
• Study:
– 6 mothers spoke with their own (normally speaking)
3-yr-olds (3M, 3F)
– Mothers’ rates significantly reduced (B) or not (A) in
A-B-A-B design
• Results:
– 5/6 children reduced their rates when their mothers
spoke more slowly
7/15/2016
21
• Sherblom & La Riviere ’87: How are speech timing and
voice quality affected by a non-familiar conversational
partner?
• Study:
– 65 pairs of undergraduates asked to discuss a
‘problem situation’ together
– Utter a single sentence before and after the
conversation
– Sentences compared for speaking rate, utterance
length and vocal jitter
• Results:
– Substantial influence of partner on all 3 measures
– Interpersonal uncertainty and differences in arousal
influenced degree of adaptation
7/15/2016
22
Amplitude and Response Latency
• Coulston et al ’02:
– Do humans adapt to the behavior of non-human
partners?
– Do children speak more loudly to a loud animated
character?
• Study:
– 24 7-10-yr olds interacted with an extroverted, loud
animated character and with an introverted, soft
character (TTS voices)
– Multiple tasks using different amplitude ranges
– Human/TTS amplitudes and latencies compared
• Results:
– 79-94% of children adapted their amplitude, bidirectionally
– Also adapted their response latencies (mean 18.4%),23
7/15/2016
bidirectionally
Social Status and Entrainment
• Azuma ’97: Do speakers adapt to the style of other
social classes?
• Study: Emperor Hirohito visits the countryside
– Corpus-based study of speech style of Japanese
Emperor Hirohito during chihoo jyunkoo (`visits to
countryside‘), 1946-54
– Published transcripts of speeches
– Findings:
• Emperor Hirohito converged his speech style to that of
listeners lower in social status
– Choice of verb-forms, pronouns no longer those of person with
highest authority
– Perceived as like those of a (low-status) mother
7/15/2016
24
Socio-Cultural Influences and Entrainment
• Co-teachers adapt teaching styles (Roth ’05)
– Social context
• High school in NE with predominantly AfricanAmerican student body
• Cristobal: Cuban-African-American teacher
• Chris: new Italian-American teacher
– Adaptation of Chris to Cristobal
• Catch phrases (e.g. right!, really really hot) and
their production: pitch and intensity contours
• Pitch ‘matching’ across speakers
– Mimesis vs entrainment
7/15/2016
25
Conclusions for SDS
• Systems can make use of user tendency to
entrain to system vocabulary
• Should systems also entrain to users?
– CMU’s Let’s Go adapts confirmation prompts
to non-native speech: Finds closest match to
user input in system vocabulary
7/15/2016
26
Evidence from Human Performance
• Users provide explicit positive and negative
feedback
• Corpus-based vs. laboratory experiments –
do these tell us different things?
7/15/2016
27
The August system
7/15/2016
People
IWhat
IStrindberg
IYes,
Over
call
The
can
How
come
Strindberg
The
Perhaps
myself
answer
that
information
who
amany
from
Royal
million
was
live
we
Strindberg,
questions
the
was
people
Institute
ain
will
people
smart
glass
married
ishere?
shown
live
thing
about
of
houses
live
but
in I
Yes,
When
it
What
Do
You
do
might
you
Thank
you
Good
were
are
is
was
like
your
be
do
welcome!
bye!
you
that
you!
born
for
itdepartment
name?
born?
we
ameet
inliving?
will!
1849
Strindberg,
ofdon’t
Speech,
should
in the
really
three
Technology!
Stockholm?
on
soon
Stockholm
not
KTH
Music
tothe
have
throw
say!
times!
again!
map
and
and
a surname
Stockholm
stones
area
Hearing
28
Adapt – demonstration
of ”complete” system
7/15/2016
29
Feedback and ‘Grounding’: Bell & Gustafson ’00
• Positive and negative
– Previous corpora: August system
• 18% of users gave pos or neg feedback in subcorpus
• Push-to-talk
• Corpus: Adapt system
– 50 dialogues, 33 subjects, 1845 utterances
– Feedback utterances labeled w/
• Positive or negative
• Explicit or implicit
• Attention/Attitude
• Results:
– 18% of utterances contained feedback
– 94% of users provided
7/15/2016
30
– 65% positive, 2/3 explicit, equal amounts of
attention vs. attitude
– Large variation
• Some subjects provided at almost every turn
• Some never did
• Utility of study:
– Use positive feedback to model the user
better (preferences)
– Use negative feedback in error detection
7/15/2016
31
The HIGGINS domain
This is a
3D test
environment
•
•
The primary domain of HIGGINS is city navigation for pedestrians.
Secondarily, HIGGINS is intended to provide simple information about the
immediate surroundings.
7/15/2016
32
Initial experiments
• Studies on human-human conversation
• The Higgins domain (similar to Map Task)
• Using ASR in one direction to elicit error
handling behaviour
User
7/15/2016
Speaks
ASR
Listens
Vocoder
Reads
Speaks
Operator
33
Non-Understanding Error Recovery
(Skantze ’03)
• Humans tend not to signal non-understanding:
O: Do you see a wooden house in front of you?
U: ASR: YES CROSSING ADDRESS NOW
(I pass the wooden house now)
O: Can you see a restaurant sign?
• This leads to
– Increased experience of task success
– Faster recovery from non-understanding
7/15/2016
34
Personality and Computer Systems
• Early-pc-era reports that significant others were jealous
of the time their partners spent with their computers.
• Reeves & Nass, The Media Equation How People Treat
Computers, Television, and New Media Like Real People
and Places, 1996
– Evolution explains the anthropomorphization of the pc
• Humans evolved over millions of years without media
• Proper response to any stimulus was critical to survival
• Human psychology and physiological responses well
developed before media invented
• Ergo, our bodies and minds react to media, immediately and
fundamentally, as if they were real
7/15/2016
35
People See ‘Personality’ Everywhere
• Humans assess personality of another (human or otherwise) quickly,
with minimal clues
• Perceived computer personality strongly affects how we evaluate
the computer and information it provides
• Experiments:
– Created “dominant” and “submissive” computer interfaces and
asked subjects to use to solve hypothetical problems
– Max (dominant) used assertive language, showed higher
confidence in the information displayed (via a numeric scale),
always presented its own analysis of the problem first
– Linus (submissive) phrased information more tentatively, rated its
own information at lower confidence levels, and allowed human
to discuss problem first
– Each used alternately by people whose personalities previously
identified as being either dominant or submissive
7/15/2016
36
User Reactions
• Users described Max and Linus in human terms:
aggressive, assertive, authoritative vs. shy, timid,
submissive
– Users correctly identified machines more like
themselves
– Users rated machines more like themselves as better
computers even though content received exactly the
same.
– Users rated their own performance better when
machine’s personality matched theirs
• People more frank when rating a computer if
questionnaire presented on another machine
• Subjects thought highly of computers that praised them,
even if praise clearly undeserved
7/15/2016
37
Personality in SDS
• Mairesse & Walker ’07 PERSONAGE
(PERSONAlity GEnerator)
– ‘Big 5’ personality trait model: extroversion,
neuroticism, agreeableness,
conscientiousness, openness to experience
– Attempts to generate “extroverted” language
based on traits associated with extroversion in
psychology literature
– Demo: find your personality type
7/15/2016
38
7/15/2016
39
Conclusions for SDS
• Systems can be designed to convey different
personalities
• Can they recognize users’ personalities and
entrain to them?
• Should they?
7/15/2016
40
Evaluating Dialogue Systems
• PARADISE framework (Walker et al ’00)
• “Performance” of a dialogue system is affected
both by what gets accomplished by the user and
the dialogue agent and how it gets
accomplished
Maximize
Task Success
Minimize
Costs
Efficiency
Measures
7/15/2016
Qualitative
Measures
41
Task Success
•Task goals seen as Attribute-Value Matrix
ELVIS e-mail retrieval task (Walker et al ‘97)
“Find the time and place of your meeting with
Kim.”
Attribute
Value
Selection Criterion Kim or Meeting
Time
10:30 a.m.
Place
2D516
•Task success defined by match between AVM
values at end of with “true” values for AVM
7/15/2016
42
Metrics
• Efficiency of the Interaction:User Turns, System
Turns, Elapsed Time
• Quality of the Interaction: ASR rejections, Time Out
Prompts, Help Requests, Barge-Ins, Mean
Recognition Score (concept accuracy), Cancellation
Requests
• User Satisfaction
• Task Success: perceived completion, information
extracted
7/15/2016
43
Experimental Procedures
• Subjects given specified tasks
• Spoken dialogues recorded
• Cost factors, states, dialog acts automatically
logged; ASR accuracy,barge-in hand-labeled
• Users specify task solution via web page
• Users complete User Satisfaction surveys
• Use multiple linear regression to model User
Satisfaction as a function of Task Success and
Costs; test for significant predictive factors
7/15/2016
44
User Satisfaction:
Sum of Many Measures
• Was Annie easy to understand
in this conversation? (TTS
Performance)
• In this conversation, did Annie
understand what you said?
(ASR Performance)
• In this conversation, was it
easy to find the message you
wanted? (Task Ease)
• Was the pace of interaction
with Annie appropriate in this
conversation? (Interaction
Pace)
• In this conversation, did you
know what you could say at
each point of the dialog?
7/15/2016
(User Expertise)
• How often was Annie sluggish
and slow to reply to you in this
conversation? (System
Response)
• Did Annie work the way you
expected her to in this
conversation? (Expected
Behavior)
• From your current experience
with using Annie to get your
email, do you think you'd use
Annie regularly to access your
mail when you are away from
your desk? (Future Use)
45
Performance Functions from Three Systems
• ELVIS User Sat.= .21* COMP + .47 * MRS - .15 * ET
• TOOT User Sat.= .35* COMP + .45* MRS - .14*ET
• ANNIE User Sat.= .33*COMP + .25* MRS +.33* Help
– COMP: User perception of task completion
(task success)
– MRS: Mean recognition accuracy (cost)
– ET: Elapsed time (cost)
– Help: Help requests (cost)
7/15/2016
46
Performance Model
• Perceived task completion and mean recognition
score are consistently significant predictors of
User Satisfaction
• Performance model useful for system
development
– Making predictions about system
modifications
– Distinguishing ‘good’ dialogues from ‘bad’
dialogues
• But can we also tell on-line when a dialogue is
‘going wrong’
7/15/2016
47
Next
• Generation: Summarization
7/15/2016
48
Download