Conversational Workshop

advertisement
Spoken Dialog with Humans and Machines
Context and Prosody in the Interpretation of
Cue Phrases in Dialogue
Julia Hirschberg
Columbia University and KTH
11/22/07
In collaboration with

Agustín Gravano, Stefan Benus, Héctor Chávez,
Shira Mitchell, and Lauren Wilcox

With thanks to Gregory Ward and Elisa Sneed
German
2
Managing Conversation




How do speakers indicate conversational structure
in human/human dialogue?
How do they communicate varying levels of
attention, agreement, acknowledgment?
What role does lexical choice play in these
communicative acts? Phonetic realization?
Prosodic variation? Prior context?
Can human/human behavior be modeled in
Spoken Dialogue Systems?
3
Cue Phrases/Discourse Markers/Cue Words/
Discourse Particles/Clue Words


Linguistic expressions that can be employed
 to convey information about the discourse
structure, or
 to make a semantic (literal?) contribution.
Examples:
 now, well, so, alright, and, okay, first, on the other
hand, by the way, for example, …
4
Some Examples

that’s pretty much okay

Speaker 1: between the yellow mermaid and
the whale
Speaker 2: okay
Speaker 1: and it is

okay we gonna be placing the blue moon
5
A Problem for Spoken Dialogue systems

How do speakers produce and hearers interpret
such potentially ambiguous terms?
 How important is acoustic/prosodic information?
 Phonetic variation?
 Discourse context?
6
Research Goals



Learn which features best characterize the
different functions of single affirmative cue words.
Determine how these can be identified
automatically.
Important in Spoken Dialogue Systems:
 Understand user input.
 Produce output appropriately.
7
Overview





Previous research
The Columbia Games Corpus
 Collection paradigm
 Annotations
Perception Study of Okays
 Experimental design
 Analysis and results
Machine Learning Experiments on Okay
Future work: Entrainment and Cue Phrases
8
Previous Work




General studies
 Schriffin ’82, ‘87; Reichman ’85; Grosz & Sidner
‘86
Cues to cue phrase disambiguation
 Hirschberg & Litman ’87, ’93; Hockey ’93; Litman
’94
Cues to Dialogue Act identification
 Jurafsky et al ’98; Rosset & Lamel ’04
Contextual cues to the production of backchannels
 Ward & Tsukahara ’00; Sanjanhar & Ward ’06
9
The Columbia Games Corpus
Collection



12 spontaneous task-oriented dyadic conversations
in Standard American English (9h 8m speech)
2 subjects playing a series of computer games, no
eye contact (45m 39s mean session time)
 2 sessions per subject, w/different partners
Several types of games, designed to vary the way
discourse entities became old, or ‘given’ in the
discourse to study variation in intonational
realization of information status
10
Cards Game #1
Player 1 (Describer) 
Player 2 (Searcher) 
• Short monologues
• Vary frequency and order of
occurrence of objects on the cards.
11
Cards Game #2
Player 1 (Describer) 
Player 2 (Searcher) 
• Dialogue
• Vary frequency and order of
occurrence of objects on the cards
across speakers.
12
Objects Game

Follower must place the target object where it
appears on the Describer’s screen solely via the
description provided (4h 19m)
Describer:
Follower:
13
The Columbia Games Corpus
Recording and Logging


Recorded on separate channels in soundproof
booth, digitized and downsampled to 16k
All user and system behaviors logged
14
The Columbia Games Corpus
Annotation







Orthographic transcription and alignment (~73k
words).
Laughs, coughs, breaths, smacks, throat-clearings.
Self-repairs.
Intonation, using ToBI conventions.
Function (10 categories) of affirmative cue words
(alright, mm-hm, okay, right, uh-huh, yeah, yes,
…).
Question form and function.
Turn-taking behaviors.
15
Perception Study
Selection of Materials
 Acknowledgment
Agreement
Speaker 1: yeah um
there's like there's/ some
space there's
Speaker 2: okay I think
I got it
Backchannel


okay
Cue beginning discourse segment
Speaker 1: but it's gonna be below the onion
Speaker 2: okay
Speaker 1: okay alright I'll try it okay
Speaker 2: okay the owl is blinking
18
Perception Study
Experiment Design




54 instances of ‘okay’ (18 for each function).
2 tokens for each ‘okay’:
Isolated condition: Only the word ‘okay’.
Contextualized condition: 2 full speaker turns:
 The turn containing the target ‘okay’; and
 The previous turn by the other speaker.
speakers
okay
contextualized ‘okay’
19
Perception Study
Experiment Design



1/3 each: 3 labelers agreed, 2…, none
Two conditions:
 Part 1: 54 isolated tokens
 Part 2: 54 contextualized tokens
Subjects asked to classify each token of ‘okay’ as:
 Acknowledgment / Agreement, or
 Backchannel, or
 Cue beginning discourse segment.
20
Perception Study
Definitions Given to the Subjects



Acknowledge/Agreement:
 The function of okay that indicates “I believe what
you said” and/or “I agree with what you say”.
Backchannel:
 The function of okay in response to another
speaker's utterance that indicates only “I’m still
here” or “I hear you and please continue”.
Cue beginning discourse segment
 The function of okay that marks a new segment of
a discourse or a new topic. This use of okay could
be replaced by now.
21
Perception Study
Subjects and Procedure


Subjects:
 20 paid subjects (10 female, 10 male).
 Ages between 20 and 60.
 Native speakers of English.
 No hearing problems.
GUI on a laboratory workstation with
headphones.
22
Results: Inter-Subject Agreement

Kappa measure of agreement with respect to
chance (Fleiss ’71)
Isolated Condition
Contextualized Condition
Overall
.120
.294
Ack / Agree vs. Other
.089
.227
Backchannel vs. Other
.118
.164
Cue beginning vs. Other
.157
.497
23
Results:Cues to Interpretation

Phonetic transcription of okay:

Isolated Condition
Strong correlation for realization of initial vowel
 Backchannel
 Ack/Agree, Cue Beginning

Contextualized Condition
No strong correlations found for phonetic variants.
24
Results: Cues to Interpretation
Isolated Condition
Contextualized Condition
Shorter /k/
Shorter latency between turns
Shorter pause before okay
Higher final pitch slope
Longer 2nd syllable
Lower intensity
Higher final pitch slope
More words by S2 before okay
Fewer words by S1 after okay
Lower final pitch slope
Lower overall pitch slope
Lower final pitch slope
Longer latency between turns
More words by S1 after okay
Ack / Agree
Backchannel
Cue beginning
S1 = Utterer of the target ‘okay’. S2 = The other speaker.
25
Results: Cues to Interpretation
Phrase-final intonation (ToBI)
(Both isolated and contextualized conditions.)
H-H%  Backchannel
H-L%
L-H%  Ack/Agree, Backchannel
L-L%  Ack/Agree, Cue beginning
26
Perception Study: Conclusions


Agreement:
 Availability of context improves inter-subject
agreement.
 Cue beginnings easier to disambiguate than the
other two functions.
Cues to interpretation:
 Contextual features override word features
 Exception: Final pitch slope of okay in both
conditions.
27
Machine Learning Experiments: Okay


Can we identify the different functions of okay in
our larger corpus reliably?
What features perform best?
 How do these compare to those that predict human
judgments?
28
Method


ML Algorithm
 JRip: Weka’s implementation of the propositional
rule learner Ripper (Cohen ’95).
 We also tried J4.8, Weka’s implementation of the
decision tree learner C4.5 (Quinlan ’93, ’96), with
similar results.
10-fold cross validation in all experiments.
29
Units of Analysis

IPU (Inter-pausal unit)
 Maximal sequence of words delimited by pause >
50ms.

Conversational Turn
 Maximal sequence of IPUs by the same speaker,
with no contribution from the other speaker.
30
Experimental features

Text-based features (from transcriptions)

Word ident, POS tags (auto); position of word in IPU / turn
 IPU, turn length in words; prev turn same spkr?
Timing features (from time alignment)
 Word / IPU / turn duration; amount of spkr overlap
 Time to word beg/end in IPU, turn

Acoustic features




{min, mean, max, stdev} x {pitch, intensity}
Slope of pitch, stylized pitch, and intensity, over the whole
word, and over its last 100, 200, 300ms.
Acoustic features from last IPU of prior speaker’s turn.
31
Results: Classification of individual words

Classification of each individual word into its most
common functions.
 alright  Ack/Agree, Cue Begin, Other
 mm-hm
 Ack/Agree, Backchannel
 okay
 Ack/Agree, Backchannel, Cue Begin,
Ack+CueBegin, Ack+CueEnd, Other
 right
 Ack/Agree, Check, Literal Modifier
 yeah
 Ack/Agree, Backchannel
32
Results: Classification of ‘okay’
Feature Set
Error
Rate
Majority Label
F-Measure
Ack /
BackCue Ack/Agree + Ack/Agree +
Cue End
Agree channel Begin Cue Begin
1137
121
548
68
232
Text-based
31.7
.76
.16
.77
.09
.33
Acoustic
40.2
.69
.24
.64
.03
.25
Text-based + Timing
25.6
.79
.31
.82
.18
.67
Full set
25.5
.80
.46
.83
.21
.66
Baseline (1)
48.3
.68
.00
.00
.00
.00
Human labelers (2)
14.0
.89
.78
.94
.56
.73
(1) Majority class baseline: ACK/AGREE.
(2) Calculated wrt each labeler’s agreement with the majority labels.
34
Conclusions: ML Experiments

Context and timing features
 Like perception in context results: timing



Pause after okay, not before
# of succeeding words
Acoustic features impoverished
 No phonetic features
 No pitch slope
 But ToBI labels (where available) didn’t help
35
Future Work


Experiments with full ToBI labeling
 Other features
Lexical, Acoustic-Prosodic, and Discourse
Entrainment and Dis-Entrainment
 Positive correlations for affirmative cue words


Affirmative cue word entrainment and game scores
Affirmative cue word entrainment and overlaps and
interruptions in turn-taking
36
Tack!
Other Work

Benus et al, 2007
 “The prosody of backchannels in American
English”, ICPhS 2007, Saarbrücken, Germany,
August 2007.

Gravano et al, 2007
 “Classification of discourse functions of
affirmative words in spoken dialogue”, Interspeech
2007, Antwerp, Belgium, August 2007.
38
Importance for Spoken Dialogue Systems


Convey ambiguous terms with the intended
meaning
Interpret the user’s input correctly
39
Experiment Design



Goal: Study the relation between the downstepped contour and
 Information status
 Syntactic position
 Discourse position
Spontaneous speech
Both monologue and dialogue
40
Experiment Design




Three computer games.
Two players, each on a different computer.
They collaborate to perform a common task.
Totally unrestricted speech.
41
Objects Game
Player 1 (Describer) 
Player 2 (Searcher) 
• Dialogue
• Vary target and surrounding objects
(subject and object position).
42
Games Session





Repeat 3 times:
 Cards Game #1
 Cards Game #2
Short break (optional)
Repeat 3 times:
 Objects Game
Each subject participated in 2 sessions.
12 sessions
43
Subjects

Postings:
 Columbia’s webpage for temporary job adds.
 Craig’s list



http://www.craigslist.org
Category: Gigs  Event gigs
Problem:
 People are unreliable
 ~50% did not show up, or cancelled with short notice.
44
Subjects

Possible solutions:

Give precise instructions to e-mail ALL required info:




Name, native speaker?, hearing impairments?, etc.
Ask for a phone number.
Call them and explain why it is so important for us that they
show up (or cancel with adecuate notice).
Increase the pay after each session.

Example: $5, $10, $15 instead of $10, $10, $10.
45
Recording

Sound-proof booth
 2 subjects + 1 or 2 confederates.
 Head-mounted mics.
 Digital Audio Tape (DAT): one channel per speaker.

Wav files
 One mono file per speaker.
 Sample rate: 48000
 Downsampled to 16000 (but kept original files!)
 ~20 hours of speech  2.8 GB (16k)
46
Logs


Log everything the subjects do to a text file.
Example:
17:03:55:234
17:04:04:868
17:04:31:837
17:04:38:426
17:05:03:873
...

BEGIN_EXECUTION
NEXT_TURN
RESULTS
97 points awarded.
NEXT_TURN
RESULTS
92 points awarded.
Later, this may be used (e.g.) to divide each session into
smaller tasks or conversations.
47
Download