On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings

advertisement
On Speaker-Specific Prosodic Models
for Automatic Dialog Act Segmentation
of Multi-Party Meetings
Jáchym Kolář1,2 Elizabeth Shriberg1,3 Yang Liu1,4
1International
2University
3SRI
Computer Science Institute, Berkeley, USA
of West Bohemia in Pilsen, Czech Republic
International, USA
4University
of Texas at Dallas, USA
Why automatic DA
segmentation?
• Standard STT systems output a raw stream of
words leaving out structural information such
as sentence and Dialog Act (DA) boundaries
• Problems for human readability
• Problems when applying downstream natural
language processing techniques requiring
formatted input
09/20/2006
Kolář et al.: On Speaker-Specific Prosodic Models for ...
2
Goal and Task Definition
• Goal: Dialog Act (DA) segmentation of meetings
• Task definition:
• 2-way classification in which each inter-word
boundary is labeled as within-DA boundary or
boundary between DAs
• e.g. “no jobs are still running ok”
3 DAs: “No.” + “Jobs are still running.” + “OK.”
• Evaluation metric – “Boundary error rate”
# Incorrectly classified word boundaries
E
#Words
09/20/2006
Kolář et al.: On Speaker-Specific Prosodic Models for ...
3
Approach:
Explore Speaker-Specific Prosody
•
•
•
•
•
Past work has used both lexical and prosodic
features, but collapsing over speakers
Speakers appear to differ, however, in both feature
types, especially in spontaneous speech
Meeting applications: speaker is often known or at
least recorded on one channel; often participates in
ongoing meetings  good opportunity for modeling
Speaker adaptation used successfully in cepstral
domain for ASR
This study takes a first look specifically at prosodic
features for the DA boundary task
09/20/2006
Kolář et al.: On Speaker-Specific Prosodic Models for ...
4
Three Questions
1) Do individual speakers benefit from modeling
more than simply pause information?
2) Do individual speakers differ enough from
the overall speaker model to benefit from a
prosodic model trained on only their speech?
3) How do speakers differ in terms of prosody
usage in marking DA boundaries?
09/20/2006
Kolář et al.: On Speaker-Specific Prosodic Models for ...
5
Data and Experimental Setup
• ICSI meeting corpus – multichannel conversational
speech annotated for DAs
• Baseline speaker-independent model trained on 567k
words
• For speaker-specific experiments –
20 most frequent speakers in terms of total words
(7.5k – 165k words)
• 17 males, 3 females
• 12 natives, 8 nonnatives
09/20/2006
Kolář et al.: On Speaker-Specific Prosodic Models for ...
6
Data and Experimental Setup II.
• Each speaker’s data: ~70% training, ~30%
testing
• Jackknife instead of separate development
set
 using 1st half of test data to tune weights
for the 2nd half and vice versa
• Tested on forced alignments rather than on
ASR hypotheses
09/20/2006
Kolář et al.: On Speaker-Specific Prosodic Models for ...
7
Prosodic Features and Classifiers
• Features: 32 for each interword boundary
• Pause – (after current, previous and follow. word)
• Duration – (phone-normalized dur of vowels, final
rhymes and words; no raw durations)
• Pitch – (F0 min, max, mean, slopes, and diffs and
ratios across word boundaries; raw values + PWL
stylized contour)
• Energy – (max, min, mean frame-level RMS values,
both raw and normalized)
• Classifiers: CART-style decision trees with
ensemble bagging
09/20/2006
Kolář et al.: On Speaker-Specific Prosodic Models for ...
8
Pause-only vs. Richer Set of
Prosodic Features
• Compare speaker-independent (SI) model
with pause only (SI-Pau) with SI model with
all 32 prosodic features (SI-All)
• SI-All significantly better for 19 of 20
speakers
• Relative error rate reduction by prosody not
correlated with the amount of training data
09/20/2006
Kolář et al.: On Speaker-Specific Prosodic Models for ...
9
Pause-only vs. Rich Prosody:
Relative Error Reduction
Relative error reduction [%]
14%
12%
10%
NONNATIVES
8%
SI-All
6%
SI-Pau
4%
2%
0%
-2%
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20
Speakers
09/20/2006
Kolář et al.: On Speaker-Specific Prosodic Models for ...
10
Speaker-Independent (SI) vs.
Speaker-Dependent (SD) Models
• We compare SI, SD, and interpolated SI+SD
models
• SI+SD defined as:
PSI  SD ( X )  PSI ( X )  (1   ) PSD ( X )
• Significantly improved result would suggest
prosodic marking of boundaries differs from
baseline SI model
09/20/2006
Kolář et al.: On Speaker-Specific Prosodic Models for ...
11
Effects of Adding SD Information
• SD models much smaller than SI model; as expected SI
better than SD alone for most subjects
(though for some SD better!)
• Many subjects, no gain by adding SD information
(no SD info or not enough data?)
• For 7 of 20 speakers, however, SD or SI+SD is better than
SI, 5 improvements statistically significant
• Improvement by SD not correlated with amount of data,
error rate, chance error, proficiency in English, or gender
• SD often helps in “unusual” prosody situations – hesitation,
lip smack, long pause, emotions
• SD helps more in preventing false alarms than misses
09/20/2006
Kolář et al.: On Speaker-Specific Prosodic Models for ...
12
Audio Examples: SD Helps
Example of preventing a FALSE ALARM:
“and another thing that we did also is that |FA| we have all this
training data … ”
SD does not false alarm after 2nd “that” because it ‘knows’ this
nonnative speaker has limited F0 range and often falls in pitch
before hesitations
----------------------------------------------------------------------------Example of preventing a MISS:
“this is one |.| and I think that's just fine |.|”
SD finds DA boundary after “one”, despite the short pause,
probably based on the speaker’s prototypical pitch reset
09/20/2006
Kolář et al.: On Speaker-Specific Prosodic Models for ...
13
Feature Usage,
Natives vs. Nonnatives
• Feature usage – how many times a feature is queried
in the tree weighted by the number of samples it
affects
• 5 groups of features:
•
•
•
•
•
Pause at boundary
Near pause
Duration
Pitch
Energy
• Compare the SD feature usage of improved speakers
with the SI distribution
09/20/2006
Kolář et al.: On Speaker-Specific Prosodic Models for ...
14
FEATURE USAGE
Feature Usage: Natives vs. Nonnatives
50%
NATIVES
40%
30%
SI
me011
20%
fe016
10%
0%
FEATURE USAGE
PAUSE
DURATION
PITCH
50%
ENERGY
NEAR PAU
NONNATIVES
40%
30%
20%
SI
mn015
mn007
mn005
fn002
10%
0%
PAUSE
09/20/2006
DURATION
PITCH
ENERGY
NEAR PAU
Kolář et al.: On Speaker-Specific Prosodic Models for ...
15
Summary
• Prosodic features beyond pause provides improvement for 19 of
20 frequent speakers
• For ~30% speakers studied, simply interpolating large SI
prosodic model with small SD model yielded improvement
• Amount of data error rate, chance error, proficiency in English,
or gender not correlated with improvement by SD
• Some interesting observations – nonnative speakers differ from
native in feature usage patterns, SD information helps in
“unusual” prosody situations and preventing false alarms
09/20/2006
Kolář et al.: On Speaker-Specific Prosodic Models for ...
16
Conclusions and Future Work
• Results are interesting and suggestive, but as of yet
inconclusive
• SD prosody modeling significantly benefits some
speakers, but predicting who they will be is still an
open question
• Many issues still to address, especially joint modeling
with lexical features, and better integration approach
• Approach interesting to explore for other domains
like broadcast news, where segmentation important
and some speakers occur repeatedly
09/20/2006
Kolář et al.: On Speaker-Specific Prosodic Models for ...
17
On Speaker-Specific Prosodic Models
for Automatic Dialog Act Segmentation
of Multi-Party Meetings
Jáchym Kolář1,2 Elizabeth Shriberg1,3 Yang Liu1,4
1International
2University
3SRI
Computer Science Institute, Berkeley, USA
of West Bohemia in Pilsen, Czech Republic
International, USA
4University
of Texas at Dallas, USA
Download