Emotional Speech CS 4706 Julia Hirschberg (thanks to Jackson

advertisement
Emotional Speech
CS 4706
Julia Hirschberg (thanks to Jackson
Liscombe and Lauren Wilcox for some
slides)
Outline
•
•
•
•
Why study emotional speech?
Why is modeling emotional speech so difficult?
Production and perception studies
Voice Quality features: the holy grail
CS 4706
1
Why study emotional speech?
• Recognition
– Customer-care centers
– Tutoring systems
– Automated agents (Wildfire)
• Generation
– Characteristics of ‘emotional speech’ little
understood, so hard to produce: …a voice
that sounds friendly, sympathetic,
authoritative….
– TTS systems
– Games
CS 4706
2
Emotion in Spoken Dialogue Systems
• Batliner, Huber, Fischer, Spilker, Nöth (2003)
– Verbmobil (Wizard of Oz scenarios)
• Ang, Dhillon, Krupski, Shriberg, Stolcke (2002)
– DARPA Communicator
• Liscombe, Guicciardi, Tur, Gokken-Tur (2005)
– “How May I Help You?” call center
• Lee, Narayanan (2004)
– Speechworks call-center
• Liscombe, Hirschberg, Venditti (2005)
– ITSpoke Tutoring System (physics)
CS 4706
3
Why is emotional speech so hard to model?
• Colloquial definitions of speakers and listeners ≠
technical definitions
• Utterances may convey multiple emotions
simultaneously
• Result:
– Human consensus low
– Hard to get reliable training data
CS 4706
4
Spontaneous Corpora
• Unconstrained
– [Campbell, 2003] [Roach, 2000]
– [Cowie et al., 2001]
• Call centers
– [Vidrascu & Devillers, 2005] [Ang et al., 2002]
– [Litman and Forbes-Riley, 2004] [Batliner et
al., 2003]
– [Lee & Narayanan, 2005]
• Meetings
– [Wrede and Shriberg, 2003]
CS 4706
5
Acted Corpora
happy
sad
angry
confident
frustrated
friendly
interested
anxious
bored
encouraging
CS 4706
6
LDC Emotional Prosody and Transcripts
corpus
• Semantically neutral (dates and numbers)
• 8 actors
• 15 emotions
CS 4706
7
Are Emotions Mutually Exclusive?
• User study to classify tokens from LDC Emotional
Prosody corpus
• 10 emotions only:
– Positive: confident, encouraging, friendly, happy,
interested
– Negative: angry, anxious, bored, frustrated, sad
• Example
CS 4706
8
Emotion Intercorrelations
Emotion
sad
sad
angry
bored
0.44
0.44
frust
anxs
friend
0.26
0.22
-0.27
angry
0.70
0.21
-0.41
bored
0.14
-0.14
-0.28
0.32
frustrated
anxious
friendly
conf
-0.32
happy
inter
encour
-0.42
-0.32
-0.33
-0.37
-0.09
-0.32
-0.17
-0.32
-0.42
-0.27
-0.43
-0.09
-0.47
-0.16
-0.39
-0.14
-0.25
-0.17
0.44
0.77
-0.14
0.59
0.75
confident
0.45
0.51
happy
0.58
0.73
interested
0.62
encouraging
(p < 0.001)
CS 4706
9
Results
• Emotions are heavily correlated
– Positive with positive
– Negative with negative
• Emotions are non-exclusive
• Can they be clustered empirically
– Activation
– Valency
CS 4706
10
Different Valence/Activation
Global Pitch Statistics
CS 4706
11
Different Valence/Same Activation
CS 4706
12
Identifying Emotions
• Automatic Acoustic-prosodic
[Davitz, 1964] [Huttar, 1968]
– Global characterization
• pitch
• loudness
• speaking rate
• Intonational Contours
[Mozziconacci & Hermes, 1999]
• Spectral Tilt
[Banse & Scherer, 1996] [Ang et al., 2002]
CS 4706
13
Machine Learning Experiment
• RIPPER 90/10 split
• Binary classification for each emotion
• Results
– 62% average baseline
– 75% average accuracy
– Acoustic-prosodic features for activation
– /H-L%/ for negative; /L-L%/ for positive
– Spectral tilt for valence?
CS 4706
14
Accuracy Distinguishing One Emotion from
the Rest
Emotion
Baseline
Accuracy
angry
69.32%
77.27%
confident
75.00%
75.00%
happy
57.39%
80.11%
interested
69.89%
74.43%
encouraging
52.27%
72.73%
sad
61.93%
80.11%
anxious
55.68%
71.59%
bored
66.48%
78.98%
friendly
59.09%
73.86%
frustrated
59.09%
73.86%
CS 4706
15
A Call Center Application
• AT&T’s “How May I Help You?” system
• Customers often angry and frustrated
CS 4706
16
HMIHY Example
Very Frustrated
Somewhat Frustrated
CS 4706
17
Pitch, Energy and Rate
M edian Pitch
M ean Energy
Speaking Rate
2
1. 5
Z
s
c
o
r
e
1
0. 5
0
-0. 5
-1
-1. 5
-2
Positive
Frustrated
Angry
Utterance
CS 4706
18
Features
• Automatic Acoustic-prosodic
• Contextual
[Cauldwell, 2000]
• Lexical
[Schröder, 2003] [Brennan, 1995]
• Pragmatic
[Ang et al., 2002] [Lee & Narayanan, 2005]
CS 4706
19
Results
Feature Set
Accuracy
Rel. Improv. over
Baseline
Majority Class
73.1%
-----
pros+lex
76.1%
-----
pros+lex+da
77.0%
1.2%
all
79.0%
3.8%
CS 4706
20
Tutoring Systems Should Respond to Uncertainty
• SCoT [Pon-Barry et al. 2006]
– Responding to uncertainty
• Active listening
• Hinting vs. paraphrasing
– Features examined
• Latency
• Filled pauses
• Hedges
– Performance metric
• Learning gain
– But no improvement by responding to
uncertainty
CS 4706
21
What does uncertainty sound like?
CS 4706
22
[pr01_sess00_prob58]
CS 4706
23
Uncertainty in ITSpoke
um <sigh> I don’t even think I have an idea
here ...... now .. mass isn’t weight ......
mass is ................ the .......... space that
an object takes up ........ is that mass?
[71-67-1:92-113]
CS 4706
24
ITSpoke Experiment
•
•
•
•
Human-Human Corpus
AdaBoost(C4.5) 90/10 split in WEKA
Classes: Uncertain vs Certain vs Neutral
Results:
Features
Accuracy
Baseline
66%
Acoustic-prosodic
75%
+ contextual
76%
+ breath-groups
77%
CS 4706
25
ITSpoke Results
Emotion
Precision
Recall
F-measure
certain
0.611
0.602
0.606
uncertain
0.515
0.393
0.446
neutral
0.846
0.891
0.868
Classified as
Emotion
label
certain
uncertain
neutral
certain
80
11
42
uncertain
26
35
28
neutral
25
22
384
CS 4706
26
Voice Quality and Emotion
• Perceptual coloring
– Derived from a variety of laryngeal and
supralaryngeal features
– modal, creaky, whispered, harsh, breathy, ...
• Correlates with emotion
– Laver ‘80, Scherer ‘86, Murray& Arnott ’93,
Laukkanen ’96, Johnstone & Scherer ’99,
Gobl & Chasaide, ‘03, Fernandez ‘00
CS 4706
27
Phonation Gestures
• Adductive tension:
interarytenoid muscles
adduct the arytenoid
muscles
• Medial compression:
adductive force on vocal
processes- adjustment of
ligamental glottis
• Longitudinal pressure:
tension of vocal folds
CS 4706
28
Modal Voice
• “Neutral” mode
• Muscular adjustments moderate
• Vibration of vocal folds periodic, full closing of
glottis, no audible friction
• Frequency of vibration and loudness in low to
mid range for conversational speech
CS 4706
29
Tense Voice
• Very strong tension of
vocal folds, very high
tension in vocal tract
CS 4706
30
Whispery Voice
• Very low adductive
tension
• Medial compression
moderately high
• Longitudinal tension
moderately high
• Little or no vocal fold
vibration
• Turbulence
generated by friction
of air in and above
larynx
CS 4706
31
Creaky Voice
• Vocal fold vibration at low
frequency, irregular
• Low tension (only
ligamental part of glottis
vibrates)
• The vocal folds strongly
adducted
• Longitudinal tension
weak
• Moderately high medial
compression
CS 4706
32
Breathy Voice
• Tension low
– Minimal adductive
tension
– Weak medial
compression
• Medium longitudinal
vocal fold tension
• Vocal folds do not come
together completely,
leading to frication
CS 4706
33
Estimating Voice Quality
• Estimate wrt controlled neutral quality
– But how do we know the control is truly “neutral”?
– Must must match the natural laryngeal behavior to
laboratory “neutral”
• Our knowledge of models of vocal fold movements may
be inadequate for describing real phonation
• Known relationships between acoustic signal and voice
source are complex
– Only can observe behavior of voicing indirectly so
prone to error.
– Direct source data obtained by invasive techniques
which may interfere with signal
CS 4706
34
Next Class
• Deceptive Speech
CS 4706
35
Download