Cues to Emotion: Language Suzanne Yuen Monday Oct 5, 2009 COMS 6998

advertisement
Cues to Emotion: Language
Suzanne Yuen
Monday Oct 5, 2009
COMS 6998
Overview
• Two-Stream Emotion Recognition for Call
Center Monitoring
• Voice Quality and f0 Cues for Affect
Expression: Implications for Synthesis
Two Stream Emotion Recognition for
Call Center Monitoring
• Background: To aid supervisors in the evaluation of
agents at call centers*
• Objective: To present a two stream processing
technique to detect strong emotion
• Previous Work:
– Fernandez categorized affect into four main components:
intonation, loudness, rhythm, and voice quality
– Yang studied feature selection methods in text
categorization and suggested that information gain should
be used
– Petrushin and Yacoub examined agitation and calm states
in people-machine interaction
*Typical medium-sized call-center receives about 100,000 calls per day
Two-Stream Recognition
Acoustic Stream
Extracted features based on pitch and
energy
•Trained on 900 calls, ~60hrs of speech
•Vocabulary system of more than 10
000 words
•TF-IDF scheme = Term Frequency –
Inverse Document Frequency
Semantic Stream
•Performed speech-to-text conversion
•Text classification algorithms
identified phrases such as “pleasure,”
“thanks,” “useless,” & “disgusting.”
Implementation
• Method:
– Two streams analyzed separately:
• speech utterance/acoustic features
• spoken text/semantics/speech recognition of conversation
– Confidence levels of two streams combined
– Examined 3 emotions
• Neutral
• Hot-anger
• Happy
• Tested two data sets:
– LDC data
– 20 real-world call-center calls
Two Stream - Conclusion
•Table 2 suggested that two-stream analysis is more accurate than
acoustic or semantic alone
•LDC data recognition significantly higher than real-world data
•Neutral emotions had less accuracy
•Combination of two-stream processing showed improvement (~20%)
in identification of “happy” and “anger” emotions
•Low acoustic stream accuracy may be attributed to length of
sentences in real-world data. Normal people do not exhibit different
emotions significantly in long sentences
Discussion
• Gupta analyzed three emotions (happy, neutral, hot-anger): Why
break it down into these categories? Implications? Can this
technique be applied to a wider range of emotions? For other
applications?
• Speech to text may not translate the complete conversation. Would
further examination greatly improve results? What are the pros and
cons?
• Pitch range was from 50-400Hz. Research may not be applicable
outside this range. Do you think it necessary to examine other
frequencies?
• In this paper, TF-IDF (Term Frequency – Inverse Document
Frequency) technique is used to classify utterances. Accuracy for
acoustics only is about 55%. Previous research suggest that
alternative techniques may be better. Would implementation better
results? What are the pros and cons of using the TF-IDF technique?
Voice Quality and f0 Cues for Affect
Expression: Implications for Synthesis
• Previous work:
– 1995; Mozziconacci suggested that VQ combined with
f0 combined could create affect
– 2002; Gobl suggested synthesized stimuli with VQ can
add affective coloring. Study suggested that “VQ + f0”
stimuli is more affective than “f0 only”
– 2003; Gobl tested VQ with large f0 range. Did not
examine contribution of affect-related f0 contours
• Objective: To examine affects of VQ and f0 on
affect expression
Voice Quality and f0 Cues for Affect
Expression: Implications for Synthesis
•
3 series of stimuli of Sweden utterance – “ja adjo”:
–
–
–
•
Tested parameters exemplifying 5 voice quality (VQ):
–
–
–
–
–
•
Stimuli exemplifying VQ
Stimuli with modal voice quality with different affect-related f0 contours
Stimuli combining both
Modal voice
Breathy voice
Whispery voice
Lax-creaky voice
Tense voice
15 synthesized stimuli test samples (see Table 1)
What is Voice Quality?
Phonation Gestures
•
•
•
•
Derived from a variety of laryngeal and supralaryngeal features
Adductive tension: interarytenoid muscles adduct the arytenoid muscles
Medial compression: adductive force on vocal processes- adjustment of
ligamental glottis
Longitudinal pressure: tension of vocal folds
Tense Voice
• Very strong tension of
vocal folds, very high
tension in vocal tract
Whispery Voice
• Very low adductive
tension
• Medial compression
moderately high
• Longitudinal tension
moderately high
• Little or no vocal fold
vibration
• Turbulence
generated by friction
of air in and above
larynx
Creaky Voice
• Vocal fold vibration at low
frequency, irregular
• Low tension (only
ligamental part of glottis
vibrates)
• The vocal folds strongly
adducted
• Longitudinal tension
weak
• Moderately high medial
compression
Breathy Voice
• Tension low
– Minimal adductive
tension
– Weak medial
compression
• Medium longitudinal
vocal fold tension
• Vocal folds do not come
together completely,
leading to frication
Modal Voice
• “Neutral” mode
• Muscular adjustments moderate
• Vibration of vocal folds periodic, full closing of
glottis, no audible friction
• Frequency of vibration and loudness in low to
mid range for conversational speech
Voice Quality and f0 Cues for Affect
Expression: Implications for Synthesis
• Six sub-tests with 20 native speakers of Hiberno-English.
• Rated on 12 different affective attributes:
–
–
–
–
–
–
Sad – happy
Intimate – formal
Relaxed – stressed
Bored – interested
Apologetic – indignant
Fearless – scared
• Participants asked to mark their response on scale
Intimate
Formal
No affective load
Voice Quality and f0 Test: Conclusion
• Categorized results
into 4 groups. No
simple one-to-one
mapping between
quality and affect
• “Happy” was most
difficult to synthesis
• Suggested that, in
addition to f0 ,VQ
should be used to
synthesis affectively
colored speech. VQ
appears to be crucial
for expressive
synthesis
Voice Quality and f0 Test: Discussion
• If the scale is on a 1-7, then 3.5 should be “neutral”; however, most
ratings are less than 2. Do the conclusions (see Fig 2) seem strong?
• In terms of VQ and f0, the groupings in Fig 2 seem to suggest that
certain affects are closely related. What are the implications of this?
For example, are happy and indignant affects closer than relaxed or
formal? Do you agree?
• Do you consider an intimate voice more “breathy” or “whispery?”
Does your intuition agree with the paper?
• Yanushevskaya found that the VQ accounts for the highest affect
ratings overall. How to compare range of voice quality with
frequency? Do you think they are comparable? Is there a different
way to describe these qualities?
Questions?
Download