Emotional Speech Julia Hirschberg CS 6998 7/15/2016 1 Today Defining emotional speech Emotional categories Eliciting judgments Producing emotional speech Detecting emotional speech A Subclass: Deceptive speech 7/15/2016 2 Cowie ‘00 Is there a good theoretical or practical definition of emotional speech? “Full-blown” emotion vs. emotional state Cause and effect descriptions Primary and secondary (second order) Everyday descriptions Representations Biological 7/15/2016 3 Dimensions in continuous space, e.g. Valence: positive or negative Activation level: how disposed to take action Structural models: different ways of appraising situation that evokes emotion e.g. positive or negative? Does situation help agent to achieve his/her goals? Timing as a key variable sadness vs. grief vs. depression vs. gloominess 7/15/2016 4 How are emotions expressed? Display rules? In speech? Mixing Simulation 7/15/2016 5 Schroeder ‘01: Emotion in Synthesis How is a given emotion expressed in speech? What are the properties of the emotion to be expressed? How are they related to those of other emotions? What kind of synthesizer works best? Formant Diphone Unit selection 7/15/2016 6 Prosody rules: what to modify? How do we evaluate the results? Forced choice Free response Recognition rate Perceived naturalness 7/15/2016 7 Ten Bosch ‘00: Emotion Recognition How hard is the problem? Is ‘standard’ ASR technology well-suited to it? Acoustic and language models target short local events Feature extraction normlizes/excludes e.g. pitch, rate, amplitude -- why? Interaction: emotional speech and ASR performance Synthesis needs one good example but... 7/15/2016 8 Ang et al Challenges: Use output from ASR system Use automatic prosodic features Find good speaker normalization Combine with lexical features Pioneered approach of “direct modeling” – no use of intermediate phonological units Applications: detecting frustration, disappointment/tiredness, amusement/surprise Results: prediction comparable to human accuracy 70-75% 7/15/2016 9 Method: Prosodic Models Extract pitch from signal Speech recognizer outputs word and phone alignments (duration features) Utterance-level features extracted (e.g., max speaker normalized pitch in the longest phonenormalized vowel, etc) Decision trees created to provide posterior probabilities of emotion classes given features Feature selection from development test set Separate test set used for evaluation 7/15/2016 10 Prosodic Features Duration features Phone / Vowel / Syllable Durations Normalized by Phone/Vowel Means, Speaker Speaking rate features (vowels/time) Pause features Speech to pause ratio, number of long pauses Maximum pause length Energy features (RMS energy) Pitch features Used pitch stylization algorithm (Sonmez et al.) LTM model of F0 to estimate speaker range Pitch ranges, slopes, locations of interest Spectral tilt features Other (non-prosodic) features Position of utterance in dialog Repeat or correction 7/15/2016 11 Emotion in Deception Motivation: why might such cues exist? Deception evokes emotion in deceivers (e.g. Ekman ‘85-92) Fear of discovery: higher pitch, faster, louder, pauses disfluencies, indirect speech Elation at successful deceiving: higher pitch, faster, louder, greater elaboration 7/15/2016 12 Acoustic/Prosodic/Lexical Cues Are deceivers less forthcoming? Shorter speech with fewer details Are lies less compelling than truths? Less plausible, logical, more discrepancies Less verbal and vocal ‘involvement’ Less verbal ‘immediacy’: more passives, negations, indirect speech More uncertainty (subjective) More repetitions Are liars less positive, pleasant? 7/15/2016 13 More negative statements, complaints Are liars more tense? Nervous overall Vocal tension High pitch Do lies contain fewer ‘imperfections’? Fewer self-repairs Fewer admissions of forgetfulness Fewer scene descriptions, details More mention of peripheral events or relationships 7/15/2016 14 Current State-of-the-Art No single cue to deceptive speech: most studied are visual Other acoustic/prosodic features proposed, but evidence mixed so far Loudness/intensity Speaking rate Response latency Disfluencies No attested method to detect deception automatically using acoustic/prosodic/lexical cues All current findings are descriptive, suggestive 7/15/2016All proposed methods require human intervention 15 Our Approach Elicit deceptive and non-deceptive corpus Motivation: Identity-relevant (self-image) and instrumental (monetary) incentives “Real” deception vs. acted Good recording conditions Tasks/interview paradigm Transcription/annotation Acoustic/prosodic/lexical analysis to identify features of interest, test validity of paradigm Automatic feature extraction and analysis to train models of deceptive and non-deceptive speech 7/15/2016 16 Corpus Collection Subjects asked to perform tasks for comparison with target profile of 25 top entrepreneurs Performance manipulated to produce performance same as/differing from target Monetary incentive to convince an interviewer they matched target Recorded interview/interrogation Biographical information (t/f) “Big lie” on task performance “Local lie”: Pedal indicators of t/f for each answer 7/15/2016 17 Collection To date: 15 subjects, totaling ~3h of subject speech Planned: 7-8h hours of subject speech 7/15/2016 18 Results of Prosodic/Acoustic Analysis On Arizona Mock Theft data subset: 32 interviews/72m, required segmentation, recording issues (50/160m more being segmented) Significant pitch feature differences between deceptive and non-deceptive speech, but... 7/15/2016 Highly motivated speakers lower pitch when lying Low motivation speakers raise pitch when lying Males lower pitch when lying Females raise pitch when lying 19 On Columbia corpus: Preliminary analyses of 8 speakers for ‘local’ t/f Significant differences in pitch range for six subjects, but differ from Mock Theft wrt gender Lexical findings: Preliminary analyses on Columbia data using LIWC show negative words more prevalent in deceptive speech 7/15/2016 20