Vocal Emotion Recognition with Cochlear Implants Xin Luo, Qian-Jie Fu, John J. Galvin III Presentation By Archie Archibong What is the Cochlear Implant • The Cochlear implant is a hearing aid device which has restored hearing sensation to many deafened individuals. How does it work? • Contemporary CI devices use spectrally-based speech-processing strategies in which the temporal envelope is extracted from a number of frequency analysis bands and used to modulate pulse trains of current delivered to appropriate electrodes. • Video aid: http://www.youtube.com/watch?v=SmNpP2fr 57A Introduction to the Study • Abstract: Spoken language gives speakers the power to relay linguistic information as well as to convey other important cues in regards to prosodic information (speech rhythm, intonation). • Prosodic speech cues can convey information regarding the speakers emotion and recognition of the speakers emotion can greatly contribute to speech understanding. • Present studies show that CI users struggle to recognize emotional speech. Introduction To the Study • This study investigated CI users’ ability to recognize vocal emotions in acted emotional speech, given CI patients limited access to pitch information and spectro-temporal fine structure cues. • Vocal emotion recognition was also tested in NH subjects listening to unprocessed speech and speech processed by acoustic CI simulations. (Different amounts of spectral resolution and temporal information were tested to examine the relative contributions of spectral and temporal cues to vocal emotion recognition) Methodology & Procedure – Subjects: 6 CI and 6 normal-hearing (NH) with even gender breakdown. (All native English speakers) – All NH subjects had puretone thresholds better than 20 dB HL at octave frequencies from 125 to 8000hz in both ears (essentially they could hear just fine) – All CI users were post-lingually deafened. 5/6 CI subjects had at least a year with their device. – The three devices used: Nucleus-22, Nucleus 24, and Freedom Methodology & Procedure – An emotional speech database was recorded for the study in which one male and one female speaker each produced 5simple sentences according to 5 target emotions. (neutral, anxious, happy, sad and angry) – Same sentences were used to convey target emotions in order to minimize the contextual cues. (forcing the listener to focus on the acoustic cues) – Sentences used: Methodology & Procedure • The CI subjects vocal emotion recognition was tested using just the unprocessed speech. • The NH subjects’ vocal emotion recognition was tested both with the unprocessed speech as well as with speech processed by acoustic sine-wave vocoder CI simulations. (The results show the reasoning and importance of testing both for NH listeners) • Image: Example of 4 channel speech processing for the CI simulation: Methodology & Procedure • Subjects were seated in a double walled sound treated booth and listened to the stimuli presented in free field over a single loud speaker. • Interesting: The presentation level (65 dBA) was calibrated according to the average power of the “angry” emotion produced by the male talker. • In each trial a sentence was randomly selected (without replacement) from the stimulus set and presented to the subject; subjects responded by clicking on one of the five response choices shown on screen with labels (i.e. “neutral” or “angry”) • No feedback or training was provided. These responses were collected and scored in terms of percent correct. At least two runs for each experimental condition. Results • Unprocessed Speech: Vocal Emotion recognition performance for CI users (filled squares) and NH performance (filled triangles) • Mean NH performance (across subjects) with the CI simulations is shown as a function of the number of spectral channels . Results • NH: With unprocessed speech , mean NH performance was %90 correct. • CI: With unprocessed speech, mean performance was just %45 correct. • To note: There was large inter-subject variability in each subject group, CI performance was significantly better than chance performance level. Results • NH: With Acoustic CI simulations, NH subjects’ vocal emotion recognition performance was significantly affected by both the number of spectral channels and the temporal envelop filter cutoff frequency. • With the 50-Hz envelope filter, performance significantly improved when the number of spectral channels was increased (with exceptions from 1 to 2 and from 4 to 8). • With the 500-Hz envelope filter performance significantly improved only when the number of spectral channels was increased from 1 to >1 & from 2 to 16. Results • NH performance with unprocessed language was significantly better than performance in any of the CI simulations with the exception of the 16-channel simulation with the 500-Hz envelop filter. Why is that? Conclusions • Results showed that both spectral and temporal cues significantly contributed to performance. With the 50-Hz envelope filter, performance generally improved as the number of spectral resolution was increased from 1 to 2, and then from 2 to 16 channels. For all but 16-channel simulations, increasing the envelope filter cutoff frequency from 50-Hz to 500 Hz significantly improved performance. • CI users’ vocal emotion recognition performance was statistically similar to that of NH subjects listening to 1-8 spectral channels with the 50-Hz envelope filter, and to 1 channel with the 500-Hz envelope filter. • This suggests that even though spectral cues may contribute more strongly to recognition of linguistic information, temporal cues may contribute more strongly to recognition of emotional content coded in spoken language.