Vocal Emotion Recognition with Cochlear Implants Presentation By Archie Archibong

advertisement
Vocal Emotion Recognition with
Cochlear Implants
Xin Luo, Qian-Jie Fu, John J. Galvin III
Presentation By Archie Archibong
What is the Cochlear Implant
• The Cochlear implant is a hearing aid device
which has restored hearing sensation to many
deafened individuals.
How does it work?
• Contemporary CI devices use spectrally-based
speech-processing strategies in which the
temporal envelope is extracted from a number
of frequency analysis bands and used to
modulate pulse trains of current delivered to
appropriate electrodes.
• Video aid:
http://www.youtube.com/watch?v=SmNpP2fr
57A
Introduction to the Study
• Abstract: Spoken language gives speakers the power to
relay linguistic information as well as to convey other
important cues in regards to prosodic information
(speech rhythm, intonation).
• Prosodic speech cues can convey information regarding
the speakers emotion and recognition of the speakers
emotion can greatly contribute to speech
understanding.
• Present studies show that CI users struggle to recognize
emotional speech.
Introduction To the Study
• This study investigated CI users’ ability to recognize
vocal emotions in acted emotional speech, given CI
patients limited access to pitch information and
spectro-temporal fine structure cues.
• Vocal emotion recognition was also tested in NH
subjects listening to unprocessed speech and speech
processed by acoustic CI simulations. (Different
amounts of spectral resolution and temporal
information were tested to examine the relative
contributions of spectral and temporal cues to vocal
emotion recognition)
Methodology & Procedure
– Subjects: 6 CI and 6 normal-hearing (NH) with even
gender breakdown. (All native English speakers)
– All NH subjects had puretone thresholds better than
20 dB HL at octave frequencies from 125 to 8000hz in
both ears (essentially they could hear just fine)
– All CI users were post-lingually deafened. 5/6 CI
subjects had at least a year with their device.
– The three devices used: Nucleus-22, Nucleus 24, and
Freedom
Methodology & Procedure
– An emotional speech database was recorded for the study in
which one male and one female speaker each produced 5simple sentences according to 5 target emotions. (neutral,
anxious, happy, sad and angry)
– Same sentences were used to convey target emotions in order
to minimize the contextual cues. (forcing the listener to focus on
the acoustic cues)
– Sentences used:
Methodology & Procedure
• The CI subjects vocal emotion recognition was tested using just the
unprocessed speech.
• The NH subjects’ vocal emotion recognition was tested both with
the unprocessed speech as well as with speech processed by
acoustic sine-wave vocoder CI simulations. (The results show the
reasoning and importance of testing both for NH listeners)
• Image: Example of 4 channel speech processing for the CI
simulation:
Methodology & Procedure
• Subjects were seated in a double walled sound treated booth and
listened to the stimuli presented in free field over a single loud
speaker.
• Interesting: The presentation level (65 dBA) was calibrated
according to the average power of the “angry” emotion produced
by the male talker.
• In each trial a sentence was randomly selected (without
replacement) from the stimulus set and presented to the subject;
subjects responded by clicking on one of the five response choices
shown on screen with labels (i.e. “neutral” or “angry”)
• No feedback or training was provided. These responses were
collected and scored in terms of percent correct. At least two runs
for each experimental condition.
Results
• Unprocessed Speech: Vocal Emotion recognition performance for CI
users (filled squares) and NH performance (filled triangles)
• Mean NH performance (across subjects) with the CI simulations is
shown as a function of the number of spectral channels .
Results
• NH: With unprocessed speech , mean NH
performance was %90 correct.
• CI: With unprocessed speech, mean
performance was just %45 correct.
• To note: There was large inter-subject
variability in each subject group, CI
performance was significantly better than
chance performance level.
Results
• NH: With Acoustic CI simulations, NH subjects’ vocal
emotion recognition performance was significantly
affected by both the number of spectral channels and
the temporal envelop filter cutoff frequency.
• With the 50-Hz envelope filter, performance
significantly improved when the number of spectral
channels was increased (with exceptions from 1 to 2
and from 4 to 8).
• With the 500-Hz envelope filter performance
significantly improved only when the number of
spectral channels was increased from 1 to >1 & from 2
to 16.
Results
• NH performance with unprocessed language
was significantly better than performance in
any of the CI simulations with the exception of
the 16-channel simulation with the 500-Hz
envelop filter. Why is that?
Conclusions
• Results showed that both spectral and temporal cues significantly
contributed to performance. With the 50-Hz envelope filter,
performance generally improved as the number of spectral
resolution was increased from 1 to 2, and then from 2 to 16
channels. For all but 16-channel simulations, increasing the
envelope filter cutoff frequency from 50-Hz to 500 Hz significantly
improved performance.
• CI users’ vocal emotion recognition performance was statistically
similar to that of NH subjects listening to 1-8 spectral channels with
the 50-Hz envelope filter, and to 1 channel with the 500-Hz
envelope filter.
• This suggests that even though spectral cues may contribute more
strongly to recognition of linguistic information, temporal cues may
contribute more strongly to recognition of emotional content coded
in spoken language.
Related documents
Download