411*---41 444Keioltieleme.mtimw

advertisement
Advances in Cochlear Implants
edited by I.J. Hochmair-Desoyer and E.S. Hochmair
Manz, Wien © 1994
A High Spectral Transmission Coding Strategy for a
Multi-Electrode Cochlear Implant
Norbert Dillier , WaiKong Lai, Hans Bögli
Dept. of Otorhinolaryngology, Head and Neck Surgery, University Hospital
CH-8091 Zürich, Switzerland
Abstract - Previous studies [1] have indicated that a strategy which presents spectral speech Information from many narrow
frequency bands at a maximally high stimulation rate (CIS, Continuous Interleaved Stimulation) can produce improved
consonant identification compared to a similar strategy (PES or Pitch Excited Stimulation) which uses the fundamental
frequency FO to determine the rate at which selected spectral peaks are presented. The main drawback of CIS was its rather
poor voice pitch discrimination. Hybrid PES/CIS coding strategies were then developed to improve voice quality In CIS-like
strategies [6]. One hybrid, INT1V, excites the lowest frequency active electrode in a voiced segment at the FO rate (PES
activity) while the remaining (maxlmally five) active electrodes are stimulated using CIS. For unvoiced segments, all active
electrodes are stimulated using CIS. A further strategy uses FO to determine the rate at which spectral Information Is
presented (the stimulation period) for voiced segments. Unvoiced segments are presented at a random rate between 150 and
250 Hz. In contrast to PES which transmlts only the spectral peaks additional spectral Information Is encoded during the
remainder of the stimulation period at a maximally high stimulation rate. This is the High Spectral Transmission (HST)
strategy.
Five regular Cochlea Im plant users participated in a corn parative study involving a
speaker identification test
and a consonant rhyme test. Voice pitch discrimination results were very good with HST and good but relalively poorer with
INT1V. Both HST and INT1V resulted in comparable consonant identification performances as the unmodified CIS strategy,
with all three strategies being generally better than PES. Percentage correct scores for individual subJects indicated that HST
was also consistently better than INT1V. The improved speech discrimination perfonnance with HST Is encouraging and will
be further evaluated.
I. INTRODUCTION
Major research and development efforts to restore auditory sensations and speech recognition for profoundly deaf
subjects have been devoted in recent years to signal processing strategies for cochlear implants. A number of
technological and electrophysiological constraints imposed by the anatornical and physiological conditions of the human
auditory system have to be considered [5]. One basic working hypothesis for cochlear implants is the idea that the
natural firing pattem of the auditory nerve should be as closely approximated by electrical stimulation as possible. The
central processor (the human brain) would then be able to utilize natural ("prewired" as well as learned) analysis modes
for auditory perception. An alternative hypothesis is the Morse code idea, which is based an the assumption that the
central processor would be as flexible as to interpret any transmitted stimulus sequence after proper training and
habituation.
Both hypotheses have never really been tested for practical
reasons. On the one hand it is not possible to reproduce the
activity of 30'000 individual nerve fibers with current
electrode technology. In fact, it is even questionable whether
it is possible to reproduce the detailed activity of a single
auditory nerve fiber via artificial stimulation. There are a
number of fundamental physiological differentes in firing
patterns of acoustically versus electrically excited neurons
which are hard to overcome. Spread of excitation within the
cochlea and current summation are other major probleins of
most electrode configurations. On the other hand the coding
and transmission of spoken language requires a much larger
communication channel bandwidth and more sophisticated
processing than a Morse code for written text. Practical
experiences with cochlear implants in the past indicate that
some natural relationships (such as growth of loudness and
voice pitch variations) should be maintained in the encoding
152
411*---41444Keioltieleme.mtimw...
=•
,
Fig. 1 Sonagram of german word "Schein". Horizontal
axis displays time from 0 to 600 msec, vertical axis:
frequencies from 0 to 5 kHz.
process. One might therefore conceive a third, more realistic, hypothesis about as follows: Signal processing for cochlear
implants should carefully select a subset of the total information contained in the sound signal and transform these
elements into those physical stimulation parameters which can generate distinctive perceptions for the listener.
An example of the complex structure of speech sounds is given in Fig. 1 which displays the sonagram of the gennan
word "Schein". Temporal as well as spectral properties of the speech signal are revealed by this analysis. The fricative
high frequency consonant /sch/ can be clearly distinguished from the vowel portion. Note the formant transitions during
the diphtong and towards the nasal /n/. The time signal at the top shows the aperiodic waveform during the unvoiced
consonant and the periodic pattern caused by the fundamental frequency of the voiced segment ot the utterance.
L
g
enseege
!,;2153k!„...
33teeanit...1?:t33i331333133111tua33
— ninumanurinnur
•
11
13
15
17
19
21
i11111
i'1111111'
31 1
I
050
1
11-1
.1.11 1.
1
„331
11
1111111111111111111giet 133f.t:.
15
17
19
21
G.
30355
0 ns
m]
3
s
H
5£75kall755"
5
wittintiwor
91-11
t 11
t2 13
15
17
19
21
■ 1 --
111
,3.331
1
111:::: ,
g
11111
11111 111 1"
18111111111 1111
1
imil Hun 1
300 ns
9
11
:13
15
17
19
II
•
CIS-NA
A3n3seeesteOureahmmemmem333a3g1
1002
:A1..inuaimomitramioninumonWex.$43:63
;#zetalimointexemenimieusionamonse
Bine
easiemei383333E imleila uimalme ime3i3e4
.....
• 5915111112117701:9Mniieei.
etannimunemarcuittümisraechm.
20
300 nc
CIS-WF
_MUSE 11..1:1
103
eUiIiedummana9lilexeriAe
.
emPi3
..3...*33333yE
3«,
300 .$
"Pts
ea.i.ra.t. 1'1%3
ein. in im
13
15
17
19
21
;
21
009
IP1 011T:eng
.1
Illo
wem
rem
.;f3t4
11
'r
I11t ''r
i
lieeete.e.*"''
15
17
19
21
kt
iiiii111
11111[11
:111111111M
.
300 mg
600.
,L
s1t11111 111 111111111111/Pir.
11
11 0!
47)1
1111111111111hiiiiiililli11111111V1111t ttt
300 m.
INT1V
f iter
WO az
HST
Fig. 2 "Electrodograms" of the german word "Schein" for 6 different processing strategies (see text for explanations).
Many researchers have designed and evaluated different systems varying the number of electrodes and the amount of
specific speech feature extraction and mapping transformations used [2]. Recently, Wilson et al. [10] have reported
astonishing improvements in speech test performance when they provided their subjects with high-rate pulsatile
stimulation patterns rather than analog broadband signals. They attributed this effect partly to the decreased current
summation obtained by non-simultaneous stimulation of different electrodes (which might otherwise have stimulated
partly the same nerve Ebers and thus interacted in a notlinear fashion) and partly to a fundamentally different and maybe
more natural Eiring pattern due to an extremely high stimulation rate. Skinner et al. [8] also found significantly higher
153
scores on word and sentence tests in quiet and noise with a new multipeak digital speech coding strategy as compared
to the formerly used FOF1F2-strategy of the Nucleus-WSP (wearable speech processor). Von Wallenberg and Battmer
[9] found that good performers (group I-subjects) improved consonant identification scores only after six monihs of
processor use whereas moderate performers (group II-subjects) showed an immediate significant improvement after 1
month. The electrode activation pattern of the MSP is displayed in Fig. 2 (top left) for the same word as in Fig. 1. The
processor was programmed in bipolar mode using thresholds of hearing and comfortable listening level (T- and Clevels) of one of the implantees who participated in this study.
The comparisons of processors and strategies described above indicate the potential gains which may be obtained by
optimizing signal processing schemes for existing implanted devices. The present study was conducted in order to
explore new ideas and concepts of multichannel pulsatile speech encoding for users of the Clark/Nucleus cochlear
prosthesis. Similar methods and tools can however be utilized to investigate alternative coding schemes for other implant
systems. Portions of the results have been presented previously [1,3].
II. SIGNAL PROCESSING STRKIEGIES
A cochlear implant digital speech processor (CIDSP) for the Nucleus 22-channel cochlear prosthesis has been designed
using a single chip digital signal processor (TMS320C25, Texas Instruments, [4]). For laboratory experiments the CIDSP
was incorporated in a general purpose computer which provided interactive parameter control, graphical display of
input/output and buffers and offline speech file processing facilities. The experiments described in this paper were all
conducted using the laboratory version of CIDSP.
Speech signals were processed as follows: after analog low-pass filtering (5 kHz) and analog-to-digital-conversion (10
kHz), preemphasis and Hanning windowing (12.8 ms, shifted by 6.4 ms or less per analysis frame) was applied and the
power spectnun calculated via fast Fourier transform (FFT); specified speech features such as formants and voice pitch
were extracted and transformed according to the selected encoding strategy; finally the stimulus parameters (electrode
position, stimulation mode, pulse amplitude and duration) were generated and transmitted via inductive coupling to the
implanted receiver. In addition to the generation of stimulus parameters for the cochlear implant an acoustic signal based
on a perceptive model of auditory nerve stimulation was output simultaneously.
Several processing strategies were implemented on this system: The first approach (PES, Pitch Excited Sampler) is based
on the maximum peak channel vocoder concept whereby the time-averaged spectral energies of a number of frequency
bands (approximately third-octave bands) are transformed into appropriate electrical stimulation parameters for up to
22 electrodes (Fig. 2, middle left). The pulse rate at any given electrode is controlled by the voice pitch of the input
speech signal. A pitch extractor algorithm calculates the autocorrelation function of a lowpass-filtered segment of the
speech signal and searches for a peak within a specified time lag interval. A random pulse rate of about 150 to 250 Hz
is used for unvoiced speech portions.
The second approach (CIS, Continuous Interleaved Sampler) uses a stimulation pulse rate which is independent of the
fundamental frequency of the input signal. The algorithm scans continuously all frequency bands and samples their
energy levels (Fig. 2, top and middle right). As only one electrode can be stimulated at any instant of tune the rate of
stimulation is limited by the required stimulus pulse widths (detennined individually for each subject) and the time to
transmit additional stimulus parameters. As the information about the electrode number, the stimulation mode, the pulse
amplitude and width is encoded by high frequency bursts (2.5 MHz) of different durations, the total transmission time
for a specific stimulus depends on all of these parameters. This transmission time can be minimized by choosing the
shortest possible pulse width combined with the maximal amplitude.
In order to achieve maximally high stimulation rates for those portions of the speech input signals which are assumed
to be most important for intelligibility several modifications of the basic CIS-strategy were designed, of which only the
two most promising (CIS-NA, Fig.2, top right) will be considered in the following. The analysis of the short time spectra
was performed either for a large number of narr frequency bands (corresponding directly to the number of available
electrodes) or for a small number (typically 6) of wide frequency bands analogous to the approach suggested by Wilson
et al. [10]. The frequency bands were logarithmically spaced from 200 to 5000 Hz in both cases. Spectral energy within
any of these frequency bands was mapped to stimulus amplitude at a selected electrode as follows: all narrow band
analysis channels whose values exceeded a noise cut level were used for CIS-NA whereas all wide band analysis
channels irrespective of NCL were mapped to preselected fixed electrodes for CIS-WF (Fig.2, middle right). Both
schemes are supposed to minimize electrode interactions by preserving maximal spatial distances between subsequently
stimulated electrodes. The first scheme (CIS-NA) emphasizes spectral resolution while the second (CIS-WF) optimizes
154
fine temporal resolution. In both the PES- and the CIS-strategies a high-frequency preemphasis was applied whenever
a spectral gravity measure exceeded a preset threshold.
In first experiments with these new strategies it could be shown that both types of strategies were able to provide
additional useful information to users of the Nucleus cochlear implant in comparison to the standard MSP in some of
the test conditions. The PES strategy resulted in somewhat lower consonant identification performance than the CISstrategies. However, the subjective quality of the processed speech and the user's ability to distinguish between different
voices was higher with the PES-strategy than with the CIS-methods. Thus it seemed logical to search for algorithms
which would combine the respective advantages of the two strategies. One of these hybrid PES/CIS coding strategy
which was called INT1V (integrated hybrid strategy with one voice excited stimulation channel, Fig. 2, bottom left),
excites the lowest frequency active electrode in a voiced segment at the FO rate (PES activity) while the remaining
(maximally live) active electrodes are stimulated using CIS. For unvoiced segments, all active electrodes are stimulated
using CIS. A further strategy uses FO to determine the rate at which spectral information is presented (the stimulation
period) for voiced segments. Unvoiced segments are presented at a random rate between 150 and 250 Hz. In contrast
to PES which transmits only the spectral peaks additional spectral information is encoded during the remainder of the
stimulation period at a maximally high stimulation rate. This was called the High Spectral Transmission (HST, Fig. 2,
bottom right) strategy.
III. SUBJECFS
Evaluation experiments have been conducted with live postlingually deaf adult (age 16 - 50 years) cochlear implant users
to date. All subjects were experienced users of their speech processors. The time since implantation ranged from 12
months (KW) to over 10 years (UT, single channel extracochlear implantation in 1980, reimplanted after device failure
in 1987) with good sentence identification (80 to 95 % correct responses) and number recognition (40 to 95 correct
responses) performance and minor open speech discrimination in monosyllabic word tests (5 to 20 % correct responses,
all tests presented via computer, hearing-alone) and limited use of the telephone. Two series of tests were carried out
with 5 subjects participating in each series. The first series comprised a female/male distinction sentence test (20
sentences spoken by 2 male and 2 femal speakers). The second series comprised a four alternative forced choice minimal
pair test with consonants in medial position (CM2) and a four alternative forced choice minimal pair test with vowels
in medial position (VM2). All subjects were regular users of the MSP. Confusion matrices were pooled over the five
subjects for the 6 processing conditions and the 2 different speech tests and information transmission analysis was
performed for the 8 resulting matrices. As the main effects were seen in the consonant results only these data will be
shown below.
The saure measurement procedure to determine thresholds of hearing (T-Ievels) and comfortable listening (C-levels) used
for fitting the MSP was also used for the CIDSP-strategies. Only minimal exposure to the new processing strategies was
possible due to time restrictions. After about 5 to 10 minutes of listening to ongoing speech one or two blocks of a 20items 2-digit numbers test were carried out. There was no feedback given during the test trials. All test items were
presented by a second computer which also recorded the subjects responses entered via touch screen terminal (for
multiple choice tests) or keyboard (numbers tests and monosyllable word tests). The computer program generated
automatically the confusion matrices and calculated the transmitted infonnation for the selected phonological feature set
according to the procedures described by Miller and Nicely [7]. Matrices could be analyzed individually for every
subject or pooled across a number of subjects. Speech Signals were either presented via loudspeaker in a sound treated
room (when patients were tested with their wearable speech processors) or processed by the CIDSP in real time and fed
directly to the transmitting coil at the subjects head. Different speakers were used for the ongoing speech, the numbers
test and the actual speech tests respectively.
IV. RESULTS AND DISCUSSION
Results of the female/male discrimination test are shown in Fig. 3. It cm be noted that all five subjects scored at or
below chance level for the CIS-NA strategy and nearly perfectly for all other strategies. Two subjects (EM and TH)
scored only about 30 % with the MSP but 100 % with the new DSP-strategies. Three subjects (UT, KW, HS) scored
between 55 and 65 % for the integrated hybrid strategy (INT1V) whereas they scored between 90 and 100 % for the
other strategies. Thus it appears as if the HST strategy was best able to preserve the speech quality features of the pitch
synchronous PES-strategy while generating a high continuous stimulus rate at the same time.
155
Female-male discrimination
% Correct responses (chance level corrected)
100
80
60
40
20
0
EM
IH
UT
KW
HS
IRMSP•PES•CIS-NAgiHSTollell4
Fig. 3 Percentage correct scores for the female/male voice
discrimination test
The results of the minimal pair speech tests shown in
Fig. 4 confinn the general pattern of earlier experiments
[1,3] although the differences between the MSP- and the
new CIDSP-strategies in consonant tests were not as
large as with the logatomes used previously. Every
subject had performed 100 trials (4 blocks of 25 trials)
per condition for both the consonant (CM2) and vowel
(VM2) minimal pair test. While it was found that for all
five subjects die performance with CIS-NA, HST and
also PES was better than with the MSP it cm be seen
that the variations between subjects was rather large. The
integrated hybrid INT1V did result in improved performance for two subjects (EM, UT) and in worse performance for two other subjects (KW, HS).
Results of the information transmission analysis for the
pooled confusion matrices of the consonant minimal pair
test are shown in Fig. 5. It should be noted that the minimal pair tests CM2 and VM2 contain more variation of the
speech material than the logatome tests C12 and VO8 which were used previously for information transmission analysis.
Thus this analysis cm only provide some indications and maybe Show some trends but should not be overestimated. As
every subject had performed 100 trials per condition, every pooled confusion matrix added up to 500 entries. As can
be seen in Fig. 5 the total overall results for consonant tests with the subjects own wearable speech processor were
significantly lower than with the new CIDSP-strategies. The pitch-synchronous coding (PES) resulted in worse
performance compared to the coding without explicit pitch extraction (CIS-NA) whereas the hybrid strategies resulted
in nearly the same perfonnance as CIS-NA. This was not only true for the overall information but for all the features
analyzed except for sonorance (SON) and place (PLC) which was equally well (but not better) perceived with the PESstrategy os with the high-rate strategies. Major improvements in consonant identification can be seen for the sibilance
(SIB) and frication (FRI) features which are closely related to each other and which mainly distinguish the high
frequency phonemes /s/ and /f/ from the rest. Some irnprovement was also evident in the place of articulation feature
(PLC). PES however produced clearly lower voicing information transmission than the CIS-strategies and was not
superior to MSP in contrast to the previous C12-test. HST however was superior in voicing information transmission.
Consonant Test CM2
Consonant Test CM2
%Correct res onses (chance level corrected)
% Transmitted Information
100
100
80
80
60
60
90
40
20
20
0
EM
TH
UT
KW
HS
° TOTAL VOI NAS SON SIE
IMSP•PESalCIS-NitrzaHSTo1NTre
FRI PLC
IMSP•PESRCIS-NitdaHSTDINTle
Fig. 4 Four alternative forced choice bisyllabic minimal
pair test with consonants in medial position
Fig. 5 Information transmission analysis of pooled
confusion matrices: 5 subjects, 5 processing conditions
Vowel identification scores, on the other hand, were not improved by modifications of the signal processing strategy.
The transmission of first formant information (F01) was even worse for all CIDSP-strategies compared to the MSP
which probably is related to the lower number of electrodes assigned to the first formant region in the standard CIDSPmapping. Second formant information however was equally well transinitted with the CIS-NA as with the MSP strategy
which cm be explained by the similar spectral resolution of diese two strategies.
When the actual stimulus rates were examined for the five different strategies it was found that on die average die CISand HST-strategies generated about four or five times higher rates than the MSP and that PES and INT1V were
somewhat in between. These analyses will be carried out in more detail in the future for different speech segments.
156
V. CONCLUSIONS
The above speech test results are still preliminary due to the small number of subjects and test conditions. lt is however
quite promising that new signal processing strategies can improve speech discrimination considerably during acute
laboratory experiments. Consonant identification apparently may be enhanced by more detailed temporal information
and specific speech feature transformations.
While some of the new high-rate coding strategies produced a reduced voice quality it was possible to improve these
aspects by designing hybrid strategies. Voice pitch discrimination results were particularly good with a high spectral
transmission (HST) strategy which used maximally high stimulation rates at pitch-synchronous intervals and somewhat
poorer with an integrated hybrid strategy (INT1V) which encoded pitch on only one electrode channel and used
continuous high rate stimulation on all other channels. Both HST and INT1V resulted in comparable consonant
identification performances as the unmodified CIS strategy, with all three strategies being generally better than PES.
Percentage correct scores for individual subjects indicated that HST was also consistently better than INT1V. The
improved speech discrimination performance with HST is encouraging and will be further evaluated.
Although many aspects of speech encoding can be efficiently studied using a laboratory digital signal processor it would
be desirable to allow subjects more time for adjustment to a new coding strategy. Several days or weeks of habituation
are sometimes required until a new mapping can be fully exploited. Thus for scientific as well as practical purposes the
further miniaturization of wearable DSP's will be of great importance.
ACKNOWLEDGEMENTS
This study was supported by the Swiss National Research Foundation (Grant no. 4018-10864 and 4018-10865) and Cochlear AG
(Basel).
REFERENCES
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Bögli, H. and Dillier, N. Digital Speech Processor for the Nucleus 22-Channel Cochlear Implant. Proc.IEEE EMBS
13/4:1901-1902, 1991.
Clark, G.M., Tong, Y.C. and Patrick, J.F. Cochlear Prostheses, Edinburgh, London, Melbourne, New York:Churchill
Livingstone, 1990. pp. 1-264.
Dillier, N., Bögli, H. and Spillmann, T. Digital Speech Processing for Cochlear Implants. ORL 54: 299-307, 1992
Dillier, N., Senn, C., Schlatter, T. and Stöckli, M. Wearable digital speech processor for cochlear implants using a
TMS320C25. Acta Otolaryngol (Stockh) Suppl. 469:120-127, 1990.
Evans, E.F. How to Provide Speech Through an Implant Device. Dimensions of the Problem: An overview. In: Cochlear
Implants, edited by Schindler, R.A. and Merzenich, M.M. 1985, p. 167-183.
Lai, WK, Dillier, N, Bögli H (1992). A hybrid coding strategy for a multichannel cochlear implant, Proc. of the Symp.
on Cochlear Implant: New Perspectives, Toulouse, France. In print.
Miller, G.A. and Nicely, P.E. An analysis of perceptual confusions among some English consonants. J Acoust.Soc.Am.
27:338-352, 1955.
Skinner, M.W., Holden, L.K., Holden, T.A., Dowell, R.C. and et al., Performance of postlingually deaf adults with the
wearable speech processor (WSP III) and mini speech processor (MSP) of the Nucleus Multi-Electrode Cochlear Implant.
Ear Hear. 12/1:3-22, 1991.
Von Wallenberg, E.L. and Battmer, R.D. Comparative speech recognition results in eight subjects using two different
coding strategies with the Nucleus 22 channel cochlear implant. Brit.J.Audiol. 25: 371-380, 1991
Wilson, B.S., Lawson, D.T., Finley, C.C. and Wolford, R.D. Coding strategies for multichannel cochlear prostheses.
Am.J.Otol. 12,Supp1.1:55-60, 1991.
157
Download