Advances in Cochlear Implants edited by I.J. Hochmair-Desoyer and E.S. Hochmair Manz, Wien © 1994 A High Spectral Transmission Coding Strategy for a Multi-Electrode Cochlear Implant Norbert Dillier , WaiKong Lai, Hans Bögli Dept. of Otorhinolaryngology, Head and Neck Surgery, University Hospital CH-8091 Zürich, Switzerland Abstract - Previous studies [1] have indicated that a strategy which presents spectral speech Information from many narrow frequency bands at a maximally high stimulation rate (CIS, Continuous Interleaved Stimulation) can produce improved consonant identification compared to a similar strategy (PES or Pitch Excited Stimulation) which uses the fundamental frequency FO to determine the rate at which selected spectral peaks are presented. The main drawback of CIS was its rather poor voice pitch discrimination. Hybrid PES/CIS coding strategies were then developed to improve voice quality In CIS-like strategies [6]. One hybrid, INT1V, excites the lowest frequency active electrode in a voiced segment at the FO rate (PES activity) while the remaining (maxlmally five) active electrodes are stimulated using CIS. For unvoiced segments, all active electrodes are stimulated using CIS. A further strategy uses FO to determine the rate at which spectral Information Is presented (the stimulation period) for voiced segments. Unvoiced segments are presented at a random rate between 150 and 250 Hz. In contrast to PES which transmlts only the spectral peaks additional spectral Information Is encoded during the remainder of the stimulation period at a maximally high stimulation rate. This is the High Spectral Transmission (HST) strategy. Five regular Cochlea Im plant users participated in a corn parative study involving a speaker identification test and a consonant rhyme test. Voice pitch discrimination results were very good with HST and good but relalively poorer with INT1V. Both HST and INT1V resulted in comparable consonant identification performances as the unmodified CIS strategy, with all three strategies being generally better than PES. Percentage correct scores for individual subJects indicated that HST was also consistently better than INT1V. The improved speech discrimination perfonnance with HST Is encouraging and will be further evaluated. I. INTRODUCTION Major research and development efforts to restore auditory sensations and speech recognition for profoundly deaf subjects have been devoted in recent years to signal processing strategies for cochlear implants. A number of technological and electrophysiological constraints imposed by the anatornical and physiological conditions of the human auditory system have to be considered [5]. One basic working hypothesis for cochlear implants is the idea that the natural firing pattem of the auditory nerve should be as closely approximated by electrical stimulation as possible. The central processor (the human brain) would then be able to utilize natural ("prewired" as well as learned) analysis modes for auditory perception. An alternative hypothesis is the Morse code idea, which is based an the assumption that the central processor would be as flexible as to interpret any transmitted stimulus sequence after proper training and habituation. Both hypotheses have never really been tested for practical reasons. On the one hand it is not possible to reproduce the activity of 30'000 individual nerve fibers with current electrode technology. In fact, it is even questionable whether it is possible to reproduce the detailed activity of a single auditory nerve fiber via artificial stimulation. There are a number of fundamental physiological differentes in firing patterns of acoustically versus electrically excited neurons which are hard to overcome. Spread of excitation within the cochlea and current summation are other major probleins of most electrode configurations. On the other hand the coding and transmission of spoken language requires a much larger communication channel bandwidth and more sophisticated processing than a Morse code for written text. Practical experiences with cochlear implants in the past indicate that some natural relationships (such as growth of loudness and voice pitch variations) should be maintained in the encoding 152 411*---41444Keioltieleme.mtimw... =• , Fig. 1 Sonagram of german word "Schein". Horizontal axis displays time from 0 to 600 msec, vertical axis: frequencies from 0 to 5 kHz. process. One might therefore conceive a third, more realistic, hypothesis about as follows: Signal processing for cochlear implants should carefully select a subset of the total information contained in the sound signal and transform these elements into those physical stimulation parameters which can generate distinctive perceptions for the listener. An example of the complex structure of speech sounds is given in Fig. 1 which displays the sonagram of the gennan word "Schein". Temporal as well as spectral properties of the speech signal are revealed by this analysis. The fricative high frequency consonant /sch/ can be clearly distinguished from the vowel portion. Note the formant transitions during the diphtong and towards the nasal /n/. The time signal at the top shows the aperiodic waveform during the unvoiced consonant and the periodic pattern caused by the fundamental frequency of the voiced segment ot the utterance. L g enseege !,;2153k!„... 33teeanit...1?:t33i331333133111tua33 — ninumanurinnur • 11 13 15 17 19 21 i11111 i'1111111' 31 1 I 050 1 11-1 .1.11 1. 1 „331 11 1111111111111111111giet 133f.t:. 15 17 19 21 G. 30355 0 ns m] 3 s H 5£75kall755" 5 wittintiwor 91-11 t 11 t2 13 15 17 19 21 ■ 1 -- 111 ,3.331 1 111:::: , g 11111 11111 111 1" 18111111111 1111 1 imil Hun 1 300 ns 9 11 :13 15 17 19 II • CIS-NA A3n3seeesteOureahmmemmem333a3g1 1002 :A1..inuaimomitramioninumonWex.$43:63 ;#zetalimointexemenimieusionamonse Bine easiemei383333E imleila uimalme ime3i3e4 ..... • 5915111112117701:9Mniieei. etannimunemarcuittümisraechm. 20 300 nc CIS-WF _MUSE 11..1:1 103 eUiIiedummana9lilexeriAe . emPi3 ..3...*33333yE 3«, 300 .$ "Pts ea.i.ra.t. 1'1%3 ein. in im 13 15 17 19 21 ; 21 009 IP1 011T:eng .1 Illo wem rem .;f3t4 11 'r I11t ''r i lieeete.e.*"'' 15 17 19 21 kt iiiii111 11111[11 :111111111M . 300 mg 600. ,L s1t11111 111 111111111111/Pir. 11 11 0! 47)1 1111111111111hiiiiiililli11111111V1111t ttt 300 m. INT1V f iter WO az HST Fig. 2 "Electrodograms" of the german word "Schein" for 6 different processing strategies (see text for explanations). Many researchers have designed and evaluated different systems varying the number of electrodes and the amount of specific speech feature extraction and mapping transformations used [2]. Recently, Wilson et al. [10] have reported astonishing improvements in speech test performance when they provided their subjects with high-rate pulsatile stimulation patterns rather than analog broadband signals. They attributed this effect partly to the decreased current summation obtained by non-simultaneous stimulation of different electrodes (which might otherwise have stimulated partly the same nerve Ebers and thus interacted in a notlinear fashion) and partly to a fundamentally different and maybe more natural Eiring pattern due to an extremely high stimulation rate. Skinner et al. [8] also found significantly higher 153 scores on word and sentence tests in quiet and noise with a new multipeak digital speech coding strategy as compared to the formerly used FOF1F2-strategy of the Nucleus-WSP (wearable speech processor). Von Wallenberg and Battmer [9] found that good performers (group I-subjects) improved consonant identification scores only after six monihs of processor use whereas moderate performers (group II-subjects) showed an immediate significant improvement after 1 month. The electrode activation pattern of the MSP is displayed in Fig. 2 (top left) for the same word as in Fig. 1. The processor was programmed in bipolar mode using thresholds of hearing and comfortable listening level (T- and Clevels) of one of the implantees who participated in this study. The comparisons of processors and strategies described above indicate the potential gains which may be obtained by optimizing signal processing schemes for existing implanted devices. The present study was conducted in order to explore new ideas and concepts of multichannel pulsatile speech encoding for users of the Clark/Nucleus cochlear prosthesis. Similar methods and tools can however be utilized to investigate alternative coding schemes for other implant systems. Portions of the results have been presented previously [1,3]. II. SIGNAL PROCESSING STRKIEGIES A cochlear implant digital speech processor (CIDSP) for the Nucleus 22-channel cochlear prosthesis has been designed using a single chip digital signal processor (TMS320C25, Texas Instruments, [4]). For laboratory experiments the CIDSP was incorporated in a general purpose computer which provided interactive parameter control, graphical display of input/output and buffers and offline speech file processing facilities. The experiments described in this paper were all conducted using the laboratory version of CIDSP. Speech signals were processed as follows: after analog low-pass filtering (5 kHz) and analog-to-digital-conversion (10 kHz), preemphasis and Hanning windowing (12.8 ms, shifted by 6.4 ms or less per analysis frame) was applied and the power spectnun calculated via fast Fourier transform (FFT); specified speech features such as formants and voice pitch were extracted and transformed according to the selected encoding strategy; finally the stimulus parameters (electrode position, stimulation mode, pulse amplitude and duration) were generated and transmitted via inductive coupling to the implanted receiver. In addition to the generation of stimulus parameters for the cochlear implant an acoustic signal based on a perceptive model of auditory nerve stimulation was output simultaneously. Several processing strategies were implemented on this system: The first approach (PES, Pitch Excited Sampler) is based on the maximum peak channel vocoder concept whereby the time-averaged spectral energies of a number of frequency bands (approximately third-octave bands) are transformed into appropriate electrical stimulation parameters for up to 22 electrodes (Fig. 2, middle left). The pulse rate at any given electrode is controlled by the voice pitch of the input speech signal. A pitch extractor algorithm calculates the autocorrelation function of a lowpass-filtered segment of the speech signal and searches for a peak within a specified time lag interval. A random pulse rate of about 150 to 250 Hz is used for unvoiced speech portions. The second approach (CIS, Continuous Interleaved Sampler) uses a stimulation pulse rate which is independent of the fundamental frequency of the input signal. The algorithm scans continuously all frequency bands and samples their energy levels (Fig. 2, top and middle right). As only one electrode can be stimulated at any instant of tune the rate of stimulation is limited by the required stimulus pulse widths (detennined individually for each subject) and the time to transmit additional stimulus parameters. As the information about the electrode number, the stimulation mode, the pulse amplitude and width is encoded by high frequency bursts (2.5 MHz) of different durations, the total transmission time for a specific stimulus depends on all of these parameters. This transmission time can be minimized by choosing the shortest possible pulse width combined with the maximal amplitude. In order to achieve maximally high stimulation rates for those portions of the speech input signals which are assumed to be most important for intelligibility several modifications of the basic CIS-strategy were designed, of which only the two most promising (CIS-NA, Fig.2, top right) will be considered in the following. The analysis of the short time spectra was performed either for a large number of narr frequency bands (corresponding directly to the number of available electrodes) or for a small number (typically 6) of wide frequency bands analogous to the approach suggested by Wilson et al. [10]. The frequency bands were logarithmically spaced from 200 to 5000 Hz in both cases. Spectral energy within any of these frequency bands was mapped to stimulus amplitude at a selected electrode as follows: all narrow band analysis channels whose values exceeded a noise cut level were used for CIS-NA whereas all wide band analysis channels irrespective of NCL were mapped to preselected fixed electrodes for CIS-WF (Fig.2, middle right). Both schemes are supposed to minimize electrode interactions by preserving maximal spatial distances between subsequently stimulated electrodes. The first scheme (CIS-NA) emphasizes spectral resolution while the second (CIS-WF) optimizes 154 fine temporal resolution. In both the PES- and the CIS-strategies a high-frequency preemphasis was applied whenever a spectral gravity measure exceeded a preset threshold. In first experiments with these new strategies it could be shown that both types of strategies were able to provide additional useful information to users of the Nucleus cochlear implant in comparison to the standard MSP in some of the test conditions. The PES strategy resulted in somewhat lower consonant identification performance than the CISstrategies. However, the subjective quality of the processed speech and the user's ability to distinguish between different voices was higher with the PES-strategy than with the CIS-methods. Thus it seemed logical to search for algorithms which would combine the respective advantages of the two strategies. One of these hybrid PES/CIS coding strategy which was called INT1V (integrated hybrid strategy with one voice excited stimulation channel, Fig. 2, bottom left), excites the lowest frequency active electrode in a voiced segment at the FO rate (PES activity) while the remaining (maximally live) active electrodes are stimulated using CIS. For unvoiced segments, all active electrodes are stimulated using CIS. A further strategy uses FO to determine the rate at which spectral information is presented (the stimulation period) for voiced segments. Unvoiced segments are presented at a random rate between 150 and 250 Hz. In contrast to PES which transmits only the spectral peaks additional spectral information is encoded during the remainder of the stimulation period at a maximally high stimulation rate. This was called the High Spectral Transmission (HST, Fig. 2, bottom right) strategy. III. SUBJECFS Evaluation experiments have been conducted with live postlingually deaf adult (age 16 - 50 years) cochlear implant users to date. All subjects were experienced users of their speech processors. The time since implantation ranged from 12 months (KW) to over 10 years (UT, single channel extracochlear implantation in 1980, reimplanted after device failure in 1987) with good sentence identification (80 to 95 % correct responses) and number recognition (40 to 95 correct responses) performance and minor open speech discrimination in monosyllabic word tests (5 to 20 % correct responses, all tests presented via computer, hearing-alone) and limited use of the telephone. Two series of tests were carried out with 5 subjects participating in each series. The first series comprised a female/male distinction sentence test (20 sentences spoken by 2 male and 2 femal speakers). The second series comprised a four alternative forced choice minimal pair test with consonants in medial position (CM2) and a four alternative forced choice minimal pair test with vowels in medial position (VM2). All subjects were regular users of the MSP. Confusion matrices were pooled over the five subjects for the 6 processing conditions and the 2 different speech tests and information transmission analysis was performed for the 8 resulting matrices. As the main effects were seen in the consonant results only these data will be shown below. The saure measurement procedure to determine thresholds of hearing (T-Ievels) and comfortable listening (C-levels) used for fitting the MSP was also used for the CIDSP-strategies. Only minimal exposure to the new processing strategies was possible due to time restrictions. After about 5 to 10 minutes of listening to ongoing speech one or two blocks of a 20items 2-digit numbers test were carried out. There was no feedback given during the test trials. All test items were presented by a second computer which also recorded the subjects responses entered via touch screen terminal (for multiple choice tests) or keyboard (numbers tests and monosyllable word tests). The computer program generated automatically the confusion matrices and calculated the transmitted infonnation for the selected phonological feature set according to the procedures described by Miller and Nicely [7]. Matrices could be analyzed individually for every subject or pooled across a number of subjects. Speech Signals were either presented via loudspeaker in a sound treated room (when patients were tested with their wearable speech processors) or processed by the CIDSP in real time and fed directly to the transmitting coil at the subjects head. Different speakers were used for the ongoing speech, the numbers test and the actual speech tests respectively. IV. RESULTS AND DISCUSSION Results of the female/male discrimination test are shown in Fig. 3. It cm be noted that all five subjects scored at or below chance level for the CIS-NA strategy and nearly perfectly for all other strategies. Two subjects (EM and TH) scored only about 30 % with the MSP but 100 % with the new DSP-strategies. Three subjects (UT, KW, HS) scored between 55 and 65 % for the integrated hybrid strategy (INT1V) whereas they scored between 90 and 100 % for the other strategies. Thus it appears as if the HST strategy was best able to preserve the speech quality features of the pitch synchronous PES-strategy while generating a high continuous stimulus rate at the same time. 155 Female-male discrimination % Correct responses (chance level corrected) 100 80 60 40 20 0 EM IH UT KW HS IRMSP•PES•CIS-NAgiHSTollell4 Fig. 3 Percentage correct scores for the female/male voice discrimination test The results of the minimal pair speech tests shown in Fig. 4 confinn the general pattern of earlier experiments [1,3] although the differences between the MSP- and the new CIDSP-strategies in consonant tests were not as large as with the logatomes used previously. Every subject had performed 100 trials (4 blocks of 25 trials) per condition for both the consonant (CM2) and vowel (VM2) minimal pair test. While it was found that for all five subjects die performance with CIS-NA, HST and also PES was better than with the MSP it cm be seen that the variations between subjects was rather large. The integrated hybrid INT1V did result in improved performance for two subjects (EM, UT) and in worse performance for two other subjects (KW, HS). Results of the information transmission analysis for the pooled confusion matrices of the consonant minimal pair test are shown in Fig. 5. It should be noted that the minimal pair tests CM2 and VM2 contain more variation of the speech material than the logatome tests C12 and VO8 which were used previously for information transmission analysis. Thus this analysis cm only provide some indications and maybe Show some trends but should not be overestimated. As every subject had performed 100 trials per condition, every pooled confusion matrix added up to 500 entries. As can be seen in Fig. 5 the total overall results for consonant tests with the subjects own wearable speech processor were significantly lower than with the new CIDSP-strategies. The pitch-synchronous coding (PES) resulted in worse performance compared to the coding without explicit pitch extraction (CIS-NA) whereas the hybrid strategies resulted in nearly the same perfonnance as CIS-NA. This was not only true for the overall information but for all the features analyzed except for sonorance (SON) and place (PLC) which was equally well (but not better) perceived with the PESstrategy os with the high-rate strategies. Major improvements in consonant identification can be seen for the sibilance (SIB) and frication (FRI) features which are closely related to each other and which mainly distinguish the high frequency phonemes /s/ and /f/ from the rest. Some irnprovement was also evident in the place of articulation feature (PLC). PES however produced clearly lower voicing information transmission than the CIS-strategies and was not superior to MSP in contrast to the previous C12-test. HST however was superior in voicing information transmission. Consonant Test CM2 Consonant Test CM2 %Correct res onses (chance level corrected) % Transmitted Information 100 100 80 80 60 60 90 40 20 20 0 EM TH UT KW HS ° TOTAL VOI NAS SON SIE IMSP•PESalCIS-NitrzaHSTo1NTre FRI PLC IMSP•PESRCIS-NitdaHSTDINTle Fig. 4 Four alternative forced choice bisyllabic minimal pair test with consonants in medial position Fig. 5 Information transmission analysis of pooled confusion matrices: 5 subjects, 5 processing conditions Vowel identification scores, on the other hand, were not improved by modifications of the signal processing strategy. The transmission of first formant information (F01) was even worse for all CIDSP-strategies compared to the MSP which probably is related to the lower number of electrodes assigned to the first formant region in the standard CIDSPmapping. Second formant information however was equally well transinitted with the CIS-NA as with the MSP strategy which cm be explained by the similar spectral resolution of diese two strategies. When the actual stimulus rates were examined for the five different strategies it was found that on die average die CISand HST-strategies generated about four or five times higher rates than the MSP and that PES and INT1V were somewhat in between. These analyses will be carried out in more detail in the future for different speech segments. 156 V. CONCLUSIONS The above speech test results are still preliminary due to the small number of subjects and test conditions. lt is however quite promising that new signal processing strategies can improve speech discrimination considerably during acute laboratory experiments. Consonant identification apparently may be enhanced by more detailed temporal information and specific speech feature transformations. While some of the new high-rate coding strategies produced a reduced voice quality it was possible to improve these aspects by designing hybrid strategies. Voice pitch discrimination results were particularly good with a high spectral transmission (HST) strategy which used maximally high stimulation rates at pitch-synchronous intervals and somewhat poorer with an integrated hybrid strategy (INT1V) which encoded pitch on only one electrode channel and used continuous high rate stimulation on all other channels. Both HST and INT1V resulted in comparable consonant identification performances as the unmodified CIS strategy, with all three strategies being generally better than PES. Percentage correct scores for individual subjects indicated that HST was also consistently better than INT1V. The improved speech discrimination performance with HST is encouraging and will be further evaluated. Although many aspects of speech encoding can be efficiently studied using a laboratory digital signal processor it would be desirable to allow subjects more time for adjustment to a new coding strategy. Several days or weeks of habituation are sometimes required until a new mapping can be fully exploited. Thus for scientific as well as practical purposes the further miniaturization of wearable DSP's will be of great importance. ACKNOWLEDGEMENTS This study was supported by the Swiss National Research Foundation (Grant no. 4018-10864 and 4018-10865) and Cochlear AG (Basel). REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Bögli, H. and Dillier, N. Digital Speech Processor for the Nucleus 22-Channel Cochlear Implant. Proc.IEEE EMBS 13/4:1901-1902, 1991. Clark, G.M., Tong, Y.C. and Patrick, J.F. Cochlear Prostheses, Edinburgh, London, Melbourne, New York:Churchill Livingstone, 1990. pp. 1-264. Dillier, N., Bögli, H. and Spillmann, T. Digital Speech Processing for Cochlear Implants. ORL 54: 299-307, 1992 Dillier, N., Senn, C., Schlatter, T. and Stöckli, M. Wearable digital speech processor for cochlear implants using a TMS320C25. Acta Otolaryngol (Stockh) Suppl. 469:120-127, 1990. Evans, E.F. How to Provide Speech Through an Implant Device. Dimensions of the Problem: An overview. In: Cochlear Implants, edited by Schindler, R.A. and Merzenich, M.M. 1985, p. 167-183. Lai, WK, Dillier, N, Bögli H (1992). A hybrid coding strategy for a multichannel cochlear implant, Proc. of the Symp. on Cochlear Implant: New Perspectives, Toulouse, France. In print. Miller, G.A. and Nicely, P.E. An analysis of perceptual confusions among some English consonants. J Acoust.Soc.Am. 27:338-352, 1955. Skinner, M.W., Holden, L.K., Holden, T.A., Dowell, R.C. and et al., Performance of postlingually deaf adults with the wearable speech processor (WSP III) and mini speech processor (MSP) of the Nucleus Multi-Electrode Cochlear Implant. Ear Hear. 12/1:3-22, 1991. Von Wallenberg, E.L. and Battmer, R.D. Comparative speech recognition results in eight subjects using two different coding strategies with the Nucleus 22 channel cochlear implant. Brit.J.Audiol. 25: 371-380, 1991 Wilson, B.S., Lawson, D.T., Finley, C.C. and Wolford, R.D. Coding strategies for multichannel cochlear prostheses. Am.J.Otol. 12,Supp1.1:55-60, 1991. 157