Real-time Implementation of Multi-band Frequency Compression for Listeners with Moderate Sensorineural Impairment Prem C. Pandey, Pandurangarao. N. Kulkarni, Nitya Tiwari {pcpandey, nitya} @ ee.iitb.ac.in, pnk_bewoor @ yahoo.com EE Dept., IIT Bombay March 2013 Outline 1. Introduction 2. Signal processing 3. Real-time implementation 4. Results 5. Conclusion nitya@ee.iitb.ac.in 1 2 3 4 5 2/24 1. Introduction Speech production mechanism Excitation source & filter model • Excitation: voiced/unvoiced glottal, frication • Filtering: vocal tract filter nitya@ee.iitb.ac.in 1 2 3 4 5 3/24 Speech segments • Words • Syllables • Phonemes • Sub-phonemic segments Phonemes: basic speech units • Vowels: Pure vowels, Diphthongs • Consonants: Semivowels, Stops, Fricatives, Affricates, Nasals /aba/ /ada/ /apa/ /aga/ nitya@ee.iitb.ac.in 1 2 3 4 5 4/24 Phonemic features • Modes of excitation • Glottal Unvoiced (aspiration, constriction at the glottis) Voiced (vibration of vocal chords) • Frication Unvoiced (constriction in vocal tract) Voiced (constriction in vocal tract & glottal vibration) • Movement of articulators • Continuant (steady-state vocal tract configuration): vowels, nasal stops, fricatives • Non-continuant (changing vocal tract): diphthongs, semivowels, oral stops (plosives) • Place of articulation (place of maximum constriction in vocal tract) Bilabial, Labio-dental, Linguo-dental, Alveolar, Palatal, Velar, Gluttoral • Changes in voicing frequency (Fo) Supra-segmental features • Intonation nitya@ee.iitb.ac.in • Rhythm 1 2 3 4 5 5/24 Hearing mechanism Peripheral auditory system • External ear (sound collection) • Pinna • Auditory canal • Middle ear (impedance matching) • Ear drum • Middle ear bones • Inner ear (analysis and transduction): cochlea • Auditory nerve (transmission of neural impulses) Central auditory system (information interpretation) nitya@ee.iitb.ac.in 1 2 3 4 5 6/24 Auditory system nitya@ee.iitb.ac.in 1 2 3 4 Tonotopic map of cochlea 5 7/24 Hearing impairment Types of hearing losses • Conductive loss • Central loss • Sensorineural loss • Functional loss Sensorineural hearing loss • Elevated hearing thresholds Reduced intelligibility as speech components are inaudible • Reduced dynamic range & loudness recruitment (abnormal loudness growth) Distortion of loudness relationship among speech components • Increased temporal masking Poor detection of acoustic events • Increased spectral masking (due to widening of auditory filters) • Reduced frequency selectivity • Reduced ability to sense spectral shapes of speech sounds >> Poor intelligibility and degraded perception of speech nitya@ee.iitb.ac.in 1 2 3 4 5 8/24 Signal processing in hearing aids Currently available • Frequency selective amplification Improves audibility but may not improve intelligibility in presence of noise • Automatic volume control • Multichannel dynamic range compression (settable attack time, release time, and compression ratios) Compresses the natural dynamic range into the reduced dynamic range Under Investigation • Improvement of consonant-to-vowel ratio (CVR): for reducing the effects of increased temporal masking • Techniques for reducing the effects of increased spectral masking: Binaural dichotic presentation, Spectral contrast enhancement, Multi-band frequency compression • Noise suppression nitya@ee.iitb.ac.in 1 2 3 4 5 9/24 Signal processing for reducing the effects of increased intra-speech spectral masking • Binaural hearing (moderate bilateral loss) – Binaural dichotic presentation (Lunner et al. 1993, Kulkarni et al. 2012) • Monaural hearing – Spectral contrast enhancement (Yang et al. 2003) Errors in identifying peaks, increases dynamic range >> degraded speech perception – Multi-band frequency compression (Arai et al. 2004, Kulkarni et al. 2012) Multi-band frequency compression Presentation of speech energy in relatively narrow bands to avoid masking by adjacent spectral components, and without causing perceptible distortions. nitya@ee.iitb.ac.in 1 2 3 4 5 10/24 Research objective Implementation of multi-band frequency compression for real-time operation for use in hearing aids • Reduction in computational requirement for implementing on a DSP chip without speech quality degradation • Low algorithmic and computational delay for real-time implementation on a DSP board with16-bit fixed-point processor nitya@ee.iitb.ac.in 1 2 3 4 5 11/24 2. Signal Processing Signal processing for multi-band frequency compression Processing steps • Segmentation and spectral analysis • Resynthesis ● Spectral modification Multi-band frequency compression (Arai et al. 2004) • Processing using auditory critical bandwidth • Compressed magnitude & original phase spectrum >> Complex spectrum Multi-band frequency compression (Kulkarni et al. 2009, 2012) • Investigations using different bandwidth, segmentation & frequency mapping • Processing using complex spectrum • Reduced computation and processing artifacts nitya@ee.iitb.ac.in 1 2 3 4 5 12/24 Multi-band frequency compression (Kulkarni et al. 2009, 2012) Signal processing • Segmentation using analysis window with 50 % overlap – fixed-frame (FF) segmentation : window length L = 20 ms – pitch-synchronous (PS) segmentation: L = 2 pitch periods / prev. L (voiced / unvoiced) • Zero padding >> N-point frame • Complex spectra by N-point DFT • Analysis bands for compression – fixed bandwidth – one-third octave – auditory critical band (ACB) • Mappings for spectral modification – sample-to-sample – spectral sample superposition – spectral segment • Resynthesis by overlap-add method Frequency mapping with compression factor α = 0.6 as shown by Kulkarni et al. 2012 nitya@ee.iitb.ac.in 1 2 3 4 5 13/24 Details of signal processing Analysis band for compression • 18 bands based on ACB Mapping for spectral modification • Spectral segment mapping No irregular variations in spectrum Spectral Segment Mapping as shown by Kulkarni et al. 2012 a k ic [( k ic ( k 0 . 5 )) / ] b a 1/ n1 Y (k ' ) (m a ) X (m ) X ( j ) (b a ) X (n ) /i/ BB noise j m 1 Resynthesis • Resynthesis by N-point IDFT • Reconstruction by overlap-add method (a) Unprocessed (b) Processed Unprocessed and processed (α =0.6) spectra of vowel /i/ and broad band noise nitya@ee.iitb.ac.in 1 2 3 4 5 14/24 Evaluation and results • Modified Rhyme Test (MRT) Best results with PS segmentation, ACB, spectral segment mapping & compression factor of 0.6 • Normal hearing subjects with loss simulated by additive noise – 17 % improvement in recognition scores – 0.88 s decrease in response time • Subjects with moderate sensorineural loss – 16.5 % improvement in recognition scores – 0.89 s decrease in response time nitya@ee.iitb.ac.in 1 2 3 4 5 15/24 Real-time implementation requirements Comparison of processing schemes with different segmentation • FF processing: Fixed-frame segmentation • Discontinuities masked by 50 % overlap-add • Perceptible distortions in form of another superimposed pitch related to shift duration • PS Processing: Pitch-synchronous segmentation • Less perceptible distortions • Pitch estimation (large computational requirement) • Not suitable for music Modified segmentation scheme for real-time implementation • Lower perceptual distortions • Lower computational requirement nitya@ee.iitb.ac.in 1 2 3 4 5 16/24 Modified multi-band frequency compression (LSEE processing) • • • • Griffin and Lim’s method of signal estimation from modified STFT Minimize mean square error between estimated and modified STFT Multiplication by analysis window before overlap-add Window requirement m w 2 ( mS n ) 1 • For partial overlap only a few windows meet the requirement • Fixed window length L and shift S, with S = L/4 (75% overlap) • Modified Hamming window w (n) 1 / 4p 2 2q 2 p q cos 2 π n 0 . 5 / L where p = 0.54 and q = -0.46 nitya@ee.iitb.ac.in 1 2 3 4 5 17/24 Investigations using offline implementation a) • Comparison of analysis-synthesis techniques for multi-band frequency compression • Pitch-synchronous implementation • LSEE method based implementation • Evaluation • Informal listening • Objective evaluation using PESQ measure (0 – 4.5) • Test material Sentence “ Where were you a year ago?” from a male speaker, vowels /a/-/i/-/u/, music • Results PESQ score (compression factor range 1 ― 0.6): 4.3 ― 3.7 (sentence/ vowels) b) c) d) Comparison of analysis-synthesis methods: (a) unprocessed, (b) FF, (c) PS, and (d) LSEE. Processing with spectral segment mapping, auditory critical bandwidth, α = 0.6. nitya@ee.iitb.ac.in 1 2 3 4 5 18/24 Processed outputs from off-line implementation Fixed-frame MFC Input Signal α=1 α = 0.8 Pitch-synchronous α = 0.6 α=1 α = 0.8 α = 0.6 Fixed-frame LSEE α=1 α = 0.8 α = 0.6 Vowels /a i u / “Where were you a year ago?” Music clip • Conclusion LSEE processing suited for non-speech and audio applications nitya@ee.iitb.ac.in 1 2 3 4 5 19/24 3. Real-time Implementation 16-bit fixed point DSP: TI/TMS320C5515 • 16 MB memory space : 320 KB on-chip RAM with 64 KB DARAM, 128 KB on-chip ROM • Three 32-bit programmable timers, 4 DMA controllers each with 4 channels • FFT hardware accelerator (8 to 1024-point FFT) Max. clock speed: 120 MHz DSP Board: eZdsp • 4 MB on-board NOR flash for user program • Codec TLV320AIC3204: stereo ADC & DAC, 16/20/24/32-bit quantization , 8 – 192 kHz sampling • Development environment for C: TI's 'CCStudio, ver. 4.0‘ nitya@ee.iitb.ac.in 1 2 3 4 5 20/24 Signal acquisition & processing Implementation details • • • • • ● Sampling rate:10 kHz ADC & DAC quantization: 16-bit FFT length N:1024 / 512 Analysis-synthesis : LSEE-based fixed frame, 260-sample (26 ms) window, 75% overlap Mapping for spectral modification: spectral segment mapping with auditory critical bands Input samples, spectral values & processed output: 32-bits each with 16-bit real and imaginary parts nitya@ee.iitb.ac.in 1 2 3 4 5 21/24 Data transfer and buffering operations (S = L/4) • DMA cyclic buffers • 5 block input buffer • 2 block output buffer (each of S samples) • Four pointers • current input block • current output block • just-filled input block • write-to output block (Pointers incremented cyclically on DMA interrupt) Efficient realization of 75 % overlap & zero padding • • Total delay • Algorithmic delay: L = 4S samples (26 ms) • Computational delay: L/4 = S samples (6.5 ms) nitya@ee.iitb.ac.in 1 2 3 4 5 22/24 4. Results Comparison of proc. output from offline-implementation & DSP board • Test material Sentence “ Where were you a year ago?”, vowels /a/-/i/-/u/, music • Evaluation methods ● PESQ measure (0 – 4.5) • Informal listening • Results • Lowest clock for satisfactory operation: 20 MHz / 12 MHz (N = 1024/512) • No perceptible change in output with N = 512, processing capacity used ≈ 1/10 of the capacity with highest clock (120 MHz) • Informal listening: Processed output from DSP similar to corresponding output from offline implementation • PESQ scores (compression factor range 0.6 ― 1) For sentence 2.5 ― 3.4 & for vowels 3.0 ― 3.4 • Total processing delay: 35 ms nitya@ee.iitb.ac.in 1 2 3 4 5 23/24 Processed outputs from DSP board Fixed-frame LSEE MFC Input Signal α=1 α = 0.8 α = 0.6 Vowels /a/-/i/-/u/ “Where were you a year ago?” Music clip nitya@ee.iitb.ac.in 1 2 3 4 5 24/24 5. Conclusion • Multi-band frequency compression using LSEE processing -- Processed speech output perceptually similar to that from earlier established pitchsynchronous processing. -- Suitable for speech as well as non-speech audio signals. -- Well-suited for real-time processing due to use of fixed-frame segmentation. • Implementation for real-time operation using 16-bit fixed-point processor TI/TMS320C5515: used-up processing capacity ≈ 1/10, delay = 35 ms • Further work Combining multi-band frequency compression with other types of processing for use in hearing aids • Frequency-selective gain • Multi-band dynamic range compression (settable gain and compression ratios) • Noise reduction • CVR modification nitya@ee.iitb.ac.in 1 2 3 4 5 25/24 THANK YOU nitya@ee.iitb.ac.in 1 2 3 4 5 Abstract Widening of auditory filters in persons with sensorineural hearing impairment leads to increased spectral masking and degraded speech perception. Multi-band frequency compression of the complex spectral samples using pitch-synchronous processing has been reported to increase speech perception by persons with moderate sensorineural loss. It is shown that implementation of multi-band frequency compression using fixed-frame processing along with least-squares error based signal estimation reduces the processing delay and the speech output is perceptually similar to that from pitch-synchronous processing. The processing is implemented on a DSP board based on the 16-bit fixed-point processor TMS320C5515, and real-time operation is achieved using about one-tenth of its computing capacity. nitya@ee.iitb.ac.in 1 2 3 4 5 27/ References [1] H. Levitt, J. M. Pickett, and R. A. Houde (eds.), Senosry Aids for the Hearing Impaired. New York: IEEE Press, 1980. [2] B. C. J. Moore, An Introduction to the Psychology of Hearing, London, UK: Academic, 1997, pp 66–107. [3] J. M. Pickett, The Acoustics of Speech Communication: Fundamentals, Speech Perception Theory, and Technology. Boston, Mass.: Allyn Bacon, 1999, pp. 289–323. [4] H. Dillon, Hearing Aids. New York: Thieme Medical, 2001. [5] T. Lunner, S. Arlinger, and J. Hellgren, “8-channel digital filter bank for hearing aid use: preliminary results in monaural, diotic, and dichotic modes,” Scand. Audiol. Suppl., vol. 38, pp. 75–81, 1993. [6] P. N. Kulkarni, P. C. Pandey, and D. S. Jangamashetti, “Binaural dichotic presentation to reduce the effects of spectral masking in moderate bilateral sensorineural hearing loss”, Int. J. Audiol., vol. 51, no. 4, pp. 334–344, 2012. [7] T. Baer, B. C. J. Moore, and S. Gatehouse, “Spectral contrast enhancement of speech in noise for listeners with sensorineural hearing impairment: effects on intelligibility, quality, and response times”, Int. J. Rehab. Res., vol. 30, no. 1, pp. 49–72, 1993. [8] J. Yang, F. Luo, and A. Nehorai, “Spectral contrast enhancement: Algorithms and comparisons,” Speech Commun., vol. 39, no. 1–2, pp. 33–46, 2003. [9] T. Arai, K. Yasu, and N. Hodoshima, “Effective speech processing for various impaired listeners,” in Proc. 18th Int. Cong. Acoust., 2004, Kyoto, Japan, pp. 1389–1392. [10] K. Yasu, M. Hishitani, T. Arai, and Y. Murahara, “Critical-band based frequency compression for digital hearing aids,” Acoustical Science and Technology, vol. 25, no. 1, pp. 61-63, 2004. [11] P. N. Kulkarni, P. C. Pandey, and D. S. Jangamashetti, “Multi-band frequency compression for reducing the effects of spectral masking,” Int. J. Speech Tech., vol. 10, no. 4, pp. 219–227, 2009. [12] P. N. Kulkarni, P. C. Pandey, and D. S. Jangamashetti, “Multi-band frequency compression for improving speech perception by listeners with moderate sensorineural hearing loss,” Speech Commun., vol. 54, no. 3 pp. 341–350, 2012. [13] E. Zwicker, “Subdivision of the audible frequency range into critical bands (Freqenzgruppen),” J. Acoust. Soc. Am., vol. 33, no. 2, pp. 248, 1961. [14] D. G. Childers and H. T. Hu, “Speech synthesis by glottal excited linear prediction”, J. Acoust. Soc. Am., vol. 96, no. 4, pp. 2026–2036, 1994. nitya@ee.iitb.ac.in 1 2 3 4 5 28/ [15] D. W. Griffin and J. S. Lim, “Signal estimation from modified short-time Fourier transform,” IEEE Trans. Acoustics, Speech, Signal Proc., vol. 32, no. 2, pp. 236-243, 1984. [16] X. Zhu, G. T. Beauregard, and L. L. Wyse, “Real-time signal estimation from modified short-time Fourier transform magnitude spectra”, IEEE Trans. Audio, Speech, Language Process., vol. 15, no. 5, pp. 1645–1653, 2007. [17] Texas Instruments Inc., TMS320C5515 Fixed-Point http://focus.ti.com/ lit/ds/symlink/tms320c5515.pdf. [18] Texas Instruments Inc., TLV320AIC3204 Ultra Low Power Stereo Audio Codec. 2008, [online] Available: http://focus. ti.com/lit/ds/symlink/tlv320aic3204.pdf. nitya@ee.iitb.ac.in 1 2 3 4 5 Digital Signal Processor. 2011, [online] Available: 29/