Eurospeech 2001 - Scandinavia Formant Estimation using Gammachirp Filterbank Kaïs OUNI, Zied LACHIRI and Noureddine ELLOUZE Laboratoire des Systemes et Traitement du Signal ( LSTS ) ENIT, BP 37, Le Belvédère 1002, Tunis, Tunisie E-mails : kais.ouni@enit.rnu.tn ; zied.lachiri@enit.rnu.tn & N.ellouze@enit.rnu.tn Abstract This paper proposes a new method for formants estimation using a decomposition of speech signal in Gammachirp functions base. It is a spectral analysis method performed by a gammachirp filterbank. A similar approach to the modified spectrum estimation, which allows a smooth and an average spectrum is adopted. In fact, instead of using an uniform window commonly used in short-time Fourier analysis, a bank of gammachirp filters is applied on the signal. A temporal average of the estimated spectra is then applied to obtain one spectrum highlighting the formants structure. This approach is validated by its application on synthesized vowels. The formants are detected with good estimation in comparison with the values given in synthesis. In the same way, this analysis is applied on natural vowels. All the results are compared to three traditional methods, LPC, cepstral and spectral one’s and also to a same analysis given by a gammatone filterbank. The tracking of formants shows that this method, which based on gammachirp filters, gives a correct estimation of the formants compared to traditional methods. frequency that is characteristic of the place. Filters with a socalled gammatone impulse response are more used for modelling of the cochlear filterbank [8][16]. The gammatone function has also been used to characterize the spectral analysis of the cochlear filters at moderate levels [14]. It is distinguished by a spectral bandwidth that depends on the central frequency of its corresponding cochlear filter, which is measured in Equivalent Rectangular Bandwidth (ERB) [2] [13]. The ERB is related to the psychophysical critical band assignment. The critical bandwidth is about 50-100 Hz at low frequencies and changes gradually to around 20 percent of the frequency at high frequencies. Recently, IRINI [8] has proposed an excellent candidate for asymmetric, leveldependent cochlear filter : the Gammachirp filter. It is an extension of the gammatone filter with an additional chirp term to produce an asymmetric amplitude spectrum. In this work, we tried to use the gammachirp filter in order to estimate formants with a similar approach to the modified spectrum estimation. In the following sections, we present a time-frequency study of the gammachirp function. Then, we introduce the approach of the gammachirp spectrum estimation and its validation for formants estimation. 1. Introduction Over the past forty years and especially in the last decade computational models of the peripheral auditory system have gained popularity in speech processing. These models have shown that in complex speech processing applications, classical spectral analysis can be modified to one’s advantage by adding properties of the auditory system [3][10]. The basic promise of this approach is that understanding cochlear function and central auditory processing such as the remarkable abilities of the human auditory system to detect, separate and recognize speech, will provide new insights into speech processing and will motivate novel approaches to the problems of analysis and robust recognition of acoustic patterns [11][15]. The adoption of such auditory processes has usually led to significant improvements in performance over systems using traditional approaches such as LPC, spectral and cepstral methods [1] [12]. The auditory path of a sound is presented as follow. When sound pressure waves impinge upon the eardrum, in the outer ear, they cause vibrations that are transmitted via the stapes, at the oval window to the fluids of the cochlea in the inner ear [2][3]. These pressure waves in turn produce mechanical displacements in the basilar membrane. The amplitude and time course of these vibrations reflect directly the amplitude and frequency content of the sound stimulus [13][15]. These mechanical displacements at any given place of the basilar membrane, can be viewed as the output signal of a band-pass filter whose frequency response has a resonance peak at 2. The Gammachirp Filter The gammachirp filter is a good approximation to the frequency selective behaviour of the cochlea. It is an auditory filter which introduce an asymetry and level – dependent characteristics of the cochlear filters and it can be considered as a generalization and improvement of the gammatone filter. The gammachirp filter is defined in temporal domain by the real part of the complex function gc(t) [8][9]. (1) gc (t ) = An, B t n −1 exp(− 2π B t )exp( j 2π f0 t + j c ln (t ) + j ϕ ) with B = b.ERB( f0 ) and ERB( f0 ) = 24.7 + 0.108 f0 in Hz (2) which is the equivalent rectangular bandwidth at frequency f 0 , n is the filter order, f 0 is the frequency modulation, An,B is some normalization constant, b is a parameter defining the envelope of the gamma distribution and c a parameter for the chirp rate. 2.1. Energy The energy of the impulse response gc(t) is obtained with the following expression : E n,B = g c E n , B = A n2, B 2 = gc, gc = Γ (2 n − 1 ) (4 π B ) 2 n −1 +∞ ∫ g c (t ) dt 2 (3) −∞ (4) with Γ(n) is the n-th order gamma distribution function. Thus, for energy normalization the normalisation constant must be : (4 π B ) Γ (2 n − 1 ) 2 n −1 A E n ,B = (5) Eurospeech 2001 - Scandinavia Figure 2 shows example of energy-normalized gammachirp. σ t2 σ 2 f = 2n − 1 2n − 3 1 (4 π ) 2 ≥ 1 (17) (4 π ) 2 2.4. The Equivalent Rectangular Bandwidth ( ERB) The ERB of a band-pass filter is defined as the width of a rectangular filter with the same peak gain and impulse response energy : 2 (18) ERB . G c ( f 0 ) = E n , B , from (4) and (7) it follows : A Γ (n ) = E ERB . (2 π B ) then, for normalized-energy gammachirp : (19) ERB = (20) 2 2 n ,B 2 n Figure 2: Example of gammachirp filters centred on different frequencies. (2 π B ) Γ (2 n − 1 ) 2 2 n − 1 Γ 2 (n ) 2.2. Frequency response The Fourier transform of gc(t) is : Gc ( f ) = +∞ ∫ g (t )exp (− c j 2 π f t ) dt = −∞ A Γ (n + j c )e , (2 π [B + j ( f − f )]) jϕ n,B (6) n 0 then, the frequency response magnitude, when the amplitude is normalized, is given by : 1 (7) G (f ) = .e 2π B + ( f − f ) cθ n c and the peak frequency in the amplitude spectrum is shifted of f0 by cB [8]. n 2.3. Time-Frequency spread Time and frequency energy concentrations are restricted by the Heinsenberg uncertainty principle. The average location of a one-dimensional wave function g c ∈ L2 (ℜ ) is: +∞ g c (t ) dt ∫t E n,B (9) 2 −∞ and the average momentum is : +∞ 1 ξ = ∫ E n ,B f G c (f ) 2 (10) df −∞ The variances around theses average values are respectively : 1 +∞ (11) (t − u )2 g (t ) 2 dt σ 2 = t E n,B ∫ c and : σ 2 f = +∞ E n,B ∫ (f −ξ ) 2 G c (f ) 2 dt (12) 4π B 2n −1 (4 π B ) 2 ξ = f0 σ 2 f = (14) (15) 2 3. Gammachirp Spectrum Estimation In this work, a gammachirp spectrum estimation is performed to estimate formants in a multiresolution and auditory context. A similar approach as the modified spectrum estimation, which allows a smooth and an average spectrum is then adopted. In fact, instead of using an uniform window like Hamming or Blackman which is commonly used in short-time Fourier analysis and which has a constant time-frequency resolution, the gammachirp function, which has not a constant time- frequency resolution but a constant quotient Q = f0/B, is used as speech-analysis window centred in a linearfrequencies assessment. The modified estimator has the following expression [12]: Rx ( f ) = 1 K ∑ R ( f ), B 2n −3 K (22) xk k =1 where K = L/M is the number of sections, L is the length of the signal and M is the length of each section: R xk (f )= 1 MP M −1 ∑ l=0 2 x k (l ) w (l ) exp (− j 2 π f l ) (23) is the equivalent expression of the simple spectrum estimator, where : 1 (24) P = ∑ w (l ) , M −1 2 −∞ In the case of the energy-normalized gammachirp function we obtain : 2 n −1 (13) u = σ t2 = take this value of n in (1). −∞ 1 (21) It is seen that for n = 3, the ERB ≈ 2 σ f . For this reason, we (8) 0 1 2 n − 3 Γ (2 n − 1 ) 2 2 n − 1 Γ 2 (n ) 2π ERB = σ f 0 f − f θ = arctan B u = comparing the ERB to the root mean square of the frequency variance σ f by inserting its expression from (16) it follows : 2 2 with n ,B (16) We remark that these values are the same in the case of gammatone function. The temporal and frequency variances of gc satisfy the Heinsenberg uncertainty principle for n > 3/2 : M l=0 is the normalization factor introduced in the expression to make the estimator asymptotically non biased. Thus, the expression of the window w(t) in (23) is replaced by the expression of the energy-normalized gammachirp gc(t) centred at f0 (1). The form of the modified estimator becomes : R xk (f 0 )= 2 M −1 ∑ x k (l ) . (g c (l )). exp (− j 2π f 0 l ) (25) l=0 512 channels in a linear–frequencies assessment is then performed using the expression (25) and which each channel is corrected in its peak frequency by (c B/n) (with c=1.5 and b=1) to obtain one spectrum highlighting the formants structure. Eurospeech 2001 - Scandinavia 4. Validation This approach, described above, is validated by its application on synthesized vowels /a/, /i/ and /u/ in noise free with a 11025 Hz sampling rate and 100 Hz as pitch frequency, for formants estimation. The obtained temporal average of the estimated spectra are compared to the spectra given by three classical methods commonly used for formant detection. The first one is the LPC method based on a model of the vocal conduct with 17 linear predictive coefficients. The second one is the smooth cepstral method which operate a smoothing on the cepstral coefficients to make smooth the estimated spectra. The third one is the more classical one, the short-time Fourier spectra. Table 1. shows the values given in synthesis and Figures 4, 5 and 6 show the Gammatone- temporal average spectra compared to LPC, smooth-cepstre and shorttime Fourier spectra. Table 1: The different parameters given in synthesis. Vowels /a/ /i/ /u/ Pitch 100 Hz 100 Hz 100 Hz F1 730 Hz 270 Hz 300 Hz F2 1090 Hz 2290 Hz 870 Hz F3 2440 Hz 3010 Hz 2240 Hz In the same way, this analysis is applied on natural vowels /a/, /i/ and /u/, spoken by a male in environment noise and recorded at 11025 Hz. All the results are compared to the formants given by the same three traditional methods. The tracking of formants shows that this method gives a correct estimation of the formants compared to traditional methods. It also detects the beginning of voiced-speech sound by the first harmonics of pitch which can be used as a method to voiced speech detection. Moreover, it presents a different formantsbandwidth. This means that this auditory approach gives, in addition to the positions of the formants, their bandwidths that are perceived by the auditory system. It also presents a different ratio between the relative amplitude formants (F1/F2 and F2/F3), which are different from the others ratios given by the classical three methods. The Figures 7, 8, 9 illustrate the comparison between Gammatone-energy spectrum obtained for /a/,/i/ and /u/ natural vowels and the same ones obtained by LPC, smoothcepstre and short-time Fourier methods. Figure 7 : The Gammachirp spectrum of the /a/ natural vowel compared to the Gammatone, LPC, smooth -cepstre and short-time Fourier spectra. Figure 4 : The Gammachirp spectrum of the /a/ synthesized vowel compared to the Gammatone, LPC, smooth -cepstre and short-time Fourier Figure 5 : The Gammachirp spectrum of the /i/ synthesized vowel compared to the Gammatone, LPC, smooth -cepstre and short-time Fourier spectra. Figure 6 : The Gammachirp spectrum of the /u/ synthesized vowel compared to the Gammatone, LPC, smooth -cepstre and short-time Fourier Figure 8 : The Gammachirp spectrum of the /i/ natural vowel compared to the Gammatone, LPC, smooth -cepstre and short-time Fourier spectra. Figure 9 : The Gammachirp spectrum of the /u/ natural vowel compared to the Gammatone, LPC, smooth -cepstre and short-time Fourier Eurospeech 2001 - Scandinavia 5. Conclusion In this work, we present a spectral analysis for formants estimation performed by a gammachirp filterbank. This spectral analysis is based on an auditory-modified spectral estimation. This method is validated by its application, for formants detection, on both synthesized and natural vowels. The obtained spectra are compared to the LPC, smooth cepstre and short-time Fourier spectra which show that this method gives a correct formant estimation. It also detects, the voiced speech sound and presents some different spectral characteristics from classical methods. Finally, it seems that this method gives an estimation of formants with less variance To test the validity of this assumption, we have to apply this method to a large data base in future work. 6. References [1] Rabiner, L. R. and Schafer, R. W., "Digital signal processing of speech signals", Englewood Cliffs, NJ : Prentice-Hall, 1978. [2] Zwicker, E. and Feldtkeller, R., “ Psychoacoustique”, Masson, 1981. [3] Flanagan, J. L., “Speech analysis, synthesis and perception”, Seconde Edition, Springer - Verlag, Berlin, 1972. [4] Nakagawa, S., Shikano, K., and Tohkura, Y., “ Speech, hearing and neural network models”, Edition Ohmsha IOS Press, 1995. [5] Mallat, S., “A wavelet tour of signal processing”, Seconde Edition, Academic Press, 1998. [6] Patterson, R. D., “Auditory filter derived with noise stimuli”, J. Acoust. Soc. Amer, Vol. 59, 1976, pp. 640654. [7] Irino, T. and Kawahara, H., “Signal reconstruction from modified auditory wavelet transform”, IEEE Transactions on Signal Processing, Vol. 41, No. 12, December 1993. [8] Irino, T. “A Gammachirp function as an optimal auditory filter with mellin transform”, IEEE ICASSP 96, pp. 981984, Atlanta, May 7-10, 1996. [9] Irino, T. and Unoki M. “An Analysis/Synthesis Auditory Filterbank Based on an IIR Implementation of the Gammachirp”, Accepted to J. Acoust. Soc. Jpn. [10] Lyon, R. F., “All Poles Models of Auditory Filtering”, Diversity in Auditory Mechanics, Lewis et al. (eds.), World Scientific Publishing, Singapore, 1997, pp. 205211. [11] Johannesma, P. I. M.,“The pre-response stimulus ensemble of neurons in the cochlear nucleus”, Symposium of hearing theory, 1972, pp. 58-69. [12] Greenberg, S., “The ear as a speech analyzer ”, Journal of phonetetics, 16, 1988, pp.139-149. [13] Shamma, S., “The acoustic features of speech sounds in a model of auditory processing : Vowels and voiceless fricatives”, Journal of phonetetics, 16, 1988, pp. 77-91. [14] Ghitza, O., “Auditory Models and Human Performance in Tasks Related to Speech Coding and Speech Recognition ”, IEEE Transactions on Speech and Audio Processing, Vol. 2, NO. 1, Part II, January 1994, pp. 115-131. [15] D. V. Compernolle, “ Development of a Computational Auditory Model ”, IPO Technical Report, Instituute voor Perceptie Onderzoek, Eindhoven, February 4, 1991. [16] Yang, X., Wang, K. and Shamma, S. A., “Auditory Representations of Acoustic Signals ”, Technical Research Report, TR 91-16r1, University of Maryland, College Park. [17] Solbach, L.,“An Architecture for Robust Partial Tracking and Onset Localization in Signal Channel Audio Signal Mixes ”, Distributed Systems Departement, Technical University of Hamburg-Harburg, Germany, DoctorEngineer Dissertation, 1998. [18] L. Cohen, “ The scale representation”, IEEE Transaction on Signal Processing, 41, pp. 3275-3292, December 1993.