Formant Estimation using Gammachirp Filterbank

advertisement
Eurospeech 2001 - Scandinavia
Formant Estimation using Gammachirp Filterbank
Kaïs OUNI, Zied LACHIRI and Noureddine ELLOUZE
Laboratoire des Systemes et Traitement du Signal ( LSTS )
ENIT, BP 37, Le Belvédère 1002, Tunis, Tunisie
E-mails : kais.ouni@enit.rnu.tn ; zied.lachiri@enit.rnu.tn & N.ellouze@enit.rnu.tn
Abstract
This paper proposes a new method for formants estimation
using a decomposition of speech signal in Gammachirp
functions base. It is a spectral analysis method performed by a
gammachirp filterbank. A similar approach to the modified
spectrum estimation, which allows a smooth and an average
spectrum is adopted. In fact, instead of using an uniform
window commonly used in short-time Fourier analysis, a bank
of gammachirp filters is applied on the signal. A temporal
average of the estimated spectra is then applied to obtain one
spectrum highlighting the formants structure. This approach is
validated by its application on synthesized vowels. The
formants are detected with good estimation in comparison
with the values given in synthesis. In the same way, this
analysis is applied on natural vowels. All the results are
compared to three traditional methods, LPC, cepstral and
spectral one’s and also to a same analysis given by a
gammatone filterbank. The tracking of formants shows that
this method, which based on gammachirp filters, gives a
correct estimation of the formants compared to traditional
methods.
frequency that is characteristic of the place. Filters with a socalled gammatone impulse response are more used for
modelling of the cochlear filterbank [8][16]. The gammatone
function has also been used to characterize the spectral
analysis of the cochlear filters at moderate levels [14]. It is
distinguished by a spectral bandwidth that depends on the
central frequency of its corresponding cochlear filter, which is
measured in Equivalent Rectangular Bandwidth (ERB) [2]
[13]. The ERB is related to the psychophysical critical band
assignment. The critical bandwidth is about 50-100 Hz at low
frequencies and changes gradually to around 20 percent of the
frequency at high frequencies. Recently, IRINI [8] has
proposed an excellent candidate for asymmetric, leveldependent cochlear filter : the Gammachirp filter. It is an
extension of the gammatone filter with an additional chirp
term to produce an asymmetric amplitude spectrum. In this
work, we tried to use the gammachirp filter in order to
estimate formants with a similar approach to the modified
spectrum estimation.
In the following sections, we present a time-frequency
study of the gammachirp function. Then, we introduce the
approach of the gammachirp spectrum estimation and its
validation for formants estimation.
1. Introduction
Over the past forty years and especially in the last decade
computational models of the peripheral auditory system have
gained popularity in speech processing. These models have
shown that in complex speech processing applications,
classical spectral analysis can be modified to one’s advantage
by adding properties of the auditory system [3][10]. The basic
promise of this approach is that understanding cochlear
function and central auditory processing such as the
remarkable abilities of the human auditory system to detect,
separate and recognize speech, will provide new insights into
speech processing and will motivate novel approaches to the
problems of analysis and robust recognition of acoustic
patterns [11][15]. The adoption of such auditory processes has
usually led to significant improvements in performance over
systems using traditional approaches such as LPC, spectral
and cepstral methods [1] [12].
The auditory path of a sound is presented as follow. When
sound pressure waves impinge upon the eardrum, in the outer
ear, they cause vibrations that are transmitted via the stapes, at
the oval window to the fluids of the cochlea in the inner ear
[2][3]. These pressure waves in turn produce mechanical
displacements in the basilar membrane. The amplitude and
time course of these vibrations reflect directly the amplitude
and frequency content of the sound stimulus [13][15]. These
mechanical displacements at any given place of the basilar
membrane, can be viewed as the output signal of a band-pass
filter whose frequency response has a resonance peak at
2. The Gammachirp Filter
The gammachirp filter is a good approximation to the
frequency selective behaviour of the cochlea. It is an auditory
filter which introduce an asymetry and level – dependent
characteristics of the cochlear filters and it can be considered
as a generalization and improvement of the gammatone filter.
The gammachirp filter is defined in temporal domain by the
real part of the complex function gc(t) [8][9].
(1)
gc (t ) = An, B t n −1 exp(− 2π B t )exp( j 2π f0 t + j c ln (t ) + j ϕ )
with B = b.ERB( f0 ) and ERB( f0 ) = 24.7 + 0.108 f0 in Hz
(2)
which is the equivalent rectangular bandwidth at frequency
f 0 , n is the filter order, f 0 is the frequency modulation, An,B is
some normalization constant, b is a parameter defining the
envelope of the gamma distribution and c a parameter for the
chirp rate.
2.1. Energy
The energy of the impulse response gc(t) is obtained with the
following expression :
E n,B = g c
E n , B = A n2, B
2
= gc, gc =
Γ (2 n − 1 )
(4 π B )
2 n −1
+∞
∫
g c (t ) dt
2
(3)
−∞
(4)
with Γ(n) is the n-th order gamma distribution function. Thus,
for energy normalization the normalisation constant must be :
(4 π B )
Γ (2 n − 1 )
2 n −1
A E n ,B =
(5)
Eurospeech 2001 - Scandinavia
Figure 2 shows example of energy-normalized gammachirp.
σ t2 σ
2
f
=
2n − 1
2n − 3
1
(4 π )
2
≥
1
(17)
(4 π )
2
2.4. The Equivalent Rectangular Bandwidth ( ERB)
The ERB of a band-pass filter is defined as the width of a
rectangular filter with the same peak gain and impulse
response energy :
2
(18)
ERB . G c ( f 0 ) = E n , B ,
from (4) and (7) it follows :
A Γ (n )
= E
ERB .
(2 π B )
then, for normalized-energy gammachirp :
(19)
ERB =
(20)
2
2
n ,B
2 n
Figure 2: Example of gammachirp filters centred on
different frequencies.
(2 π
B ) Γ (2 n − 1 )
2 2 n − 1 Γ 2 (n )
2.2. Frequency response
The Fourier transform of gc(t) is :
Gc ( f ) =
+∞
∫ g (t )exp (−
c
j 2 π f t ) dt =
−∞
A Γ (n + j c )e
,
(2 π [B + j ( f − f )])
jϕ
n,B
(6)
n
0
then, the frequency response magnitude, when the amplitude is
normalized, is given by :
1
(7)
G (f ) =
.e
 2π B + ( f − f ) 
cθ
n
c
and the peak frequency in the amplitude spectrum is shifted of
f0 by cB [8].
n
2.3. Time-Frequency spread
Time and frequency energy concentrations are restricted by the
Heinsenberg
uncertainty
principle.
The
average
location of a one-dimensional wave function g c ∈ L2 (ℜ ) is:
+∞
g c (t ) dt
∫t
E n,B
(9)
2
−∞
and the average momentum is :
+∞
1
ξ =
∫
E n ,B
f
G c (f
)
2
(10)
df
−∞
The variances around theses average values are respectively :
1 +∞
(11)
(t − u )2 g (t ) 2 dt
σ 2 =
t
E n,B
∫
c
and :
σ
2
f
=
+∞
E n,B
∫ (f
−ξ
)
2
G c (f
)
2
dt
(12)
4π B
2n −1
(4 π B )
2
ξ = f0
σ
2
f
=
(14)
(15)
2
3. Gammachirp Spectrum Estimation
In this work, a gammachirp spectrum estimation is performed
to estimate formants in a multiresolution and auditory context.
A similar approach as the modified spectrum estimation,
which allows a smooth and an average spectrum is then
adopted. In fact, instead of using an uniform window like
Hamming or Blackman which is commonly used in short-time
Fourier analysis and which has a constant time-frequency
resolution, the gammachirp function, which has not a constant
time- frequency resolution but a constant quotient Q = f0/B, is
used as speech-analysis window centred in a linearfrequencies assessment. The modified estimator has the
following expression [12]:
Rx ( f ) =
1
K
∑ R ( f ),
B
2n −3
K
(22)
xk
k =1
where K = L/M is the number of sections, L is the length of the
signal and M is the length of each section:
R xk
(f )=
1
MP
M −1
∑
l=0
2
x k (l ) w (l ) exp (− j 2 π f l )
(23)
is the equivalent expression of the simple spectrum estimator,
where :
1
(24)
P =
∑ w (l ) ,
M −1
2
−∞
In the case of the energy-normalized gammachirp function we
obtain :
2 n −1
(13)
u =
σ t2 =
take this value of n in (1).
−∞
1
(21)
It is seen that for n = 3, the ERB ≈ 2 σ f . For this reason, we
(8)
0
1
2 n − 3 Γ (2 n − 1 )
2 2 n − 1 Γ 2 (n )
2π
ERB
=
σ f

0
 f − f 
θ = arctan 

 B 
u =
comparing the ERB to the root mean square of the frequency
variance σ f by inserting its expression from (16) it follows :
2
2

with
n ,B
(16)
We remark that these values are the same in the case of
gammatone function.
The temporal and frequency variances of gc satisfy the
Heinsenberg uncertainty principle for n > 3/2 :
M
l=0
is the normalization factor introduced in the expression
to make the estimator asymptotically non biased. Thus,
the expression of the window w(t) in (23) is replaced by
the expression of the energy-normalized gammachirp
gc(t) centred at f0 (1).
The form of the modified estimator becomes :
R
xk
(f 0 )=
2
M −1
∑
x
k
(l ) . (g c (l )). exp
(−
j 2π f
0
l
)
(25)
l=0
512 channels in a linear–frequencies assessment is then
performed using the expression (25) and which each channel
is corrected in its peak frequency by (c B/n) (with c=1.5 and
b=1) to obtain one spectrum highlighting the formants
structure.
Eurospeech 2001 - Scandinavia
4. Validation
This approach, described above, is validated by its application
on synthesized vowels /a/, /i/ and /u/ in noise free with a
11025 Hz sampling rate and 100 Hz as pitch frequency, for
formants estimation. The obtained temporal average of the
estimated spectra are compared to the spectra given by three
classical methods commonly used for formant detection. The
first one is the LPC method based on a model of the vocal
conduct with 17 linear predictive coefficients. The second
one is the smooth cepstral method which operate a smoothing
on the cepstral coefficients to make smooth the estimated
spectra. The third one is the more classical one, the short-time
Fourier spectra. Table 1. shows the values given in synthesis
and Figures 4, 5 and 6 show the Gammatone- temporal average spectra compared to LPC, smooth-cepstre and shorttime Fourier spectra.
Table 1: The different parameters given in synthesis.
Vowels
/a/
/i/
/u/
Pitch
100 Hz
100 Hz
100 Hz
F1
730 Hz
270 Hz
300 Hz
F2
1090 Hz
2290 Hz
870 Hz
F3
2440 Hz
3010 Hz
2240 Hz
In the same way, this analysis is applied on natural vowels /a/,
/i/ and /u/, spoken by a male in environment noise and
recorded at 11025 Hz. All the results are compared to the
formants given by the same three traditional methods. The
tracking of formants shows that this method gives a correct
estimation of the formants compared to traditional methods. It
also detects the beginning of voiced-speech sound by the first
harmonics of pitch which can be used as a method to voiced
speech detection. Moreover, it presents a different formantsbandwidth. This means that this auditory approach gives, in
addition to the positions of the formants, their bandwidths that
are perceived by the auditory system. It also presents a
different ratio between the relative amplitude formants (F1/F2
and F2/F3), which are different from the others ratios given by
the classical three methods.
The Figures 7, 8, 9 illustrate the comparison between
Gammatone-energy spectrum obtained for /a/,/i/ and /u/
natural vowels and the same ones obtained by LPC, smoothcepstre and short-time Fourier methods.
Figure 7 : The Gammachirp spectrum of the /a/
natural vowel compared to the Gammatone,
LPC, smooth -cepstre and short-time Fourier spectra.
Figure 4 : The Gammachirp spectrum of the /a/
synthesized vowel compared to the Gammatone,
LPC, smooth -cepstre and short-time Fourier
Figure 5 : The Gammachirp spectrum of the /i/
synthesized vowel compared to the Gammatone,
LPC, smooth -cepstre and short-time Fourier spectra.
Figure 6 : The Gammachirp spectrum of the /u/
synthesized vowel compared to the Gammatone,
LPC, smooth -cepstre and short-time Fourier
Figure 8 : The Gammachirp spectrum of the /i/
natural vowel compared to the Gammatone,
LPC, smooth -cepstre and short-time Fourier spectra.
Figure 9 : The Gammachirp spectrum of the /u/
natural vowel compared to the Gammatone,
LPC, smooth -cepstre and short-time Fourier
Eurospeech 2001 - Scandinavia
5. Conclusion
In this work, we present a spectral analysis for formants
estimation performed by a gammachirp filterbank. This
spectral analysis is based on an auditory-modified spectral
estimation. This method is validated by its application, for
formants detection, on both synthesized and natural vowels.
The obtained spectra are compared to the LPC, smooth cepstre and short-time Fourier spectra which show that this
method gives a correct formant estimation. It also detects, the
voiced speech sound and presents some different spectral
characteristics from classical methods. Finally, it seems that
this method gives an estimation of formants with less variance
To test the validity of this assumption, we have to apply this
method to a large data base in future work.
6. References
[1] Rabiner, L. R. and Schafer, R. W., "Digital signal
processing of speech signals", Englewood Cliffs, NJ :
Prentice-Hall, 1978.
[2] Zwicker, E. and Feldtkeller, R., “ Psychoacoustique”,
Masson, 1981.
[3] Flanagan, J. L., “Speech analysis, synthesis and
perception”, Seconde Edition, Springer - Verlag, Berlin,
1972.
[4] Nakagawa, S., Shikano, K., and Tohkura, Y., “ Speech,
hearing and neural network models”, Edition Ohmsha
IOS Press, 1995.
[5] Mallat, S., “A wavelet tour of signal processing”,
Seconde Edition, Academic Press, 1998.
[6] Patterson, R. D., “Auditory filter derived with noise
stimuli”, J. Acoust. Soc. Amer, Vol. 59, 1976, pp. 640654.
[7] Irino, T. and Kawahara, H., “Signal reconstruction from
modified auditory wavelet transform”, IEEE Transactions
on Signal Processing, Vol. 41, No. 12, December 1993.
[8] Irino, T. “A Gammachirp function as an optimal auditory
filter with mellin transform”, IEEE ICASSP 96, pp. 981984, Atlanta, May 7-10, 1996.
[9] Irino, T. and Unoki M. “An Analysis/Synthesis Auditory
Filterbank Based on an IIR Implementation of the
Gammachirp”, Accepted to J. Acoust. Soc. Jpn.
[10] Lyon, R. F., “All Poles Models of Auditory Filtering”,
Diversity in Auditory Mechanics, Lewis et al. (eds.),
World Scientific Publishing, Singapore, 1997, pp. 205211.
[11] Johannesma, P. I. M.,“The pre-response stimulus
ensemble of neurons in the cochlear nucleus”,
Symposium of hearing theory, 1972, pp. 58-69.
[12] Greenberg, S., “The ear as a speech analyzer ”, Journal of
phonetetics, 16, 1988, pp.139-149.
[13] Shamma, S., “The acoustic features of speech sounds in a
model of auditory processing : Vowels and voiceless
fricatives”, Journal of phonetetics, 16, 1988, pp. 77-91.
[14] Ghitza, O., “Auditory Models and Human Performance in
Tasks Related to Speech Coding and Speech
Recognition ”, IEEE Transactions on Speech and Audio
Processing, Vol. 2, NO. 1, Part II, January 1994, pp.
115-131.
[15] D. V. Compernolle, “ Development of a Computational
Auditory Model ”, IPO Technical Report, Instituute voor
Perceptie Onderzoek, Eindhoven, February 4, 1991.
[16] Yang, X., Wang, K. and Shamma, S. A., “Auditory
Representations of Acoustic Signals ”, Technical
Research Report, TR 91-16r1, University of Maryland,
College Park.
[17] Solbach, L.,“An Architecture for Robust Partial Tracking
and Onset Localization in Signal Channel Audio Signal
Mixes ”, Distributed Systems Departement, Technical
University of Hamburg-Harburg, Germany, DoctorEngineer Dissertation, 1998.
[18] L. Cohen, “ The scale representation”, IEEE Transaction
on Signal Processing, 41, pp. 3275-3292, December
1993.
Download