Joint Estimation of Spectral Envelope and Fundamental Frequency for Speech Signals

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 13 Number 3 – Jul 2014
Joint Estimation of Spectral Envelope and
Fundamental Frequency for Speech Signals
Joint Estimator for Speech Signal Modeling
Chaithanya V1, Vidya Sagar K.N2, Pallaviram Sure3
1
PG Scholar & ECE Department & REVA Institute of Technology and Management, Bangalore, Karnataka, India,
2,3
Faculty & ECE Department & REVA Institute of Technology and Management, Bangalore, Karnataka, India,
Abstract—The speech synthesis applications demand effective
modeling of speech signals. Such speech modeling techniques
require the squared magnitude response (spectral envelope) of
the vocal tract and the pitch period (fundamental frequency) of
its input excitation to be estimated accurately. Logically as both
spectral envelope and fundamental frequency are inter-related,
an ideal estimation requires a joint estimation of both of them
rather than two independent estimations. In this paper, such a
joint estimator to obtain spectral envelope and fundamental
frequency is introduced and developed. It is a parametric source
filter model, built upon a concept of Gaussian Mixture Models
(GMM) and is iterative in nature. The estimator performance is
evaluated using different verification test cases and a few
synthesized speech signals. In all the cases, the results showed
that the joint estimator is capable of estimating both fundamental
frequency and spectral envelope accurately.
Keywords—fundamental frequency, spectral envelope, joint
estimation, speech synthesis, GMM.
I.
INTRODUCTION
Speech synthesis applications play a gigantic role in many
entertainment productions such as gaming, animations etc. To
synthesize speech which implies artificially producing speech
signals, the basic requirement is to be able to model the speech
signals. Such a basic speech production model consists of a
source signal passing through a linear filter which produces
speech. Other applications of speech modelling include speech
compression, speech recognition, speech synthesis, speech
coding, voice analysis and speech enhancement. In speech
processing the estimation of fundamental frequency (F0) and
spectral envelope play a crucial role. This paper targets joint
estimation of fundamental frequency and spectral envelope.
Fundamental frequency is a perceived frequency of any
sound signal. A spectral envelope is a curve in the frequencyamplitude plane which is a squared Fourier magnitude
spectrum. The peaks in the power spectrum are called
formants. Different models are suggested in the literature to
estimate spectral envelope and fundamental frequency. For
example, a linear predictive coding (LPC) method of digital
signal processing is developed for speech transmission in [2],
which is used in speech compression. LPC targets mainly
spectral envelope estimation by representing each sample of a
signal in the time-domain by a linear combination of few
preceding values. Though LPC is a good estimator of spectral
envelope, its performance degrades with increasing pitch, and
ISSN: 2231-5381
the relation of model order to the fundamental frequency is not
proper.
Cepstrum is a method of speech analysis based on a
spectral representation of the signal [3]. Cepstrum method
estimates the spectral envelope by low pass filtering a logamplitude spectrum which is interpreted as a signal. As it
allows only the slow fluctuations and causes smoothing, it is
highly sensitive to pitch estimation. So it estimates only
spectral envelope. Discrete cepstrum method [4] is an
extension of the cepstrum method and estimates cepstral
coefficients. Here, spectral envelope is computed from
discrete points in the frequency-amplitude plane. The discrete
cepstrum generates smoothly interpolated curve which tries to
link the existing local peaks. Another method called discrete
all-pole modelling [5] overcomes the limitations of LPC and
gives better all-pole spectral envelopes.
However all the above existing methods of speech
modelling involve independent estimations for pitch period
and the spectral envelope. As pitch period changes, the
spectral envelope also changes [1]. So for a better estimation,
both F0 and spectral envelope need to be estimated jointly and
iteratively. The rest of the paper is organized as follows. The
joint estimation methodology is discussed in section II along
with the iterative algorithm. The simulation results and various
verification test cases are discussed in section III. The paper is
concluded in section IV.
II.
METHODOLOGY
A. Speech Spectrum Modeling
The vocal tract impulse response h(t) is excited by the
excitation s(t) that produces a short-time segment of a speech
signal y(t), as shown in figure 1. The corresponding model is
given in (1). In (1), w(t) represents a window function.
Fig. 1: speech spectrum model
y(t) = (s(t)*h(t))w(t)
(1)
Different types of windows like hamming window,
rectangle window, hanning window, blackman window,
http://www.ijettjournal.org
Page 111
International Journal of Engineering Trends and Technology (IJETT) – Volume 13 Number 3 – Jul 2014
Gaussian window can be assumed. Taking Fourier transform
on both sides of (1), it becomes (2).
Y(ω) = (S(ω)H(ω))*W(ω)
(2)
The excitation signal s(t) is assumed as an impulse sequence
with a pitch period T, and is given by (3), whose Fourier
transform is given in (4), which is again another impulse train.
In (4), µ = 2π/T is the Fo parameter.
/2 ∑
s(t) =
( −
∑
/2
S(ω) =
)
( −
(3)
obtain the model parameters we proceed by the minimization
of the distortion measure, Csiszar’s I-divergence. This requires
maximization of the term given in (10). Then the update
equations for all the six model parameters can be obtained by
equating the partial derivative of (10) with respect to each of
the model parameters to zero. Observe that (10) incorporates
(11), which is a set of weighting functions to be chosen.
Instead, in the iterative algorithm the weighting functions can
be updated using (11) because in each iteration , ( ) is
decided by the model parameter values.
∫ |F(ω)| ∑ ∑ λ
)
= √µ ∑
( − µ)
Y(ω) = (S(ω)H(ω))*W(ω)
( µ) ( − µ)
(5)
| ( )|2|W(ω-nµ)|2
(6)
Assuming a Gaussian window function w(t), |W(ω)|2 is a
Gaussian distribution function given by (7).
W(ω)|2 =
√
exp(-
)
(7)
( )
2
+
=
|H(ω)| ≡ η∑
√
exp –
)
,
( )=
∑
exp
,
( )
–(
)
−
(
)
(9)
B. Parameter Estimation
Estimation of the total six model parameters η,
, , ,
µ, and σ can be obtained iteratively as described here. To
ISSN: 2231-5381
;
(11)
( )
,
∫ | ( )| ,
(
)
b ≡
(
)
∑
c ≡
(
)
∑ ∫
(
)
∑
µ(
)
⎡
⎢
⎢
⎣
( )
,
+
(
,
,
∑
(12)
(13)
( )| ( )|
(14)
( )| ( )|
( )| ( )|
( )| ( )|
∫
−
⎤
⎥ = −
⋮
⎥
)
−
⎦
+
)
,
∫
)
=
, ( )
( )
⋮
(
(
g=d+ ∑
×
(8)
Substituting (7) and (8) in (6), the speech power spectrum
can be modelled [1] as in (9). In (9) µ, σ, are also the model
parameters.
|Y(ω)|2 = ∑
∑
∑ ∫ | ( )| a≡ ∑
d ≡
(
+4
2
=
Using Gaussian Mixture Model (GMM) concept, the
spectral envelope function |H(ω)|2 is written as in (8). In (8), η,
, ,
are the model parameters.
2
(10)
( )
,
∑
( )
,
Parameters µ and
: By maximizing (11) w.r.t these
parameters, the obtained update equations obtained are (12)
and (13). The computation of (12) and (13) require the values
of a, bm, cm and d given in (14). These can be organized in a
matrix equation (15). Solving (15) involves a matrix inversion
and a multiplication.
∑
Approximating the power spectrum of y(t) to model the
speech signal, we rewrite (5) as (6).
|Y(ω)|2 ≅ µ∑
( )=
,
(4)
Correspondingly product of S(ω) and the vocal tract
frequency response H(ω) convolution with the window
frequency response W(ω) gives the complex spectrum Y(ω) of
a short-time segment of any voiced speech signal as in (5).
= √µ ∑
( ) log
,
0
,
⋯ −
0
⋱
⋮
⋯
0
⋮
0
(15)
Parameters , , ,η: These model parameters are
calculated on similar lines by maximizing (10) and their
solutions are shown in (16), (17), (18) and (19). Also, in this
derivation the restriction ∀ ω : ∑ ∑
holds
, ( ) =1
good.
( )
= ∑ ∫
,
( )| ( )|
(16)
( ) = ∑ , ∫ , ( )| ( )|2
×(ω− ( ) )2 dω
http://www.ijettjournal.org
(17)
Page 112
International Journal of Engineering Trends and Technology (IJETT) – Volume 13 Number 3 – Jul 2014
( )
=
∫ | ( )| ∑
,
× ∫
( )
= F/∑
,
( )
×∑
( )
( )| ( )| −
( )
four known spectral envelopes,
(18)
( ) ( )
,
√
( )
× exp
(
( )
( )
)
(19)
( )
Also F = ∫ | | dω, term t refers to the iteration cycles, M,
N are number of Gaussians used in the approximations.
C. Algorithm of the joint estimator
1.
2.
3.
4.
5.
estimated and corresponding true spectral envelopes are
compared graphically and the SD is also calculated in each
case. The results for these four test cases are shown in Fig. (1),
Fig. (2), Fig. (3) and Fig. (4). The actual, estimated parameters
are compared in Table 1 to Table 4 for all the four cases. SD
value is indicated at the bottom of each table.
Given data: Short term speech signal whose power
2
spectrum Y ( ) is computed.
Initialization: Initialize arbitrary values for {µ, σ, η,
⋃
{ , , }}, choose N and M.
For iteration num=1 to 100 (say)
a. Calculate F from the power spectrum.
b. Substituting the initial parameters find yn,m(ω)
using (9)
c. Compute λn,m(ω) using (11).
d. Calculate a, b1 to bM, c1 to cM and d using (14).
e. Frame the matrix equation using (14), and
compute new and µ.
f. Compute new
, η for all m=1 to M
, σ,
using (16) to (19).
Exit condition: Model parameters converge
Substitute the final model parameters in (8) to find
the estimated spectral envelope Hˆ ( ) 2 . The
1
power spectrum
Actual spectral envelope
Estimated spectral envelope
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
1
parameter µ is directly the pitch frequency
| (
)|−
| (
)| 2
3
4
Frequency (Rad/sec)
5
6
7
Fig. 2: Spectral envelope for test case 1
D. Performance Analysis of Joint Estimator
The actual spectral envelope and the estimated spectral
envelope should have very less error. The measure used to
evaluate the estimator performance is the spectral distortion
(SD), defined in (20).
SD = ∑
2
H () are generated.
Substituting them in (9), the power spectrum Y ( ) 2 ,
corresponding power spectra are also generated. To verify the
correctness of the algorithm and its MATLAB code, we
provided each of the four power spectra to the joint estimation
algorithm in section 2.2. Correspondingly the spectral
envelope Hˆ ( ) 2 is estimated in all the four cases. These
Normalized power density
(20)
TABLE I.
ACTUAL, ESTIMATED MODEL PARAMETERS FOR TEST CASE 1
Model
Actual values
Estimated values
µ
0.6283
0.3135
σ
0.06
0.0595
0.628, 0.691, 0.565, 0.628,
0.45, 0.65, 0.56, 0.42,
0.502
0.34
0.8, 1.88, 4.7, 5.65, 5.9
0.83, 1.87, 4.80, 5.52,
parameters
In (20) i refers to the frequency-bin index, | ( )| is the
true spectral envelope and | ( )| is the spectral envelope
estimate.
5.84
III. RESULTS AND DISCUSSION
0.2, 0.2, 0.1, 0.2, 0.3
0.18, 0.25, 0.16, 0.18,
0.20
The speech modelling used in this paper comprises of
Gaussian Mixture Models for spectral envelope and power
spectrum as discussed in equations (8) and (9) respectively.
The overall model has six model parameters. Choosing
arbitrary values for model parameters, choosing M=5, N=30
ISSN: 2231-5381
η
1
http://www.ijettjournal.org
0.994
SD = 5.0339e-005
Page 113
International Journal of Engineering Trends and Technology (IJETT) – Volume 13 Number 3 – Jul 2014
TABLE III.
1
power spectrum
Actual spectral envelope
Estimated spectral envelope
0.9
ACTUAL, ESTIMATED MODEL PARAMETERS FOR TEST CASE 3
Model
Actual values
Estimated values
µ
0.5026
0.2512
σ
0.0560
0.0560
0.5
0.691, 1.256, 0.502, 0.816,
0.54,
0.4
0.565
0.60, 0.25
0.3
0.94, 1.25, 4.08, 4.39, 5.96
0.96,
0.8
Normalized power density
parameters
0.7
0.6
0.81,
2.17,
0.55,
3.77,
4.73, 5.82
0.2
0.2, 0.2,5 0.05, 0.15, 0.35
0.31,
0.15,
0.07,
0.1
0.21, 0.24
0
0
1
2
3
4
Frequency (Rad/sec)
5
6
η
7
1
0.9840
SD = 1.2880e-004
Fig. 3: Spectral envelope for test case 2
TABLE II.
1
ACTUAL, ESTIMATED MODEL PARAMETERS FOR TEST CASE 2
Model
Actual values
Estimated values
µ
0.5654
0.5087
σ
0.05
0.1243
1.256, 1.131, 0.37,
0.72, 0.57, 0.27, 0.77, 0.27
power spectrum
Actual spectral envelope
Estimated spectral envelope
0.9
0.8
0.69, 0.43
0.628, 0.942, 4.08,
1.23, 0.86, 4.19, 4.30, 5.84
4.39, 5,96
0.3, 0.15, 0.25, 0.1,
Normalized power density
parameters
0.7
0.6
0.5
0.4
0.3
0.2
0.14, 0.17, 0.18, 0.25, 0.24
0.1
0.3
η
1
0
0.9984
0
1
2
3
4
Frequency (Rad/sec)
5
6
7
SD = 6.7369e-005
Fig. 5: Spectral envelope for test case 4
TABLE IV.
ACTUAL, ESTIMATED MODEL PARAMETERS FOR TEST CASE 4
1
power spectrum
Actual spectral envelope
Estimated spectral envelope
0.9
Normalized power density
0.8
Model
Actual values
Estimated values
µ
0.4398
0.3005
σ
0.0430
0.1537
0.75, 0.31, 0.56, 1.82, 0.62
0.61, 0.34, 0.72, 0.54,
parameters
0.7
0.6
0.5
0.42
0.4
3.45, 2.51, 5.34, 1.57, 3.89
0.3
5.42
0.2
0.15, 0.25, 0.1, 0.25, 0.35
0.1
0
1.06, 2.51, 3.55, 4.20,
0.091,
0.273,
0.161, 0.079
0
1
2
3
4
Frequency (Rad/sec)
5
6
7
η
1
1.003
SD = 5.8781e-005
Fig. 4: Spectral envelope for test case 3
ISSN: 2231-5381
http://www.ijettjournal.org
Page 114
0.395,
International Journal of Engineering Trends and Technology (IJETT) – Volume 13 Number 3 – Jul 2014
Next, a short term speech signal
y(t )
TABLE V.
is generated
using Fig. 6. Here s(t) is the impulse train with a pitch period
of 100. The vocal tract frequency response is created using the
GMM model in (8). Then the output of this vocal tract when
excited by the impulse train input is passed through a
Gaussian window and the resultant time domain signal is the
created short term speech signal y(t). The corresponding
power spectrum is computed and provided to the joint
estimator in section 2.2. The corresponding actual and
estimated spectral envelopes are compared in Fig. 7 and the
model parameters are compared in Table 5. Corresponding SD
is also shown below Table 5.
Another short term speech signal y(t ) is generated as in Fig.
6, but the vocal tract is modelled using an all pole filter shown
in (21). The input excitation is used with a pitch frequency of
100 Hz. In this case the generated speech signal’s power
spectrum is computed and is provided to the joint estimator in
section 2.2. The model parameters estimated and the spectral
envelope obtained are shown in Table 6 and Fig. (8)
respectively. Corresponding SD is shown below Table 6.
ACTUAL, ESTIMATED VALUES OF FOUR MODEL PARAMETERS
Model
Actual values
Estimated values
100, 110, 90, 100, 80
71.7, 105.30, 87.12,
parameters
72.67, 54.39
100, 300, 750, 900, 950
118.5, 300.5, 743.82,
853.7, 928
0.2, 0.2, 0.1, 0.2, 0.3
0.185, 0.246, 0.125,
0.177, 0.265
η
1
0.9970
SD = 9.0675e-005
H(z) = ∏
)/ ∏
(1 −
In (21), = … …=
= 0, = ∗ = 0.4225+0.7529 , (1 −
)
(21)
= ∗ = −0.5026 + 0.5976 ,
= 0.6602 are used.
1
power spectrum
Actual spectral envelope
Estimated spectral envelope
0.9
Norm alized power density
0.8
Fig. 6: Generation of speech signal
0.7
0.6
0.5
0.4
0.3
0.2
0.1
1
power spectrum
Actual spectral envelope
Estimated spectral envelope
0.9
0
0
1
2
0.8
3
4
Frequency (Rad/sec)
5
6
7
N orm alized power dens ity
Fig. 8: Spectral envelope estimation for speech signal 2
0.7
TABLE VI.
0.6
ESTIMATED VALUES OF SIX MODEL PARAMETERS
0.5
Model
0.4
Actual values
Estimated values
parameters
0.3
µ
NA
99.8373
0.2
σ
NA
9.9428
NA
87.14, 125.6, 81.95, 54.33, 88.73
NA
144.6, 256, 636.9, 807.51, 886.19
NA
0.327, 0.172, 0.082, 0.193, 0.225
0.1
0
0
1
2
3
4
Frequency (Rad/sec)
5
6
7
η
NA
Fig. 7: Spectral envelope estimation for speech signal 1
ISSN: 2231-5381
0.9881
SD = 0.0167
http://www.ijettjournal.org
Page 115
International Journal of Engineering Trends and Technology (IJETT) – Volume 13 Number 3 – Jul 2014
REFERENCES
[1]
IV. CONCLUSION
Speech synthesis applications demand speech signal models
to be devised so as to produce artificial speech signals. In such
speech models, a joint estimator to estimate both the
fundamental frequency and the spectral envelope is discussed
and applied in this paper which is based on parametric speech
source filter model. Some known speech signals are generated
and their spectral envelopes and fundamental frequencies are
estimated using the joint estimator. The results show that the
joint estimator indeed provides good estimation of spectral
envelope and fundamental frequency. For any recorded speech
signal, the estimator can be applied to obtain the model
parameters and then using a summation series of Gabor
functions the speech can be reproduced. The recorded and
reproduced speech signals can be compared for audibility and
clarity, which form the future scope of this paper.
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
ISSN: 2231-5381
Hirokazu Kameoka, “Speech spectrum modeling for joint estimation of
spectral envelope and Fundamental frequency”, IEEE Transactions, vol
18, aug 2010
B.S Atal “Effectiveness of linear prediction characteristics of the speech
wave for automatic speaker identification and verrification,”J.Acoust
.Soc.Amer, vol55,1974.
A.V. Oppenhein and R.W. Schafer, “4Homomorphic analysis of
speech,” IEE Transactions, vol.AU-16, pp.june 1968.
O. Cappe and E. Moulines, “Regulasized estimation of cepstrum
envelope from discrete frequency points,” IEEE Workshop appls., 1995.
A. El-Jaroudi and J. Makhoul, “Discrete all-pole modeling,” IEE
Transactions, vol. 39,
No.2, pp.411-423, Feb.1991.
D. Giacobello et al., “Sparese linear predictors for speech processing,”
in Proc. Interspeech, 2008.
R. Badeau and B. David, “Weighted maximum likelihood autoregressive
and moving average spectrum modeling,” in Proc. Int. Conf. Acoust.,
speech, signal process.
(ICASSP’08), 2008, PP.3761-3764.
W.J. Hess, “Pitch and voice determination,” in advance in speech signal
processing, S. Fur ui and M.M. Sohndi, Eds. New York: Marcel Dekker,
1992, pp.3-48.
K. Lange, D. Hunter and I. Yang, “Optimization transfer using surrogate
objective functions,” J. Comp., Graph. Statist., Vol.9, no.1,pp.1-20,2000.
http://www.ijettjournal.org
Page 116
Download