International Journal of Engineering Trends and Technology (IJETT) – Volume 13 Number 3 – Jul 2014 Joint Estimation of Spectral Envelope and Fundamental Frequency for Speech Signals Joint Estimator for Speech Signal Modeling Chaithanya V1, Vidya Sagar K.N2, Pallaviram Sure3 1 PG Scholar & ECE Department & REVA Institute of Technology and Management, Bangalore, Karnataka, India, 2,3 Faculty & ECE Department & REVA Institute of Technology and Management, Bangalore, Karnataka, India, Abstract—The speech synthesis applications demand effective modeling of speech signals. Such speech modeling techniques require the squared magnitude response (spectral envelope) of the vocal tract and the pitch period (fundamental frequency) of its input excitation to be estimated accurately. Logically as both spectral envelope and fundamental frequency are inter-related, an ideal estimation requires a joint estimation of both of them rather than two independent estimations. In this paper, such a joint estimator to obtain spectral envelope and fundamental frequency is introduced and developed. It is a parametric source filter model, built upon a concept of Gaussian Mixture Models (GMM) and is iterative in nature. The estimator performance is evaluated using different verification test cases and a few synthesized speech signals. In all the cases, the results showed that the joint estimator is capable of estimating both fundamental frequency and spectral envelope accurately. Keywords—fundamental frequency, spectral envelope, joint estimation, speech synthesis, GMM. I. INTRODUCTION Speech synthesis applications play a gigantic role in many entertainment productions such as gaming, animations etc. To synthesize speech which implies artificially producing speech signals, the basic requirement is to be able to model the speech signals. Such a basic speech production model consists of a source signal passing through a linear filter which produces speech. Other applications of speech modelling include speech compression, speech recognition, speech synthesis, speech coding, voice analysis and speech enhancement. In speech processing the estimation of fundamental frequency (F0) and spectral envelope play a crucial role. This paper targets joint estimation of fundamental frequency and spectral envelope. Fundamental frequency is a perceived frequency of any sound signal. A spectral envelope is a curve in the frequencyamplitude plane which is a squared Fourier magnitude spectrum. The peaks in the power spectrum are called formants. Different models are suggested in the literature to estimate spectral envelope and fundamental frequency. For example, a linear predictive coding (LPC) method of digital signal processing is developed for speech transmission in [2], which is used in speech compression. LPC targets mainly spectral envelope estimation by representing each sample of a signal in the time-domain by a linear combination of few preceding values. Though LPC is a good estimator of spectral envelope, its performance degrades with increasing pitch, and ISSN: 2231-5381 the relation of model order to the fundamental frequency is not proper. Cepstrum is a method of speech analysis based on a spectral representation of the signal [3]. Cepstrum method estimates the spectral envelope by low pass filtering a logamplitude spectrum which is interpreted as a signal. As it allows only the slow fluctuations and causes smoothing, it is highly sensitive to pitch estimation. So it estimates only spectral envelope. Discrete cepstrum method [4] is an extension of the cepstrum method and estimates cepstral coefficients. Here, spectral envelope is computed from discrete points in the frequency-amplitude plane. The discrete cepstrum generates smoothly interpolated curve which tries to link the existing local peaks. Another method called discrete all-pole modelling [5] overcomes the limitations of LPC and gives better all-pole spectral envelopes. However all the above existing methods of speech modelling involve independent estimations for pitch period and the spectral envelope. As pitch period changes, the spectral envelope also changes [1]. So for a better estimation, both F0 and spectral envelope need to be estimated jointly and iteratively. The rest of the paper is organized as follows. The joint estimation methodology is discussed in section II along with the iterative algorithm. The simulation results and various verification test cases are discussed in section III. The paper is concluded in section IV. II. METHODOLOGY A. Speech Spectrum Modeling The vocal tract impulse response h(t) is excited by the excitation s(t) that produces a short-time segment of a speech signal y(t), as shown in figure 1. The corresponding model is given in (1). In (1), w(t) represents a window function. Fig. 1: speech spectrum model y(t) = (s(t)*h(t))w(t) (1) Different types of windows like hamming window, rectangle window, hanning window, blackman window, http://www.ijettjournal.org Page 111 International Journal of Engineering Trends and Technology (IJETT) – Volume 13 Number 3 – Jul 2014 Gaussian window can be assumed. Taking Fourier transform on both sides of (1), it becomes (2). Y(ω) = (S(ω)H(ω))*W(ω) (2) The excitation signal s(t) is assumed as an impulse sequence with a pitch period T, and is given by (3), whose Fourier transform is given in (4), which is again another impulse train. In (4), µ = 2π/T is the Fo parameter. /2 ∑ s(t) = ( − ∑ /2 S(ω) = ) ( − (3) obtain the model parameters we proceed by the minimization of the distortion measure, Csiszar’s I-divergence. This requires maximization of the term given in (10). Then the update equations for all the six model parameters can be obtained by equating the partial derivative of (10) with respect to each of the model parameters to zero. Observe that (10) incorporates (11), which is a set of weighting functions to be chosen. Instead, in the iterative algorithm the weighting functions can be updated using (11) because in each iteration , ( ) is decided by the model parameter values. ∫ |F(ω)| ∑ ∑ λ ) = √µ ∑ ( − µ) Y(ω) = (S(ω)H(ω))*W(ω) ( µ) ( − µ) (5) | ( )|2|W(ω-nµ)|2 (6) Assuming a Gaussian window function w(t), |W(ω)|2 is a Gaussian distribution function given by (7). W(ω)|2 = √ exp(- ) (7) ( ) 2 + = |H(ω)| ≡ η∑ √ exp – ) , ( )= ∑ exp , ( ) –( ) − ( ) (9) B. Parameter Estimation Estimation of the total six model parameters η, , , , µ, and σ can be obtained iteratively as described here. To ISSN: 2231-5381 ; (11) ( ) , ∫ | ( )| , ( ) b ≡ ( ) ∑ c ≡ ( ) ∑ ∫ ( ) ∑ µ( ) ⎡ ⎢ ⎢ ⎣ ( ) , + ( , , ∑ (12) (13) ( )| ( )| (14) ( )| ( )| ( )| ( )| ( )| ( )| ∫ − ⎤ ⎥ = − ⋮ ⎥ ) − ⎦ + ) , ∫ ) = , ( ) ( ) ⋮ ( ( g=d+ ∑ × (8) Substituting (7) and (8) in (6), the speech power spectrum can be modelled [1] as in (9). In (9) µ, σ, are also the model parameters. |Y(ω)|2 = ∑ ∑ ∑ ∫ | ( )| a≡ ∑ d ≡ ( +4 2 = Using Gaussian Mixture Model (GMM) concept, the spectral envelope function |H(ω)|2 is written as in (8). In (8), η, , , are the model parameters. 2 (10) ( ) , ∑ ( ) , Parameters µ and : By maximizing (11) w.r.t these parameters, the obtained update equations obtained are (12) and (13). The computation of (12) and (13) require the values of a, bm, cm and d given in (14). These can be organized in a matrix equation (15). Solving (15) involves a matrix inversion and a multiplication. ∑ Approximating the power spectrum of y(t) to model the speech signal, we rewrite (5) as (6). |Y(ω)|2 ≅ µ∑ ( )= , (4) Correspondingly product of S(ω) and the vocal tract frequency response H(ω) convolution with the window frequency response W(ω) gives the complex spectrum Y(ω) of a short-time segment of any voiced speech signal as in (5). = √µ ∑ ( ) log , 0 , ⋯ − 0 ⋱ ⋮ ⋯ 0 ⋮ 0 (15) Parameters , , ,η: These model parameters are calculated on similar lines by maximizing (10) and their solutions are shown in (16), (17), (18) and (19). Also, in this derivation the restriction ∀ ω : ∑ ∑ holds , ( ) =1 good. ( ) = ∑ ∫ , ( )| ( )| (16) ( ) = ∑ , ∫ , ( )| ( )|2 ×(ω− ( ) )2 dω http://www.ijettjournal.org (17) Page 112 International Journal of Engineering Trends and Technology (IJETT) – Volume 13 Number 3 – Jul 2014 ( ) = ∫ | ( )| ∑ , × ∫ ( ) = F/∑ , ( ) ×∑ ( ) ( )| ( )| − ( ) four known spectral envelopes, (18) ( ) ( ) , √ ( ) × exp ( ( ) ( ) ) (19) ( ) Also F = ∫ | | dω, term t refers to the iteration cycles, M, N are number of Gaussians used in the approximations. C. Algorithm of the joint estimator 1. 2. 3. 4. 5. estimated and corresponding true spectral envelopes are compared graphically and the SD is also calculated in each case. The results for these four test cases are shown in Fig. (1), Fig. (2), Fig. (3) and Fig. (4). The actual, estimated parameters are compared in Table 1 to Table 4 for all the four cases. SD value is indicated at the bottom of each table. Given data: Short term speech signal whose power 2 spectrum Y ( ) is computed. Initialization: Initialize arbitrary values for {µ, σ, η, ⋃ { , , }}, choose N and M. For iteration num=1 to 100 (say) a. Calculate F from the power spectrum. b. Substituting the initial parameters find yn,m(ω) using (9) c. Compute λn,m(ω) using (11). d. Calculate a, b1 to bM, c1 to cM and d using (14). e. Frame the matrix equation using (14), and compute new and µ. f. Compute new , η for all m=1 to M , σ, using (16) to (19). Exit condition: Model parameters converge Substitute the final model parameters in (8) to find the estimated spectral envelope Hˆ ( ) 2 . The 1 power spectrum Actual spectral envelope Estimated spectral envelope 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 1 parameter µ is directly the pitch frequency | ( )|− | ( )| 2 3 4 Frequency (Rad/sec) 5 6 7 Fig. 2: Spectral envelope for test case 1 D. Performance Analysis of Joint Estimator The actual spectral envelope and the estimated spectral envelope should have very less error. The measure used to evaluate the estimator performance is the spectral distortion (SD), defined in (20). SD = ∑ 2 H () are generated. Substituting them in (9), the power spectrum Y ( ) 2 , corresponding power spectra are also generated. To verify the correctness of the algorithm and its MATLAB code, we provided each of the four power spectra to the joint estimation algorithm in section 2.2. Correspondingly the spectral envelope Hˆ ( ) 2 is estimated in all the four cases. These Normalized power density (20) TABLE I. ACTUAL, ESTIMATED MODEL PARAMETERS FOR TEST CASE 1 Model Actual values Estimated values µ 0.6283 0.3135 σ 0.06 0.0595 0.628, 0.691, 0.565, 0.628, 0.45, 0.65, 0.56, 0.42, 0.502 0.34 0.8, 1.88, 4.7, 5.65, 5.9 0.83, 1.87, 4.80, 5.52, parameters In (20) i refers to the frequency-bin index, | ( )| is the true spectral envelope and | ( )| is the spectral envelope estimate. 5.84 III. RESULTS AND DISCUSSION 0.2, 0.2, 0.1, 0.2, 0.3 0.18, 0.25, 0.16, 0.18, 0.20 The speech modelling used in this paper comprises of Gaussian Mixture Models for spectral envelope and power spectrum as discussed in equations (8) and (9) respectively. The overall model has six model parameters. Choosing arbitrary values for model parameters, choosing M=5, N=30 ISSN: 2231-5381 η 1 http://www.ijettjournal.org 0.994 SD = 5.0339e-005 Page 113 International Journal of Engineering Trends and Technology (IJETT) – Volume 13 Number 3 – Jul 2014 TABLE III. 1 power spectrum Actual spectral envelope Estimated spectral envelope 0.9 ACTUAL, ESTIMATED MODEL PARAMETERS FOR TEST CASE 3 Model Actual values Estimated values µ 0.5026 0.2512 σ 0.0560 0.0560 0.5 0.691, 1.256, 0.502, 0.816, 0.54, 0.4 0.565 0.60, 0.25 0.3 0.94, 1.25, 4.08, 4.39, 5.96 0.96, 0.8 Normalized power density parameters 0.7 0.6 0.81, 2.17, 0.55, 3.77, 4.73, 5.82 0.2 0.2, 0.2,5 0.05, 0.15, 0.35 0.31, 0.15, 0.07, 0.1 0.21, 0.24 0 0 1 2 3 4 Frequency (Rad/sec) 5 6 η 7 1 0.9840 SD = 1.2880e-004 Fig. 3: Spectral envelope for test case 2 TABLE II. 1 ACTUAL, ESTIMATED MODEL PARAMETERS FOR TEST CASE 2 Model Actual values Estimated values µ 0.5654 0.5087 σ 0.05 0.1243 1.256, 1.131, 0.37, 0.72, 0.57, 0.27, 0.77, 0.27 power spectrum Actual spectral envelope Estimated spectral envelope 0.9 0.8 0.69, 0.43 0.628, 0.942, 4.08, 1.23, 0.86, 4.19, 4.30, 5.84 4.39, 5,96 0.3, 0.15, 0.25, 0.1, Normalized power density parameters 0.7 0.6 0.5 0.4 0.3 0.2 0.14, 0.17, 0.18, 0.25, 0.24 0.1 0.3 η 1 0 0.9984 0 1 2 3 4 Frequency (Rad/sec) 5 6 7 SD = 6.7369e-005 Fig. 5: Spectral envelope for test case 4 TABLE IV. ACTUAL, ESTIMATED MODEL PARAMETERS FOR TEST CASE 4 1 power spectrum Actual spectral envelope Estimated spectral envelope 0.9 Normalized power density 0.8 Model Actual values Estimated values µ 0.4398 0.3005 σ 0.0430 0.1537 0.75, 0.31, 0.56, 1.82, 0.62 0.61, 0.34, 0.72, 0.54, parameters 0.7 0.6 0.5 0.42 0.4 3.45, 2.51, 5.34, 1.57, 3.89 0.3 5.42 0.2 0.15, 0.25, 0.1, 0.25, 0.35 0.1 0 1.06, 2.51, 3.55, 4.20, 0.091, 0.273, 0.161, 0.079 0 1 2 3 4 Frequency (Rad/sec) 5 6 7 η 1 1.003 SD = 5.8781e-005 Fig. 4: Spectral envelope for test case 3 ISSN: 2231-5381 http://www.ijettjournal.org Page 114 0.395, International Journal of Engineering Trends and Technology (IJETT) – Volume 13 Number 3 – Jul 2014 Next, a short term speech signal y(t ) TABLE V. is generated using Fig. 6. Here s(t) is the impulse train with a pitch period of 100. The vocal tract frequency response is created using the GMM model in (8). Then the output of this vocal tract when excited by the impulse train input is passed through a Gaussian window and the resultant time domain signal is the created short term speech signal y(t). The corresponding power spectrum is computed and provided to the joint estimator in section 2.2. The corresponding actual and estimated spectral envelopes are compared in Fig. 7 and the model parameters are compared in Table 5. Corresponding SD is also shown below Table 5. Another short term speech signal y(t ) is generated as in Fig. 6, but the vocal tract is modelled using an all pole filter shown in (21). The input excitation is used with a pitch frequency of 100 Hz. In this case the generated speech signal’s power spectrum is computed and is provided to the joint estimator in section 2.2. The model parameters estimated and the spectral envelope obtained are shown in Table 6 and Fig. (8) respectively. Corresponding SD is shown below Table 6. ACTUAL, ESTIMATED VALUES OF FOUR MODEL PARAMETERS Model Actual values Estimated values 100, 110, 90, 100, 80 71.7, 105.30, 87.12, parameters 72.67, 54.39 100, 300, 750, 900, 950 118.5, 300.5, 743.82, 853.7, 928 0.2, 0.2, 0.1, 0.2, 0.3 0.185, 0.246, 0.125, 0.177, 0.265 η 1 0.9970 SD = 9.0675e-005 H(z) = ∏ )/ ∏ (1 − In (21), = … …= = 0, = ∗ = 0.4225+0.7529 , (1 − ) (21) = ∗ = −0.5026 + 0.5976 , = 0.6602 are used. 1 power spectrum Actual spectral envelope Estimated spectral envelope 0.9 Norm alized power density 0.8 Fig. 6: Generation of speech signal 0.7 0.6 0.5 0.4 0.3 0.2 0.1 1 power spectrum Actual spectral envelope Estimated spectral envelope 0.9 0 0 1 2 0.8 3 4 Frequency (Rad/sec) 5 6 7 N orm alized power dens ity Fig. 8: Spectral envelope estimation for speech signal 2 0.7 TABLE VI. 0.6 ESTIMATED VALUES OF SIX MODEL PARAMETERS 0.5 Model 0.4 Actual values Estimated values parameters 0.3 µ NA 99.8373 0.2 σ NA 9.9428 NA 87.14, 125.6, 81.95, 54.33, 88.73 NA 144.6, 256, 636.9, 807.51, 886.19 NA 0.327, 0.172, 0.082, 0.193, 0.225 0.1 0 0 1 2 3 4 Frequency (Rad/sec) 5 6 7 η NA Fig. 7: Spectral envelope estimation for speech signal 1 ISSN: 2231-5381 0.9881 SD = 0.0167 http://www.ijettjournal.org Page 115 International Journal of Engineering Trends and Technology (IJETT) – Volume 13 Number 3 – Jul 2014 REFERENCES [1] IV. CONCLUSION Speech synthesis applications demand speech signal models to be devised so as to produce artificial speech signals. In such speech models, a joint estimator to estimate both the fundamental frequency and the spectral envelope is discussed and applied in this paper which is based on parametric speech source filter model. Some known speech signals are generated and their spectral envelopes and fundamental frequencies are estimated using the joint estimator. The results show that the joint estimator indeed provides good estimation of spectral envelope and fundamental frequency. For any recorded speech signal, the estimator can be applied to obtain the model parameters and then using a summation series of Gabor functions the speech can be reproduced. The recorded and reproduced speech signals can be compared for audibility and clarity, which form the future scope of this paper. [2] [3] [4] [5] [6] [7] [8] [9] ISSN: 2231-5381 Hirokazu Kameoka, “Speech spectrum modeling for joint estimation of spectral envelope and Fundamental frequency”, IEEE Transactions, vol 18, aug 2010 B.S Atal “Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verrification,”J.Acoust .Soc.Amer, vol55,1974. A.V. Oppenhein and R.W. Schafer, “4Homomorphic analysis of speech,” IEE Transactions, vol.AU-16, pp.june 1968. O. Cappe and E. Moulines, “Regulasized estimation of cepstrum envelope from discrete frequency points,” IEEE Workshop appls., 1995. A. El-Jaroudi and J. Makhoul, “Discrete all-pole modeling,” IEE Transactions, vol. 39, No.2, pp.411-423, Feb.1991. D. Giacobello et al., “Sparese linear predictors for speech processing,” in Proc. Interspeech, 2008. R. Badeau and B. David, “Weighted maximum likelihood autoregressive and moving average spectrum modeling,” in Proc. Int. Conf. Acoust., speech, signal process. (ICASSP’08), 2008, PP.3761-3764. W.J. Hess, “Pitch and voice determination,” in advance in speech signal processing, S. Fur ui and M.M. Sohndi, Eds. New York: Marcel Dekker, 1992, pp.3-48. K. Lange, D. Hunter and I. Yang, “Optimization transfer using surrogate objective functions,” J. Comp., Graph. Statist., Vol.9, no.1,pp.1-20,2000. http://www.ijettjournal.org Page 116