EEL6586 Project Speech Analysis and Synthesis Using Sinusoidal Model Yin Zhu, Xiaochuan Guo 1. Introduction There are a lot of ways to model speech signal. Many approaches are to represent speech as the result of passing a glottal excitation through a timevarying linear filter which models the characteristic of vocal tract. A number of approaches make use of a set of narrow band signals to represent speech signal. The phase vocoder maybe the first one, in which the absolute phase information is lost. Mutiband excitation (MBE) synthesizes voiced signal using sine waves. Some researchers also used sine waves to represent LPC residual. In our project, we represent speech signal by a linear combination of sinusoids with time-varying amplitude, phase and frequencies. The parameters of these sine waves are interpolated to get maximally smooth output. High quality speech is reconstructed. 2. The Sinusoidal Model The speech production model is illustrated in Fig.1. In LPC, the excitation signal, e(t), is usually represented as a periodic pulse train during voiced speech and is represented as a noiselike signal during unvoiced speech. Figure 1 Speech production sinusoidal model In the sinusoidal model, the binary voiced-unvoiced excitation is replaced by a sum of sine wave. The motivation for this sine wave representation is that the voiced excitation, when perfectly periodic, can be represented by a Fourier series decomposition in which each harmonic component of the decomposition corresponds to a single sine wave. If the excitation is for unvoiced speech, it can also be represented by a set of sine waves if their frequencies are ‘close enough’. Passing these sine waves through the time-varying vocal tract results in the sinusoidal representation for the speech waveform, which is given by: s (n) A cos( l n ) l l l 1 L where A(l) and (l) represent the amplitude and phase of each of sine wave component associated with the frequency track (l) and L is the number of sine waves. 3. Estimation of Speech Parameters In sinusoidal analysis-synthesis, parameters of sine waves should be extracted to approximate the original speech as close as possible. The optimal solution for this parameters estimation problem is difficult to obtain. One heuristic solution is to extract sine wave parameters from short time Fourier Transform (STFT). If the signal is purely periodic, the amplitude, phase and frequency can be obtained form STFT. For the voiced sound, the periodogram peaks will yield the amplitude of an underlying sine wave. The locations of the selected peaks give the sine wave frequencies and the peak values give the sine wave amplitudes. When the speech is not perfectly voiced, the periodogram will still have a multiplicity of peaks but at frequencies, which can be used to identify the underlying sine wave structure. From the Karhunen-Loeve expansion of noise like signals, a sinusoidal representation is valid provided the frequencies are ‘close enough’ so that the ensemble power spectral density changes slowly over consecutive frequencies. For the 20ms wide window, it provides a sufficiently dense sampling to satisfy the necessary constraints. 4. Sine Wave Track As speech evolves from frame to frame, different sets of these parameters will be obtained. The locations of the peaks will change as the pitch changes, and there will be rapid changes in both the location and the number of peaks corresponding to rapidly varying regions of speech, such as at voiced and unvoiced transitions. 35 F r e q u e n c y Frequencies track on the boundary of the 1st and 2nd frames 30 25 20 15 wnk wmk+1 10 - 5 0 0 0.5 1 1.5 Frames 2 2.5 3 Figure 2 Wave Track To define a sine wave track in which the parameters are matched, the concept of ‘birth’ and ‘death’ is introduce. As in Figure 2, if successive frequencies fall within some matching interval, they are considered in the same wave track. Otherwise, a wave track is either ‘born’ or ‘dead’. 5. Parameters Interpolation After the frequency matching, the amplitude, phase and frequency extracted from STFT of frame k are associated with a corresponding set of parameters for frame k+1. Letting {A(l, k), (l, k), (l, k)} and {A(l, k+1), (l, k+1), (l, k+1)} denote the successive set of parameters for the lth frequency track, the amplitude interpolation is simply linear interpolation: A (n) = A(k) + ( A(k+1) – A(k) )* (n /T) where T is the frame length. The phase interpolation is more complicated because phase from STFT is obtained modulo 2. Thus, phase unwrapping must be performed to make sure that the frequency tracks are ‘maximally smooth’ across frame boundaries. A cubic polynomial is used to interpolated the phase, (n) n n n 2M 2 3 where the term 2M is used to account for the phase unwrapping. From the constraints that the cubic phase function and its derivative equal the phase and frequency of STFT at the boundaries, the parameters are obtained: 2 (M ) 3 /T 1 / T (k 1) (k ) (k )T 2M [ ][ ][ ] (M ) 2 / T 3 1 / T 2 (k 1) (k ) where T is the frame length. The phase unwrapping parameter M is then chosen to make the unwrapped phase maximally smooth. 6. Sinusoidal Analysis and Synthesis A block diagram for analysis is given in Figure 3. The speech passes through a Hamming window function, which is at least two and a half times of the speech pitch period to make sure accurate harmonic structure. A 1024 points DFT is taken and peaks of DFT amplitude are picked. The phases and frequencies at these peaks are transmitted. Figure 3 Sinusoidal Analysis Figure 4 Sinusoidal Synthesis Figure 4 shows the synthesis diagram. As stated in the last section, phases and frequencies interpolation is used for wave tracks. The output speech is the summation of these sine waves. 7. Experimental Results At the first stage, we did not use the phase interpolation. The sound is highly intelligent. However, we can feel the roughness due to the sudden changes of sine wave parameters at the frame boundary. After parameters interpolation, the synthesized sound is almost indistinguishable from the original sound. From figure above, it is obviously that parameters interpolation smoothes the speech signal at the boundary. As a result, high quality speech is synthesized.