EEL6586_prj_report

advertisement
EEL6586 Project
Speech Analysis and Synthesis Using Sinusoidal Model
Yin Zhu, Xiaochuan Guo
1. Introduction
There are a lot of ways to model speech signal. Many approaches are to
represent speech as the result of passing a glottal excitation through a timevarying linear filter which models the characteristic of vocal tract.
A number of approaches make use of a set of narrow band signals to
represent speech signal. The phase vocoder maybe the first one, in which the
absolute phase information is lost. Mutiband excitation (MBE) synthesizes
voiced signal using sine waves. Some researchers also used sine waves to
represent LPC residual.
In our project, we represent speech signal by a linear combination of
sinusoids with time-varying amplitude, phase and frequencies. The
parameters of these sine waves are interpolated to get maximally smooth
output. High quality speech is reconstructed.
2. The Sinusoidal Model
The speech production model is illustrated in Fig.1. In LPC, the excitation
signal, e(t), is usually represented as a periodic pulse train during voiced
speech and is represented as a noiselike signal during unvoiced speech.
Figure 1 Speech production sinusoidal model
In the sinusoidal model, the binary voiced-unvoiced excitation is replaced by
a sum of sine wave. The motivation for this sine wave representation is that
the voiced excitation, when perfectly periodic, can be represented by a
Fourier series decomposition in which each harmonic component of the
decomposition corresponds to a single sine wave. If the excitation is for
unvoiced speech, it can also be represented by a set of sine waves if their
frequencies are ‘close enough’.
Passing these sine waves through the time-varying vocal tract results in the
sinusoidal representation for the speech waveform, which is given by:
s (n)   A cos( l n   )
l
l
l 1
L
where A(l) and (l) represent the amplitude and phase of each of sine wave
component associated with the frequency track (l) and L is the number of
sine waves.
3. Estimation of Speech Parameters
In sinusoidal analysis-synthesis, parameters of sine waves should be
extracted to approximate the original speech as close as possible. The
optimal solution for this parameters estimation problem is difficult to obtain.
One heuristic solution is to extract sine wave parameters from short time
Fourier Transform (STFT). If the signal is purely periodic, the amplitude,
phase and frequency can be obtained form STFT. For the voiced sound, the
periodogram peaks will yield the amplitude of an underlying sine wave. The
locations of the selected peaks give the sine wave frequencies and the peak
values give the sine wave amplitudes.
When the speech is not perfectly voiced, the periodogram will still have a
multiplicity of peaks but at frequencies, which can be used to identify the
underlying sine wave structure. From the Karhunen-Loeve expansion of
noise like signals, a sinusoidal representation is valid provided the
frequencies are ‘close enough’ so that the ensemble power spectral density
changes slowly over consecutive frequencies. For the 20ms wide window, it
provides a sufficiently dense sampling to satisfy the necessary constraints.
4. Sine Wave Track
As speech evolves from frame to frame, different sets of these parameters
will be obtained. The locations of the peaks will change as the pitch changes,
and there will be rapid changes in both the location and the number of peaks
corresponding to rapidly varying regions of speech, such as at voiced and
unvoiced transitions.
35
F
r
e
q
u
e
n
c
y
Frequencies track on the boundary of the 1st and 2nd frames
30
25
20

15
wnk
wmk+1
10
-
5
0
0
0.5
1
1.5
Frames
2
2.5
3
Figure 2 Wave Track
To define a sine wave track in which the parameters are matched, the
concept of ‘birth’ and ‘death’ is introduce. As in Figure 2, if successive
frequencies fall within some matching interval, they are considered in the
same wave track. Otherwise, a wave track is either ‘born’ or ‘dead’.
5. Parameters Interpolation
After the frequency matching, the amplitude, phase and frequency extracted
from STFT of frame k are associated with a corresponding set of parameters
for frame k+1. Letting {A(l, k), (l, k), (l, k)} and {A(l, k+1), (l, k+1),
(l, k+1)} denote the successive set of parameters for the lth frequency
track, the amplitude interpolation is simply linear interpolation:
A (n) = A(k) + ( A(k+1) – A(k) )* (n /T)
where T is the frame length.
The phase interpolation is more complicated because phase from STFT is
obtained modulo 2. Thus, phase unwrapping must be performed to make
sure that the frequency tracks are ‘maximally smooth’ across frame
boundaries. A cubic polynomial is used to interpolated the phase,
 (n)     n   n   n  2M
2
3
where the term 2M is used to account for the phase unwrapping. From the
constraints that the cubic phase function and its derivative equal the phase
and frequency of STFT at the boundaries, the parameters are obtained:
2
 (M )
3 /T
 1 / T  (k  1)   (k )   (k )T  2M
[
][
][
]
 (M )  2 / T 3 1 / T 2
 (k  1)   (k )
where T is the frame length. The phase unwrapping parameter M is then
chosen to make the unwrapped phase maximally smooth.
6. Sinusoidal Analysis and Synthesis
A block diagram for analysis is given in Figure 3.
The speech passes through a Hamming window function, which is at least
two and a half times of the speech pitch period to make sure accurate
harmonic structure. A 1024 points DFT is taken and peaks of DFT amplitude
are picked. The phases and frequencies at these peaks are transmitted.
Figure 3 Sinusoidal Analysis
Figure 4 Sinusoidal Synthesis
Figure 4 shows the synthesis diagram. As stated in the last section, phases
and frequencies interpolation is used for wave tracks. The output speech is
the summation of these sine waves.
7. Experimental Results
At the first stage, we did not use the phase interpolation. The sound is highly
intelligent. However, we can feel the roughness due to the sudden changes
of sine wave parameters at the frame boundary. After parameters
interpolation, the synthesized sound is almost indistinguishable from the
original sound.
From figure above, it is obviously that parameters interpolation smoothes the
speech signal at the boundary. As a result, high quality speech is
synthesized.
Download