Speech signals can be modeled as an AR process (all-pole

advertisement
ENEE624 Advanced Digital Signal Processing
Project 2: Linear Prediction and Synthesis
Submitted by
Narayanan Ramanathan
&
Nagaraj P Anthapadhmanabhan
PROBLEM 1
OBJECTIVE: To determine the optimal length of the speech signal, called frame length,
that can make speech signal a stationary process.
SOLUTION:
An important requirement, when we use the linear prediction technique, is that the
signal should be stationary over the prediction period. Speech is a non-stationary
process, but it can be approximated to be locally stationary. As a consequence of this,
the predictor coefficients must be estimated from short segments (frames) of the speech
over which the speech signal can be assumed to be stationary.
Physical Model:
The physical model for speech generation is explained below:
Fig 1.1: Physical model of speech generation
Speech production involves three processes: generation of the sound excitation,
articulation by the vocal tract, and radiation from the lips and/or nostrils. The spectrum
of the excitation is shaped by the vocal tract tube, which has a frequency response that
contains some resonant frequencies called formants. Changing the shape of the vocal
tract changes its frequency response and results in the generation of various sounds.
The shape of the vocal tract changes relatively slowly (on the scale of 10 ms to 100 ms).
Mathematical Model:
A mathematical model for speech can be obtained using the linear predictive model.
This simplified all-pole (AR) model is a natural representation of non-nasal voiced
sounds, but for nasals and fricative sounds, theoretically a model with poles and zeros
is required. However, when the order is high enough, the all-pole model provides a
good representation for almost all sounds.
The Reflection Coefficients we calculate using this technique will represent the vocal
tract’s response and will remain the same as long as it has the same shape. During this
period, the speech can be assumed to be stationary.
Trade-offs:
There are certain trade-offs we need to consider when we choose the length of the
frame.


If the frame length is small, the prediction error is reduced, but the compression
is also decreased as we now have a greater number of speech segments to model
and hence more Reflection Coefficients to represent the speech signal.
On the other hand, for larger frame lengths the accuracy is decreased, but we can
achieve greater compressions.
We plotted the Prediction Error against the frame length for a model of order 10. The
plot is shown below:
Fig 1.2: Prediction Error Vs. Frame Length (for order 10)
Considering these trade-offs, a frame length of 20 ms to 30 ms (160 to 240 samples) may
be considered optimal. We have done our analysis using the Federal Standard, which is
approximately 236 samples.
Fig:1.3 Spectrogram Plot for frame length=236
The spectrogram plot for a Frame Length of 236 indicates that the energy distribution
remains a constant in the observed frame. This indicates that for a frame length of 236
stationarity assumptions hold good.
PROBLEM 2
OBJECTIVE: To build a linear predictive model based on the lattice structure and to
determine the optimal order and reflection coefficients. To show through simulations
the difference between the true signal and the one obtained from the model.
SOLUTION:
Speech signals can be modeled as an AR process (all-pole model) when the order is high
enough. We use lattice structures to model the speech signal.
(a)
(b)
Fig 2.1: (a) Forward (analysis) filter; (b) Inverse (synthesis) filter
x n
: Speech signal
g i  n  : Backward Prediction Error
fi  n 
: Forward Prediction Error
Ki
: Reflection Coefficients
Practically, an important advantage of using this model is that it is modular and hence,
the computations don’t have to be redone each time the order is changed


As the order of the model is increased, the prediction error decreases and finally
reaches a point after which there isn’t any significant decrease in error power
with increase in order.
Also, as the order increases, there are a greater number of coefficients to be
transmitted and hence the compression is decreased.
So, there is a trade-off in choosing the order.
Prediction error Vs. Order plot:
In our project, we plotted the Prediction Error Power for different orders. The plot is
shown below:
Fig: 2.2 Prediction Error Power Vs. Order
We observed that the plot flattens out when the order is around 10.
Akaike Information Criterion:
In order to also take into consideration the penalties in increasing the order, we use the
Akaike Information Criterion (AIC).
AIC  M   N log E | f M  n |2   2 M
where,
M
= Prediction Order
= Frame Length
N
2
E | f M  n  |  = Prediction error power
2M
= Penalty function
The plot of AIC values for different orders is shown below. The optimum order would
be that for which the AIC value is least. We observe that the minimum is obtained
when the order is 10. So, we choose the order for our model as 10.
Fig 2.3: AIC (M) Vs. Order (M)
Reflection Coeffients:
Since each of the frames of the speech signal have a separate model, the reflection
coefficients differ for each frame. But, when the speech is produced by a single speaker
there will be cases when there isn’t much difference between the coefficients for
adjacent segments. This will later on be used for increasing the compression.
The Reflection coefficients for the first segment have been listed below (for order=10
and Frame length=236 samples).
0.2871
0.0737 -0.1500
0.3108 -0.1988
0.3157 -0.1581
0.1863
0.0721
0.0987
Simulations:
We modeled the speech signal for:
Frame Length=236 samples, Order=10
Reconstructed Signals:
The signal obtained from the above models and the actual signal are plotted below:
(a)
(b)
Fig 2.4: (a) Signal obtained from model (b) Actual speech signal
Forward Prediction Error:
The Forward Prediction Error is plotted below for the above models:
Fig: 2.5 (a ): Error = Input - Output
Fig 2.5 (b): Forward Prediction Error
Average Error Power:
E( sqr(input – ouput) ) = 3.9273e-031
Note: There may be a situation where, for a particular voice sequence and a Frame size,
all the values in the input sequence of that frame are zero. This situation was
encountered for the voice sequence considered in this project, for a frame size of 176.
Since the autocorrelation matrix of the input is a zero matrix, this situation must be
dealt with as a special case. The AR parameters for this frame will be zero.
Signal obtained from model:
http://www.glue.umd.edu/~ramanath/Problem2.wav
PROBLEM 3
OBJECTIVE: To develop a scheme for the transmission of the model coefficients and the
residual, and to find out the best compression ratio possible such that the receiver can
decode a speech with a good index of performance.
SOLUTION:
The speech is modeled using a Linear Predictive model of order 10 for each segment.
The receiver needs to know only the order, frame length, AR Parameters (or Reflection
Coeffients) for each segment, and the Residue for reconstructing the original speech
using the model.
We have developed three schemes:
1. Forward prediction errors are transmitted as the residue and AR parameters are
transmitted for all frames of the signal (Good Quality / Less Compression).
2. Forward prediction errors are transmitted as the residue, but AR parameters are
transmitted only for a few frames of the signal (Lesser Quality / More
Compression).
3. Pitch information (an impulse train) is sent as the residue. AR parameters are
sent for all frames (Low quality / Very high compression).
Scheme I: Transmission of all the AR Parameters (Residue: f M )
The bit representation scheme for the order, AR parameters and the residue is given
below:

Order: Since the order is going to be in the range of 8-15 for speech (due to
reasons stated in previous section), we require only 4 bits for the order. The
binary representation of the order is sent just once for the whole signal as the
order remains same for all frames.

Frame Length: The frame length (duration of the signal in samples) will be
between 160 to 255, due to reasons stated in section 1. The 8-bit binary equivalent
of the frame length will be transmitted to the receiver.

AR Parameters: The AR parameters are the ones that determine the vocal tract
transfer function. Also, they determine the location of the poles and hence the
stability of the system will highly depend on the accurate transmission of these
AR parameters. If there is a huge approximation of these values, there is a
possibility that one or more of these poles may go outside the unit circle. So,
we cannot afford much loss in detail while transmitting the AR parameters.
For linear prediction of order 10, we shall have 11 AR Parameters. But we only
need to send the last 10 AR parameters for each frame, as the first one is always
1. So, we have a total of 10 x N(frames) AR parameters to be transmitted. If the
AR parameters are in the range say,  min,max  , the min value is added to the AR
Parameters (shifting) and the AR parameters are scaled to integers of the range
(0,255)
The 8-bit binary equivalent of this is then transmitted to the receiver.
The receiver needs to know



The min value
The multiplication factor
Shifted and scaled AR parameters
to retrieve the original AR parameters.
For transmitting the min value, we send the 8-bit binary equivalent of
round  min  100  . The number of bits allocated to each field is given below:
min Value
Multiplication Factor
AR Parameters
(shifted and scaled)
: 8 bits
: 7 bits
: 8 bits (for each parameter)
Total bits = N(frames) x 10 x 8
As an illustration:
Encoding:
Let the AR Parameters range from –1.99 to 1.81
min val= -1.99
Subtract the above value from all the AR Parameters
Thus the AR Parameters range from 0 to 3.80
Mul Factor= 50
Multiply all the AR Parameters with the Mul Factor & round them to the nearest
integer
The AR Parameters: 0:190
Encode the AR Parameters using 8-bit uniform quantization.
Decoding:
Decode the AR Parameters from the bit stream (8 bits per value)
Divide the AR Parameters by the Mul Factor
Add min Val to all the AR Parameters

Residue: The forward prediction errors are transmitted to the receiver as the
residue. The table below indicates the values to which f M is mapped. Then based
on the probability distributions of f M , the Huffman’s codes are generated. The
probabilities for all ranges and the bit representations are explained in the table
below. Since the residual values are white noise like, the losses encountered in
the above coding scheme are acceptable.
Range
Mapped
Value
No. of
Occurrences
Probability
Huffman
Code
0.15  f M  0.55
0.2
64
0.003
001010
0.06  f M  0.15
0.1
312
0.015
0011
0.008  f M  0.06
0.01
4611
0.22
000
-0.008  f M  0.008
0
10488
0.51
1
-0.06  f M  -0.008
-0.01
4934
0.24
01
-0.15  f M  -0.06
-0.1
298
0.014
00100
-0.46  f M  -0.15
-0.2
61
0.003
001011
Frame Structure:
Order
4 bits
Frame
Length
8 bits
Mul.
Factor
7 bits
Min.
Value
8 bits
Residue (
f M values)
Huffman coded (explained above)
AR Parameters
N(frames) x 10 x 8 bits
Fig 3.1: Frame structure when AR Parameters are transmitted for all frames.
Simulation:
For an order=10 and frame length=236 we obtained the following results for the given
speech signal.
Bit Stream Size
= 4 + 8 + 7 + 8 + 37677 + (88 x 10 x 8) = 44744 bits
Compression Factor

Error Power
= 0.0038
20768  8
20768  8

 3.7132
Bit Stream Size
44744
Speech signal obtained from model:
http://glue.umd.edu/~ramanath/p3.wav
Scheme II: Transmitting only a few AR Co-efficients (Residue: f M )
In this approach, if the set of AR co-efficients of adjacent frames are comparable, then
only one set of AR co-efficients will be transmitted. The scheme adopted to implement
the above idea is described in the following sections.
When the first three AR co-efficients of adjacent frames are comparable, it was noted
that the other AR co-efficients of adjacent frames are comparable as well. Since the first
AR co-efficient for each frame is one, we shall compare the second and the third AR
coefficients of adjacent frames.
We shall employ a bit called as the Window Bit.
The encoding and decoding procedure for the Order, Frame Length, Mul. Factor, Min
Value and the Residue ( f M values) remain the same as that described in the earlier
sections. In this section we shall concentrate on the transmission and the reception of
the AR parameters alone
At the Transmitter’s side:
Criterion:
M & M+1 denote the adjacent windows
If | ARM (2) – ARM 1 (2)|< 0.1
Then
Else
and
| ARM (3) – ARM 1 (3)| < 0.1
 Window Bit  M 1 = 1
 Window Bit  M 1 = 0
 Window Bit  M 1 = 0
Then append the  Window Bit  M 1 to the encoded ARM 1 co-efficients
If
Transmit the above bitstream
Else
Transmit  Window Bit  M 1 =1 only
At the Receiver’s side: We assume that the residue values have been decoded
Criterion
The receiver has decoded all the residual values. (The no: of residual values to be
decoded is known to the receiver)
The receiver would view the next 80 bits as the bits that represent the 10 AR Coefficients of the first frame. The AR Co-efficients are decoded.
The next bit encountered is the Window Bit.
If (Window Bit = 1)
AR Coefficients of the second frame = AR Coefficients of the First frame.
Else
The next 80 bits correspond to the 10 AR Coefficients of the second frame.
The above procedure is adopted to decode all the AR Co-efficients.
Frame Structure:
Order
4 bits
Frame
Length
8 bits
Mul.
Factor
7 bits
Min.
Value
8 bits
0
Residue (
f M values)
AR Parameters
Huffman coded (explained above)
80 bits  AR Coefficients
1
0
80 bits  AR Coefficients .……
When window bit=1, it indicates that the AR Coefficients of the next frame are the same
as that of the previous frame
The above-mentioned method is adopted in two criterions:
Criterion 1: The AR co-efficients will not be sent if both the second and the third AR
parameters of adjacent frames are within 0.1 in difference
Criterion 2: The constraint mentioned in the above case is relaxed a bit. The AR coefficients will not be sent if any one of the second set of parameters of adjacent frames
or one of the third set of parameters of adjacent frames are within 0.1 in difference.
Simulations:
Criterion
1
2
Size of Bitstream
44352
41952
Compression
20768 x 8/44352 = 3.7460
20768 x 8/41952 = 3.9603
Avg Error Power
0.0041
0.0067
Speech Signals obtained from the model :
Criterion 1: http://www.glue.umd.edu/~ramanath/ Problem3_Scheme2_version1.wav
Criterion 2: http://www.glue.umd.edu/~ramanath/Problem3_Scheme2_version2.wav
Scheme III: All AR parameters transmitted (Residue: impulse train)
There are two types of speech: Voiced and Unvoiced. Unvoiced speech can be modeled
by an all-pole filter excited by white noise, but voiced speech requires a different model
due to its nearly periodic nature. In order to compensate for its periodicity, voiced
speech is modeled by an impulse train (which contains the pitch information or the
periodicity information) and white noise. We only need to send the power for the white
noise part and an effective encoding scheme can be employed for the impulse train.
Hence, we can achieve very high compression, but the quality of the reconstructed
signal is low.
Operation:
The block diagram of this scheme is shown below.
Speech
Signal
Residual
Signal
White noiselike Signal
LP Analysis
Filter
Extract
Power
Extract Pitch
Information
Signal Containing
Pitch Information
Reconstructed
Speech Signal
LP Synthesis
Filter
White noise
Signal
Fig 3. : Block Diagram of Scheme III
At the transmitter, the speech signal is first passed through the Linear Predictor to
obtain the residual signal (Forward prediction errors). From this, the pitch information
is extracted by taking only those values of f M whose absolute value does not fall within
the threshold. For those values which fall within the threshold, the pitch information is
zero. So, the signal containing the pitch information, p(n), will be an impulse train. The
part of f M which falls within the threshold resembles a white noise signal, v(n). We
transmit only the power of this white noise to the receiver.
At the receiver, a white-gaussian noise of the given power will be generated. This will
then be added to the impulse train that contains the pitch information. This signal will
be used to regenerate the speech using the AR parameters.
In this project, we set the value of the threshold as 0.06. So, for all n for which
| f M  n | 0.06 , the impulse train p(n)=0. And for all n for which, | f M  n | 0.06 , the
white noise signal, v(n)=0. For all other n, they have the same values as f M  n 
Bit Representations:
In this case we need to transmit: the order, framelength, AR parameters, the impulse
train and the power of the white-noise. The encoding scheme remains the same as in
Scheme I for order, framelength and AR parameters. We use 8-bits for transmitting the
power. We can see that the pitch information, which contains a series of impulse, will
contain many sequences of zeros. To code this efficiently, we use a combination of
Huffman coding and Run-length coding. The table below shows the range of values of
p(n), the number of occurrences, and the corresponding Huffman Code.
Mapped
Value
0.25
No. of
Occurrences
64
Huffman
Code
0001
-0.15  p (n)  -0.06
0.1
0
-0.1
312
20033
298
01
1
001
-0.46  p (n)  -0.15
-0.25
61
0000
Range
0.15  p (n)  0.55
0.06  p (n)  0.15
p ( n)  0
In addition, we allocate 8 bits to denote the number of zeros that follow whenever
p(n)=0. This is run-length coding. Thus, the number of bits for representing the residue
is tremendously reduced resulting in very high compression. But the trade-off is that
the reconstructed signal is of low quality.
Frame Structure:
Order
Frame
Length
Mul.
Factor
Min.
Value
Residue (impulse train)
WGN
Power
AR Parameters
4 bits
8 bits
7 bits
8 bits
Huffman and run-length
coded
8 bits
N(frames) x 10 x 8 bits
Residue: Impulse Train
White Gaussian Noise
White Gaussian Noise + Impulse Train
Simulations:
Size of bitstream
Compression
Avg. Error Power
= 4 + 8 + 7 + 8 + 6581 + 8 + (88 x 10 x 8) = 13656
= 20768 x 8/13656 = 12.1664
= 0.0058
Speech obtained from the model:
http://www.glue.umd.edu/~ramanath/Problem3_Scheme3.wav
Performance Index:
We rate the above schemes according to the following criteria:
Scheme I
Scheme II
Scheme III

Avg. Error Power
0.0038
0.0041
0.0058
Compression
3.7132
3.7460
12.1664
Avg. Error Power
Rank: Scheme I > Scheme II > Scheme III

Compression
Rank: Scheme III > Scheme II > Scheme I

Audio Clarity
This index is subjective and is obtained by listening to the signal obtained from
each of the models.
Rank: Scheme I > Scheme II > Scheme III
PROBLEM 4
OBJECTIVE: To develop a transmission scheme in the presence of noise (SNR=20 dB)
and a channel impulse response.
SOLUTION: Here, we need to consider the channel response and additive noise too as
the channel is no longer an ideal one.
Assumptions:
Channel response: h(1)=1 h(50)=0.3 h(100)=0.1 ; zero for all other values.
The autocorrelation values transmitted in the system are not affected by the channel
response. We assume that our system is robust enough to tackle the above case.
The residual values are convolved with the given channel’s transfer function. A zero
mean white Gaussian noise (SNR: 20db) adds with the residual values. Thus the
encoding scheme adopted in Problem: 3 might not be a good encoding scheme.
A new encoding scheme is adopted to ensure better results:
The no: of levels used in encoding f M is increased to 9. The f M values are mapped as
indicated in the table below. Based on the no: of occurrences of each of the values, a
Huffman Code is generated for all the values for optimal encoding. The Huffman Code
is listed in the table below.
0.15  f M  0.56
Mapped
Value
0.2
No. of
Occurrences
72
0.0035
Huffman
Code
001001
0.1  f M  0.15
0.12
89
0.0043
001011
0.03  f M  0.1
0.02
1485
0.0719
000
0.007  f M  0.03
0.015
5192
0.25
01
-0.1  f M  -0.03
-0.02
1402
0.0675
0011
-0.15  f M  -0.1
-0.12
83
0.004
001010
-0.48  f M  -0.15
-0.2
64
0.0031
001000
-.03  f M  -.007
-.015
5356
0.2579
10
-.007  f M  .007
0
7015
0.3378
11
Range
Probability
Frame Structure:
Order
4 bits
Frame
Length
8 bits
Mul.
Factor
7 bits
Min.
Value
8 bits
Residue (
f M values)
Huffman coded (explained above)
AR Parameters
N(frames) x 10 x 8 bits
Simulations:
Size of Bit stream = 54917
Compression
= 20768 x 8 / 54917 = 3.024
Avg. Error Power = 0.0060
By using the same Huffman encoding scheme as that for Scheme I of Problem 3, we get
an Avg. Error Power of 0.0076, which is unacceptable as the speech isn’t
comprehendible.
Signal obtained from model:
http://glue.umd.edu/~ramanath/Problem4.wav
PROBLEM 5
Discussions and Conclusions:
1. Speech is produced by the shaping of air by the vocal tract. Different sounds are
produced by a change in shape of the vocal tract. The AR parameters (or
Reflection Coefficients) obtained from the linear prediction model, thus represent
the transfer function of the vocal tract.
2. Speech is a non-stationary process, but can be approximated to be locally
stationary during the period for which the shape of the vocal tract does not
change. This period is approximately 10ms to 100ms. Considering trade-offs, a
framelength of 20 ms to 30 ms may be considered optimal.
3. For linear prediction of speech an order of 10-12 is sufficient.
4. The forward prediction errors have a white gaussian noise like property for nonvoice speech inputs. For voiced inputs, the forward prediction errors can be
separated into an impulse train and a signal with white noise property.
5. The AR parameters (or Reflection coefficients) have a greater significance than
the residue as they affect the location of poles of the vocal tract transfer function
(and hence the stability). We, therefore, cannot afford much loss in detail during
the encoding and transmission of the AR parameters.
6. On the contrary, it is not absolutely necessary to send all the AR parameters of
each frame of voice for a good reconstruction of the voice. A scheme where only
some of AR parameters are required can be developed. This increases our
compression. But quality of voice reconstructed, will deteriorate.
7. The residual errors can be quantized coarsely to obtain a compressed version of
the speech signal. Huffman coding can be employed in encoding the residual
error values. Run-length coding of residual errors will not help us in achieving
good compression due to the uncorrelated nature of the residual errors.
8. On the other hand Run-length coding comes in handy in the Scheme III of
problem 3 due to the large number of zeros that are to be coded.
9. A noisy channel forces a decrease in the compression to obtain a comparable
quality of the reconstructed voice to that in the case of an ideal channel. Robust
Error Correction Codes are to be employed whenever the channel is noisy.
10. Greater compression can be achieved using techniques like Code Excited Linear
Prediction (CELP).
11. A very coarse version of the voice can be reconstructed by just transmitting the
AR Co-efficients and the powers of the forward prediction errors for each frame
of the input voice. A white gaussian noise of matched variance is generated at
the receiver to reconstruct the input signal.
12. The Performance index can be defined in terms of the Average Error in output,
the compression value, clarity of voice etc.
So the best prediction scheme would depend on the requirements of the application.
References:
1. D.G. Manolakis et al, “Statistical and Adaptive Signal Processing”
2. L. R. Rabiner and R. W. Schafer, “Digital Processing of Speech Signals”
3. S. Haykin, "Adaptive Filter Theory"
4. http://www.isip.msstate.edu/publications/courses/ece_4773/lectures/v1.0/lecture
_30.pdf
5. http://www.data-compression.com/speech.html
Download