ENEE624 Advanced Digital Signal Processing Project 2: Linear Prediction and Synthesis Submitted by Narayanan Ramanathan & Nagaraj P Anthapadhmanabhan PROBLEM 1 OBJECTIVE: To determine the optimal length of the speech signal, called frame length, that can make speech signal a stationary process. SOLUTION: An important requirement, when we use the linear prediction technique, is that the signal should be stationary over the prediction period. Speech is a non-stationary process, but it can be approximated to be locally stationary. As a consequence of this, the predictor coefficients must be estimated from short segments (frames) of the speech over which the speech signal can be assumed to be stationary. Physical Model: The physical model for speech generation is explained below: Fig 1.1: Physical model of speech generation Speech production involves three processes: generation of the sound excitation, articulation by the vocal tract, and radiation from the lips and/or nostrils. The spectrum of the excitation is shaped by the vocal tract tube, which has a frequency response that contains some resonant frequencies called formants. Changing the shape of the vocal tract changes its frequency response and results in the generation of various sounds. The shape of the vocal tract changes relatively slowly (on the scale of 10 ms to 100 ms). Mathematical Model: A mathematical model for speech can be obtained using the linear predictive model. This simplified all-pole (AR) model is a natural representation of non-nasal voiced sounds, but for nasals and fricative sounds, theoretically a model with poles and zeros is required. However, when the order is high enough, the all-pole model provides a good representation for almost all sounds. The Reflection Coefficients we calculate using this technique will represent the vocal tract’s response and will remain the same as long as it has the same shape. During this period, the speech can be assumed to be stationary. Trade-offs: There are certain trade-offs we need to consider when we choose the length of the frame. If the frame length is small, the prediction error is reduced, but the compression is also decreased as we now have a greater number of speech segments to model and hence more Reflection Coefficients to represent the speech signal. On the other hand, for larger frame lengths the accuracy is decreased, but we can achieve greater compressions. We plotted the Prediction Error against the frame length for a model of order 10. The plot is shown below: Fig 1.2: Prediction Error Vs. Frame Length (for order 10) Considering these trade-offs, a frame length of 20 ms to 30 ms (160 to 240 samples) may be considered optimal. We have done our analysis using the Federal Standard, which is approximately 236 samples. Fig:1.3 Spectrogram Plot for frame length=236 The spectrogram plot for a Frame Length of 236 indicates that the energy distribution remains a constant in the observed frame. This indicates that for a frame length of 236 stationarity assumptions hold good. PROBLEM 2 OBJECTIVE: To build a linear predictive model based on the lattice structure and to determine the optimal order and reflection coefficients. To show through simulations the difference between the true signal and the one obtained from the model. SOLUTION: Speech signals can be modeled as an AR process (all-pole model) when the order is high enough. We use lattice structures to model the speech signal. (a) (b) Fig 2.1: (a) Forward (analysis) filter; (b) Inverse (synthesis) filter x n : Speech signal g i n : Backward Prediction Error fi n : Forward Prediction Error Ki : Reflection Coefficients Practically, an important advantage of using this model is that it is modular and hence, the computations don’t have to be redone each time the order is changed As the order of the model is increased, the prediction error decreases and finally reaches a point after which there isn’t any significant decrease in error power with increase in order. Also, as the order increases, there are a greater number of coefficients to be transmitted and hence the compression is decreased. So, there is a trade-off in choosing the order. Prediction error Vs. Order plot: In our project, we plotted the Prediction Error Power for different orders. The plot is shown below: Fig: 2.2 Prediction Error Power Vs. Order We observed that the plot flattens out when the order is around 10. Akaike Information Criterion: In order to also take into consideration the penalties in increasing the order, we use the Akaike Information Criterion (AIC). AIC M N log E | f M n |2 2 M where, M = Prediction Order = Frame Length N 2 E | f M n | = Prediction error power 2M = Penalty function The plot of AIC values for different orders is shown below. The optimum order would be that for which the AIC value is least. We observe that the minimum is obtained when the order is 10. So, we choose the order for our model as 10. Fig 2.3: AIC (M) Vs. Order (M) Reflection Coeffients: Since each of the frames of the speech signal have a separate model, the reflection coefficients differ for each frame. But, when the speech is produced by a single speaker there will be cases when there isn’t much difference between the coefficients for adjacent segments. This will later on be used for increasing the compression. The Reflection coefficients for the first segment have been listed below (for order=10 and Frame length=236 samples). 0.2871 0.0737 -0.1500 0.3108 -0.1988 0.3157 -0.1581 0.1863 0.0721 0.0987 Simulations: We modeled the speech signal for: Frame Length=236 samples, Order=10 Reconstructed Signals: The signal obtained from the above models and the actual signal are plotted below: (a) (b) Fig 2.4: (a) Signal obtained from model (b) Actual speech signal Forward Prediction Error: The Forward Prediction Error is plotted below for the above models: Fig: 2.5 (a ): Error = Input - Output Fig 2.5 (b): Forward Prediction Error Average Error Power: E( sqr(input – ouput) ) = 3.9273e-031 Note: There may be a situation where, for a particular voice sequence and a Frame size, all the values in the input sequence of that frame are zero. This situation was encountered for the voice sequence considered in this project, for a frame size of 176. Since the autocorrelation matrix of the input is a zero matrix, this situation must be dealt with as a special case. The AR parameters for this frame will be zero. Signal obtained from model: http://www.glue.umd.edu/~ramanath/Problem2.wav PROBLEM 3 OBJECTIVE: To develop a scheme for the transmission of the model coefficients and the residual, and to find out the best compression ratio possible such that the receiver can decode a speech with a good index of performance. SOLUTION: The speech is modeled using a Linear Predictive model of order 10 for each segment. The receiver needs to know only the order, frame length, AR Parameters (or Reflection Coeffients) for each segment, and the Residue for reconstructing the original speech using the model. We have developed three schemes: 1. Forward prediction errors are transmitted as the residue and AR parameters are transmitted for all frames of the signal (Good Quality / Less Compression). 2. Forward prediction errors are transmitted as the residue, but AR parameters are transmitted only for a few frames of the signal (Lesser Quality / More Compression). 3. Pitch information (an impulse train) is sent as the residue. AR parameters are sent for all frames (Low quality / Very high compression). Scheme I: Transmission of all the AR Parameters (Residue: f M ) The bit representation scheme for the order, AR parameters and the residue is given below: Order: Since the order is going to be in the range of 8-15 for speech (due to reasons stated in previous section), we require only 4 bits for the order. The binary representation of the order is sent just once for the whole signal as the order remains same for all frames. Frame Length: The frame length (duration of the signal in samples) will be between 160 to 255, due to reasons stated in section 1. The 8-bit binary equivalent of the frame length will be transmitted to the receiver. AR Parameters: The AR parameters are the ones that determine the vocal tract transfer function. Also, they determine the location of the poles and hence the stability of the system will highly depend on the accurate transmission of these AR parameters. If there is a huge approximation of these values, there is a possibility that one or more of these poles may go outside the unit circle. So, we cannot afford much loss in detail while transmitting the AR parameters. For linear prediction of order 10, we shall have 11 AR Parameters. But we only need to send the last 10 AR parameters for each frame, as the first one is always 1. So, we have a total of 10 x N(frames) AR parameters to be transmitted. If the AR parameters are in the range say, min,max , the min value is added to the AR Parameters (shifting) and the AR parameters are scaled to integers of the range (0,255) The 8-bit binary equivalent of this is then transmitted to the receiver. The receiver needs to know The min value The multiplication factor Shifted and scaled AR parameters to retrieve the original AR parameters. For transmitting the min value, we send the 8-bit binary equivalent of round min 100 . The number of bits allocated to each field is given below: min Value Multiplication Factor AR Parameters (shifted and scaled) : 8 bits : 7 bits : 8 bits (for each parameter) Total bits = N(frames) x 10 x 8 As an illustration: Encoding: Let the AR Parameters range from –1.99 to 1.81 min val= -1.99 Subtract the above value from all the AR Parameters Thus the AR Parameters range from 0 to 3.80 Mul Factor= 50 Multiply all the AR Parameters with the Mul Factor & round them to the nearest integer The AR Parameters: 0:190 Encode the AR Parameters using 8-bit uniform quantization. Decoding: Decode the AR Parameters from the bit stream (8 bits per value) Divide the AR Parameters by the Mul Factor Add min Val to all the AR Parameters Residue: The forward prediction errors are transmitted to the receiver as the residue. The table below indicates the values to which f M is mapped. Then based on the probability distributions of f M , the Huffman’s codes are generated. The probabilities for all ranges and the bit representations are explained in the table below. Since the residual values are white noise like, the losses encountered in the above coding scheme are acceptable. Range Mapped Value No. of Occurrences Probability Huffman Code 0.15 f M 0.55 0.2 64 0.003 001010 0.06 f M 0.15 0.1 312 0.015 0011 0.008 f M 0.06 0.01 4611 0.22 000 -0.008 f M 0.008 0 10488 0.51 1 -0.06 f M -0.008 -0.01 4934 0.24 01 -0.15 f M -0.06 -0.1 298 0.014 00100 -0.46 f M -0.15 -0.2 61 0.003 001011 Frame Structure: Order 4 bits Frame Length 8 bits Mul. Factor 7 bits Min. Value 8 bits Residue ( f M values) Huffman coded (explained above) AR Parameters N(frames) x 10 x 8 bits Fig 3.1: Frame structure when AR Parameters are transmitted for all frames. Simulation: For an order=10 and frame length=236 we obtained the following results for the given speech signal. Bit Stream Size = 4 + 8 + 7 + 8 + 37677 + (88 x 10 x 8) = 44744 bits Compression Factor Error Power = 0.0038 20768 8 20768 8 3.7132 Bit Stream Size 44744 Speech signal obtained from model: http://glue.umd.edu/~ramanath/p3.wav Scheme II: Transmitting only a few AR Co-efficients (Residue: f M ) In this approach, if the set of AR co-efficients of adjacent frames are comparable, then only one set of AR co-efficients will be transmitted. The scheme adopted to implement the above idea is described in the following sections. When the first three AR co-efficients of adjacent frames are comparable, it was noted that the other AR co-efficients of adjacent frames are comparable as well. Since the first AR co-efficient for each frame is one, we shall compare the second and the third AR coefficients of adjacent frames. We shall employ a bit called as the Window Bit. The encoding and decoding procedure for the Order, Frame Length, Mul. Factor, Min Value and the Residue ( f M values) remain the same as that described in the earlier sections. In this section we shall concentrate on the transmission and the reception of the AR parameters alone At the Transmitter’s side: Criterion: M & M+1 denote the adjacent windows If | ARM (2) – ARM 1 (2)|< 0.1 Then Else and | ARM (3) – ARM 1 (3)| < 0.1 Window Bit M 1 = 1 Window Bit M 1 = 0 Window Bit M 1 = 0 Then append the Window Bit M 1 to the encoded ARM 1 co-efficients If Transmit the above bitstream Else Transmit Window Bit M 1 =1 only At the Receiver’s side: We assume that the residue values have been decoded Criterion The receiver has decoded all the residual values. (The no: of residual values to be decoded is known to the receiver) The receiver would view the next 80 bits as the bits that represent the 10 AR Coefficients of the first frame. The AR Co-efficients are decoded. The next bit encountered is the Window Bit. If (Window Bit = 1) AR Coefficients of the second frame = AR Coefficients of the First frame. Else The next 80 bits correspond to the 10 AR Coefficients of the second frame. The above procedure is adopted to decode all the AR Co-efficients. Frame Structure: Order 4 bits Frame Length 8 bits Mul. Factor 7 bits Min. Value 8 bits 0 Residue ( f M values) AR Parameters Huffman coded (explained above) 80 bits AR Coefficients 1 0 80 bits AR Coefficients .…… When window bit=1, it indicates that the AR Coefficients of the next frame are the same as that of the previous frame The above-mentioned method is adopted in two criterions: Criterion 1: The AR co-efficients will not be sent if both the second and the third AR parameters of adjacent frames are within 0.1 in difference Criterion 2: The constraint mentioned in the above case is relaxed a bit. The AR coefficients will not be sent if any one of the second set of parameters of adjacent frames or one of the third set of parameters of adjacent frames are within 0.1 in difference. Simulations: Criterion 1 2 Size of Bitstream 44352 41952 Compression 20768 x 8/44352 = 3.7460 20768 x 8/41952 = 3.9603 Avg Error Power 0.0041 0.0067 Speech Signals obtained from the model : Criterion 1: http://www.glue.umd.edu/~ramanath/ Problem3_Scheme2_version1.wav Criterion 2: http://www.glue.umd.edu/~ramanath/Problem3_Scheme2_version2.wav Scheme III: All AR parameters transmitted (Residue: impulse train) There are two types of speech: Voiced and Unvoiced. Unvoiced speech can be modeled by an all-pole filter excited by white noise, but voiced speech requires a different model due to its nearly periodic nature. In order to compensate for its periodicity, voiced speech is modeled by an impulse train (which contains the pitch information or the periodicity information) and white noise. We only need to send the power for the white noise part and an effective encoding scheme can be employed for the impulse train. Hence, we can achieve very high compression, but the quality of the reconstructed signal is low. Operation: The block diagram of this scheme is shown below. Speech Signal Residual Signal White noiselike Signal LP Analysis Filter Extract Power Extract Pitch Information Signal Containing Pitch Information Reconstructed Speech Signal LP Synthesis Filter White noise Signal Fig 3. : Block Diagram of Scheme III At the transmitter, the speech signal is first passed through the Linear Predictor to obtain the residual signal (Forward prediction errors). From this, the pitch information is extracted by taking only those values of f M whose absolute value does not fall within the threshold. For those values which fall within the threshold, the pitch information is zero. So, the signal containing the pitch information, p(n), will be an impulse train. The part of f M which falls within the threshold resembles a white noise signal, v(n). We transmit only the power of this white noise to the receiver. At the receiver, a white-gaussian noise of the given power will be generated. This will then be added to the impulse train that contains the pitch information. This signal will be used to regenerate the speech using the AR parameters. In this project, we set the value of the threshold as 0.06. So, for all n for which | f M n | 0.06 , the impulse train p(n)=0. And for all n for which, | f M n | 0.06 , the white noise signal, v(n)=0. For all other n, they have the same values as f M n Bit Representations: In this case we need to transmit: the order, framelength, AR parameters, the impulse train and the power of the white-noise. The encoding scheme remains the same as in Scheme I for order, framelength and AR parameters. We use 8-bits for transmitting the power. We can see that the pitch information, which contains a series of impulse, will contain many sequences of zeros. To code this efficiently, we use a combination of Huffman coding and Run-length coding. The table below shows the range of values of p(n), the number of occurrences, and the corresponding Huffman Code. Mapped Value 0.25 No. of Occurrences 64 Huffman Code 0001 -0.15 p (n) -0.06 0.1 0 -0.1 312 20033 298 01 1 001 -0.46 p (n) -0.15 -0.25 61 0000 Range 0.15 p (n) 0.55 0.06 p (n) 0.15 p ( n) 0 In addition, we allocate 8 bits to denote the number of zeros that follow whenever p(n)=0. This is run-length coding. Thus, the number of bits for representing the residue is tremendously reduced resulting in very high compression. But the trade-off is that the reconstructed signal is of low quality. Frame Structure: Order Frame Length Mul. Factor Min. Value Residue (impulse train) WGN Power AR Parameters 4 bits 8 bits 7 bits 8 bits Huffman and run-length coded 8 bits N(frames) x 10 x 8 bits Residue: Impulse Train White Gaussian Noise White Gaussian Noise + Impulse Train Simulations: Size of bitstream Compression Avg. Error Power = 4 + 8 + 7 + 8 + 6581 + 8 + (88 x 10 x 8) = 13656 = 20768 x 8/13656 = 12.1664 = 0.0058 Speech obtained from the model: http://www.glue.umd.edu/~ramanath/Problem3_Scheme3.wav Performance Index: We rate the above schemes according to the following criteria: Scheme I Scheme II Scheme III Avg. Error Power 0.0038 0.0041 0.0058 Compression 3.7132 3.7460 12.1664 Avg. Error Power Rank: Scheme I > Scheme II > Scheme III Compression Rank: Scheme III > Scheme II > Scheme I Audio Clarity This index is subjective and is obtained by listening to the signal obtained from each of the models. Rank: Scheme I > Scheme II > Scheme III PROBLEM 4 OBJECTIVE: To develop a transmission scheme in the presence of noise (SNR=20 dB) and a channel impulse response. SOLUTION: Here, we need to consider the channel response and additive noise too as the channel is no longer an ideal one. Assumptions: Channel response: h(1)=1 h(50)=0.3 h(100)=0.1 ; zero for all other values. The autocorrelation values transmitted in the system are not affected by the channel response. We assume that our system is robust enough to tackle the above case. The residual values are convolved with the given channel’s transfer function. A zero mean white Gaussian noise (SNR: 20db) adds with the residual values. Thus the encoding scheme adopted in Problem: 3 might not be a good encoding scheme. A new encoding scheme is adopted to ensure better results: The no: of levels used in encoding f M is increased to 9. The f M values are mapped as indicated in the table below. Based on the no: of occurrences of each of the values, a Huffman Code is generated for all the values for optimal encoding. The Huffman Code is listed in the table below. 0.15 f M 0.56 Mapped Value 0.2 No. of Occurrences 72 0.0035 Huffman Code 001001 0.1 f M 0.15 0.12 89 0.0043 001011 0.03 f M 0.1 0.02 1485 0.0719 000 0.007 f M 0.03 0.015 5192 0.25 01 -0.1 f M -0.03 -0.02 1402 0.0675 0011 -0.15 f M -0.1 -0.12 83 0.004 001010 -0.48 f M -0.15 -0.2 64 0.0031 001000 -.03 f M -.007 -.015 5356 0.2579 10 -.007 f M .007 0 7015 0.3378 11 Range Probability Frame Structure: Order 4 bits Frame Length 8 bits Mul. Factor 7 bits Min. Value 8 bits Residue ( f M values) Huffman coded (explained above) AR Parameters N(frames) x 10 x 8 bits Simulations: Size of Bit stream = 54917 Compression = 20768 x 8 / 54917 = 3.024 Avg. Error Power = 0.0060 By using the same Huffman encoding scheme as that for Scheme I of Problem 3, we get an Avg. Error Power of 0.0076, which is unacceptable as the speech isn’t comprehendible. Signal obtained from model: http://glue.umd.edu/~ramanath/Problem4.wav PROBLEM 5 Discussions and Conclusions: 1. Speech is produced by the shaping of air by the vocal tract. Different sounds are produced by a change in shape of the vocal tract. The AR parameters (or Reflection Coefficients) obtained from the linear prediction model, thus represent the transfer function of the vocal tract. 2. Speech is a non-stationary process, but can be approximated to be locally stationary during the period for which the shape of the vocal tract does not change. This period is approximately 10ms to 100ms. Considering trade-offs, a framelength of 20 ms to 30 ms may be considered optimal. 3. For linear prediction of speech an order of 10-12 is sufficient. 4. The forward prediction errors have a white gaussian noise like property for nonvoice speech inputs. For voiced inputs, the forward prediction errors can be separated into an impulse train and a signal with white noise property. 5. The AR parameters (or Reflection coefficients) have a greater significance than the residue as they affect the location of poles of the vocal tract transfer function (and hence the stability). We, therefore, cannot afford much loss in detail during the encoding and transmission of the AR parameters. 6. On the contrary, it is not absolutely necessary to send all the AR parameters of each frame of voice for a good reconstruction of the voice. A scheme where only some of AR parameters are required can be developed. This increases our compression. But quality of voice reconstructed, will deteriorate. 7. The residual errors can be quantized coarsely to obtain a compressed version of the speech signal. Huffman coding can be employed in encoding the residual error values. Run-length coding of residual errors will not help us in achieving good compression due to the uncorrelated nature of the residual errors. 8. On the other hand Run-length coding comes in handy in the Scheme III of problem 3 due to the large number of zeros that are to be coded. 9. A noisy channel forces a decrease in the compression to obtain a comparable quality of the reconstructed voice to that in the case of an ideal channel. Robust Error Correction Codes are to be employed whenever the channel is noisy. 10. Greater compression can be achieved using techniques like Code Excited Linear Prediction (CELP). 11. A very coarse version of the voice can be reconstructed by just transmitting the AR Co-efficients and the powers of the forward prediction errors for each frame of the input voice. A white gaussian noise of matched variance is generated at the receiver to reconstruct the input signal. 12. The Performance index can be defined in terms of the Average Error in output, the compression value, clarity of voice etc. So the best prediction scheme would depend on the requirements of the application. References: 1. D.G. Manolakis et al, “Statistical and Adaptive Signal Processing” 2. L. R. Rabiner and R. W. Schafer, “Digital Processing of Speech Signals” 3. S. Haykin, "Adaptive Filter Theory" 4. http://www.isip.msstate.edu/publications/courses/ece_4773/lectures/v1.0/lecture _30.pdf 5. http://www.data-compression.com/speech.html