Modelling long-term correlations in broadband speech and audio pulse coders F. Riera-Palou, A.C. den Brinker and A.J. Gerrits A new idea to model long-term correlations in broadband speech and audio coders based on regular pulse excitation is presented. It involves expanding the excitation sequence with additional pulses targeting this type of correlation while being bit rate efficient. Comparison with the standard method (long-term prediction) highlights the advantages of the proposed technique. Introduction: Linear prediction-based narrowband (8 kHz sampling rate) speech coders exploit the inter-sample correlation (i.e. redundancy) to achieve a high coding efficiency [1]. To this end, the input signal is typically decorrelated in two consecutive steps: first, a linear prediction (LP) analysis filter eliminates most of the short-term correlation; then, a long-term prediction (LTP) analysis filter minimises the long-term correlation. Bit rate reductions are achieved by modelling this decorrelated signal by an excitation signal, which is computed using an analysis-by-synthesis (AbS) procedure like regular pulse excitation (RPE) [2]. This excitation consists of a pulse sequence that, after being filtered with the LTP and LP synthesis filters, produces a close replication of the original while requiring only a relatively low bit rate. Several attempts have been made, with limited success, to extend the use of these techniques to broadband ( 32 kHz sampling) speech and audio signals [3]. In this context, however, LTP is rarely used since its performance is greatly impaired by the higher bandwidth of the processed signals. Consequently, long-term correlations, e.g. due to speaker pitch, are still present during the AbS excitation modelling stage leading to a significant drop in the coding quality. In this Letter a modified RPE technique is presented that is capable of modelling accurately a broadband signal containing long-term correlations with a minimal increase in bit rate. where es in(k) is a vector corresponding to the response of the filter H(z) due to its initial filter states and truncated to N, er(k) is the vector containing the difference between the N samples from the LTP residual, rLTP(k), and its corresponding excitation x(k). H(k) is an N N matrix representing the impulse response (truncated to N) of the filter H(z). When using RPE, only J equidistant nonzero values are allowed per frame (decimation N=J). The J pulse amplitudes minimising jjep(k)jj22 are then given by ([2]): xp ðkÞ ¼ ðMðkÞt HðkÞt HðkÞMðkÞÞ1 ðMðkÞt HðkÞt Þ ðes in ðkÞ þ HðkÞrLTP ðkÞÞ ð2Þ where M(k) is an N J location matrix signalling which positions in the excitation sequence are nonzero. Different location matrices corresponding to different grid positions (RPE offsets) are tested and the one with minimum jjep(k)jj22 is selected. The full N-sample excitation used at the decoder to drive the LTP and LP synthesis filters, x(k), can be represented as: xðkÞ ¼ MðkÞgRPE Qfxp ðkÞg ð3Þ where Q{xp(k)} denotes the pulses after quantisation and gRPE a gain associated with the excitation. To achieve an attractive bit rate (40–45 kbit=s), we use decimation 2- and 3-level pulse quantisation with an associated gain per frame (240 samples, 5.4 ms) in our prototype broadband (44.1 kHz sampling) pulse coder. Adressed problem: The LTP gain for broadband signals is low due to the presence of high frequencies obscuring the signal periodicity present at low frequencies (e.g. due to speaker’s pitch). Long LTP filters (more than three taps) can attain higher gains but exacerbate the stability problems in the LTP synthesis, introducing the need for complex stabilisation procedures [4]. Nevertheless, long-term correlation is still present in the LP residual appearing as periodic pulselike trains in rLP(n). Frames containing pulse-like structures generate sets of RPE pulses with large dynamic range, resulting in poor compromise excitations when coarsely quantised. RPE with extra pulses: We propose to skip the LTP and instead provide the RPE excitation with additional degrees of freedom suitable to model effectively pulse-like trains (i.e. long-term correlations) in the LP residual rLP(k). To this end, the RPE excitation for a frame is complemented with R additional independent pulses with free gains and positions resulting in an excitation of the form: xext ðkÞ ¼ MðkÞgRPE Qfxp ðkÞg þ R P gi ðkÞdðdi ðkÞÞ ð4Þ i¼1 Fig. 1 Analysis-by-synthesis scheme for narrowband speech coding Signal decorrelation and modelling: Fig. 1 shows the analysisby-synthesis (AbS) scheme usually present in linear predictionbased coders. The original signal, s(n), is passed through a linear prediction analysis filter with transfer function A(z) resulting in the residual signal rLP(n). The presence of long-term correlation, due for instance to voiced speech segments, is often revealed as pulse trainlike structures in rLP(n). To get rid of this type of correlation, the residual rLP(n) is further filtered using a long-term predictor analysis filter with transfer function P(z) resulting in a second residual signal, rLTP(n). A pulse sequence, x(n), is generated in an AbS manner involving LTP and LP synthesis, perceptual weighting and minimisation, typically, using the least-squares (LS) technique. In Fig. 1, the LTP and LP synthesis filters are denoted by 1=P(z) and 1=A(z), respectively, and the perceptual filter by W(z). The cascade of these three filters is denoted as H(z). The signal ep(n) represents the error being minimised. Assuming a frame length of N samples, the vector ep(k), consisting of N successive error samples over frame k, can be expressed as: ep ðkÞ ¼ es in ðkÞ þ HðkÞer ðkÞ ELECTRONICS LETTERS 14th April 2005 ð1Þ where M(k)gRPE Q{xp(k)} represents the (quantised) RPE component, d(di(k)) corresponds to an N-length vector with a unit-amplitude pulse located at position di(k) and zeros elsewhere and gi(k) denotes a gain. These gains are quantised independently and more finely than the RPE pulses. Limiting the number of extra pulses to just two makes the extra bit rate rather small (comparable to the LTP bit rate). Given the frame duration of 5.4 ms, the two extra pulses per frame allow pulse trains with frequencies of up to 370 Hz (i.e. most of the human pitch range) to be modelled. Note that (4) can be seen as the combination of RPE and multi-pulse excitation ([1]). The computation of the optimum RPE excitation and additional pulses for each LP residual frame, rLP(k), is computationally very complex as all combinations of RPE sequences and extra pulse positions should be examined. To lower the computational burden, the extra pulses are restricted to lie on the RPE grid. This amounts to performing a conventional RPE search where for each RPE candidate, its two largest pulses are quantised separately and more finely than the rest of the pulses resulting in the algorithm: For each possible RPE offset j do Compute RPE pulses using (2) xp( j) Extract positions of the two largest pulses in xp( j) d1( j),d2( j) gRPE( j), g1( j), g2( j) Compute gains using LS on (1) and quantise them E( j) Construct xext (k) using (4) and evaluate its associated error end jopt, gRPEopt, g1opt, g2opt, xpopt, d1opt, d2opt Parameters with lowest E( j) Vol. 41 No. 8 The RPE sequence and extra pulses gains, gRPE( j), g1( j), g2( j), are computed using LS on the error given by (1) over the processed frame. This strategy is based on the idea that the (two) largest RPE pulses are the ones contributing the most to the error minimisation. This algorithm, as shown by the results, has proved effective in modelling long-term correlations. Also, the encoding of the extra pulses’ positions requires a lower bit rate since they are constrained to the RPE grid. Note that the complexity of this algorithm is only marginally higher than that of conventional RPE. RPE stage, presumably, due to their large dynamic range. The plot in Fig. 2c shows the resulting excitation signal when the residual rLP(n) is directly fed to a pulse excitation stage making use of RPE with two extra pulses per frame. As with the LTP, the RPE part was computed using decimation 2- and 3-level quantisation. From the resulting excitation, it can be clearly seen that the extra pulses are mainly used to model the spikes in rLP(n). The plot in Fig. 2d shows the reconstruction error when using RPE with extra pulses. No periodicity can be observed in the reconstructed error. Moreover, the total error using the new technique is lower than when using the LTP (i.e. provides larger gain). Listening to both reconstructed signals, the RPE with extra pulses version sounds closer to the original than the LTP version. In particular, an apparent loss of presence due to a poor modelling of the speaker’s pitch could be noticed when using the LTP. Experiments using a variety of speech and audio material show that the extra pulses can also help in modelling other phenomena such as transients. Conclusion: A new method to model long-term correlations in RPE-based broadband audio and speech coders has been proposed. The technique consists of extending an RPE excitation with two extra pulses with free gains. A computationally and bit rate efficient way to do this is by setting the extra pulses on the RPE grid. Results show that the new method outperforms conventional LTP. Fig. 2 Performance comparison of long-term predictor and RPE with extra pulses Results: A comparison of the new technique and long-term prediction (operating at similar bit rate) is presented using a voiced fragment of a male speech excerpt sampled at 44.1 kHz. The first processing step in both cases was a 40th-order LP prediction filter. The plot in Fig. 2a shows the LP analysis filter output, rLP(n). Longterm correlation in this signal is clearly hinted from the presence of a pulse-like periodicity. The plot in Fig. 2b displays the reconstruction error, i.e. the difference between the original and decoded signals, when using a third-order LTP and modelling the resulting residual rLTP(n) using RPE with decimation 2- and 3-level pulse quantisation. If required, the LTP was stabilised according to [4]. Despite the gain provided by the LTP, around 2.5 dB over the whole excerpt, the reconstruction error still exhibits some periodicity that mimics that of rLP(n). In particular, we note that the error is larger in the regions corresponding to pulses in rLP(n). This suggests that the frames containing these pulses cannot be accurately modelled by the # IEE 2005 Electronics Letters online no: 20058338 doi: 10.1049/el:20058338 20 December 2004 F. Riera-Palou, A.C. den Brinker and A.J. Gerrits (Philips Research Laboratories, Prof. Holstlaan 4 (WO-02), 5656 AA Eindhoven, The Netherlands) E-mail: f.riera-palou@philips.com References 1 2 3 4 Kleijn, W.B., and Paliwal, K. (Eds): ‘Speech coding and synthesis’ (Elsevier, Amsterdam, 1995) Kroon, P., Deprettere, E.F., and Sluijter, R.J.: ‘Regular-pulse excitation — a novel approach to effective and efficient multipulse coding of speech’, IEEE Trans. Antennas Speech Signal Process., 1986, 34, (5), pp. 1054–1063 Singhal, S.: ‘High quality audio coding using multipulse LPC’. Proc. IEEE ICASSP, April 1990, pp. 1101–1104 Ramachandran, R.P., and Kabal, P.: ‘Stability and performance of pitch filters in speech coders’, IEEE Trans. Antennas Speech Signal Process., 1987, 35, (7), pp. 937–945 ELECTRONICS LETTERS 14th April 2005 Vol. 41 No. 8