Modelling long-term correlations in broadband speech and audio

advertisement
Modelling long-term correlations in
broadband speech and audio pulse coders
F. Riera-Palou, A.C. den Brinker and A.J. Gerrits
A new idea to model long-term correlations in broadband speech and
audio coders based on regular pulse excitation is presented. It involves
expanding the excitation sequence with additional pulses targeting this
type of correlation while being bit rate efficient. Comparison with the
standard method (long-term prediction) highlights the advantages of
the proposed technique.
Introduction: Linear prediction-based narrowband (8 kHz sampling
rate) speech coders exploit the inter-sample correlation (i.e. redundancy) to achieve a high coding efficiency [1]. To this end, the input
signal is typically decorrelated in two consecutive steps: first, a linear
prediction (LP) analysis filter eliminates most of the short-term
correlation; then, a long-term prediction (LTP) analysis filter minimises the long-term correlation. Bit rate reductions are achieved by
modelling this decorrelated signal by an excitation signal, which is
computed using an analysis-by-synthesis (AbS) procedure like regular
pulse excitation (RPE) [2]. This excitation consists of a pulse
sequence that, after being filtered with the LTP and LP synthesis
filters, produces a close replication of the original while requiring only
a relatively low bit rate.
Several attempts have been made, with limited success, to extend the
use of these techniques to broadband ( 32 kHz sampling) speech and
audio signals [3]. In this context, however, LTP is rarely used since its
performance is greatly impaired by the higher bandwidth of the
processed signals. Consequently, long-term correlations, e.g. due to
speaker pitch, are still present during the AbS excitation modelling
stage leading to a significant drop in the coding quality. In this Letter a
modified RPE technique is presented that is capable of modelling
accurately a broadband signal containing long-term correlations with
a minimal increase in bit rate.
where es in(k) is a vector corresponding to the response of the filter H(z)
due to its initial filter states and truncated to N, er(k) is the vector
containing the difference between the N samples from the LTP residual,
rLTP(k), and its corresponding excitation x(k). H(k) is an N N matrix
representing the impulse response (truncated to N) of the filter H(z).
When using RPE, only J equidistant nonzero values are allowed per
frame (decimation N=J). The J pulse amplitudes minimising jjep(k)jj22
are then given by ([2]):
xp ðkÞ ¼ ðMðkÞt HðkÞt HðkÞMðkÞÞ1 ðMðkÞt HðkÞt Þ
ðes in ðkÞ þ HðkÞrLTP ðkÞÞ
ð2Þ
where M(k) is an N J location matrix signalling which positions in
the excitation sequence are nonzero. Different location matrices corresponding to different grid positions (RPE offsets) are tested and the one
with minimum jjep(k)jj22 is selected. The full N-sample excitation used
at the decoder to drive the LTP and LP synthesis filters, x(k), can be
represented as:
xðkÞ ¼ MðkÞgRPE Qfxp ðkÞg
ð3Þ
where Q{xp(k)} denotes the pulses after quantisation and gRPE a gain
associated with the excitation. To achieve an attractive bit rate
(40–45 kbit=s), we use decimation 2- and 3-level pulse quantisation
with an associated gain per frame (240 samples, 5.4 ms) in our
prototype broadband (44.1 kHz sampling) pulse coder.
Adressed problem: The LTP gain for broadband signals is low due to
the presence of high frequencies obscuring the signal periodicity
present at low frequencies (e.g. due to speaker’s pitch). Long LTP
filters (more than three taps) can attain higher gains but exacerbate the
stability problems in the LTP synthesis, introducing the need for
complex stabilisation procedures [4]. Nevertheless, long-term correlation is still present in the LP residual appearing as periodic pulselike trains in rLP(n). Frames containing pulse-like structures generate
sets of RPE pulses with large dynamic range, resulting in poor
compromise excitations when coarsely quantised.
RPE with extra pulses: We propose to skip the LTP and instead
provide the RPE excitation with additional degrees of freedom
suitable to model effectively pulse-like trains (i.e. long-term correlations) in the LP residual rLP(k). To this end, the RPE excitation for a
frame is complemented with R additional independent pulses with
free gains and positions resulting in an excitation of the form:
xext ðkÞ ¼ MðkÞgRPE Qfxp ðkÞg þ
R
P
gi ðkÞdðdi ðkÞÞ
ð4Þ
i¼1
Fig. 1 Analysis-by-synthesis scheme for narrowband speech coding
Signal decorrelation and modelling: Fig. 1 shows the analysisby-synthesis (AbS) scheme usually present in linear predictionbased coders. The original signal, s(n), is passed through a linear
prediction analysis filter with transfer function A(z) resulting in the
residual signal rLP(n). The presence of long-term correlation, due for
instance to voiced speech segments, is often revealed as pulse trainlike structures in rLP(n). To get rid of this type of correlation, the
residual rLP(n) is further filtered using a long-term predictor analysis
filter with transfer function P(z) resulting in a second residual signal,
rLTP(n). A pulse sequence, x(n), is generated in an AbS manner
involving LTP and LP synthesis, perceptual weighting and minimisation, typically, using the least-squares (LS) technique. In Fig. 1, the
LTP and LP synthesis filters are denoted by 1=P(z) and 1=A(z),
respectively, and the perceptual filter by W(z). The cascade of these
three filters is denoted as H(z). The signal ep(n) represents the error
being minimised.
Assuming a frame length of N samples, the vector ep(k), consisting of
N successive error samples over frame k, can be expressed as:
ep ðkÞ ¼ es in ðkÞ þ HðkÞer ðkÞ
ELECTRONICS LETTERS 14th April 2005
ð1Þ
where M(k)gRPE Q{xp(k)} represents the (quantised) RPE component,
d(di(k)) corresponds to an N-length vector with a unit-amplitude pulse
located at position di(k) and zeros elsewhere and gi(k) denotes a gain.
These gains are quantised independently and more finely than the RPE
pulses. Limiting the number of extra pulses to just two makes the extra
bit rate rather small (comparable to the LTP bit rate). Given the frame
duration of 5.4 ms, the two extra pulses per frame allow pulse trains
with frequencies of up to 370 Hz (i.e. most of the human pitch range) to
be modelled. Note that (4) can be seen as the combination of RPE and
multi-pulse excitation ([1]).
The computation of the optimum RPE excitation and additional
pulses for each LP residual frame, rLP(k), is computationally very
complex as all combinations of RPE sequences and extra pulse
positions should be examined. To lower the computational burden,
the extra pulses are restricted to lie on the RPE grid. This amounts to
performing a conventional RPE search where for each RPE candidate,
its two largest pulses are quantised separately and more finely than the
rest of the pulses resulting in the algorithm:
For each possible RPE offset j do
Compute RPE pulses using (2)
xp( j)
Extract positions of the two largest pulses in xp( j)
d1( j),d2( j)
gRPE( j), g1( j), g2( j) Compute gains using LS on (1) and quantise them
E( j) Construct xext (k) using (4) and evaluate its associated error
end
jopt, gRPEopt, g1opt, g2opt, xpopt, d1opt, d2opt Parameters
with lowest E( j)
Vol. 41 No. 8
The RPE sequence and extra pulses gains, gRPE( j), g1( j), g2( j), are
computed using LS on the error given by (1) over the processed frame.
This strategy is based on the idea that the (two) largest RPE pulses are
the ones contributing the most to the error minimisation. This
algorithm, as shown by the results, has proved effective in modelling
long-term correlations. Also, the encoding of the extra pulses’ positions
requires a lower bit rate since they are constrained to the RPE grid. Note
that the complexity of this algorithm is only marginally higher than that
of conventional RPE.
RPE stage, presumably, due to their large dynamic range. The plot
in Fig. 2c shows the resulting excitation signal when the residual
rLP(n) is directly fed to a pulse excitation stage making use of RPE
with two extra pulses per frame. As with the LTP, the RPE part was
computed using decimation 2- and 3-level quantisation. From the
resulting excitation, it can be clearly seen that the extra pulses are
mainly used to model the spikes in rLP(n). The plot in Fig. 2d shows
the reconstruction error when using RPE with extra pulses. No
periodicity can be observed in the reconstructed error. Moreover, the
total error using the new technique is lower than when using the
LTP (i.e. provides larger gain). Listening to both reconstructed
signals, the RPE with extra pulses version sounds closer to the
original than the LTP version. In particular, an apparent loss of
presence due to a poor modelling of the speaker’s pitch could be
noticed when using the LTP. Experiments using a variety of speech
and audio material show that the extra pulses can also help in
modelling other phenomena such as transients.
Conclusion: A new method to model long-term correlations in
RPE-based broadband audio and speech coders has been proposed.
The technique consists of extending an RPE excitation with two
extra pulses with free gains. A computationally and bit
rate efficient way to do this is by setting the extra pulses on the
RPE grid. Results show that the new method outperforms conventional LTP.
Fig. 2 Performance comparison of long-term predictor and RPE with extra
pulses
Results: A comparison of the new technique and long-term prediction (operating at similar bit rate) is presented using a voiced
fragment of a male speech excerpt sampled at 44.1 kHz. The first
processing step in both cases was a 40th-order LP prediction filter.
The plot in Fig. 2a shows the LP analysis filter output, rLP(n). Longterm correlation in this signal is clearly hinted from the presence of
a pulse-like periodicity. The plot in Fig. 2b displays the reconstruction error, i.e. the difference between the original and decoded
signals, when using a third-order LTP and modelling the resulting
residual rLTP(n) using RPE with decimation 2- and 3-level pulse
quantisation. If required, the LTP was stabilised according to [4].
Despite the gain provided by the LTP, around 2.5 dB over the whole
excerpt, the reconstruction error still exhibits some periodicity that
mimics that of rLP(n). In particular, we note that the error is larger in
the regions corresponding to pulses in rLP(n). This suggests that the
frames containing these pulses cannot be accurately modelled by the
# IEE 2005
Electronics Letters online no: 20058338
doi: 10.1049/el:20058338
20 December 2004
F. Riera-Palou, A.C. den Brinker and A.J. Gerrits (Philips Research
Laboratories, Prof. Holstlaan 4 (WO-02), 5656 AA Eindhoven, The
Netherlands)
E-mail: f.riera-palou@philips.com
References
1
2
3
4
Kleijn, W.B., and Paliwal, K. (Eds): ‘Speech coding and synthesis’
(Elsevier, Amsterdam, 1995)
Kroon, P., Deprettere, E.F., and Sluijter, R.J.: ‘Regular-pulse excitation —
a novel approach to effective and efficient multipulse coding of
speech’, IEEE Trans. Antennas Speech Signal Process., 1986, 34, (5),
pp. 1054–1063
Singhal, S.: ‘High quality audio coding using multipulse LPC’. Proc.
IEEE ICASSP, April 1990, pp. 1101–1104
Ramachandran, R.P., and Kabal, P.: ‘Stability and performance of pitch
filters in speech coders’, IEEE Trans. Antennas Speech Signal Process.,
1987, 35, (7), pp. 937–945
ELECTRONICS LETTERS 14th April 2005
Vol. 41 No. 8
Download