Pitch Prefiltering and Postfiltering Techniques for Improving the

advertisement
Pitch Prefiltering and Postfiltering Techniques for Improving the Audio
Quality of the IETF Opus Mode 1 Codec (the Original CELT Codec)
Juin-Hwey (Raymond) Chen
Broadcom Corporation
5300 California Avenue
Irvine, California 92612
USA
1. Introduction
This document describes pitch prefiltering and postfiltering techniques for improving the output audio
quality of the IETF Opus mode 1 codec (the original CELT codec) when the input audio signal is periodic
or nearly periodic. Such techniques are found to provide a fairly substantial improvement to the output
audio quality of the CELT codec for nearly periodic audio signals produced by certain solo music
instruments such as a trumpet, especially for low-delay modes with smaller codec frame sizes.
An inherent limitation of low-delay transform audio codecs employing small transform window sizes is
that the frequency resolution of such transforms is insufficient to resolve the pitch harmonics of some of
the nearly periodic segments of music and speech signals. As a result, such low-delay transform codecs
tends to give more audible coding distortion when encoding nearly periodic music and speech signals,
even though the coding performance may be fine for other non-periodic signals. Increasing the transform
window size will reduce such coding distortion for nearly periodic music and speech signals, but will also
have the undesirable effect of increasing the coding delay. The pitch prefiltering and postfiltering
techniques introduced in this document has the potential to reduce such coding distortion significantly for
nearly periodic music and speech signals without increasing the coding delay and with only a slight
increase in the codec complexity and encoding bit-rate (around 2 kb/s, or less with a range coder).
This document gives a detailed technical description of such pitch prefiltering and postfiltering
techniques, with the goal of enabling CELT codec developers to implement and integrate these techniques
in the current C code for CELT to evaluate the technical merits of these techniques. If such evaluation
confirms our preliminary findings of substantial audio quality improvement as expected, then it is
proposed that such pitch prefiltering and postfiltering techniques be incorporated into Mode 1 of the IETF
Opus codec to ensure a more uniform audio quality level across a wider variety of different audio signals.
2. Detailed Description of the Techniques
The pitch prefiltering is a pre-processing technique applied to the input audio signal before the CELT
encoder. The pitch postfiltering is a post-processing technique applied to the output decoded audio signal
of the CELT decoder. Figure 1 shows a simplified high-level block diagram of such an arrangement.
Input
Audio
Signal
CELT
Audio
Encoder
Pitch
Prefilter
Compressed
Audio
Bit-stream
CELT
Audio
Decoder
Output
Audio
Signal
Pitch
Postfilter
Fig. 1 Simplified high-level block diagram of a system incorporating pitch prefiltering and postfiltering
The pitch prefilter adaptively boosts the frequency components in the spectral valleys between pitch
harmonics when the input signal exhibits significant pitch periodicity. The effect is essentially adaptive
comb filtering. The prefiltered version of the input audio signal is then encoded by the CELT encoder and
decoded by the CELT decoder as usual. The decoded audio signal is then passed through a corresponding
pitch postfilter, which in the ideal case is an exact inverse filter of the pitch prefilter for that frame of the
audio signal. Thus, the pitch postfilter attenuates the inter-harmonic spectral valleys. The combined
effect of the pitch prefilter and pitch postfilter in the system of Fig. 1 is to shape the spectrum of the
coding noise such that the noise spectrum in the inter-harmonic spectral valley regions is attenuated
relative to that of a system without the pitch prefilter and postfilter.
To allow the pitch postfilter to be an exact inverse filter of the pitch prefilter, it is necessary to transmit
the pitch period and the pitch filter tap(s) of the pitch prefilter from the encoder side to the decoder side.
Fig. 2 depicts a more detailed block diagram showing pitch parameter estimation, quantization,
transmission, and decoding. The main audio signal path and the main blocks in Fig. 1 are shown with
thick lines.
Pitch
Parameters
Pitch
Parameter
Estimator
Input
Audio
Signal
Pitch
Parameter
Quantizer
Compressed
Pitch
Parameters
Bit-stream
Quantized
Pitch
Parameters
Pitch
Pre-filter
Compressed
Pitch
Parameters
Bit-stream
Bit-stream
Multiplexer
or Joint
Coder
CELT
Audio
Encoder
Compressed
Audio
Bit-stream
Pitch
Parameter
Decoder
Quantized
Pitch
Parameters
Bit-stream
De-multiplexer
or Joint
Decoder
Compressed
Audio
Bit-stream
CELT
Audio
Decoder
Pitch
Post-filter
Output
Audio
Signal
Fig. 2 A more detailed block diagram of a system incorporating pitch prefiltering and postfiltering
It should be noted that to achieve the desired noise spectral shaping effect of attenuating the spectral
valleys between pitch harmonics adequately, it is essential for the pitch parameter estimator to extract the
true pitch period rather than the integer multiples of the true pitch period. Any low-complexity pitch
estimator can be used here to extract the pitch period at a rate of about one pitch period every 5 ms or so,
as long as it can extract the true pitch period (rather than integer multiples of it) at least the vast majority
of the time.
The pitch parameter estimator in Fig. 2 also needs to calculate the pitch filter tap(s) to be used by the
pitch prefilter and postfilter. In its simplest form, a single non-zero filter tap at a bulk delay of the pitch
period is used for both the pitch prefilter and postfilter. Although it is simple, the down side is that the
difference between peaks and valleys in the frequency responses of such single-tap pitch filters remain
constant across the frequency range. Thus, for those audio signals where the pitch harmonic peaks are
prominent only in low frequencies, such single-tap pitch filters may introduce more periodicity in the
high-frequency region than in the original input audio signal. To avoid this problem, one can use multiple
filter taps (for example, 2, 3, 4, or 5 taps) around the bulk delay of the pitch period. By properly choosing
the filter taps, one can reduce the difference between the peaks and valleys of the frequency response as
the frequency increases. It is certainly also possible to use sub-band decomposition and apply pitch
prefiltering and postfiltering only to the lower band, although doing so inevitably incurs additional
filtering delay and complexity.
In our preliminary simulations, even simple single-tap pitch prefilter and postfilter can achieve
significant audio quality improvement for many audio signals. The following description will concentrate
on such single-tap pitch prefilter and postfilter. The description can easily be generalized to the multi-tap
filter case.
The pitch prefilter can take several possible forms. It can be an all-zero filter, an all-pole filter, or a polezero filter. At its simplest, the pitch prefilter can be implemented as an all-zero Finite Impulse Response
(FIR) filter with a single filter tap at a bulk delay of the pitch period. More specifically, let b denote the
filter tap weight and let p denote the pitch period in samples, i.e., the time period by which the nearly
periodic input audio signal repeats its waveform approximately, then, the relationship between the input
signal sample s(n) and output signal sample d(n) at time index n is defined by the following difference
equation.
d ( n)  s ( n)  b s ( n  p )
Such an all-zero FIR filter has a transfer function of
H pre ( z ) 
D( z )
 1 b z  p .
S ( z)
Typically, the tap weight is chosen to be 0  b  1, with b = 0 when there is not sufficient periodicity
detected in the input audio signal. The more periodic the input audio signal, the closer b is to 1. The
frequency response of such a filter H(z) has equally-spaced downward spikes located at the harmonic
frequencies of the pitch frequency ( Fs / p) Hz, where Fs is the sampling rate of the input audio signal in
Hz.
~
The pitch postfilter is the exact inverse filter of the pitch prefilter. Denote its input signal as d ( n) and
s (n) at time index n, then the input-output relationship of the pitch postfilter is given by
output signal as ~
~
~
s ( n)  d ( n)  b ~
s (n  p)
Such a pitch postfilter has a transfer function of
~
S ( z)
1
H post ( z )  ~

D( z ) 1  b z  p
This filter has a frequency response that is a mirror image of the horizontal axis, with upward spikes
located at the harmonic frequencies of the pitch frequency ( Fs / p) Hz.
One can use an all-pole pitch prefilter in the form of
H pre ( z ) 
1
1 a zp
and a corresponding all-zero pitch postfilter in the form of
H post ( z)  1  a z  p
Furthermore, one can even use pole-zero filters for both the pitch prefilter and the pitch postfilter, in the
forms of
1 b z  p
H pre ( z ) 
1 a z  p
and
H post ( z ) 
1 a zp
1 b zp
respectively. These last forms of pole-zero filters gives more control of the shape of the
frequency response around each pitch harmonic, although at a cost of more computational
complexity.
To keep the pitch prefilter and postfilter low complexity and conceptually simple, the rest of this
document will concentrate on the first example above where H pre ( z )  1  b z  p and
H post ( z ) 
1
. It is again relatively easy to generalize or extend to other forms of pitch filters.
1 b zp
~
Note that if there is no quantization applied on the pitch prefilter output signal d(n), then d (n)  d (n) ,
and thus from the two difference equations above, it follows that
~
s ( n)  d ( n)  b ~
s ( n  p )  s ( n)  b s ( n  p )  b ~
s (n  p)
By ensuring that the set of {p, b} is identical in the pitch prefilter and the pitch postfilter, and by ensuring
s (n)} start with the same initial condition, the second and the third
that the signal arrays {s ( n)} and {~
term on the right side of the last equal sign in the equation above will exactly cancel each other out,
s (n)  s(n) , that is, perfect reconstruction.
resulting in ~
Of course, with the quantization effect introduced by the CELT encoder and CELT decoder, such perfect
reconstruction property is lost. However, if the quantization error is relatively small, i.e., the signal-tocoding-noise ratio is reasonably high, then the output signal of the pitch postfilter will still be reasonably
close to the input signal of the pitch prefilter.
For the pitch filter tap b, it is reasonable to make it proportional to a parameter that measures the
correlation between the adjacent pitch cycle waveforms, such as the cosine of the angle between the
vector of the current frame of audio signal samples and the vector of the audio samples that are one pitch
period earlier. Specifically, let L be the length of the frame and let time index n = 1, 2, …, L correspond
to the current frame, then, the normalized correlation c, which is the cosine of the angle described above,
is calculated as
L
c
 s ( n) s ( n  p )
n 1
L
L
n 1
n 1
 s 2 ( n)  s 2 ( n  p )
To reduce the complexity, the normalized correlation may be approximated by the optimal tap weight of
the single-tap pitch predictor, calculated as
L
c 
 s ( n) s ( n  p )
n 1
L
s
2
(n  p)
n 1
The pitch prefilter and postfilter tap b can then be obtained as
 bmax if

b  bmax c if
 0
if

c 1
T  c 1
c T
.
where a reasonable value of bmax may be in the range of 0.4 to 0.9, while a reasonable value of the
threshold T may be around 0.6, although a threshold of 0 will work, too.
Note that if the frame size is very small (such as 2.5 ms), the summation over such a short period may not
give reliable result. In this case, it may be beneficial to use a longer summation window by including
some of the signal samples in previous frames.
One practical problem when applying such pitch prefilter and pitch postfilter is that when the pitch period
p and the filter tap b change at the frame boundary, there is often a waveform discontinuity in the output
signals of such filters, and this will cause an audible click and can cause undesirable effect in the audio
encoder. This problem can be avoided by applying an overlap-add method as described in the U.S. patent
No. 7,353,168. Specifically, at the beginning of the current frame and with the filter memory set at the
value left after filtering the last sample of the last frame, two filtering operations are performed for the
first K samples of the current frame, where the value of K is usually chosen to correspond to 2.5 ms or
longer. The first filtering operation is performed with the filter parameters (the pitch period p and pitch
filter tap b) of the last frame, and the second filtering operation is performed with the filter parameters of
the current frame. Note that both filtering operations should start with the same filter memory that was
left after filtering the last sample of the last frame. A fade-out window (can be a downward-sloping
triangular window) of K samples is applied to the output signal of the first filtering operation, while a
fade-in window (can be an upward-sloping triangular window) of K samples is applied to the output
signal of the second filtering operation. The sum of the fade-in and fade-out windows is unity at every
one of the K samples. The two windowed filter output signals are added together and used as the final
filter output signal. It is assumed that K ≤ L. If K < L, then from (K+1)-th sample to the last (L-th) sample
of the current frame, only one filtering operation is performed using the filter parameters of the current
frame. Such an overlap-add filtering method ensures smooth waveform transition and eliminate
waveform discontinuities at frame boundaries.
Note that with such an overlap-add filtering approach, the perfect reconstruction property for the nonoverlap-add version of the simple pitch prefilter and pitch postfilter as described earlier no longer holds
true. In fact, it can be shown that to maintain the perfect reconstruction property, the parallel filtering and
overlap-add of the two filtered output signals should be performed not for the entire all-pole pitch
1
, but only for the all-zero FIR filter b z  p in the feedback branch of the
p
1 b z
all-pole filter H post (z ) . For the pitch prefilter H pre ( z )  1  b z  p , applying the overlap-add filtering
postfilter H post ( z ) 
approach to the entire H pre (z ) filter is mathematically equivalent to applying the overlap-add filtering
approach only to the all-zero FIR filter b z  p in the feed-forward branch of the all-zero filter H pre (z ) .
The all-zero pitch prefilter with overlap-add is relatively straightforward to implement. On the other
hand, due to the recursive nature of all-pole filtering, the all-pole pitch postfilter needs to be handled with
care, especially when the pitch period is smaller than the overlap-add length K. The most important thing
to note is that the two filtering operations cannot be implemented independent of each other for the entire
K samples and then windowed and overlap-added together. This is because the waveform discontinuity at
the beginning of the current frame resulting from such independent filtering will be repeated before the K
samples of overlap-add period is over, and therefore the overlap-add operation will not be able to smooth
out such repeated waveform discontinuities after the beginning of the current frame.
The correct way to implement the overlap-add approach is to perform the overlap-add of the two filtering
output signal sample-by-sample. Then, the waveform discontinuity at the beginning of the frame is
already smoothed out by the overlap-add operation by the time the filtering operation reached one pitch
period into the frame, so there will not be a repeated waveform discontinuity there.
Specifically, let the time index n for the current frame be from 1 to L, and let wi (n) and wo (n) be the
fade-in window sample and fade-out window sample at time index n, respectively. In addition, let p 0
and b0 be the pitch period and the pitch postfilter tap of the previous frame, respectively. Then, the allpole pitch postfiltering with overlap-add should be performed sample-by-sample for the first K samples of
the current frame by the following pseudo-code.
For n from 1 to K, calculate the pitch postfilter output sample as
~
~
s (n)  d (n)  wo (n) b0 ~
s (n  p0 )  wi (n) b ~
s ( n  p)
end
After filtering the first K samples, if L > K, then the filtering from the (K+1)-th sample to the L-th sample
is just simple all-pole filtering using the difference equation
~
~
s ( n)  d ( n)  b ~
s (n  p) .
In our simulations, K is chosen to be corresponding to 2.5 ms, or 120 samples at 48 kHz sampling rate.
By doing this, we don’t need to change the overlap-add length K for all four possible frame sizes of the
Opus mode 1 CELT codec, since the smallest frame size is 2.5 ms.
In our simulations, we found that even if we do not apply a pitch prefilter and only apply a pitch
postfilter, the resulting output audio quality is still improved noticeably compared with no postfiltering,
although the improvement is somewhat less than the case when both the pitch prefilter and pitch postfilter
are used. Therefore, it is possible to achieve audio quality enhancement “for free” without spending
additional bits to transmit the pitch period and pitch filter tap(s) from the encoder side to the decoder side.
The only price paid is some additional complexity. In this case, the pitch period and the pitch postfilter
tap(s) will need to be derived locally from the decoded audio signal.
3. Conclusion
This document gives sufficient technical details to allow one to implement the pitch prefiltering and
postfiltering techniques for the IETF Opus mode 1 CELT codec to improve its output audio quality for
certain audio signals with a high degree of periodicity. Special care needs to be taken when implementing
an all-pole pitch filter with overlap-add. These pitch prefiltering and postfiltering techniques give
substantial audio quality improvement for many audio signals we tried that were problematic for the
CELT codec to code well, thus these pitch filtering techniques have the potential to make CELT more
robust to different types of audio signals. The price paid is a small increase in bit-rate and codec
complexity.
Download