Pitch Prefiltering and Postfiltering Techniques for Improving the Audio Quality of the IETF Opus Mode 1 Codec (the Original CELT Codec) Juin-Hwey (Raymond) Chen Broadcom Corporation 5300 California Avenue Irvine, California 92612 USA 1. Introduction This document describes pitch prefiltering and postfiltering techniques for improving the output audio quality of the IETF Opus mode 1 codec (the original CELT codec) when the input audio signal is periodic or nearly periodic. Such techniques are found to provide a fairly substantial improvement to the output audio quality of the CELT codec for nearly periodic audio signals produced by certain solo music instruments such as a trumpet, especially for low-delay modes with smaller codec frame sizes. An inherent limitation of low-delay transform audio codecs employing small transform window sizes is that the frequency resolution of such transforms is insufficient to resolve the pitch harmonics of some of the nearly periodic segments of music and speech signals. As a result, such low-delay transform codecs tends to give more audible coding distortion when encoding nearly periodic music and speech signals, even though the coding performance may be fine for other non-periodic signals. Increasing the transform window size will reduce such coding distortion for nearly periodic music and speech signals, but will also have the undesirable effect of increasing the coding delay. The pitch prefiltering and postfiltering techniques introduced in this document has the potential to reduce such coding distortion significantly for nearly periodic music and speech signals without increasing the coding delay and with only a slight increase in the codec complexity and encoding bit-rate (around 2 kb/s, or less with a range coder). This document gives a detailed technical description of such pitch prefiltering and postfiltering techniques, with the goal of enabling CELT codec developers to implement and integrate these techniques in the current C code for CELT to evaluate the technical merits of these techniques. If such evaluation confirms our preliminary findings of substantial audio quality improvement as expected, then it is proposed that such pitch prefiltering and postfiltering techniques be incorporated into Mode 1 of the IETF Opus codec to ensure a more uniform audio quality level across a wider variety of different audio signals. 2. Detailed Description of the Techniques The pitch prefiltering is a pre-processing technique applied to the input audio signal before the CELT encoder. The pitch postfiltering is a post-processing technique applied to the output decoded audio signal of the CELT decoder. Figure 1 shows a simplified high-level block diagram of such an arrangement. Input Audio Signal CELT Audio Encoder Pitch Prefilter Compressed Audio Bit-stream CELT Audio Decoder Output Audio Signal Pitch Postfilter Fig. 1 Simplified high-level block diagram of a system incorporating pitch prefiltering and postfiltering The pitch prefilter adaptively boosts the frequency components in the spectral valleys between pitch harmonics when the input signal exhibits significant pitch periodicity. The effect is essentially adaptive comb filtering. The prefiltered version of the input audio signal is then encoded by the CELT encoder and decoded by the CELT decoder as usual. The decoded audio signal is then passed through a corresponding pitch postfilter, which in the ideal case is an exact inverse filter of the pitch prefilter for that frame of the audio signal. Thus, the pitch postfilter attenuates the inter-harmonic spectral valleys. The combined effect of the pitch prefilter and pitch postfilter in the system of Fig. 1 is to shape the spectrum of the coding noise such that the noise spectrum in the inter-harmonic spectral valley regions is attenuated relative to that of a system without the pitch prefilter and postfilter. To allow the pitch postfilter to be an exact inverse filter of the pitch prefilter, it is necessary to transmit the pitch period and the pitch filter tap(s) of the pitch prefilter from the encoder side to the decoder side. Fig. 2 depicts a more detailed block diagram showing pitch parameter estimation, quantization, transmission, and decoding. The main audio signal path and the main blocks in Fig. 1 are shown with thick lines. Pitch Parameters Pitch Parameter Estimator Input Audio Signal Pitch Parameter Quantizer Compressed Pitch Parameters Bit-stream Quantized Pitch Parameters Pitch Pre-filter Compressed Pitch Parameters Bit-stream Bit-stream Multiplexer or Joint Coder CELT Audio Encoder Compressed Audio Bit-stream Pitch Parameter Decoder Quantized Pitch Parameters Bit-stream De-multiplexer or Joint Decoder Compressed Audio Bit-stream CELT Audio Decoder Pitch Post-filter Output Audio Signal Fig. 2 A more detailed block diagram of a system incorporating pitch prefiltering and postfiltering It should be noted that to achieve the desired noise spectral shaping effect of attenuating the spectral valleys between pitch harmonics adequately, it is essential for the pitch parameter estimator to extract the true pitch period rather than the integer multiples of the true pitch period. Any low-complexity pitch estimator can be used here to extract the pitch period at a rate of about one pitch period every 5 ms or so, as long as it can extract the true pitch period (rather than integer multiples of it) at least the vast majority of the time. The pitch parameter estimator in Fig. 2 also needs to calculate the pitch filter tap(s) to be used by the pitch prefilter and postfilter. In its simplest form, a single non-zero filter tap at a bulk delay of the pitch period is used for both the pitch prefilter and postfilter. Although it is simple, the down side is that the difference between peaks and valleys in the frequency responses of such single-tap pitch filters remain constant across the frequency range. Thus, for those audio signals where the pitch harmonic peaks are prominent only in low frequencies, such single-tap pitch filters may introduce more periodicity in the high-frequency region than in the original input audio signal. To avoid this problem, one can use multiple filter taps (for example, 2, 3, 4, or 5 taps) around the bulk delay of the pitch period. By properly choosing the filter taps, one can reduce the difference between the peaks and valleys of the frequency response as the frequency increases. It is certainly also possible to use sub-band decomposition and apply pitch prefiltering and postfiltering only to the lower band, although doing so inevitably incurs additional filtering delay and complexity. In our preliminary simulations, even simple single-tap pitch prefilter and postfilter can achieve significant audio quality improvement for many audio signals. The following description will concentrate on such single-tap pitch prefilter and postfilter. The description can easily be generalized to the multi-tap filter case. The pitch prefilter can take several possible forms. It can be an all-zero filter, an all-pole filter, or a polezero filter. At its simplest, the pitch prefilter can be implemented as an all-zero Finite Impulse Response (FIR) filter with a single filter tap at a bulk delay of the pitch period. More specifically, let b denote the filter tap weight and let p denote the pitch period in samples, i.e., the time period by which the nearly periodic input audio signal repeats its waveform approximately, then, the relationship between the input signal sample s(n) and output signal sample d(n) at time index n is defined by the following difference equation. d ( n) s ( n) b s ( n p ) Such an all-zero FIR filter has a transfer function of H pre ( z ) D( z ) 1 b z p . S ( z) Typically, the tap weight is chosen to be 0 b 1, with b = 0 when there is not sufficient periodicity detected in the input audio signal. The more periodic the input audio signal, the closer b is to 1. The frequency response of such a filter H(z) has equally-spaced downward spikes located at the harmonic frequencies of the pitch frequency ( Fs / p) Hz, where Fs is the sampling rate of the input audio signal in Hz. ~ The pitch postfilter is the exact inverse filter of the pitch prefilter. Denote its input signal as d ( n) and s (n) at time index n, then the input-output relationship of the pitch postfilter is given by output signal as ~ ~ ~ s ( n) d ( n) b ~ s (n p) Such a pitch postfilter has a transfer function of ~ S ( z) 1 H post ( z ) ~ D( z ) 1 b z p This filter has a frequency response that is a mirror image of the horizontal axis, with upward spikes located at the harmonic frequencies of the pitch frequency ( Fs / p) Hz. One can use an all-pole pitch prefilter in the form of H pre ( z ) 1 1 a zp and a corresponding all-zero pitch postfilter in the form of H post ( z) 1 a z p Furthermore, one can even use pole-zero filters for both the pitch prefilter and the pitch postfilter, in the forms of 1 b z p H pre ( z ) 1 a z p and H post ( z ) 1 a zp 1 b zp respectively. These last forms of pole-zero filters gives more control of the shape of the frequency response around each pitch harmonic, although at a cost of more computational complexity. To keep the pitch prefilter and postfilter low complexity and conceptually simple, the rest of this document will concentrate on the first example above where H pre ( z ) 1 b z p and H post ( z ) 1 . It is again relatively easy to generalize or extend to other forms of pitch filters. 1 b zp ~ Note that if there is no quantization applied on the pitch prefilter output signal d(n), then d (n) d (n) , and thus from the two difference equations above, it follows that ~ s ( n) d ( n) b ~ s ( n p ) s ( n) b s ( n p ) b ~ s (n p) By ensuring that the set of {p, b} is identical in the pitch prefilter and the pitch postfilter, and by ensuring s (n)} start with the same initial condition, the second and the third that the signal arrays {s ( n)} and {~ term on the right side of the last equal sign in the equation above will exactly cancel each other out, s (n) s(n) , that is, perfect reconstruction. resulting in ~ Of course, with the quantization effect introduced by the CELT encoder and CELT decoder, such perfect reconstruction property is lost. However, if the quantization error is relatively small, i.e., the signal-tocoding-noise ratio is reasonably high, then the output signal of the pitch postfilter will still be reasonably close to the input signal of the pitch prefilter. For the pitch filter tap b, it is reasonable to make it proportional to a parameter that measures the correlation between the adjacent pitch cycle waveforms, such as the cosine of the angle between the vector of the current frame of audio signal samples and the vector of the audio samples that are one pitch period earlier. Specifically, let L be the length of the frame and let time index n = 1, 2, …, L correspond to the current frame, then, the normalized correlation c, which is the cosine of the angle described above, is calculated as L c s ( n) s ( n p ) n 1 L L n 1 n 1 s 2 ( n) s 2 ( n p ) To reduce the complexity, the normalized correlation may be approximated by the optimal tap weight of the single-tap pitch predictor, calculated as L c s ( n) s ( n p ) n 1 L s 2 (n p) n 1 The pitch prefilter and postfilter tap b can then be obtained as bmax if b bmax c if 0 if c 1 T c 1 c T . where a reasonable value of bmax may be in the range of 0.4 to 0.9, while a reasonable value of the threshold T may be around 0.6, although a threshold of 0 will work, too. Note that if the frame size is very small (such as 2.5 ms), the summation over such a short period may not give reliable result. In this case, it may be beneficial to use a longer summation window by including some of the signal samples in previous frames. One practical problem when applying such pitch prefilter and pitch postfilter is that when the pitch period p and the filter tap b change at the frame boundary, there is often a waveform discontinuity in the output signals of such filters, and this will cause an audible click and can cause undesirable effect in the audio encoder. This problem can be avoided by applying an overlap-add method as described in the U.S. patent No. 7,353,168. Specifically, at the beginning of the current frame and with the filter memory set at the value left after filtering the last sample of the last frame, two filtering operations are performed for the first K samples of the current frame, where the value of K is usually chosen to correspond to 2.5 ms or longer. The first filtering operation is performed with the filter parameters (the pitch period p and pitch filter tap b) of the last frame, and the second filtering operation is performed with the filter parameters of the current frame. Note that both filtering operations should start with the same filter memory that was left after filtering the last sample of the last frame. A fade-out window (can be a downward-sloping triangular window) of K samples is applied to the output signal of the first filtering operation, while a fade-in window (can be an upward-sloping triangular window) of K samples is applied to the output signal of the second filtering operation. The sum of the fade-in and fade-out windows is unity at every one of the K samples. The two windowed filter output signals are added together and used as the final filter output signal. It is assumed that K ≤ L. If K < L, then from (K+1)-th sample to the last (L-th) sample of the current frame, only one filtering operation is performed using the filter parameters of the current frame. Such an overlap-add filtering method ensures smooth waveform transition and eliminate waveform discontinuities at frame boundaries. Note that with such an overlap-add filtering approach, the perfect reconstruction property for the nonoverlap-add version of the simple pitch prefilter and pitch postfilter as described earlier no longer holds true. In fact, it can be shown that to maintain the perfect reconstruction property, the parallel filtering and overlap-add of the two filtered output signals should be performed not for the entire all-pole pitch 1 , but only for the all-zero FIR filter b z p in the feedback branch of the p 1 b z all-pole filter H post (z ) . For the pitch prefilter H pre ( z ) 1 b z p , applying the overlap-add filtering postfilter H post ( z ) approach to the entire H pre (z ) filter is mathematically equivalent to applying the overlap-add filtering approach only to the all-zero FIR filter b z p in the feed-forward branch of the all-zero filter H pre (z ) . The all-zero pitch prefilter with overlap-add is relatively straightforward to implement. On the other hand, due to the recursive nature of all-pole filtering, the all-pole pitch postfilter needs to be handled with care, especially when the pitch period is smaller than the overlap-add length K. The most important thing to note is that the two filtering operations cannot be implemented independent of each other for the entire K samples and then windowed and overlap-added together. This is because the waveform discontinuity at the beginning of the current frame resulting from such independent filtering will be repeated before the K samples of overlap-add period is over, and therefore the overlap-add operation will not be able to smooth out such repeated waveform discontinuities after the beginning of the current frame. The correct way to implement the overlap-add approach is to perform the overlap-add of the two filtering output signal sample-by-sample. Then, the waveform discontinuity at the beginning of the frame is already smoothed out by the overlap-add operation by the time the filtering operation reached one pitch period into the frame, so there will not be a repeated waveform discontinuity there. Specifically, let the time index n for the current frame be from 1 to L, and let wi (n) and wo (n) be the fade-in window sample and fade-out window sample at time index n, respectively. In addition, let p 0 and b0 be the pitch period and the pitch postfilter tap of the previous frame, respectively. Then, the allpole pitch postfiltering with overlap-add should be performed sample-by-sample for the first K samples of the current frame by the following pseudo-code. For n from 1 to K, calculate the pitch postfilter output sample as ~ ~ s (n) d (n) wo (n) b0 ~ s (n p0 ) wi (n) b ~ s ( n p) end After filtering the first K samples, if L > K, then the filtering from the (K+1)-th sample to the L-th sample is just simple all-pole filtering using the difference equation ~ ~ s ( n) d ( n) b ~ s (n p) . In our simulations, K is chosen to be corresponding to 2.5 ms, or 120 samples at 48 kHz sampling rate. By doing this, we don’t need to change the overlap-add length K for all four possible frame sizes of the Opus mode 1 CELT codec, since the smallest frame size is 2.5 ms. In our simulations, we found that even if we do not apply a pitch prefilter and only apply a pitch postfilter, the resulting output audio quality is still improved noticeably compared with no postfiltering, although the improvement is somewhat less than the case when both the pitch prefilter and pitch postfilter are used. Therefore, it is possible to achieve audio quality enhancement “for free” without spending additional bits to transmit the pitch period and pitch filter tap(s) from the encoder side to the decoder side. The only price paid is some additional complexity. In this case, the pitch period and the pitch postfilter tap(s) will need to be derived locally from the decoded audio signal. 3. Conclusion This document gives sufficient technical details to allow one to implement the pitch prefiltering and postfiltering techniques for the IETF Opus mode 1 CELT codec to improve its output audio quality for certain audio signals with a high degree of periodicity. Special care needs to be taken when implementing an all-pole pitch filter with overlap-add. These pitch prefiltering and postfiltering techniques give substantial audio quality improvement for many audio signals we tried that were problematic for the CELT codec to code well, thus these pitch filtering techniques have the potential to make CELT more robust to different types of audio signals. The price paid is a small increase in bit-rate and codec complexity.