EE5359 MULTIMEDIA PROCESSING INTERIM REPORT Study and implementation of G.719 audio codec and performance analysis of G.719 with AAC (advanced audio codec) and HE-AAC (high efficiency-advanced audio codec) audio codecs. Student: Yashas Prakash Student ID: 1000803680 Instructor: Dr. K. R. Rao E-mail: yashas.prakash@mavs.uta.edu Date: 04-25-2012. 1 List of acronyms AAC - Advanced audio coding ATSC - Advanced television systems committee AES - Audio Engineering Society EBU - European broadcasting union FLVQ - Fast lattice vector quantization HE-AAC - High efficiency advanced audio coding HRQ - Higher rate lattice vector quantization IMDCT - Inverse modified discrete cosine transform ISO - International organization for standardization ITU - International telecommunication union JAES - Journal of the Audio Engineering Society LC - Low complexity LRQ - Lower rate lattice vector quantization LFE - Low frequencies enhancement LTP - Long term prediction MDCT - Modified discrete cosine transform MPEG - Moving picture experts group QMF - Quadrature mirror filter SBR - Spectral band replication SMR - Symbolic music representation SRS - Sample rate scalable TDA - Time domain aliased WMOPS - Weighted millions operations per second 2 SUBJECTIVE PERFORMANCE OF G.719 Subjective tests for the ITU-T G.719 Optimization/Characterization phase were performed from mid February through early April 2008 by independent listening laboratories in American english. According to a test plan designed by ITU-T Q7/SG12 experts, the joint candidate codec conducted two experiments as follows: Experiment 1: Speech (clean, reverberant, and noisy) Experiment 2: Mixed content and music Mixed content items are representative of advertisement, film trailers, news with jingles, music with announcements and contain speech, music, and noise. Each experiment used the “triple stimulus/hidden reference/double blind test method” described in ITU-R Recommendation BS.1116-1. A standard MPEG audio codec, LAME MP3 version 3.97 as found on the LAME website was used as the reference codec in the subjective tests. The ITU-T requirement was that the G.719 candidate codec at 32, 48, and 64kbps be proven “Not Worse Than” the reference codec at 40, 56, and 64 kbps, respectively, with a 95% statistical confidence level. In addition, the G.719 candidate codec at 64 kbps was also tested against the G.722.1C codecs at 48 kbps for Experiment 2. The subjective test results for the G.719 codec are shown in Figures 6-8. Statistical analysis of the results showed that the G.719 codec met all performance requirements specified for the subjective Optimization/Characterization test. For experiment 1 the G.719 codec was better than the reference codec at all bit rates. For experiment 2 the G.719 codec is better than the reference codec at the lowest bit rate for all the items and at the two other bitrates for most of the items. An additional subjective listening test for the G.719 codec was conducted later to evaluate the quality of the codec at rates higher than those described in the ITU-T test plan. Because the quality expectation of the codec at these high rates is high, a pre-selection of critical items, for which the quality at the lower bit rate range was most degraded, was conducted prior to testing. The test results are shown in Figure 6. It has been proven that transparency was reached for critical material at 128 kbps. 3 Figure 6: subjective test results experiment 1 [7] Figure 7: subjective test results experiment 2 [7] 4 Figure 8: additional subjective tests [7] Algorithmic efficiency The G.719 codec has a low complexity and a low algorithmic delay [1]. The delay depends on the frame size of 20 milliseconds and the look-ahead of one frame used to form the transform blocks. Hence, the algorithmic delay of the G.719 codec is 40 milliseconds. The algorithmic delay of comparable codecs as 3GPP eAAC+ [14] and 3GPP AMR-WB+ are significantly higher. For AMR-WB+ the algorithmic delay for mono coding is between 77.5 and 227.6 ms depending on the internal sampling frequency. For eAAC+ the algorithmic delay is 323 ms for mono coding with 32 kbps and 48 kHz sampling rate. In Table 1 the average and worst-case complexity of G.719 is expressed in Weighted Millions Operations Per Second (WMOPS). The figures are based on complexity reports using the basic operators of ITU-T STL2005 Software Tool Library v2.2 [7]. For comparison, the complexity of the three comparable audio codecs eAAC+, AMR-WB+ and ITU-T G.722.1C [8], the low-complexity super-wideband codec (14 kHz) that G.719 was developed from, is shown in Table 2. The memory requirements of G.719 are presented in Table 3. The delay and complexity measures show that the G.719 codec is very efficient in terms of complexity and algorithmic delay especially when compared to eAAC+ and AMR-WB+. 5 Frame buffering and windowing with overlap A time-limited block of the input audio signal can be seen as windowed with a rectangular window. The windowing that is a multiplication in the time-domain becomes in the frequency domain a convolution and results in a large frequency spread for this window. In addition the sampling theorem states that the maximal frequency that can be correctly represented in discrete time is the Nyquist frequency, i.e. half of the sampling rate, otherwise aliasing occurs. For example in a signal sampled at 48 kHz a frequency of 25 kHz, i.e. 1 kHz above the Nyquist frequency of 24 kHz, will be analyzed as 23 kHz due to the aliasing. Due to the large frequency spread of the rectangular window the frequency analysis can be contaminated by the aliasing. In order to reduce the frequency spread and suppress the aliasing effect windows without sharp discontinuities can be used. Two examples are the sine and the Hann windows, defined in [17], that compared to the rectangular window indeed have a larger attenuation of the side lobes but also a wider main lobe. This is illustrated in Figure 10 where the shape of the windows and the corresponding frequency spectrum can be observed. Conclusively, there has to be a trade-off between the possible aliasing and the frequency resolution. Figure 9: Three window functions and their corresponding frequency spectrum. The windows are 1920 samples long at a sampling rate of 48 kHz [17] 6 In the synthesis of the analysed and encoded blocks of a processed audio signal the window effects has to be cancelled. For example the inverse window function could be applied to the coded time-domain blocks but there is a high possibility that artefacts can be audible near the block edges due to discontinuities and amplification of the coding errors. In order to reduce the block artefacts overlap-add techniques are commonly used [17]. In ITU-T G.719 the blocks of two consequent frames are windowed with a sine window of length 2N = 1920 samples that is defined by: The signals are processed with an overlap in the data of 50% between consecutive blocks. The windowed signal of each block is given by: Figure 10: G.719 buffering, windowing and transformation of an audio signal [17] 7 Figure 10 shows the buffering and windowing with the overlap of N = 960 samples between the blocks of length 2N. The blocks are Time-Domain Aliased (TDA) into spectra of length N that are transformed using the Discrete Cosine Transform (DCTIV). The information from the transient detector is not used in the buffering, the windowing or the TDA but for the DCT IV, which implies that there is a common buffering and windowing for the stationary and transient mode. The combination of the TDA and the DCTIV is the MDCT which is further presented in the following section. Modified Discrete Cosine Transform The MDCT is used in G.719 to transform the buffered and windowed signal blocks in to a frequency representation. The transform comprises Time-Domain Aliasing (TDA) which means that the signal blocks of 2N=1920 samples are folded (aliased) into blocks of N=960 samples. These time-domain aliased signals of each block are then represented by N coefficients of cosine basis functions. Due to the TDA it is not possible to reconstruct the time-domain signals from individual MDCT spectra, but the framework of overlapped signal blocks enables perfect reconstruction. The 50 % overlap and the properties of the windows are essential for the reconstruction where the TDA can be cancelled with overlapped of consequent inversetransformed MDCT spectra. The conditions for Time-Domain Aliasing Cancellation (TDAC) and the perfect reconstruction with the overlap-add technique. The signal blocks are overlapped in order to avoid block artefacts. The number of frequency coefficients per time unit is thereby increased in comparison to transformation of non-overlapped blocks. This implies that the bitrate of coding the spectra is increased in order to avoid block artefacts. However, due to the TDA in the MDCT the bitrate can be reduced by the corresponding factor of the overlap. This in combination with the real frequency coefficients makes the MDCT competitive for audio coding with a compact representation of the signals. The MDCT spectrum XMDCT [k] of the windowed signal xw [n] is by definition obtained as: 8 Transient mode transformation In the transient mode of G.719 the time-aliased signal block xw = Qxw is reversed in time and divided into four sub-frames. The reversion re-creates the temporal coherence of the input signal that was destroyed by the TDA. The first and the last sub-frames are windowed by half sine windows with a fourth of zero padding while the second and third sub-frames are windowed with the ordinary sine window as illustrated in Figure 11. The overlap between the windowed subframes is 50% and each segment is MDCT transformed, i.e. time aliased and DCT IVtransformed, which results in sub-spectra of length N/4. Thus the total length of the four subspectra is N frequency coefficients, i.e. the transform lengths are equal in the stationary and the transient mode of G.719. Figure 11: Windowing of sub-frames in the transient mode [1]. Perceptual coding In G.719 the MDCT spectra are perceptually encoded based on a psycho-acoustical model. The model describes the human hearing system and is used in order to introduce coding errors that are not audible. In Figure 13 the principle of the perceptual coder is illustrated. The MDCT 9 spectrum of the transformed windowed time-domain signal is split into 44 sub-vectors that approximate the frequency resolution of the ear by increasing sub-vector lengths with increasing frequency. The sub-vector spectra are quantized and coded based on the subvector energies, or norms, that are weighted according to the psychoacoustical model. The coding procedure is similar for the two time-resolution modes in G.719, but for the transient mode the spectral coefficients of the four sub-frames are interleaved before coding to preserve the coherence of the signal in the time-domain. Figure 12: Block diagram of the perceptual encoder based on MDCT domain masking [17] The norm of each sub-vector is estimated and quantized with a uniform logarithmic scalar quantizer in 40 levels of 3 dB difference. The MDCT spectra are normalized with the quantized norms in order to reduce the amount of information needed to describe the spectra. The quantized norms are both differentially and Huffman encoded [3] before they are transmitted to the decoder where they can be used to de-normalize the decoded MDCT spectra. In the next step of encoding, bits are iteratively allocated to each sub-vector as a function of the quantized subvector norms. The goal of the bit allocation is to distribute the available bits in a way that the maximum subjective quality is obtained at a given data rate, i.e. a given number of bits. Therefore the quantized norms of the sub-vectors are perceptually weighted to account for psycho-acoustical masking and threshold effects. For each iteration in the allocation of bits, the sub-vector of the largest weighted norm is found and one bit is assigned to each MDCT 10 coefficient in the corresponding sub-vector. The corresponding norm is decreased by 6 dB and the procedure repeats until all available bits are assigned. When a sub-vector is assigned with 9 bits per coefficient the norm is set to minus infinity in order to not allocate more bits for that subvector. Considering the allocated bits the normalized sub-vectors are lattice vector quantized and Huffman coded. More information about the vector quantization can be found in [1] for G.719 specifically. In the stationary mode the amount of non-coded spectral coefficients in the subvectors assigned with zero bits is estimated, quantized and included in the bit stream for frequencies below the so-called transition frequency. The quantization indices of the norms, the encoded sub-vector spectra and the estimated noise level form the encoded bit stream. In addition, information about for example the coding mode (stationary or transient) and the coding (Huffman or not) is added to the bit stream that is transmitted to the G.719 decoder. Implementation of the encoder: 11 12 Implementation of the decoder: 13 comparison of the decoded signal with the original raw file with the same bit-rate. Interim report: G.719 encoder and decoder was successfully implemented and complexity with respect to performance in different bit rates 32k, 48k and 64k was analyzed. References: [1] M. Xie, P. Chu, A. Taleb and M. Briand, " A new low-complexity full band (20kHz) audio coding standard for high-quality conversational applications", IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp.265-268, Oct. 2009. [2] A. Taleb and S. Karapetkov, " The first ITU-T standard for high-quality conversational fullband audio coding ", IEEE communications magazine, vol.47, pp.124130, Oct. 2009. 14 [3] J. Wang, B. Chen, H. He, S. Zhao and J. Kuang, " An adaptive window switching method for ITU-T G.719 transient coding in TDA domain", IEEE International Conference on Wireless, Mobile and Multimedia Networks, pp.298-301, Jan. 2011. [4] J. Wang, N. ning, X. ji and J. kuang, " Norm adjustment with segmental weighted SMR for ITU-T G.719 audio codec ", IEEE International Conference on Multimedia and Signal Processing, vol.2, pp.282-285, May 2011. [5] K. Brandenburg and M. Bosi, “ Overview of MPEG audio: current and future standards for low-bit-rate audio coding ” JAES, vol.45, pp.4-21, Jan/Feb. 1997. [6] A/52 B ATSC Digital http://www.atsc.org/cms/standards/a_52b.pdf Audio Compression Standard: [7] F. Henn , R. Böhm and S. Meltzer, “ Spectral band replication technology and its application in broadcasting ”, International broadcasting convention, 2003. [8] M. Dietz and S. Meltzer, “ CT-AACPlus – a state of the art audio coding scheme ”, Coding Tecnologies, EBU Technical review, July 2002. [9] ISO/IEC IS 13818-7, “ Information technology – Generic coding of moving pictures and associated audio information Part 7: advanced audio coding (AAC) ”, 1997. [10] M. Bosi and R. E. Goldberg, “ Introduction to digital audio coding standards ”, Norwell, MA, Kluwer, 2003. [11] H. S. Malvar, “ Signal processing with lapped transforms ”, Artech House, Norwood, MA, 1992. [12] D. Meares, K. Watanabe and E. Scheirer, “ Report on the MPEG-2 AAC stereo verification tests ”, ISO/IEC JTC1/SC29/WG11, Feb. 1998. [13] Super (c) v.2012.build.50: A simplified universal player encoder and renderer, A graphic user interface to FFmpeg, Mencoder, Mplayer, x264, Musepack, Shorten audio, True audio, Wavpack, Libavcodec library and Theora/vorbis real producers plugin: www.erightsoft.com [14] T. Ogunfunmi and M. Narasimha, “ Principles of speech coding ”, Boca Raton, FL: CRC Press, 2010. [15] P. Ekstrand, " Bandwidth extension of audio signals by spectral band replication ", IEEE, Workshop on model based processing and coding of audio, pp.53-58, Nov. 2002. [16] T. Johnson, " Stereo coding for ITU-T G.719 codec ", Uppsala university, May 2011. 15