EE5359 MULTIMEDIA PROCESSING PROJECT PROPOSAL Study and implementation of G.719 audio codec and performance analysis of G.719 with AAC (advanced audio codec) and HE-AAC (high efficiency-advanced audio codec) audio codecs. Student: Yashas Prakash Student ID: 1000803680 Instructor: Dr. K. R. Rao E-mail: yashas.prakash@mavs.uta.edu Date: 03-27-2012. Project Proposal: Title: Study and implementation of G.719 audio codec and performance analysis of G.719 with AAC (advanced audio codec) and HE-AAC (high efficiency-advanced audio codec) audio codecs. Abstract: This project describes a low-complexity full-band (20 kHz) audio coding algorithm which has been recently standardized by ITU-T(International Telecommunication Union- Telecommunication Standardization Sector) as Recommendation G.719 [1]. The G.719 encoded audio is the container format of 3GP audio. The algorithm is designed to provide 20 Hz - 20 kHz audio bandwidth using a 48 kHz sample rate, operating at 32 - 128 kbps. This codec features very high audio quality and low computational complexity and is suitable for use in applications such as videoconferencing, teleconferencing, and streaming audio over the Internet [1]. This technology, while leading to exceptionally low complexity and small memory footprint, results in high full band audio quality, making the codec a great choice for any kind of communication devices, from large telepresence systems to small low-power devices for mobile communication [2]. A comparison of the widely used AAC and HE-AAC [9] audio codecs is carried out in terms of performance, reliability, memory requirements and applications. A windows media audio file is encoded to 3GP, AAC, HE AAC formats using SUPER (c) [13] software and testing of different coding schemes is carried out for performance, encoding and decoding durations, memory requirements and compression ratios. List of acronyms AAC - Advanced audio coding ATSC - Advanced television systems committee AES - Audio Engineering Society EBU - European broadcasting union FLVQ - Fast lattice vector quantization HE-AAC - High efficiency advanced audio coding HRQ - Higher rate lattice vector quantization IMDCT - Inverse modified discrete cosine transform ISO - International organization for standardization ITU - International telecommunication union JAES - Journal of the Audio Engineering Society LC - Low complexity LRQ - Lower rate lattice vector quantization LFE - Low frequencies enhancement LTP - Long term prediction MDCT - Modified discrete cosine transform MPEG - Moving picture experts group SBR - Spectral band replication SMR - Symbolic music representation SRS - Sample rate scalable An Overview of G.719 Audio Codec: In hands-free videoconferencing and teleconferencing markets, there is strong and increasing demand for audio coding providing the full human auditory bandwidth of 20 Hz to 20 kHz [1], This is because: Conferencing systems are increasingly used for more elaborate presentations, often including music and sound effects (i.e. animal sounds, musical instruments, vehicles or nature sounds, etc.) which occupy a wider audio band than speech. Presentations involve remote education of music, playback of audio and video from DVDs and VCRs, audio/video clips from PCs, and elaborate audio-visual presentations from, for example, PowerPoint [1]. Users perceive the bandwidth of 20 Hz to 20 kHz as representing the ultimate goal for audio bandwidth. The resulting market pressures are causing a shift in this direction, now that sufficient IP (internet protocol) bit rate and audio coding technology are available to deliver this. As with any audio codec for hands-free videoconferencing use, the requirements include [1]: Low latency (support natural conversation) Low complexity (free cycles for video processing and other processing reduce cost) High quality on all signal types [1]. Block diagram of the G.719 encoder: Figure 1: Block diagram of the G.719 encoder [1]. In figure 1 the input signal sampled at 48 kHz is processed through a transient detector. Depending on the detection of a transient, indicated by a flag IsTransient, a high frequency resolution or a low frequency resolution transform is applied on the input signal frame. The adaptive transform is based on a modified discrete cosine transform (MDCT) [1] in case of stationary frames [1]. For transient frames, the MDCT is modified to obtain a higher temporal resolution without a need for additional delay and with very little overhead in complexity. Transient frames have a temporal resolution equivalent to 5 ms frames [1]. MDCT is defined as follows: y(k)=transform co-efficient of input frame x(n)=time domain aliased signal of the input signal Block diagram of the G.719 decoder: Figure 2: Block diagram of the G.719 decoder [1]. A block diagram of the G.719 decoder is shown in Figure 2. The transient flag is first decoded which indicates the frame configuration, i.e. stationary or transient. The spectral envelope is then decoded and the same, bit-exact, norm adjustment and bit-allocation algorithms are used at the decoder to re-compute the bit-allocation which is essential for decoding quantization indices of the normalized transform coefficients [1]. After decoding the transform coefficients, the noncoded transform coefficients (allocated zero bits) in low frequencies are regenerated by using a spectral-fill codebook built from the decoded transform coefficients [2]. TRANSFORM COEFFICIENT QUANTIZATION: Each band consists of one or more vectors of 8-dimensional transform coefficients and the coefficients are normalized by the quantized norm. All 8-dimensional vectors belonging to one band are assigned the same number of bits for quantization. A fast lattice vector quantization (FLVQ) scheme is used to quantize the normalized coefficients in 8 dimensions. In FLVQ the quantizer comprises two sub-quantizers: a D8-based higher rate lattice vector quantizer (HRQ) and an RE8-based lower-rate lattice vector quantizer (LRQ). HRQ is a multi-rate quantizer designed to quantize the transform coefficients at rates of 2 up to 9 bit/coefficient and its codebook is based on the so-called Voronoi code for the D8 lattice [4]. D8 is a well-known lattice and defined as: where Z8 is the lattice which consists of all points with integer coordinates. It can be seen that D8 consists of the points having integer coordinates with an even sum. The codebook of HRQ is constructed from a finite region of the D8 lattice and is not stored in memory. The code words are generated by a simple algebraic method and a fast quantization algorithm is used. Figure 3: Observed spectrum of different sounds, voiced speech, unvoiced speech and pop music on different audio bandwidths [3]. Figure 3 illustrates how, for some signals, a large portion of energy is beyond the wideband frequency range. While the use of wideband speech codecs primarily addresses the requirement of intelligibility, the perceived naturalness and experienced quality of speech can be further enhanced by providing a larger acoustic bandwidth [3]. This is especially true in applications such as teleconferencing where a high-fidelity representation of both speech and natural sounds enables a much higher degree of naturalness and spontaneity. The logical step toward the sense of being there is the coding and rendering of super wide band signals with an acoustic bandwidth of 14 kHz. The response of ITU-T to this increased need for naturalness was standardization of the G.722.1 Annex C extension in 2005 [2]. More recently, this has also led ITU-T to start work on extensions of the G.718, G.729.1, G.722, and G.711.1 codecs to provide super-wideband telephony as extension layers to these wideband core codecs [3]. An overview of MPEG – Advanced Audio Coding Advanced audio coding (AAC) scheme was a joint development by Dolby, Fraunhoffer, AT&T, Sony and Nokia [9]. It is a digital audio compression scheme for medium to high bit rates which is not backward compatible with previous MPEG audio standards. The AAC encoding follows a modular approach and the standard defines four profiles which can be chosen based on factors like complexity of bit stream to be encoded, desired performance and output. Low complexity (LC) Main profile (MAIN) Sample-rate scalable (SRS) Long term prediction (LTP) Excellent audio quality is provided by AAC and it is suitable for low bit rate high quality audio applications. MPEG – AAC audio coder uses the AAC scheme. HE – AAC also known as AACPlus is a low bit rate audio coder. It is an AAC LC audio coder enhanced with spectral band replication (SBR) technology. AAC is a second generation coding scheme which is used for stereo and multichannel signals. When compared to the perceptual coders, AAC provides more flexibility and uses more coding tools [12]. The coding efficiency is enhanced by the following tools and they help attain higher quality at lower bit rates [12]. This scheme has higher frequency resolution with the number of lines increased up to 1024 from 576. Joint stereo coding has been improved. The bit rate can be reduced frequently owing to the flexibility of the mid or side coding and intensity coding. Huffman coding [12] is applied to the coder partitions. An overview of spectral band replication technology in AACplus audio codec Spectral band replication (SBR) is a new audio coding tool that significantly improves the coding gain of perceptual coders and speech coders. Currently, there are three different audio coders that have shown a vast improvement by the combination with SBR: MPEG-AAC, MPEG-Layer II and MPEG-Layer III (mp3), all three being parts of the open ISO-MPEG standard. The combination of AAC and SBR will be used in the standardized Digital audio Mondiale (DRM) system, and SBR is currently also being standardized within MPEG-4 [15]. Block diagram of SBR encoder: Figure 4: Block diagram of the SBR encoder [15] The basic layout of the SBR encoder is shown in figure 4. The input signal is initially fed to a down-sampler, which supplies the core encoder with a time domain signal having half the sampling frequency of the input signal. The input signal is in parallel fed to a 64-channel analysis QMF bank. The outputs from the filter bank are complex-valued sub-band signals. The sub-band signals are fed to an envelope estimator and various detectors. The outputs from the detectors and the envelope estimator are assembled into the SBR data stream. The data is subsequently coded using entropy coding and, in the case of multichannel signals, also channel- redundancy coding. The coded SBR data and a bitrate control signal are then supplied to the core encoder. The SBR encoder interacts closely with the core encoder. Information is exchanged between the systems in order to, for example, determine the optimal cutoff frequency between the core coder and the SBR band. The core coder finally multiplexes the SBR data stream into the combined bitstream [15]. Block diagram of the SBR decoder: figure 6: Block diagram of the decoder [15] Figure 6 illustrates the layout of the SBR enhanced decoder. The received bitstream is divided into two parts: the core coder bitstream and the SBR data stream. The core bitstream is decoded by the core decoder, and the output audio signal, typically of lowpass character, is forwarded to the SBR decoder together with the SBR data stream. The core audio signal, sampled at half the frequency of the original signal, is first filtered in the analysis QMF bank. The filter bank splits the time domain signal into 32 sub-band signals. The output from the filter bank, i.e. the subband signals, are complex-valued and thus oversampled by a factor of two compared to a regular QMF bank [15]. Proposal: In this project working of G.719 codec is comprehensively analyzed and studied with implementation and a comparative analysis in terms of performance, reliability, memory requirements, compression ratios and application of the of G.719 codec with AAC and HE-AAC audio codecs is carried out with results. References: [1] M. Xie, P. Chu, A. Taleb and M. Briand, " A new low-complexity full band (20kHz) audio coding standard for high-quality conversational applications", IEEE, Workshop on Applications of Signal Processing to Audio and Acoustics, pp.265-268, Oct. 2009. [2] A. Taleb and S. Karapetkov, " The first ITU-T standard for high-quality conversational fullband audio coding ", IEEE, communications magazine, vol.47, pp.124130, Oct. 2009. [3] J. Wang, B. Chen, H. He, S. Zhao and J. Kuang, " An adaptive window switching method for ITU-T G.719 transient coding in TDA domain", IEEE, International Conference on Wireless, Mobile and Multimedia Networks, pp.298-301, Jan. 2011. [4] J. Wang, N. ning, X. ji and J. kuang, " Norm adjustment with segmental weighted SMR for ITU-T G.719 audio codec ", IEEE, International Conference on Multimedia and Signal Processing, vol.2, pp.282-285, May 2011. [5] K. Brandenburg and M. Bosi, “ Overview of MPEG audio: current and future standards for low-bit-rate audio coding ” JAES, vol.45, pp.4-21, Jan/Feb. 1997. [6] A/52 B ATSC Digital http://www.atsc.org/cms/standards/a_52b.pdf Audio Compression Standard: [7] F. Henn , R. Böhm and S. Meltzer, “ Spectral band replication technology and its application in broadcasting ”, International broadcasting convention, 2003. [8] M. Dietz and S. Meltzer, “ CT-AACPlus – a state of the art audio coding scheme ”, Coding Tecnologies, EBU Technical review, July 2002. [9] ISO/IEC IS 13818-7, “ Information technology – Generic coding of moving pictures and associated audio information Part 7: advanced audio coding (AAC) ”, 1997. [10] M. Bosi and R. E. Goldberg, “ Introduction to digital audio coding standards ”, Norwell, MA, Kluwer, 2003. [11] H. S. Malvar, “ Signal processing with lapped transforms ”, Artech House, Norwood, MA, 1992. [12] D. Meares, K. Watanabe and E. Scheirer, “ Report on the MPEG-2 AAC stereo verification tests ”, ISO/IEC JTC1/SC29/WG11, Feb. 1998. [13] Super (c) v.2012.build.50: A simplified universal player encoder and renderer, A graphic user interface to FFmpeg, Mencoder, Mplayer, x264, Musepack, Shorten audio, True audio, Wavpack, Libavcodec library and Theora/vorbis real producers plugin: www.erightsoft.com [14] T. Ogunfunmi and M. Narasimha, “ Principles of speech coding ”, Boca Raton, FL: CRC Press, 2010. [15] P. Ekstrand, " Bandwidth extension of audio signals by spectral band replication ", IEEE, Workshop on model based processing and coding of audio, pp.53-58, Nov. 2002.