InterimRep.

advertisement
EE5359 MULTIMEDIA PROCESSING INTERIM REPORT
Study and implementation of G.719 audio codec and performance analysis of
G.719 with AAC (advanced audio codec) and HE-AAC (high efficiency-advanced
audio codec) audio codecs.
Student: Yashas Prakash
Student ID: 1000803680
Instructor: Dr. K. R. Rao
E-mail: yashas.prakash@mavs.uta.edu
Date: 04-25-2012.
1
List of acronyms
AAC
- Advanced audio coding
ATSC
- Advanced television systems committee
AES
- Audio Engineering Society
EBU
- European broadcasting union
FLVQ
- Fast lattice vector quantization
HE-AAC
- High efficiency advanced audio coding
HRQ
- Higher rate lattice vector quantization
IMDCT
- Inverse modified discrete cosine transform
ISO
- International organization for standardization
ITU
- International telecommunication union
JAES
- Journal of the Audio Engineering Society
LC
- Low complexity
LRQ
- Lower rate lattice vector quantization
LFE
- Low frequencies enhancement
LTP
- Long term prediction
MDCT
- Modified discrete cosine transform
MPEG
- Moving picture experts group
QMF
- Quadrature mirror filter
SBR
- Spectral band replication
SMR
- Symbolic music representation
SRS
- Sample rate scalable
TDA
- Time domain aliased
WMOPS
- Weighted millions operations per second
2
SUBJECTIVE PERFORMANCE OF G.719
Subjective tests for the ITU-T G.719 Optimization/Characterization phase were performed from
mid February through early April 2008 by independent listening laboratories in American
english. According to a test plan designed by ITU-T Q7/SG12 experts, the joint candidate codec
conducted two experiments as follows:
 Experiment 1: Speech (clean, reverberant, and noisy)
 Experiment 2: Mixed content and music
Mixed content items are representative of advertisement, film trailers, news with jingles, music
with announcements and contain speech, music, and noise. Each experiment used the “triple
stimulus/hidden reference/double blind test method” described in ITU-R Recommendation
BS.1116-1. A standard MPEG audio codec, LAME MP3 version 3.97 as found on the LAME
website was used as the reference codec in the subjective tests. The ITU-T requirement was that
the G.719 candidate codec at 32, 48, and 64kbps be proven “Not Worse Than” the reference
codec at 40, 56, and 64 kbps, respectively, with a 95% statistical confidence level. In addition,
the G.719 candidate codec at 64 kbps was also tested against the G.722.1C codecs at 48 kbps for
Experiment 2. The subjective test results for the G.719 codec are shown in Figures 6-8.
Statistical analysis of the results showed that the G.719 codec met all performance requirements
specified for the subjective Optimization/Characterization test. For experiment 1 the G.719
codec was better than the reference codec at all bit rates. For experiment 2 the G.719 codec is
better than the reference codec at the lowest bit rate for all the items and at the two other bitrates
for most of the items. An additional subjective listening test for the G.719 codec was conducted
later to evaluate the quality of the codec at rates higher than those described in the ITU-T test
plan. Because the quality expectation of the codec at these high rates is high, a pre-selection of
critical items, for which the quality at the lower bit rate range was most degraded, was conducted
prior to testing. The test results are shown in Figure 6. It has been proven that transparency was
reached for critical material at 128 kbps.
3
Figure 6: subjective test results experiment 1 [7]
Figure 7: subjective test results experiment 2 [7]
4
Figure 8: additional subjective tests [7]
Algorithmic efficiency
The G.719 codec has a low complexity and a low algorithmic delay [1]. The delay depends on
the frame size of 20 milliseconds and the look-ahead of one frame used to form the transform
blocks. Hence, the algorithmic delay of the G.719 codec is 40 milliseconds. The algorithmic
delay of comparable codecs as 3GPP eAAC+ [14] and 3GPP AMR-WB+ are significantly
higher. For AMR-WB+ the algorithmic delay for mono coding is between 77.5 and 227.6 ms
depending on the internal sampling frequency. For eAAC+ the algorithmic delay is 323 ms for
mono coding with 32 kbps and 48 kHz sampling rate. In Table 1 the average and worst-case
complexity of G.719 is expressed in Weighted Millions Operations Per Second (WMOPS). The
figures are based on complexity reports using the basic operators of ITU-T STL2005 Software
Tool Library v2.2 [7]. For comparison, the complexity of the three comparable audio codecs
eAAC+, AMR-WB+ and ITU-T G.722.1C [8], the low-complexity super-wideband codec (14
kHz) that G.719 was developed from, is shown in Table 2. The memory requirements of G.719
are presented in Table 3. The delay and complexity measures show that the G.719 codec is very
efficient in terms of complexity and algorithmic delay especially when compared to eAAC+ and
AMR-WB+.
5
Frame buffering and windowing with overlap
A time-limited block of the input audio signal can be seen as windowed with a rectangular
window. The windowing that is a multiplication in the time-domain becomes in the frequency
domain a convolution and results in a large frequency spread for this window. In addition the
sampling theorem states that the maximal frequency that can be correctly represented in discrete
time is the Nyquist frequency, i.e. half of the sampling rate, otherwise aliasing occurs. For
example in a signal sampled at 48 kHz a frequency of 25 kHz, i.e. 1 kHz above the Nyquist
frequency of 24 kHz, will be analyzed as 23 kHz due to the aliasing. Due to the large frequency
spread of the rectangular window the frequency analysis can be contaminated by the aliasing. In
order to reduce the frequency spread and suppress the aliasing effect windows without sharp
discontinuities can be used. Two examples are the sine and the Hann windows, defined in [17],
that compared to the rectangular window indeed have a larger attenuation of the side lobes but
also a wider main lobe. This is illustrated in Figure 10 where the shape of the windows and the
corresponding frequency spectrum can be observed. Conclusively, there has to be a trade-off
between the possible aliasing and the frequency resolution.
Figure 9: Three window functions and their corresponding frequency spectrum. The windows are
1920 samples long at a sampling rate of 48 kHz [17]
6
In the synthesis of the analysed and encoded blocks of a processed audio signal the window
effects has to be cancelled. For example the inverse window function could be applied to the
coded time-domain blocks but there is a high possibility that artefacts can be audible near the
block edges due to discontinuities and amplification of the coding errors. In order to reduce the
block artefacts overlap-add techniques are commonly used [17].
In ITU-T G.719 the blocks of two consequent frames are windowed with a sine window of
length 2N = 1920 samples that is defined by:
The signals are processed with an overlap in the data of 50% between consecutive blocks.
The windowed signal of each block is given by:
Figure 10: G.719 buffering, windowing and transformation of an audio signal [17]
7
Figure 10 shows the buffering and windowing with the overlap of N = 960 samples between the
blocks of length 2N. The blocks are Time-Domain Aliased (TDA) into spectra of length N that
are transformed using the Discrete Cosine Transform (DCTIV). The information from the
transient detector is not used in the buffering, the windowing or the TDA but for the DCT IV,
which implies that there is a common buffering and windowing for the stationary and transient
mode. The combination of the TDA and the DCTIV is the MDCT which is further presented in
the following section.
Modified Discrete Cosine Transform
The MDCT is used in G.719 to transform the buffered and windowed signal blocks in to a
frequency representation. The transform comprises Time-Domain Aliasing (TDA) which means
that the signal blocks of 2N=1920 samples are folded (aliased) into blocks of N=960 samples.
These time-domain aliased signals of each block are then represented by N coefficients of cosine
basis functions. Due to the TDA it is not possible to reconstruct the time-domain signals from
individual MDCT spectra, but the framework of overlapped signal blocks enables perfect
reconstruction. The 50 % overlap and the properties of the windows are essential for the
reconstruction where the TDA can be cancelled with overlapped of consequent inversetransformed MDCT spectra. The conditions for Time-Domain Aliasing Cancellation (TDAC)
and the perfect reconstruction with the overlap-add technique. The signal blocks are overlapped
in order to avoid block artefacts. The number of frequency coefficients per time unit is thereby
increased in comparison to transformation of non-overlapped blocks. This implies that the bitrate
of coding the spectra is increased in order to avoid block artefacts. However, due to the TDA in
the MDCT the bitrate can be reduced by the corresponding factor of the overlap. This in
combination with the real frequency coefficients makes the MDCT competitive for audio coding
with a compact representation of the signals. The MDCT spectrum XMDCT [k] of the windowed
signal xw [n] is by definition obtained as:
8
Transient mode transformation
In the transient mode of G.719 the time-aliased signal block xw = Qxw is reversed in time and
divided into four sub-frames. The reversion re-creates the temporal coherence of the input signal
that was destroyed by the TDA. The first and the last sub-frames are windowed by half sine
windows with a fourth of zero padding while the second and third sub-frames are windowed with
the ordinary sine window as illustrated in Figure 11. The overlap between the windowed subframes is 50% and each segment is MDCT transformed, i.e. time aliased and DCT IVtransformed, which results in sub-spectra of length N/4. Thus the total length of the four subspectra is N frequency coefficients, i.e. the transform lengths are equal in the stationary and the
transient mode of G.719.
Figure 11: Windowing of sub-frames in the transient mode [1].
Perceptual coding
In G.719 the MDCT spectra are perceptually encoded based on a psycho-acoustical model. The
model describes the human hearing system and is used in order to introduce coding errors that
are not audible. In Figure 13 the principle of the perceptual coder is illustrated. The MDCT
9
spectrum of the transformed windowed time-domain signal is split into 44 sub-vectors that
approximate the frequency resolution of the ear by increasing sub-vector lengths with increasing
frequency. The sub-vector spectra are quantized and coded based on the subvector energies, or
norms, that are weighted according to the psychoacoustical model. The coding procedure is
similar for the two time-resolution modes in G.719, but for the transient mode the spectral
coefficients of the four sub-frames are interleaved before coding to preserve the coherence of the
signal in the time-domain.
Figure 12: Block diagram of the perceptual encoder based on MDCT domain masking [17]
The norm of each sub-vector is estimated and quantized with a uniform logarithmic scalar
quantizer in 40 levels of 3 dB difference. The MDCT spectra are normalized with the quantized
norms in order to reduce the amount of information needed to describe the spectra. The
quantized norms are both differentially and Huffman encoded [3] before they are transmitted to
the decoder where they can be used to de-normalize the decoded MDCT spectra. In the next step
of encoding, bits are iteratively allocated to each sub-vector as a function of the quantized subvector norms. The goal of the bit allocation is to distribute the available bits in a way that the
maximum subjective quality is obtained at a given data rate, i.e. a given number of bits.
Therefore the quantized norms of the sub-vectors are perceptually weighted to account for
psycho-acoustical masking and threshold effects. For each iteration in the allocation of bits, the
sub-vector of the largest weighted norm is found and one bit is assigned to each MDCT
10
coefficient in the corresponding sub-vector. The corresponding norm is decreased by 6 dB and
the procedure repeats until all available bits are assigned. When a sub-vector is assigned with 9
bits per coefficient the norm is set to minus infinity in order to not allocate more bits for that subvector. Considering the allocated bits the normalized sub-vectors are lattice vector quantized and
Huffman coded. More information about the vector quantization can be found in [1] for G.719
specifically. In the stationary mode the amount of non-coded spectral coefficients in the subvectors assigned with zero bits is estimated, quantized and included in the bit stream for
frequencies below the so-called transition frequency. The quantization indices of the norms, the
encoded sub-vector spectra and the estimated noise level form the encoded bit stream. In
addition, information about for example the coding mode (stationary or transient) and the coding
(Huffman or not) is added to the bit stream that is transmitted to the G.719 decoder.
Implementation of the encoder:
11
12
Implementation of the decoder:
13
comparison of the decoded signal with the original raw file with the same bit-rate.
Interim report: G.719 encoder and decoder was successfully implemented and complexity with
respect to performance in different bit rates 32k, 48k and 64k was analyzed.
References:
[1] M. Xie, P. Chu, A. Taleb and M. Briand, " A new low-complexity full band (20kHz)
audio coding standard for high-quality conversational applications", IEEE Workshop on
Applications of Signal Processing to Audio and Acoustics, pp.265-268, Oct. 2009.
[2] A. Taleb and S. Karapetkov, " The first ITU-T standard for high-quality
conversational fullband audio coding ", IEEE communications magazine, vol.47, pp.124130, Oct. 2009.
14
[3] J. Wang, B. Chen, H. He, S. Zhao and J. Kuang, " An adaptive window switching
method for ITU-T G.719 transient coding in TDA domain", IEEE International
Conference on Wireless, Mobile and Multimedia Networks, pp.298-301, Jan. 2011.
[4] J. Wang, N. ning, X. ji and J. kuang, " Norm adjustment with segmental weighted
SMR for ITU-T G.719 audio codec ", IEEE International Conference on Multimedia and
Signal Processing, vol.2, pp.282-285, May 2011.
[5] K. Brandenburg and M. Bosi, “ Overview of MPEG audio: current and future
standards for low-bit-rate audio coding ” JAES, vol.45, pp.4-21, Jan/Feb. 1997.
[6]
A/52
B
ATSC
Digital
http://www.atsc.org/cms/standards/a_52b.pdf
Audio
Compression
Standard:
[7] F. Henn , R. Böhm and S. Meltzer, “ Spectral band replication technology and its
application in broadcasting ”, International broadcasting convention, 2003.
[8] M. Dietz and S. Meltzer, “ CT-AACPlus – a state of the art audio coding scheme ”,
Coding Tecnologies, EBU Technical review, July 2002.
[9] ISO/IEC IS 13818-7, “ Information technology – Generic coding of moving pictures
and associated audio information Part 7: advanced audio coding (AAC) ”, 1997.
[10] M. Bosi and R. E. Goldberg, “ Introduction to digital audio coding standards ”,
Norwell, MA, Kluwer, 2003.
[11] H. S. Malvar, “ Signal processing with lapped transforms ”, Artech House,
Norwood, MA, 1992.
[12] D. Meares, K. Watanabe and E. Scheirer, “ Report on the MPEG-2 AAC stereo
verification tests ”, ISO/IEC JTC1/SC29/WG11, Feb. 1998.
[13] Super (c) v.2012.build.50: A simplified universal player encoder and renderer, A
graphic
user interface to FFmpeg, Mencoder, Mplayer, x264, Musepack,
Shorten audio, True audio, Wavpack, Libavcodec library and
Theora/vorbis
real
producers plugin: www.erightsoft.com
[14] T. Ogunfunmi and M. Narasimha, “ Principles of speech coding ”, Boca Raton, FL:
CRC Press, 2010.
[15] P. Ekstrand, " Bandwidth extension of audio signals by spectral band replication ",
IEEE, Workshop on model based processing and coding of audio, pp.53-58, Nov. 2002.
[16] T. Johnson, " Stereo coding for ITU-T G.719 codec ", Uppsala university, May
2011.
15
Download