ISO/IEC MPEG-4 High-Definition Scalable Advanced Audio Coding* RALF GEIGER, RONGSHAN YU,

advertisement
PAPERS
ISO/IEC MPEG-4 High-Definition Scalable
Advanced Audio Coding*
RALF GEIGER,1 RONGSHAN YU,2** JÜRGEN HERRE,1 SUSANTO RAHARDJA,2
(ralf.geiger@iis.fraunhofer.de) (rzyu@dolby.com) (juergen.herre@iis.fraunhofer.de) (rsusanto@i2r.a-star.edu.sg)
SANG-WOOK KIM,3
(sangwookkim@samsung.com)
XIAO LIN,2 and
(linxiao@fortemedia.com.cn)
MARKUS SCHMIDT1
(markus.schmidt@iis.fraunhofer.de)
1
2
Fraunhofer IIS, Erlangen, Germany
Institute for Infocomm Research, Singapore
3
Samsung Electronics, Suwon, Korea
Recently the MPEG Audio standardization group has successfully concluded the standardization process on technology for lossless coding of audio signals. A summary of the scalable
lossless coding (SLS) technology as one of the results of this standardization work is given.
MPEG-4 scalable lossless coding provides a fine-grain scalable lossless extension of the
well-known MPEG-4 AAC perceptual audio coder up to fully lossless reconstruction at word
lengths and sampling rates typically used for high-resolution audio. The underlying innovative technology is described in detail and its performance is characterized for lossless and near
lossless representation, both in conjunction with an AAC coder and as a stand-alone compression engine. A number of application scenarios for the new technology are discussed.
0 INTRODUCTION
Perceptual coding of high-quality audio signals has experienced a tremendous evolution over the past two decades, both in terms of research progress and in worldwide
deployment in products. Examples for successful applications include portable music players, Internet audio, audio
for digital media (such as VCD, DVD), and digital broadcasting. Several phases of international standardization
have been conducted successfully [1]–[3]. Many recent
research and standardization efforts focus at achieving
good sound quality at even lower bit rates [4]–[9] to accommodate storage and transmission channels with limited quality (such as terrestrial and satellite broadcasting
and third-generation mobile telecommunication). Nevertheless, for other scenarios with higher transmission bandwidth available, there is a general trend toward providing
the consumer with a media experience of extremely high
fidelity [10], as it is frequently associated with the terms
“high definition” and “high resolution” and dedicated me-
*Manuscript received 2006 March 23; revised 2006 October
24 and November 27.
**Currently with Dolby Laboratories, San Francisco, CA.
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February
dia types, such as DVD-Audio [11], SACD [12], HDDVD [11], or Blu-Ray Disc [13]. In the realm of audio,
this is achieved by employing lossless formats with high
resolution (word length) and/or high sampling rate.
There exist several proprietary lossless formats, the
most prominent of which are MLP Lossless [14], [15],
DTS-HD [16], and Apple Lossless [17]. Furthermore, several freeware coding systems have gained some prominence through the Internet. Among them are FLAC [18],
Monkey’s Audio [19], and OptimFROG [20].
On the other end of the bit-rate scale, approaches for
scalable perceptual audio coding have been developed in
the recent years. Within the International Telecommunication Union (ITU) scalability in wide-band audio coding
has been adopted only recently [21]. Within MPEG-4 Audio [3], scalability has been developed and adopted sometime earlier [22]–[24]. This includes the realm of speech
coding where scalability is provided both for harmonic
vector excitation coding (HVXC) [25], [26] and codeexcited linear prediction (CELP) [27], [28].
In this context the ISO/MPEG Audio standardization
group decided to start a new work item to explore technology for lossless and near-lossless coding of audio signals by issuing a call for proposals for relevant technology
in 2002 [29]. Three specifications emerged from this call
27
GEIGER ET AL.
PAPERS
as amendments to the MPEG-4 Audio standard. First the
standard on lossless coding of 1-bit oversampled signals
[30] specifies the lossless compression of highly oversampled 1-bit sigma–delta-modulated audio signals as
they are stored on the SACD media under the name Direct
Stream Digital (DSD) [31]. Second the audio lossless coding (ALS) specification [32] describes technology for the
lossless coding of PCM-coded signals at sampling rates up
to 192 kHz as well as floating-point audio.
This paper provides an overview of the third descendant, the scalable lossless coding (SLS) specification [33],
which extends traditional methods for perceptual audio
coding toward lossless coding of high-resolution audio
signals in a scalable way. Specifically, it allows to scale up
from a perceptually coded representation (MPEG-4 advanced audio coding, AAC) to a fully lossless representation with high definition, including a wide range of intermediate near-lossless representations. The paper starts
with an explanation of the general concept, discusses the
underlying novel technology components, and characterizes the codec in terms of compression performance and
complexity. Finally a number of application scenarios are
briefly outlined.
1 CONCEPT AND TECHNOLOGY OVERVIEW
This section provides an overview of the principle and
the structure of the scalable lossless coding technology
and its combination with the AAC perceptual audio coder.
This combination will be referred to as high-definition
advanced audio coding (HD-AAC) in the following.
1.1 General System Structure
Fig. 1 shows a very high-level view of the structure of
the HD-AAC encoder. The input signal is first coded by an
AAC (“core layer”) encoder. The SLS algorithm uses the
output to enhance the system’s performance toward lossless coding, resulting in an enhancement layer. The two
layers of information are subsequently multiplexed into
one high-definition bit stream. Decoding of an HD-AAC
bit stream is illustrated in Fig. 2. From the HD-AAC bit
stream, decoders are able either to decode the perceptually
coded AAC part only, or to use the additional SLS infor-
28
mation to produce losslessly/near-losslessly coded audio
for high-definition applications.
1.2 AAC Background
Since the MPEG-4 SLS coder has been designed to
operate as an enhancement to MPEG-4 AAC [3], [34], its
structure is closely related to that of the underlying AAC
core coder [35]. This section sketches the architectural
features of MPEG-4 AAC as they are relevant to the SLS
enhancement technology.
The underlying AAC codec provides efficient perceptual audio coding with high quality and achieves broadcast
quality at a bit rate of about 64 kbps per channel [36]. Fig.
3 gives a very concise view of the AAC encoder’s structure. The audio signal is processed in a blockwise spectral
representation using the modified discrete cosine transform (MDCT) [37]. The resulting 1024 spectral values are
quantized and coded considering the required accuracy as
demanded by the perceptual model. This is done to minimize the perceptibility of the introduced quantization distortion by exploiting masking effects. Several neighboring
spectral values are grouped into so-called scale factor
bands, sharing the same scale factor for quantization. Prior
to the quantization/coding (Q/C) tool, a number of processing tools operate on the spectral coefficients in order
to improve coding performance for certain situations. The
most important tools are the following.
• The temporal noise shaping (TNS) tool [38] carries out
predictive filtering across frequency in order to achieve
a temporal shaping of the quantization noise according
to the signal envelope and in this way optimize temporal
masking of the quantization distortion.
• The M/S stereo coding tool [39] provides sum/
difference coding of channel pairs, exploits interchannel
redundancy for near-monophonic signals, and avoids
binaural unmasking.
Fig. 1. HD-AAC encoder structure.
1.3 HD-AAC/SLS Enhancement
The SLS scalable lossless enhancement works on top of
this AAC architecture. The structures of an HD-AAC encoder and decoder are shown in Figs. 4 and 5.
In the encoder the audio signal is converted into a spectral
representation using the integer modified discrete cosine
transform (IntMDCT) [40], [41]. This transform represents
an invertible integer approximation of the MDCT and is
well-suited for lossless coding in the frequency domain.
Other AAC coding tools, such as mid/side coding or temporal noise shaping, are also considered and performed on the
IntMDCT spectral coefficients in an invertible integer fashion, thus maintaining the similarity between the spectral values used in the AAC coder and in the lossless enhancement.
Fig. 2. HD-AAC decoder structure.
Fig. 3. Structure of AAC encoder (simplified).
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February
PAPERS
ISO/IEC MPEG-4 SCALABLE AAC
The link between the perceptual core and the scalable
lossless enhancement layer of the coder [42] is provided
by an error-mapping process. The error-mapping process
removes the information that has already been coded in the
perceptual (AAC) path from the IntMDCT spectral coefficients such that only the resulting (IntMDCT) residuals
are coded in the enhancement encoder, and in this way the
coding efficiency benefits from the underlying AAC layer.
The error-mapping process also preserves the probability
distribution skew of the original IntMDCT coefficients and
thus permits an efficient encoding of the residual by means
of two bit-plane coding processes, namely, bit-plane Golomb
coding (BPGC) [43] and context-based arithmetic coding
(CBAC) [44], and a so-called low-energy-mode encoder,
all of which will be described later in further detail.
By using the bit-plane coding process the lossless enhancement is performed in a fine-grain scalable way. A
lossless reconstruction is obtained if all the bit planes of
the residual are coded, transmitted, and decoded completely. If only parts of the bit planes are decoded, a lossy
reconstruction of the signal is obtained with a quality between the AAC layer’s and lossless reconstruction. In order to achieve optimal perceptual quality at intermediate
bit rates, the bit-plane coding is started from the most
significant bit (MSB) for all scale-factor bands, and progresses toward the least significant bit (LSB) for all bands
(see Fig. 6). In this way the bit-plane coding process preserves the overall spectral shape of the quantization noise,
as it results from the noise-shaping process of the AAC
perceptual coder, and thus takes advantage of the AAC
perceptual model.
The following sections will describe the principles underlying the design of SLS in greater detail by focusing on
its IntMDCT filter bank, error-mapping strategy, and entropy coding parts.
2 FILTER BANK
2.1 IntMDCT
The IntMDCT, as introduced in [40], is an invertible
integer approximation of the MDCT, which is obtained by
utilizing the “lifting scheme” [45] or “ladder network”
[46]. It enables efficient lossless coding of audio signals
[40] by means of entropy coding of integer spectral coefficients. Furthermore, as the IntMDCT closely approximates the behavior of the MDCT, it allows to combine the
strategies of perceptual and lossless audio coding in the
frequency domain into a common framework [42].
Fig. 7 illustrates the close relationship between IntMDCT
and MDCT for a small audio segment by displaying the
respective magnitude spectra of both filter banks. The difference between MDCT and IntMDCT values is visible as
a small noise floor that is typically much lower than the
error introduced by perceptual coding. Thus the IntMDCT
allows to code efficiently the quantization error of an
MDCT-based perceptual codec in the frequency domain.
Fig. 4. Structure of HD-AAC encoder.
Fig. 5. Structure of HD-AAC decoder.
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February
29
GEIGER ET AL.
PAPERS
The following section provides some detail on the derivation of the IntMDCT from the MDCT.
2.2 Decomposition of MDCT
The MDCT, defined by
X!m" =
!
2
N
2N−1
# w!k"x!k" cos
k=0
!2k + 1 + N"!2m + 1"!
,
4N
m = 0, . . . , N − 1
(1)
with the time-domain input x(k) and the windowing function w(k) can be decomposed into two blocks, namely,
windowing and time-domain aliasing (TDA) and discrete
cosine transform of type IV (DCT-IV). This is illustrated
in Fig. 8 for both forward MDCT and inverse MDCT.
In the forward IntMDCT, the windowing/TDA block is
calculated by 3N/2 so-called lifting steps:
"
x!k"
# $
" #$
#
"
x!N − 1 − k"
×
×
1
!
1 −
w!N − 1 − k" − 1
w!k"
0
0
−w!k" 1
x!k"
x!N − 1 − k"
1
1 −
w!N − 1 − k" − 1
w!k"
0
1
k = 0, . . . ,
,
%
%
N
− 1.
2
(2)
After each lifting step a rounding operation is applied to
stay in the integer domain. Every lifting step can be inverted by simply adding the subtracted value.
Fig. 6. Bit-plane scan process in SLS.
2.3 Integer DCT-IV
For the IntMDCT, the DCT-IV is calculated in an invertible integer fashion, called the integer DCT-IV. The
multidimensional lifting (MDL) scheme [41], [47] is applied in order to reduce the required rounding operations
in the invertible integer approximation as much as possible
and in this way minimize the approximation error noise
floor (see Fig. 7). The following block matrix decomposition for an invertible matrix T and the identity matrix I
shows the basic principle of the MDL scheme,
" # " #" #" #
Fig. 7. IntMDCT and MDCT magnitude spectra.
T
0
0
T −1
=
−I
0
T −1 I
I
−T
0 I
0 I
I
T −1
.
(3)
The three blocks in this decomposition are the so-called
MDL steps. Similar to the conventional lifting steps, they
can be transferred to invertible integer mappings by rounding the floating-point values after being processed by T or
T−1, and they can be inverted by subtracting the values that
have been added.
Fig. 8. MDCT and inverse MDCT by windowing/TDA and DCT-IV.
30
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February
PAPERS
When coding stereo signals, this decomposition is used
to obtain an integrated calculation of the M/S matrix and
the integer DCT-IV for the left and right channels. The
number of required rounding operations is 3N per channel
pair, or 3N/2 per channel, which is the same number as for
the windowing/TDA stage. As a whole, this so-called stereo IntMDCT requires only three rounding operations per
sample, including M/S processing. Concerning mono signals, the same structure can be used, but has to be extended by some additional lifting steps to obtain the integer DCT-IV of one block; see [47]. This mono IntMDCT
requires four rounding operations per sample.
2.4 Noise Shaping
The lossless coding efficiency of the IntMDCT is further improved by utilizing a noise-shaping technique, introduced in [48]. In the lifting steps where time-domain
signals are processed, the rounding operations include an
error feedback mechanism to provide a spectral shaping of
the approximation noise. This approximation noise affects
the lossless coding efficiency mainly in the highfrequency region where audio signals usually carry only a
small amount of energy, especially at sampling rates of 96
kHz and above. Hence the low-pass characteristics of the
approximation noise improve the lossless coding efficiency. A first-order noise-shaping filter is employed in
the three stages of lifting steps in the windowing/TDA
processing and in the first rounding stage of the integer
DCT-IV processing. Fig. 9 compares the resulting approximation error between the IntMDCT values and the
MDCT values rounded to integer, when the IntMDCT
operates both with and without noise shaping.
3 ERROR MAPPING AND
RESIDUAL CALCULATION
The objective of the error-mapping/residual calculation
stage is to produce an integer enhancement signal that
enables lossless reconstruction of the audio signal while
Fig. 9. Mean-squared approximation error of stereo IntMDCT
(including M/S) with and without noise shaping.
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February
ISO/IEC MPEG-4 SCALABLE AAC
consuming as few bits as possible. Rather than encoding
all IntMDCT coefficients c[k] directly, in the lossless enhancement layer, it is more efficient to make use of the
information that has already been coded by the AAC layer.
This is achieved by the error-mapping process, which
produces a deterministic residual signal between the
IntMDCT spectral values and their counterparts from the
AAC layer.
In order to produce a residual signal e[k] with the smallest possible entropy, given the core AAC quantized value
i[k], it is sufficient in most cases to use the minimum mean
square error (MMSE) residual obtained by subtracting the
IntMDCT coefficient c[k] from its MMSE reconstruction
ĉ[k] ! E{c[k]|i[k]},
given i[k], that is,
e $k% = c $k% − ĉ $k%.
(4)
Here E{·} denotes the expectation operation. However, in
SLS a somewhat different approach is adopted, where the
residual signal is given by
e $k% =
&
c $k% − thr $k%, i $k% " 0
c $k%,
i $k% = 0
(5)
where thr (i[k]) is the next quantized value closer to zero
with respect to i[k], and is calculated via table lookup and
linear interpolation to ensure the deterministic behavior
necessary for lossless coding. This error-mapping process
is illustrated in Fig. 10, where two different cases are
shown. In the first case the IntMDCT coefficients c[k]
belong to a scale-factor band that has been quantized and
coded at the AAC encoder (significant band). For these
coefficients the residual coefficients are obtained by subtracting the quantization thresholds from their corresponding c[k], resulting in a residual spectrum with reduced
amplitude. In the other case c[k] belongs to a band that is
not coded or has been quantized to zero in the AAC encoder (insignificant band). In this case the residual spectrum is simply the IntMDCT spectrum itself.
This error-mapping process offers many advantages that
permit better coding efficiency in the enhancement layer.
First, as illustrated in Fig. 11, we notice that if the amplitude of c[k] is distributed exponentially, the amplitude of
Fig. 10. Illustration of error-mapping process.
31
GEIGER ET AL.
PAPERS
e[k] will likewise be distributed approximately exponentially. Thus it can be coded very efficiently by using the
BPGC/CBAC coding process described in the next section. Secondly, for a significant band we find the following property for the coefficients c[k]:
|thr!i $k%" | # |c $k% | $ |thr!i $k%" | + %!i $k%"
(6)
where
%!i $k%" = thr!|i $k% | + 1" − thr!|i $k% |"
M−1
e $k% = !2s $k% − 1"
# b $k, j% & 2 ,
j
k = 0, . . . , N − 1
j=0
(8)
(7)
is the quantization step size of i[k]. Clearly, the magnitude
of e[k] is bounded by the value of %(i[k]), and its sign is
also identical to that of i[k] for i[k] " 0. As a result, no
additional data transmission is needed for conveying the
sign information. This is referred to as “implicit signaling”
in the SLS specification.
This implicit signaling mechanism assumes that the input to the SLS layer and the AAC core quantization value
are “well synchronized.” This may not be true in all cases,
given that AAC encoders have all the freedom to optimize
the encoding process and thus may produce an i[k] that
does not satisfy Eq. (6). Thus SLS also includes an “explicit signaling” mechanism, which employs an MMSE
error mapping as defined in Eq. (4), where ĉ[k] is approximated by the AAC inverse quantization value. In this case
all the side information necessary for decoding e[k] is
signaled explicitly from the encoder to the decoder.
4 ENTROPY CODING
To achieve best efficiency in the compression of the enhancement layer information, several methods for entropy
coding of the IntMDCT residual information are employed.
4.1 Combined BPGC/CBAC Coding
The bit-plane Golomb code (BPGC) coding process
used in SLS is basically a bit-plane coding scheme where
Fig. 11. Distribution of residual signal from error-mapping
process.
32
the bit-plane symbols are coded arithmetically with a
structural frequency assignment rule. Considering an input
data vector e ! {e[0], . . , e[N − 1]} for which N is the
dimension of e, each element e[k] in e is first represented
in a binary format as
which consists of a sign symbol
=
s $k% !
&
1,
e $k% ' 0
0,
e $k% $ 0
, k = 0, . . . , N − 1
(9)
and bit-plane symbols b[k, j] ∈ {0, 1}, i ! 1, . . . , k, and
M is the MSB for e that satisfies 2M−1 # max {|e[k] |} <
2M, k ! 0, . . . , N − 1. The bit-plane symbols are then
scanned and coded from MSB to LSB over all the elements in e, and coded by using an arithmetic code with a
structural frequency assignment QLJ given by
QL $ j % =
'
1
,
2
j$L
1
1 + 22
(10)
,
j−L
j'L
where the “lazy-plane” parameter L can be selected using
the adaptation rule
L = min&L" ∈ "|2L"+1N ' A'
(11)
and A is the absolute sum of the data vector e.
Although this BPGC coding process delivers excellent
compression performance for data that stem from an independent and identically distributed (iid) source with
Laplacian distribution [43], it lacks the capability to explore the statistical dependencies between data samples
that may exist in certain sources to achieve better compression performance. These correlations can be captured
very effectively by incorporating context modeling technology into the BPGC coding process, where the frequency assignment for arithmetic coding of bit-plane symbols is not only dependent on the distance of the current bit
plane from the lazy-plane parameter as in the frequency
assignment rule [Eq. (10)], but also on other possible elements that may affect the probability distribution of these
bit-plane symbols.
In the context of lossless coding of IntMDCT spectral
data for audio, elements that possibly affect the distribution of the bit-plane symbols include the frequency locations of the IntMDCT spectral data, the magnitude of the
adjacent spectral lines, and the status of the AAC core
quantizer. In order to capture these correlations, several
contexts are used in the context-based arithmetic code
(CBAC) of SLS. The guide in selecting these contexts is to
try to find those contexts that are “most” correlated to the
distribution of the bit-plane symbols.
In CBAC, three types of contexts are used, namely, the
frequency band (FB) context, the distance to lazy (D2L)
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February
PAPERS
ISO/IEC MPEG-4 SCALABLE AAC
context, and the significant state (SS) context. The detailed
context assignments are summarized in the following.
Context 1: Frequency Band (FB) It is found in [44]
that the probability distribution of bit-plane symbols of
IntMDCT varies for different frequency bands. Therefore
in CBAC the IntMDCT spectral data are classified into
three different FB contexts, namely, low band (0–4 kHz),
midband (4–11 kHz), and high band (above 11 kHz).
Context 2: Distance to Lazy (D2L) The D2L context is
defined as the distance of the current bit plane j to the
BPGC lazy-plane parameter L, as defined in the following
equation:
D2L =
&
3 − j + L,
j − L ' −2
6,
otherwise.
(12)
This is motivated by the BPGC frequency assignment rule,
Eq. (10), which is based on the fact that the skew of the
probability distribution of bit-plane symbols from a source
with a (near) Laplacian distribution tends to decrease as
the number of D2L decreases. To reduce the total number
of the D2L context, all the bit planes with j − L < −2 are
grouped into one context where all the bit-plane symbols
are coded with probability 0.5.
Context 3: Significant State (SS) The SS context attempts to capture the factors that correlate with the distribution of the magnitude of the IntMDCT residual in one
place. These include the magnitude of the adjacent
IntMDCT spectral lines and the quantization interval of
the AAC core quantizer if it has been quantized previously
in the core encoder. Further detail on the SS context can be
found in [49].
4.2 Low-Energy-Mode Coding
The BPGC/CBAC coding process described earlier
works well for sources with (near) Laplacian distribution,
which usually is the case for most audio signals [50].
However, it was also found that for some music items
there are time/frequency (T/F) regions with very low energy levels where the IntMDCT spectral data are in fact
dominated by the rounding errors of the IntMDCT (see
Fig. 7) with a distribution that is substantially different
from Laplacian. In order to encode those low-energy regions efficiently, the BPGC/CBAC coding process is replaced by low-energy-mode coding, as shown in the
following.
The low-energy-mode coding is invoked for scale factor
bands for which the BPGC parameter L is smaller than or
equal to 0. Then the amplitude of the residual spectral data
e[k] is first converted into a unitary binary string b !
{b[0], b[1], . . . , b[pos], . . .}, as illustrated in Table 1,
with M being the maximum bit plane. It can be seen that
the probability distribution of these symbols is a function
of the position pos, and the distribution of e[k];
Pr&b $pos% = 1' = Pr&e $k% ( pos |e $k% ' pos'
(13)
where 0 # pos < 2M.b[pos] is then coded arithmetically
depending on its position pos and the BPGC parameter L
with a trained frequency table.
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February
4.3 Smart Decoding
Due to their fine-grain scalability, HD-AAC bit streams
can be truncated at any bit rate lower than what would be
needed for a fully lossless reconstruction to produce nearlossless representations of the audio signal. In the context
of arithmetic decoding, the smart decoding method provides a way to optimally decode such a truncated bit
stream. It decodes additional symbols in the absence of
incoming bits when a decoding buffer still contains meaningful information for arithmetic decoding in the CBAC/
BPGC mode and/or low-energy mode. Decoding continues up to the point where no ambiguity exists in determining a symbol [51].
5 OTHER CODING TOOLS
As counterparts to the underlying AAC perceptual audio coder, SLS provides a number of integerized versions
of AAC coding tools.
5.1 Integer M/S
In the AAC codec the M/S tool allows to choose individually between mid/side and left/right coding modes on
a scale-factor-band basis. As shown in Section 2.3 it can
provide either a global left/right or a global mid/side spectral representation. In order to make the integer spectral
values fit the spectral values from the AAC core on a
scale-factor-band basis, an invertible integer version of the
M/S mapping is used. It is based on a lifting decomposition of the normalized M/S matrix, that is, a rotation
by !/4,
" #"
(
1 −1
1
2
1
1
=
×
#" ) ( #
" (#
1 1 − (2
1
0
0
1
2 1
1
1 1−
0
2
(14)
.
1
5.2 Integer TNS
When the temporal noise-shaping (TNS) tool is used in
the AAC core, the resulting MDCT spectral values deviate
from the IntMDCT spectral values. In order to compensate
for this, the same TNS filter as in the AAC core is applied
to the integer spectral values in the lossless enhancement.
To assure lossless operation, the TNS filter is converted to
a deterministic invertible integer filter.
Table 1. Binarization of IntMDCT error spectrum at
low-energy mode.
Amplitude of e[k]
Binary String {b[pos]}
0
1
2
...
2M −2
2M −1
0
1 0
1 1 0
...
1 1 ...
1 1 ...
pos
0
1
2
...
...
3
...
...
1
1
0
1
...
33
GEIGER ET AL.
PAPERS
6 BIT STREAM MULTIPLEXING
From a mechanical point of view, the SLS coded data,
including the core layer AAC bit stream and the enhancement bit stream, can be carried in multiple elementary
streams (ES) in an MPEG-4 system [52]. As shown in Fig.
12, the AAC bit stream is carried in the so-called base
layer ES, and the enhancement bit stream is carried in one
or more enhancement layer ESs. Each ES is thus composed of a sequence of access units (AUs), where one AU
contains one audio frame from the AAC bit stream or the
enhancement bit stream.
From an application point of view, such a bit stream
structure provides great flexibility in constructing either a
large-step scalable system or a fine-grain scalable system
with SLS. For example, in a scalable audio streaming application, the server stores multiple SLS ESs at predefined
bit rates, each assigned with a different stream priority.
During the streaming, the ESs are transmitted in the order
of their stream priority, and ESs with lower stream priority
are dropped by either the streaming server or the network
gateway whenever the transmission bandwidth is insufficient to stream a full-rate lossless bit stream. Alternatively
it is also possible to implement a lightweight bit stream
truncation algorithm in the streaming server or in the multimedia gateway to truncate the enhancement stream AUs
directly according to the available bandwidth to achieve
the fine granular bit rate scalability.
7 MODES OF OPERATION
So far the HD-AAC coder has been described as a combination of a regular AAC core layer coder (such as low
complexity AAC) and an SLS enhancement layer. This
section introduces further features offered by the SLS enhancement layer that concern its combination with the
AAC coder, and its ability to be used in combination with
other types of AAC-based codecs, such as AAC scalable
or AAC/BSAC.
7.1 Oversampling Mode
In the context of high-definition audio applications it is
frequently desirable to achieve lossless signal reconstruction at sampling rates of 96 kHz, or even 192 kHz. While
the MPEG-4 AAC coder supports these sampling rates, it
typically achieves best coding efficiency for high-quality
perceptual coding at sampling rates between 32 and 48
kHz. Thus in order to allow both efficient core layer coding at common rates and lossless reconstruction at higher
rates, MPEG-4 SLS includes an additional feature called
“oversampling mode.” This refers to the possibility of letting the lossless enhancement operate at a sampling rate
higher than that of the AAC core codec. The ratio between
the SLS sampling rate and the AAC sampling rate is called
“oversampling factor” and can be either 1, 2, or 4. For
example, the lossless enhancement can operate at a rate of
192 kHz, whereas the AAC core operates at 48 kHz, see
Table 2.
The mapping between the two coding layers is achieved
by using a time-aligned framing and a correspondingly
longer IntMDCT in the lossless enhancement. For example, an IntMDCT of the size of 4096 spectral values is
Table 2. Example combinations of sampling rates for AAC
core and lossless enhancement.
SLS @ 48 kHz
SLS @ 96 kHz
SLS @ 192 kHz
AAC @
48 kHz
AAC @
96 kHz
AAC @
192 kHz
×
×
×
×
×
×
Fig. 12. Structure of MPEG-4 SLS bit stream.
34
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February
PAPERS
ISO/IEC MPEG-4 SCALABLE AAC
used in case of oversampling by a factor of 4, and the 1024
MDCT values from the AAC core are mapped to the lower
1024 IntMDCT values. In addition to allowing for optimum AAC performance, the lossless performance is improved when using longer IntMDCT transforms. (A transform length of 2048 or 4096 provides a better lossless
performance for stationary signals than a transform length
of 1024.)
7.2 Combination with MPEG-4
Scalable AAC
In order to account for varying or unknown transmission capacity, the MPEG-4 AAC codec also provides several scalable coding modes [3], [34]. MPEG-4 scalable
AAC allows perceptual audio coding with one or more
AAC mono or stereo layers and can be combined with an
SLS enhancement layer. This results in an overall coder
that offers fine-grain scalability in the range between lossless reconstruction and the AAC representation, and scalability in several steps within the perceptually coded AAC
representation.
7.2 Combination with MPEG-4 AAC/BSAC
Similar to the combination with scalable AAC, the SLS
enhancement can also be operated on top of MPEG-4
AAC/BSAC as a core layer coder. This provides a finegrain scalable representation both between lossless and
perceptually coded audio and within the perceptually
coded range. The latter has a granularity of 1 kbps per
channel.
7.4 Stand-Alone Operation
Finally the SLS lossless enhancement can also operate
as a straightforward stand-alone codec, without any underlying core codec. Nonetheless, this operation mode offers both full lossless coding capability and fine-grain
scalability.
8 PERFORMANCE
This section quantifies the performance of the HD-AAC
codec in various operation modes. While the compression
ratio can be seen as the sole measure of merit for lossless
operation scenarios, an evaluation of performance for
near-lossless operation requires audio quality measurements at various data rates and operating points.
8.1 Lossless Compression Performance
This section reports the performance of HD-AAC in
terms of its ability for lossless compression of various
audio material. As a figure of merit, the compression ratio
is defined as
compression ratio =
original file size
.
compressed file size
(15)
Tables 3 and 4 show the lossless compression performance for two major sets of test material, that is, the
MPEG-4 lossless audio coding test set (donated by Matsushita Corporation and containing in part recordings performed by the New York Symphonic Ensemble). For both
sets the compression results are given for an HD-AAC
configuration with an AAC core layer running at 128
kbit/s stereo plus SLS enhancement, and for an SLS standalone configuration.
As could be expected from theory, it can be observed in
Table 3 that an increase in word length reduces the average compression ratio (due to the fact that the least significant bits of the PCM codewords are more random and
thus less compressible). On the other hand, increasing the
sampling rate improves compression because of the increased correlation between adjacent samples (assuming
sound material with typical high-frequency characteristics).
For the MPEG-4 lossless audio test set, an average compression ratio of 2:1 can be achieved easily at a sampling
rate of 48 kHz and 16-bit word length. This is competitive
with the best of other known lossless compression systems
[53]. It can also be observed that for the AAC-based mode
an additional bit rate of only 30–40 kbps is required compared to the stand-alone mode for lossless representation.
This reduces the bit rate consumption by 90–100 kbps
compared to simulcast solutions that transmit both an 128kbps AAC bit stream and a stand-alone SLS lossless bit
stream simultaneously.
8.2 Near-Lossless Compression Performance
The bit-plane coding of residual spectral values (that is,
of the AAC quantization error) allows to refine the initial
AAC quantization successively as more bits from the SLS
enhancement layer are decoded. With each additional decoded bit plane the quantization error is reduced by 6 dB.
Consequently an increasing safety margin with respect to
audibility is added as the bit rate increases.
Table 3. Lossless compression results for MPEG-4 lossless audio test set.
SLS + AAC @ 128 kbps/Stereo
(AAC @ 48 kHz sampling rate)
48 kHz/16 bit
48 kHz/24 bit
96 kHz/24 bit
192 kHz/24 bit
Overall
SLS Stand-Alone
Compression
Ratio
Average
Bit Rate
(kbps)
Compression
Ratio
Average
Bit Rate
(kbps)
2.09
1.55
2.09
2.60
2.08
735
1490
2201
3543
1992
2.20
1.58
2.13
2.63
2.12
698
1454
2160
3509
1955
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February
35
GEIGER ET AL.
PAPERS
8.2.1 Evaluation of Near-Lossless Audio Quality
While it may seem sufficient for most purposes to provide perceptually transparent reproduction of audio signals
by using conventional perceptual audio coders (such as
AAC at a sufficient bit rate), there are applications that
demand still higher audio quality. This is especially the
case for professional audio production facilities, such as
archiving and broadcasting, in which audio signals may
undergo many cycles of encoding/decoding (tandem coding) before being delivered to the consumer. This leads to
an accumulation of introduced coding distortion and may
lead to unacceptable final audio quality [54], unless substantial headroom toward audibility is provided by each
coding step, for example, by using coding algorithms with
very high quality or bit rate.
ITU-R BS.1548-1 [55] defines the requirements for audio coding systems for digital broadcasting, assuming a
codec chain consisting of so-called contribution, distribution, and emission codecs. According to this recommendation, and based on ITU-R BS.1116-1 [56], audio codecs
for contribution and distribution should fulfill the following requirements:
The quality of sound reproduced after a reference contribution/
distribution cascade [. . .] should be subjectively indistinguishable from the source for most types of audio programme material. Using the triple stimuli double blind with hidden reference
test, described in Recommendation ITU-R BS.1116 [. . .], this
requires mean scores generally higher than 4.5 in the impairment 5-grade scale, for listeners at the reference listening position. The worst rated item should not be graded lower than 4.
In accordance with these recommendations, tests were
run on signals encoded or decoded with HD-AAC. Instead
of running numerous listening tests for subjective quality
assessment at individual operating points, the evaluation
employed the PEAQ measurement (BS.1387-1) [57],
which provides methods for objective measurements of
perceived audio quality in scenarios that are normally assessed by ITU-R BS.1116 testing. The most essential results can be seen in Figs. 13–15; see also [58].
The graphs show the estimated subjective sound quality
expressed as objective difference grade (ODG) values,
which were computed by a PEAQ measurement. The
evaluation procedure consists of multiple cycles of tandem
coding/decoding with up to 16 cycles. The standard set of
critical MPEG-4 audio items for perceptual audio coding
evaluations was used. ODG values of 0, −1, −2, −3, −4
correspond to a subjective audio quality of “indistinguishable from original,” “perceptible but not annoying,” “slightly
annoying,” “annoying,” and “very annoying,” respectively.
Fig. 13 shows the achieved ODG values as a function of
tandem cycles for a traditional AAC coder running at a bit
rate of 128 kbps/stereo. As expected, it can be observed
that the audio quality degrades significantly with an increasing number of tandem cycles, depending on the test
item. For this reason tandem coding is not a recommended
practice for such coders.
Fig. 14 displays the corresponding tandem coding results for the HD-AAC combination running at 512 kbps/
stereo (AAC at 128 kbps + SLS enhancement at 384 kbps).
It can be noted that the audio quality remains consistently
at a very high level, even after a total of 16 tandem cycles.
This illustrates the high robustness of the HD-AAC representation against tandem coding. According to these
measurements, the aforementioned BS.1548-1 audio quality requirement is fulfilled with a considerable safety margin. Furthermore, when placed in tandem with AAC (such
Table 4. Lossless compression results for commercial CD test set.
Compression Ratio
SLS+AAC @
128 kbps/Stereo
SLS
Stand-Alone
ACDC—Highway to Hell (Sony 80206)
Avril Lavigne—Let Go (Arista 14740)
Backstreet Boys—Greatest Hits Chapter One (Jive 41779)
Brian Setzer—The Dirty Boogie (Interscope 90183)
Cowboy Junkies—Trinity Session (RCA-8568)
Grieg—Peer Gynt, von Karajan (DG 439010)
Jannifer Warnes—Famous Blue Raincoat (BMG 258418)
Marlboro Music Festival—DISC A (Bridge 9108)
Marlboro Music Festival—DISC B (Bridge 9108)
Nirvana—Nirvana (Interscope 493523)
Philip Jones—40 Famous Marches, CD1 (Decca 416241)
Philip Jones—40 Famous Marches, CD2 (Decca 416241)
Pink Floyd—Dark Side of the Moon (Capitol 46001)
Rebecca Pidgeon—The Raven (Chesky 115)
Ricky Martin (Sony 69891)
Schubert Piano Trio in E-flat (Sony 48088)
Spaniels—The Very Best Of (Collectables 7243)
Steeleye Span—Below the Salt (Shanachie 79039)
Suzanne Vega—Solitude Standing (A&M 5136)
Westminster Concert Bell Choir—Christmas Bells (Gothic Records 49055)
1.31
1.36
1.39
1.43
1.93
2.63
2.07
2.23
2.20
1.50
1.99
2.00
1.76
1.88
1.33
2.74
2.41
1.85
1.74
2.55
1.36
1.41
1.45
1.49
2.04
2.83
2.20
2.35
2.33
1.56
2.11
2.11
1.85
1.97
1.38
2.90
2.62
1.95
1.83
2.71
Overall
1.85
1.94
CD Items (16 bit/44.1 kHz)
36
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February
PAPERS
ISO/IEC MPEG-4 SCALABLE AAC
as for final audio distribution over a narrow-band channel), the resulting audio quality is not degraded significantly by the preceding HD-AAC tandem cascade. Further
details can be found in [59].
8.2.2 Stand-Alone SLS Operation
The SLS codec can also operate as a stand-alone lossless codec when the AAC core codec is not used, some-
Fig. 13. Test results: AAC tandem coding.
Fig. 14. Test results: HD-AAC tandem coding.
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February
37
GEIGER ET AL.
PAPERS
times also referred to as the “noncore mode”. Despite of its
simple structure (only IntMDCT and BPGC/CBAC modules are used), this mode allows efficient lossless coding
[59]. Furthermore, fine-grain scalability by truncated bitplane coding is also possible in this mode. Given that the
stand-alone SLS codec does not include any perceptual
model to estimate masking thresholds, it is interesting to
investigate the audio quality resulting from a truncation of
the SLS bit stream.
Due to the behavior of bit-plane coding in this mode, a
constant signal-to-noise ratio is achieved in each scalefactor band. With each additional bit plane the signal-tonoise ratio improves by 6 dB. While this behavior does not
allow SLS to compete with efficient perceptual codecs at
low bit rates (for example, AAC at 128 kbps/stereo), this
simple approach works quite well at higher bit rates in the
near-lossless range. Fig. 15 shows tandem coding results
for the stand-alone SLS codec operating at 512 kbps/
stereo. It reaches about the same near-lossless audio quality as the AAC-based HD-AAC mode discussed in the
previous section.
At a constant bit rate of 768 kbps most test items still
require the truncation of some coder frames. Nevertheless
the corresponding PEAQ measurements indicate that no
degradation of subjective audio quality occurs in this tandem coding scenario for both the AAC-based mode and
the stand-alone mode; see [59]. This provides an interesting operating point for HD-AAC modes, corresponding to
a guaranteed 2:1 compression. While other stand-alone
lossless codecs can also provide an average compression
of 2:1 for suitable test material, their peak compression
performance can be much lower, depending on the audio
material to be encoded. In contrast, HD-AAC is able to
guarantee a certain compression ratio while providing
lossless or near-lossless signal representation, depending
on the input signal.
9 DECODER COMPLEXITY
The computational complexity for SLS decoding can be
evaluated by counting the total number of standard instructions (multiplications, additions, bit shifts, comparisons, memory transfers) required for performing the decoding process on a generic 32-bit fixed-point CPU.
The main components contributing to the computational
complexity of SLS are:
1) IntMDCT filter bank
2) Bit-plane arithmetic decoder
3) AAC Huffman decoding
4) AAC + SLS inverse error mapping
5) Integer M/S stereo coding
6) Unpacking of tables.
Items 3) to 5) are only required in the AAC-based mode,
item 6) only if the necessary tables are not precomputed.
9.1 Number of Instructions
Table 5 lists the number of instructions required for decoding in the AAC-based mode, with the AAC core operating at 64 kbps per channel. Table 6 shows the corresponding numbers for the SLS in stand-alone mode (without the
AAC core layer). For both tables, values are provided for
both implementations with and without table prepacking.
Fig. 15. Test results: stand-alone SLS tandem coding.
38
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February
PAPERS
ISO/IEC MPEG-4 SCALABLE AAC
9.2 ROM Requirements
For an implementation of SLS in stand-alone mode, a
ROM size of 4 kbytes is required. For the AAC-based
mode, the ROM requirement is 45 kbytes. As can be seen
from Tables 5 and 6, a tradeoff between ROM requirement
and number of instructions can be made by precomputing
the necessary table values. More details on SLS computational complexity can be found in [59]. The computational complexity of AAC decoding is analyzed in [60].
10. APPLICATIONS
As the primary functionality of HD-AAC audio coding
is lossless audio coding, it can be used in applications that
require bit-exact reconstruction, such as studio operation,
music disc delivery, or audio archiving. Due to its inherent
scalability, HD-AAC audio coding technology in fact fits
into virtually every application that requires audio compression. Several potential application scenarios are listed
here.
Studio Operations HD-AAC audio coding technology
is useful for the storage of audio at various points in studio
operations such as recording, editing, mixing, and premastering as studio procedures are designed to preserve the
highest levels of quality. The scalability of the SLS layer
also provides a nice solution to situations in which the bandwidth is not sufficient to support fully lossless quality.
Archival Application Archives of sound recordings
are very common in studios, record labels, libraries, and so
on. These archives are tremendously large and certainly
compression is essential. In addition, the scalability of the
SLS technology facilitates the possibility that lower bitrate versions of the archive’s lossless audio items can be
extracted at any time to allow applications such as remote
data browsing.
Broadcast Contribution/Distribution Chain In a
broadcast environment HD-AAC audio coding technology
could be used in all stages comprising archiving, contribution/distribution, and emission. In the broadcast chain
one main feature of the technology can be used: In every
stage where lower bit rates are required, the bit stream is
merely truncated, and no reencoding is therefore required.
Consumer Disc-Based Delivery HD-AAC technology
can also be used in consumer disc-based delivery of music
content. It enables the music disc to deliver both lossless
and lossy audio on the same medium.
Internet Delivery of Audio In such an application scenario the available transmission bandwidth can vary dramatically across different access network technologies and
over time. As a result, the same audio content at a variety
of bit rates and qualities may need to be kept ready at the
server side. HD-AAC technology provides a “one-file”
solution for such a requirement.
Audio Streaming HD-AAC technology delivers the
vital bit-rate scalability for streaming applications on
channels with variable quality of service (QoS) conditions.
Examples for this kind of streaming applications include
Internet audio streaming and multicast streaming applications that feed several channels of differing capacity.
Digital Home The idea of the digital home is to create
an open and transparent home network platform that enables consumers to easily create, use, manage, and share
digital content such as audio, video, or image. In a typical
Table 5. Maximum numbers of INT32 operations per sample for SLS decoding with AAC core.
Frame Length
Muls
Adds/Subs
4096 or 512
2048 or 256
1024 or 128
19.50
20.25
18.00
93.46
91.22
88.99
19.50
20.25
18.00
123.21
117.47
105.24
4096 or 512
2048 or 256
1024 or 128
Shifts
Negs
Movs
Combined
All ! 1 Cycle
Tables Preunpacked
34.50
72.60
31.50
73.54
28.50
68.50
34.26
32.25
30.85
16.54
16.29
15.99
270.86
265.05
250.83
34.26
32.25
30.85
16.54
16.29
15.99
316.10
303.30
273.58
Ors
Tables Unpacked in Place
34.50
88.01
31.50
85.54
28.50
75.00
Table 6. Maximum number of INT32 operations per sample for SLS stand-alone decoding.
Shifts
Negs
Movs
Combined
All ! 1 cycle
86.63
84.39
82.16
Tables Preunpacked
34.50
67.10
31.50
68.04
28.50
63.00
29.93
27.92
26.52
9.54
9.29
8.99
245.70
239.89
225.67
116.38
110.64
98.41
Tables Unpacked in Place
34.50
82.6
31.50
80.04
28.50
69.50
29.93
27.92
26.52
9.54
9.29
8.99
290.05
278.14
248.42
Frame Length
Muls
Adds/Subs
4096 or 512
2048 or 256
1024 or 128
18
18.75
16.5
4096 or 512
2048 or 256
1024 or 128
18.00
18.75
16.50
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February
Ors
39
GEIGER ET AL.
setup for audio, the user can download the HD-AAC
coded bit streams in lossless quality from the service provider and archive them on the home music server. These
bit streams are then streamed, or downloaded to different
audio terminals at differing quality for playback.
11. CONCLUSIONS
The new ISO/MPEG specification for scalable lossless
coding extends the well-known perceptual coding scheme
AAC toward lossless and near-lossless operation, and in
this way enables its use in the context of high-definition
applications. The HD-AAC scheme offers competitive
lossless compression rates at all relevant operating points
(word length and sampling rate). For distribution on bandwidth-limited channels a perceptually coded compatible
AAC bit stream can simply be extracted from the composite HD-AAC stream. Alternatively, the SLS part can
also be used as a simple and versatile stand-alone compression engine. In both cases the fidelity of the signal
representation can be scaled with fine granularity within a
wide range of near lossless representations. This enables
lossless or near lossless transmission of high-definition
audio with a guaranteed maximum rate. We anticipate that
this flexibility will make HD-AAC the technology of
choice for many applications that call for both very high
audio quality and delivery over a wide range of transmission channels.
12 ACKNOWLEDGMENT
The authors would like to thank all their colleagues at
the MPEG audio subgroup who supported the lossless
standardization activity, especially Takehiro Moriya
(NTT) for inspiring these standardization activities, Tilman Liebchen (Technical University of Berlin) for chairing the ad hoc group on lossless audio coding, and Yuriy
Reznik (Real Networks/Qualcomm) for his thorough complexity evaluations.
13 REFERENCES
[1] ISO/IEC 11172-3, “Coding of Moving Pictures and
Associated Audio for Digital Storage Media at up to about
1.5 Mbit/s—Part 3: Audio,” International Standards Organization, Geneva, Switzerland (1992).
[2] ISO/IEC 13818-3, “Information Technology—
Generic Coding of Moving Pictures and Associated Audio—Part 3: Audio,” International Standards Organization, Geneva, Switzerland (1994).
[3] ISO/IEC 14496-3:2001, “Coding of Audio-Visual
Objects—Part 3: Audio,” International Standards Organization, Geneva, Switzerland (2001).
[4] ISO/IEC 14496-3:2001/Amd.1:2003, “Coding of
Audio-Visual Objects—Part 3: Audio, Amendment 1:
Bandwidth Extension,” International Standards Organization, Geneva, Switzerland (2003).
[5] M. Dietz, L. Liljeryd, K. Kjörling, and O. Kunz,
“Spectral Band Replication—A Novel Approach in Audio
40
PAPERS
Coding,” presented at the 112th Convention of the Audio
Engineering Society, J. Audio Eng. Soc. (Abstracts), vol.
50, pp. 509, 510 (2002 June), convention paper 5553.
[6] ISO/IEC 14496-3:2001/Amd.2:2004, “Coding of
Audio-Visual Objects—Part 3: Audio, Amendment 2:
Parametric Coding for High Quality Audio,” International
Standards Organization, Geneva, Switzerland (2004).
[7] C. den Brinker, E. Schuijers, and W. Oomen, “Parametric Coding for High-Quality Audio,” presented at the
112th Convention of the Audio Engineering Society, J.
Audio Eng. Soc. (Abstracts), vol. 50, p. 510 (2002 June),
convention paper 5554.
[8] ISO/IEC FCD 23003-1, “MPEG-D (MPEG Audio
Technologies)—Part 1: MPEG Surround” International
Standards Organization, Geneva, Switzerland (2006).
[9] J. Breebaart, J. Herre, C. Faller, J. Roeden, F. Myburg, S. Disch, H. Purnhagen, G. Hotho, M. Neusinger, K.
Kjoerling, and W. Oomen, “MPEG Spatial Audio Coding/
MPEG Surround: Overview and Current Status,” presented at the 119th Convention of the Audio Engineering
Society, J. Audio Eng. Soc. (Abstracts), vol. 53, p. 1228
(2005 Dec.), convention paper 6599.
[10] “Special Issue: High-Resolution Audio,” J. Audio
Eng. Soc., vol. 52, pp. 116–260 (2004 Mar.).
[11] “DVD Forum,” http://www.dvdforum.org/
forum.shtml (2004).
[12] Royal Philips Electronics, “Super Audio CD Systems,” http://www.licensing.philips.com/information/
sacd/ (2006).
[13] Blu-Ray Disc Association, “Blu-Ray Disc,” http://
www.blu-raydisc.com/ (2006).
[14] Dolby Laboratories Inc., “MLP Lossless” http://
www.dolby.com/consumer/technology/mlp_lossless.html
(2006).
[15] M. A. Gerzon, P. G. Craven, J. R. Stuart, M. J.
Law, and R. J. Wilson, “The MLP Lossless Compression
System,” in Proc 17th AES Conf. (Florence, Italy, 1999
Sept.), pp. 61–75.
[16] DTS Inc., “DTS HD,” http://www.dtsonline.com/
consumer/dtshd.php (2006).
[17] Apple Computer Inc., “Apple Quicktime,” http://
www.apple.com/downloads/macosx/apple/quicktime651
.html (2006).
[18] J. Coalson, “FLAC—Free Lossless Audio Codec,”
http://flac.sourceforge.net (2006).
[19] M. T. Ashland, “Monkey’s Audio—A Fast and
Powerful Lossless Audio Compressor,” http://www
.monkeysaudio.com (2004).
[20] F. Ghido, “OptimFROG,” http://www.losslessaudio
.org (2006).
[21] ITU-T G.729.1, “G.729 Based Embedded Variable
Bit-Rate Coder: An 8–32 kbit/s Scalable Wideband Coder
Bitstream Interoperable with G.729,” International Telecommunication Union, Geneva, Switzerland (2006).
[22] B. Grill, “A Bit-Rate Scalable Perceptual Coder for
MPEG-4 Audio presented at the 103rd Convention of the
Audio Engineering Society, J. Audio Eng. Soc. (Abstracts), vol. 15, p. 1005 (1997 Nov.), preprint 4620.
[23] J. Herre, E. Allamanche, K. Brandenburg, M.
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February
PAPERS
Dietz, B. Teichmann, B. Grill, A. Jin, T. Moriya, N.
Iwakami, T. Norimatsu, M. Tsushima, and T. Ishikawa,
“The Integrated Filterbank-Based Scalable MPEG-4 Audio Coder,” presented at the 105th Convention of the Audio Engineering Society, J. Audio Eng. Soc. (Abstracts),
vol. 46, p. 1039 (1998 Nov.), preprint 4810.
[24] S. H. Park, Y. B. Kim, S. W. Kim, and Y. S. Seo,
“Multi-Layer Bit-Sliced Bit-Rate Scalable Audio Coding,”
presented at the 103rd Convention of the Audio Engineering Society, J. Audio Eng. Soc. (Abstracts), vol. 45, p.
1005 (1997 Nov.), preprint 4520.
[25] M. Nishiguchi, “MPEG-4 Speech Coding,” in
Proc. 17th AES Conf. (Florence, Italy, 1999 Sept.), pp.
139–146.
[26] M. Nishiguchi, A. Inoue, Y. Maeda, and J. Matsumoto, “Parametric Speech Coding—HVXC at 2.0–4.0
kbps,” presented at the IEEE Workshop on Speech Coding, Porvoo, Finland (1999 June).
[27] M. S. Schroeder and B. S. Atal, “Code-Excited
Linear Prediction (CELP): High-Quality Speech at Very
Low Bit Rates,” in Proc. IEEE ICASSP (Tampa, FL, 1985
Mar.), pp. 937–940.
[28] T. Nomura, M. Iwadare, M. Serizawa, and K.
Ozawa, “A Bitrate and Bandwidth Scalable CELP Coder,”
in Proc. IEEE ICASSP (Seattle, WA, 1998 May), pp.
341–344.
[29] ISO/IEC JTC1/SC29/WG11, “Final Call for Proposals on MPEG-4 Lossless Audio Coding,” MPEG2002/
N5208, Shanghai, China (2002 Oct.).
[30] ISO/IEC 14496-3:2001/Amd.6:2005, “Coding of
Audio-Visual Objects—Part 3: Audio, Amendment 6:
Lossless Coding of Oversampled Audio,” International
Standards Organization, Geneva, Switzerland (2005)
[31] E. Knapen, D. Reefman, E. Janssen, and F. Bruekers, “Lossless Compression of 1-Bit Audio,” J. Audio Eng.
Soc., vol. 52, pp. 190–199 (2004 Mar.).
[32] ISO/IEC 14496-3:2005/Amd.2:2006, “Coding of
Audio-Visual Objects—Part 3: Audio, Amendment 2: Audio Lossless Coding (ALS), New Audio Profiles and
BSAC Extensions,” International Standards Organization,
Geneva, Switzerland (2006).
[33] ISO/IEC 14496-3:2005/Amd.3:2006, “Coding of
Audio-Visual Objects—Part 3: Audio, Amendment 3:
Scalable Lossless Coding (SLS),” International Standards
Organization, Geneva, Switzerland (2006).
[34] J. Herre and H. Purnhagen, “General Audio Coding,” in The MPEG-4 Book, IMSC Multimedia Ser., F.
Pereira and T. Ebrahimi, Eds. (Prentice-Hall, Englewood
Cliffs, NJ, 2002).
[35] M. Bosi, K. Brandenburg, S. Quackenbush, L.
Fielder, K. Akagiri, H. Fuchs, M. Dietz, J. Herre, G. Davidson, and Y. Oikawa, “ISO/IEC MPEG-2 Advanced Audio Coding,” J. Audio Eng. Soc., vol. 45, pp. 789–814
(1997 Oct.).
[36] ISO/IEC JTC1/SC29/WG11, “Report on the
MPEG-2 AAC Stereo Verification Tests,” MPEG1998/
N2006, San Jose, CA (1998 Feb.).
[37] J. Princen, A. Johnson, and A. Bradley, “Subband/
Transform Coding Using Filter Bank Designs Based on
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February
ISO/IEC MPEG-4 SCALABLE AAC
Time Domain Aliasing Cancellation,” in Proc IEEE
ICASSP (Dallas, TX, 1987), pp. 2161–2164.
[38] J. Herre and J. D. Johnston, “Enhancing the Performance of Perceptual Audio Coders by Using Temporal
Noise Shaping (TNS),” presented at the 101st Convention
of the Audio Engineering Society, J. Audio Eng. Soc. (Abstracts), vol. 44, p. 1175 (1996 Dec.), preprint 4384.
[39] J. D. Johnston, J. Herre, M. Davis, and U. Gbur,
“MPEG-2 NBC Audio—Stereo and Multichannel Coding
Methods,” presented at the 101st Convention of the Audio
Engineering Society, J. Audio Eng. Soc. (Abstracts), vol.
44, p. 1175 (1996 Dec.), preprint 4383.
[40] R. Geiger, T. Sporer, J. Koller, and K. Brandenburg, “Audio Coding Based on Integer Transforms,” presented at the 111th Convention of the Audio Engineering
Society, J. Audio Eng. Soc. (Abstracts), vol. 49, p. 1230
(2001 Dec.), convention paper 5471.
[41] R. Geiger, Y. Yokotani, and G. Schuller, “Improved Integer Transforms for Lossless Audio Coding,” in
Proc. 37th Asilomar Conf. on Signals, Systems and Computers (Pacific Grove, CA, 2003 Nov.).
[42] R. Geiger, J. Herre, J. Koller, and K. Brandenburg,
“IntMDCT—A Link between Perceptual and Lossless Audio Coding,” in Proc. IEEE ICASSP (Orlando, FL, 2002
May).
[43] R. Yu, C. C. Ko, S. Rahardja, and X. Lin, “BitPlane Golomb Code for Sources with Laplacian Distributions,” in Proc. IEEE ICASSP (Hong Kong, China, 2003
Apr.), pp. 277–280.
[44] R. Yu, X. Lin, S. Rahardja, C. C. Ko, and H.
Huang, “Improving Coding Efficiency for MPEG-4 Audio
Scalable Lossless Coding,” in Proc. IEEE ICASSP (Philadelphia, PA, 2005 May).
[45] I. Daubechies, and W. Sweldens, “Factoring Wavelet Transforms into Lifting Steps,” Tech. Rep. Bell Laboratories, Lucent Technologies (1996).
[46] F. Bruekers and A. Enden, “New Networks for
Perfect Inversion and Perfect Reconstruction,” IEEE J.
Selected Areas Comm., vol. 10, pp. 130–137 (1992 Jan).
[47] R. Geiger, Y. Yokotani, G. Schuller, and J. Herre,
“Improved Integer Transforms Using Multi-Dimensional
Lifting,” in Proc. IEEE ICASSP (Montreal, Canada, 2004
May).
[48] Y. Yokotani, R. Geiger, G. Schuller, S. Oraintara,
and K. R. Rao, “Improved Lossless Audio Coding Using
the Noise-Shaped IntMDCT,” presented at the IEEE 11th
DSP Workshop, Taos Ski Valley, NM (2004 Aug).
[49] R. Yu, X. Lin, S. Rahardja, and H. Haibin, “Proposed Core Experiment for Improving Coding Efficiency
in MPEG-4 Audio Scalable Coding (SLS),” ISO/IEC
JTC1/SC29/WG11, M10683, Munich, Germany (2004
Mar.)
[50] R. Yu, X. Lin, S. Rahardja, and C. C. Ko, “A
Statistics Study of the MDCT Coefficient Distribution for
Audio,” in Proc. ICME (Taipei, Taiwan, 2004 June).
[51] K. H. Choo, E. Oh, J. H. Kim, and C. Y. Son,
“Enhanced Performance in the Functionality of Fine Grain
Scalability,” presented at the 119th Convention of the Audio Engineering Society, J. Audio Eng. Soc. (Abstracts),
41
GEIGER ET AL.
PAPERS
vol. 53, pp. 1227, 1228 (2005 Dec.), convention paper
6597.
[52] ISO/IEC 14496–1:2004, “Coding of Audio-Visual
Objects—Part 1: Systems,” International Standards Organization, Geneva, Switzerland (2004).
[53] M. Hans and R. W. Schafer, “Lossless Compression of Digital Audio,” IEEE Signal Process Mag., vol.
18, pp. 21–32 (2001 July).
[54] AES Technical Committee of Coding of Audio
Signals, “Perceptual Audio Coders: What to Listen For,”
CD-ROM with tutorial information and audio examples,
Audio Engineering Society, New York (2001).
[55] ITU-R BS.1548-1, “User Requirements for Audio
Coding Systems for Digital Broadcasting,” International
Telecommunication Union, Geneva, Switzerland
(2001–2002).
[56] ITU-R BS.1116-1, “Methods for the Subjective
Assessment of Small Impairments in Audio Systems In-
cluding Multichannel Sound Systems,” International Telecommunication Union, Geneva, Switzerland (1994).
[57] ITU-R BS.1387-1, “Method for Objective Measurements of Perceived Audio Quality,” International
Telecommunication Union, Geneva, Switzerland (1998).
[58] R. Geiger, M. Schmidt, J. Herre, and R. Yu,
“MPEG-4 SLS—Lossless and Near-Lossless Audio Coding Based on MPEG-4 AAC,” presented at the International Symposium on Communications, Control and Signal Processing, Marrakech, Morocco (2006 Mar.)
[59] ISO/IEC JTC1/SC29/WG11, “Verification Report
on MPEG-4 SLS,” MPEG2005/N7687, Nice, France
(2005 Oct.).
[60] ISO/IEC JTC1/SC29/WG11, “Revised Report on
Complexity of MPEG-2 AAC Tools,” MPEG1999/N2957
(Melbourne, Australia, 1999 Oct.), http://www.chiariglione
.org/mpeg/working_documents/mpeg-02/audio/AAC_
tool_complexity(rev).zip
THE AUTHORS
R. Geiger
S.-W. Kim
R. Yu
J. Herre
X. Lin
Ralf Geiger received a diploma degree in mathematics
from the University of Regensburg, Regensburg, Germany, in 1997.
In 1998 he joined the Audio/Multimedia Department at
the Fraunhofer Institute for Integrated Circuits (IIS), Erlangen, Germany. From 2000 to 2004 he was with the
Fraunhofer Institute for Digital Media Technology
(IDMT), Ilmenau, Germany. In 2005 he returned to Fraunhofer IIS.
Mr. Geiger is working on the development and standardization of perceptual and lossless audio coding
42
S. Rahardja
M. Schmidt
schemes. He is a coeditor of the ISO/IEC standard
“MPEG-4 Scalable Lossless Coding (SLS).”
●
Rongshan Yu received a B.Eng. degree from Shanghai
Jiaotong University, Shanghai, P. R. China, in 1995, and
M. Eng. and Ph.D. degrees from the National University
of Singapore in 2000 and 2005, respectively.
He was with the Centre for Signal Processing, School of
Electrical and Electronics, Nanyang Technological University, Singapore, from 1999 to 2001, and with the InstiJ. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February
PAPERS
ISO/IEC MPEG-4 SCALABLE AAC
tute for Infocomm Research (I2R), A*STAR, Singapore,
from 2001 to 2005. He is currently with Dolby Laboratories, San Francisco, CA, USA. His research interests include audio coding, data compression, and digital signal
processing.
Dr. Yu received the Tan Kah Kee Young Inventors’
Gold Award (Open Category) in 2003. He has been working on the development and standardization of MPEG
lossless audio coding schemes and coedited the ISO/IEC
standard “MPEG-4 Scalable Lossless Coding (SLS).”
●
Jürgen Herre joined the Fraunhofer Institute for Integrated Circuits (IIS) in Erlangen, Germany, in 1989. Since
then he has been involved in the development of perceptual coding algorithms for high-quality audio, including
the well-known ISO/MPEG-Audio Layer III coder (aka
MP3). In 1995 he joined Bell Laboratories for a postdoctoral term, working on the development of MPEG-2 advanced audio coding (AAC). Since the end of 1996 he has
been back at Fraunhofer, working on the development of
advanced multimedia technology, including MPEG-4,
MPEG-7, and secure delivery of audiovisual content, currently as the chief scientist for the audio/multimedia activities at Fraunhofer IIS, Erlangen.
Dr. Herre is a fellow of the Audio Engineering Society,
cochair of the AES Technical Committee on Coding of
Audio Signals, and vice chair of the AES Technical Council. He also is an IEEE senior member, served as an associate editor of the IEEE Transactions on Speech and
Audio Processing, and is an active member of the MPEG
audio subgroup. Outside his professional life he is a dedicated amateur musician.
●
Susanto Rahardja received a Ph.D. degree in electrical
and electronic engineering from the Nanyang Technological University, Singapore.
He joined the Centre for Signal Processing, Nanyang
Technological University, in 1996 and he has been a faculty member at its School of Electrical and Electronic
Engineering since 2001. In 2002 he joined the Institute for
Infocomm Research (I2R) and is currently the director of
its Media Division. He is overseeing research areas on
signal processing (audio coding, video/image processing),
media analysis (text/speech, image, video), media security
(biometrics, computer vision, and surveillance), and sensor network.
Dr. Rahardja has published in more than 150 international journals and at conferences in the areas of digital
J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February
communications, signal processing, and logic synthesis.
He was the recipient of the IEE Hartree Premium Award
for best journal paper published in the IEE Proceedings in
2001 and the Tan Kah Kee Young Inventors’ Gold Award
(Open Category) in 2003. He has served in various IEEE
and SPIE-related professional activities in the area of multimedia and is actively involved in technical committees
of the IEEE Circuits and Systems and Signal Processing
societies. He is associate editor of the Journal of Visual
Communication and Image Representation. Since 2002 he
has been an active participant of the MPEG and since then
has been involved in the development of MPEG-4 Audio,
where he is one of the contributors to the MPEG-4 SLS
and ALS.
●
Sang-Wook Kim received a B.A. degree in electrical
engineering from the Yonsei University, Korea, in 1989.
He completed his master’s degree in 1991 at the Korea
Advanced Institute of Science and Technology while researching digital signal processing.
●
Xiao Lin received a Ph.D. degree from the Electronics
and Computer Science Department of the University of
Southampton, Southampton, UK, in 1993.
He worked at the Centre for Signal Processing,
Nanyang Technological University, Singapore, as a research fellow and senior research fellow for about five
years. Subsequently he joined DeSOC Technology as a
technical director and then the Institute for Infocomm Research in 2002. There he was member of technical staff,
lead scientist, and principal scientist and managed the Media Processing Department until 2006. He is now with
Fortemedia Inc. as a senior director. He actively participated in international standards such as MPEG-4,
JPEG2000, and JVT.
Dr. Lin is a senior member of the IEEE.
●
Markus Schmidt received a Dipl.-Ing. degree in media
technology from the Technical University of Ilmenau,
Germany, in 2004. During his studies he spent a year at
the University of Strathclyde in Glasgow, Scotland.
After completing an intership at the Fraunhofer Institute
for Digital Media Technology (IDMT) in Ilmenau, he
joined the Fraunhofer Institute for Integrated Circuits
(IIS), Erlangen, Germany, in 2005. There his research interests include low-delay and lossless audio coding schemes
and their implementation in real-time environments.
43
Download