PAPERS ISO/IEC MPEG-4 High-Definition Scalable Advanced Audio Coding* RALF GEIGER,1 RONGSHAN YU,2** JÜRGEN HERRE,1 SUSANTO RAHARDJA,2 (ralf.geiger@iis.fraunhofer.de) (rzyu@dolby.com) (juergen.herre@iis.fraunhofer.de) (rsusanto@i2r.a-star.edu.sg) SANG-WOOK KIM,3 (sangwookkim@samsung.com) XIAO LIN,2 and (linxiao@fortemedia.com.cn) MARKUS SCHMIDT1 (markus.schmidt@iis.fraunhofer.de) 1 2 Fraunhofer IIS, Erlangen, Germany Institute for Infocomm Research, Singapore 3 Samsung Electronics, Suwon, Korea Recently the MPEG Audio standardization group has successfully concluded the standardization process on technology for lossless coding of audio signals. A summary of the scalable lossless coding (SLS) technology as one of the results of this standardization work is given. MPEG-4 scalable lossless coding provides a fine-grain scalable lossless extension of the well-known MPEG-4 AAC perceptual audio coder up to fully lossless reconstruction at word lengths and sampling rates typically used for high-resolution audio. The underlying innovative technology is described in detail and its performance is characterized for lossless and near lossless representation, both in conjunction with an AAC coder and as a stand-alone compression engine. A number of application scenarios for the new technology are discussed. 0 INTRODUCTION Perceptual coding of high-quality audio signals has experienced a tremendous evolution over the past two decades, both in terms of research progress and in worldwide deployment in products. Examples for successful applications include portable music players, Internet audio, audio for digital media (such as VCD, DVD), and digital broadcasting. Several phases of international standardization have been conducted successfully [1]–[3]. Many recent research and standardization efforts focus at achieving good sound quality at even lower bit rates [4]–[9] to accommodate storage and transmission channels with limited quality (such as terrestrial and satellite broadcasting and third-generation mobile telecommunication). Nevertheless, for other scenarios with higher transmission bandwidth available, there is a general trend toward providing the consumer with a media experience of extremely high fidelity [10], as it is frequently associated with the terms “high definition” and “high resolution” and dedicated me- *Manuscript received 2006 March 23; revised 2006 October 24 and November 27. **Currently with Dolby Laboratories, San Francisco, CA. J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February dia types, such as DVD-Audio [11], SACD [12], HDDVD [11], or Blu-Ray Disc [13]. In the realm of audio, this is achieved by employing lossless formats with high resolution (word length) and/or high sampling rate. There exist several proprietary lossless formats, the most prominent of which are MLP Lossless [14], [15], DTS-HD [16], and Apple Lossless [17]. Furthermore, several freeware coding systems have gained some prominence through the Internet. Among them are FLAC [18], Monkey’s Audio [19], and OptimFROG [20]. On the other end of the bit-rate scale, approaches for scalable perceptual audio coding have been developed in the recent years. Within the International Telecommunication Union (ITU) scalability in wide-band audio coding has been adopted only recently [21]. Within MPEG-4 Audio [3], scalability has been developed and adopted sometime earlier [22]–[24]. This includes the realm of speech coding where scalability is provided both for harmonic vector excitation coding (HVXC) [25], [26] and codeexcited linear prediction (CELP) [27], [28]. In this context the ISO/MPEG Audio standardization group decided to start a new work item to explore technology for lossless and near-lossless coding of audio signals by issuing a call for proposals for relevant technology in 2002 [29]. Three specifications emerged from this call 27 GEIGER ET AL. PAPERS as amendments to the MPEG-4 Audio standard. First the standard on lossless coding of 1-bit oversampled signals [30] specifies the lossless compression of highly oversampled 1-bit sigma–delta-modulated audio signals as they are stored on the SACD media under the name Direct Stream Digital (DSD) [31]. Second the audio lossless coding (ALS) specification [32] describes technology for the lossless coding of PCM-coded signals at sampling rates up to 192 kHz as well as floating-point audio. This paper provides an overview of the third descendant, the scalable lossless coding (SLS) specification [33], which extends traditional methods for perceptual audio coding toward lossless coding of high-resolution audio signals in a scalable way. Specifically, it allows to scale up from a perceptually coded representation (MPEG-4 advanced audio coding, AAC) to a fully lossless representation with high definition, including a wide range of intermediate near-lossless representations. The paper starts with an explanation of the general concept, discusses the underlying novel technology components, and characterizes the codec in terms of compression performance and complexity. Finally a number of application scenarios are briefly outlined. 1 CONCEPT AND TECHNOLOGY OVERVIEW This section provides an overview of the principle and the structure of the scalable lossless coding technology and its combination with the AAC perceptual audio coder. This combination will be referred to as high-definition advanced audio coding (HD-AAC) in the following. 1.1 General System Structure Fig. 1 shows a very high-level view of the structure of the HD-AAC encoder. The input signal is first coded by an AAC (“core layer”) encoder. The SLS algorithm uses the output to enhance the system’s performance toward lossless coding, resulting in an enhancement layer. The two layers of information are subsequently multiplexed into one high-definition bit stream. Decoding of an HD-AAC bit stream is illustrated in Fig. 2. From the HD-AAC bit stream, decoders are able either to decode the perceptually coded AAC part only, or to use the additional SLS infor- 28 mation to produce losslessly/near-losslessly coded audio for high-definition applications. 1.2 AAC Background Since the MPEG-4 SLS coder has been designed to operate as an enhancement to MPEG-4 AAC [3], [34], its structure is closely related to that of the underlying AAC core coder [35]. This section sketches the architectural features of MPEG-4 AAC as they are relevant to the SLS enhancement technology. The underlying AAC codec provides efficient perceptual audio coding with high quality and achieves broadcast quality at a bit rate of about 64 kbps per channel [36]. Fig. 3 gives a very concise view of the AAC encoder’s structure. The audio signal is processed in a blockwise spectral representation using the modified discrete cosine transform (MDCT) [37]. The resulting 1024 spectral values are quantized and coded considering the required accuracy as demanded by the perceptual model. This is done to minimize the perceptibility of the introduced quantization distortion by exploiting masking effects. Several neighboring spectral values are grouped into so-called scale factor bands, sharing the same scale factor for quantization. Prior to the quantization/coding (Q/C) tool, a number of processing tools operate on the spectral coefficients in order to improve coding performance for certain situations. The most important tools are the following. • The temporal noise shaping (TNS) tool [38] carries out predictive filtering across frequency in order to achieve a temporal shaping of the quantization noise according to the signal envelope and in this way optimize temporal masking of the quantization distortion. • The M/S stereo coding tool [39] provides sum/ difference coding of channel pairs, exploits interchannel redundancy for near-monophonic signals, and avoids binaural unmasking. Fig. 1. HD-AAC encoder structure. 1.3 HD-AAC/SLS Enhancement The SLS scalable lossless enhancement works on top of this AAC architecture. The structures of an HD-AAC encoder and decoder are shown in Figs. 4 and 5. In the encoder the audio signal is converted into a spectral representation using the integer modified discrete cosine transform (IntMDCT) [40], [41]. This transform represents an invertible integer approximation of the MDCT and is well-suited for lossless coding in the frequency domain. Other AAC coding tools, such as mid/side coding or temporal noise shaping, are also considered and performed on the IntMDCT spectral coefficients in an invertible integer fashion, thus maintaining the similarity between the spectral values used in the AAC coder and in the lossless enhancement. Fig. 2. HD-AAC decoder structure. Fig. 3. Structure of AAC encoder (simplified). J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February PAPERS ISO/IEC MPEG-4 SCALABLE AAC The link between the perceptual core and the scalable lossless enhancement layer of the coder [42] is provided by an error-mapping process. The error-mapping process removes the information that has already been coded in the perceptual (AAC) path from the IntMDCT spectral coefficients such that only the resulting (IntMDCT) residuals are coded in the enhancement encoder, and in this way the coding efficiency benefits from the underlying AAC layer. The error-mapping process also preserves the probability distribution skew of the original IntMDCT coefficients and thus permits an efficient encoding of the residual by means of two bit-plane coding processes, namely, bit-plane Golomb coding (BPGC) [43] and context-based arithmetic coding (CBAC) [44], and a so-called low-energy-mode encoder, all of which will be described later in further detail. By using the bit-plane coding process the lossless enhancement is performed in a fine-grain scalable way. A lossless reconstruction is obtained if all the bit planes of the residual are coded, transmitted, and decoded completely. If only parts of the bit planes are decoded, a lossy reconstruction of the signal is obtained with a quality between the AAC layer’s and lossless reconstruction. In order to achieve optimal perceptual quality at intermediate bit rates, the bit-plane coding is started from the most significant bit (MSB) for all scale-factor bands, and progresses toward the least significant bit (LSB) for all bands (see Fig. 6). In this way the bit-plane coding process preserves the overall spectral shape of the quantization noise, as it results from the noise-shaping process of the AAC perceptual coder, and thus takes advantage of the AAC perceptual model. The following sections will describe the principles underlying the design of SLS in greater detail by focusing on its IntMDCT filter bank, error-mapping strategy, and entropy coding parts. 2 FILTER BANK 2.1 IntMDCT The IntMDCT, as introduced in [40], is an invertible integer approximation of the MDCT, which is obtained by utilizing the “lifting scheme” [45] or “ladder network” [46]. It enables efficient lossless coding of audio signals [40] by means of entropy coding of integer spectral coefficients. Furthermore, as the IntMDCT closely approximates the behavior of the MDCT, it allows to combine the strategies of perceptual and lossless audio coding in the frequency domain into a common framework [42]. Fig. 7 illustrates the close relationship between IntMDCT and MDCT for a small audio segment by displaying the respective magnitude spectra of both filter banks. The difference between MDCT and IntMDCT values is visible as a small noise floor that is typically much lower than the error introduced by perceptual coding. Thus the IntMDCT allows to code efficiently the quantization error of an MDCT-based perceptual codec in the frequency domain. Fig. 4. Structure of HD-AAC encoder. Fig. 5. Structure of HD-AAC decoder. J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February 29 GEIGER ET AL. PAPERS The following section provides some detail on the derivation of the IntMDCT from the MDCT. 2.2 Decomposition of MDCT The MDCT, defined by X!m" = ! 2 N 2N−1 # w!k"x!k" cos k=0 !2k + 1 + N"!2m + 1"! , 4N m = 0, . . . , N − 1 (1) with the time-domain input x(k) and the windowing function w(k) can be decomposed into two blocks, namely, windowing and time-domain aliasing (TDA) and discrete cosine transform of type IV (DCT-IV). This is illustrated in Fig. 8 for both forward MDCT and inverse MDCT. In the forward IntMDCT, the windowing/TDA block is calculated by 3N/2 so-called lifting steps: " x!k" # $ " #$ # " x!N − 1 − k" × × 1 ! 1 − w!N − 1 − k" − 1 w!k" 0 0 −w!k" 1 x!k" x!N − 1 − k" 1 1 − w!N − 1 − k" − 1 w!k" 0 1 k = 0, . . . , , % % N − 1. 2 (2) After each lifting step a rounding operation is applied to stay in the integer domain. Every lifting step can be inverted by simply adding the subtracted value. Fig. 6. Bit-plane scan process in SLS. 2.3 Integer DCT-IV For the IntMDCT, the DCT-IV is calculated in an invertible integer fashion, called the integer DCT-IV. The multidimensional lifting (MDL) scheme [41], [47] is applied in order to reduce the required rounding operations in the invertible integer approximation as much as possible and in this way minimize the approximation error noise floor (see Fig. 7). The following block matrix decomposition for an invertible matrix T and the identity matrix I shows the basic principle of the MDL scheme, " # " #" #" # Fig. 7. IntMDCT and MDCT magnitude spectra. T 0 0 T −1 = −I 0 T −1 I I −T 0 I 0 I I T −1 . (3) The three blocks in this decomposition are the so-called MDL steps. Similar to the conventional lifting steps, they can be transferred to invertible integer mappings by rounding the floating-point values after being processed by T or T−1, and they can be inverted by subtracting the values that have been added. Fig. 8. MDCT and inverse MDCT by windowing/TDA and DCT-IV. 30 J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February PAPERS When coding stereo signals, this decomposition is used to obtain an integrated calculation of the M/S matrix and the integer DCT-IV for the left and right channels. The number of required rounding operations is 3N per channel pair, or 3N/2 per channel, which is the same number as for the windowing/TDA stage. As a whole, this so-called stereo IntMDCT requires only three rounding operations per sample, including M/S processing. Concerning mono signals, the same structure can be used, but has to be extended by some additional lifting steps to obtain the integer DCT-IV of one block; see [47]. This mono IntMDCT requires four rounding operations per sample. 2.4 Noise Shaping The lossless coding efficiency of the IntMDCT is further improved by utilizing a noise-shaping technique, introduced in [48]. In the lifting steps where time-domain signals are processed, the rounding operations include an error feedback mechanism to provide a spectral shaping of the approximation noise. This approximation noise affects the lossless coding efficiency mainly in the highfrequency region where audio signals usually carry only a small amount of energy, especially at sampling rates of 96 kHz and above. Hence the low-pass characteristics of the approximation noise improve the lossless coding efficiency. A first-order noise-shaping filter is employed in the three stages of lifting steps in the windowing/TDA processing and in the first rounding stage of the integer DCT-IV processing. Fig. 9 compares the resulting approximation error between the IntMDCT values and the MDCT values rounded to integer, when the IntMDCT operates both with and without noise shaping. 3 ERROR MAPPING AND RESIDUAL CALCULATION The objective of the error-mapping/residual calculation stage is to produce an integer enhancement signal that enables lossless reconstruction of the audio signal while Fig. 9. Mean-squared approximation error of stereo IntMDCT (including M/S) with and without noise shaping. J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February ISO/IEC MPEG-4 SCALABLE AAC consuming as few bits as possible. Rather than encoding all IntMDCT coefficients c[k] directly, in the lossless enhancement layer, it is more efficient to make use of the information that has already been coded by the AAC layer. This is achieved by the error-mapping process, which produces a deterministic residual signal between the IntMDCT spectral values and their counterparts from the AAC layer. In order to produce a residual signal e[k] with the smallest possible entropy, given the core AAC quantized value i[k], it is sufficient in most cases to use the minimum mean square error (MMSE) residual obtained by subtracting the IntMDCT coefficient c[k] from its MMSE reconstruction ĉ[k] ! E{c[k]|i[k]}, given i[k], that is, e $k% = c $k% − ĉ $k%. (4) Here E{·} denotes the expectation operation. However, in SLS a somewhat different approach is adopted, where the residual signal is given by e $k% = & c $k% − thr $k%, i $k% " 0 c $k%, i $k% = 0 (5) where thr (i[k]) is the next quantized value closer to zero with respect to i[k], and is calculated via table lookup and linear interpolation to ensure the deterministic behavior necessary for lossless coding. This error-mapping process is illustrated in Fig. 10, where two different cases are shown. In the first case the IntMDCT coefficients c[k] belong to a scale-factor band that has been quantized and coded at the AAC encoder (significant band). For these coefficients the residual coefficients are obtained by subtracting the quantization thresholds from their corresponding c[k], resulting in a residual spectrum with reduced amplitude. In the other case c[k] belongs to a band that is not coded or has been quantized to zero in the AAC encoder (insignificant band). In this case the residual spectrum is simply the IntMDCT spectrum itself. This error-mapping process offers many advantages that permit better coding efficiency in the enhancement layer. First, as illustrated in Fig. 11, we notice that if the amplitude of c[k] is distributed exponentially, the amplitude of Fig. 10. Illustration of error-mapping process. 31 GEIGER ET AL. PAPERS e[k] will likewise be distributed approximately exponentially. Thus it can be coded very efficiently by using the BPGC/CBAC coding process described in the next section. Secondly, for a significant band we find the following property for the coefficients c[k]: |thr!i $k%" | # |c $k% | $ |thr!i $k%" | + %!i $k%" (6) where %!i $k%" = thr!|i $k% | + 1" − thr!|i $k% |" M−1 e $k% = !2s $k% − 1" # b $k, j% & 2 , j k = 0, . . . , N − 1 j=0 (8) (7) is the quantization step size of i[k]. Clearly, the magnitude of e[k] is bounded by the value of %(i[k]), and its sign is also identical to that of i[k] for i[k] " 0. As a result, no additional data transmission is needed for conveying the sign information. This is referred to as “implicit signaling” in the SLS specification. This implicit signaling mechanism assumes that the input to the SLS layer and the AAC core quantization value are “well synchronized.” This may not be true in all cases, given that AAC encoders have all the freedom to optimize the encoding process and thus may produce an i[k] that does not satisfy Eq. (6). Thus SLS also includes an “explicit signaling” mechanism, which employs an MMSE error mapping as defined in Eq. (4), where ĉ[k] is approximated by the AAC inverse quantization value. In this case all the side information necessary for decoding e[k] is signaled explicitly from the encoder to the decoder. 4 ENTROPY CODING To achieve best efficiency in the compression of the enhancement layer information, several methods for entropy coding of the IntMDCT residual information are employed. 4.1 Combined BPGC/CBAC Coding The bit-plane Golomb code (BPGC) coding process used in SLS is basically a bit-plane coding scheme where Fig. 11. Distribution of residual signal from error-mapping process. 32 the bit-plane symbols are coded arithmetically with a structural frequency assignment rule. Considering an input data vector e ! {e[0], . . , e[N − 1]} for which N is the dimension of e, each element e[k] in e is first represented in a binary format as which consists of a sign symbol = s $k% ! & 1, e $k% ' 0 0, e $k% $ 0 , k = 0, . . . , N − 1 (9) and bit-plane symbols b[k, j] ∈ {0, 1}, i ! 1, . . . , k, and M is the MSB for e that satisfies 2M−1 # max {|e[k] |} < 2M, k ! 0, . . . , N − 1. The bit-plane symbols are then scanned and coded from MSB to LSB over all the elements in e, and coded by using an arithmetic code with a structural frequency assignment QLJ given by QL $ j % = ' 1 , 2 j$L 1 1 + 22 (10) , j−L j'L where the “lazy-plane” parameter L can be selected using the adaptation rule L = min&L" ∈ "|2L"+1N ' A' (11) and A is the absolute sum of the data vector e. Although this BPGC coding process delivers excellent compression performance for data that stem from an independent and identically distributed (iid) source with Laplacian distribution [43], it lacks the capability to explore the statistical dependencies between data samples that may exist in certain sources to achieve better compression performance. These correlations can be captured very effectively by incorporating context modeling technology into the BPGC coding process, where the frequency assignment for arithmetic coding of bit-plane symbols is not only dependent on the distance of the current bit plane from the lazy-plane parameter as in the frequency assignment rule [Eq. (10)], but also on other possible elements that may affect the probability distribution of these bit-plane symbols. In the context of lossless coding of IntMDCT spectral data for audio, elements that possibly affect the distribution of the bit-plane symbols include the frequency locations of the IntMDCT spectral data, the magnitude of the adjacent spectral lines, and the status of the AAC core quantizer. In order to capture these correlations, several contexts are used in the context-based arithmetic code (CBAC) of SLS. The guide in selecting these contexts is to try to find those contexts that are “most” correlated to the distribution of the bit-plane symbols. In CBAC, three types of contexts are used, namely, the frequency band (FB) context, the distance to lazy (D2L) J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February PAPERS ISO/IEC MPEG-4 SCALABLE AAC context, and the significant state (SS) context. The detailed context assignments are summarized in the following. Context 1: Frequency Band (FB) It is found in [44] that the probability distribution of bit-plane symbols of IntMDCT varies for different frequency bands. Therefore in CBAC the IntMDCT spectral data are classified into three different FB contexts, namely, low band (0–4 kHz), midband (4–11 kHz), and high band (above 11 kHz). Context 2: Distance to Lazy (D2L) The D2L context is defined as the distance of the current bit plane j to the BPGC lazy-plane parameter L, as defined in the following equation: D2L = & 3 − j + L, j − L ' −2 6, otherwise. (12) This is motivated by the BPGC frequency assignment rule, Eq. (10), which is based on the fact that the skew of the probability distribution of bit-plane symbols from a source with a (near) Laplacian distribution tends to decrease as the number of D2L decreases. To reduce the total number of the D2L context, all the bit planes with j − L < −2 are grouped into one context where all the bit-plane symbols are coded with probability 0.5. Context 3: Significant State (SS) The SS context attempts to capture the factors that correlate with the distribution of the magnitude of the IntMDCT residual in one place. These include the magnitude of the adjacent IntMDCT spectral lines and the quantization interval of the AAC core quantizer if it has been quantized previously in the core encoder. Further detail on the SS context can be found in [49]. 4.2 Low-Energy-Mode Coding The BPGC/CBAC coding process described earlier works well for sources with (near) Laplacian distribution, which usually is the case for most audio signals [50]. However, it was also found that for some music items there are time/frequency (T/F) regions with very low energy levels where the IntMDCT spectral data are in fact dominated by the rounding errors of the IntMDCT (see Fig. 7) with a distribution that is substantially different from Laplacian. In order to encode those low-energy regions efficiently, the BPGC/CBAC coding process is replaced by low-energy-mode coding, as shown in the following. The low-energy-mode coding is invoked for scale factor bands for which the BPGC parameter L is smaller than or equal to 0. Then the amplitude of the residual spectral data e[k] is first converted into a unitary binary string b ! {b[0], b[1], . . . , b[pos], . . .}, as illustrated in Table 1, with M being the maximum bit plane. It can be seen that the probability distribution of these symbols is a function of the position pos, and the distribution of e[k]; Pr&b $pos% = 1' = Pr&e $k% ( pos |e $k% ' pos' (13) where 0 # pos < 2M.b[pos] is then coded arithmetically depending on its position pos and the BPGC parameter L with a trained frequency table. J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February 4.3 Smart Decoding Due to their fine-grain scalability, HD-AAC bit streams can be truncated at any bit rate lower than what would be needed for a fully lossless reconstruction to produce nearlossless representations of the audio signal. In the context of arithmetic decoding, the smart decoding method provides a way to optimally decode such a truncated bit stream. It decodes additional symbols in the absence of incoming bits when a decoding buffer still contains meaningful information for arithmetic decoding in the CBAC/ BPGC mode and/or low-energy mode. Decoding continues up to the point where no ambiguity exists in determining a symbol [51]. 5 OTHER CODING TOOLS As counterparts to the underlying AAC perceptual audio coder, SLS provides a number of integerized versions of AAC coding tools. 5.1 Integer M/S In the AAC codec the M/S tool allows to choose individually between mid/side and left/right coding modes on a scale-factor-band basis. As shown in Section 2.3 it can provide either a global left/right or a global mid/side spectral representation. In order to make the integer spectral values fit the spectral values from the AAC core on a scale-factor-band basis, an invertible integer version of the M/S mapping is used. It is based on a lifting decomposition of the normalized M/S matrix, that is, a rotation by !/4, " #" ( 1 −1 1 2 1 1 = × #" ) ( # " (# 1 1 − (2 1 0 0 1 2 1 1 1 1− 0 2 (14) . 1 5.2 Integer TNS When the temporal noise-shaping (TNS) tool is used in the AAC core, the resulting MDCT spectral values deviate from the IntMDCT spectral values. In order to compensate for this, the same TNS filter as in the AAC core is applied to the integer spectral values in the lossless enhancement. To assure lossless operation, the TNS filter is converted to a deterministic invertible integer filter. Table 1. Binarization of IntMDCT error spectrum at low-energy mode. Amplitude of e[k] Binary String {b[pos]} 0 1 2 ... 2M −2 2M −1 0 1 0 1 1 0 ... 1 1 ... 1 1 ... pos 0 1 2 ... ... 3 ... ... 1 1 0 1 ... 33 GEIGER ET AL. PAPERS 6 BIT STREAM MULTIPLEXING From a mechanical point of view, the SLS coded data, including the core layer AAC bit stream and the enhancement bit stream, can be carried in multiple elementary streams (ES) in an MPEG-4 system [52]. As shown in Fig. 12, the AAC bit stream is carried in the so-called base layer ES, and the enhancement bit stream is carried in one or more enhancement layer ESs. Each ES is thus composed of a sequence of access units (AUs), where one AU contains one audio frame from the AAC bit stream or the enhancement bit stream. From an application point of view, such a bit stream structure provides great flexibility in constructing either a large-step scalable system or a fine-grain scalable system with SLS. For example, in a scalable audio streaming application, the server stores multiple SLS ESs at predefined bit rates, each assigned with a different stream priority. During the streaming, the ESs are transmitted in the order of their stream priority, and ESs with lower stream priority are dropped by either the streaming server or the network gateway whenever the transmission bandwidth is insufficient to stream a full-rate lossless bit stream. Alternatively it is also possible to implement a lightweight bit stream truncation algorithm in the streaming server or in the multimedia gateway to truncate the enhancement stream AUs directly according to the available bandwidth to achieve the fine granular bit rate scalability. 7 MODES OF OPERATION So far the HD-AAC coder has been described as a combination of a regular AAC core layer coder (such as low complexity AAC) and an SLS enhancement layer. This section introduces further features offered by the SLS enhancement layer that concern its combination with the AAC coder, and its ability to be used in combination with other types of AAC-based codecs, such as AAC scalable or AAC/BSAC. 7.1 Oversampling Mode In the context of high-definition audio applications it is frequently desirable to achieve lossless signal reconstruction at sampling rates of 96 kHz, or even 192 kHz. While the MPEG-4 AAC coder supports these sampling rates, it typically achieves best coding efficiency for high-quality perceptual coding at sampling rates between 32 and 48 kHz. Thus in order to allow both efficient core layer coding at common rates and lossless reconstruction at higher rates, MPEG-4 SLS includes an additional feature called “oversampling mode.” This refers to the possibility of letting the lossless enhancement operate at a sampling rate higher than that of the AAC core codec. The ratio between the SLS sampling rate and the AAC sampling rate is called “oversampling factor” and can be either 1, 2, or 4. For example, the lossless enhancement can operate at a rate of 192 kHz, whereas the AAC core operates at 48 kHz, see Table 2. The mapping between the two coding layers is achieved by using a time-aligned framing and a correspondingly longer IntMDCT in the lossless enhancement. For example, an IntMDCT of the size of 4096 spectral values is Table 2. Example combinations of sampling rates for AAC core and lossless enhancement. SLS @ 48 kHz SLS @ 96 kHz SLS @ 192 kHz AAC @ 48 kHz AAC @ 96 kHz AAC @ 192 kHz × × × × × × Fig. 12. Structure of MPEG-4 SLS bit stream. 34 J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February PAPERS ISO/IEC MPEG-4 SCALABLE AAC used in case of oversampling by a factor of 4, and the 1024 MDCT values from the AAC core are mapped to the lower 1024 IntMDCT values. In addition to allowing for optimum AAC performance, the lossless performance is improved when using longer IntMDCT transforms. (A transform length of 2048 or 4096 provides a better lossless performance for stationary signals than a transform length of 1024.) 7.2 Combination with MPEG-4 Scalable AAC In order to account for varying or unknown transmission capacity, the MPEG-4 AAC codec also provides several scalable coding modes [3], [34]. MPEG-4 scalable AAC allows perceptual audio coding with one or more AAC mono or stereo layers and can be combined with an SLS enhancement layer. This results in an overall coder that offers fine-grain scalability in the range between lossless reconstruction and the AAC representation, and scalability in several steps within the perceptually coded AAC representation. 7.2 Combination with MPEG-4 AAC/BSAC Similar to the combination with scalable AAC, the SLS enhancement can also be operated on top of MPEG-4 AAC/BSAC as a core layer coder. This provides a finegrain scalable representation both between lossless and perceptually coded audio and within the perceptually coded range. The latter has a granularity of 1 kbps per channel. 7.4 Stand-Alone Operation Finally the SLS lossless enhancement can also operate as a straightforward stand-alone codec, without any underlying core codec. Nonetheless, this operation mode offers both full lossless coding capability and fine-grain scalability. 8 PERFORMANCE This section quantifies the performance of the HD-AAC codec in various operation modes. While the compression ratio can be seen as the sole measure of merit for lossless operation scenarios, an evaluation of performance for near-lossless operation requires audio quality measurements at various data rates and operating points. 8.1 Lossless Compression Performance This section reports the performance of HD-AAC in terms of its ability for lossless compression of various audio material. As a figure of merit, the compression ratio is defined as compression ratio = original file size . compressed file size (15) Tables 3 and 4 show the lossless compression performance for two major sets of test material, that is, the MPEG-4 lossless audio coding test set (donated by Matsushita Corporation and containing in part recordings performed by the New York Symphonic Ensemble). For both sets the compression results are given for an HD-AAC configuration with an AAC core layer running at 128 kbit/s stereo plus SLS enhancement, and for an SLS standalone configuration. As could be expected from theory, it can be observed in Table 3 that an increase in word length reduces the average compression ratio (due to the fact that the least significant bits of the PCM codewords are more random and thus less compressible). On the other hand, increasing the sampling rate improves compression because of the increased correlation between adjacent samples (assuming sound material with typical high-frequency characteristics). For the MPEG-4 lossless audio test set, an average compression ratio of 2:1 can be achieved easily at a sampling rate of 48 kHz and 16-bit word length. This is competitive with the best of other known lossless compression systems [53]. It can also be observed that for the AAC-based mode an additional bit rate of only 30–40 kbps is required compared to the stand-alone mode for lossless representation. This reduces the bit rate consumption by 90–100 kbps compared to simulcast solutions that transmit both an 128kbps AAC bit stream and a stand-alone SLS lossless bit stream simultaneously. 8.2 Near-Lossless Compression Performance The bit-plane coding of residual spectral values (that is, of the AAC quantization error) allows to refine the initial AAC quantization successively as more bits from the SLS enhancement layer are decoded. With each additional decoded bit plane the quantization error is reduced by 6 dB. Consequently an increasing safety margin with respect to audibility is added as the bit rate increases. Table 3. Lossless compression results for MPEG-4 lossless audio test set. SLS + AAC @ 128 kbps/Stereo (AAC @ 48 kHz sampling rate) 48 kHz/16 bit 48 kHz/24 bit 96 kHz/24 bit 192 kHz/24 bit Overall SLS Stand-Alone Compression Ratio Average Bit Rate (kbps) Compression Ratio Average Bit Rate (kbps) 2.09 1.55 2.09 2.60 2.08 735 1490 2201 3543 1992 2.20 1.58 2.13 2.63 2.12 698 1454 2160 3509 1955 J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February 35 GEIGER ET AL. PAPERS 8.2.1 Evaluation of Near-Lossless Audio Quality While it may seem sufficient for most purposes to provide perceptually transparent reproduction of audio signals by using conventional perceptual audio coders (such as AAC at a sufficient bit rate), there are applications that demand still higher audio quality. This is especially the case for professional audio production facilities, such as archiving and broadcasting, in which audio signals may undergo many cycles of encoding/decoding (tandem coding) before being delivered to the consumer. This leads to an accumulation of introduced coding distortion and may lead to unacceptable final audio quality [54], unless substantial headroom toward audibility is provided by each coding step, for example, by using coding algorithms with very high quality or bit rate. ITU-R BS.1548-1 [55] defines the requirements for audio coding systems for digital broadcasting, assuming a codec chain consisting of so-called contribution, distribution, and emission codecs. According to this recommendation, and based on ITU-R BS.1116-1 [56], audio codecs for contribution and distribution should fulfill the following requirements: The quality of sound reproduced after a reference contribution/ distribution cascade [. . .] should be subjectively indistinguishable from the source for most types of audio programme material. Using the triple stimuli double blind with hidden reference test, described in Recommendation ITU-R BS.1116 [. . .], this requires mean scores generally higher than 4.5 in the impairment 5-grade scale, for listeners at the reference listening position. The worst rated item should not be graded lower than 4. In accordance with these recommendations, tests were run on signals encoded or decoded with HD-AAC. Instead of running numerous listening tests for subjective quality assessment at individual operating points, the evaluation employed the PEAQ measurement (BS.1387-1) [57], which provides methods for objective measurements of perceived audio quality in scenarios that are normally assessed by ITU-R BS.1116 testing. The most essential results can be seen in Figs. 13–15; see also [58]. The graphs show the estimated subjective sound quality expressed as objective difference grade (ODG) values, which were computed by a PEAQ measurement. The evaluation procedure consists of multiple cycles of tandem coding/decoding with up to 16 cycles. The standard set of critical MPEG-4 audio items for perceptual audio coding evaluations was used. ODG values of 0, −1, −2, −3, −4 correspond to a subjective audio quality of “indistinguishable from original,” “perceptible but not annoying,” “slightly annoying,” “annoying,” and “very annoying,” respectively. Fig. 13 shows the achieved ODG values as a function of tandem cycles for a traditional AAC coder running at a bit rate of 128 kbps/stereo. As expected, it can be observed that the audio quality degrades significantly with an increasing number of tandem cycles, depending on the test item. For this reason tandem coding is not a recommended practice for such coders. Fig. 14 displays the corresponding tandem coding results for the HD-AAC combination running at 512 kbps/ stereo (AAC at 128 kbps + SLS enhancement at 384 kbps). It can be noted that the audio quality remains consistently at a very high level, even after a total of 16 tandem cycles. This illustrates the high robustness of the HD-AAC representation against tandem coding. According to these measurements, the aforementioned BS.1548-1 audio quality requirement is fulfilled with a considerable safety margin. Furthermore, when placed in tandem with AAC (such Table 4. Lossless compression results for commercial CD test set. Compression Ratio SLS+AAC @ 128 kbps/Stereo SLS Stand-Alone ACDC—Highway to Hell (Sony 80206) Avril Lavigne—Let Go (Arista 14740) Backstreet Boys—Greatest Hits Chapter One (Jive 41779) Brian Setzer—The Dirty Boogie (Interscope 90183) Cowboy Junkies—Trinity Session (RCA-8568) Grieg—Peer Gynt, von Karajan (DG 439010) Jannifer Warnes—Famous Blue Raincoat (BMG 258418) Marlboro Music Festival—DISC A (Bridge 9108) Marlboro Music Festival—DISC B (Bridge 9108) Nirvana—Nirvana (Interscope 493523) Philip Jones—40 Famous Marches, CD1 (Decca 416241) Philip Jones—40 Famous Marches, CD2 (Decca 416241) Pink Floyd—Dark Side of the Moon (Capitol 46001) Rebecca Pidgeon—The Raven (Chesky 115) Ricky Martin (Sony 69891) Schubert Piano Trio in E-flat (Sony 48088) Spaniels—The Very Best Of (Collectables 7243) Steeleye Span—Below the Salt (Shanachie 79039) Suzanne Vega—Solitude Standing (A&M 5136) Westminster Concert Bell Choir—Christmas Bells (Gothic Records 49055) 1.31 1.36 1.39 1.43 1.93 2.63 2.07 2.23 2.20 1.50 1.99 2.00 1.76 1.88 1.33 2.74 2.41 1.85 1.74 2.55 1.36 1.41 1.45 1.49 2.04 2.83 2.20 2.35 2.33 1.56 2.11 2.11 1.85 1.97 1.38 2.90 2.62 1.95 1.83 2.71 Overall 1.85 1.94 CD Items (16 bit/44.1 kHz) 36 J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February PAPERS ISO/IEC MPEG-4 SCALABLE AAC as for final audio distribution over a narrow-band channel), the resulting audio quality is not degraded significantly by the preceding HD-AAC tandem cascade. Further details can be found in [59]. 8.2.2 Stand-Alone SLS Operation The SLS codec can also operate as a stand-alone lossless codec when the AAC core codec is not used, some- Fig. 13. Test results: AAC tandem coding. Fig. 14. Test results: HD-AAC tandem coding. J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February 37 GEIGER ET AL. PAPERS times also referred to as the “noncore mode”. Despite of its simple structure (only IntMDCT and BPGC/CBAC modules are used), this mode allows efficient lossless coding [59]. Furthermore, fine-grain scalability by truncated bitplane coding is also possible in this mode. Given that the stand-alone SLS codec does not include any perceptual model to estimate masking thresholds, it is interesting to investigate the audio quality resulting from a truncation of the SLS bit stream. Due to the behavior of bit-plane coding in this mode, a constant signal-to-noise ratio is achieved in each scalefactor band. With each additional bit plane the signal-tonoise ratio improves by 6 dB. While this behavior does not allow SLS to compete with efficient perceptual codecs at low bit rates (for example, AAC at 128 kbps/stereo), this simple approach works quite well at higher bit rates in the near-lossless range. Fig. 15 shows tandem coding results for the stand-alone SLS codec operating at 512 kbps/ stereo. It reaches about the same near-lossless audio quality as the AAC-based HD-AAC mode discussed in the previous section. At a constant bit rate of 768 kbps most test items still require the truncation of some coder frames. Nevertheless the corresponding PEAQ measurements indicate that no degradation of subjective audio quality occurs in this tandem coding scenario for both the AAC-based mode and the stand-alone mode; see [59]. This provides an interesting operating point for HD-AAC modes, corresponding to a guaranteed 2:1 compression. While other stand-alone lossless codecs can also provide an average compression of 2:1 for suitable test material, their peak compression performance can be much lower, depending on the audio material to be encoded. In contrast, HD-AAC is able to guarantee a certain compression ratio while providing lossless or near-lossless signal representation, depending on the input signal. 9 DECODER COMPLEXITY The computational complexity for SLS decoding can be evaluated by counting the total number of standard instructions (multiplications, additions, bit shifts, comparisons, memory transfers) required for performing the decoding process on a generic 32-bit fixed-point CPU. The main components contributing to the computational complexity of SLS are: 1) IntMDCT filter bank 2) Bit-plane arithmetic decoder 3) AAC Huffman decoding 4) AAC + SLS inverse error mapping 5) Integer M/S stereo coding 6) Unpacking of tables. Items 3) to 5) are only required in the AAC-based mode, item 6) only if the necessary tables are not precomputed. 9.1 Number of Instructions Table 5 lists the number of instructions required for decoding in the AAC-based mode, with the AAC core operating at 64 kbps per channel. Table 6 shows the corresponding numbers for the SLS in stand-alone mode (without the AAC core layer). For both tables, values are provided for both implementations with and without table prepacking. Fig. 15. Test results: stand-alone SLS tandem coding. 38 J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February PAPERS ISO/IEC MPEG-4 SCALABLE AAC 9.2 ROM Requirements For an implementation of SLS in stand-alone mode, a ROM size of 4 kbytes is required. For the AAC-based mode, the ROM requirement is 45 kbytes. As can be seen from Tables 5 and 6, a tradeoff between ROM requirement and number of instructions can be made by precomputing the necessary table values. More details on SLS computational complexity can be found in [59]. The computational complexity of AAC decoding is analyzed in [60]. 10. APPLICATIONS As the primary functionality of HD-AAC audio coding is lossless audio coding, it can be used in applications that require bit-exact reconstruction, such as studio operation, music disc delivery, or audio archiving. Due to its inherent scalability, HD-AAC audio coding technology in fact fits into virtually every application that requires audio compression. Several potential application scenarios are listed here. Studio Operations HD-AAC audio coding technology is useful for the storage of audio at various points in studio operations such as recording, editing, mixing, and premastering as studio procedures are designed to preserve the highest levels of quality. The scalability of the SLS layer also provides a nice solution to situations in which the bandwidth is not sufficient to support fully lossless quality. Archival Application Archives of sound recordings are very common in studios, record labels, libraries, and so on. These archives are tremendously large and certainly compression is essential. In addition, the scalability of the SLS technology facilitates the possibility that lower bitrate versions of the archive’s lossless audio items can be extracted at any time to allow applications such as remote data browsing. Broadcast Contribution/Distribution Chain In a broadcast environment HD-AAC audio coding technology could be used in all stages comprising archiving, contribution/distribution, and emission. In the broadcast chain one main feature of the technology can be used: In every stage where lower bit rates are required, the bit stream is merely truncated, and no reencoding is therefore required. Consumer Disc-Based Delivery HD-AAC technology can also be used in consumer disc-based delivery of music content. It enables the music disc to deliver both lossless and lossy audio on the same medium. Internet Delivery of Audio In such an application scenario the available transmission bandwidth can vary dramatically across different access network technologies and over time. As a result, the same audio content at a variety of bit rates and qualities may need to be kept ready at the server side. HD-AAC technology provides a “one-file” solution for such a requirement. Audio Streaming HD-AAC technology delivers the vital bit-rate scalability for streaming applications on channels with variable quality of service (QoS) conditions. Examples for this kind of streaming applications include Internet audio streaming and multicast streaming applications that feed several channels of differing capacity. Digital Home The idea of the digital home is to create an open and transparent home network platform that enables consumers to easily create, use, manage, and share digital content such as audio, video, or image. In a typical Table 5. Maximum numbers of INT32 operations per sample for SLS decoding with AAC core. Frame Length Muls Adds/Subs 4096 or 512 2048 or 256 1024 or 128 19.50 20.25 18.00 93.46 91.22 88.99 19.50 20.25 18.00 123.21 117.47 105.24 4096 or 512 2048 or 256 1024 or 128 Shifts Negs Movs Combined All ! 1 Cycle Tables Preunpacked 34.50 72.60 31.50 73.54 28.50 68.50 34.26 32.25 30.85 16.54 16.29 15.99 270.86 265.05 250.83 34.26 32.25 30.85 16.54 16.29 15.99 316.10 303.30 273.58 Ors Tables Unpacked in Place 34.50 88.01 31.50 85.54 28.50 75.00 Table 6. Maximum number of INT32 operations per sample for SLS stand-alone decoding. Shifts Negs Movs Combined All ! 1 cycle 86.63 84.39 82.16 Tables Preunpacked 34.50 67.10 31.50 68.04 28.50 63.00 29.93 27.92 26.52 9.54 9.29 8.99 245.70 239.89 225.67 116.38 110.64 98.41 Tables Unpacked in Place 34.50 82.6 31.50 80.04 28.50 69.50 29.93 27.92 26.52 9.54 9.29 8.99 290.05 278.14 248.42 Frame Length Muls Adds/Subs 4096 or 512 2048 or 256 1024 or 128 18 18.75 16.5 4096 or 512 2048 or 256 1024 or 128 18.00 18.75 16.50 J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February Ors 39 GEIGER ET AL. setup for audio, the user can download the HD-AAC coded bit streams in lossless quality from the service provider and archive them on the home music server. These bit streams are then streamed, or downloaded to different audio terminals at differing quality for playback. 11. CONCLUSIONS The new ISO/MPEG specification for scalable lossless coding extends the well-known perceptual coding scheme AAC toward lossless and near-lossless operation, and in this way enables its use in the context of high-definition applications. The HD-AAC scheme offers competitive lossless compression rates at all relevant operating points (word length and sampling rate). For distribution on bandwidth-limited channels a perceptually coded compatible AAC bit stream can simply be extracted from the composite HD-AAC stream. Alternatively, the SLS part can also be used as a simple and versatile stand-alone compression engine. In both cases the fidelity of the signal representation can be scaled with fine granularity within a wide range of near lossless representations. This enables lossless or near lossless transmission of high-definition audio with a guaranteed maximum rate. We anticipate that this flexibility will make HD-AAC the technology of choice for many applications that call for both very high audio quality and delivery over a wide range of transmission channels. 12 ACKNOWLEDGMENT The authors would like to thank all their colleagues at the MPEG audio subgroup who supported the lossless standardization activity, especially Takehiro Moriya (NTT) for inspiring these standardization activities, Tilman Liebchen (Technical University of Berlin) for chairing the ad hoc group on lossless audio coding, and Yuriy Reznik (Real Networks/Qualcomm) for his thorough complexity evaluations. 13 REFERENCES [1] ISO/IEC 11172-3, “Coding of Moving Pictures and Associated Audio for Digital Storage Media at up to about 1.5 Mbit/s—Part 3: Audio,” International Standards Organization, Geneva, Switzerland (1992). [2] ISO/IEC 13818-3, “Information Technology— Generic Coding of Moving Pictures and Associated Audio—Part 3: Audio,” International Standards Organization, Geneva, Switzerland (1994). [3] ISO/IEC 14496-3:2001, “Coding of Audio-Visual Objects—Part 3: Audio,” International Standards Organization, Geneva, Switzerland (2001). [4] ISO/IEC 14496-3:2001/Amd.1:2003, “Coding of Audio-Visual Objects—Part 3: Audio, Amendment 1: Bandwidth Extension,” International Standards Organization, Geneva, Switzerland (2003). [5] M. Dietz, L. Liljeryd, K. Kjörling, and O. Kunz, “Spectral Band Replication—A Novel Approach in Audio 40 PAPERS Coding,” presented at the 112th Convention of the Audio Engineering Society, J. Audio Eng. Soc. (Abstracts), vol. 50, pp. 509, 510 (2002 June), convention paper 5553. [6] ISO/IEC 14496-3:2001/Amd.2:2004, “Coding of Audio-Visual Objects—Part 3: Audio, Amendment 2: Parametric Coding for High Quality Audio,” International Standards Organization, Geneva, Switzerland (2004). [7] C. den Brinker, E. Schuijers, and W. Oomen, “Parametric Coding for High-Quality Audio,” presented at the 112th Convention of the Audio Engineering Society, J. Audio Eng. Soc. (Abstracts), vol. 50, p. 510 (2002 June), convention paper 5554. [8] ISO/IEC FCD 23003-1, “MPEG-D (MPEG Audio Technologies)—Part 1: MPEG Surround” International Standards Organization, Geneva, Switzerland (2006). [9] J. Breebaart, J. Herre, C. Faller, J. Roeden, F. Myburg, S. Disch, H. Purnhagen, G. Hotho, M. Neusinger, K. Kjoerling, and W. Oomen, “MPEG Spatial Audio Coding/ MPEG Surround: Overview and Current Status,” presented at the 119th Convention of the Audio Engineering Society, J. Audio Eng. Soc. (Abstracts), vol. 53, p. 1228 (2005 Dec.), convention paper 6599. [10] “Special Issue: High-Resolution Audio,” J. Audio Eng. Soc., vol. 52, pp. 116–260 (2004 Mar.). [11] “DVD Forum,” http://www.dvdforum.org/ forum.shtml (2004). [12] Royal Philips Electronics, “Super Audio CD Systems,” http://www.licensing.philips.com/information/ sacd/ (2006). [13] Blu-Ray Disc Association, “Blu-Ray Disc,” http:// www.blu-raydisc.com/ (2006). [14] Dolby Laboratories Inc., “MLP Lossless” http:// www.dolby.com/consumer/technology/mlp_lossless.html (2006). [15] M. A. Gerzon, P. G. Craven, J. R. Stuart, M. J. Law, and R. J. Wilson, “The MLP Lossless Compression System,” in Proc 17th AES Conf. (Florence, Italy, 1999 Sept.), pp. 61–75. [16] DTS Inc., “DTS HD,” http://www.dtsonline.com/ consumer/dtshd.php (2006). [17] Apple Computer Inc., “Apple Quicktime,” http:// www.apple.com/downloads/macosx/apple/quicktime651 .html (2006). [18] J. Coalson, “FLAC—Free Lossless Audio Codec,” http://flac.sourceforge.net (2006). [19] M. T. Ashland, “Monkey’s Audio—A Fast and Powerful Lossless Audio Compressor,” http://www .monkeysaudio.com (2004). [20] F. Ghido, “OptimFROG,” http://www.losslessaudio .org (2006). [21] ITU-T G.729.1, “G.729 Based Embedded Variable Bit-Rate Coder: An 8–32 kbit/s Scalable Wideband Coder Bitstream Interoperable with G.729,” International Telecommunication Union, Geneva, Switzerland (2006). [22] B. Grill, “A Bit-Rate Scalable Perceptual Coder for MPEG-4 Audio presented at the 103rd Convention of the Audio Engineering Society, J. Audio Eng. Soc. (Abstracts), vol. 15, p. 1005 (1997 Nov.), preprint 4620. [23] J. Herre, E. Allamanche, K. Brandenburg, M. J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February PAPERS Dietz, B. Teichmann, B. Grill, A. Jin, T. Moriya, N. Iwakami, T. Norimatsu, M. Tsushima, and T. Ishikawa, “The Integrated Filterbank-Based Scalable MPEG-4 Audio Coder,” presented at the 105th Convention of the Audio Engineering Society, J. Audio Eng. Soc. (Abstracts), vol. 46, p. 1039 (1998 Nov.), preprint 4810. [24] S. H. Park, Y. B. Kim, S. W. Kim, and Y. S. Seo, “Multi-Layer Bit-Sliced Bit-Rate Scalable Audio Coding,” presented at the 103rd Convention of the Audio Engineering Society, J. Audio Eng. Soc. (Abstracts), vol. 45, p. 1005 (1997 Nov.), preprint 4520. [25] M. Nishiguchi, “MPEG-4 Speech Coding,” in Proc. 17th AES Conf. (Florence, Italy, 1999 Sept.), pp. 139–146. [26] M. Nishiguchi, A. Inoue, Y. Maeda, and J. Matsumoto, “Parametric Speech Coding—HVXC at 2.0–4.0 kbps,” presented at the IEEE Workshop on Speech Coding, Porvoo, Finland (1999 June). [27] M. S. Schroeder and B. S. Atal, “Code-Excited Linear Prediction (CELP): High-Quality Speech at Very Low Bit Rates,” in Proc. IEEE ICASSP (Tampa, FL, 1985 Mar.), pp. 937–940. [28] T. Nomura, M. Iwadare, M. Serizawa, and K. Ozawa, “A Bitrate and Bandwidth Scalable CELP Coder,” in Proc. IEEE ICASSP (Seattle, WA, 1998 May), pp. 341–344. [29] ISO/IEC JTC1/SC29/WG11, “Final Call for Proposals on MPEG-4 Lossless Audio Coding,” MPEG2002/ N5208, Shanghai, China (2002 Oct.). [30] ISO/IEC 14496-3:2001/Amd.6:2005, “Coding of Audio-Visual Objects—Part 3: Audio, Amendment 6: Lossless Coding of Oversampled Audio,” International Standards Organization, Geneva, Switzerland (2005) [31] E. Knapen, D. Reefman, E. Janssen, and F. Bruekers, “Lossless Compression of 1-Bit Audio,” J. Audio Eng. Soc., vol. 52, pp. 190–199 (2004 Mar.). [32] ISO/IEC 14496-3:2005/Amd.2:2006, “Coding of Audio-Visual Objects—Part 3: Audio, Amendment 2: Audio Lossless Coding (ALS), New Audio Profiles and BSAC Extensions,” International Standards Organization, Geneva, Switzerland (2006). [33] ISO/IEC 14496-3:2005/Amd.3:2006, “Coding of Audio-Visual Objects—Part 3: Audio, Amendment 3: Scalable Lossless Coding (SLS),” International Standards Organization, Geneva, Switzerland (2006). [34] J. Herre and H. Purnhagen, “General Audio Coding,” in The MPEG-4 Book, IMSC Multimedia Ser., F. Pereira and T. Ebrahimi, Eds. (Prentice-Hall, Englewood Cliffs, NJ, 2002). [35] M. Bosi, K. Brandenburg, S. Quackenbush, L. Fielder, K. Akagiri, H. Fuchs, M. Dietz, J. Herre, G. Davidson, and Y. Oikawa, “ISO/IEC MPEG-2 Advanced Audio Coding,” J. Audio Eng. Soc., vol. 45, pp. 789–814 (1997 Oct.). [36] ISO/IEC JTC1/SC29/WG11, “Report on the MPEG-2 AAC Stereo Verification Tests,” MPEG1998/ N2006, San Jose, CA (1998 Feb.). [37] J. Princen, A. Johnson, and A. Bradley, “Subband/ Transform Coding Using Filter Bank Designs Based on J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February ISO/IEC MPEG-4 SCALABLE AAC Time Domain Aliasing Cancellation,” in Proc IEEE ICASSP (Dallas, TX, 1987), pp. 2161–2164. [38] J. Herre and J. D. Johnston, “Enhancing the Performance of Perceptual Audio Coders by Using Temporal Noise Shaping (TNS),” presented at the 101st Convention of the Audio Engineering Society, J. Audio Eng. Soc. (Abstracts), vol. 44, p. 1175 (1996 Dec.), preprint 4384. [39] J. D. Johnston, J. Herre, M. Davis, and U. Gbur, “MPEG-2 NBC Audio—Stereo and Multichannel Coding Methods,” presented at the 101st Convention of the Audio Engineering Society, J. Audio Eng. Soc. (Abstracts), vol. 44, p. 1175 (1996 Dec.), preprint 4383. [40] R. Geiger, T. Sporer, J. Koller, and K. Brandenburg, “Audio Coding Based on Integer Transforms,” presented at the 111th Convention of the Audio Engineering Society, J. Audio Eng. Soc. (Abstracts), vol. 49, p. 1230 (2001 Dec.), convention paper 5471. [41] R. Geiger, Y. Yokotani, and G. Schuller, “Improved Integer Transforms for Lossless Audio Coding,” in Proc. 37th Asilomar Conf. on Signals, Systems and Computers (Pacific Grove, CA, 2003 Nov.). [42] R. Geiger, J. Herre, J. Koller, and K. Brandenburg, “IntMDCT—A Link between Perceptual and Lossless Audio Coding,” in Proc. IEEE ICASSP (Orlando, FL, 2002 May). [43] R. Yu, C. C. Ko, S. Rahardja, and X. Lin, “BitPlane Golomb Code for Sources with Laplacian Distributions,” in Proc. IEEE ICASSP (Hong Kong, China, 2003 Apr.), pp. 277–280. [44] R. Yu, X. Lin, S. Rahardja, C. C. Ko, and H. Huang, “Improving Coding Efficiency for MPEG-4 Audio Scalable Lossless Coding,” in Proc. IEEE ICASSP (Philadelphia, PA, 2005 May). [45] I. Daubechies, and W. Sweldens, “Factoring Wavelet Transforms into Lifting Steps,” Tech. Rep. Bell Laboratories, Lucent Technologies (1996). [46] F. Bruekers and A. Enden, “New Networks for Perfect Inversion and Perfect Reconstruction,” IEEE J. Selected Areas Comm., vol. 10, pp. 130–137 (1992 Jan). [47] R. Geiger, Y. Yokotani, G. Schuller, and J. Herre, “Improved Integer Transforms Using Multi-Dimensional Lifting,” in Proc. IEEE ICASSP (Montreal, Canada, 2004 May). [48] Y. Yokotani, R. Geiger, G. Schuller, S. Oraintara, and K. R. Rao, “Improved Lossless Audio Coding Using the Noise-Shaped IntMDCT,” presented at the IEEE 11th DSP Workshop, Taos Ski Valley, NM (2004 Aug). [49] R. Yu, X. Lin, S. Rahardja, and H. Haibin, “Proposed Core Experiment for Improving Coding Efficiency in MPEG-4 Audio Scalable Coding (SLS),” ISO/IEC JTC1/SC29/WG11, M10683, Munich, Germany (2004 Mar.) [50] R. Yu, X. Lin, S. Rahardja, and C. C. Ko, “A Statistics Study of the MDCT Coefficient Distribution for Audio,” in Proc. ICME (Taipei, Taiwan, 2004 June). [51] K. H. Choo, E. Oh, J. H. Kim, and C. Y. Son, “Enhanced Performance in the Functionality of Fine Grain Scalability,” presented at the 119th Convention of the Audio Engineering Society, J. Audio Eng. Soc. (Abstracts), 41 GEIGER ET AL. PAPERS vol. 53, pp. 1227, 1228 (2005 Dec.), convention paper 6597. [52] ISO/IEC 14496–1:2004, “Coding of Audio-Visual Objects—Part 1: Systems,” International Standards Organization, Geneva, Switzerland (2004). [53] M. Hans and R. W. Schafer, “Lossless Compression of Digital Audio,” IEEE Signal Process Mag., vol. 18, pp. 21–32 (2001 July). [54] AES Technical Committee of Coding of Audio Signals, “Perceptual Audio Coders: What to Listen For,” CD-ROM with tutorial information and audio examples, Audio Engineering Society, New York (2001). [55] ITU-R BS.1548-1, “User Requirements for Audio Coding Systems for Digital Broadcasting,” International Telecommunication Union, Geneva, Switzerland (2001–2002). [56] ITU-R BS.1116-1, “Methods for the Subjective Assessment of Small Impairments in Audio Systems In- cluding Multichannel Sound Systems,” International Telecommunication Union, Geneva, Switzerland (1994). [57] ITU-R BS.1387-1, “Method for Objective Measurements of Perceived Audio Quality,” International Telecommunication Union, Geneva, Switzerland (1998). [58] R. Geiger, M. Schmidt, J. Herre, and R. Yu, “MPEG-4 SLS—Lossless and Near-Lossless Audio Coding Based on MPEG-4 AAC,” presented at the International Symposium on Communications, Control and Signal Processing, Marrakech, Morocco (2006 Mar.) [59] ISO/IEC JTC1/SC29/WG11, “Verification Report on MPEG-4 SLS,” MPEG2005/N7687, Nice, France (2005 Oct.). [60] ISO/IEC JTC1/SC29/WG11, “Revised Report on Complexity of MPEG-2 AAC Tools,” MPEG1999/N2957 (Melbourne, Australia, 1999 Oct.), http://www.chiariglione .org/mpeg/working_documents/mpeg-02/audio/AAC_ tool_complexity(rev).zip THE AUTHORS R. Geiger S.-W. Kim R. Yu J. Herre X. Lin Ralf Geiger received a diploma degree in mathematics from the University of Regensburg, Regensburg, Germany, in 1997. In 1998 he joined the Audio/Multimedia Department at the Fraunhofer Institute for Integrated Circuits (IIS), Erlangen, Germany. From 2000 to 2004 he was with the Fraunhofer Institute for Digital Media Technology (IDMT), Ilmenau, Germany. In 2005 he returned to Fraunhofer IIS. Mr. Geiger is working on the development and standardization of perceptual and lossless audio coding 42 S. Rahardja M. Schmidt schemes. He is a coeditor of the ISO/IEC standard “MPEG-4 Scalable Lossless Coding (SLS).” ● Rongshan Yu received a B.Eng. degree from Shanghai Jiaotong University, Shanghai, P. R. China, in 1995, and M. Eng. and Ph.D. degrees from the National University of Singapore in 2000 and 2005, respectively. He was with the Centre for Signal Processing, School of Electrical and Electronics, Nanyang Technological University, Singapore, from 1999 to 2001, and with the InstiJ. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February PAPERS ISO/IEC MPEG-4 SCALABLE AAC tute for Infocomm Research (I2R), A*STAR, Singapore, from 2001 to 2005. He is currently with Dolby Laboratories, San Francisco, CA, USA. His research interests include audio coding, data compression, and digital signal processing. Dr. Yu received the Tan Kah Kee Young Inventors’ Gold Award (Open Category) in 2003. He has been working on the development and standardization of MPEG lossless audio coding schemes and coedited the ISO/IEC standard “MPEG-4 Scalable Lossless Coding (SLS).” ● Jürgen Herre joined the Fraunhofer Institute for Integrated Circuits (IIS) in Erlangen, Germany, in 1989. Since then he has been involved in the development of perceptual coding algorithms for high-quality audio, including the well-known ISO/MPEG-Audio Layer III coder (aka MP3). In 1995 he joined Bell Laboratories for a postdoctoral term, working on the development of MPEG-2 advanced audio coding (AAC). Since the end of 1996 he has been back at Fraunhofer, working on the development of advanced multimedia technology, including MPEG-4, MPEG-7, and secure delivery of audiovisual content, currently as the chief scientist for the audio/multimedia activities at Fraunhofer IIS, Erlangen. Dr. Herre is a fellow of the Audio Engineering Society, cochair of the AES Technical Committee on Coding of Audio Signals, and vice chair of the AES Technical Council. He also is an IEEE senior member, served as an associate editor of the IEEE Transactions on Speech and Audio Processing, and is an active member of the MPEG audio subgroup. Outside his professional life he is a dedicated amateur musician. ● Susanto Rahardja received a Ph.D. degree in electrical and electronic engineering from the Nanyang Technological University, Singapore. He joined the Centre for Signal Processing, Nanyang Technological University, in 1996 and he has been a faculty member at its School of Electrical and Electronic Engineering since 2001. In 2002 he joined the Institute for Infocomm Research (I2R) and is currently the director of its Media Division. He is overseeing research areas on signal processing (audio coding, video/image processing), media analysis (text/speech, image, video), media security (biometrics, computer vision, and surveillance), and sensor network. Dr. Rahardja has published in more than 150 international journals and at conferences in the areas of digital J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February communications, signal processing, and logic synthesis. He was the recipient of the IEE Hartree Premium Award for best journal paper published in the IEE Proceedings in 2001 and the Tan Kah Kee Young Inventors’ Gold Award (Open Category) in 2003. He has served in various IEEE and SPIE-related professional activities in the area of multimedia and is actively involved in technical committees of the IEEE Circuits and Systems and Signal Processing societies. He is associate editor of the Journal of Visual Communication and Image Representation. Since 2002 he has been an active participant of the MPEG and since then has been involved in the development of MPEG-4 Audio, where he is one of the contributors to the MPEG-4 SLS and ALS. ● Sang-Wook Kim received a B.A. degree in electrical engineering from the Yonsei University, Korea, in 1989. He completed his master’s degree in 1991 at the Korea Advanced Institute of Science and Technology while researching digital signal processing. ● Xiao Lin received a Ph.D. degree from the Electronics and Computer Science Department of the University of Southampton, Southampton, UK, in 1993. He worked at the Centre for Signal Processing, Nanyang Technological University, Singapore, as a research fellow and senior research fellow for about five years. Subsequently he joined DeSOC Technology as a technical director and then the Institute for Infocomm Research in 2002. There he was member of technical staff, lead scientist, and principal scientist and managed the Media Processing Department until 2006. He is now with Fortemedia Inc. as a senior director. He actively participated in international standards such as MPEG-4, JPEG2000, and JVT. Dr. Lin is a senior member of the IEEE. ● Markus Schmidt received a Dipl.-Ing. degree in media technology from the Technical University of Ilmenau, Germany, in 2004. During his studies he spent a year at the University of Strathclyde in Glasgow, Scotland. After completing an intership at the Fraunhofer Institute for Digital Media Technology (IDMT) in Ilmenau, he joined the Fraunhofer Institute for Integrated Circuits (IIS), Erlangen, Germany, in 2005. There his research interests include low-delay and lossless audio coding schemes and their implementation in real-time environments. 43