Fundamentals of Multimedia Ze-Nian Li, Mark S. Drew Simon Fraser University, Canada. Chapter 4. Audio and Video Compression A Guide to MPEG-1 Audio Standard Big Picture about MPEG-1 Audio Standard The Psychoacoustic Model 1 and Model 2 The Layer I Schemes The Layer II Scheme and Enhancement The Layer III Scheme and Enhancement Conclusion and Prospectives of MPEG Audio Standard I. Big Picture About MPEG-1 Audio Standard 1. Rationale for MPEG-1 Audio Compression The Rationale of Lossy audio compression can be expressed in just one sentence: audio quantization will inevitablly generate quantization noise, this quantization can be removed by employing Masking effect. The Masking is a perceptual property of human auditory system: the presence of a strong audio signal makes a temporal or spectral neighborhood of weaker audio signals imperceptible, as shown in Fig1. Figure1 Example of frequency-domain masking 2. Basic Scheme i)The noise spectrum is shaped by decomposing the audio signal into 32 equal-with subbands (different for layer 3) ii)Signals in each subband are adaptively quantized and encoded s.t. the noise introduced is below the Masking Threshold. iii) 2. How to get Masking Threshold Function: from Psychoacoustic Model 1 or Model Figure 2. A block diagram of MPEG-1 audio encoder 3. Compression Mechanism There are 3 layers of compression: Layers I, II, and III. Each is a distinct compression scheme with enhanced output audio quality, and also increasing complexity. Relationship between Layers and Psychoacoustic Models: model 1 is valid for Layers I and II, can also be used for Layer III. But Layer III is based on Model 2. II. The Psychoacoustic Model 1 and Model 2 1. General Function of Psychoacoustic Models a)Computation of the masking curve according to the frequency-masking properties of human ear. This requires a very accurate spectral analysis of the input signal. b)From the making curve, a set of masking threasholds is derived for each subband. Each of them determines the maximum energy acceptable for the quantization of noise in each subband (i.e. below this level the noise will not be perceived). c)For low bit rates, instead of requiring the quantization noise to be below the masking threasholds, the psychoacoustic model uses an iterative algorithm that allocates more bits to the subbands where increased resolution will provide the greatest benefits. 2.Implementation of Psychoacoustic Model 1 a) The spectrum of the input signal is computed with transform length for FFT of 512 samples for layer I and 1024 samples for Layer II. Fig 3 Fourier power spectrum of an audio signal b) The sound pressure level (SPL) in each subband is computed. c) The threshold in quiet regions(absolute threshold) is also provided, as shown in Fig 4 Figure 4 Absolute threshold of hearing for Model 1 in layer 1 d)The tonal and nontonal components are extracted from the FFT power spectrum, since they influence the masking threshold in the critical band. For calculating the global masking threshold, (1) Model 1 identifies tonal components by determining the local maxima (peaks) from the neighboring spectral lines. (2) sums the remaining spectral values into a single nontonal component (close to the geometric mean) for each critical band. (a) (b) Fig 5 The local maxima of tonal (a) nontonal (b)components on Fourier power spectrum e) Decimation - a procedure to reduce the number of maskers that are considered for the determination of the global masking threshold. Only the tonal or nontonal components that are greater than the absolute threshold are considered. Aside, components that are smaller than the highest power within a distance of less than 0.5 Bark are removed form the list of tonal or nontonal components. (a) (b) Fig 6 Decimated list of tonal components (a)and nontonal components (b) f) Individual masking thresholds of both tonal and nontonal components are defined by adding the masking index and masking function to the masking component (tonal or nontonal). For a tonal component j, at critical band rate z(j), the masking threshold LTtm(j, i) at critical band rate is given by LTtm(j, i) = Xtm(j) + avtm(z(j)) + vf[z(i)-z(j), Xtm(j)] z(j) - the function mapping spectral frequency to critical band rate Xtm(j) - the strength of the tonal component at frequency j avtm(z(j)) - the tonal making index (provided in the standard) vf[z(i)-z(j), Xtm(j)] - the masking function (provided in the standard) Similarly, for nontonal masking component: LTnm(j, i) = Xnm(j) + avnm(z(j)) + vf[z(i)-z(j), Xnm(j)] The global masking thresholds LTG for a frequency sample is derived by summing the individual masking threshold LTtm(j, i) and LTnm(j, i) and the threshold in quiet LTq in the following way: g) Next step, the minimum masking threshold function is determined for each subband from the minimum of all the global masking thresholds contributing to that subband. (a) (b) Fig 7 (a): global thresholds. (b): minimum global masking thresholds The minimum global masking threshold LTmin(n) in subband n is used for determining the signal-to-mask ratio (SMR), which is given by : h) SMRsb(n) = Lsb(n) - LTmin(n) ; Lsb(n)- signal component in subband n Fig 8 Signal to mask ratio Bit allocation in layers I and II is based on the SMR for each subband. 3. Psychoacoustic Model 2 Model 2 Can be used by Layers I and II, but now mainly used by Layer III. More constraints are included. Why Model 2? Model 1 selects the minimum masking threshold within each subband. This approach works well for the lower frequency subbands where the subband is wide relative to a critical band. It might be inaccurate for the higher frequency subbands, because critical bands in high frequency range span several several subands (as shown by Fig 9) Fig 9 Nonlinear critical bands measured by Scharf So inaccuracies arise because model 1 concentrates all nontonal components within each critical band into a single value at single frequency. A subband within a wide critical band but far from the concentrated nontonal component will not get an accurate assessment for its nontonal masking. While Model 2 is different in this point. It selects the minimum of the masking threads covered by the subband only where the band is wide relative to the critical band. It uses the average of the masking thresholds covered by the subband where the band is narrow relative to the critical band. So Model 2 has the same accuracy for high frequency subbands as for lower frequency subbands because it does not concentrate the nontonal components. b) Model 2's improvements and adaptations for Layer III i) Model 2 never actually separated tonal and nontonal components. Instead the spectral data is transformed to a "partition" domain. The domain was designed to provide a resolution of either one frequency line or 1/3 of a critical band, whichever is wider. In each partition domain, a tonality index as a function of frequency is computed. This index gives a measure whether the component is more tone-like or noise- like, and ultimately determines the amount of masking. ii) The size of FFT and Hanning window can be varied. In practice, layer III computes the model twice in parallel with FFTs of 192 samples (short block) and 576 samples (long block). iii) Instead of using masking function like Model 1, a spreading function is considered between neighboring critical bands. The function is based on the fact that a sound stimulus leaves a trace of aftereffects that die out gradually (forward masking) and the masking threshold can be changed by the previous stimulus (backward masking). iv) The masking threshold at any specific partition is equal to the convolved partitioned energy spectrum (obtained by mapping the auditory power spectrum into the partitioned domain and convolved by the spreading function) multiplied by an attention factor,. v) The SMR is computed for either subbands (in Layers I and II) or the scale factor bands (in Layer 3). The SMRn sent to the coder is given by: SMRn = 10 log10 (EPARTn / NPARTn ) EPARTn - energy in the scale factor band; n - the index of coder partition ;NPARTn - noise level in the scale factor band III. The Layer I Schemes Layer I is the basic coding algorithm, which codes audio in frames of 384 audio samples. It does so by grouping 12 samples from each of the 32 subbands. Layer I contains only three kinds of components as shown in Fig 10. Fig 10 The data bitstream structure of layer I. Bit allocation indicates the number of bits used to code the 12 samples in a subband. The scale factor is a multiplier that sizes the samples to fully use the range of the quantizer. 1. Subband filtering A polyphase of the analysis filter bank is implemented as shown by Fig 11. a) 32 audio signal is shifted into a 512 sample X buffer; b)Windowing for the time domain aliasing cancellation (TDAC): z(i) = C(i)x(i), i= 0, 1, ..., 511 c) Partial calculation: The windowed data vector is subsampled at every 64th sample and the 8 samples are summed with formula a, and filtered by 32 subband matrixing with formula b e) Output: output 32 subband samples s(i). Fig 11 Polyphase implementation of the analysis bank 2. Psychoacustic modeling Either Model 1 or Model 2 can be used for the determination of the psychoacoustic parameters. Model 1 is sufficient for layer I, which require the FFT of 512 samples. The SMR is determined from the psychoacoustic model. 3. Scale factor The maximum of the absolute values of 12 samples in a subband is determined. The next largest value can be found in the lookup table and the index can be coded as a scale factor for the 12 samples. The scale factor section contains total 32 6-bit values in a scale factor table. Later on the decoder multiplies the decoded quantizer output with the scale factor to recover the quantized subband values. e)Output: output 32 subband samples s(i). 4. Bit allocation The basic concept in the bit allocation procedure is to maximize the minimum MNR of all subbands with the constratint : the number of bits used <= number of bits a available The number of bits available to encode a frame (Bf) is determined from the bit rate and sampling rate(fs): Bf = Bit rate/ fs (bits/frame) The bit allocation procedure is an iterative process that starts with zero bit allocation. The algorithm computes the MNR for each subband, which is given by: MNR=SNR-SMR dB where SNR is given in the standard, SMR is provided by psychoacoustic model. And then finds the subband with the lowest MNR whose bit allocation has not reached its maximum limit. The bit allocation for that subband is increased one level and the number of additional bits required is substracted from the available bit pool. The process is repeated until all the available bits have been used or all the subbands have reached their maximum limit. 4. Quantization and encoding Each of the subband samples Si is normalized by dividing its value by the scalefactor scf and quantized using the following formula: Sqi=(A(Si /scf) + B)/N A and B are constants and given by the standard; Sqi - quantized sample in the subband; N - necessary number of steps to encode 5. Bit-Stream formatting The encoded subband information is multiplexed in a frame unit. The bit-stream formatting is performed as a final procedure with no additional coding. A frame is composed of an integer of slots to adjust the mean bit rate. In layer I a slot equals 32 bits, while in layers II and III a slot equals 8 bits. The number of slots in a frame is obtained by dividing the total number of bits available(Bf) by the number of bits in a slot, i.e. Number of slots = Bf / 32 for Layer I IV. The Layer II Scheme and Enhancement Layer II follows basically the same rules for coding and decoding as Layer I. The main difference : Layer II introduces correlation between subands, and contains information for 1152 samples (3*12*32). Layer II brings in significant savings in bit rate result from increasing the frame size to 36 samples per subband (as samples/frame increase , fs decreases) . Finally, the bit stream in layer II add a Scale factor Select information Fig 12 The data bit-stream structure of Layer II (upper) and Structure of Layer II subband samples (lower) Both psychoacoustic models 1 and 2 can be used. Either model provides the SMR for every subband. 1. Coding scale factor The same analysis and synthesis filters as those applied in Layer 1 can be used for Layer II, and the technique for calculating the scale factor is also the same. The difference: a frame corresponds to 36(3*12) subband samples (12 granules as in Fig12), and contain 3 scale factors per subband (one scale factor per 12 consecutive samples). 2. The bit allocation and quantization and encoding of Layer II are basically the same as Layer I. V. The Layer III Scheme and Enhancement Coding Layer III is much more sophisticated than coding Layers I and II, in the sense: i) Additional frequency resolution is provided by the use of a hybrid filter bank ii) Nonuniform quantizer is applied iii) Entropy coding (Huffman coding) is introduced iv)iteration loops for the psychoacoustic modeling and bit allocation are elaborated. Fig 13 shows a block diagram of hybrid filter bank( the equal band width used for layers I and II and MDCT). Fig 13 A hybrid filter bank for MPEG audio layer III Layer III specifies two different MDCT block length - A long block of 36 samples or a short block of 12 samples - due to the trade off between time and frequency resolutions. The short block length improves the time resolution to cope with transients ( abruptly changing portions). Switching between long and short is not instantaneous. Transient windowing (long-toshort and short-to-long ) functions are applied, as shown in Fig 14. Fig 14 Long-block(36 samples) and short-block(12 samples) window responses The decision whether the filter bank should be switched to short or long windows is derived from the masking threshold obtained by estimating the psychoacoustic entropy(PE). If PE >= 1800, then the block needs to be shorter. Hence, block activities can be considered for efficient coding. The Major enhancements and distinctions of Layer III over Layers I and II 1. Aliasing reduction: Only the long blocks are input to the aliasing reduction procedure. The MDCT results 18 coefficients from 36 input samples in each subband. Between two sets of 18 coefficients the butterfly operation is performed as shown in Fig 16, where Fig 15 The aliasing butterfly for the layer III encoder and decoder i represents the distance from the last line of the previous block to the first line of the current block. 8 butterflies are defined with different weighting factors csi, cai. 2. Noise allocation: the bit allocation process used in Layers I and II are only allocates the available bits and approximates the amount of noise caused by quantization. Layer III introduces the noise allocation iteration loop: The inner loop quantizes the input vector and increases the quantizer step size until the output vector can be coded with available number of bits; The outer loop calculates the distortion in each scale factor band so that a set of frequency lines is scaled by one scale factor. If the allowed distortion is exceeded, the scale factor band(critical band) is amplified and the inner loop is again activated. 3. The nonuniform quantizer is designed by a power law - raises input samples to the 3/4 power to provide more consistent SNR over the range of the quantizer. The decoder calculates the 4/3 power of the Huffman decoder output samples. 5. Bit reservoir: the coded data bit stream does not necessarily fit into a fixed length frame. The slots are still used for adjusting bit rate. Hence bits are saved in the reservoir when fewer bits are used to code one granule. If bits are saved for a frame, the remaining space in the frame can be used for the next frame data. Therefore the encoder can donate bits to or borrow bits from the reservoir when appropriate. The number of bits in bit reservoir is given by side information in data bit stream. VI. Conclusion and Prospectives of MPEG Audio Standard The MPEG audio coding algorithm is the first international standard for the compression of Audio signals. It can be applied to streams that combine both audio and video information or audio-only streams. The MPEG audio compression standard achieves compression by exploiting the spectral and temporal masking effects of the ear. It uses the subband coding and psychoacoustic models to eliminate information that is irrelevant from a human sound-perception viewpoint. MPEG audio standard is also experiencing updating to its sophisticated editions since its adoption as standard by ISO/IEC at the end of 1992. MPEG-2 is an extension of MPEG-1, with some added features: Multichannel inputs (5.1 channel), Multilingual audio support(up to 8 commentary channels are supported, Lower bit rates (down to 8 bits/s), and Additional sampling rates( 32, 44.11, 48, 16, 22.05, and 24 kHz). Now MPEG AAC is developed and expected to adopted as the international standard, which will constitute the kernel of the forthcoming MPEG-4 audio standard. Work on MPEG4 is expected to be completed by the end of 1998. http://www.cs.sfu.ca/CourseCentral/365/li/material/notes/Chap4/mp eg-audio/chap4.html