A_Guide_to_MPEGaudio_sfu

advertisement
Fundamentals of Multimedia
Ze-Nian Li, Mark S. Drew
Simon Fraser University, Canada.
Chapter 4.
Audio and Video Compression
A Guide to MPEG-1 Audio Standard
Big Picture about MPEG-1 Audio Standard
The Psychoacoustic Model 1 and Model 2
The Layer I Schemes
The Layer II Scheme and Enhancement
The Layer III Scheme and Enhancement
Conclusion and Prospectives of MPEG Audio Standard
I. Big Picture About MPEG-1 Audio Standard
1. Rationale for MPEG-1 Audio Compression
The Rationale of Lossy audio compression can be expressed in just one sentence: audio
quantization will inevitablly generate quantization noise, this quantization can be
removed by employing Masking effect. The Masking is a perceptual property of human
auditory system: the presence of a strong audio signal makes a temporal or spectral
neighborhood of weaker audio signals imperceptible, as shown in Fig1.
Figure1 Example of frequency-domain masking
2. Basic Scheme
i)The noise
spectrum is shaped by decomposing the audio signal into 32 equal-with
subbands (different for layer 3)
ii)Signals
in each subband are adaptively quantized and encoded s.t. the noise
introduced is below the Masking Threshold.
iii)
2.
How to get Masking Threshold Function: from Psychoacoustic Model 1 or Model
Figure 2. A block diagram of MPEG-1 audio encoder
3. Compression Mechanism
There are 3 layers of compression: Layers I, II, and III. Each is a distinct compression
scheme with enhanced output audio quality, and also increasing complexity.
Relationship between Layers and Psychoacoustic Models: model 1 is valid for Layers I
and II, can also be used for Layer III. But Layer III is based on Model 2.
II. The Psychoacoustic Model 1 and Model 2
1. General Function of Psychoacoustic Models
a)Computation
of the masking curve according to the frequency-masking properties of
human ear. This requires a very accurate spectral analysis of the input signal.
b)From
the making curve, a set of masking threasholds is derived for each subband.
Each of them determines the maximum energy acceptable for the quantization of noise
in each subband (i.e. below this level the noise will not be perceived).
c)For
low bit rates, instead of requiring the quantization noise to be below the masking
threasholds, the psychoacoustic model uses an iterative algorithm that allocates more
bits to the subbands where increased resolution will provide the greatest benefits.
2.Implementation of Psychoacoustic Model 1
a)
The spectrum of the input signal is computed with transform length for FFT of 512
samples for layer I and 1024 samples for Layer II.
Fig 3 Fourier power spectrum of an audio signal
b)
The sound pressure level (SPL) in each subband is computed.
c)
The threshold in quiet regions(absolute threshold) is also provided, as shown in Fig 4
Figure 4 Absolute threshold of hearing for Model 1 in layer 1
d)The tonal
and nontonal components are extracted from the FFT power spectrum,
since they influence the masking threshold in the critical band. For calculating the
global masking threshold, (1) Model 1 identifies tonal components by determining the
local maxima (peaks) from the neighboring spectral lines. (2) sums the remaining
spectral values into a single nontonal component (close to the geometric mean) for each
critical band.
(a)
(b)
Fig 5 The local maxima of tonal (a) nontonal (b)components on Fourier power
spectrum
e)
Decimation - a procedure to reduce the number of maskers that are considered for
the determination of the global masking threshold. Only the tonal or nontonal
components that are greater than the absolute threshold are considered. Aside,
components that are smaller than the highest power within a distance of less than 0.5
Bark are removed form the list of tonal or nontonal components.
(a)
(b)
Fig 6 Decimated list of tonal components (a)and nontonal components (b)
f)
Individual masking thresholds of both tonal and nontonal components are defined by
adding the masking index and masking function to the masking component (tonal or
nontonal).
For a tonal component j, at critical band rate z(j), the masking threshold LTtm(j, i) at
critical band rate is given by
LTtm(j, i) = Xtm(j) +
avtm(z(j)) + vf[z(i)-z(j), Xtm(j)]
z(j) - the function mapping spectral frequency to critical band rate
Xtm(j) - the strength of the tonal component at frequency j
avtm(z(j)) - the tonal making index (provided in the standard)
vf[z(i)-z(j), Xtm(j)] - the masking function (provided in the standard)
Similarly, for nontonal masking component:
LTnm(j, i) = Xnm(j) +
avnm(z(j)) + vf[z(i)-z(j), Xnm(j)]
The global masking thresholds LTG for a frequency sample is derived by summing
the individual masking threshold LTtm(j, i) and LTnm(j, i) and the threshold in quiet
LTq in the following way:
g)
Next step, the minimum masking threshold function is determined for each subband
from the minimum of all the global masking thresholds contributing to that subband.
(a)
(b)
Fig 7 (a): global thresholds. (b): minimum global masking thresholds
The minimum global masking threshold LTmin(n) in subband n is used for
determining the signal-to-mask ratio (SMR), which is given by :
h)
SMRsb(n) = Lsb(n) - LTmin(n) ;
Lsb(n)- signal component in subband n
Fig 8 Signal to mask ratio
Bit allocation in layers I and II is based on the SMR for each subband.
3. Psychoacoustic Model 2
Model 2 Can be used by Layers I and II, but now mainly used by Layer III. More
constraints are included.
Why Model 2?
Model 1 selects the minimum masking threshold within each subband. This approach
works well for the lower frequency subbands where the subband is wide relative to a
critical band. It might be inaccurate for the higher frequency subbands, because critical
bands in high frequency range span several several subands (as shown by Fig 9)
Fig 9 Nonlinear critical bands measured by Scharf
So inaccuracies arise because model 1 concentrates all nontonal components within
each critical band into a single value at single frequency. A subband within a wide
critical band but far from the concentrated nontonal component will not get an accurate
assessment for its nontonal masking.
While Model 2 is different in this point. It selects the minimum of the masking threads
covered by the subband only where the band is wide relative to the critical band. It uses
the average of the masking thresholds covered by the subband where the band is narrow
relative to the critical band. So Model 2 has the same accuracy for high frequency
subbands as for lower frequency subbands because it does not concentrate the nontonal
components. b) Model 2's improvements and adaptations for Layer III
i)
Model 2 never actually separated tonal and nontonal components. Instead the
spectral data is transformed to a "partition" domain. The domain was designed to
provide a resolution of either one frequency line or 1/3 of a critical band, whichever is
wider.
In each partition domain, a tonality index as a function of frequency is computed. This
index gives a measure whether the component is more tone-like or noise- like, and
ultimately determines the amount of masking.
ii)
The size of FFT and Hanning window can be varied. In practice, layer III computes
the model twice in parallel with FFTs of 192 samples (short block) and 576 samples
(long block).
iii)
Instead of using masking function like Model 1, a spreading function is considered
between neighboring critical bands. The function is based on the fact that a sound
stimulus leaves a trace of aftereffects that die out gradually (forward masking) and the
masking threshold can be changed by the previous stimulus (backward masking).
iv)
The masking threshold at any specific partition is equal to the convolved partitioned
energy spectrum (obtained by mapping the auditory power spectrum into the partitioned
domain and convolved by the spreading function) multiplied by an attention factor,.
v)
The SMR is computed for either subbands (in Layers I and II) or the scale factor
bands (in Layer 3). The SMRn sent to the coder is given by:
SMRn = 10 log10 (EPARTn / NPARTn )
EPARTn - energy in the scale factor band; n - the index of coder partition ;NPARTn
- noise level in the scale factor band
III. The Layer I Schemes
Layer I is the basic coding algorithm, which codes audio in frames of 384 audio
samples. It does so by grouping 12 samples from each of the 32 subbands. Layer I
contains only three kinds of components as shown in Fig 10.
Fig 10 The data bitstream structure of layer I.
Bit allocation indicates the number of bits used to code the 12 samples in a subband.
The scale factor is a multiplier that sizes the samples to fully use the range of the
quantizer.
1. Subband filtering
A polyphase of the analysis filter bank is implemented as shown by Fig 11.
a)
32 audio signal is shifted into a 512 sample X buffer;
b)Windowing for the
time domain aliasing cancellation (TDAC):
z(i) = C(i)x(i), i= 0, 1, ..., 511
c)
Partial calculation: The windowed data vector is subsampled at every 64th sample
and the 8 samples are summed with formula a, and filtered by 32 subband matrixing
with formula b
e)
Output: output 32 subband samples s(i).
Fig 11 Polyphase implementation of the analysis bank
2. Psychoacustic modeling
Either Model 1 or Model 2 can be used for the determination of the psychoacoustic
parameters. Model 1 is sufficient for layer I, which require the FFT of 512 samples. The
SMR is determined from the psychoacoustic model.
3. Scale factor
The maximum of the absolute values of 12 samples in a subband is determined. The
next largest value can be found in the lookup table and the index can be coded as a scale
factor for the 12 samples. The scale factor section contains total 32 6-bit values in a
scale factor table. Later on the decoder multiplies the decoded quantizer output with the
scale factor to recover the quantized subband values. e)Output: output 32 subband
samples s(i).
4. Bit allocation
The basic concept in the bit allocation procedure is to maximize the minimum MNR of
all subbands with the constratint : the number of bits used <= number of bits a available
The number of bits available to encode a frame (Bf) is determined from the bit rate and
sampling rate(fs): Bf = Bit rate/ fs (bits/frame) The bit allocation procedure is an
iterative process that starts with zero bit allocation.
The algorithm computes the MNR for each subband, which is given by:
MNR=SNR-SMR dB
where SNR is given in the standard, SMR is provided by psychoacoustic model.
And then finds the subband with the lowest MNR whose bit allocation has not reached
its maximum limit. The bit allocation for that subband is increased one level and the
number of additional bits required is substracted from the available bit pool.
The process is repeated until all the available bits have been used or all the subbands
have reached their maximum limit.
4. Quantization and encoding
Each of the subband samples Si is normalized by dividing its value by the scalefactor
scf and quantized using the following formula:
Sqi=(A(Si /scf) + B)/N
A and B are constants and given by the standard; Sqi - quantized sample in the
subband; N - necessary number of steps to encode
5. Bit-Stream formatting
The encoded subband information is multiplexed in a frame unit. The bit-stream
formatting is performed as a final procedure with no additional coding. A frame is
composed of an integer of slots to adjust the mean bit rate. In layer I a slot equals 32
bits, while in layers II and III a slot equals 8 bits. The number of slots in a frame is
obtained by dividing the total number of bits available(Bf) by the number of bits in a
slot, i.e.
Number of slots = Bf / 32 for Layer I
IV. The Layer II Scheme and Enhancement
Layer II follows basically the same rules for coding and decoding as Layer I.
The main difference : Layer II introduces correlation between subands, and contains
information for 1152 samples (3*12*32). Layer II brings in significant savings in bit
rate result from increasing the frame size to 36 samples per subband (as samples/frame
increase , fs decreases) . Finally, the bit stream in layer II add a Scale factor Select
information
Fig 12 The data bit-stream structure of Layer II (upper) and
Structure of Layer II subband samples (lower)
Both psychoacoustic models 1 and 2 can be used. Either model provides the SMR for
every subband.
1. Coding scale factor
The same analysis and synthesis filters as those applied in Layer 1 can be used for Layer
II, and the technique for calculating the scale factor is also the same.
The difference: a frame corresponds to 36(3*12) subband samples (12 granules as in
Fig12), and contain 3 scale factors per subband (one scale factor per 12 consecutive
samples).
2. The bit allocation and quantization and encoding of Layer II are basically the
same as Layer I.
V. The Layer III Scheme and Enhancement
Coding Layer III is much more sophisticated than coding Layers I and II, in the sense: i)
Additional frequency resolution is provided by the use of a hybrid filter bank ii)
Nonuniform quantizer is applied iii) Entropy coding (Huffman coding) is introduced
iv)iteration loops for the psychoacoustic modeling and bit allocation are elaborated.
Fig 13 shows a block diagram of hybrid filter bank( the equal band width used for
layers I and II and MDCT).
Fig 13 A hybrid filter bank for MPEG audio layer III
Layer III specifies two different MDCT block length - A long block of 36 samples or a
short block of 12 samples - due to the trade off between time and frequency resolutions.
The short block length improves the time resolution to cope with transients ( abruptly
changing portions).
Switching between long and short is not instantaneous. Transient windowing (long-toshort and short-to-long ) functions are applied, as shown in Fig 14.
Fig 14 Long-block(36 samples) and short-block(12 samples) window responses
The decision whether the filter bank should be switched to short or long windows is
derived from the masking threshold obtained by estimating the psychoacoustic
entropy(PE). If PE >= 1800, then the block needs to be shorter. Hence, block activities
can be considered for efficient coding.
The Major enhancements and distinctions of Layer III over Layers I and II
1. Aliasing reduction: Only the long blocks are input to the aliasing reduction
procedure. The MDCT results 18 coefficients from 36 input samples in each subband.
Between two sets of 18 coefficients the butterfly operation is performed as shown in Fig
16, where
Fig 15 The aliasing butterfly for the layer III encoder and decoder
i represents the distance from the last line of the previous block to the first line of the
current block. 8 butterflies are defined with different weighting factors csi, cai.
2. Noise allocation: the bit allocation process used in Layers I and II are only allocates
the available bits and approximates the amount of noise caused by quantization.
Layer III introduces the noise allocation iteration loop:
The inner loop quantizes the input vector and increases the quantizer step size until the
output vector can be coded with available number of bits;
The outer loop calculates the distortion in each scale factor band so that a set of
frequency lines is scaled by one scale factor. If the allowed distortion is exceeded, the
scale factor band(critical band) is amplified and the inner loop is again activated.
3. The nonuniform quantizer is designed by a power law - raises input samples to the
3/4 power to provide more consistent SNR over the range of the quantizer. The decoder
calculates the 4/3 power of the Huffman decoder output samples. 5. Bit reservoir: the
coded data bit stream does not necessarily fit into a fixed length frame. The slots are
still used for adjusting bit rate. Hence bits are saved in the reservoir when fewer bits are
used to code one granule. If bits are saved for a frame, the remaining space in the frame
can be used for the next frame data.
Therefore the encoder can donate bits to or borrow bits from the reservoir when
appropriate. The number of bits in bit reservoir is given by side information in data bit
stream.
VI. Conclusion and Prospectives of MPEG Audio Standard
The MPEG audio coding algorithm is the first international standard for the
compression of Audio signals. It can be applied to streams that combine both audio and
video information or audio-only streams. The MPEG audio compression standard
achieves compression by exploiting the spectral and temporal masking effects of the
ear. It uses the subband coding and psychoacoustic models to eliminate information that
is irrelevant from a human sound-perception viewpoint.
MPEG audio standard is also experiencing updating to its sophisticated editions since
its adoption as standard by ISO/IEC at the end of 1992. MPEG-2 is an extension of
MPEG-1, with some added features: Multichannel inputs (5.1 channel), Multilingual
audio support(up to 8 commentary channels are supported, Lower bit rates (down to 8
bits/s), and Additional sampling rates( 32, 44.11, 48, 16, 22.05, and 24 kHz). Now
MPEG AAC is developed and expected to adopted as the international standard, which
will constitute the kernel of the forthcoming MPEG-4 audio standard. Work on MPEG4 is expected to be completed by the end of 1998.
http://www.cs.sfu.ca/CourseCentral/365/li/material/notes/Chap4/mp
eg-audio/chap4.html
Download