third

advertisement
Media Processing – Audio Part
Dr Wenwu Wang
Centre for Vision Speech and Signal Processing
Department of Electronic Engineering
w.wang@surrey.ac.uk
http://personal.ee.surrey.ac.uk/Personal/W.Wang/teaching.html
1
Approximate outline
 Week 6: Fundamentals of audio
 Week 7: Audio acquiring, recording, and standards
 Week 8: Audio processing, coding, and standards
 Week 9: Audio production and reproduction
 Week 10: Audio perception and audio quality assessment
2
Audio coding and standards
Concepts and topics to be covered:
 Spectral analysis of audio
 DFT, DCT, and MDCT
 Subband analysis of audio
 PQMF
 Audio coding methods
 Lossless and lossy coding
 Coding standards
 MPEG-1, -2, and -4
 Coding principles of MPEG-1
3
Fourier transform
 Motivation for using Fourier transform (FT)

The waveform (a general time-domain representation) provides some
indication of the dynamics and periodicity of audio, but apart from this, it
is not clear, for example, about its frequency distribution.

Fourier transform provides an alternative representation of the signal,
suitable for displaying other speech characteristics, such as its
frequency information, harmonics, etc.
 Definition
 The Fourier transform of a continuous signal
x(t ) is computed as:

X ( )   x(t )e jt dt

where

is the angular frequency:
  2f
 Inverse Fourier transform:
x(t ) 
1
2

2
X ( )e jt d
4
4
Discrete-time Fourier transform (DTFT)
 The Fourier transform of a discrete-time signal
x[n]  x(nT ) is computed as:

X ( )   x[n]e  jn

where

is the angular frequency:
  2f
 Inverse discrete-time Fourier transform:
x[n] 
1
2

2
X ( )e jn d
5
5
Discrete Fourier transform (DFT)
 The Fourier transform of a digital signal
x[n] is computed as:
N 1
X [k ]   x[n]e  jk n
n 0
where
k
is the angular frequency: k 
2k
, k  0,1,..., N  1
N
 Inverse discrete Fourier transform:
1 N 1
x[n]   X [k ]e jk n , n  0,1,...,N  1
N k 0
6
6
Power spectral density (PSD)
 PSD is defined as the magnitude squared of the DFT of the signal:
P[ k ]  X [ k ]
2
 Examples:
PSD of a vowel spoken by a
male speaker
PSD of a fricative spoken by a male
7
7
speaker
Fast Fourier transform (FFT)
 FFT is a fast computation of DFT. The typical FFT algorithm consists
of three conceptual parts:

Shuffling (bit reversal): shuffling the N-dimensional input into N onedimensional signals

Performing N one-point DFTs

Merging the N one-point DFTs into one N-point DFT using “Butterfly”
merging equations (requiring that N to be an integral power of 2.)
 The computational complexities of FFT and DFT are respectively:
 FFT: O( N log N )
 DFT: O( N 2 )
8
8
Short-time Fourier transform (STFT)
 STFT (sometimes called short-term FT) can be computed as a N-point
windowed DFT as follows (note that we only consider the discrete form here,
and in practice FFT is usually used to compute the DFT in each frame ):
N 1
x[k , m]   x[m  n]w(n)e  jk n
where
n 0
k
- the discrete angular frequency:
m
- the time-frame index
k 
2k
, k  0,1,..., N  1
N
 - the hop size
w(n) - a window function, such as rectangular, Hann
windows
9
9
Spectrogram
 Spectrogram of a speech signal can be computed as magnitude squared
STFT:
spectrogra m{x[n]}  x[k , m]
2
 An example:
Spectrogram of female speaker uttering “warm cloak”
10
10
Spectrogram (cont.)
 Each vertical line of the spectrogram describes the frequencydependent power distribution of the signal over a short segment (or
window) of the speech signal, i.e. PSD of the segment.
 The width of the window is N, and the gap between consecutive
windows is the hop size  .
 The horizontal line of the spectrogram represents the power
distribution within a particular frequency band as a function of time.
 The spectrogram shows the time-frequency spectral distribution of
power within the signal.
 The spectrogram is much better suited than the waveform to
displaying speech structures, e.g. harmonics, the energy balance of
frequency components, formants, etc.
 The time and frequency resolution of the spectrogram are inversely
proportional.
11
11
Spectrogram – resolution issues
 The STFT has a fixed resolution that depends on the selection of the
window size.
 A wider window gives better frequency resolution (frequency
components close together can be separated) but poorer time
resolution (the time at which frequencies change), and vice versa.
 We use the example from Wikipedia to demonstrate this: a signal is
composed of 4 sinusoidal components, whose frequencies are 10, 25,
50, 100Hz respectively, with the same length of 5 seconds. The
sampling frequency of the signal is 400Hz.
 Multi-resolution analysis tools exist that do not suffer from this
problem, such as wavelet transform.
12
12
Spectrogram resolution issues (cont.)
Different time-frequency resolutions for the same signal due to different
window
13
sizes were used in generating the STFT. (Resource: from Wikipedia.)
13
Spectrogram resolution issues (cont.)
 Although a long window can give higher frequency resolution, it would
be misleading if we use too long a window, as the spectral
characteristics would change over the duration of the windowed
segment.
 How long window should we choose such that the spectral
characteristics does not change (dramatically)? This question relates to
the concept of “stationarity”.
 In practice, speech segment with a length of around 20-30ms is usually
regarded as “quasi-stationary” (very littler change in spectral
characteristics). This is because the speech units (phonemes) occur at
a rate of 4-5 per second for average speech, although more rapid
changes can occur from one steady state to another.
 To ensure smooth transitions of the energy distribution from frame to
frame, the windows are usually chosen to be overlapping, with a typical
hop size of 5ms.
14
14
Windowing and overlapping in
spectrogram
 By choosing the window length appropriately, the assumption of
stationarity (quasi-stationarity) within the windowed speech is almost true.
 However, when appending copies of the segment one after another, there
may still be sharp discontinuities in the waveform at the boundaries (see
the figure below).
 The discontinuity results in the high-frequency noise spread across the
spectrum, known as spectral leakage.
Spectral leakage:
(a) a sinusoidal audio segment
(b) its periodic extension.
15
15
Windowing and overlapping in
spectrogram (cont.)
 To reduce spectral leakage, we can multiply the segment with a window
function that approaches zero at its ends, such as Hann window shown
below. This effectively attenuates discontiuities between two boundairies
of the window, and therefore reduces the leakage.
The waveform and amplitude spectrum of the Hann window function
 In practice, short segments can be appended with zeros to the required
length, known as zero-padding.
16
16
Various window functions
Source: Kondoz (2001)
17
Time plots of various window
functions
Source: Kondoz (2001)
18
Frequency response of various
window functions
Source: Kondoz (2001)
19
Short-time spectral analysis
using DFT
 Effect of window types on voiced speech with 220 samples window length.
(a) and (b) are time and frequency plots of speech using a rectangular
window, and (c) and (d) are time and frequency plots of speech using
Hamming window.
Source: Kondoz (2001)
20
Short-time spectral analysis
using DFT
 Effect of window types on unvoiced speech with 220 samples window
length. (a) and (b) are time and frequency plots of speech using a
rectangular window, and (c) and (d) are time and frequency plots of speech
using Hamming window.
Source: Kondoz (2001)
21
Short-time spectral analysis
using DFT
 Effect of window types on voiced speech with 40 samples window length.
(a) and (b) are time and frequency plots of speech using a rectangular
window, and (c) and (d) are time and frequency plots of speech using
Hamming window.
Source: Kondoz (2001)
22
Discrete Cosine Transform (DCT)
 A definition of DCT transform (DCT II) is shown below:
Source: wikipedia
23
MDCT
 An advantage of Modified DCT (MDCT) is that it allows for a 50% overlap
between blocks without increasing the data rate.
 The MDCT is an example of a class of transforms called Time Domain
Aliasing Cancellation (TDAC). In particular, MDCT is sometimes referred to
as oddly-stacked TDAC (OTDAC).
 These transforms do not invert like the DFT to recover the original signal
but rather invert to recover a signal that has adjacent blocks’ signal mixed
into it so that the effect of “time-domain aliasing”, i.e. the mixing of adjacent
blocks of data, is removed. As a result, the input signal is perfectly
reconstructed.
24
MDCT (cont.)
Analysis: from time to frequency
Synthesis: from frequency to time
For the signal to be perfectly reconstructed from after synthesis
process, the windows should satisfy the following condition:
where i is the index of blocks (or short-time frames), subscript a means
analysis, and s means synthesis. n0 = (N/2+1)/2.
25
MDCT (cont.)
 Responses of the MDCT filter bank (cosine window function):
Source: Bosi & Goldberg (2002)
26
Subband analysis of audio signals
 General subband analysis framework for audio coding:
Source: Bosi & Goldberg (2002)
27
Pseudo quadrature mirror filter
(PQMF) as a subband analysis tool
 The PQMF filter bank employs the following analysis h and synthesis g
filters respectively, where k is the frequency index and n is the time index.
See its general form below
28
PQMF filter bank
 An example of subband analysis of speech:
Source: Kondoz (2001)
29
PQMF filter bank (cont.)
 The PQMF filter bank in MPEG audio standards employs the following
analysis and synthesis filters respectively, where k is the frequency index
and n is the time index.
30
PQMF filter bank (cont.)
 MPEG Audio PQMF prototype filter impulse response h[n] and hk[n], for
k=0 and k=1.
Source: Bosi & Goldberg (2002)
31
PQMF filter bank (cont.)
 Frequency response of the prototype filter (unit: Fs/64) used in MPEG
audio standards.
Source: Bosi & Goldberg (2002)
32
PQMF filter bank (cont.)
 Frequency response of the first four bands of the MPEG audio coding
standards (unit: Fs/64) is shown in the figure below.
Source: Bosi & Goldberg (2002)
33
Audio coding methods
 Lossless coding
 Based on statistical relation between symbols within the data
 Entropy coding such as Huffman coding, arithmetic coding etc.
 The original signals can be perfectly reconstructed.
 Lossy coding
 Based on the perceptual modelling of audio signals (such as
psychoacoustic models of hearing), some redundant information within
audio signals can be removed without affecting their perceptual quality.
 Usually done in transform domain followed by quantisation.
 The original signals cannot be perfectly reconstructed.
34
Lossless coding: Huffman coding
 Huffman coding is a variable length coding method for coding symbols
based on the probabilities of each symbol’s occurrence.
 Considering a 2-bit quantised signal that has the codes [00], [01], [10], [11],
and suppose that we have a signal to be encoded, where the probability of
the symbols’ occurrence is 70%, 15%, 10%, 5% respectively. The original
bit rate was 2 bits per sample, and after entropy coding, the average bit
rate becomes 1.45 (=70%*1+15%*2+10%*3+5%*3) bits per sample.
Source: Bosi & Goldberg (2002)
35
Lossy coding: psychoacoustic model
 Psychoacoustic principles and models, in particular, frequency and
temporal masking, have been used as a basis in producing perceptually
lossless audio quality in lossy audio coding algorithms.
Source: Bosi & Goldberg (2002)
Frequency masking
36
Lossy coding: psychoacoustic model
 Psychoacoustic principles and models, in particular, frequency and
temporal masking, have been used as a basis in producing perceptually
lossless audio quality in lossy audio coding algorithms.
Temporal masking
Source: Bosi & Goldberg (2002)
37
MPEG-1 audio coding standard
 MPEG: the Moving Picture Experts Group (MPEG) within the joint technical
committee on information technology (JTC 1) sponsored by the
International Organisation for Standardisation (ISO) and the International
Electrotechnical Commission (IEC), was established at the end of 1980s
aiming to develop standards for coded representation of moving pictures,
associated audio, and their combination.
 MPEG-1 Audio was the first international standard specifying the digital
format for high quality audio with the aim of reducing the data rate while
maintaining CD-like quality. Prior to this standard, there was
standardisation effort for either speech-only applications or providing only
media-quality audio performance. The adoption of MPEG-1 enables the
compressed high-quality audio in a wide range of applications, including
digital broadcasting to internet applications.
38
MPEG standards – a brief history
 The MPEG standardisation effort started in 1988.
 The MPEG-1 standard [ISO/IEC 11172] coding of synchronised video and
audio at a total data rate of about 1.5 Mb/s was finalised in 1992.
 The MPEG-2 standard [ISO/IEC 13818] coding of synchronised video and
audio at a total data rate of about 10 Mb/s was finalised in 1994.
 The effort for MPEG-3 standard, i.e. coding of synchronised video and
audio at a total data rate of about 40 Mb/s, was dropped in 1993, after
being considered redundant as its attributes were already incorporated in
the MPEG-2 standard.
 The MPEG-4 [ISO/IEC 14496] addresses audio visual coding at very low
data rates with additional functionalities, such as scalability, 3-D,
synthetic/natural hybrid coding, was finalised in 1998.
 MPEG-7 [ISO/IEC 15938] addresses the description of multimedia content
for multimedia database search, finalised in 2001.
39
MPEG audio standards
 MPEG Audio is usually used as stand-alone standard, however it is a part
of a multi-part standard, where “part 1” describes the system structure,
“part 2” describes video coding, and “part 3” the audio coding.
 MPEG-1 Audio
 Defining coding/decoding high quality audio signals for storage media
 Standardising only the bitstream and decoder specifications, but not
the encoder. This allows interoperability between different
implementations, and for manufacturers to retain control on the core
intellectual property of their coding system.
 Aims to support one or two main channels.
 Input and outputs are compatible with existing PCM standards such as
the CD and the digital audio tape formats.
 Sampling rate at 32 kHz, 44.1 kHz, and 48 kHz
Source: Bosi & Goldberg (2002)
40
MPEG audio standards (cont.)
 MPEG-2 Audio
 Extending of MPEG-1 Audio to multiple channels.
 Lower sampling rates than MPEG-1 at 16 kHz, 22.5 kHz and 24 kHz.
 Motivated by the emerging internet applications.
 Defining a higher-quality multichannel audio than achievable with
MPEG-1 extensions.
 MPEG-2 AAC (NBC, non backward-compatible) shows comparable or
better audio quality than MPEG-2 Layer II BC (backward-compatible).
41
MPEG audio standards (cont.)
 MPEG-4 Audio
 Aims to provide a high coding efficiency, where data rates (200 b/s to
64 kb/s) introduced are lower than defined in MPEG-1/2.
 Accommodates speech coding technology, error protection
 Includes content-based interactivity such as flexible access and
manipulation, for example, pitch modifications.
 Allows universal access, such as access to a subset of data or
scalability.
 Supports for synthetic audio and speech, e.g. in structured audio, text
to speech interfaces.
 Accommodates additional post-processing effects, such as
reverberation, 3D etc.
42
MPEG-1 Audio
 The MPEG-1 standard part 3, i.e. [ISO/IEC 11172-3], specifies the audio
part of the MPEG-1 standard.
 It includes the syntax of the audio coded bitstream and a description of
decoding process, which ensures interoperability between different
systems.
 It also provides reference software modules and a set of test vectors
for assessing the compliance of the decoder.
 It does not define the encoder which is left to the designer of the
systems to decide.
 It describes perceptual audio coding algorithms for general audio
signals, unlike in speech codecs where specific source model is
applied.
43
MPEG-1 Audio – main features
 It supports sampling rates at 32, 44.1 and 48 kHz, one or two channels
(including dual monophonic mode for two independent channels and a
stereo mode for stereophonic channels).
 Data rates vary between 32 and 224 kb/s per channel allowing for
compression ratios between 2.7:1 to 24:1, depending on sampling rate.
 It specifies three different layers, layer I, layer II, and layer III, which offer
increasingly higher audio quality at slightly increased complexity.
 Layer I is the simplest layer, operates at date rates between 32 and
224 kb/s per channel (preferred rate at above 128 kb/s). It finds
applications in e.g. digital compact cassette (DCC) at 192 kb/s per
channel.
 Layer II is of media complexity, operates preferably between 32 and
192 kb/s per channel (providing very good quality at 128 kb/s).
Applications of Layer II include digital audio broadcasting (DAB).
 Layer III has the highest quality with increased complexity. The data
rates vary between 32 and 160 kb/s per channel. Applications include
ISDN and internet transmissions. A modified MPEG Layer III format at
lower sampling frequencies becomes the well-known
44MP3 format.
MPEG-1 Audio coding: main
building blocks of encoder
 The encoder includes a time to frequency mapping stage followed by a bit
(or noise) allocation stage. The psychoacoustic model is used to determine
the precision of the allocation stage. The bitstream formatting stage
interleaves the representation of the quantised data with side information
and optional ancillary data.
Source: Bosi & Goldberg (2002)
45
MPEG-1 Audio coding: main
building blocks of decoder
 The decoder interprets the bitstream, restores the quantised spectral
components of the signal and reconstructs the time domain representation
of the audio signal by frequency to time mapping.
Source: Bosi & Goldberg (2002)
46
MPEG-1 Audio: coding options of
layer I and II
 In both layer I an II, T-F mapping is performed by a 32-band PQMF, which
is then scaled and quantised with a uniform midtread quantiser whose
precision is determined by the output of the psychoacoustic model based
on 512- (Layer I) or 1024-point (Layer II) FFT analysis. To reduce data
rate, group coding of consecutive quantised samples is applied in Layer II.
Source: Bosi & Goldberg (2002)
47
MPEG-1 Audio: coding options of
layer III
 In Layer III, the output of the PQMF is fed to an MDCT stage. The filter
bank is adaptive, rather than static (as in Layer I & II), and is scaled and
non-uniform quantised with a midtread quantiser. Noiseless coding, such
as Huffman coding has also been employed. Side informtion include bit
allocation and control parameters.
Source: Bosi & Goldberg (2002)
48
Time-frequency mapping in layer III:
analysis filterbank
 After the 32-band PQMF filter, the subband samples are overlapped by
50%, multiplied by a sine window, and then process by MDCT transform.
 The MDCT output is multiplied by coefficients to reduce the aliasing effects
caused by PQMF and overlapping bands.
Source: Bosi & Goldberg (2002)
Time-frequency mapping in layer III:
synthesis filterbank
 In the decoder, the inverse aliasing reduction process is applied prior to the
IMDCT (inverse MDCT). Without aliasing reduction, a pure sine wave after
passing through the PQMF/MDCT filterbank can present a spurious
component as high as -12 dB with respect to the original signal.
Source: Bosi & Goldberg (2002)
50
Time-frequency mapping in layer III:
block switching
 The block size processed by the layer III filter is 32*36 time-samples =
1152, which leads to a frequency resolution of about 41.66 Hz at 48 kHz
sampling rate, and hence is good for performing bit allocation based on the
psychoacoustic model.
 However, for transient signals, such a long block size can result in
unmasked temporal noise, such as pre-echo. Hence, a shorter block size
of 32*12 = 384 time samples will be used to improve the time resolution
and hence reducing the temporal spreading of quantisation noise for sharp
attacks.
 Two transition blocks, long-to-short and short-to-long, having the same size
as the long block are employed.
51
Time-frequency mapping in layer III:
window sequence in block switching
Source: Bosi & Goldberg (2002)
52
Time-frequency mapping in layer III:
window sequence in block switching
 The mixed block mode ensures high frequency resolution in low
frequencies and high time resolution at high frequencies.
Source: Bosi & Goldberg (2002)
53
Time-frequency mapping in layer III
as compared with that in layer I & II
 The hybrid filter bank in Layer III has advantages such as high frequency
resolution, a dynamic, adaptive tradeoff between time and frequency
resolution, full compatibility with layer I & II.
 The disadvantages include potential aliasing effects exposed by the MDCT
and long impulse response filters.
 The complexity of Layer III filter bank is increased with respect to the
complexity of Layers I & II.
54
Psychoacoustic models in
MPEG audio
Source: Bosi & Goldberg (2002)
55
References
 Marina Bosi and Richard E. Goldberg, “Introduction to Digital Audio Coding
and Standards”, Springer, 2002.
 Ahmet Kondoz, “Digital Speech Coding for Low Bit Rate Communication
Systems”, Wiley, 2001.
56
Download