Lecture 7: Audio compression and coding

advertisement
EE E6820: Speech & Audio Processing & Recognition
Lecture 7:
Audio compression and coding
Dan Ellis <dpwe@ee.columbia.edu>
Michael Mandel <mim@ee.columbia.edu>
Columbia University Dept. of Electrical Engineering
http://www.ee.columbia.edu/∼dpwe/e6820
March 31, 2009
1
Information, Compression & Quantization
2
Speech coding
3
Wide-Bandwidth Audio Coding
E6820 (Ellis & Mandel)
L7: Audio compression and coding
March 31, 2009
1 / 37
Outline
1
Information, Compression & Quantization
2
Speech coding
3
Wide-Bandwidth Audio Coding
E6820 (Ellis & Mandel)
L7: Audio compression and coding
March 31, 2009
2 / 37
Compression & Quantization
How big is audio data? What is the bitrate?
Fs frames/second (e.g. 8000 or 44100)
x C samples/frame (e.g. 1 or 2 channels)
x B bits/sample (e.g. 8 or 16)
→ Fs · C · B bits/second (e.g. 64 Kbps or 1.4 Mbps)
bits / frame
I
CD Audio 1.4 Mbps
32
8
Mobile
!13 Kbps
8000
44100
frames / sec
Telephony 64 Kbps
How to reduce?
I
I
I
lower sampling rate → less bandwidth (muffled)
lower channel count → no stereo image
lower sample size → quantization noise
Or: use data compression
E6820 (Ellis & Mandel)
L7: Audio compression and coding
March 31, 2009
3 / 37
Data compression:
Redundancy vs. Irrelevance
Two main principles in compression:
I
I
remove redundant information
remove irrelevant information
Redundant information is implicit in remainder
e.g. signal bandlimited to 20kHz,
but sample at 80kHz
→ can recover every other sample by interpolation:
I
sample
In a bandlimited signal, the red samples
can be exactly recovered by interpolating
the blue samples
time
Irrelevant info is unique but unnecessary
I
e.g. recording a microphone signal at 80 kHz sampling rate
E6820 (Ellis & Mandel)
L7: Audio compression and coding
March 31, 2009
4 / 37
Irrelevant data in audio coding
For coding of audio signals,
irrelevant means perceptually insignificant
I
an empirical property
Compact Disc standard is adequate:
I
I
44 kHz sampling for 20 kHz bandwidth
16 bit linear samples for ∼ 96 dB peak SNR
Reflect limits of human sensitivity:
I
I
I
20 kHz bandwidth, 100 dB intensity
sinusoid phase, detail of noise structure
dynamic properties - hard to characterize
Problem: separating salient & irrelevant
E6820 (Ellis & Mandel)
L7: Audio compression and coding
March 31, 2009
5 / 37
Quantization
Represent waveform with discrete levels
Q[x]
5
4
3
2
1
0
-1
-2
-3
-4
-5
-5 -4 -3 -2 -1 0 1 2 3 4 5
6
x[n]
Q[x[n]]
4
2
0
-2
error e[n] = x[n] - Q[x[n]]
0
5
10
15
20
25
30
35
40
x
Equivalent to adding error e[n]:
x[n] = Q [x[n]] + e[n]
e[n] ∼ uncorrelated, uniform white noise
p(e[n])
→ variance σe2 =
-D/
2
D2
12
+D/
2
E6820 (Ellis & Mandel)
L7: Audio compression and coding
March 31, 2009
6 / 37
Quantization noise (Q-noise)
Uncorrelated noise has flat spectrum
With a B bit word and a quantization step D
I
I
max signal range (x) = −(2B−1 ) · D . . . (2B−1 − 1) · D
quantization noise (e) = −D/2 . . . D/2
→ Best signal-to-noise ratio (power)
SNR = E [x 2 ]/E [e 2 ]
= (2B )2
.. or, in dB, 20 · log10 2 · B ≈ 6 · B dB
0
level / dB
Quantized at 7 bits
-20
-40
-60
-80
0
1000
E6820 (Ellis & Mandel)
2000
3000
4000
5000
6000
7000 freq / Hz
L7: Audio compression and coding
March 31, 2009
7 / 37
Redundant information
Redundancy removal is lossless
Signal correlation implies redundant information
I
I
e.g. if x[n] = x[n − 1] + v [n]
x[n] has a greater amplitude range
→ uses more bits than v [n]
sending v [n] = x[n] − x[n − 1] can reduce amplitude, hence
bitrate
x[n] - x[n-1]
I
but: ‘white noise’ sequence has no redundancy ...
Problem: separating unique & redundant
E6820 (Ellis & Mandel)
L7: Audio compression and coding
March 31, 2009
8 / 37
Optimal coding
Shannon information:
An unlikely occurrence is more ‘informative’
p(A) = 0.5
p(B) = 0.5
ABBBBAAABBABBABBABB
p(A) = 0.9
p(B) = 0.1
AAAAABBAAAAAABAAAAB
A, B equiprobable
A is expected;
B is ‘big news’
Information in bits I = − log2 (probability )
I
clearly works when all possibilities equiprobable
Optimal bitrate → av.token length = entropy H = E [I ]
I
.. equal-length tokens are equally likely
How to achieve this?
I
I
I
transform signal to have uniform pdf
nonuniform quantization for equiprobable tokens
variable-length tokens → Huffman coding
E6820 (Ellis & Mandel)
L7: Audio compression and coding
March 31, 2009
9 / 37
Quantization for optimum bitrate
Quantization should reflect pdf of signal:
p(x < x0)
p(x = x0)
x'
1.0
0.8
0.6
0.4
0.2
-0.02 -0.015 -0.01 -0.005
I
0
0.005 0.01
0.015 0.02
0.025
x
0
cumulative pdf p(x < x0 ) maps to uniform x 0
Or, codeword length per Shannon log2 (p(x)):
-0.02
-0.01
0
0.01
0.02
0.03
I
p(x)
Shannon info / bits Codewords
111111111xx
111101xx
111100xx
101xx
100xx
0xx
110xx
1110xx
111110xx
1111110xx
111111100xx
111111101xx
111111110xx
0
2
4
6
8
Huffman coding: tree-structured decoder
E6820 (Ellis & Mandel)
L7: Audio compression and coding
March 31, 2009
10 / 37
Vector Quantization
Quantize mutually dependent values in joint space:
3
x1
2
1
0
-1
-2
x2
-6
-4
-2
0
2
4
May help even if values are largely independent
I
larger space x1,x2 is easier for Huffman
E6820 (Ellis & Mandel)
L7: Audio compression and coding
March 31, 2009
11 / 37
Compression & Representation
As always, success depends on representation
Appropriate domain may be ‘naturally’ bandlimited
I
I
e.g. vocal-tract-shape coefficients
can reduce sampling rate without data loss
In right domain, irrelevance is easier to ‘get at’
I
e.g. STFT to separate magnitude and phase
E6820 (Ellis & Mandel)
L7: Audio compression and coding
March 31, 2009
12 / 37
Aside: Coding standards
Coding only useful if recipient knows the code!
Standardization efforts are important
Federal Standards: Low bit-rate secure voice:
I
I
FS1015e: LPC-10 2.4 Kbps
FS1016: 4.8 Kbps CELP
ITU G.x series (also H.x for video)
I
I
G.726 ADPCM
G.729 Low delay CELP
MPEG
I
I
I
MPEG-Audio layers 1,2,3 (mp3)
MPEG 2 Advanced Audio Codec (AAC)
MPEG 4 Synthetic-Natural Hybrid Codec
More recent ‘standards’
I
I
proprietary: WMA, Skype...
Speex ...
E6820 (Ellis & Mandel)
L7: Audio compression and coding
March 31, 2009
13 / 37
Outline
1
Information, Compression & Quantization
2
Speech coding
3
Wide-Bandwidth Audio Coding
E6820 (Ellis & Mandel)
L7: Audio compression and coding
March 31, 2009
14 / 37
Speech coding
Standard voice channel:
I
I
analog: 4 kHz slot (∼ 40 dB SNR)
digital: 64 Kbps = 8 bit µ-law x 8 kHz
How to compress?
Redundant
I
signal assumed to be a single voice,
not any possible waveform
Irrelevant
I
need code only enough for intelligibility, speaker identification
(c/w analog channel)
Specifically, source-filter decomposition
I
vocal tract & f0 change slowly
Applications:
I
I
live communications
offline storage
E6820 (Ellis & Mandel)
L7: Audio compression and coding
March 31, 2009
15 / 37
Channel Vocoder (1940s-1960s)
Basic source-filter decomposition
I
I
filterbank breaks into spectral bands
transmit slowly-changing energy in each band
Encoder
Bandpass
filter 1
Decoder
Smoothed
energy
Downsample
& encode
E1
Bandpass
filter 1
Output
Input
Bandpass
filter N
Smoothed
energy
Voicing
analysis
I
Downsample
& encode
EN
Bandpass
filter 1
V/UV
Pitch
Pulse
generator
Noise
source
10-20 bands, perceptually spaced
Downsampling?
Excitation?
I
I
pitch / noise model
or: baseband + ‘flattening’...
E6820 (Ellis & Mandel)
L7: Audio compression and coding
March 31, 2009
16 / 37
LPC encoding
The classic source-filter model
Encoder
Decoder
Filter coefficients {ai}
|1/A(ej!)|
Input
s[n]
Represent
& encode
f
LPC
analysis
Represent
& encode
Residual
e[n]
Excitation
generator
^
e[n]
All-pole
filter
H(z) =
t
Output
^
s[n]
1
1 - "aiz-i
Compression gains:
I
I
filter parameters are ∼slowly changing
excitation can be represented many ways
20 ms
Filter
parameters
Excitation/pitch
parameters
5 ms
E6820 (Ellis & Mandel)
L7: Audio compression and coding
March 31, 2009
17 / 37
Encoding LPC filter parameters
For ‘communications quality’:
I
I
I
8 kHz sampling (4 kHz bandwidth)
∼10th order LPC (up to 5 pole pairs)
update every 20-30 ms → 300 - 500 param/s
Representation & quantization
ai
I
I
I
{ai } - poor distribution,
can’t interpolate
reflection coefficients {ki }:
guaranteed stable
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-1
-0.5
0
0.5
1
1.5
2
ki
-2
-1.5
fLi
LSPs - lovely!
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Bit allocation (filter):
I
I
GSM (13 kbps): 8 LARs x 3-6 bits / 20 ms = 1.8 Kbps
FS1016 (4.8 kbps): 10 LSPs x 3-4 bits / 30 ms = 1.1 Kbps
E6820 (Ellis & Mandel)
L7: Audio compression and coding
March 31, 2009
18 / 37
Line Spectral Pairs (LSPs)
LSPs encode LPC filter by a set of frequencies
Excellent for quantization & interpolation
Definition: zeros of
P(z) = A(z) + z −p−1 · A(z −1 )
Q(z) = A(z) − z −p−1 · A(z −1 )
I
I
I
I
z = e jω → z −1 = e −jω → |A(z)| = |A(z −1 )| on u.circ.
P(z), Q(z) have (interleaved) zeros when
−1
∠{A(z)} = ±∠{z −p−1 A(zQ
)}
reconstruct P(z), Q(z) = i (1 − ζi z −1 ) etc.
A(z) = [P(z) + Q(z)]/2
A(z-1) = 0
A(z) = 0
Q(z) = 0
P(z) = 0
E6820 (Ellis & Mandel)
L7: Audio compression and coding
March 31, 2009
19 / 37
Encoding LPC excitation
Excitation already better than raw signal:
5000
0
-5000
100
Original signal
0
-100 LPC residual
1.3 1.35
1.4 1.45
I
1.5
1.55
1.6
1.65
time / s
1.7
save several bits/sample, but still > 32 Kbps
Crude model: U/V flag + pitch period
I
∼ 7 bits / 5 ms = 1.4 Kbps → LPC10 @ 2.4 Kbps
Pitch period values
16 ms frame boundaries
100
50
0
-50
1.3
1.35
1.4
1.45
1.5
1.55
1.6
1.65
1.7
1.75
time / s
Band-limit then re-extend (RELP)
E6820 (Ellis & Mandel)
L7: Audio compression and coding
March 31, 2009
20 / 37
Encoding excitation
Something between full-quality residual (32 Kbps)
and pitch parameters (1.4 kbps)?
‘Analysis by synthesis’ loop:
Input
Filter coefficients
A(z)
LPC
analysis
s[n]
Excitation code
Synthetic
excitation
control
c[n]
Pitch-cycle
predictor
+
b·z–NL
‘Perceptual’
weighting
W(z)= A(z)
A(z/!)
MSError
minimization
^
x[n]
LPC filter
1
A(z)
–
+
^
x[n]
*ha[n] - s[n]
Excitation encoding
‘Perceptual’ weighting discounts peaks:
gain / dB
20
10
0
W(z)= A(z)
A(z/!)
1/A(z/!)
-10
-20
0
E6820 (Ellis & Mandel)
1/A(z)
500
1000
1500
2000
2500
L7: Audio compression and coding
3000
3500
freq / Hz
March 31, 2009
21 / 37
Multi-Pulse Excitation (MPE-LPC)
Stylize excitation as N discrete pulses
15
original LPC
residual
10
5
0
-5
mi
multi-pulse
excitation
5
0
-5
I
0
20
ti
40
60
80
100
120
time / samps
encode as N × (ti , mi ) pairs
Greedy algorithm places one pulse at a time:
A(z) X (z)
Epcp =
− S(z)
A(z/γ) A(z)
X (z)
R(z)
=
−
A(z/γ) A(z/γ)
I
I
R(z) is residual of target waveform after inverse-filtering
cross-correlate hγ and r ∗ hγ , iterate
E6820 (Ellis & Mandel)
L7: Audio compression and coding
March 31, 2009
22 / 37
CELP
Represent excitation with codebook
e.g. 512 sparse excitation vectors
Codebook
Search
index
000
001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
0
I
excitation
60 time / samples
linear search for minimum weighted error?
FS1016 4.8 Kbps CELP (30ms frame = 144 bits):
10 LSPs
4x4 + 6x3 bits = 34 bits
Pitch delay
4 x 7 bits =
28 bits
Pitch gain
4 x 5 bits =
20 bits
I 138 bits
Codebk index 4 x 9 bits =
36 bits
Codebk gain
4 x 5 bits =
20 bits
E6820 (Ellis & Mandel)
L7: Audio compression and coding
March 31, 2009
23 / 37
Aside: CELP for nonspeech?
CELP is sometimes called a ‘hybrid’ coder:
I
I
originally based on source-filter voice model
CELP residual is waveform coding (no model)
freq / kHz
Original (mrZebra-8k)
4
3
2
CELP does not
break with multiple
voices etc.
1
0
5 kbps buzz-hiss
4
3
2
I
just does the
best it can
1
0
4.8 kbps CELP
4
3
2
1
0
0
1
2
3
4
5
6
7
8
time / sec
LPC filter models vocal tract;
also matches auditory system?
I
i.e. the ‘source-filter’ separation is good for
relevance as well as redundancy?
E6820 (Ellis & Mandel)
L7: Audio compression and coding
March 31, 2009
24 / 37
Outline
1
Information, Compression & Quantization
2
Speech coding
3
Wide-Bandwidth Audio Coding
E6820 (Ellis & Mandel)
L7: Audio compression and coding
March 31, 2009
25 / 37
Wide-Bandwidth Audio Coding
Goals:
I
I
transparent coding i.e. no perceptible effect
general purpose - handles any signal
Simple approaches (redundancy removal)
I
Adaptive Differential PCM (ADPCM)
X[n]
+
–
Xp[n]
I
+
D[n]
Adaptive
quantizer
C[n] = Q[D[n]]
Predictor
Xp[n] = F[X'[n-i]]
C[n]
X'[n]
+
+
+
D'[n]
Dequantizer
D'[n] = Q-1[C[n]]
as prediction gets smarter, becomes LPC
e.g. shorten - lossless LPC encoding
Larger compression gains needs irrelevance
I
hide quantization noise with psychoacoustic masking
E6820 (Ellis & Mandel)
L7: Audio compression and coding
March 31, 2009
26 / 37
Noise shaping
Plain Q-noise sounds like added white noise
I
I
actually, not all that disturbing
.. but worst-case for exploiting masking
1
Have Q-noise scale with
signal level
mu-law
quantization
0.8
0.6
I
I
i.e. quantizer step gets
larger with amplitude
mu(x)
0.4
minimum distortion for
some center-heavy pdf
0.2
x - mu(x)
0
0
0.2
0.4
0.6
0.8
1
Or: put Q-noise around peaks in spectrum
key to getting benefit of perceptual masking
level / dB
I
20
Signal
0
Linear Q-noise
-20
-40
-60
E6820 (Ellis & Mandel)
0
Transform Q-noise 1500
2000
2500
3000
L7: Audio compression and coding
3500
freq / Hz
March 31, 2009
27 / 37
Subband coding
Idea: Quantize separately in separate bands
Encoder
Bandpass
f
Input
Decoder
Downsample
M
Analysis
filters
Q-1[•]
M
M
f
Reconstruction
filters
(M channels)
f
I
Quantize
Q[•]
Q-1[•]
Q[•]
Output
+
M
Q-noise stays within band, gets masked
gain / dB
‘Critical sampling’ → 1/M of spectrum per band
I
0
-10
-20
-30
-40
-50
0
alias energy
0.1
0.2
0.3
0.4
0.5
0.6
0.7
normalized freq 1
some aliasing inevitable
Trick is to cancel with alias of adjacent band
→ ‘quadrature-mirror’ filters
E6820 (Ellis & Mandel)
L7: Audio compression and coding
March 31, 2009
28 / 37
MPEG-Audio (layer I, II)
Basic idea: Subband coding
plus psychoacoustic masking model
to choose dynamic Q-levels in subbands
Scale indices
Scale
normalize
Input
32 band
polyphase
filterbank
Format Bitstream
& pack
bitstream
Data frame
Scale
normalize
Psychoacoustic
masking
model
I
I
I
I
Quantize
Control & scales
Quantize
32 chans
x 36 samples
Per-band
masking margins
24 ms
22 kHz ÷ 32 equal bands = 690 Hz bandwidth
8 / 24 ms frames = 12 / 36 subband samples
fixed bitrates 32 - 256 Kbps/chan (1-6 bits/samp)
scale factors are like LPC envelope?
E6820 (Ellis & Mandel)
L7: Audio compression and coding
March 31, 2009
29 / 37
MPEG Psychoacoustic model
Based on simultaneous masking experiments
Difficulties:
Procedure
I
pick ‘tonal peaks’ in NB FFT
spectrum
I
remaining energy → ‘noisy’
peaks
I
apply nonlinear ‘spreading
function’
I
sum all masking & threshold in
power domain
E6820 (Ellis & Mandel)
SPL / dB
I
noise energy masks ∼10 dB better than tones
masking level nonlinear in frequency & intensity
complex, dynamic sounds not well understood
SPL / dB
I
SPL / dB
I
60
50
40
30
20
10
0
-10
60
50
40
30
20
10
0
-10
60
50
40
30
20
10
0
-10
Tonal
peaks
Non-tonal
energy
Signal
spectrum
1
3
5
7
9
11
13
15
17
19
21
23
25
23
25
23
25
Masking
spread
1
3
5
7
9
11
13
15
17
19
21
Resultant
masking
1
3
L7: Audio compression and coding
5
7
9
11 13 15
freq / Bark
17
19
21
March 31, 2009
30 / 37
MPEG Bit allocation
Result of psychoacoustic model is
maximum tolerable noise per subband
SNR ~ 6·B
level
Masking tone
Masked
threshold
Safe noise level
freq
Quantization noise
Subband N
I
safe noise level → required SNR → bits B
Bit allocation procedure (fixed bit rate):
I
I
I
pick channel with worst noise-masker ratio
improve its quantization by one step
repeat while more bits available for this frame
Bands with no signal above masking curve can be skipped
E6820 (Ellis & Mandel)
L7: Audio compression and coding
March 31, 2009
31 / 37
MPEG Audio Layer III
‘Transform coder’ on top of ‘subband coder’
Line
Subband
575
31
Filterbank
MDCT
32 Subbands
0
0
Window
Switching
Coded
Audio Signal
192 kbit/s
Distortion
Control Loop
Nonuniform
Quantization
Rate
Control Loop
Huffman
Encoding
Coding of
Sideinformation
Bitstream Formatting
CRC-Check
Digital Audio
Signal (PCM)
(768 kbit/s)
32 kbit/s
Psychoacoustic
Model
FFT
1024 Points
External Control
Blocks of 36 subband time-domain samples become 18 pairs
of frequency-domain samples
I
I
I
more redundancy in spectral domain
finer control e.g. of aliasing, masking
scale factors now in band-blocks
Fixed Huffman tables optimized for audio data
Power-law quantizer
E6820 (Ellis & Mandel)
L7: Audio compression and coding
March 31, 2009
32 / 37
Adaptive time window
Time window relies on temporal masking
I
single quantization level over 8-24 ms window
‘Nightmare’ scenario:
Pre-echo distortion
I
‘backward masking’ saves in most cases
Adaptive switching of time window:
normal
window level
1
0.6
0.4
0.2
0
0
E6820 (Ellis & Mandel)
transition short
0.8
20
40
60
80
100 time / ms
L7: Audio compression and coding
March 31, 2009
33 / 37
The effects of MP3
Before & after:
freq / kHz
Josie - direct from CD
20
15
10
5
0
freq / kHz
After MP3 encode (160 kbps) and decode
20
15
10
5
0
freq / kHz
Residual (after aligning 1148 sample delay)
20
15
10
5
0
0
I
I
I
2
4
6
8
10
time / sec
chop off high frequency (above 16 kHz)
occasional other time-frequency ‘holes’
quantization noise under signal
E6820 (Ellis & Mandel)
L7: Audio compression and coding
March 31, 2009
34 / 37
MP3 & Beyond
MP3 is ‘transparent’ at ∼ 128 Kbps for stereo
(11x smaller than 1.4 Mbps CD rate)
I
only decoder is standardized:
better psychological models → better encoders
MPEG2 AAC
I
I
I
rebuild of MP3 without backwards compatibility
30% better (stereo at 96 Kbps?)
multichannel etc.
MPEG4-Audio
I
I
wide range of component encodings
MPEG Audio, LSPs, ...
SAOL
I
I
I
‘synthetic’ component of MPEG-4 Audio
complete DSP/computer music language!
how to encode into it?
E6820 (Ellis & Mandel)
L7: Audio compression and coding
March 31, 2009
35 / 37
Summary
For coding, every bit counts
I
I
take care over quantization domain & effects
Shannon limits...
Speech coding
I
I
LPC modeling is old but good
CELP residual modeling can go beyond speech
Wide-band coding
I
I
noise shaping ‘hides’ quantization noise
detailed psychoacoustic models are key
E6820 (Ellis & Mandel)
L7: Audio compression and coding
March 31, 2009
36 / 37
References
Ben Gold and Nelson Morgan. Speech and Audio Signal Processing: Processing and
Perception of Speech and Music. Wiley, July 1999. ISBN 0471351547.
D. Pan. A tutorial on MPEG/audio compression. IEEE Multimedia, 2(2):60–74, 1995.
Karlheinz Brandenburg. MP3 and AAC explained. In Proc. AES 17th International
Conference on High Quality Audio Coding, pages 1–12, 1999.
T. Painter and A. Spanias. Perceptual coding of digital audio. Proc. IEEE, 80(4):
451–513, Apr. 2000.
E6820 (Ellis & Mandel)
L7: Audio compression and coding
March 31, 2009
37 / 37
Download