Audio Coding and MP3 What is Sound?

advertisement
Audio Coding and MP3
Wolfgang Leister
contributions by:
Torbjørn Ekman
Norsk Regnesentral
What is Sound?
n
n
n
Sound waves: 20Hz - 20kHz
Speed: 331.3 m/s (air)
Wavelength: 165 cm - 1.65 cm
Norsk Regnesentral
26-Feb- 03
Wolfgang Leister
1
Analogue audio
n
n
n
frequencies: 20Hz - 20kHz
mono: x(t) scalar
stereo:
 x r (t )
x(t ) = 

 x l (t ) 
Norsk Regnesentral
26-Feb- 03
Wolfgang Leister
Audio Compression
n
n
n
n
small files, low data rate at transmission
reconstruction must be (as much as
possible) equal to original signal
redundancy (lossless coding)
irrelevancy (do not code what you cannot hear)
Norsk Regnesentral
26-Feb- 03
Wolfgang Leister
2
Data rates
Quality
Telephone
MW
UKW
CD
DAT
Sample Rate
Bit/Sample Channels
8.000
8 Mono
11.025
8 Mono
22.050
16 Stereo
44.100
16 Stereo
48.000
16 Stereo
Data Rate kb/s
64,00
88,00
705,60
1411,00
1536,00
Frequency
200-3400
20-20000
20-20000
Norsk Regnesentral
26-Feb- 03
Wolfgang Leister
Dynamics compression
n
A-Law
A ⋅ abs( S )

 sign ( S ) ⋅ 1 + ln A
S'= 
 sign ( S ) ⋅ 1 + ln( A ⋅ abs(S ))

1 + ln A
n
for abs(S) ≤
1
A
else
µ-Law
S ' = sign( S ) ⋅
1 + ln( 1 + µ ⋅ abs( S ))
, µ = 255
ln( 1 + µ )
Norsk Regnesentral
26-Feb- 03
Wolfgang Leister
3
Masking
Norsk Regnesentral
26-Feb- 03
Wolfgang Leister
Masking
n
n
Threshold for human ear
Threshold changes:
n
neighbouring frequencies
(Example 0.5, 1, 4, 8 kHz)
n
in time
Norsk Regnesentral
26-Feb- 03
Wolfgang Leister
4
Sampling
•
When x(t) is bandwidth limited:
f >ω
•
then
⇒
x (t ) =
x( f ) = 0
∞
∑ x[n ]g (t − n ⋅ ∆t )
n = −∞
•
with
∆t =
x[n] = x (n ⋅ ∆t)
1
1
<
f s 2ω
g( t) =
sin(2πω t)
2πω t
Norsk Regnesentral
26-Feb- 03
Wolfgang Leister
Quantisation
n
x → Q(x )
n
k bits ⇒ L = 2 k representations
n
{y1 , K, yn }
n
x − yi ≤ x − y j
⇒ Q( x) = y i
Norsk Regnesentral
26-Feb- 03
Wolfgang Leister
5
PCM = Pulse Code Modulation
n
Sampling:
Quantisation:
Coding:
n
Play:
n
n
{x(t )} → {x[n]}
{x[n ]} → {Q( x[n])}
Q({x[n ]}) → {ni }
redundancy
irrelevancy
y (t ) = ∑ Q (x [ni ]) ⋅ g (t − ni ⋅ ∆t )
Norsk Regnesentral
26-Feb- 03
Wolfgang Leister
Stereo CD Audio
n
Data rate:
2 ⋅16 bit ⋅ 44.1 ⋅ 10-3 s −1
= 1411 .2 ⋅10 3
bit
s
Norsk Regnesentral
26-Feb- 03
Wolfgang Leister
6
MPEG compression factors
n
n
MPEG 1 Audio: PCM 32, 44.1, 48 kHz,
max 448 kBit/s
MPEG 2 Audio: PCM 16, 22.05, 24, 32,
44.1, 48 kHz, max 384 KBit/s
Norsk Regnesentral
26-Feb- 03
Wolfgang Leister
MPEG Audio Layer I,II,III
n
n
n
Layer I
Layer II ⇒ Digital TV
Layer III ⇒ MP3
Norsk Regnesentral
26-Feb- 03
Wolfgang Leister
7
MP3 - MPEG 1 Audio Layer 3
n
n
Sampling: 16 kHz - 48 kHz
Bit rate: 32 kb/s - 192 kb/s
(CD Audio: 44.1 kHz, 1411 kb/s)
n
n
www.iis.fhg.de/amm/gallery/index.html
Karlheinz Brandenburg: “MP3 and AAC
explained”
http://www.exp-math.uni-essen.de/~dreibh/diplom/bra99.pdf
Norsk Regnesentral
26-Feb- 03
Wolfgang Leister
perceptual encoding / decoding
Norsk Regnesentral
26-Feb- 03
Wolfgang Leister
8
Filterbank
Norsk Regnesentral
26-Feb- 03
Wolfgang Leister
Ideal sub-band coder
n
n
n
impossible: ideal sub-band coder
downsampling ⇒ aliasing
possible: “nearly perfect”
1 for f ∈ Dm , m = 1, K, M
(
f
)
=

Hm
0 else
Norsk Regnesentral
26-Feb- 03
Wolfgang Leister
9
Downsampling
n
from M ⋅ f s back to f s
n
sub-bandwidth B, upper frequency is multiple of B
n
can sample at f s = 2 B
(instead of f s = 2 M ⋅ B )
xm [n]
ym [k ]
↓M
y m [k ] = x m [k ⋅ M ]
Norsk Regnesentral
26-Feb- 03
Wolfgang Leister
Filterbank in MPEG-1 audio layer 1-3
n
n
n
n
Polyphase filterbank
32 subbands
512 tap FIR-filters
80 + and * per output
n
n
n
Equal width
Not perfect reconstruction
Frequency overlap
Norsk Regnesentral
26-Feb- 03
Wolfgang Leister
10
A closer look
n
n
n
The subbands overlap at 3 dB to the adjacent bands.
The leakage to the other bands is small.
The total response almost adds up to one (0 dB).
Norsk Regnesentral
26-Feb- 03
Wolfgang Leister
White noise
n
n
The white noise run
through the filterbank.
The samples from each
band are played in the
order of the subbands.
n
n
n
The subsampled filtered
sequence.
The samples from each
band are played in the
order of the subbands.
The reconstruction error
is –84 dB.
Norsk Regnesentral
26-Feb- 03
Wolfgang Leister
11
Nonideal filterbanks
1 M −1
Y (e jω ) = X (e j ω ) ∑ H kR (e jω )H kA (e jω ) +
M4
k =4
0 4
1
424444
3
M −1
∑X (e
n =1
 2πn 
j  ω−

M 

≈1

2πn 
j ω −

1 M −1
) ∑HkR (ej ω )HkA (e  M  )
M4
k=4
0 44
1
4244444
3
≈0
n
n
In a perfect filterbank
the first part is the only
part.
n
The second part consists
of the aliasing terms.
The filterbank is designed
so that the aliasing is
small.
Norsk Regnesentral
26-Feb- 03
Wolfgang Leister
Tubthumper, a time domain view
The red line is the reconstruction error after splitting the
signal in subbands, down sampling and applying the synthesis
filterbank. The reconstruction error is –84 dB and sounds like
Norsk Regnesentral
26-Feb- 03
Wolfgang Leister
12
Tubthumper, frequency view
Subband
Center frequency
1
2
4
8
16
32
0.3
1.0
2.4
5.2
10.7
21.7
[kHz]
No subsampling
Subsampled 32 times
Norsk Regnesentral
26-Feb- 03
Wolfgang Leister
Filterbank MPEG
polyphase
filterbank
12 samples
12 samples
12 samples
band 1
band 2
...
Layer I frame
384 samples
band 31
Layer II/III frame
1152 samples
Norsk Regnesentral
26-Feb- 03
Wolfgang Leister
13
Critical Bands
n
n
n
Heinrich Barkhausen (1881-1956)
psycho-acoustic
width measured in bark
for f < 500
 f / 100
1 bark = 
9 + 4 ⋅ log( f / 1000 ) else
Norsk Regnesentral
26-Feb- 03
Wolfgang Leister
MPEG - Sub bands
n
n
n
Layer I: 32 bands, 625 Hz each, Fourier
transform
Layer II: 32 bands, three frames, time
masking
Layer III: Division according to critical
bands
Norsk Regnesentral
26-Feb- 03
Wolfgang Leister
14
MPEG masking
n
n
n
n
n
Psycho-acoustic model
masking of neighbouring bands
signals are coded when above masking
threshold
MUSICAM (Masking-pattern adapted
Universal Subband Integrated Coding
and Multiplexing)
Layer I: simplified, Layer II: entirely, Layer III: with other methods
Norsk Regnesentral
26-Feb- 03
Wolfgang Leister
Example: Masking MPEG Audio
band
level
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1 8 12 10 6 2 10 60 35 20 15
2
3
5
3
1
masking
?
?
?
?
?
? 12 x 15 ?
?
?
?
?
?
?
coding
?
?
?
?
?
?
?
?
?
?
?
?
-
x
x ?
Norsk Regnesentral
26-Feb- 03
Wolfgang Leister
15
MPEG-1 Layer 3 encoder
Norsk Regnesentral
26-Feb- 03
Wolfgang Leister
MP3
n
n
n
n
n
n
Filter bank - sub bands
Series MDCT
fine grain frequency resolution
non-uniform quantisation
perception model
Huffman coding
Norsk Regnesentral
26-Feb- 03
Wolfgang Leister
16
MP3 (vs. Layer I/II)
n
n
n
n
n
n
modified DCT (Series MDCT vs. FFT)
critical bands
Huffman coding
entropy reduction
dynamics compression
difference and sum of stereo signals
Norsk Regnesentral
26-Feb- 03
Wolfgang Leister
MPEG Audio Layer I,II,III
n
n
n
Layer I: 19 ms delay, FFT, 384 samples,
frequency masking, equal bands
Layer II: 35 ms delay, FFT, 1152
samples, frequency masking, time
simulated, equal bands
Layer III: 59 ms delay, DCT, 1152
samples, frequency and time masking,
bands as in bark scale
Norsk Regnesentral
26-Feb- 03
Wolfgang Leister
17
MPEG Layer I, II, III
subj. quality
Audio CD
MPEG1 Layer I
MPEG1 Layer II
MPEG1 Layer III
MPEG2 Layer III
MPEG2 Layer III
CS-ACELP
CD
CD
CD
CD
Radio
Telephone
Speech
bandwidth compression
1400
384
256
128
64
16
5,30
1:1
3.6:1
5.5:1
11:1
22:1
88:1
264:1
1 min audio
10.58 MB
2.88 MB
1.92 MB
962 kB
481 kB
120 kB
40 kB
Norsk Regnesentral
26-Feb- 03
Wolfgang Leister
MPEG-2 AAC
Norsk Regnesentral
26-Feb- 03
Wolfgang Leister
18
Audio Formats
n
PCM - Pulse Code Modulation
ITU G.711; speech data 4kHz bandwidth, 64 kb/s
data rate
n
ADPCM (Adaptive Differential PCM)
ITU G.726, G.727; 16, 24, 32, 40 kBit/s. Standard
for CCITT G.721
n
SB-ADPCM (Sub-Band ADPCM)
ISDN, G.722; 7 kHz bandwidth in 64 kBit/s streams
Norsk Regnesentral
26-Feb- 03
Wolfgang Leister
Audio Formats
n
AIFF - Audio Interchange File Format
Apple
n
(extension from IFF by Electronic Arts)
Wave (by Microsoft and IBM)
Part of RIFF (Resource Interchange File Format)
n
NeXT/Sun Audio File Format
! big endian
Norsk Regnesentral
26-Feb- 03
Wolfgang Leister
19
Proprietary Audio Formats
n
n
n
n
AT&T Proprietary Compression
Algorithm
EPAC (Bell Labs)
Microsoft Windows Media Audio (WMA)
AC-3 Audio Code No. 3 - Dolby Digital
Surround
Norsk Regnesentral
26-Feb- 03
Wolfgang Leister
Speech compression formats
n
GSM 06-10:
n
CELP (Code Excited Linear Prediction):
160 13-bit values in 260 Bit (33 Byte) are
compressed; 8000 samples/s result in data rate of 1650 Byte/s
analytical model
n
n
LD-CELP (Low Delay CELP): G.728
LPC-10E (Linear Prediction Coder
(Enhanced): military coder, analytical model, 2.4 kBit/s
understandable, but low quality.
Norsk Regnesentral
26-Feb- 03
Wolfgang Leister
20
End of Part
Thank you for your attention!
Norsk Regnesentral
26-Feb- 03
Wolfgang Leister
21
Download