Audio Coding and MP3 Wolfgang Leister contributions by: Torbjørn Ekman Norsk Regnesentral What is Sound? n n n Sound waves: 20Hz - 20kHz Speed: 331.3 m/s (air) Wavelength: 165 cm - 1.65 cm Norsk Regnesentral 26-Feb- 03 Wolfgang Leister 1 Analogue audio n n n frequencies: 20Hz - 20kHz mono: x(t) scalar stereo: x r (t ) x(t ) = x l (t ) Norsk Regnesentral 26-Feb- 03 Wolfgang Leister Audio Compression n n n n small files, low data rate at transmission reconstruction must be (as much as possible) equal to original signal redundancy (lossless coding) irrelevancy (do not code what you cannot hear) Norsk Regnesentral 26-Feb- 03 Wolfgang Leister 2 Data rates Quality Telephone MW UKW CD DAT Sample Rate Bit/Sample Channels 8.000 8 Mono 11.025 8 Mono 22.050 16 Stereo 44.100 16 Stereo 48.000 16 Stereo Data Rate kb/s 64,00 88,00 705,60 1411,00 1536,00 Frequency 200-3400 20-20000 20-20000 Norsk Regnesentral 26-Feb- 03 Wolfgang Leister Dynamics compression n A-Law A ⋅ abs( S ) sign ( S ) ⋅ 1 + ln A S'= sign ( S ) ⋅ 1 + ln( A ⋅ abs(S )) 1 + ln A n for abs(S) ≤ 1 A else µ-Law S ' = sign( S ) ⋅ 1 + ln( 1 + µ ⋅ abs( S )) , µ = 255 ln( 1 + µ ) Norsk Regnesentral 26-Feb- 03 Wolfgang Leister 3 Masking Norsk Regnesentral 26-Feb- 03 Wolfgang Leister Masking n n Threshold for human ear Threshold changes: n neighbouring frequencies (Example 0.5, 1, 4, 8 kHz) n in time Norsk Regnesentral 26-Feb- 03 Wolfgang Leister 4 Sampling • When x(t) is bandwidth limited: f >ω • then ⇒ x (t ) = x( f ) = 0 ∞ ∑ x[n ]g (t − n ⋅ ∆t ) n = −∞ • with ∆t = x[n] = x (n ⋅ ∆t) 1 1 < f s 2ω g( t) = sin(2πω t) 2πω t Norsk Regnesentral 26-Feb- 03 Wolfgang Leister Quantisation n x → Q(x ) n k bits ⇒ L = 2 k representations n {y1 , K, yn } n x − yi ≤ x − y j ⇒ Q( x) = y i Norsk Regnesentral 26-Feb- 03 Wolfgang Leister 5 PCM = Pulse Code Modulation n Sampling: Quantisation: Coding: n Play: n n {x(t )} → {x[n]} {x[n ]} → {Q( x[n])} Q({x[n ]}) → {ni } redundancy irrelevancy y (t ) = ∑ Q (x [ni ]) ⋅ g (t − ni ⋅ ∆t ) Norsk Regnesentral 26-Feb- 03 Wolfgang Leister Stereo CD Audio n Data rate: 2 ⋅16 bit ⋅ 44.1 ⋅ 10-3 s −1 = 1411 .2 ⋅10 3 bit s Norsk Regnesentral 26-Feb- 03 Wolfgang Leister 6 MPEG compression factors n n MPEG 1 Audio: PCM 32, 44.1, 48 kHz, max 448 kBit/s MPEG 2 Audio: PCM 16, 22.05, 24, 32, 44.1, 48 kHz, max 384 KBit/s Norsk Regnesentral 26-Feb- 03 Wolfgang Leister MPEG Audio Layer I,II,III n n n Layer I Layer II ⇒ Digital TV Layer III ⇒ MP3 Norsk Regnesentral 26-Feb- 03 Wolfgang Leister 7 MP3 - MPEG 1 Audio Layer 3 n n Sampling: 16 kHz - 48 kHz Bit rate: 32 kb/s - 192 kb/s (CD Audio: 44.1 kHz, 1411 kb/s) n n www.iis.fhg.de/amm/gallery/index.html Karlheinz Brandenburg: “MP3 and AAC explained” http://www.exp-math.uni-essen.de/~dreibh/diplom/bra99.pdf Norsk Regnesentral 26-Feb- 03 Wolfgang Leister perceptual encoding / decoding Norsk Regnesentral 26-Feb- 03 Wolfgang Leister 8 Filterbank Norsk Regnesentral 26-Feb- 03 Wolfgang Leister Ideal sub-band coder n n n impossible: ideal sub-band coder downsampling ⇒ aliasing possible: “nearly perfect” 1 for f ∈ Dm , m = 1, K, M ( f ) = Hm 0 else Norsk Regnesentral 26-Feb- 03 Wolfgang Leister 9 Downsampling n from M ⋅ f s back to f s n sub-bandwidth B, upper frequency is multiple of B n can sample at f s = 2 B (instead of f s = 2 M ⋅ B ) xm [n] ym [k ] ↓M y m [k ] = x m [k ⋅ M ] Norsk Regnesentral 26-Feb- 03 Wolfgang Leister Filterbank in MPEG-1 audio layer 1-3 n n n n Polyphase filterbank 32 subbands 512 tap FIR-filters 80 + and * per output n n n Equal width Not perfect reconstruction Frequency overlap Norsk Regnesentral 26-Feb- 03 Wolfgang Leister 10 A closer look n n n The subbands overlap at 3 dB to the adjacent bands. The leakage to the other bands is small. The total response almost adds up to one (0 dB). Norsk Regnesentral 26-Feb- 03 Wolfgang Leister White noise n n The white noise run through the filterbank. The samples from each band are played in the order of the subbands. n n n The subsampled filtered sequence. The samples from each band are played in the order of the subbands. The reconstruction error is –84 dB. Norsk Regnesentral 26-Feb- 03 Wolfgang Leister 11 Nonideal filterbanks 1 M −1 Y (e jω ) = X (e j ω ) ∑ H kR (e jω )H kA (e jω ) + M4 k =4 0 4 1 424444 3 M −1 ∑X (e n =1 2πn j ω− M ≈1 2πn j ω − 1 M −1 ) ∑HkR (ej ω )HkA (e M ) M4 k=4 0 44 1 4244444 3 ≈0 n n In a perfect filterbank the first part is the only part. n The second part consists of the aliasing terms. The filterbank is designed so that the aliasing is small. Norsk Regnesentral 26-Feb- 03 Wolfgang Leister Tubthumper, a time domain view The red line is the reconstruction error after splitting the signal in subbands, down sampling and applying the synthesis filterbank. The reconstruction error is –84 dB and sounds like Norsk Regnesentral 26-Feb- 03 Wolfgang Leister 12 Tubthumper, frequency view Subband Center frequency 1 2 4 8 16 32 0.3 1.0 2.4 5.2 10.7 21.7 [kHz] No subsampling Subsampled 32 times Norsk Regnesentral 26-Feb- 03 Wolfgang Leister Filterbank MPEG polyphase filterbank 12 samples 12 samples 12 samples band 1 band 2 ... Layer I frame 384 samples band 31 Layer II/III frame 1152 samples Norsk Regnesentral 26-Feb- 03 Wolfgang Leister 13 Critical Bands n n n Heinrich Barkhausen (1881-1956) psycho-acoustic width measured in bark for f < 500 f / 100 1 bark = 9 + 4 ⋅ log( f / 1000 ) else Norsk Regnesentral 26-Feb- 03 Wolfgang Leister MPEG - Sub bands n n n Layer I: 32 bands, 625 Hz each, Fourier transform Layer II: 32 bands, three frames, time masking Layer III: Division according to critical bands Norsk Regnesentral 26-Feb- 03 Wolfgang Leister 14 MPEG masking n n n n n Psycho-acoustic model masking of neighbouring bands signals are coded when above masking threshold MUSICAM (Masking-pattern adapted Universal Subband Integrated Coding and Multiplexing) Layer I: simplified, Layer II: entirely, Layer III: with other methods Norsk Regnesentral 26-Feb- 03 Wolfgang Leister Example: Masking MPEG Audio band level 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 8 12 10 6 2 10 60 35 20 15 2 3 5 3 1 masking ? ? ? ? ? ? 12 x 15 ? ? ? ? ? ? ? coding ? ? ? ? ? ? ? ? ? ? ? ? - x x ? Norsk Regnesentral 26-Feb- 03 Wolfgang Leister 15 MPEG-1 Layer 3 encoder Norsk Regnesentral 26-Feb- 03 Wolfgang Leister MP3 n n n n n n Filter bank - sub bands Series MDCT fine grain frequency resolution non-uniform quantisation perception model Huffman coding Norsk Regnesentral 26-Feb- 03 Wolfgang Leister 16 MP3 (vs. Layer I/II) n n n n n n modified DCT (Series MDCT vs. FFT) critical bands Huffman coding entropy reduction dynamics compression difference and sum of stereo signals Norsk Regnesentral 26-Feb- 03 Wolfgang Leister MPEG Audio Layer I,II,III n n n Layer I: 19 ms delay, FFT, 384 samples, frequency masking, equal bands Layer II: 35 ms delay, FFT, 1152 samples, frequency masking, time simulated, equal bands Layer III: 59 ms delay, DCT, 1152 samples, frequency and time masking, bands as in bark scale Norsk Regnesentral 26-Feb- 03 Wolfgang Leister 17 MPEG Layer I, II, III subj. quality Audio CD MPEG1 Layer I MPEG1 Layer II MPEG1 Layer III MPEG2 Layer III MPEG2 Layer III CS-ACELP CD CD CD CD Radio Telephone Speech bandwidth compression 1400 384 256 128 64 16 5,30 1:1 3.6:1 5.5:1 11:1 22:1 88:1 264:1 1 min audio 10.58 MB 2.88 MB 1.92 MB 962 kB 481 kB 120 kB 40 kB Norsk Regnesentral 26-Feb- 03 Wolfgang Leister MPEG-2 AAC Norsk Regnesentral 26-Feb- 03 Wolfgang Leister 18 Audio Formats n PCM - Pulse Code Modulation ITU G.711; speech data 4kHz bandwidth, 64 kb/s data rate n ADPCM (Adaptive Differential PCM) ITU G.726, G.727; 16, 24, 32, 40 kBit/s. Standard for CCITT G.721 n SB-ADPCM (Sub-Band ADPCM) ISDN, G.722; 7 kHz bandwidth in 64 kBit/s streams Norsk Regnesentral 26-Feb- 03 Wolfgang Leister Audio Formats n AIFF - Audio Interchange File Format Apple n (extension from IFF by Electronic Arts) Wave (by Microsoft and IBM) Part of RIFF (Resource Interchange File Format) n NeXT/Sun Audio File Format ! big endian Norsk Regnesentral 26-Feb- 03 Wolfgang Leister 19 Proprietary Audio Formats n n n n AT&T Proprietary Compression Algorithm EPAC (Bell Labs) Microsoft Windows Media Audio (WMA) AC-3 Audio Code No. 3 - Dolby Digital Surround Norsk Regnesentral 26-Feb- 03 Wolfgang Leister Speech compression formats n GSM 06-10: n CELP (Code Excited Linear Prediction): 160 13-bit values in 260 Bit (33 Byte) are compressed; 8000 samples/s result in data rate of 1650 Byte/s analytical model n n LD-CELP (Low Delay CELP): G.728 LPC-10E (Linear Prediction Coder (Enhanced): military coder, analytical model, 2.4 kBit/s understandable, but low quality. Norsk Regnesentral 26-Feb- 03 Wolfgang Leister 20 End of Part Thank you for your attention! Norsk Regnesentral 26-Feb- 03 Wolfgang Leister 21