Psycho-acoustics The auditory system • Simple objective quality measures such as mean-square error is not so useful for efficient • The human auditory system is a very complex and highly non-linear system. audio coding. • It is not know exactly how it works. A lot of open questions. • A very import part of modern audio coding methods is the psycho-acoustic model. • The dynamic range is more than 130 dB. • Psycho-acoustic models try to model the human perception of sound • We are able to perceive sound between 20 Hz and 20 kHz. The upper threshold is reduced – Absolute threshold of hearing (ATH) with age, and many cannot hear above 16 kHz. – Masking principles • The ear is most sensitive between 1 and 5 kHz. – Critical bandwidth, Bark scale • The cochlea is often modelled as a bank of highly overlapping asymmetrical non-linear filters. (It is often said the the cochlea, or basiliar membrane in the cochlea, performs a frequency to place transformation). • The bandwidth of each filter is often called the critical bandwidth. It increases with frequency. • Different critical bands corresponds to different regions in the cochlea. SMS047 - 2006 - #12 Frank Sjöberg - Signalbehandling 1 Absolute Threshold of Hearing SMS047 - 2006 - #12 Frank Sjöberg - Signalbehandling 2 Critical bandwidth • We cannot hear sounds below a certain sound pressure level. This is referred to as the • Critical bandwidth can loosely be defined as the bandwidth at which subjective responses absolute threshold of hearing (ATH). of the hearing system change abruptly. 2 AT H(f ) = 3.64(f /1000)−0.8 − 6.5e−0.6(f /1000−3.3) + 10−3(f /1000)4 (dB SPL) – Perceived loudness of narrowband noise (with constant SPL) is approximately constant 100 as long as the bandwidth is smaller than the critical bandwidth. 90 – Masking effects are constant within the critical bandwidth (a critical band). 80 Sound Pressure Level, SPL (dB) 70 • Approximative model of critical bandwidth 60 50 BWc(f ) = 25 + 75(1 + 1.4(f /1000)2)0.69 (Hz) 40 30 20 10 0 −10 2 10 SMS047 - 2006 - #12 3 10 Frequency (Hz) 4 10 Frank Sjöberg - Signalbehandling 3 SMS047 - 2006 - #12 Frank Sjöberg - Signalbehandling 4 Critical bandwidth Bark scale • The Bark scale is a nonlinear frequency scale, related to critical bandwidth and how the inner ear process a signal. 5000 • One critical band has the width of 1 Bark. 4500 −1 z(f ) = 13 tan 4000 f 7500 2 (Bark) 3000 25 25 20 20 15 15 2500 2000 Bark 1500 1000 Bark Critical Bandwidth (Hz) 3500 10 10 5 5 500 0 0.2 0.4 0.6 0.8 1 1.2 Frequency (Hz) 1.4 1.6 1.8 2 4 x 10 0 SMS047 - 2006 - #12 Frank Sjöberg - Signalbehandling 5 0.2 0.4 0.6 0.8 1 1.2 Frequency (Hz) 1.4 1.6 SMS047 - 2006 - #12 1.8 2 0 4 2 10 x 10 4 10 Frank Sjöberg - Signalbehandling Masking principles Tone-masking-noise • Masking is an effect of the hearing that a strong sound can mask weaker sounds, tones or • Two tones separated with fd Hertz mask narrowband noise. 3 10 Frequency (Hz) 6 noise. fd • We can differ between for different cases of masking SMR – Tone-masking-noise (TNM). • The masking effect is fairly constant as long as the tones – Noise-masking-tone (NMT). are separated with less than the so called critical bandwidth. – Noise-masking-noise (NMN). Sound pressure level • The SMR threshold is typically in the range 20-30 dB. • When the distance between the maskers exceed the criti- – Tone-masking-tone (TMT). Frequency cal bandwidth, the masking threshold goes down. • The two first cases have been studied the most, but it is case 1 and 3 that are most relevant for audio coding. • The masking threshold is measured in signal-to-masker ratio (SMR). SMS047 - 2006 - #12 Frank Sjöberg - Signalbehandling 7 SMS047 - 2006 - #12 Frank Sjöberg - Signalbehandling 8 Noise-masking-tone Spreading function • The masking effect is not limited to a critical band, it spreads into neighboring bands. • A noise-masking-tone test can be made with two narrow- • The spreading function drops around 10 dB/Bark for higher frequencies and around 25 fd SMR band noise maskers with a pure tone in between. dB/Bark for lower frequencies. SpF n(z) = 15.81 + 7.5(z + 0.474) − 17.5 1 + (z + 0.474)2 Sound pressure level • The SMR for noise-masking-tones is around 5 dB. • Noise is a better masker than a pure tone. • Noise-masking-noise is not studied so much, but it looks Spreading function Spreading function Frequency as it gives similar results. 0 10 −10 0 −20 −10 −20 −40 Spreading (dB) Spreading (dB) −30 −50 −30 −40 −60 −50 −70 −60 −80 −70 −90 −100 −5 SMS047 - 2006 - #12 Frank Sjöberg - Signalbehandling 9 Temporal masking −3 70 1 2 3 4 SMS047 - 2006 - #12 5 −80 0.2 0.4 0.6 0.8 1 1.2 Frequency (Hz) 1.4 1.6 1.8 2 4 x 10 Frank Sjöberg - Signalbehandling 10 0≤k ≤N −1 where w[n] is a window function, and each xi [n] overlaps N samples with the previous block Simultaneuous masking xi[n] = x[iN + n]. 50 40 • The inverse MDCT transform is defined as N −1 1 N +1 4 π k+ n+ , yi[n] = w[n] Xi[k] cos N k=0 N 2 2 30 post-masking pre 20 10 SMS047 - 2006 - #12 0 Bark • The normalized MDCT of block number i is defined as 2N −1 1 N +1 4 π k+ n+ , Xi [k] = w[n]xi[n] cos N n=0 N 2 2 • Post-masking can occur up to 200 ms after a strong stimuli. 0 −1 • The modified DCT (MDCT) is a critically sampled lapped transform based on DCT type IV. N −1 1 1 2 π k+ n+ XDCT −IV [k] = x[n] cos N n=0 N 2 2 • Pre-masking occur about 5 ms before the start of a strong stimuli. 0 −2 Modified DCT – MDCT • The masking effect spreads in time also. 60 −4 0 ≤ n ≤ 2N − 1 We also have to add the overlapping parts to cancel the aliasing 100 200 300 400 500 Frank Sjöberg - Signalbehandling x̂i[n] = yi[n] + yi−1[n + N ] 600 11 SMS047 - 2006 - #12 Frank Sjöberg - Signalbehandling 12 MDCT-windows • The window function must satisfy the following condition w2[n] + w2[n + N ] = 2 • One example is the sine-window used in MPEG-2 layer 3 1 π n+ w[n] = sin 2N 2 • Another window is the Kaiser-Bessel derived (KBD) window, used in MPEG-4 AAC n v[k] w[n] = k=0 for 0 ≤ n ≤ N, N k=0 v[k] √ 2 where v[k] is the Kasier window v[k] = I0 πα 1−(2k/n−1) I0 (πα) , and I0[·] is the zeroth order mod- ified Bessel function of the first kind, and α is a constant that determines the shape of the window. α is typically chosen between 3 and 6. SMS047 - 2006 - #12 Frank Sjöberg - Signalbehandling 13