Media Processing – Audio Part Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk http://personal.ee.surrey.ac.uk/Personal/W.Wang/teaching.html 1 Approximate outline Week 6: Fundamentals of audio Week 7: Audio acquiring, recording, and standards Week 8: Audio processing, coding, and standards Week 9: Audio production and reproduction Week 10: Audio perception and audio quality assessment 2 Audio coding and standards Concepts and topics to be covered: Spectral analysis of audio DFT, DCT, and MDCT Subband analysis of audio PQMF Audio coding methods Lossless and lossy coding Coding standards MPEG-1, -2, and -4 Coding principles of MPEG-1 3 Fourier transform Motivation for using Fourier transform (FT) The waveform (a general time-domain representation) provides some indication of the dynamics and periodicity of audio, but apart from this, it is not clear, for example, about its frequency distribution. Fourier transform provides an alternative representation of the signal, suitable for displaying other speech characteristics, such as its frequency information, harmonics, etc. Definition The Fourier transform of a continuous signal x(t ) is computed as: X ( ) x(t )e jt dt where is the angular frequency: 2f Inverse Fourier transform: x(t ) 1 2 2 X ( )e jt d 4 4 Discrete-time Fourier transform (DTFT) The Fourier transform of a discrete-time signal x[n] x(nT ) is computed as: X ( ) x[n]e jn where is the angular frequency: 2f Inverse discrete-time Fourier transform: x[n] 1 2 2 X ( )e jn d 5 5 Discrete Fourier transform (DFT) The Fourier transform of a digital signal x[n] is computed as: N 1 X [k ] x[n]e jk n n 0 where k is the angular frequency: k 2k , k 0,1,..., N 1 N Inverse discrete Fourier transform: 1 N 1 x[n] X [k ]e jk n , n 0,1,...,N 1 N k 0 6 6 Power spectral density (PSD) PSD is defined as the magnitude squared of the DFT of the signal: P[ k ] X [ k ] 2 Examples: PSD of a vowel spoken by a male speaker PSD of a fricative spoken by a male 7 7 speaker Fast Fourier transform (FFT) FFT is a fast computation of DFT. The typical FFT algorithm consists of three conceptual parts: Shuffling (bit reversal): shuffling the N-dimensional input into N onedimensional signals Performing N one-point DFTs Merging the N one-point DFTs into one N-point DFT using “Butterfly” merging equations (requiring that N to be an integral power of 2.) The computational complexities of FFT and DFT are respectively: FFT: O( N log N ) DFT: O( N 2 ) 8 8 Short-time Fourier transform (STFT) STFT (sometimes called short-term FT) can be computed as a N-point windowed DFT as follows (note that we only consider the discrete form here, and in practice FFT is usually used to compute the DFT in each frame ): N 1 x[k , m] x[m n]w(n)e jk n where n 0 k - the discrete angular frequency: m - the time-frame index k 2k , k 0,1,..., N 1 N - the hop size w(n) - a window function, such as rectangular, Hann windows 9 9 Spectrogram Spectrogram of a speech signal can be computed as magnitude squared STFT: spectrogra m{x[n]} x[k , m] 2 An example: Spectrogram of female speaker uttering “warm cloak” 10 10 Spectrogram (cont.) Each vertical line of the spectrogram describes the frequencydependent power distribution of the signal over a short segment (or window) of the speech signal, i.e. PSD of the segment. The width of the window is N, and the gap between consecutive windows is the hop size . The horizontal line of the spectrogram represents the power distribution within a particular frequency band as a function of time. The spectrogram shows the time-frequency spectral distribution of power within the signal. The spectrogram is much better suited than the waveform to displaying speech structures, e.g. harmonics, the energy balance of frequency components, formants, etc. The time and frequency resolution of the spectrogram are inversely proportional. 11 11 Spectrogram – resolution issues The STFT has a fixed resolution that depends on the selection of the window size. A wider window gives better frequency resolution (frequency components close together can be separated) but poorer time resolution (the time at which frequencies change), and vice versa. We use the example from Wikipedia to demonstrate this: a signal is composed of 4 sinusoidal components, whose frequencies are 10, 25, 50, 100Hz respectively, with the same length of 5 seconds. The sampling frequency of the signal is 400Hz. Multi-resolution analysis tools exist that do not suffer from this problem, such as wavelet transform. 12 12 Spectrogram resolution issues (cont.) Different time-frequency resolutions for the same signal due to different window 13 sizes were used in generating the STFT. (Resource: from Wikipedia.) 13 Spectrogram resolution issues (cont.) Although a long window can give higher frequency resolution, it would be misleading if we use too long a window, as the spectral characteristics would change over the duration of the windowed segment. How long window should we choose such that the spectral characteristics does not change (dramatically)? This question relates to the concept of “stationarity”. In practice, speech segment with a length of around 20-30ms is usually regarded as “quasi-stationary” (very littler change in spectral characteristics). This is because the speech units (phonemes) occur at a rate of 4-5 per second for average speech, although more rapid changes can occur from one steady state to another. To ensure smooth transitions of the energy distribution from frame to frame, the windows are usually chosen to be overlapping, with a typical hop size of 5ms. 14 14 Windowing and overlapping in spectrogram By choosing the window length appropriately, the assumption of stationarity (quasi-stationarity) within the windowed speech is almost true. However, when appending copies of the segment one after another, there may still be sharp discontinuities in the waveform at the boundaries (see the figure below). The discontinuity results in the high-frequency noise spread across the spectrum, known as spectral leakage. Spectral leakage: (a) a sinusoidal audio segment (b) its periodic extension. 15 15 Windowing and overlapping in spectrogram (cont.) To reduce spectral leakage, we can multiply the segment with a window function that approaches zero at its ends, such as Hann window shown below. This effectively attenuates discontiuities between two boundairies of the window, and therefore reduces the leakage. The waveform and amplitude spectrum of the Hann window function In practice, short segments can be appended with zeros to the required length, known as zero-padding. 16 16 Various window functions Source: Kondoz (2001) 17 Time plots of various window functions Source: Kondoz (2001) 18 Frequency response of various window functions Source: Kondoz (2001) 19 Short-time spectral analysis using DFT Effect of window types on voiced speech with 220 samples window length. (a) and (b) are time and frequency plots of speech using a rectangular window, and (c) and (d) are time and frequency plots of speech using Hamming window. Source: Kondoz (2001) 20 Short-time spectral analysis using DFT Effect of window types on unvoiced speech with 220 samples window length. (a) and (b) are time and frequency plots of speech using a rectangular window, and (c) and (d) are time and frequency plots of speech using Hamming window. Source: Kondoz (2001) 21 Short-time spectral analysis using DFT Effect of window types on voiced speech with 40 samples window length. (a) and (b) are time and frequency plots of speech using a rectangular window, and (c) and (d) are time and frequency plots of speech using Hamming window. Source: Kondoz (2001) 22 Discrete Cosine Transform (DCT) A definition of DCT transform (DCT II) is shown below: Source: wikipedia 23 MDCT An advantage of Modified DCT (MDCT) is that it allows for a 50% overlap between blocks without increasing the data rate. The MDCT is an example of a class of transforms called Time Domain Aliasing Cancellation (TDAC). In particular, MDCT is sometimes referred to as oddly-stacked TDAC (OTDAC). These transforms do not invert like the DFT to recover the original signal but rather invert to recover a signal that has adjacent blocks’ signal mixed into it so that the effect of “time-domain aliasing”, i.e. the mixing of adjacent blocks of data, is removed. As a result, the input signal is perfectly reconstructed. 24 MDCT (cont.) Analysis: from time to frequency Synthesis: from frequency to time For the signal to be perfectly reconstructed from after synthesis process, the windows should satisfy the following condition: where i is the index of blocks (or short-time frames), subscript a means analysis, and s means synthesis. n0 = (N/2+1)/2. 25 MDCT (cont.) Responses of the MDCT filter bank (cosine window function): Source: Bosi & Goldberg (2002) 26 Subband analysis of audio signals General subband analysis framework for audio coding: Source: Bosi & Goldberg (2002) 27 Pseudo quadrature mirror filter (PQMF) as a subband analysis tool The PQMF filter bank employs the following analysis h and synthesis g filters respectively, where k is the frequency index and n is the time index. See its general form below 28 PQMF filter bank An example of subband analysis of speech: Source: Kondoz (2001) 29 PQMF filter bank (cont.) The PQMF filter bank in MPEG audio standards employs the following analysis and synthesis filters respectively, where k is the frequency index and n is the time index. 30 PQMF filter bank (cont.) MPEG Audio PQMF prototype filter impulse response h[n] and hk[n], for k=0 and k=1. Source: Bosi & Goldberg (2002) 31 PQMF filter bank (cont.) Frequency response of the prototype filter (unit: Fs/64) used in MPEG audio standards. Source: Bosi & Goldberg (2002) 32 PQMF filter bank (cont.) Frequency response of the first four bands of the MPEG audio coding standards (unit: Fs/64) is shown in the figure below. Source: Bosi & Goldberg (2002) 33 Audio coding methods Lossless coding Based on statistical relation between symbols within the data Entropy coding such as Huffman coding, arithmetic coding etc. The original signals can be perfectly reconstructed. Lossy coding Based on the perceptual modelling of audio signals (such as psychoacoustic models of hearing), some redundant information within audio signals can be removed without affecting their perceptual quality. Usually done in transform domain followed by quantisation. The original signals cannot be perfectly reconstructed. 34 Lossless coding: Huffman coding Huffman coding is a variable length coding method for coding symbols based on the probabilities of each symbol’s occurrence. Considering a 2-bit quantised signal that has the codes [00], [01], [10], [11], and suppose that we have a signal to be encoded, where the probability of the symbols’ occurrence is 70%, 15%, 10%, 5% respectively. The original bit rate was 2 bits per sample, and after entropy coding, the average bit rate becomes 1.45 (=70%*1+15%*2+10%*3+5%*3) bits per sample. Source: Bosi & Goldberg (2002) 35 Lossy coding: psychoacoustic model Psychoacoustic principles and models, in particular, frequency and temporal masking, have been used as a basis in producing perceptually lossless audio quality in lossy audio coding algorithms. Source: Bosi & Goldberg (2002) Frequency masking 36 Lossy coding: psychoacoustic model Psychoacoustic principles and models, in particular, frequency and temporal masking, have been used as a basis in producing perceptually lossless audio quality in lossy audio coding algorithms. Temporal masking Source: Bosi & Goldberg (2002) 37 MPEG-1 audio coding standard MPEG: the Moving Picture Experts Group (MPEG) within the joint technical committee on information technology (JTC 1) sponsored by the International Organisation for Standardisation (ISO) and the International Electrotechnical Commission (IEC), was established at the end of 1980s aiming to develop standards for coded representation of moving pictures, associated audio, and their combination. MPEG-1 Audio was the first international standard specifying the digital format for high quality audio with the aim of reducing the data rate while maintaining CD-like quality. Prior to this standard, there was standardisation effort for either speech-only applications or providing only media-quality audio performance. The adoption of MPEG-1 enables the compressed high-quality audio in a wide range of applications, including digital broadcasting to internet applications. 38 MPEG standards – a brief history The MPEG standardisation effort started in 1988. The MPEG-1 standard [ISO/IEC 11172] coding of synchronised video and audio at a total data rate of about 1.5 Mb/s was finalised in 1992. The MPEG-2 standard [ISO/IEC 13818] coding of synchronised video and audio at a total data rate of about 10 Mb/s was finalised in 1994. The effort for MPEG-3 standard, i.e. coding of synchronised video and audio at a total data rate of about 40 Mb/s, was dropped in 1993, after being considered redundant as its attributes were already incorporated in the MPEG-2 standard. The MPEG-4 [ISO/IEC 14496] addresses audio visual coding at very low data rates with additional functionalities, such as scalability, 3-D, synthetic/natural hybrid coding, was finalised in 1998. MPEG-7 [ISO/IEC 15938] addresses the description of multimedia content for multimedia database search, finalised in 2001. 39 MPEG audio standards MPEG Audio is usually used as stand-alone standard, however it is a part of a multi-part standard, where “part 1” describes the system structure, “part 2” describes video coding, and “part 3” the audio coding. MPEG-1 Audio Defining coding/decoding high quality audio signals for storage media Standardising only the bitstream and decoder specifications, but not the encoder. This allows interoperability between different implementations, and for manufacturers to retain control on the core intellectual property of their coding system. Aims to support one or two main channels. Input and outputs are compatible with existing PCM standards such as the CD and the digital audio tape formats. Sampling rate at 32 kHz, 44.1 kHz, and 48 kHz Source: Bosi & Goldberg (2002) 40 MPEG audio standards (cont.) MPEG-2 Audio Extending of MPEG-1 Audio to multiple channels. Lower sampling rates than MPEG-1 at 16 kHz, 22.5 kHz and 24 kHz. Motivated by the emerging internet applications. Defining a higher-quality multichannel audio than achievable with MPEG-1 extensions. MPEG-2 AAC (NBC, non backward-compatible) shows comparable or better audio quality than MPEG-2 Layer II BC (backward-compatible). 41 MPEG audio standards (cont.) MPEG-4 Audio Aims to provide a high coding efficiency, where data rates (200 b/s to 64 kb/s) introduced are lower than defined in MPEG-1/2. Accommodates speech coding technology, error protection Includes content-based interactivity such as flexible access and manipulation, for example, pitch modifications. Allows universal access, such as access to a subset of data or scalability. Supports for synthetic audio and speech, e.g. in structured audio, text to speech interfaces. Accommodates additional post-processing effects, such as reverberation, 3D etc. 42 MPEG-1 Audio The MPEG-1 standard part 3, i.e. [ISO/IEC 11172-3], specifies the audio part of the MPEG-1 standard. It includes the syntax of the audio coded bitstream and a description of decoding process, which ensures interoperability between different systems. It also provides reference software modules and a set of test vectors for assessing the compliance of the decoder. It does not define the encoder which is left to the designer of the systems to decide. It describes perceptual audio coding algorithms for general audio signals, unlike in speech codecs where specific source model is applied. 43 MPEG-1 Audio – main features It supports sampling rates at 32, 44.1 and 48 kHz, one or two channels (including dual monophonic mode for two independent channels and a stereo mode for stereophonic channels). Data rates vary between 32 and 224 kb/s per channel allowing for compression ratios between 2.7:1 to 24:1, depending on sampling rate. It specifies three different layers, layer I, layer II, and layer III, which offer increasingly higher audio quality at slightly increased complexity. Layer I is the simplest layer, operates at date rates between 32 and 224 kb/s per channel (preferred rate at above 128 kb/s). It finds applications in e.g. digital compact cassette (DCC) at 192 kb/s per channel. Layer II is of media complexity, operates preferably between 32 and 192 kb/s per channel (providing very good quality at 128 kb/s). Applications of Layer II include digital audio broadcasting (DAB). Layer III has the highest quality with increased complexity. The data rates vary between 32 and 160 kb/s per channel. Applications include ISDN and internet transmissions. A modified MPEG Layer III format at lower sampling frequencies becomes the well-known 44MP3 format. MPEG-1 Audio coding: main building blocks of encoder The encoder includes a time to frequency mapping stage followed by a bit (or noise) allocation stage. The psychoacoustic model is used to determine the precision of the allocation stage. The bitstream formatting stage interleaves the representation of the quantised data with side information and optional ancillary data. Source: Bosi & Goldberg (2002) 45 MPEG-1 Audio coding: main building blocks of decoder The decoder interprets the bitstream, restores the quantised spectral components of the signal and reconstructs the time domain representation of the audio signal by frequency to time mapping. Source: Bosi & Goldberg (2002) 46 MPEG-1 Audio: coding options of layer I and II In both layer I an II, T-F mapping is performed by a 32-band PQMF, which is then scaled and quantised with a uniform midtread quantiser whose precision is determined by the output of the psychoacoustic model based on 512- (Layer I) or 1024-point (Layer II) FFT analysis. To reduce data rate, group coding of consecutive quantised samples is applied in Layer II. Source: Bosi & Goldberg (2002) 47 MPEG-1 Audio: coding options of layer III In Layer III, the output of the PQMF is fed to an MDCT stage. The filter bank is adaptive, rather than static (as in Layer I & II), and is scaled and non-uniform quantised with a midtread quantiser. Noiseless coding, such as Huffman coding has also been employed. Side informtion include bit allocation and control parameters. Source: Bosi & Goldberg (2002) 48 Time-frequency mapping in layer III: analysis filterbank After the 32-band PQMF filter, the subband samples are overlapped by 50%, multiplied by a sine window, and then process by MDCT transform. The MDCT output is multiplied by coefficients to reduce the aliasing effects caused by PQMF and overlapping bands. Source: Bosi & Goldberg (2002) Time-frequency mapping in layer III: synthesis filterbank In the decoder, the inverse aliasing reduction process is applied prior to the IMDCT (inverse MDCT). Without aliasing reduction, a pure sine wave after passing through the PQMF/MDCT filterbank can present a spurious component as high as -12 dB with respect to the original signal. Source: Bosi & Goldberg (2002) 50 Time-frequency mapping in layer III: block switching The block size processed by the layer III filter is 32*36 time-samples = 1152, which leads to a frequency resolution of about 41.66 Hz at 48 kHz sampling rate, and hence is good for performing bit allocation based on the psychoacoustic model. However, for transient signals, such a long block size can result in unmasked temporal noise, such as pre-echo. Hence, a shorter block size of 32*12 = 384 time samples will be used to improve the time resolution and hence reducing the temporal spreading of quantisation noise for sharp attacks. Two transition blocks, long-to-short and short-to-long, having the same size as the long block are employed. 51 Time-frequency mapping in layer III: window sequence in block switching Source: Bosi & Goldberg (2002) 52 Time-frequency mapping in layer III: window sequence in block switching The mixed block mode ensures high frequency resolution in low frequencies and high time resolution at high frequencies. Source: Bosi & Goldberg (2002) 53 Time-frequency mapping in layer III as compared with that in layer I & II The hybrid filter bank in Layer III has advantages such as high frequency resolution, a dynamic, adaptive tradeoff between time and frequency resolution, full compatibility with layer I & II. The disadvantages include potential aliasing effects exposed by the MDCT and long impulse response filters. The complexity of Layer III filter bank is increased with respect to the complexity of Layers I & II. 54 Psychoacoustic models in MPEG audio Source: Bosi & Goldberg (2002) 55 References Marina Bosi and Richard E. Goldberg, “Introduction to Digital Audio Coding and Standards”, Springer, 2002. Ahmet Kondoz, “Digital Speech Coding for Low Bit Rate Communication Systems”, Wiley, 2001. 56