Media Processing – Audio Part Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk http://personal.ee.surrey.ac.uk/Personal/W.Wang/teaching.html 1 Approximate outline Week 6: Fundamentals of audio Week 7: Audio acquiring, recording, and standards Week 8: Audio processing, coding, and standards Week 9: Audio perception and audio quality assessment Week 10: Audio production and reproduction 2 Speech codec, audio coding quality evaluation, and audio perception Concepts and topics to be covered: Speech coding Waveform coder, vocoder, and hybrid coder Frequency domain and time domain coders Audio file formats Digital container format Audio quality measurement Subjective assessment: listening tests Objective assessment: perceptual objective measurement Objective perceptual measurements Masked threshold, internal representation, PEAQ, PESQ Audio Perception Loudness perception, pitch perception, space perception, timbre perception 3 Speech coding strategies Speech coding schemes can be broadly divided into the following three main categories: vocoders, waveform coders, and hybrid coders. The aim is to analyse the signal, remove the redundancies, and efficiently code the nonredundant parts of the signal in a perceptually acceptable manner. SBC = subband coding, ATC = adaptive transform coding, MBE = multiband excitation, APC = adaptive coding, RELP = residual excited linear predictive coding (LPC), MPLPC = multi-pulse LPC, CELP = code-excited LPC, SELP = self-excitation LPC. Source: Kondoz, 2001 4 Waveform coders Such coders attempt to preserve the general shape of signal waveforms. Hence they are not speech specific. They generally operate on a sample to sample basis. Their performance is usually measured by SNR, as quantisation is the major source of distortion. They usually operate above 16 kb/s. For example, the first speech coding standard, PCM operates at 64 kb/s, and then a later standard, adaptive differential PCM (ADPCM) operates at 32 kb/s. 5 Voice coders (vocoders) A vocoder consists of an analyser and a synthesiser. The analyser extract from the original speech a set of parameters representing the speech production model, which are then transmitted. The syntheser then reconstruct the original speech based on the parameters transmitted. The synthesised speech is often crude. Vocoders are very speech specific and they don’t attempt to preserve the waveform of speech. Vocoder often operates in the regions below 4.8 kb/s. It is usually subjectively measured using mean opinion scores (MOS) test, diagnostic acceptability measure (DAM) (including both perceptual quality of the signal and background, such as intelligibility, pleasantness, and overall acceptability). Such standard is mainly targeted at non-commercial applications, e.g. secure military systems. 6 Hybrid coders The hybrid scheme attempts to use the advantages of the waveform coder and vocoder. It can be generally categorised into: frequency-domain and time-domain methods. The basic idea of frequency domain coding is to divide the speech spectrum into frequency bands or components using filter bank or a block transform analysis. After encoding and decoding, these components are used to resynthesise the original input waveform based on either filter bank summation or inverse block transform. The time domain coding is usually motivated by linear prediction. The statistical characteristics of speech signals can be very accurately modelled by a source-filter model which assumes speech is produced by filtering the excitation signal with a linear time-varying filter. For voiced speech, the excitation signal is a periodic impulse train, and for unvoiced speech, a random noise signal. 7 Hybrid coders (cont) An example of frequency domain hybrid coder: a typical sub-band coder (broad band analysis). Source: Kondoz, 2001 8 Hybrid coders (cont) An example of frequency domain hybrid coder: an adaptive transform coder (narrow band analysis), in which different bit depth can be applied to each sub-band. Source: Kondoz, 2001 9 Hybrid coders (cont) An example of time domain hybrid coder: adaptive predictive coder. Source: Kondoz, 2001 10 Quality of speech coding schemes Source: Kondoz, 2001 11 Digital speech coding standards 12 Difference between audio codecs and audio file formats A codec is an algorithm that performs the encoding and decoding of the raw audio data. Audio data itself is usually stored in a file with a specific audio file format. There are three major kinds of file formats: Uncompressed audio formats, such as WAV, AIFF, AU, or PCM. Lossless compressed audio formats, such as FLAC, WavPack, Apple Lossless, MPEG-4 ALS, Windows Media Audio (WMA) Lossless. Lossy compression audio formats, such as MP3, Ogg Vorbis, AAC, WMA Lossy. 13 Difference between audio codecs and audio file formats (cont) Most audio file formats support only one type of audio data (created with an audio coder), however there are multimedia digital container formats (as AVI) that may contain multiple types of audio and video data. A digital container format is a meta-file format where different types of data elements and metadata exist together in a computer file. Formats exclusive to audio include, e.g., wav, xmf. Formats that contain multiple types of data include, e.g. Ogg, MP4. 14 Coding dilemma In practical audio codec design, it is always a trade-off between the following two important factors: Data rate and system complexity limitation Audio quality 15 Objective quality measurement of coded audio Traditional objective measure: The quality of audio is measured using, e.g. the following objective performance index, where psychoacoustic effects are ignored. Signal to noise ratio (SNR) Total block distortion (TBD) Perceptual objective measure: The quality of audio is predicted based on a specific model of hearing. 16 Subjective quality measurement of coded audio Human listening tests: When a highly accurate assessment is needed, formal listening tests will be required to judge the perceptual quality of audio. 17 Experiment of “13 dB miracle” J. Johnston and K. Brandenburg, then at Bell Labs, presented examples of two signals having the same SNR of 13 dB, one of which was added white noise, and the other injected noise but perceptually shaped (so that the distortion was partially or completed masked by the signal components). Despite the same SNR measure, the perceived quality was very different with the latter one being judged as a good-quality signal, and the former one as a bit annoying. 18 Factors to consider in assessing the audio coder quality Audio material Different material stresses different aspects of a coder. For example, transient sounds can be used to test the coder’s ability of coding transient signals. Data rate Decreasing the data rate is likely to reduce the quality of a codec. It will be meaningful to take into account the data rate when comparing the quality of audio codecs. 19 Impairments versus transparency The audio quality of a coding system can be assessed in terms of impairment, which is the perceived difference between the output of a system under test and a known reference signal. The coding system under test is said to be transparent when even listeners who are experts in identifying impairments cannot distinguish between the reference and the test signals. To determine whether the coding system is transparent or how transparent the system is, we can present both the test and reference signals to the listeners in random order, and ask them to pick up the test signal. If the listeners get it wrong roughly 50%, then the system is transparent. 20 Coding margin Coding margin refers to a measure of how far the coder is from the onset of audible impairments. It can be estimated using listening tests where the data rate of a coder is gradually reduced before listeners can detect the test signal with statistically significant accuracy when the reference signals are also present. In practice, people are interested in the degree of impairments when they are below the region of transparency. In most cases, the coder in or near the transparent region is preferred (i.e. the impairments are very small). The well-known five-grade impairment scale and the formal listening test process, to be discussed later, are designed (by the ITU-R) for such situations. 21 Listening tests for audio codec quality assessment Main features of carrying out a listening test for coded audio with small impairments (more details described in the standard [ITU-R BS.526]) Five-grade impairment scale Test method Training and grading Expert listeners and critical material Listening conditions Data analysis of listening results 22 Five-grade impairment scale According to the standard ITU-R BS.562-3, any perceived difference between the reference signal and the output of the system under test should be interpreted as a perceptual impairment, measured by the following discrete five-grade scale: Source: Bosi & Goldberg, 2002 23 Five-grade impairment scale (cont) Correspondence between the five-grade impairment scale and fivegrade quality scale: Source: Bosi & Goldberg, 2002 24 Five-grade impairment scale (cont) For the convenience of data analysis, subjective difference grade (SDG) is usually used. SDG is the difference grade between listener’s rating of the reference signal and the coded signal, i.e. SDG = Grade of coded signal – grade of reference signal. The SDG has a negative value when the listener successfully distinguishes the reference from the coded signal and it has a positive value when the listener erroneously identifies the coded signal as the reference. Source: Bosi & Goldberg, 2002 25 Test method The most widely accepted method for testing coding systems with small impairments is the so-called “double-blind, triple-stimulus with hidden reference” method. Triple stimulus. The listener is presented with three signals: the reference signal, the test signals A and B. One of the test signals is identical to the reference signal. Double blind. Neither the listener nor the test administrator should know beforehand which test signal is which. Test signals A and B are assigned randomly by some entity different from the test administrator entity. Hidden reference. The hidden reference (one of the test signals that is identical to the reference signal) provides an easy mean to check that the listener does not consistently make mistakes. The above test method has been employed worldwide and it provides a very sensitive, accurate, and stable way of assessing small impairments in coded audio. 26 Training and grading A listening test usually consists of two phases: training and a formal grading phase. Training phase. This is carried out prior to the formal grading phase. This phase allows the listening panel to become familiar with the test environment, the grading process, and the codec impairments. This phase can essentially reduce the effect of the so-called informational masking, which refers to the phenomenon where the threshold of a complex maskee masked by a complex masker can decrease on the order of 40 dB after training. Note that a small unfamiliar distortion is much more difficult to assess than a small familiar distortion. Testing phase. In this phase, the listener is presented with a grading sheet, as shown in an example used for the development of the MPEG AAC coder, see the figure in the next page. 27 Training and testing (cont) Example of a grading sheet from a listening test Source: Bosi & Goldberg, 2002 28 Expert listeners and critical material Expert listener refers to listeners who have recent and extensive experience of assessing impairments of the type being studied in the test. The expert listener panel is typically selected by using pre-screening (e.g. an audiometric test) and post-screening (e.g. to determine whether the listener can consistently identify the hidden reference) procedures. Critical material should be sought for each codec to be tested, even though it is impossible to create a complete list of the difficult material for perceptual audio codecs. Such material can be the synthetic signals that deliberately break the system under test, any potential broadcast material that stresses the coding system under test. 29 Listening conditions The listening conditions and the equipment need to be precisely specified for others to be able to reliably reproduce the test. The listening conditions include the characteristics of the listening room (such as its geometric properties, its reverberation time, early reflections, background noise, etc.), the characteristics and the arrangement of the loudspeakers in the listening room, and the reference listening area. (See the multichannel loudspeakers configuration from [ITU-R BS. 1116]) Source: Bosi & Goldberg, 2002 30 Data analysis ANOVA (Analysis of Variance) method is most commonly used for the analysis of the test results. SDG (subjective difference grade) is an appropriate basis for a detailed statistical analysis. The resolution achieved by the listening test is reflected in the confidence interval, which contains the SDG values with a specific degree of confidence, 1-a, where a represents the probability that inaudible differences are labelled as audible. Figure below shows an example of formal listening test results from [ISO/IEC MPEG N1420]. Source: Bosi & Goldberg, 2002 31 MUSHRA method MUSHRA (Multiple Stimulus with Hidden Reference and Anchors) is recommended in [ITU-R BS. 1534] to provide guidelines for the assessment of audio systems with intermediate quality, i.e. for the ranking between two systems in the region far from transparency. In this case, the seven-grade comparison scale is recommended. The presence of the anchor(s), which is a low-passed version of the reference signal, is meant as an aid in weighting the relative annoyance of the various artefacts. Source: Bosi & Goldberg, 2002 32 Advantage and problems with formal subjective listening tests Advantage Good reliability Disadvantage High cost Time consuming 33 Objective perceptual measurements of audio quality Aim To predict the basic audio quality by using objective measurements based on psychoacoustic principles. PEAQ (Perceptual Evaluation of Audio Quality) Adopted in [ITU-R BS.1387], is based on a refinement of a generally accepted psychoacoustics models, together with new cognitive components accounting for higher-level processes involved in the judgement of audio quality. 34 Two basic approaches used in objective perceptual measurements The masked threshold method (based on the estimation and accurate model of masking) The internal representation method (based on the estimation of the excitation patterns of the cochlea taking place in the human ear) Masked threshold method Internal representation method 35 Source: Bosi & Goldberg, 2002 PEAQ PEAQ has two versions: basic (only using DFT) and advanced (using both DFT and filter bank). The basic model is fast and suitable for real-time applications, while the advanced model is computationally more expensive but provides more accurate results. In advanced version, the peripheral ear is modelled both through a DFT and a bank of forty pairs of linear-phase filters with centre frequencies and bandwidths corresponding to the auditory filters bandwidths. The model output values (MOVs) are based partly on the masked threshold method and partly on the internal representation method. MOVs include partial loudness of linear and nonlinear distortions, noise to mask ratios, alteration of temporal envelopes, harmonic errors, probability of error detection, and proportion of signal frames containing audible distortions. The selected MOVs are mapped to an objective difference grade (ODG) via an artificial neural network. ODG is a prediction of SDG. The correlation between SDG and ODG proved to be very good, and there is no significant statistical difference between them. 36 PEAQ (cont) Psychoacoustic model of dvanced version of PEAQ. Source: Bosi & Goldberg, 2002 37 Coding artifacts Pre-echo For sharp transient signal, pre-echo is caused by the spreading of quantisation noise into a time region where it is not masked. Can be reduced by block switching. Aliasing It might happen when applying PQMF and MDCT and coarse quantisation, but not a problem in normal conditions. Birdies This could happen in low data rate, due to the bit allocation changes from block to block for highest frequency bands, causing the appear or disappear of some spectral coefficients. Reverberation It could happen when a large block size is employed for the filter bank in low data rate. Multichannel artefacts The loss or shift in the stereo image can introduce artefacts, relevant to binaural masking. 38 PESQ PESQ refers to perceptual evaluation of speech quality, described in [ITU-T Rec. P.862], launched in 2000, is a family of algorithms for objective measurements of speech quality that predict the results of subjective listening tests on telephony systems. PESQ uses a sensory model to compare the original, unprocessed signal with the degraded signal from the network or network element. The resulting quality score is analogous to the subjective “Mean Opinion Score” (MOS) measured using listening tests according to ITU-T P.800. PESQ takes into account coding distortions, errors, packet loss, delay and variable delay, and filtering in analogue network components. The user interfaces have been designed to provide a simple access to this powerful algorithm, either directly from the analogue connection or from speech files recorded elsewhere. 39 Audio Perception Loudness perception Pitch perception Space perception Timbre perception 40 Inner Ear Function The inner ear consists of cochlea which has a snail-like structure. o It transfers the mechanical vibrations to the movement of basilar membrane, and then converts into nerve firings (organ of corti which consists of a number of hair cells). o The basilar membrane carries out frequency analysis of input sounds, and it responds best to high frequencies at the (narrow and thin) base end, and to low frequencies at the (wide and thick) apex end. Inner Ear Function (a) The spiral nature of the cochlea (b) The cochlea unrolled (c) Vertical crosssection through the cochlea (d) Detailed view of the cochlea tube From: (Howard & Angus, 1996) Loudness Perception The ear’s sensitivity to sounds of different frequencies varies over a wide range of sound pressure level (SPL). The minimum SPL that can be detected by the human hearing system around 4kHz is approximately 10e-5Pa, while the maximum SPL (i.e the threshold of pain) is 20Pa. For convenience, in practice, SPL is usually represented in decibels (dB) relative to 20e-5Pa. P dB( SPL) 20 log m Pr where Pm is the measured SPL, 5 For example, the threshold of hearing at 1 kHz is, in fact, Pr 2 10 Pa In dB, it equals to 2 105 0dB 20 log 5 2 10 While the threshold of pain is 20Pa which in dB equals to 2 10 20log 120dB 5 2 10 Loudness Perception (cont.) The perceived loudness of an acoustic sound is related to its amplitude (but not a simple one-to-one relationship), as well as the context and nature of the sound. As the sensitivity of our hearing system varies as the frequency changes, it is possible for a sound with a larger pressure amplitude to be heard as quieter than a sound with a lower pressure amplitude (for example, if they are at different frequencies). [recall the equal loudness contour of the human auditory system shown in the first lecture] Demos for Loudness Perception Resources: Audio Box CD from Univ. of Victoria Decibels vs Loudness Starting with a 440Hz tone (i.e. note A4), then it is reduced 1dB each step Starting with a 440Hz tone (i.e. note A4), then it is reduced 3dB each step Starting with a 440Hz tone (i.e. note A4), then it is reduced 5dB each step Intensity vs Loudness Various frequencies played at a constant SPL A reference tone is played and then the same tone is played 5dB higher; followed by the reference tone, and then the tone 8dB higher and finally the reference tone and then the one 10dB higher Pitch Perception What is pitch? Pitch • is “the attribute of auditory sensation in terms of which sounds may be ordered on a musical scale extending from low to high” (American Standard Association, 1960) • is a “subjective” attribute, and cannot be measured directly. Therefore, a specific pitch value is usually referred to the frequency of a pure tone that has the equal subjective pitch of the sound. In other words, the measurement of pitch requires a human listener (the “subject”) to make a perceptual judgement. This is in contrast to the measurement in the laboratory of, for example, the fundamental frequency of a complex tone, which is an “objective” measurement. (Howard & Angus, 1996) • is related to the repetition rate of the waveform of a sound, therefore it corresponds to the frequency of a pure tone and the fundamental frequency of a complex tone. In general, sounds having a periodic acoustic pressure variation with time are perceived as pitched sounds, for non-periodic acoustic pressure waveform, as non-pitched sounds. (Howard & Angus, 1996) Existing Pitch Perception Theories ‘Place’ theory Spectral analysis is performed on the stimulus in the inner ear, different frequency components of the input sound excite different places or positions along the basilar membrane, and hence neurones with different centre frequencies. ‘Temporal’ theory Pitch corresponds to the time pattern of the neural impulses evoked by that stimulus. Nerve firings tend to occur at a particular phase of the stimulating waveform, and thus the intervals between successive neural impulses approximate integral multiples of the period of the stimulating waveform. ‘Contemporary’ theory Neither of the theories is perfect for explaining the mechanism of human pitch perception. A combination of both theories will benefit the analysis of pitch perception. Contemporary Theory (Moore, 1982) Demos for Pitch Perception Resources: Audio Box CD from Univ. of Victoria This three demos show how pitch is perceived with different time duration of the signals. In each track, time bursts of sounds are played. Three different pitches are played in these three tracks. Space Perception Sound localisation refers to judgements of the direction and distance of a sound source, usually achieved through the use of two ears (binaural hearing). interaural time difference interaural intensity difference Although binaural hearing is crucial for sound localisation, monaural perception is similarly effective in some cases, such as in the detection of signals in quiet, intensity discrimination, and frequency discrimination. Interaural Time Difference (ITD) (Howard & Angus, 1996) Interaural Time Difference (ITD) Based on the equation below, it can be shown that the maximum ITD occurs at 90 degree (considering the average head diameter), which is: 6.73104 673s Where r ( sin( )) t c - ITD (in s) t - Half the distance between the ears (in m) r c - The angle of arrival of the sound from the median (in radians) - Sound speed (in m/s) ITD and IPD The ear appears to use the interaural phase difference (IPD) caused by the ITD in the two waves to resolve the sound direction. The phase difference is given by: 2fr ( sin( )) Where r f - The phase difference between the two ears (in radians) - Half the distance between the ears (in m) - The angle of arrival of the sound from the median (in radians) - The frequency (in Hz) When the phase difference is greater than 180 degree, there will be an unresolvable ambiguity in the sound direction as the angles could be the one to the left or to the right. Interaural Intensity Difference (IID) Due to the shading effect of the head, the intensity of the sound levels reaching each ear is also different. Such difference is called interaural intensity difference (ITD). When the sound source is on the median plane, the sound level at each ear is equal, while the level at one ear progressively reduces, and increases at the other, as the sources move away from the median plane. c The shading effect of the head is difficult to calculate, however, experiments seem to show that the intensity ratio between the two ears varies sinusoidally from 0dB up to 20dB with the sound direction angles, for various frequencies. The shading effect is not significant unless the size of the head is about one third of a wavelength in size. For a head with a diameter of 18cm, this corresponds to a minimum frequency (Howard & Angus, 1996) of: 1 c 1 344m / s f min( / 2) 637Hz 3 d 3 0.18m Shading Effect in IID c (Howard & Angus, 1996) Timbre Perception According to American Standard Association, it is defined as “that attribute of sensation in terms of which a listener can judge that two sounds have the same loudness and pitch are dissimilar”. Musically, it is “the quality of a musical note which distinguishes different types of musical instruments.” It can be defined as “everything that is not loudness, pitch or spatial perception”. • Loudness < - > Amplitude (frequency dependent) • Pitch < - > Fundamental Frequency • Spatial perception <-> IID, IPD • Timbre <-> ??? 56 Physical Parameters Timbre relates to: • Static spectrum (e.g. harmonic content of spectrum) • Envelope of spectrum (e.g. the peaks in the LPC spectrum which corresponds to formants) • Dynamic spectrum (time evolving) • Phase • … 57 Static Spectrum 58 Spectrum Envelope The spectral envelopes of the flute (the above figure) and the piano (the below figure) suggest that they are different for different music instrument. 59 Dynamic Spectrum This figure shows how the spectral envelope looks like in a trumpet sound 60 Demos for Timbre Perception Resources: Audio Box CD from Univ. of Victoria Examples of differences in timbres 61 A Music Demo for Auditory Transduction Perceptual mechanism of an auditory system http://www.youtube.com/watch?v=46aNGGNPm7s 62 References M. Bosi and R.E. Goldberg, “Introduction to Digital Audio Coding and Standards”, Springer, 2002. A. Kondoz, “Digital Speech Coding for Low Bit Rate Communication Systems”, Wiley, 2001. D.M. Howard, and J.A.S. Angus, Acoustics and Psychoacoustics (4th Edition), 2009, Focal Press. 63