Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk http://personal.ee.surrey.ac.uk/Personal/W.Wang/teaching.html 1 Timber Perception (Ack. S. Zielinski) 2 What is Timbre? According to American Standard Association, it is defined as “that attribute of sensation in terms of which a listener can judge that two sounds have the same loudness and pitch are dissimilar”. Musically, it is “the quality of a musical note which distinguishes different types of musical instruments.” It can be defined as “everything that is not loudness, pitch or spatial perception”. • Loudness < - > Amplitude (frequency dependent) • Pitch < - > Fundamental Frequency • Spatial perception <-> IID, IPD • Timbre <-> ??? 3 Physical Parameters Timbre relates to: • Static spectrum (e.g. harmonic content of spectrum) • Envelope of spectrum (e.g. the peaks in the LPC spectrum which corresponds to formants) • Dynamic spectrum (time evolving) • Phase • … 4 Static Spectrum 5 Spectrum Envelope Formant affects the sensation of timbre 6 Spectrum Envelope (cont) Formants determines not only timbre, but also the recognition of vowels 7 Spectrum Envelope (cont) This figure shows how the spectral envelope looks like in a trumpet sound 8 Spectrum Envelope (cont) The spectral envelopes of the flute (the above figure) and the piano (the below figure) suggest that they are different for different music instrument. 9 Dynamic Spectrum This figure shows how the spectral envelope looks like in a trumpet sound 10 Phase The above two magnitude spectra are identical, while their waveforms are totally different. The timbre of these two sounds are almost identical, and hence phase affects the timbre but to very little extent. This also suggests that human hearing is not sensitive to phase difference. 11 Demos for Timbre Perception Resources: Audio Box CD from Univ. of Victoria Examples of differences in timbres 12 Auditory Masking 13 What is masking ? Masking: One sound is made inaudible by another one. • Simultaneous masking refers to the situation where one sound (signal) is made inaudible by another simultaneous sound (i.e. the masker). In other words, both the signal and the masker happen at the same duration. It is also known as frequency masking or spectral masking since if two sounds share a same frequency band, they can be perceived clearly when separated, but cannot be perceived clearly when simultaneous, such as the tones at 440Hz and 450Hz • Non-simultaneous masking refers to the situation where one sound (signal) is made inaudible by another sound (i.e. the masker) that proceeds or follows the signal. In other words, they do not present at the same time. 14 What is masking? (cont) 15 Simultaneous Masking On-frequency masking Off-frequency masking The masker and the signal are within the same auditory filter band, with the louder sound masks the quieter one. The masker and the signal are with different frequency bands. The masking effect is weaker as compared with the on-frequency masking. (Source: figures from wikipedia, 2010) 16 Simultaneous Masking (cont) To have a same masking In off-frequency masking, the amount effect as in on-frequency that the masker raises the threshold of masking, the level of masker the signal is much less as compared needs to be greater in offwith on-frequency masking, however, frequency masking. it does have some masking effect on the signal, as shown in the above figure. (Source: figures from wikipedia, 2010) 17 Demos for Simultaneous Masking (Frequency Domain Masking) Resources: Audio Box CD from Univ. of Victoria A single tone is played, followed by the same tone and a higher frequency tone. The higher frequency tone is reduced in intensity first by 12 dB, then by steps of 5 dB. The sequence is repeated twice. The second time the frequency separation between the tones is increased. Pure tones mask higher frequencies better than lower frequencies. This demo tries to mask high frequencies. Pure tones mask higher frequencies better than lower frequencies. This demo tries to mask low frequencies. This demo shows a tone of greater intensity masks a broader ranger of tones than a tone of less intensity. A single tone is played, followed by the same tone and a higher frequency tone. The higher frequency tone is reduced in intensity first by 10 dB, then by steps of 3 dB. The sequence above is repeated twice, the second time increasing the intensity of the single tone by 28 dB. 18 The Amount of Masking In the example above, the amount of masking is 16dB, which is the difference between the masked threshold and un-masked threshold. Note that the threshold for a signal that is masked will be raised as compared with the signal is not masked (for example, when the signal is heard in a quiet environment.) (Source: figures from wikipedia, 2010) 19 Masking Interprets Frequency Resolution of Auditory System Frequency selectivity, also known as frequency resolution, is referred to as the ability of human auditory system to separate the different frequency components of a complex sound. Recall the concept of the critical bandwidth, two sounds with different frequencies (pitches) can be heard as two separate tones. It is achieved and performed by the filtering process of the cochlear, where the complex sound is (band-pass) filtered and decomposed into individual frequency components (sinusoids), and then coded independently in the auditory nerve. Masking is usually used to quantify and characterise the frequency resolution of the auditory system. The auditory system would not be able to separate the two frequencies if the sound of one frequency is masked by that of the other. Therefore, masking explains the limits of frequency resolution of the human auditory system. 20 Use Masking to Estimate the Critical Band The original experiment by Fletcher (1940) to measure the threshold for detecting a sinusoidal signal as a function of the bandwidth of a bandpass noise masker Conditions: The noise was centred at the signal frequency. Noise power density was constant. Findings: At first, the threshold increases as the noise bandwidth increases. However, it flats off with the further increases in noise. This was due to the critical bandwidth: where the noise bandwidth exceeds the bandwidth of the auditory filter and the threshold ceases to increase even if the noise power increases. The power-spectrum model of masking assumes (Moore, 1995): The auditory system is a bank of linear overlapping band-pass filters. Use one filter with a centre frequency close to that of the signal for the detection of the signal. The signal is only masked by the noise component that passes through the auditory filter. The threshold corresponds to a certain signal-to-noise (masker) ratio. 21 Psychophysical Tuning Curves Psychophysical tuning curves (PTCs) is a method for the estimation of the shape of the auditory filter. The PTCs above were determined in simultaneous masking, using sinusoidal signals at 10 dB SPL. For each curve, the diamond below it shows the frequency and the level of the signal. The masker was a sinusoid that had a fixed starting phase relationship to the signal. The masker level required for threshold (i.e. just mask the signal) is plotted as a function of masker frequency on a logarithmic scale. The dashed line represents the absolute threshold for the signal. Figure from (Moore, 1995). 22 Shape of Auditory Filter The shape of the auditory filter centred at 1kHz plotted for input sound levels ranging from 20 to 90 dB SPL/ ERB. The output level of the filter is plotted as a function of the frequency. On the low-frequency side, the filter becomes progressively less sharply tuned with increasing sound level. On the high-frequency side, the sharpness of tuning increases slightly with increasing sound level. At moderate sound levels the filter is approximately symmetric on the linear frequency scale used. Figure from 23 (Moore, 1995) Bark Scale Proposed in 1961 by Eberhard Zwicker, named after Heinrich Barkhausen who proposed the first subjective measurement of loudness. The scale ranges from 1 to 24 and corresponds to the first 24 critical bands of hearing. The subsequent band edges are (in Hz) 20, 100, 200, 300, 400, 510, 630, 770, 920, 1080, 1270, 1480, 1720, 2000, 2320, 2700, 3150, 3700, 4400, 5300, 6400, 7700, 9500, 12000, 15500. Bark 13arctan(0.00076f ) 3.5 arctan((f / 7500)2 ) 24 Non-Simultaneous Masking T Forward masking Masked tone Masking tone time T cannot be as long as 20-30ms T Backward masking Masked tone Masking tone time T cannot be more than 10ms 25 Forwarding Masking The left figure shows the amount of forward masking of a 2kHz signal as a function of the time delay between the signal and the end of the noise masker. Each curve represents a different noise level. The results for each spectrum level fall on a straight line when the signal delay is plotted on a logarithmic scale. The right figure shows the same thresholds plotted as a function of the masker level. The slopes of these growth of masking functions 26 decrease with increasing signal delay. Figures from (Moore, 1995) Forwarding Masking Forward masking is greater the nearer in time to the masker that the signal occurs. Increments in masker level do not produce equal increments in amount of forward masking, i.e. the slope of the growth of masking function is less than 1, which is in contrast to the simultaneous masking where the slope is close to 1. 27 PTCs Comparisons Comparison of the psychophysical tuning curves determined by the simultaneous masking (triangle) and the forward masking (square). The masker frequency is plotted as a function of the deviation of the centre frequency divided by the centre frequency. The unit for the centre frequency is kHz. Figures from (Moore et al, 1984) 28 Demos for Non-simultaneous Masking (Time Domain Masking) Resources: Audio Box CD from Univ. of Victoria Forward masking: a masking tone is played and then a tone which is semitone lower is followed with a 100ms delay in the between. Two tones can be heard even though the second tone is decreased in 3dB increments. Forward masking: a masking tone is played and then a tone which is semitone lower is followed with a 10ms delay in the between. Masking occurs in this demo. How many steps are audible before the second tone is masked. Backward masking: the initial tone is masked by the one that follows. The time delay is 100ms. Backward masking: the initial tone is masked by the one that follows. The time delay is decreased by still more than 10ms. Backward masking: the initial tone is masked by the one that follows. The time delay is below 10ms. Masking occurs. How many 29 steps are audible? Examples of Modern Audio Formats MP3: MPEG-1 or MPEG-2 Audio Layer 3 (or III), is a patented lossy audio codec. It is a common audio format for consumer audio storage, as well as a standard of digital audio compression for the transfer and playback of music on digital audio players. Ogg Vorbis: an lossy audio codec developed by the Xiph.Org Foundation (formerly Xiphophorus company). Free and open source. AAC: Advanced Audio Coding, an audio compression format specified by MPEG-2 and MPEG-4, and successor to MPEG-1’s “MP3” format. WMA: Windows Media Audio, is an audio codec developed by Microsoft. MPEG-1 Layer II or MPEG-2 Audio Layer II (MP2): a lossy audio compression format defined by ISO/IEC 11172-3 alongside MPEG-1 Audio Layer I and MPEG-1 Audio Layer III (MP3). While MP3 is much more popular for PC and internet applications, MP2 remains a dominant standard for audio broadcasting. ATRAC: Adaptive Transform Acoustic Coding (ATRAC) is a family of proprietary audio compression algorithms developed by Sony. ATRAC allowed a relatively small disc like MiniDisc to have the same running time as CD while storing audio information with minimal loss in perceptible quality. 30 Auditory Scene Analysis 31 Demos for Sequential Organisation Resources: Audio Box CD from Univ. of Victoria In this demo, the sound is perceived as a single stream of notes C4 G4 F4 B3 As the notes are sped up, rhythmic beats played as a melody begin to be heard. The auditory system is now hearing two groups of two notes. If the time delay is further decreased. We no longer hear a melody, we only hear the rhythmic beats. Our auditory system is now hearing four groups of one note each. 32 Demo for Speech Segregation Resources: Audio Box CD from Univ. of Victoria This demo begins the two melodies of “Camptown Races” and “Yankee Doodle” at the same pitch. Each time the interleaved melody is played, one of the songs is shifted in pitch until eventually the two melodies become distinguishable. This demo plays the two melodies at the same pitch, but at different timbre. The two melodies are distinguishable instantly. This demo adjusts the amplitude of the two songs while leaving the pitch constant. 33 Segregation of a melody from interfering tones Track 1 in Bregman’s ASA Demonstration 34 Segregation of a melody from interfering tones Track 5 in Bregman’s ASA Demonstration 35 Segregation of high notes from low ones in a sonata by Telemann Track 6 in Bregman’s ASA Demonstration 36 Streaming in African xylophone music Track 7 in Bregman’s ASA Demonstration 37 Effects of a timbre difference between the two parts in African xylophone music Track 9 in Bregman’s ASA Demonstration 38 Stream segregation of vowels and diphthongs Track 11 in Bregman’s ASA Demonstration 39 Stream segregation of high and low bands of noise Track 14 in Bregman’s ASA Demonstration 40 Apparent Continuity Track 28 in Bregman’s ASA Demonstration 41 Perceptual continuation Track 29 in Bregman’s ASA Demonstration 42