The oscillation analysis is patented and all privilege belong to OrtoSense. The Dynamic Ear Frank Uldall Leonhard, PhD September 2006 Summary Until now, the assumed acoustic model for human sound perception has been base on spectrum analysis (Fourier Transformation), although it leaves some inexplicable phenomena. Among others, how a deep male voice can be identified through a phone, even if the pitch is below 100 Hz and the phone has cut off frequency at 300 Hz and how are vowels identified, when the formats are displaced a good deal between small children and grown-up men. Therefore another acoustic model is proposed based on interference analysis. In this model the dynamic of the instantaneous energy is analysed in one domain and the oscillations in another. In this paper the basic principles for the model is discussed and listening experiments are described – which strongly indicates, that the model is in agreement with the human auditory perception. 1. Introduction The human auditory system is a sound and speech analyser with excellent performance. The computational models of today lag a lot behind in performance, and it would be very useful to find a better approach that reflected the human perception much better. This would lead to a better understanding of hard of hearing and how to remedy. It will also improve speech recognitions systems, which today are very sensitive and unstable. The auditory model is based on a cochlea band-pass filter bank. The frequency bands are normally given by a Bark scale or Mel frequency scale. A concise definition of the Bark scale is provided by Zwicker (1961). Today it is assumed that the basilar membrane in the cochlea makes a frequency to place transformation with a high frequency resolution. The spectral representation is then analysed by the auditory-nerve system and brain. The fundamental facts about the auditory perception are primarily based on studies of listen tests where sinusoids are used. This is among others the case for the definition of the standardised sensitivity curves and masking effect of frequencies. However a sinusoid is not a very common sound in the real life and it is a special case. The auditory system is originally developed as an alarm system where pulses in the sound picture are very important. When enemies or wild predators try to sneak towards you to attack you they may be revealed by accidentally stepping upon twigs that break or drop something on the ground. The sound from twigs that break and things that hit the ground are pulses. The ear is therefore developed to be very sensitive to pulses. It is incontestable that the ear has a high frequency resolution at least in the lower frequency range. The sensitivity to pulses and a high frequency resolution are inconsistence. To get a high frequency resolution it is necessary to analyse for a longer period in a narrow-banded filter, and to analyse a pulse it is necessary to have high time resolution and the filter has to be broad-banded. The sound analyses must therefore be analysed in more processes. A quit different model is therefore proposed which consist of 2 analyses, an oscillation analysis and a dynamic energy analysis. Both analyses are based on a band- pass filter bank pre-analysis made of the basilar membrane. But in contrast to the traditional models the filters have broad-banded filters of about one octave. The really frequency analysis must be done by the auditory-nerve system or by the brain The model is draw up in this paper is primarily based on listen test but studies of auditory-nerve fibres point in same direction e.g. described by Delgutte (1980). 2. Sound signals The background for sound and speech analysis has been that the signals with good approximation could be considered as steady state for periods of 20-40 ms, and the spectrum calculated for that period has been the key to auditory perception. However this assumption does not hold in practise. Sound and speech signals are dynamic signals where pulses play an important role. Pulses are typical transient responses. An abrupt force actuates a material and the material response. This response is a transient response in the shape of a pulse contains damped resonance frequencies. An example could be a twig that breaks. The pulse might contain more damped resonance frequencies, and is necessary to have a sufficient separation of the damped frequencies. This might be the duty of the basilar membrane. Fig. 1 shows 10 ms of voiced speech filtered in a filter bank. The signal is a female pronouncing “a” as in ”have”. On the figure outputs from a band-pass filter in the low and the high frequency range is shown. Especially the output of the high frequency band-pass filter shows that there are two pulses in the 10 ms signal with a period of about 5 ms. The pulses are created by the vocal cords actuating the vocal tract with a period of about 5 ms. 3. Dynamic energy analysis The envelope of the output signals of the band-pass filters is an expression of the dynamic energy in the frequency bands given by filters. Fig. 2 shows the dynamic energy of the output signal of the band-pass filter in the high frequency range shown on fig.1. A full wave rectification and a low-pass filtering measure the dynamic energy. The features of interest of the dynamic energy are rise and fall time, the magnitude, and maximum slope of the leading and trailing edges of the pulses. Also the periodicity of the dynamic energy is an important feature. The definition of an edge is base on the slope of the dynamic energy. Start and the end of the edge are defined relatively to the maximum slope. When the slope is e.g. 20 % of the maximum slope before the maximum slope the edge starts and when the slope is e.g. 20 % of the maximum slope after the maximum slope the edge ends. The energy of a single sinusoid is constant but a physical system as mentioned above will not measure a 100 % constant energy level. The rectification and low-pass filtering will cause small ripples on the energy level. Fig. 3 shows an example where the energy of a sinusoid is measured and it is seen that there is small ripples in the energy. Now 2 signals will be compared. Signal1 is a pure sinusoid, and Signal2 is signal created by means of high Q filter with 1 resonance frequency and an impulse train as excitation signal. The output signal will be very close to a sinusoid given by the resonance frequency, but the signal will be periodic with the period of the pulse train. An impulse train does not fulfil the Nyquist criterion. To get Signal2 as smooth as possible it is therefore best if the frequency and the period of the pulse train is chosen so that the sampling frequency is very close to be a multiple of them. The sampling frequency is 44.100 Hz. The frequency is then chosen to 802 Hz and the period of the pulse train is chosen to 2.494 ms, which correspond to 401 Hz. The impulse response for the filter is h(t ) e t sin( 2ft ) (1) and the time constant is equal to 100 ms. Fig. 4 shows Signal2 and immediately the signal looks like the pure sinusoid tone. Listening to the two signals shows however that they are different. The sound of a pure sinusoid is in fact annoying and “sterile” to listen to while Signal2 sounds like an organ pipe with a pitch at 401 Hz. The energy of Signal2 is fluctuating a bit with ripples. The ripples of the dynamic energy are greatest in the frequency band from 1 to 2 kHz, which is just above the 802 Hz tone. Fig. 5 shows the band-pass filtering of Signal2 and the measured dynamic energy in that frequency band. It can be seen from the band-passed filtered signal that the amplitude varying with time and the same does the measured energy. The magnitude of the ripples are normalised to the power of the signals. The ripples or pulses of Signal1 are measured to 4.6*10-8 while the pulses of Signal2 are measured to 4.5*10-7. The pulses of Signal2 are nearly 10 times greater compared to Signal1. Also the period of the pulses in the dynamic energy is different in the 2 signals. In Signal1 the period corresponds to 802 Hz while it corresponds to 401 Hz in Signal2. Although the pulses are 10 times greater in Signal2 than in Signal1 they are small and it gives an impression of how sensitive the ear is to pulses. Signal2 has a small fundamental at 401 Hz in the spectrum, but the great sensitivity to pulses points in the direction that the ear perceives the pitch in the sound as periodicity of pulses in the dynamic energy, and not as the fundamental in a spectrum. Signal2 is very tonal but there is many sounds where this is not the case e.g. most of the vowels. When the damping is increasing the tonality is decreasing. Signal3 is generated in the same way that Signal2 but the time constant is only 0.5 ms and it is much less tonal and much more speech-like than Signal2. The normalised ripples of Signal3 are measured to 1.6*10-4. This is nearly 3500 greater than in the sinusoid. For many years it has been a mystery how humans have perceived the fundamental or pitch in a sound. One of the mysteries has been how the pitch of a deep male voice was perceivable through a phone. The pitch of a male voice can be lower than 100 Hz and a traditional phone has a lower cut-off frequency of 300 Hz, and by mobile phones it is higher. Never the less there is no problem in perceiving a deep male voice through a phone. The classic answer is that mystery is that the brain is able reconstruct the fundamental, but it is more obvious that it is the periodicity of the dynamic energy gives the pitch. Signal4 is the word “key” pronounced by a male. The pitch of the voiced vowel in the word is about 130 Hz. Fig. 8 shows a segment on 15 ms of the vowel. Fig. 9 shows the spectrum of the vowel. The greatest frequency is 260 Hz and this is not the pitch but the 1st harmonic. A band-pass filter from 2-4 kHz is then used to filter the signal. This signal is Signal5. Fig. 10 shows the spectrum of Signal5. As it is seen the low frequencies below 400 Hz are suppressed more than 70 dB compared to the high frequency above 2 kHz. In defiance of that there is no problem in perceiving the pitch. The dynamic energy gives a more understandable picture. Fig. 11 shows the bandpass filtered signal and the dynamic energy, and it is obviously that the dynamic energy holds the periodicity of the pitch. It has also been known for many years that interference between to frequencies appears as a pitch. Signal6 is a sum of 2 frequencies of 1000 and 1200 Hz. A sum of 2 frequencies can be transformed as a product as follows. sin( 2f1t ) sin( 2f 2t ) 2 sin( ( f1 f 2 )t ) cos( ( f1 f 2 )t ) (2) This can be interpreted as a modulation of an oscillation. Fig. 12 shows a signal with a 1000 and a 1200 Hz frequency of equal amplitude. It also shows the dynamic energy. Ideally the energy will fluctuate between double magnitude of the energy of the frequencies and zero, but it is measured by means of a full wave rectification and low-pass filtration, and will therefore not achieve the ideal maximum and minimum. The signal is modulated by the frequency f1 f 2 2 and the oscillation is f1 f 2 2 The shape of the dynamic energy is ideally sin( ( f1 f 2 )t ) The pitch is therefore f1 f 2 and not f1 f 2 . 2 By listening to Signal4 it is obvious that it has a pitch of 100 Hz although it has no fundamental at all. As reference Signal7 is a signal with a 1000 and a 1220 Hz frequency of equal amplitude. It is easy to hear that Signal4 has e deeper pith that Signal7. Therefore the periodicity of the dynamic energy must have decisive significance for perceiving the pitch. Oscillation analysis The outputs from the filter bank shown on fig. 1 keep not only information about dynamic energy but also oscillations. They could be analysed by means of a frequency analysis but it is more likely that the ear makes an oscillation analysis in the time domain. An oscillation analysis is a histogram of peak-peak values accumulated over a certain time frame as function of the oscillation period. Instead of peak-peak values it might be the peak values of the half wave rectified signal that are accumulated (Seneff, 1988). An oscillation analysis has some advantages compared to frequency analysis. No smoothing time window like Hamming or Hanning, and the spectrum of the damping function is not mixed up with the resonance frequency as is the case in a frequency analysis. The question now is does the ear analyse the oscillation or not? If the signal has 2 damped resonance frequencies in the in the same frequency band of the filter bank then the output of the band-pass filter has an oscillation that is the mean frequency of the two frequencies according to eq. 2. Five signals are generated by means of 5 filters with 2 damped resonance frequencies. An impulse train with a period of 5 ms is used as excitation signal. The duration of the signals is 2 seconds. Signal Time constant ms Frequency 1 Hz Frequency 2 Hz 0.5 0.7 1.0 0.5 0.5 1000 800 700 800 1200 1000 1200 1300 800 1200 DFSignal1 DFSignal2 DFSignal3 DFSignal4 DFSignal5 The signals DFSignal1.WAV , DFSignal2.WAV , and DFSignal3.WAV have all oscillations of 1000 Hz. The time constant is increased when the distance between the frequencies is increased. This is necessary to keep the amount of oscillations in the signals at same level. The signals DFSignal4.WAV and DFSignal5.WAV are made as reference signals. Fig. 12 shows a segment of DFSignal1 and fig. 13 shows a segment of DFSignal2. A listen test shows that DFSignal1, DFSignal2, and DFSignal3 sounds very similar while DFSignal4 has a sound colour that is deeper and DFSignal5 has a sound colour that is higher than DFSignal1, DFSignal2, and DFSignal3. The similar test was made with higher frequencies. Signal Time constant ms Frequency 1 Hz Frequency 2 Hz 0.25 0.50 0.25 0.25 2000 1400 1400 2600 2000 2600 1400 2600 DFSignal6 DFSignal7 DFSignal8 DFSignal9 The result was the same. DFSignal8.WAV DFSignal6.WAV sounds deeper and DFSignal7. , and DFSignal9.WAV DFSignal7.WAV sound very similar, while sounds higher than DFSignal6, and The conclusion that the ear is analysing the oscillations and not the spectrum when the signal is broad spectred and contains damped resonance frequencies. This fact is important to speech recognition. The resonance frequencies in speech signals (also called formants) has been regarded as the fingerprint of the vowels, but that fact that the ear uses the oscillations means that it is not the formants but the oscillations generated by the interference between the formants that is the fingerprint of the vowels. Conclusion The experiments with dynamic energy show, that the ear is very sensitive to dynamics of the energy. A signal generated by means of high Q filter and impulse train has only very small fluctuations in the energy. Nevertheless, the signal sounds clearly different, compared to a sinusoid and it has a pitch, that corresponds to the period of the impulse train. When a high Q filter is used, the sound is very tonal, but if a low Q filter is used, it is not. When listening to 2 frequencies e.g. 1000 and 1200 Hz, a pitch of 200 Hz is heard. It indicates that the pitch is perceived by means of the period of the dynamic energy and not by means of the fundamental, because the signal does not have a fundamental of 200 Hz, but the energy has a period corresponding to 200 Hz. The experiments with oscillations show that if 2 damped frequencies are not too far from each other, the ear perceive the oscillation arisen from the interference between the 2 frequencies and not the 2 frequencies. This is an important phenomenon in relation to speech recognition, especially vowel recognition. Traditionally the formants, which are damped resonance frequencies in speech signals, are regarded as the fingerprint of the vowels. But this result indicates that it is the oscillations arisen from the interference between the formants that are the fingerprint. The findings in these experiments suggest that basilar membrane acts as a broadbanded band-pass filter bank and auditory nerve system and the brain make that detailed signal processing. References Delgutte, B. (1980) Representation of speech-like sounds in the discharge patterns of auditory-nerve fibres, Journal of the Acoustical Society of America, 68, 843-857. Leonhard, F. U. (2003) Method for analysing signals conditions containing pulses, patent application WO2005015543, date of priority August 6, 2003. Leonhard, F. U. (1993) Method and System for detecting and generating transient condition in auditory signals, patent EP0737351, date of priority April 22, 1993. Seneff, S. (1988) A joint synchrony/mean-rate model of auditory speech processing, Journal of Phonetics, 16, 55-76. Zwicker, E. (1961) Subdivision of the audible frequency range into critical bands, Journal of the Acoustical Society of America, 33, 248-249. Figures Output BP1 . . . Input BPN 10 ms Fig. 1. Separation of the oscillations by means of a filter bank with N band-pass filters separating oscillations. Dynamic energy Maximum slope Magnitude Rise time 10 ms Fig. 2. Dynamic energy of the output of high frequency band-pass filter. Amplitude Dynamic energy Fig. 3. The energy of a sinusoid measured by means of a full wave rectification and low-pass filtering. The energy contains ripple. Amplitude Fig. 4. An 802 Hz tone generated by a high Q filter and an impulse train as excitation signal. The period of the pulse train is 2.494 ms. The time constant of the filter is 100 ms. Amplitude – band-passed, 1-2 kHz Dynamic energy Fig. 5. The upper signal is the signal shown on fig 4 filtered with a band-pass filter. The energy contains ripple with a periodicity of the pulse train. Amplitude Fig. 6. An 802 Hz tone generated by a low Q filter and an impulse train as excitation signal. The period of the pulse train is 2.494 ms. The time constant of the filter is 0.5 ms. Dynamic energy Fig. 7. The measured dynamic energy of the signal shown on fig. 6. Amplitude 15 ms Fig. 8. A segment of the vowel in the word “key” pronounced by a male (Signal4). The pitch is about 130 Hz. Fig. 9. The spectrum of the signal shown on fig. 6. Fig. 10. The spectrum of the signal shown on fig. 6 filtered by a band-pass filter from 2 to 4 kHz. Amplitude – band-passed, 2-4 kHz Dynamic energy Pitch = 130 Hz 15 ms Fig. 11. The upper signal is the signal shown on fig. 6 filtered with a band-pass filter. The dynamic energy has clearly a periodicity corresponding to the pitch of the signal shown on fig. 8. Amplitude Dynamic energy Pitch = 200 Hz 10 ms Fig. 12. The upper signal contains 2 frequencies. The dynamic energy fluctuates as consequence of the interference between the 2 frequencies. 10 ms Fig. 13. A segment of a pulses train. The pulses consist of 2 damped frequencies. The time constant is 0.5 ms and both frequencies are 1000 Hz. 10 ms Fig. 14. A segment of a pulses train. The pulses consist of 2 damped frequencies. The time constant is 0.7 ms and the frequencies are 800 and 1200 Hz.