The Dynamic Ear

advertisement
The oscillation analysis is patented and all privilege belong to OrtoSense.
The Dynamic Ear
Frank Uldall Leonhard, PhD
September 2006
Summary
Until now, the assumed acoustic model for human sound perception has been base
on spectrum analysis (Fourier Transformation), although it leaves some inexplicable
phenomena. Among others, how a deep male voice can be identified through a
phone, even if the pitch is below 100 Hz and the phone has cut off frequency at 300
Hz and how are vowels identified, when the formats are displaced a good deal
between small children and grown-up men.
Therefore another acoustic model is proposed based on interference analysis. In this
model the dynamic of the instantaneous energy is analysed in one domain and the
oscillations in another. In this paper the basic principles for the model is discussed
and listening experiments are described – which strongly indicates, that the model is
in agreement with the human auditory perception.
1. Introduction
The human auditory system is a sound and speech analyser with excellent
performance. The computational models of today lag a lot behind in performance,
and it would be very useful to find a better approach that reflected the human
perception much better. This would lead to a better understanding of hard of hearing
and how to remedy. It will also improve speech recognitions systems, which today
are very sensitive and unstable.
The auditory model is based on a cochlea band-pass filter bank. The frequency
bands are normally given by a Bark scale or Mel frequency scale. A concise
definition of the Bark scale is provided by Zwicker (1961). Today it is assumed that
the basilar membrane in the cochlea makes a frequency to place transformation with
a high frequency resolution. The spectral representation is then analysed by the
auditory-nerve system and brain.
The fundamental facts about the auditory perception are primarily based on studies
of listen tests where sinusoids are used. This is among others the case for the
definition of the standardised sensitivity curves and masking effect of frequencies.
However a sinusoid is not a very common sound in the real life and it is a special
case. The auditory system is originally developed as an alarm system where pulses
in the sound picture are very important. When enemies or wild predators try to sneak
towards you to attack you they may be revealed by accidentally stepping upon twigs
that break or drop something on the ground. The sound from twigs that break and
things that hit the ground are pulses. The ear is therefore developed to be very
sensitive to pulses.
It is incontestable that the ear has a high frequency resolution at least in the lower
frequency range. The sensitivity to pulses and a high frequency resolution are
inconsistence. To get a high frequency resolution it is necessary to analyse for a
longer period in a narrow-banded filter, and to analyse a pulse it is necessary to have
high time resolution and the filter has to be broad-banded. The sound analyses must
therefore be analysed in more processes.
A quit different model is therefore proposed which consist of 2 analyses, an
oscillation analysis and a dynamic energy analysis. Both analyses are based on a
band- pass filter bank pre-analysis made of the basilar membrane. But in contrast to
the traditional models the filters have broad-banded filters of about one octave. The
really frequency analysis must be done by the auditory-nerve system or by the brain
The model is draw up in this paper is primarily based on listen test but studies of
auditory-nerve fibres point in same direction e.g. described by Delgutte (1980).
2. Sound signals
The background for sound and speech analysis has been that the signals with good
approximation could be considered as steady state for periods of 20-40 ms, and the
spectrum calculated for that period has been the key to auditory perception. However
this assumption does not hold in practise. Sound and speech signals are dynamic
signals where pulses play an important role. Pulses are typical transient responses.
An abrupt force actuates a material and the material response. This response is a
transient response in the shape of a pulse contains damped resonance frequencies.
An example could be a twig that breaks. The pulse might contain more damped
resonance frequencies, and is necessary to have a sufficient separation of the
damped frequencies. This might be the duty of the basilar membrane. Fig. 1 shows
10 ms of voiced speech filtered in a filter bank. The signal is a female pronouncing
“a” as in ”have”. On the figure outputs from a band-pass filter in the low and the high
frequency range is shown. Especially the output of the high frequency band-pass
filter shows that there are two pulses in the 10 ms signal with a period of about 5 ms.
The pulses are created by the vocal cords actuating the vocal tract with a period of
about 5 ms.
3. Dynamic energy analysis
The envelope of the output signals of the band-pass filters is an expression of the
dynamic energy in the frequency bands given by filters. Fig. 2 shows the dynamic
energy of the output signal of the band-pass filter in the high frequency range shown
on fig.1. A full wave rectification and a low-pass filtering measure the dynamic
energy. The features of interest of the dynamic energy are rise and fall time, the
magnitude, and maximum slope of the leading and trailing edges of the pulses. Also
the periodicity of the dynamic energy is an important feature. The definition of an
edge is base on the slope of the dynamic energy. Start and the end of the edge are
defined relatively to the maximum slope. When the slope is e.g. 20 % of the
maximum slope before the maximum slope the edge starts and when the slope is
e.g. 20 % of the maximum slope after the maximum slope the edge ends.
The energy of a single sinusoid is constant but a physical system as mentioned
above will not measure a 100 % constant energy level. The rectification and low-pass
filtering will cause small ripples on the energy level. Fig. 3 shows an example where
the energy of a sinusoid is measured and it is seen that there is small ripples in the
energy.
Now 2 signals will be compared. Signal1 is a pure sinusoid, and Signal2 is signal
created by means of high Q filter with 1 resonance frequency and an impulse train as
excitation signal. The output signal will be very close to a sinusoid given by the
resonance frequency, but the signal will be periodic with the period of the pulse train.
An impulse train does not fulfil the Nyquist criterion. To get Signal2 as smooth as
possible it is therefore best if the frequency and the period of the pulse train is
chosen so that the sampling frequency is very close to be a multiple of them. The
sampling frequency is 44.100 Hz. The frequency is then chosen to 802 Hz and the
period of the pulse train is chosen to 2.494 ms, which correspond to 401 Hz.
The impulse response for the filter is
h(t )  e
t

sin( 2ft )
(1)
and the time constant  is equal to 100 ms.
Fig. 4 shows Signal2 and immediately the signal looks like the pure sinusoid tone.
Listening to the two signals shows however that they are different. The sound of a
pure sinusoid is in fact annoying and “sterile” to listen to while Signal2 sounds like an
organ pipe with a pitch at 401 Hz. The energy of Signal2 is fluctuating a bit with
ripples. The ripples of the dynamic energy are greatest in the frequency band from 1
to 2 kHz, which is just above the 802 Hz tone. Fig. 5 shows the band-pass filtering of
Signal2 and the measured dynamic energy in that frequency band. It can be seen
from the band-passed filtered signal that the amplitude varying with time and the
same does the measured energy.
The magnitude of the ripples are normalised to the power of the signals. The ripples
or pulses of Signal1 are measured to 4.6*10-8 while the pulses of Signal2 are
measured to 4.5*10-7. The pulses of Signal2 are nearly 10 times greater compared to
Signal1. Also the period of the pulses in the dynamic energy is different in the 2
signals. In Signal1 the period corresponds to 802 Hz while it corresponds to 401 Hz
in Signal2. Although the pulses are 10 times greater in Signal2 than in Signal1 they
are small and it gives an impression of how sensitive the ear is to pulses. Signal2 has
a small fundamental at 401 Hz in the spectrum, but the great sensitivity to pulses
points in the direction that the ear perceives the pitch in the sound as periodicity of
pulses in the dynamic energy, and not as the fundamental in a spectrum.
Signal2 is very tonal but there is many sounds where this is not the case e.g. most of
the vowels. When the damping is increasing the tonality is decreasing. Signal3 is
generated in the same way that Signal2 but the time constant is only 0.5 ms and it is
much less tonal and much more speech-like than Signal2. The normalised ripples of
Signal3 are measured to 1.6*10-4. This is nearly 3500 greater than in the sinusoid.
For many years it has been a mystery how humans have perceived the fundamental
or pitch in a sound. One of the mysteries has been how the pitch of a deep male
voice was perceivable through a phone. The pitch of a male voice can be lower than
100 Hz and a traditional phone has a lower cut-off frequency of 300 Hz, and by
mobile phones it is higher. Never the less there is no problem in perceiving a deep
male voice through a phone. The classic answer is that mystery is that the brain is
able reconstruct the fundamental, but it is more obvious that it is the periodicity of the
dynamic energy gives the pitch.
Signal4 is the word “key” pronounced by a male. The pitch of the voiced vowel in the
word is about 130 Hz. Fig. 8 shows a segment on 15 ms of the vowel. Fig. 9 shows
the spectrum of the vowel. The greatest frequency is 260 Hz and this is not the pitch
but the 1st harmonic. A band-pass filter from 2-4 kHz is then used to filter the signal.
This signal is Signal5. Fig. 10 shows the spectrum of Signal5. As it is seen the low
frequencies below 400 Hz are suppressed more than 70 dB compared to the high
frequency above 2 kHz. In defiance of that there is no problem in perceiving the
pitch.
The dynamic energy gives a more understandable picture. Fig. 11 shows the bandpass filtered signal and the dynamic energy, and it is obviously that the dynamic
energy holds the periodicity of the pitch.
It has also been known for many years that interference between to frequencies
appears as a pitch. Signal6 is a sum of 2 frequencies of 1000 and 1200 Hz. A sum of
2 frequencies can be transformed as a product as follows.
sin( 2f1t )  sin( 2f 2t )  2 sin(  ( f1  f 2 )t ) cos( ( f1  f 2 )t )
(2)
This can be interpreted as a modulation of an oscillation. Fig. 12 shows a signal with
a 1000 and a 1200 Hz frequency of equal amplitude. It also shows the dynamic
energy. Ideally the energy will fluctuate between double magnitude of the energy of
the frequencies and zero, but it is measured by means of a full wave rectification and
low-pass filtration, and will therefore not achieve the ideal maximum and minimum.
The signal is modulated by the frequency
f1  f 2
2
and the oscillation is
f1  f 2
2
The shape of the dynamic energy is ideally
sin(  ( f1  f 2 )t )
The pitch is therefore f1  f 2 and not
f1  f 2
.
2
By listening to Signal4 it is obvious that it has a pitch of 100 Hz although it has no
fundamental at all. As reference Signal7 is a signal with a 1000 and a 1220 Hz
frequency of equal amplitude. It is easy to hear that Signal4 has e deeper pith that
Signal7. Therefore the periodicity of the dynamic energy must have decisive
significance for perceiving the pitch.
Oscillation analysis
The outputs from the filter bank shown on fig. 1 keep not only information about
dynamic energy but also oscillations. They could be analysed by means of a
frequency analysis but it is more likely that the ear makes an oscillation analysis in
the time domain. An oscillation analysis is a histogram of peak-peak values
accumulated over a certain time frame as function of the oscillation period. Instead of
peak-peak values it might be the peak values of the half wave rectified signal that are
accumulated (Seneff, 1988).
An oscillation analysis has some advantages compared to frequency analysis. No
smoothing time window like Hamming or Hanning, and the spectrum of the damping
function is not mixed up with the resonance frequency as is the case in a frequency
analysis.
The question now is does the ear analyse the oscillation or not? If the signal has 2
damped resonance frequencies in the in the same frequency band of the filter bank
then the output of the band-pass filter has an oscillation that is the mean frequency of
the two frequencies according to eq. 2.
Five signals are generated by means of 5 filters with 2 damped resonance
frequencies. An impulse train with a period of 5 ms is used as excitation signal. The
duration of the signals is 2 seconds.
Signal
Time constant
ms
Frequency 1
Hz
Frequency 2
Hz
0.5
0.7
1.0
0.5
0.5
1000
800
700
800
1200
1000
1200
1300
800
1200
DFSignal1
DFSignal2
DFSignal3
DFSignal4
DFSignal5
The signals DFSignal1.WAV , DFSignal2.WAV , and DFSignal3.WAV have all oscillations of 1000 Hz.
The time constant is increased when the distance between the frequencies is
increased. This is necessary to keep the amount of oscillations in the signals at same
level. The signals
DFSignal4.WAV
and
DFSignal5.WAV
are made as reference signals.
Fig. 12 shows a segment of DFSignal1 and fig. 13 shows a segment of DFSignal2.
A listen test shows that DFSignal1, DFSignal2, and DFSignal3 sounds very similar
while DFSignal4 has a sound colour that is deeper and DFSignal5 has a sound
colour that is higher than DFSignal1, DFSignal2, and DFSignal3.
The similar test was made with higher frequencies.
Signal
Time constant
ms
Frequency 1
Hz
Frequency 2
Hz
0.25
0.50
0.25
0.25
2000
1400
1400
2600
2000
2600
1400
2600
DFSignal6
DFSignal7
DFSignal8
DFSignal9
The result was the same.
DFSignal8.WAV
DFSignal6.WAV
sounds deeper and
DFSignal7.
, and
DFSignal9.WAV
DFSignal7.WAV
sound very similar, while
sounds higher than DFSignal6, and
The conclusion that the ear is analysing the oscillations and not the spectrum when
the signal is broad spectred and contains damped resonance frequencies. This fact
is important to speech recognition. The resonance frequencies in speech signals
(also called formants) has been regarded as the fingerprint of the vowels, but that
fact that the ear uses the oscillations means that it is not the formants but the
oscillations generated by the interference between the formants that is the fingerprint
of the vowels.
Conclusion
The experiments with dynamic energy show, that the ear is very sensitive to
dynamics of the energy. A signal generated by means of high Q filter and impulse
train has only very small fluctuations in the energy. Nevertheless, the signal sounds
clearly different, compared to a sinusoid and it has a pitch, that corresponds to the
period of the impulse train. When a high Q filter is used, the sound is very tonal, but if
a low Q filter is used, it is not. When listening to 2 frequencies e.g. 1000 and 1200
Hz, a pitch of 200 Hz is heard. It indicates that the pitch is perceived by means of the
period of the dynamic energy and not by means of the fundamental, because the
signal does not have a fundamental of 200 Hz, but the energy has a period
corresponding to 200 Hz.
The experiments with oscillations show that if 2 damped frequencies are not too far
from each other, the ear perceive the oscillation arisen from the interference between
the 2 frequencies and not the 2 frequencies. This is an important phenomenon in
relation to speech recognition, especially vowel recognition. Traditionally the
formants, which are damped resonance frequencies in speech signals, are regarded
as the fingerprint of the vowels. But this result indicates that it is the oscillations
arisen from the interference between the formants that are the fingerprint.
The findings in these experiments suggest that basilar membrane acts as a broadbanded band-pass filter bank and auditory nerve system and the brain make that
detailed signal processing.
References
Delgutte, B. (1980) Representation of speech-like sounds in the discharge patterns of
auditory-nerve fibres, Journal of the Acoustical Society of America, 68, 843-857.
Leonhard, F. U. (2003) Method for analysing signals conditions containing pulses,
patent application WO2005015543, date of priority August 6, 2003.
Leonhard, F. U. (1993) Method and System for detecting and generating transient
condition in auditory signals, patent EP0737351, date of priority April 22, 1993.
Seneff, S. (1988) A joint synchrony/mean-rate model of auditory speech processing,
Journal of Phonetics, 16, 55-76.
Zwicker, E. (1961) Subdivision of the audible frequency range into critical bands,
Journal of the Acoustical Society of America, 33, 248-249.
Figures
Output
BP1
.
.
.
Input
BPN
10 ms
Fig. 1.
Separation of the oscillations by means of a filter
bank with N band-pass filters separating
oscillations.
Dynamic energy
Maximum slope
Magnitude
Rise time
10 ms
Fig. 2.
Dynamic energy of the output of high frequency
band-pass filter.
Amplitude
Dynamic energy
Fig. 3.
The energy of a sinusoid measured by means of
a full wave rectification and low-pass filtering. The
energy contains ripple.
Amplitude
Fig. 4.
An 802 Hz tone generated by a high Q filter and
an impulse train as excitation signal. The period
of the pulse train is 2.494 ms. The time constant
of the filter is 100 ms.
Amplitude – band-passed, 1-2 kHz
Dynamic energy
Fig. 5.
The upper signal is the signal shown on fig 4
filtered with a band-pass filter. The energy
contains ripple with a periodicity of the pulse train.
Amplitude
Fig. 6.
An 802 Hz tone generated by a low Q filter and
an impulse train as excitation signal. The period
of the pulse train is 2.494 ms. The time constant
of the filter is 0.5 ms.
Dynamic energy
Fig. 7.
The measured dynamic energy of the signal
shown on fig. 6.
Amplitude
15 ms
Fig. 8.
A segment of the vowel in the word “key”
pronounced by a male (Signal4). The pitch is
about 130 Hz.
Fig. 9.
The spectrum of the signal shown on fig. 6.
Fig. 10. The spectrum of the signal shown on fig. 6 filtered
by a band-pass filter from 2 to 4 kHz.
Amplitude – band-passed, 2-4 kHz
Dynamic energy
Pitch = 130 Hz
15 ms
Fig. 11. The upper signal is the signal shown on fig. 6
filtered with a band-pass filter. The dynamic
energy has clearly a periodicity corresponding to
the pitch of the signal shown on fig. 8.
Amplitude
Dynamic energy
Pitch = 200 Hz
10 ms
Fig. 12. The upper signal contains 2 frequencies. The
dynamic energy fluctuates as consequence of the
interference between the 2 frequencies.
10 ms
Fig. 13. A segment of a pulses train. The pulses consist of
2 damped frequencies. The time constant is 0.5
ms and both frequencies are 1000 Hz.
10 ms
Fig. 14. A segment of a pulses train. The pulses consist of
2 damped frequencies. The time constant is 0.7
ms and the frequencies are 800 and 1200 Hz.
Download