PITCH ESTIMATION USING MULTIPLE INDEPENDENT TIME

advertisement
Proc. 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, Oct. 17-20, 1999
PITCH ESTIMATION USING MULTIPLE INDEPENDENT
TIME-FREQUENCY WINDOWS
Anssi Klapuri
Signal Processing Laboratory, Tampere University of Technology
P.O.Box 553, FIN-33101 Tampere, Finland
klap@cs.tut.fi
ABSTRACT
Then we present a harmonic summation model that is applied to
calculate the likelihood for a pitch candidate from the parameters
of the associated partials.
In the case of real, non-ideal physical vibrators, the frequency
partials of the sound are not in exact integral ratios. For vibrating
strings, for example, the frequencies of the partials obey
A system for the detection of the pitch of musical sounds at a
wide pitch range and in diverse conditions is presented. The system is built upon a pitch model that calculates independent pitch
estimates in separate time-frequency windows and then combines
them to yield a single estimate of the pitch. Both psychoacoustic
and computational experiments were carried out to determine the
optimal sizes of the elementary windows. The robustness of the
system in wide-band additive noise and in the interference of
another harmonic sound are demonstrated. An extension of the
algorithm to the multi-pitch case is described, and simulation
results for two-voice polyphonies are presented.1
2
f n = n ⋅ f 1 ⋅ [ 1 + ( n – 1 )B ]
1⁄2
(1)
where B is an inharmonicity factor [4]. If B is zero, the sound is
ideally harmonic, but for the piano, for example, B is large
enough to shift the 17th frequency partial to the position of the
18th partial of an ideal string. Thus the partials cannot be assumed
to be found at harmonic spectral positions. Building pitch calculations upon intervals between partials has proved to be more successful [6,7,2], but even the interval cannot be assumed constant.
A fundamental idea in our system is to calculate completely
independent estimations of the pitch at distinct frequency bands,
and then integrate the results in a manner that takes the inharmonicity of the source into account. This solves the problem, since the
spectral intervals can be assumed to be piece-wise constant at narrow enough bands. This also provides robustness in the cases
when only a fragment of the whole frequency range can be used.
For algorithm details, see the reference [3].
After logical grouping of partials, another problem is to find a
quantitative model, which calculates pitch likelihood measure
from the parameters of the associated partials. A harmonic summation model was developed that is based on the model of loudness perception as proposed by Moore et al. [5]. The model was
simplified, and then modified and parametrized for our purpose.
However, the pitch algorithm still gives approximations of the
loudness of detected sounds as a side-product. Musical instruments were used to train the parameters of the model.
An overview of the pitch calculations is as follows. First, a
discrete Fourier transform X(f) is calculated for a windowed time
domain signal. Then a logarithmic intensity spectrum is calculated as I ( f ) = log ( X ( f ) ⋅ f ) . At the frequency band of interest, the mean of I(f) values is calculated and subtracted from I(f).
Resulting negative intensities are set to zero. For the used sizes of
logarithmically distributed bands, this operation significantly
enhanced the performance of the algorithm in noise. After this,
the intensities of the associated series of equidistant partials at the
band are linearly summed, and weighted by a factor that approximates an integral over the frequency band in equivalent rectangular bandwidth (ERB) scale units [3].
1. INTRODUCTION
Recent psychoacoustic knowledge suggests that the human auditory system estimates pitch at local regions of the frequency spectrum, and then integrates the results [1,2]. In [3] we have
proposed an approach, in which independent estimates of the
pitch are calculated in separate time-frequency windows, and the
results are then combined to yield a single estimate. An algorithm
was described that implements the presented idea.
In this paper, we first consider the selection of the elementary
time-frequency windows in such a way that the performance of
the algorithm is optimized. Then we construct and evaluate a system which employs the designed model to the estimation of the
pitch of musical sounds at a wide pitch range. The robustness of
the system in wide-band additive noise and in the interference of
another harmonic sound are demonstrated. An extension of the
algorithm to the detection of multiple pitches is described, and
preliminary simulations for two-voice polyphonies are provided.
Although the algorithm is based on band-wise processing, it can
be efficiently implemented and requires only one discrete Fourier
transform per time frame.
2. PITCH MODEL USED
The system presented in this paper is built upon a pitch model that
we originally described in [3]. Here we can only shortly revisit
the principles of the model itself. The details and comparison to
other models of perception ([2,5]) can be found in the reference.
2.1. Pitch estimation in a time-frequency window
Pitch is commonly known as a phenomen that groups together
spectral components by regarding certain harmonic relations
between them. We will first describe a procedure that implements
the logical rules governing the associations between partials.
2.2. Integration of the band-wise pitch estimates
In Figure 1, the calculated pitch likelihood vectors for different
bands are illustrated. In the final phase, the pitch estimates from
1 This work was supported in part by Nokia Oyj Foundation.
115
Proc. 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, Oct. 17-20, 1999
pitch
Pitch likelihoods
C2 (65 Hz)
3
pitch
10
2
100
100
80
80
60
60
40
40
20
20
2
10
10
0
200
log(I)
6
600
800 band
1000 #
frequency
400
Spectrum
0
200
log(I)
6
5
5
4
4
3
3
2
2
1
A#4 (466 Hz)
3
10
Pitch likelihoods
1
0
600
800
1000#
frequency
band
400
3
10
2
10
15
20
0
5
10
15
20
Figure 2: Left: percentage of subjects that selected even series
higher in pitch. Right: percentage of subjects that selected
upper beginning series higher, although it was odd.
Spectrum
tials and on the intervals between them. The pitch of a sound does
not change if its even numbered harmonics are removed. However, if the odd harmonics are removed, the pitch doubles
although the spectral period is the same as in the previous case.
Thus the spectral place of the harmonics does matter. This is not
the case when only higher harmonics are involved.
We carried out a psychoacoustic experiment to determine the
limit, up to which the human auditory system discerns the spectral
places of harmonic partials. A total of nine subjects were presented with a pair of harmonic sounds, the first composed of a set
of successive odd harmonics, and the latter of a set of even harmonics. The subjects were asked which one of the sounds was
higher. In the cases when the series of partials were among the
lower harmonics, the subjects perceived a clear octave difference.
This suggests that the perception was affected by the spectral
place information. In the higher end, the octave difference disappeared and other aspects started to dominate.
The results of the listening test are presented in Figure 2. Two
aspects were found to affect to the selection of the higher sound:
the odd/even factor, and the starting position, i.e, the number of
harmonics that were dropped from the beginning. The latter factor
was found to be surprisingly strong. The effect of the spectral
place information to pitch perception decreases as more and more
harmonics are dropped from the lower end, and disappears around
7–10, as can be seen in both panels of Figure 2.
In our pitch algorithm, the spectral place of partials dominates
in the case that only one partial is determined to reside at a band,
and spectral intervals otherwise. Using the results of the psychoacoustic experiment, we calculated a bandwidth that would make
our algorithm most resemble human perception. The experimental data would suggest approximately third-octave bands, which
in turn comes close to the critical band. However, to make a good
compromise from the viewpoint of the other factors mentioned in
the beginning of this section, we ended up to an elementary bandwidth of 2/3 octave bands, ranging from 50 Hz up to 6 kHz. For
this selection, inharmonicity was still quite well handled.
10
10 frequency
frequency
Figure 1: Band-wise calculated pitch likelihood vectors for
two piano tones. Bottom: bands at which likelihoods were
calculated. Up: pitch likelihood vectors.
2
10
5
3
different frequency bands are drawn together to yield a global
pitch estimate. A procedure was designed, which scans through
the most promising pitch values and matches a curve across the
band-wise likelihood vectors in such a way that inharmonicity of
the source is taken into account. In the top panels of Figure 1, this
means drawing a slightly bending curve from left to right over the
likelihood vectors. In quantitative terms, linear summation is a
relevant way to aggregate the likelihoods, because this extends the
principles lent from loudness coding to cross-band processing [3].
3. SELECTION OF TIME-FREQUENCY WINDOWS
The presented pitch model is based on independent estimates in
separate time-frequency windows, which are then combined into
a single pitch estimate. Obviously, it is important to find reasonable sizes for the elementary time-frequency windows in order to
optimize the performance of the algorithm.
3.1. Bandwidth selection
Several different considerations call for a logarithmic distribution
of frequency bands. It is needed to reproduce the phenomena in
human pitch perception, to retain the association to loudness coding, and to be able to cope with a wide pitch range. Quantitative
question concerning the width of the bands is not that trivial.
Selections of the bandswidths merely determines the accuracy
of the band-wise linear approximation of the non-linearity in
Equation 1, but does not change the model itself. However, there
are certain factors that set the bounds for the selection of elementary bandwidths. On one hand, the bands cannot be arbitrarily
wide, since the intervals between harmonics are not constant over
wide bands, if a sound exhibits inharmonicities. On the other
hand, the bands cannot be arbitrarily narrow, since in the cases
when only one harmonic is at a band, the maximum likelihood is
assigned to all pitch candidates that are determined to have only
one harmonic at that band. This creates noise to the system and
just postpones the problem to the integration stage.
It is well known that the human auditory system bases its
pitch perception both on the spectral positions of frequency par-
3.2. Time window length
Tuning the size of the time window is more straightforward than
that of the bandwidth. In order to optimize the time resolution and
causality, we try to find the shortest possible window for which
the algorithm still works without problems. We used a set of
musical instrument samples and sung wovels to evaluate the performance of the algorithm for different time window lengths (see
Table 1). The whole pitch range of each instrument was used.
Marimba and vibes were first included, too, but it turned out that
116
Proc. 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, Oct. 17-20, 1999
100
100
80
80
4096
2048
1024
60
40
60
20
0
white noise
pink noise
40
20
C1
C2
C3
C4
C5
C6
C7
C8
33
65
131
262
523
1047
1093
4186
0
Figure 3: Percentage of correct pitch detections as a function of
the pitch (Hz) of the sound, using different time window lengths.
C1
C2
C3
C4
C5
C6
C7
C8
33
65
131
262
523
1047
1093
4186
Figure 4: Performance in additive noise.
sented for single time frames, without including higher level
processing over multiple time windows. Correct detection of the
pitch was defined to be an estimate, which deviates less than a
semitone from the correct pitch. Octave errors are not accepted.
Results for a clean case, i.e., without noise, were presented in
the Figure 3. For the set of instruments used, pitch was detected
practically without errors between 65 Hz and 2000 Hz. The
number of simulation cases outside these limits is not sufficient,
since only couple of instruments occupy the widest pitch range.
Some differences between instrument classes could be observed.
The wind instruments represented an easy case. Inharmonicities
were not observed in this class of instruments. The results for
singing resembled those for wind instruments. The piano, cello
and double bass were among the most difficult ones.
the spectrum of vibrating bars is too exotic to be well enough handled by the the algorithm [4]. Sampling rate was fixed to
44.1 kHz, and the size of the used hamming window ranged from
1024 to 8192 samples. The pitch was calculated in a single window, 300 ms after the onset of the sound.
Table 1: Experiment material
singing, female
piano
trumpet
singing, male
violin
flute
soprano, alto, tenor,
cello
oboe
baritone and bass
double bass
b flat clarinet
saxophones
tuba
bass clarinet
In Figure 3, we have presented the percentage of correct pitch
detections as a function of the pitch of the note, all instruments
included. As expected, the performance for low-pitched sounds
starts to degrade, when the length of the window gets shorter.
Window length of 2048 samples suffices to achieve robust pitch
detections down to 65 Hz. Doubling this size pushes the limit a bit
lower, but below 50 Hz the sounds themselves get irregular
enough to make the algorithm fail. For a hamming window, 90 %
of its mass is contained in 2/3 of its length. Thus we can calculate
the effective length of the used hamming window to be
2 ⁄ 3 ⋅ ( 2048 ⁄ 44100 ) s = 31 ms. This is by far good enough for
vibrato and glissando in music. In order to attain robustness
against irregularities in the sounds themselves, a single 4096 window is used in the simulations presented later in this paper.
Same time window lengths have to be used throughout the
bands, since we found out that the pitch of the lowest sounds
could often be detected even at the highest time-frequency windows. This has an interesting positive consequence to the computational complexity of the algorithm: the band-wise processing
can be achieved by calculating a single fast Fourier transform in
each time frame, and then processing the spectrum band by band.
4.1. Robustness in noise
The robustness of the algorithm in noise was experimented by
repeating the above simulations with samples that were contaminated by wide-band additive noise. The perceptual loudness of the
noise signals was scaled to equal the loudness of each unique
sample. The scaling was done automatically by implementing the
loudness model of Moore et al. [5]. This signal-to-noise ratio represents very poor recording or transmission conditions.
Results for the signals that were contaminated by white and
pink noise are given in Figure 4. The algorithm exhibits significant robustness for noise. In the higher end, the number of harmonic partials gets smaller, and the performance degrades.
Additional higher-level processing, for example taking the
median of the estimates in three successive time frames, would
still improve the performance in noise.
As we have earlier demonstrated in [3], the band-wise
processing approach makes the algorithm immune against arbitrarily strong but band-limited noise. Experiments with extremely
strong band-limited noise and with missing frequency bands gave
results that were almost identical to the clean case. A significant
degradation starts to takes place, if the eliminated band is widened to comprise several octaves.
4. VALIDATION EXPERIMENTS
To validate the performance of the presented algorithm, we
applied it to the detection of the pitch of musical sounds, as listed
in Table 1. The whole pitch range of each instrument was used.
Male and female singing together ranged from 65 Hz to 660 Hz.
In this setting, more simulation cases hit the mid-range, since
most instruments occupy that pitch range. Contrariwise, only
piano and double bass were available for the fundamental frequencies below 65 Hz, for example.
All simulations were performed without prior knowledge of
the instruments used, or the pitch range involved. A constant set
of algorithm parameters were used all the time and results are pre-
5. MULTI-PITCH DETECTION
Most of the research on pitch estimation has took place in the
speech processing domain. This is also the reason why detection
of multiple pitches has been almost a neglegted problem. Some
proposals have been presented that utilize auditory modeling [9,8]
or a set of instrument models [10]. It is generally admitted that the
single pitch detection methods are not appropriate as such to the
detection of multiple pitches. This is especially true for musical
signals, where very often the frequency relations of simultaneous
sounds either make them appear as a single coherent sound, or a
117
Proc. 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, Oct. 17-20, 1999
ished. This is also the reason why results for richer polyphonies
are not presented. However, the presented multi-pitch detection
algorithm exhibits several desirable features. It operates at a wide
pitch range, is not limited to pre-defined sounds, and is able to
handle harmonically related fundamental frequencies without
problems. We are currently working to solve the integer relation
problem, and adding the masking phenomenon of human hearing.
100
80
60
first detection correct
both detections correct
40
20
0
C1
C2
C3
C4
C5
C6
C7
C8
33
65
131
262
523
1047
1093
4186
6. SUMMARY
Figure 5: Percentage of correct pitch detections for iterative
multi-pitch detection in two-voice polyphinies.
non-existent sound arises because of the joint effect of the others.
We generated two-voice complexes by letting one sound go
through the pitch range of each particular instrument, and randomizing the other sound from the pitch range of the same instrument. However, the pitch of the latter sound was restricted to be
between 55 Hz and 2100 Hz. The loudness of the two sounds was
equalized to a same level.
Multipitch detection was performed by applying the presented algorithm iteratively two times, and by removing the frequency partials of the firstly detected sound in between. This can
be done, since the spectral position of the partials can be calculated by subsituting f0 and B of the sound to Equation 1. This iterative approach was first proposed in [11]. Since the presented
algorithm yields loudness estimates as a side-product, we call this
loudest-first scheduling. Some special applications, like finding
the base lines from music may require lowest-first scheduling.
One addition had to be made to make the algorithm applicable
to multi-pitch detection. In the case of simple harmonic relations,
where the fundamental frequencies of the two sounds matched
two low harmonics of a non-existent root sound, the root sound
was erroneously detected. This problem was solved by a following method. We implemented a mechanism which is able to check
the existence of the fundamental partial, and the series of every
nth harmonic partials hni , where n takes prime number values
{2,3,5,7}. Three of more among {h1,h2i,h3i,h5i,h7i} had to be
found in order that a pitch candidate be accepted. This cannot be
caused by any combination of two higher sounds. This addition
did not have any practical effect in the monophonic case, but was
nevertheless included also in all the calculations of the earlier presented simulation results. This was done in order to to use only
one single algorithm and parameter set in all simulations.
Results for two-voice polyphonies are presented in Figure 5.
What was somewhat surprising is that after the ’root sound’ problem was solved, the detections were almost hundred percent correct at the first iteration. Harmonic interference of the other sound
did not cause octave errors or prune detections. Results for the
second iteration were also good, but not quite as successful. This
indicates that the spectral subtraction does not work perfectly. A
significant portion of the errors at the second iteration took place
for combinations, in which the randomly generated sounds were
in simple integer relations, i.e., the fundamental frequency of one
sound was an integer multiple of the other. In this case all the harmonics of the higher sound perfectly match every nth harmonic of
the lower one, and removal of the partials of the firstly detected
lower sound removed all the partials of the higher one, too.
The results for multi-pitch detection are still preliminary,
because the work with the subtraction and iteration is still unfin-
A system for the detection of the pitch of musical sounds was
designed and evaluated. The system works at a wide pitch range
and was shown to be robust in noise and in presence of other harmonic sounds. Algorithm for single time frames was presented
and analyzed. Imposing higher-level logic on several time frames
is expected to further enhance the performance. The algorithm
can also be relatively efficiently implemented.
An iterative approach to multi-pitch detection was proposed
by applying the presented algorithm several times and by removing the frequency partials of the detected sounds in between. This
was validated to be an efficient method, provided that the spectral
removal procedure is carefully implemented.
7. REFERENCES
[1] Bregman, “Auditory Scene Analysis,” MIT Press, 1990.
[2] Meddis, Hewitt, “Virtual pitch and phase sensitivity of a computer model of the auditory periphery. I: pitch identification,”
J. Acoust. Soc. Am., vol. 89, pp. 2866-2882, June 1991.
[3] Klapuri, “Wide-band pitch estimation for natural sound
sources with inharmonicities,” 106th Audio Engineering
Society Convention, München, Germany, 1999.
[4] Fletcher, Rossing, “The Physics of Musical Instruments” (2nd
edition), Springer-Verlag New York, Inc., 1998.
[5] Moore, Glasberg, Baer, “A Model for the Prediction of
Thresholds, Loudness, and Partial Loudness,” J. Audio Eng.
Soc., vol. 45, No. 4, April 1997.
[6] Lahat, Niederjohn, Krubsack, “Spectral autocorrelation
method for measurement of the fundamental frequency of
noise-corrupted speech,” IEEE Trans. on Acoustics, Speech
and Signal Processing, No. 6, June 1987.
[7] Kunieda, Shimamura, Suzuki, “Robust method of measurement of fundamental frequency by ACLOS -autocorrelation
of log spectrum,” IEEE Trans. on Acoustics, Speech and Signal Processing, 1996.
[8] Karjalainen, Tolonen, “Multi-pitch and Periodicity Analysis
Model for Sound Separation and Auditory Scene Analysis,”
In proceedings of the International Conference on Acoustics,
Speech and Signal Processing, ICASSP, 1999.
[9] Martin, “Automatic Transcription of Simple Polyphonic
Music: Robust Front End Processing,”. MIT Media Laboratory Perceptual Computing Section Technical Report, 1996.
[10] Kashino, Tanaka, “A sound source separation system with
the ability of automatic tone modeling,” Proceedings of the
International Computer Music Conference, 1993.
[11] Klapuri, “Number Theoretical Means of Resolving a Mixture
of Several Harmonic Sounds,” In proceedings of the European
Signal Processing Conference, 1998.
118
Download