Proc. 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, Oct. 17-20, 1999 PITCH ESTIMATION USING MULTIPLE INDEPENDENT TIME-FREQUENCY WINDOWS Anssi Klapuri Signal Processing Laboratory, Tampere University of Technology P.O.Box 553, FIN-33101 Tampere, Finland klap@cs.tut.fi ABSTRACT Then we present a harmonic summation model that is applied to calculate the likelihood for a pitch candidate from the parameters of the associated partials. In the case of real, non-ideal physical vibrators, the frequency partials of the sound are not in exact integral ratios. For vibrating strings, for example, the frequencies of the partials obey A system for the detection of the pitch of musical sounds at a wide pitch range and in diverse conditions is presented. The system is built upon a pitch model that calculates independent pitch estimates in separate time-frequency windows and then combines them to yield a single estimate of the pitch. Both psychoacoustic and computational experiments were carried out to determine the optimal sizes of the elementary windows. The robustness of the system in wide-band additive noise and in the interference of another harmonic sound are demonstrated. An extension of the algorithm to the multi-pitch case is described, and simulation results for two-voice polyphonies are presented.1 2 f n = n ⋅ f 1 ⋅ [ 1 + ( n – 1 )B ] 1⁄2 (1) where B is an inharmonicity factor [4]. If B is zero, the sound is ideally harmonic, but for the piano, for example, B is large enough to shift the 17th frequency partial to the position of the 18th partial of an ideal string. Thus the partials cannot be assumed to be found at harmonic spectral positions. Building pitch calculations upon intervals between partials has proved to be more successful [6,7,2], but even the interval cannot be assumed constant. A fundamental idea in our system is to calculate completely independent estimations of the pitch at distinct frequency bands, and then integrate the results in a manner that takes the inharmonicity of the source into account. This solves the problem, since the spectral intervals can be assumed to be piece-wise constant at narrow enough bands. This also provides robustness in the cases when only a fragment of the whole frequency range can be used. For algorithm details, see the reference [3]. After logical grouping of partials, another problem is to find a quantitative model, which calculates pitch likelihood measure from the parameters of the associated partials. A harmonic summation model was developed that is based on the model of loudness perception as proposed by Moore et al. [5]. The model was simplified, and then modified and parametrized for our purpose. However, the pitch algorithm still gives approximations of the loudness of detected sounds as a side-product. Musical instruments were used to train the parameters of the model. An overview of the pitch calculations is as follows. First, a discrete Fourier transform X(f) is calculated for a windowed time domain signal. Then a logarithmic intensity spectrum is calculated as I ( f ) = log ( X ( f ) ⋅ f ) . At the frequency band of interest, the mean of I(f) values is calculated and subtracted from I(f). Resulting negative intensities are set to zero. For the used sizes of logarithmically distributed bands, this operation significantly enhanced the performance of the algorithm in noise. After this, the intensities of the associated series of equidistant partials at the band are linearly summed, and weighted by a factor that approximates an integral over the frequency band in equivalent rectangular bandwidth (ERB) scale units [3]. 1. INTRODUCTION Recent psychoacoustic knowledge suggests that the human auditory system estimates pitch at local regions of the frequency spectrum, and then integrates the results [1,2]. In [3] we have proposed an approach, in which independent estimates of the pitch are calculated in separate time-frequency windows, and the results are then combined to yield a single estimate. An algorithm was described that implements the presented idea. In this paper, we first consider the selection of the elementary time-frequency windows in such a way that the performance of the algorithm is optimized. Then we construct and evaluate a system which employs the designed model to the estimation of the pitch of musical sounds at a wide pitch range. The robustness of the system in wide-band additive noise and in the interference of another harmonic sound are demonstrated. An extension of the algorithm to the detection of multiple pitches is described, and preliminary simulations for two-voice polyphonies are provided. Although the algorithm is based on band-wise processing, it can be efficiently implemented and requires only one discrete Fourier transform per time frame. 2. PITCH MODEL USED The system presented in this paper is built upon a pitch model that we originally described in [3]. Here we can only shortly revisit the principles of the model itself. The details and comparison to other models of perception ([2,5]) can be found in the reference. 2.1. Pitch estimation in a time-frequency window Pitch is commonly known as a phenomen that groups together spectral components by regarding certain harmonic relations between them. We will first describe a procedure that implements the logical rules governing the associations between partials. 2.2. Integration of the band-wise pitch estimates In Figure 1, the calculated pitch likelihood vectors for different bands are illustrated. In the final phase, the pitch estimates from 1 This work was supported in part by Nokia Oyj Foundation. 115 Proc. 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, Oct. 17-20, 1999 pitch Pitch likelihoods C2 (65 Hz) 3 pitch 10 2 100 100 80 80 60 60 40 40 20 20 2 10 10 0 200 log(I) 6 600 800 band 1000 # frequency 400 Spectrum 0 200 log(I) 6 5 5 4 4 3 3 2 2 1 A#4 (466 Hz) 3 10 Pitch likelihoods 1 0 600 800 1000# frequency band 400 3 10 2 10 15 20 0 5 10 15 20 Figure 2: Left: percentage of subjects that selected even series higher in pitch. Right: percentage of subjects that selected upper beginning series higher, although it was odd. Spectrum tials and on the intervals between them. The pitch of a sound does not change if its even numbered harmonics are removed. However, if the odd harmonics are removed, the pitch doubles although the spectral period is the same as in the previous case. Thus the spectral place of the harmonics does matter. This is not the case when only higher harmonics are involved. We carried out a psychoacoustic experiment to determine the limit, up to which the human auditory system discerns the spectral places of harmonic partials. A total of nine subjects were presented with a pair of harmonic sounds, the first composed of a set of successive odd harmonics, and the latter of a set of even harmonics. The subjects were asked which one of the sounds was higher. In the cases when the series of partials were among the lower harmonics, the subjects perceived a clear octave difference. This suggests that the perception was affected by the spectral place information. In the higher end, the octave difference disappeared and other aspects started to dominate. The results of the listening test are presented in Figure 2. Two aspects were found to affect to the selection of the higher sound: the odd/even factor, and the starting position, i.e, the number of harmonics that were dropped from the beginning. The latter factor was found to be surprisingly strong. The effect of the spectral place information to pitch perception decreases as more and more harmonics are dropped from the lower end, and disappears around 7–10, as can be seen in both panels of Figure 2. In our pitch algorithm, the spectral place of partials dominates in the case that only one partial is determined to reside at a band, and spectral intervals otherwise. Using the results of the psychoacoustic experiment, we calculated a bandwidth that would make our algorithm most resemble human perception. The experimental data would suggest approximately third-octave bands, which in turn comes close to the critical band. However, to make a good compromise from the viewpoint of the other factors mentioned in the beginning of this section, we ended up to an elementary bandwidth of 2/3 octave bands, ranging from 50 Hz up to 6 kHz. For this selection, inharmonicity was still quite well handled. 10 10 frequency frequency Figure 1: Band-wise calculated pitch likelihood vectors for two piano tones. Bottom: bands at which likelihoods were calculated. Up: pitch likelihood vectors. 2 10 5 3 different frequency bands are drawn together to yield a global pitch estimate. A procedure was designed, which scans through the most promising pitch values and matches a curve across the band-wise likelihood vectors in such a way that inharmonicity of the source is taken into account. In the top panels of Figure 1, this means drawing a slightly bending curve from left to right over the likelihood vectors. In quantitative terms, linear summation is a relevant way to aggregate the likelihoods, because this extends the principles lent from loudness coding to cross-band processing [3]. 3. SELECTION OF TIME-FREQUENCY WINDOWS The presented pitch model is based on independent estimates in separate time-frequency windows, which are then combined into a single pitch estimate. Obviously, it is important to find reasonable sizes for the elementary time-frequency windows in order to optimize the performance of the algorithm. 3.1. Bandwidth selection Several different considerations call for a logarithmic distribution of frequency bands. It is needed to reproduce the phenomena in human pitch perception, to retain the association to loudness coding, and to be able to cope with a wide pitch range. Quantitative question concerning the width of the bands is not that trivial. Selections of the bandswidths merely determines the accuracy of the band-wise linear approximation of the non-linearity in Equation 1, but does not change the model itself. However, there are certain factors that set the bounds for the selection of elementary bandwidths. On one hand, the bands cannot be arbitrarily wide, since the intervals between harmonics are not constant over wide bands, if a sound exhibits inharmonicities. On the other hand, the bands cannot be arbitrarily narrow, since in the cases when only one harmonic is at a band, the maximum likelihood is assigned to all pitch candidates that are determined to have only one harmonic at that band. This creates noise to the system and just postpones the problem to the integration stage. It is well known that the human auditory system bases its pitch perception both on the spectral positions of frequency par- 3.2. Time window length Tuning the size of the time window is more straightforward than that of the bandwidth. In order to optimize the time resolution and causality, we try to find the shortest possible window for which the algorithm still works without problems. We used a set of musical instrument samples and sung wovels to evaluate the performance of the algorithm for different time window lengths (see Table 1). The whole pitch range of each instrument was used. Marimba and vibes were first included, too, but it turned out that 116 Proc. 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, Oct. 17-20, 1999 100 100 80 80 4096 2048 1024 60 40 60 20 0 white noise pink noise 40 20 C1 C2 C3 C4 C5 C6 C7 C8 33 65 131 262 523 1047 1093 4186 0 Figure 3: Percentage of correct pitch detections as a function of the pitch (Hz) of the sound, using different time window lengths. C1 C2 C3 C4 C5 C6 C7 C8 33 65 131 262 523 1047 1093 4186 Figure 4: Performance in additive noise. sented for single time frames, without including higher level processing over multiple time windows. Correct detection of the pitch was defined to be an estimate, which deviates less than a semitone from the correct pitch. Octave errors are not accepted. Results for a clean case, i.e., without noise, were presented in the Figure 3. For the set of instruments used, pitch was detected practically without errors between 65 Hz and 2000 Hz. The number of simulation cases outside these limits is not sufficient, since only couple of instruments occupy the widest pitch range. Some differences between instrument classes could be observed. The wind instruments represented an easy case. Inharmonicities were not observed in this class of instruments. The results for singing resembled those for wind instruments. The piano, cello and double bass were among the most difficult ones. the spectrum of vibrating bars is too exotic to be well enough handled by the the algorithm [4]. Sampling rate was fixed to 44.1 kHz, and the size of the used hamming window ranged from 1024 to 8192 samples. The pitch was calculated in a single window, 300 ms after the onset of the sound. Table 1: Experiment material singing, female piano trumpet singing, male violin flute soprano, alto, tenor, cello oboe baritone and bass double bass b flat clarinet saxophones tuba bass clarinet In Figure 3, we have presented the percentage of correct pitch detections as a function of the pitch of the note, all instruments included. As expected, the performance for low-pitched sounds starts to degrade, when the length of the window gets shorter. Window length of 2048 samples suffices to achieve robust pitch detections down to 65 Hz. Doubling this size pushes the limit a bit lower, but below 50 Hz the sounds themselves get irregular enough to make the algorithm fail. For a hamming window, 90 % of its mass is contained in 2/3 of its length. Thus we can calculate the effective length of the used hamming window to be 2 ⁄ 3 ⋅ ( 2048 ⁄ 44100 ) s = 31 ms. This is by far good enough for vibrato and glissando in music. In order to attain robustness against irregularities in the sounds themselves, a single 4096 window is used in the simulations presented later in this paper. Same time window lengths have to be used throughout the bands, since we found out that the pitch of the lowest sounds could often be detected even at the highest time-frequency windows. This has an interesting positive consequence to the computational complexity of the algorithm: the band-wise processing can be achieved by calculating a single fast Fourier transform in each time frame, and then processing the spectrum band by band. 4.1. Robustness in noise The robustness of the algorithm in noise was experimented by repeating the above simulations with samples that were contaminated by wide-band additive noise. The perceptual loudness of the noise signals was scaled to equal the loudness of each unique sample. The scaling was done automatically by implementing the loudness model of Moore et al. [5]. This signal-to-noise ratio represents very poor recording or transmission conditions. Results for the signals that were contaminated by white and pink noise are given in Figure 4. The algorithm exhibits significant robustness for noise. In the higher end, the number of harmonic partials gets smaller, and the performance degrades. Additional higher-level processing, for example taking the median of the estimates in three successive time frames, would still improve the performance in noise. As we have earlier demonstrated in [3], the band-wise processing approach makes the algorithm immune against arbitrarily strong but band-limited noise. Experiments with extremely strong band-limited noise and with missing frequency bands gave results that were almost identical to the clean case. A significant degradation starts to takes place, if the eliminated band is widened to comprise several octaves. 4. VALIDATION EXPERIMENTS To validate the performance of the presented algorithm, we applied it to the detection of the pitch of musical sounds, as listed in Table 1. The whole pitch range of each instrument was used. Male and female singing together ranged from 65 Hz to 660 Hz. In this setting, more simulation cases hit the mid-range, since most instruments occupy that pitch range. Contrariwise, only piano and double bass were available for the fundamental frequencies below 65 Hz, for example. All simulations were performed without prior knowledge of the instruments used, or the pitch range involved. A constant set of algorithm parameters were used all the time and results are pre- 5. MULTI-PITCH DETECTION Most of the research on pitch estimation has took place in the speech processing domain. This is also the reason why detection of multiple pitches has been almost a neglegted problem. Some proposals have been presented that utilize auditory modeling [9,8] or a set of instrument models [10]. It is generally admitted that the single pitch detection methods are not appropriate as such to the detection of multiple pitches. This is especially true for musical signals, where very often the frequency relations of simultaneous sounds either make them appear as a single coherent sound, or a 117 Proc. 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, Oct. 17-20, 1999 ished. This is also the reason why results for richer polyphonies are not presented. However, the presented multi-pitch detection algorithm exhibits several desirable features. It operates at a wide pitch range, is not limited to pre-defined sounds, and is able to handle harmonically related fundamental frequencies without problems. We are currently working to solve the integer relation problem, and adding the masking phenomenon of human hearing. 100 80 60 first detection correct both detections correct 40 20 0 C1 C2 C3 C4 C5 C6 C7 C8 33 65 131 262 523 1047 1093 4186 6. SUMMARY Figure 5: Percentage of correct pitch detections for iterative multi-pitch detection in two-voice polyphinies. non-existent sound arises because of the joint effect of the others. We generated two-voice complexes by letting one sound go through the pitch range of each particular instrument, and randomizing the other sound from the pitch range of the same instrument. However, the pitch of the latter sound was restricted to be between 55 Hz and 2100 Hz. The loudness of the two sounds was equalized to a same level. Multipitch detection was performed by applying the presented algorithm iteratively two times, and by removing the frequency partials of the firstly detected sound in between. This can be done, since the spectral position of the partials can be calculated by subsituting f0 and B of the sound to Equation 1. This iterative approach was first proposed in [11]. Since the presented algorithm yields loudness estimates as a side-product, we call this loudest-first scheduling. Some special applications, like finding the base lines from music may require lowest-first scheduling. One addition had to be made to make the algorithm applicable to multi-pitch detection. In the case of simple harmonic relations, where the fundamental frequencies of the two sounds matched two low harmonics of a non-existent root sound, the root sound was erroneously detected. This problem was solved by a following method. We implemented a mechanism which is able to check the existence of the fundamental partial, and the series of every nth harmonic partials hni , where n takes prime number values {2,3,5,7}. Three of more among {h1,h2i,h3i,h5i,h7i} had to be found in order that a pitch candidate be accepted. This cannot be caused by any combination of two higher sounds. This addition did not have any practical effect in the monophonic case, but was nevertheless included also in all the calculations of the earlier presented simulation results. This was done in order to to use only one single algorithm and parameter set in all simulations. Results for two-voice polyphonies are presented in Figure 5. What was somewhat surprising is that after the ’root sound’ problem was solved, the detections were almost hundred percent correct at the first iteration. Harmonic interference of the other sound did not cause octave errors or prune detections. Results for the second iteration were also good, but not quite as successful. This indicates that the spectral subtraction does not work perfectly. A significant portion of the errors at the second iteration took place for combinations, in which the randomly generated sounds were in simple integer relations, i.e., the fundamental frequency of one sound was an integer multiple of the other. In this case all the harmonics of the higher sound perfectly match every nth harmonic of the lower one, and removal of the partials of the firstly detected lower sound removed all the partials of the higher one, too. The results for multi-pitch detection are still preliminary, because the work with the subtraction and iteration is still unfin- A system for the detection of the pitch of musical sounds was designed and evaluated. The system works at a wide pitch range and was shown to be robust in noise and in presence of other harmonic sounds. Algorithm for single time frames was presented and analyzed. Imposing higher-level logic on several time frames is expected to further enhance the performance. The algorithm can also be relatively efficiently implemented. An iterative approach to multi-pitch detection was proposed by applying the presented algorithm several times and by removing the frequency partials of the detected sounds in between. This was validated to be an efficient method, provided that the spectral removal procedure is carefully implemented. 7. REFERENCES [1] Bregman, “Auditory Scene Analysis,” MIT Press, 1990. [2] Meddis, Hewitt, “Virtual pitch and phase sensitivity of a computer model of the auditory periphery. I: pitch identification,” J. Acoust. Soc. Am., vol. 89, pp. 2866-2882, June 1991. [3] Klapuri, “Wide-band pitch estimation for natural sound sources with inharmonicities,” 106th Audio Engineering Society Convention, München, Germany, 1999. [4] Fletcher, Rossing, “The Physics of Musical Instruments” (2nd edition), Springer-Verlag New York, Inc., 1998. [5] Moore, Glasberg, Baer, “A Model for the Prediction of Thresholds, Loudness, and Partial Loudness,” J. Audio Eng. Soc., vol. 45, No. 4, April 1997. [6] Lahat, Niederjohn, Krubsack, “Spectral autocorrelation method for measurement of the fundamental frequency of noise-corrupted speech,” IEEE Trans. on Acoustics, Speech and Signal Processing, No. 6, June 1987. [7] Kunieda, Shimamura, Suzuki, “Robust method of measurement of fundamental frequency by ACLOS -autocorrelation of log spectrum,” IEEE Trans. on Acoustics, Speech and Signal Processing, 1996. [8] Karjalainen, Tolonen, “Multi-pitch and Periodicity Analysis Model for Sound Separation and Auditory Scene Analysis,” In proceedings of the International Conference on Acoustics, Speech and Signal Processing, ICASSP, 1999. [9] Martin, “Automatic Transcription of Simple Polyphonic Music: Robust Front End Processing,”. MIT Media Laboratory Perceptual Computing Section Technical Report, 1996. [10] Kashino, Tanaka, “A sound source separation system with the ability of automatic tone modeling,” Proceedings of the International Computer Music Conference, 1993. [11] Klapuri, “Number Theoretical Means of Resolving a Mixture of Several Harmonic Sounds,” In proceedings of the European Signal Processing Conference, 1998. 118