Multiple Fundamental Frequency Estimation by Summing Harmonic Amplitudes Anssi Klapuri Institute of Signal Processing, Tampere University of Technology Korkeakoulunkatu 1, 33720 Tampere, Finland anssi.klapuri@tut.fi Abstract This paper proposes a conceptually simple and computationally efficient fundamental frequency (F0) estimator for polyphonic music signals. The studied class of estimators calculate the salience, or strength, of a F0 candidate as a weighted sum of the amplitudes of its harmonic partials. A mapping from the Fourier spectrum to a “F0 salience spectrum” is found by optimization using generated training material. Based on the resulting function, three different estimators are proposed: a “direct” method, an iterative estimation and cancellation method, and a method that estimates multiple F0s jointly. The latter two performed as well as a considerably more complex reference method. The number of concurrent sounds is estimated along with their F0s. Keywords: F0 estimation, pitch, music transcription. 1. Introduction Pitch information is an essential part of almost all Western music, but extracting the pitch content automatically from recorded audio signals is difficult. Whereas the spectrogram of a music signal can be calculated straightforwardly using the short-time Fourier transform, computing a “piano roll”representation which shows polyphonic pitch content as a function of time is non-trivial, and systems trying to do this tend to be very complex. F0 estimation in polyphonic music has been addressed by many authors. Kashino extracted sinusoid tracks from a music signal and grouped them to sound sources based on acoustic features and musical information [1]. De Cheveigné [2], Tolonen and Karjalainen [3], and Klapuri [5] proposed methods based on modeling the human auditory system. Goto [6] and Davy and Godsill [7] employed a parametric signal model and statistical methods. Smaragdis [8] and Abdallah [9], proposed unsupervised learning techniques to resolve sound mixtures, and Poliner and Ellis [10] introduced a classification approach to the problem. Here we study a certain type of F0 estimators, where an input signal is first spectrally flattened (“whitened”) in order to suppress timbral information, and then the salience, or Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. c 2006 University of Victoria ° strength, of a F0 candidate is calculated as a weighted sum of the amplitudes of its harmonic partials. More exactly, the salience s(τ ) of a period candidate τ is calculated as s(τ ) = M X g(τ, m)|Y (fτ,m )|, (1) m=1 where fτ,m = mfs /τ is the frequency of the m:th harmonic partial of a F0 candidate fs /τ , fs is the sampling rate, and function g(τ, m) defines the weight of partial m of period τ in the sum. Y (f ) is the short-time Fourier transform of the whitened time-domain signal. 1 The spectral whitening is a straightforward pre-processing operation explained in Sect. 2.1. For convenience, we write s(τ ) as a function of the fundamental period τ instead of the F0 (= fs /τ ). The basic idea of (1) is intuitively appealing since pitch perception is closely related to the time-domain periodicity of sounds, and the Fourier theorem states that a periodic signal can be represented with spectral components at integer multiples of the inverse of the period. Indeed, formulas and principles resembling (1) have been used for F0 estimation by a number of authors, under different names and in different variants. Already in 1960s and 70s, Schroeder introduced the frequency histogram and Noll the harmonic sum spectrum (see [11, p.414]). De Cheveigné [2] discusses harmonic selection methods and, more recently, Walmsley [12] uses the name harmonic transform for a similar technique. The question of an optimal mapping of the Fourier spectrum to a “F0 salience spectrum” is closely related to these methods. Here, the function g(τ, m) is learned by a bruteforce optimization using a large amount of training material. Based on this, a parametric form is proposed for g(τ, m). In this paper, three different methods based on (1) are proposed. The first and simplest is a “direct” method based on evaluating s(τ ) for a range of values of τ and picking the desired number of highest local maxima in it. The second method represents an iterative estimation and cancellation approach, where the maximum of s(τ ) is used to estimate one F0 which is then cancelled from the mixture before estimating the next one. The third method estimates all F0s jointly. For the latter two methods, a technique is proposed for estimating the number of sounds in the mixture. The iterative method admits a very fast implementation which 1 Defining s(τ ) in terms of the power spectrum instead of the magnitude spectrum would have certain analytical advantages, but this led to clearly inferior F0 estimation results despite of extensive investigation. inefficient. Use of the fast Fourier transform becomes possible by replacing Y (f ) in (1) by its discrete version Y (k) and by approximating s(τ ) by Power 1 0.5 0 0 1000 2000 3000 4000 Frequency (Hz) 5000 6000 7000 ŝ(τ ) = M X m=1 Figure 1. Responses Hb (k) applied in spectral whitening. does not require evaluating s(τ ) for all period candidates τ . The three methods are evaluated using mixtures of musical instrument sounds, and the results are compared with three reference methods [3], [4] and [5]. 2. Proposed methods This section describes the proposed methods in detail. 2.1. Spectral whitening One of the big challenges in F0 estimation is to make systems robust for different sound sources. A way to achieve this is to try to suppress timbral information prior to the actual F0 estimation. This can be done by estimating the rough spectral energy distribution (which largely defines the timbre of a sound) and then flattening it entirely or partly by inverse filtering. This process is called spectral whitening and there are several ways of doing it (see e.g. [3]). Here a frequency-domain technique is employed which is easy to implement and leads to good results in practice. First, the discrete Fourier transform X(k) of the input signal x(n) is calculated in an analysis frame that is Hanningwindowed and zero-padded to twice its length before the transform. Then a bandpass filterbank is simulated in the frequency domain. Center frequencies cb [Hz] of the subbands are distributed uniformly on the critical-band scale, cb = 229 × (10(b+1)/21.4 − 1), and each subband b = 1, . . . , 30 has a triangular power response Hb (k) that extends from cb−1 to cb+1 and is zero elsewhere (see Fig. 1). Standard deviations σb within the subbands b are calculated by applying the responses Hb (k) in the frequency domain: !1/2 Ã 1 X Hb (k)|X(k)|2 , (2) σb = K k where K is the length of the Fourier transform. Next, bandwise compression coefficients γb = σbν−1 are calculated, where ν = 0.33 is a parameter determining the amount of spectral whitening applied. The coeffients γb are linearly interpolated between the center frequencies cb to obtain compression coefficients γ(k) for all frequency bins k. Finally, a whitened spectrum Y (k) is obtained by weighting the spectrum of the input signal by the compression coefficents, Y (k) = γ(k)X(k). 2.2. Calculation of the salience function in practice Calculation of s(τ ) using (1) directly requires evaluating Y (f ) for arbitrary frequencies f which is computationally g(τ, m) max |Y (k)|, k∈κτ,m (3) where the set κτ,m defines a range of frequency bins in the vicinity of the m:th overtone partial of the F0 candidate fs /τ . More exactly, κτ,m = [hmK/(τ +∆τ /2)i, . . . , hmK/(τ −∆τ /2)i], (4) where h·i denotes rounding to the nearest integer. It is clear that ŝ(τ ) ≈ s(τ ) when ∆τ → 0. In practice, however, it is useful to set ∆τ according to the spacing between successive period candidates τ in order to ensure that all spectral components k belong to the range κτ,m of at least one period candidate τ when m is fixed. Here we use the value ∆τ = 0.5, that is, the spacing between fundamental period candidates τ is half the sampling interval. 2 2.3. Optimization of the weight function A remaining task is to optimize the function g(τ, m) so as to minimize the F0 estimation error rate of the system. For this purpose, we generated training material consisting of random mixtures of musical instrument sounds with their reference F0 data. The database from which the samples were drawn is described in more detail in Sect. 3. The mixtures were generated by first allotting an instrument and then a random sound from its playing range, limiting F0s between 40 Hz and 2100 Hz. This two-stage randomizing was repeated until the desired number of sounds was obtained, and the sounds were then mixed with equal mean-square levels. One thousand mixtures of one, two, four, and six sounds were generated, totalling 4000 training instances. F0 estimation was performed simply by picking P highest local maxima in the function ŝ(τ ). The number of F0s in each mixture, P , was given to the estimator along with a 93 ms analysis frame. Multiple-F0 estimation error rate is defined as the proportion of reference F0s that were not correctly found. In predominant-F0 estimation, the task is to find only one F0 in each mixture. In this case, the maximum of ŝ(τ ) was taken and judged correct if it matched any of the reference F0s in the mixture. A correct F0 estimate was defined to deviate less than 3% from the reference. The criterion to be minimized in the optimization was the average of multiple-F0 and predominant-F0 estimation error rates in different polyphonies. Two different factorized forms of g(τ, m) were studied: g(τ, m) = g1 (τ )g2 (m), g(τ, m) = g1 (τ )g3 (fτ,m ). (5) (6) 2 In Sect. 2.4 where the fast algorithm is presented, the sampling of τ has only minor effect on computational efficiency, and therefore very dense sampling can be implemented. In practice, ∆τ = 0.5 suffices. g(τ, m) = fs /τ + α , mfs /τ + β where the parameters α and β are given in Sect. 3. (7) 2 40 g (m) 1 0.5 0 0 2 2 g1(τ) 1 Error rate (%) 3 1.5 0.5 1 1.5 F0 (kHz) 0 2 1 30 20 10 0 5 10 15 20 Harmonic index 1 2 4 6 Polyphony Figure 2. The learned functions g1 (τ ) and g2 (m) are shown in the left and the middle panels, respectively. For clarity, g1 (τ ) is drawn as a function of F0 (fs /τ ) instead of τ . The right panel shows the resulting error rates for multiple-F0 estimation (black) and predominant-F0 estimation (white). −3 3 1500 g3(f) 1 2 1000 1 500 0 0 0.5 1 1.5 F0 (kHz) 2 0 x 10 40 Error rate (%) 2000 g (τ) Reducing the two-parameter function g(τ, m) to a product of two marginal functions makes the optimization task easier and is likely to lead to a result that generalizes better to previously unseen test cases. Let us first consider the form given by (5). The function g1 (τ ) was parametrized by interpolating between ten “anchor points” which were distributed roughly as a geometric series between the fundamental frequencies 30 Hz and 2500 Hz. Similarly, the function g2 (m) was parametrized by distributing ten anchor points as a geometric series between the 1st and 21st harmonic, and the function g2 (m) was then linearly interpolated between these. The optimization was done by initializing the amplitudes of the anchor points to unity values and then updating them cyclically, one at the time, so as to minimize the F0 estimation error rate. Figure 2 shows the learned functions g1 (τ ) and g2 (m), together with the resulting F0 estimation error rates (for the training data). The found shape of g1 (τ ) is more or less a linear function of F0 (that is, fs /τ ), whereas g2 (m) converged roughly to the 1/m shape, however with smaller weights for the lowest even-numbered harmonics (see Fig. 2). The predominant-F0 estimation accuracy is good, but multiple-F0 estimation leaves room for improvement. Optimization for the factorization in (6) was done in a similar manner. The function g3 (fτ,m ) was parametrized by interpolating it linearly between 13 anchor points that were distributed roughly as a geometric series between 30 Hz and 7 kHz. The optimization was again carried out by updating the amplitudes of the anchor points cyclically, one at a time, so as to minimize the F0 estimation error rate. The best result was obtained by starting the optimization from configuration g(τ, m) = 1/m, which is obtained by initializing the anchor points as g1 (τ ) = fs /τ and g3 (fτ,m ) = 1/fτ,m . Figure 3 shows the learned functions g1 (τ ) and g3 (fτ,m ), and the resulting F0 estimation error rates. As can be seen, the functions g1 (τ ) and g3 (fτ,m ) do not drift very far from their initial shape, and the error rates are about the same as those achieved with the previous factorization. The latter form (6) is interesting, because it allows a simple implementation where the spectrum Y (k) is first filtered using the response g3 (fτ,m ), then ŝ(τ ) is computed without any weights, and in the end ŝ(τ ) is weighted with g1 (τ ). The latter factorization (6) was taken into use. To get rid of the large number free parameters (the anchor points), the function g1 (τ ) is modeled as a linear function of fundamental frequency, g1 (τ ) = fs /τ + α, and the function g3 (fτ,m ) is modeled as an inverse of the frequency fτ,m and a moderation term β, g3 (fτ,m ) = 1/(fτ,m + β). The dashed lines in Figure 3 show the modeled functions. As a result, the function g(τ, m) can be finally written as 30 20 10 1 2 3 4 5 6 7 Frequency (kHz) 0 1 2 4 6 Polyphony Figure 3. The learned functions g1 (τ ) and g3 (fτ,m ) are drawn with a solid line in the left and the middle panels, respectively. The dashed lines show the corresponding parametric models. The right panel shows F0 estimation error rates before the parametric modeling. 2.4. Iterative estimation and cancellation The “direct” F0 estimator described above suffers from the problem that a single F0 in a sound mixture produces several peaks to ŝ(τ ), and although the maximum of ŝ(τ ) is a robust indicator of one of the true F0s, the second or third-highest peak is often due to the same sound and located at τ that is half or twice the position of the highest peak. Multiple-F0 estimation accuracy can be improved by an iterative estimation and cancellation scheme where each detected sound is cancelled from the mixture and ŝ(τ ) is updated accordingly before estimating the next F0. The basic cancellation mechanism described here is similar to that presented in [5], except that here a fast algorithm is described for finding the maximum of ŝ(τ ) and a technique is proposed for estimating the number of sounds in the mixture. Let us first look at an efficient way of finding the maximum of ŝ(τ ). Somewhat surprisingly, the global maximum of ŝ(τ ) and the corresponding value of τ can be found with a fast algorithm that does not require evaluating ŝ(τ ) for all τ . This is another motivation for the iterative estimation and cancellation approach where only the maximum of ŝ(τ ) is needed at each iteration. Let us denote the minimum and maximum fundamental period of interest by τmin and τmax , respectively, and the required precision of sampling τ by τprec . A fast search of the maximum of ŝ(τ ) can be implemented by repeatedly splitting the range [τmin , τmax ] into smaller “blocks”, computing an upper bound for the salience within each block q, smax (q), and continuing by splitting the block with the Algorithm 1: Fast search of the maximum of ŝ(τ ) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Q ← 1; τlow (1) ← τmin ; τup (1) ← τmax ; qbest ← 1; while τup (qbest ) − τlow (qbest ) > τprec do # Split the best block and compute new limits Q←Q+1 τlow (Q) ← (τlow (qbest ) + τup (qbest ))/2 τup (Q) ← τup (qbest ) τup (qbest ) ← τlow (Q) # Compute new saliences for the two block-halves for q ∈ {qbest , Q} do Calculate smax (q) using Equations (3)-(4) fs /τlow (q)+α with g(τ, m) = mf s /τup (q)+β τ = (τlow (q) + τup (q))/2 ∆τ = τup (q) − τlow (q) end # Search again the best block qbest ← arg maxq∈[1,Q] smax (q) end Return τ̂ = (τlow (qbest ) + τup (qbest ))/2 ŝ(τ̂ ) = smax (qbest ) highest smax (q). Let us denote the number of blocks by Q and the upper and lower limits of block q by τlow (q) and τup (q), respectively. Index of the highest-salience block is denoted by qbest . The algorithm starts with only one block with upper and lower limits at τmin and τmax , and then repeatedly splits the best block into two halves, as detailed in Algorithm 1. 3 As a result, it gives the maximum of ŝ(τ ) and the corresponding value of τ . On lines 13–14 of the algorithm, in order to obtain an upper bound for the salience ŝ(τ ) within range [τlow (q), τup (q)], Equation (3) is evaluated using the given values for g(τ, m), τ , and ∆τ . Splitting a block later on can only decrease the value of smax (q) when computed for the new block-halves. Algorithm 1 is important for two reasons. First, it allows searching the maximum of ŝ(τ ) efficiently even when the required sampling density of τ is very high. Secondly, increasing the sampling density of τ has the consequence that all the sets κτ,m in (3) contain exactly one frequency bin, in which case the non-linear maximization operation vanishes and ŝ(τ ) becomes a linear function of the magnitude spectrum |Y (k)|, making it analytically more tractable. The iterative estimation and cancellation goes as follows: 1. A residual spectrum YR (k) is initialized to equal Y (k), and a spectrum of detected sounds YD (k) to zero. 2. A fundamental period τ̂ is estimated using YR (k) and Algorithm 1. The maximum of ŝ(τ ) determines τ̂ . 3. Harmonic partials of τ̂ are located in YR (k) at bin hmK/τ i. The frequency and amplitude of each partial is estimated and used to calculate its magnitude 3 It is even more efficient to start with p (τmax − τmin )/τprec blocks. spectrum at the few surrounding frequency bins. The magnitude spectrum of the m:th partial is weighted by g(τ̂ , m) and added to the corresponding position of the spectrum of detected sounds, YD (k). 4. The residual spectrum is recalculated as YR (k) ← max(0, Y (k) − dYD (k)), where d controls the amount of the subtraction. 5. If there are sounds remaining in YR (k), return to Step 2. Note that the purpose of the cancellation is to suppress harmonic and subharmonic peaks of τ̂ in ŝ(τ ). This should be done in such a way that the residual spectrum YR (k) is not corrupted too much to detect remaining sounds at the coming iterations. These conflicting requirements are effectively met by weighting the partials of a detected sound by g(τ, m) in Step 3 before adding them to YD (k). In practice this means that the higher partials are not entirely cancelled from the mixture since g(τ, m) ≈ 1/m. Parameter d ≈ 1 together with g(τ, m) defines the amount of subtraction. The function g(τ, m) was re-optimized for the iterative method using a similar optimization scheme as described above. Despite the double role of g(τ, m) here (affecting both the salience and the cancellation), the obtained functions g1 (τ ), g2 (m), and g3 (fτ,m ) were very similar to those shown in Figs. 2–3, and the model (7) is suitable. When the number of sounds in the mixture is not given, it has to be estimated. This task, polyphony estimation, is accomplished by stopping the iteration when a newly-detected sound τ̂j at iteration j no longer increases the quantity Pj ŝ(τ̂i ) , (8) S(j) = i=1γ j where γ = 0.70 was found empirically. 4 The value of j maximizing (8) is taken as the estimated polyphony P̂ . 2.5. Joint estimation of multiple F0s The described iterative multiple-F0 estimator is efficient and produces good results, but it also leaves us with a couple of open questions. How much does the iterative search algorithm affect the result? Is it possible to compute saliences of the found sounds so that the order of detecting them would not affect? This section describes a joint estimator which can answer these questions. First, the salience function ŝ(τ ) is calculated according to (3). Then, I highest local maxima of ŝ(τ ) are chosen as candidate fundamental period values τi , i = 1, . . . , I. For each candidate i, the following quantities are computed: • Frequency bins of harmonic partials ki,m , where m is the harmonic index and ki,m corresponds to the maximum of |Y (k)| in the range κm,τi (see (4)). 4 Note that S(j) would be monotonically decreasing for γ = 1 (average of ŝ(τ̂i ):s) and monotonically increasing for γ = 0 (sum). • Candidate spectrum Zi (k) is an estimate of the spectrum of candidate i, and is calculated by translating the spectrum of the window function (Hanning) to the positions ki,m and adding them to Zi (k) after scaling by (d/2)g(τi , m), where d is the cancellation parameter from Step 4 of the iterative method. Let us denote by P the number of simultaneous F0s to estimate and ¡ ¢by I a set of P different candidate indices i (there are PI different possibilities). Then the joint estimation consists of finding such an index set I that maximizes Y XX (1 − Zj (ki,m )), g(τi , m)|Y (ki,m )| G(I) = i∈I m j∈I\i (9) where Zj (k) ≤ 1 because (d/2)g(τ, m) ≤ 1. By comparison with (1), it can be seen that the above goodness measure implements a similar harmonic summing model but with the difference that the salience contribution of sound i is reduced by “inhibition” (cancellation) from other sounds j in I, as determined by their estimated spectrum Zj (k). In fact, the above model is a very close equivalent to the iterative method presented above, the difference being that here the estimation is performed jointly instead of iteratively. The reason why the parameter d is halved when calculating Zj (k) is that here all sounds inhibit all others, whereas in the iterative method only sounds detected at earlier iterations inhibit (through cancellation) those detected later. A problem with (9) is that ¡ ¢the computational complexity of evaluating G(I) for all PI different index combinations I is computationally impractical. A reasonably efficient implementation is possible by making use of the lower bound G̃(I) of G(I). By writing out the product in (9), it is easy to see that G(I) ≥ G̃(I) where XX X G̃(I) = g(τi , m)|Y (ki,m )|[1 − Zj (ki,m )] i∈I m = X j∈I\i ŝ(τi ) − i∈I X X Inh(i, j) (10) i∈I j∈I\i where the “inhibition” Inh(i, j) is a non-symmetric function X g(τi , m)|Y (ki,m )|Zj (ki,m ). (11) Inh(i, j) = m From the computational viewpoint, the advantage of (10) is that the values ŝ(τi ) and Inh(i, j) can be precomputed, making the evaluation of (10) an easy operation for different index combinations I. Another crucial factor is that, due to the sparseness of Zj (k), the lower bound G̃(Ic ) is actually an accurate estimate of G(Ic ) so that G(Ic ) ≈ G̃(Ic ). The algorithm for finding a set I which maximizes (9) is: 1. Initialize I different sets I with the individual candidates τi , i = 1, . . . , I, so that set number c is initialized with Ic ← {c} and G̃(Ic ) ← ŝ(τc ). 2. Generate I × I new combinations by extending all the existing combinations with all the individual candidates τi . The goodness measures of these extended combinations can be computed recursively as X G̃(Ic ∪i) = G̃(Ic )+ŝ(τi )− (Inh(i, j)+Inh(j, i)). j∈Ic 3. Sort the I × I extended sets in descending goodness order and retain only I best combinations with different goodness measures. The latter prevents from choosing different permutations of a same set. 4. If polyphony P was not reached, return to Step 2. 5. Evaluate the exact goodness measure (9) for the I best combinations and choose the combination with the highest value to output. If the polyphony P is not given, it can be estimated by evaluating the exact goodness measure (9) for the I best combination always between the Steps 3 and 4, and by storing the best j-size combination Ibest(j) . Extending the sets is continued as long as the following measure increases: S(j) = G(Ibest(j) )/j γ , (12) where the parameter γ = 0.73 was found empirically. An advantage of the joint estimation method is that, contrary to the iterative system, here the order of detecting the sounds does not affect the result. A drawback is that the joint estimator is computationally less efficient as it requires evaluating the function ŝ(τ ) for all τ and therefore Algorithm 1 cannot be used. Finding the optimal combination I in the above algorithm is still quite efficient when 50–100 candidates τi are selected from ŝ(τ ), which is roughly the amount needed to retain the true periods among them. In the simulations, we used I = 100. 3. Evaluation Simulation experiments were carried out to evaluate the proposed estimators. These were compared with the reference methods [3] and [5] based on auditory models, and the reference method [4] based on spectral techniques. Test data consisted of random mixtures of musical instrument samples with F0s between 40 and 2100 Hz, generated in the same way as in Sect. 2.3 but of course randomizing new test cases here. 5 The acoustic database, however, was the same, and consisted of samples from the McGill University Master Samples collection, the University of Iowa website, IRCAM Studio Online, and of recordings for the acoustic guitar. In total, there were 2842 samples from 32 musical instruments. As estimation of the number of concurrent sounds is very difficult in itself, we evaluate F0 estimation and polyphony estimation separately. The parameter values α, β,and d were the same for all the three proposed methods and were 27 Hz, 5 For the reference method [3], F0s were restricted between 40 Hz and 530 Hz, since the method is not very robust for F0s higher than this. 50 Multiple−F0 estim. 40 46 ms frame 30 20 10 0 [4] [3] di j [5] 1 2 4 Polyphony 93 ms frame 30 20 10 [3] [4] di j [5] 1 2 4 Polyphony 6 30 Predominant−F0 estim. 40 Error rate (%) Error rate (%) 40 0 6 50 46 ms frame 30 [4] [3] 20 10 di j [5] 0 4. Conclusions 50 Multiple−F0 estim. Error rate (%) Error rate (%) 60 1 2 4 Polyphony 6 25 Predominant−F0 estim. 20 93 ms frame [3] 15 [4] 10 5 di j [5] 0 1 2 4 Polyphony 6 References Figure 4. Multiple-F0 estimation and predominant-F0 estimation results in 46 ms and 93 ms analysis frames. Reading left to right, each stack of six thin bars corresponds to the error rates of the direct (d), iterative (i), joint (j), and reference methods [3], [4], and [5] in a certain polyphony. * 123456789 Polyphony * 123456789 Polyphony * 123456789 Polyphony The principle of summing harmonic amplitudes as given by (1) is very simple, yet it suffices for predominant-F0 estimation in polyphonic signals provided that the weights g(τ, m) of different partials and periods are appropriate. In multipleF0 estimation, both the iterative and the joint estimator were successful, but the iterative method admits a fast implementation and is therefore more appealing. The joint estimator, in turn, achieves better predominant-F0 estimation. Both methods can be seen to implement the model embodied in the goodness measure (9), which is very simplistic considering the wide range of instruments and F0 values addressed. * 123456789 Polyphony Figure 5. Histograms of polyphony estimates for the iterative method and a 93 ms analysis frame. The asterisks indicate the true polyphony (1, 2, 4, and 6, from left to right). 320 Hz, and 1.0, respectively, for 46 ms analysis frame, and 52 Hz, 320 Hz, and 0.89, respectively, for 93 ms frame. Figure 4 shows the F0 estimation results of the proposed and the reference methods. Here the number of concurrent sounds (polyphony) was given as a side-information to the estimators. The error rates are practically the same for the proposed iterative and joint methods and the reference method [5], and these three outperform the methods [3] and [4]. This is a very nice result, since the best reference method [5] involves computation of an auditory model, including e.g. Fourier transforms at 70 subbands. The proposed methods are considerably simpler and computationally more efficient. In monophonic cases (polyph. 1), about 50% of the errors are caused by F0s between 40 and 65 Hz. The lower panels of Figure 4 show predominant-F0 estimation accuracies. Here the error rates are practically the same for the proposed direct and the iterative method and for the reference method. The accuracy of the joint method, however, is clearly better in high polyphonies. Figure 5 illustrates the results of polyphony estimation for the iterative method and a 93 ms analysis frame. Results for the joint method were very similar and are not shown. The asterisk indicates true polyphony in each panel, and bars show a histogram of the estimates. The results are not fully satisfactory, and it seems that robust estimation of the number of sounds requires more than one analysis frame. [1] K. Kashino and H. Tanaka, “A sound source separation system with the ability of automatic tone modeling,” in International Computer Music Conf., (Tokyo, Japan), 1993. [2] A. de Cheveign´ e, “Separation of concurrent harmonic sounds: Fundamental frequency estimation and a timedomain cancellation model for auditory processing,” Journal of the Acoust. Soc. Am., vol. 93, no. 6, 1993. [3] T. Tolonen and M. Karjalainen, “A computationally efficient multipitch analysis model,” IEEE Trans. Speech and Audio Processing, vol. 8, no. 6, pp. 708–716, 2000. [4] A. P. Klapuri, “Multiple fundamental frequency estimation based on harmonicity and spectral smoothness,” IEEE Trans. Speech and Audio Processing, vol. 11, no. 6, 2003. [5] A. P. Klapuri, “A perceptually motivated multiple-F0 estimation method for polyphonic music signals,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, (New Paltz, NY), 2005. [6] M. Goto, “A real-time music scene description system: Predominant-F0 estimation for detecting melody and bass lines in real-world audio signals,” Speech Communication, vol. 43, no. 4, pp. 311–329, 2004. [7] M. Davy, S. Godsill, and J. Idier, “Bayesian analysis of Western tonal music,” Journal of the Acoustical Society of America, vol. 119, no. 4, pp. 2498–2517, 2006. [8] P. Smaragdis and J. C. Brown, “Non-negative matrix factorization for polyphonic music transcription,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, (New Paltz, NY), 2003. [9] S. A. Abdallah and M. D. Plumbley, “Polyphonic transcription by non-negative sparse coding of power spectra,” in International Conference on Music Information Retrieval, (Barcelona, Spain), pp. 318–325, Oct. 2004. [10] G. E. Poliner and D. P. W. Ellis, “A classification approach to melody transcription,” in 6th International Conference on Music Information Retrieval, (London, UK), 2005. [11] W. J. Hess, Pitch Determination of Speech Signals. Berlin Heidelberg: Springer, 1983. [12] P. Walmsley, Signal Separation of Musical Instruments. Simulation-based methods for musical signal decomposition and transcription. PhD thesis, Department of Engineering, University of Cambridge, Sept. 2000.