INSTRUMENT IDENTIFICATION IN POLYPHONIC MUSIC Jana Eggink and Guy J. Brown University of Sheffield Instrument Identification • GMMs trained with spectral features in combination with missing data approach • features dominated by energy from a non-target tone are excluded from the recognition process Sampled audio signal F0 analysis Fourier analysis Feature mask GMM classifier Instrument class Spectral features Example Features with Mask non-target tone (oboe G4 sharp) Time (frames) Time (frames) Time (frames) mixture + mask Frequency (features) non-target tone + mask Frequency (features) Frequency (features) target tone + mask Time (frames) mixture Frequency (features) Frequency (features) Frequency (features) target tone (violin D4) Time (frames) Time (frames) Evaluation • trained on realistic monophonic phrases and isolated notes • tested on examples not included in the training material results: • monophonic: realistic phrases 88% correct, isolated notes 66% • a priori masks (2 concurrent sounds): realistic phrases 74%, isolated notes 62% • pitch-based masks (2 concurrent sounds): isolated notes 47% GMMs with Missing Features probability density function (pdf) of observed D-dimensional feature vector x is modeled as: N p( x) pi i ( x, i , i ) i 1 assuming feature independence, this can be rewritten as: N D i 1 j 1 p( x) pi i ( x j , mij , 2 ij ) approximating the pdf from reliable data xr only leads to: N = number of Gaussians in the mixture model, pi = mixture weight, i = univariate Gaussians with i = mean vector, mij = mean, i = covariance matrix, 2ij = standard deviation, M’ = subset of reliable features in Mask M Bounded Marginalisation • even unreliable features hold some information, as the ‘true’ energy value can not be higher than the observed energy value • for unreliable features xu: - use observed energy as upper limit - integrate over all possible feature values below that limit • use reliable features xr as before • compute the pdf as a product of the activation based on reliable and unreliable features: Results using Bounded Marginalisation • no significant improvement for mixtures of two instrument sounds • strongly improves results for mixtures of monophonic sound files and white noise • results can probably explained in terms of different energy distributions: • energy of harmonic tones is concentrated in their partials => strong increase of energy in a small number of features • energy of white noise is distributed across the whole spectrum => small increase of energy in a large number of features • bounded marginalisation seems to improve results mainly when the difference between observed and ‘true’ feature values is small HMMs • HMMs to capture additional temporal information, especially from the onset of tones (seems important to humans) • one HMM for every F0 of every instrument • HMMs did not work better than GMMs, no clear association of onsets with specific states • maybe wrong features (too coarse), maybe too much change during onset Increasing Polyphony • mixtures of 3 and 4 instruments with equal rms power, resulting in negative SNR for target tone • results pretty bad even with a priori masks Identifying Solo Instruments with Accompaniment • identify the most prominent instrument • instrument or voice? • useful for • automatic indexing • information retrieval • listening to a representative soundclip • preprocessing for lyrics recognition Existing Systems Locating singing voice segments within music signals (Berenzweig & Ellis, 2001) • assumption: singing voice, even with instrumental background, is closer to speech than purely instrumental music • ANN trained on normal speech to distinguish between 56 phones • classifiers trained on posterior probability features generated by the ANN did not perform better than classifiers trained directly on the cepstral features • binary decision (singing voice present?): ~80% correct • no attempt at speech recognition Our Work • same system as for mixtures of two sounds • a priori masks for mixtures of monophonic instrument phrases and unrelated orchestra/piano pieces give good results • but: reliable identification of all F0s within a harmonically rich orchestra/piano piece currently not possible, but needed for missing feature approach Peak-based Recognition • partials cause spectral peaks, and solo instruments are generally louder than accompaniment • use only spectral peaks for recognition, most prominent F0 and corresponding harmonic overtone series easier to identify than all background F0s • same system as before, recognition based on peaks only: not very good • trained on peaks only by setting all features without the presence of a partial to 0: • recognition results for monophonic examples lower than spectral features • but relatively stable in the presence of accompaniment New Features • peak based training and recognition seems promising for solo instrument + accompaniment • but: better features needed • new features: exact frequency and power of first 15 harmonics (set to 0 if no harmonic is found), deltas and delta-deltas • one GMM trained for every F0 of every instrument • slightly better than the old system for isolated tones (c4-c5), 67% correct • some problems with realistic phrases, seems to be due to errors in F0 estimation, slow pieces with clear and easily identifiable F0s better than fast pieces with short notes Polyphonic Recognition • limited results so far, but looks promising • identification of most prominent harmonic series not easy Future Work: • include models trained on singing voice • integrate multisource decoding (Jon Barker, Helsinki meeting)