Jana Eggink

advertisement
INSTRUMENT IDENTIFICATION IN
POLYPHONIC MUSIC
Jana Eggink and Guy J. Brown
University of Sheffield
Instrument Identification
• GMMs trained with spectral features in combination with missing
data approach
• features
dominated by
energy from a
non-target tone
are excluded
from the
recognition
process
Sampled audio signal
F0 analysis
Fourier
analysis
Feature
mask
GMM
classifier
Instrument class
Spectral
features
Example Features with Mask
non-target tone
(oboe G4 sharp)
Time (frames)
Time (frames)
Time (frames)
mixture + mask
Frequency (features)
non-target tone + mask
Frequency (features)
Frequency (features)
target tone + mask
Time (frames)
mixture
Frequency (features)
Frequency (features)
Frequency (features)
target tone (violin D4)
Time (frames)
Time (frames)
Evaluation
• trained on realistic monophonic phrases and isolated notes
• tested on examples not included in the training material
results:
• monophonic: realistic phrases 88% correct,
isolated notes 66%
• a priori masks (2 concurrent sounds): realistic phrases 74%,
isolated notes 62%
• pitch-based masks (2 concurrent sounds): isolated notes 47%
GMMs with Missing Features
probability density function (pdf) of observed D-dimensional
feature vector x is modeled as:
N
p( x)   pi  i ( x,  i ,  i )
i 1
assuming feature independence, this can be rewritten as:
N
D
i 1
j 1
p( x)   pi   i ( x j , mij ,  2 ij )
approximating the pdf from reliable data xr only leads to:
N = number of Gaussians in the mixture model, pi = mixture weight, i = univariate
Gaussians with i = mean vector, mij = mean, i = covariance matrix, 2ij = standard
deviation, M’ = subset of reliable features in Mask M
Bounded Marginalisation
• even unreliable features hold some information, as the ‘true’
energy value can not be higher than the observed energy value
• for unreliable features xu:
- use observed energy as upper limit
- integrate over all possible feature values below that limit
• use reliable features xr as before
• compute the pdf as a product of the activation based on reliable
and unreliable features:
Results using Bounded Marginalisation
• no significant improvement for mixtures of two instrument
sounds 
• strongly improves results for mixtures of monophonic sound files
and white noise 
• results can probably explained in terms of different energy
distributions:
• energy of harmonic tones is concentrated in their partials
=> strong increase of energy in a small number of features
• energy of white noise is distributed across the whole spectrum
=> small increase of energy in a large number of features
• bounded marginalisation seems to improve results mainly when
the difference between observed and ‘true’ feature values is small
HMMs
• HMMs to capture additional temporal information, especially from
the onset of tones (seems important to humans)
• one HMM for every F0 of every instrument
• HMMs did not work better than GMMs, no clear association of
onsets with specific states 
• maybe wrong features (too coarse), maybe too much change during
onset
Increasing Polyphony
• mixtures of 3 and 4 instruments with equal rms power,
resulting in negative SNR for target tone
• results pretty bad even with a priori masks 
Identifying Solo Instruments with
Accompaniment
• identify the most prominent instrument
• instrument or voice?
• useful for
• automatic indexing
• information retrieval
• listening to a representative soundclip
• preprocessing for lyrics recognition
Existing Systems
Locating singing voice segments within music signals
(Berenzweig & Ellis, 2001)
• assumption: singing voice, even with instrumental background,
is closer to speech than purely instrumental music
• ANN trained on normal speech to distinguish between 56
phones
• classifiers trained on posterior probability features generated by
the ANN did not perform better than classifiers trained directly
on the cepstral features 
• binary decision (singing voice present?): ~80% correct
• no attempt at speech recognition
Our Work
• same system as for mixtures of two sounds
• a priori masks for mixtures of monophonic instrument
phrases and unrelated orchestra/piano pieces give good
results 
• but: reliable identification of all F0s within a harmonically
rich orchestra/piano piece currently not possible, but needed
for missing feature approach 
Peak-based Recognition
• partials cause spectral peaks, and solo instruments are
generally louder than accompaniment
• use only spectral peaks for recognition, most prominent F0
and corresponding harmonic overtone series easier to identify
than all background F0s
• same system as before, recognition based on peaks only: not
very good 
• trained on peaks only by setting all features without the
presence of a partial to 0:
• recognition results for monophonic examples lower than
spectral features
• but relatively stable in the presence of accompaniment 
New Features
• peak based training and recognition seems promising for solo
instrument + accompaniment
• but: better features needed
• new features: exact frequency and power of first 15 harmonics
(set to 0 if no harmonic is found), deltas and delta-deltas
• one GMM trained for every F0 of every instrument
• slightly better than the old system for isolated tones (c4-c5),
67% correct 
• some problems with realistic phrases, seems to be due to errors in
F0 estimation, slow pieces with clear and easily identifiable F0s
better than fast pieces with short notes
Polyphonic Recognition
• limited results so far, but looks promising
• identification of most prominent harmonic series not easy
Future Work:
• include models trained on singing voice
• integrate multisource decoding (Jon Barker, Helsinki
meeting)
Download