Document

advertisement
www.kdd.uncc.edu
Polyphonic music information retrieval
based on multi-label cascade classification system
http//:www.mir.uncc.edu
presented by
Zbigniew W. Ras
University of North Carolina, Charlotte, NC
College of Computing and Informatics
Polyphonic music information retrieval
based on multi-label cascade classification system
Student: Wenxin Jiang
Advisor: Dr. Zbigniew W. Ras
Survey of MIR- http://mirsystems.info/



43 MIR systems
Most are pitch estimation-based melody and
rhythm match
This presentation will focus on timbre
estimation
MIRAI - Musical Database (mostly MUMS)
[music pieces played by 59 different music instruments]
Goal: Design and Implement a System for
Automatic Indexing of Music by Instruments
(objective task) and Emotions (subjective task)
Outcome:
Musical Database
[music pieces indexed by instruments and emotions].
Resulting Database will be represented as FS-tree
guarantying efficient storage and retrieval .
Automatic Indexing of Music
What is needed?
Database of monophonic and polyphonic music signals and
their descriptions in terms of new features (including temporal)
in addition to the standard MPEG7 features.
These signals are labeled by instruments and emotions
forming additional features called decision features.
Why is needed?
To build classifiers for automatic indexing of musical
sound by instruments and emotions.
MIRAI - Cooperative Music Information Retrieval
System based on Automatic Indexing
Query
…
…
…
…
…
…
Instruments
Durations
Indexed
Audio Database
Query
Adapter
Music Objects
User
…
Empty
Answer?
Raw data--signal representation


Binary File
PCM :

Sampling Rate
44.1K Hz
16 bits
2,646,000
values/min.
PCM (Pulse Code Modulation) - the most straightforward mechanism to store audio.
Analog audio is sampled & individual samples are stored sequentially in binary format.
Challenges to applying KDD in MIR
The nature and types of raw data
Data source
organization
volume
Type
Quality
Traditional data
Structured
Modest
Discrete,
Categorical
Clean
Audio data
Unstructured
Very large
Continuous,
Numeric
Noise
Feature extractions
Amplitude values
at each sample
point
lower level raw data form
Feature
Extraction
Higher level
representations
manageable
Feature
Database
traditional pattern
recognition
classification
clustering
regression
MPEG7 features
Hamming
Window
NFFT
FFT points
STFT
Signal
envelope
Signal
STFT
Hamming
Window
Power
Spectrum
Spectral Centroid
Log Attack Time
Temporal Centroid
Harmonic
Peaks
Detection
Fundamental
Frequency
Instantaneous
Harmonic Spectral Spread
Instantaneous
Harmonic Spectral Centroid
Instantaneous
Harmonic Spectral Deviation
Instantaneous
Harmonic Spectral Variation
Derived Database
Extended MPEG7 features
Feature
Other features & new features
Feature
Durations
Sub-Total
Total
Harmonic Upper Limit
1
Tristimulus Parameters
4
10
40
Harmoni Ratio
1
Spectrum Centriod/Spread II
4
2
8
Basis Functions
190
Flux
4
1
4
Log Attack Time
1
Roll Off
4
1
4
Temporal Centroid
1
Zero Crossing
4
1
4
Spectral Centroid
1
MFCC
4
4x13
208
Spectrum Centroid/Spread I
2
Spectrum Centroid/Spread I
3
2
6
Harmonic Parameters
4
Harmonic Parameters
3
4
12
Flatness
24x4
Flatness
3
4x24
288
Total
297
Durations
3
1
3
Total
577
Hierarchical Classification
Schema I
Schema II - Hornbostel Sachs
Idiophone
Membranophone
Lip Vibration
C Trumpet
French Horn
Aerophone
Single Reed
Tuba
Bassoon
Oboe
Chordophone
Free
Whip
Side
Flute
Alto Flute
Schema III - Play Methods
Blow
Bowed
Alto Flute ……Flute
Muted
…… Picked
Piccolo
Bassoon
Pizzicato
Shaken
Database Table
Obj
Decision Attributes
Classification
Attributes
CA1 … … CAn
Hornbostel Sachs
Play Method
1
0.22 … … 0.28 [Aerophone, Side, Alto Flute]
[Blown, Alto Flute]
2
0.31 … … 0.77 [Idiophone, Concussion, Bell]
[Concussive, Bell]
3
0.05 … … 0.21 [Chordophone, Composite,
[Bowed, Cello]
Cello]
4
0.12 … … 0.11 [Chordophone, Composite,
Violin]
[Martele, Violin]
Xin Cynthia Zhang
Xin Cynthia Zhang
17
17
Example
Level I
C[1]
1
C[2]
2
1
Level II
C[2,1]
d[1]
1
d[2]
2
d[3]
2
1
C[2,2]
d[3,1]
X
a
b
c
d
x1
a[1]
b[2]
c[1]
d[3]
x2
a[1]
b[1]
c[1]
d[3,1]
x3
a[1]
b[2]
c[2,2]
d[1]
x4
a[2]
b[2]
c[2]
d[1]
Classification Attributes
Decision Attributes
3
2
d[3,2]
Classification
90% training, 10% testing.
 10 folds.
 Hierarchical (Schema I) vs none hierarchical.
 Compare with different classifiers.
– J48 tree
– Naïve Baysian

Results of the none-hierarchical
Classification
J48-Tree
NaïveBaysian
All
70.4923%
68.5647%
MPEG
65.7256%
56.9824%
Results of the hierarchical Classification
(Schema I) with MPEG7 features
J48-Tree
NaïveBaysian
Family
86.434%
64.7041%
No-pitch
73.7299%
66.2949%
Percussion
85.2484%
84.9379%
String
72.4272%
61.8447%
Wind
67.8133%
67.8133%
Results of the hierarchical Classification
(Schema I) with all features
J48-Tree
NaïveBaysian
Family
91.726%
72.6868%
No-pitch
77.943%
75.2169%
Percussion
86.0465%
88.3721%
String
76.669%
66.6021%
Woodwind
75.761%
78.0158%
Classification Results
J48-Tree
With new Features
Without new Features
Accuracy
Recall
Accuracy
Recall
Con-clarinet
100.0
60.0
83.3
100.0
Electricbass
100.0
73.3
93.3
93.3
Flute
100.0
50.0
60.0
75.0
Steel Drums
100.0
66.7
50.0
66.7
Tuba
100.0
100.0
100.0
85.7
Vibraphone
87.5
93.3
78.6
73.3
Cello
87.0
95.2
86.7
61.9
Violin
84.0
77.8
66.7
59.3
Piccolo
83.3
50.0
60.0
60.0
Marimba
82.4
87.5
83.3
93.8
Ctrumpet
81.3
76.5
87.5
82.4
Alto Flute
80.0
80.0
80.0
80.0
English Horn
80.0
57.1
42.9
42.9
Polyphonic sounds – how to handle?
1.
Single-label classification Based on Sound Separation
2.
Multi-labeled classifiers
Problems?
Polyphonic
Sound
Get frame
Classifier
.
segmentation
Feature
extraction
Sound
separation
Get
Instrument
Sound Separation Flowchart
Information loss during the
signal subtraction
This presentation will focus on timbre estimation in polyphonic sounds
and designing multi-labeled classifiers
timbre relevant descriptors


Spectrum Centroid, Spread
Spectrum Flatness Band Coefficients

Harmonic Peaks

Mel frequency cepstral coefficients (MFCC)

Tristimulus

Sub-pattern of single instrument in mixture
Feature extraction
Timbre estimation based on multi-label classifier
40ms
segmentation
Feature
Extraction
Single
label
database
Acoustic descriptors
Features
Classifier
instrument
confidence
Candidate 1
70%
Candidate 2
50%
.
.
.
.
.
.
Candidate N
10%
Flowchart of multi-label classification system
Polyphonic
Sound
Perform multiple
classifying
Get frame
Feature extraction
Multiple labels
Ci ,,C j
Get Final winners
Voting process
based on context
Finish all the Frames estimation
S  X , F  C
Timbre Estimation Results based on different methods
[Instruments - 45, Training Data (TD) - 2917 single instr. sounds from MUMS, Testing
on 308 mixed sounds randomly chosen from TD, window size – 1 sec, frame size –
120ms, hop size – 40ms, MFCC extracted from each frame (following MPEG-7)]
Seperation Vs Non-Sperataion
Recall
Recall
Single Label Vs Multiple Label
61.20%
67.69%
64.28%
61.20%
54.55%
1
2
2
2
4
SP
SP
SP
NS
NS
experiment
#
pitch
based
Sound
Separation
N(Labels)
max
Recall
Precision
F-score
1
Yes
Yes
1
54.55%
39.2%
45.60%
2
Yes
Yes
2
61.20%
38.1%
46.96%
3
Yes
NO
2
64.28%
44.8%
52.81%
4
Yes
NO
4
67.69%
37.9%
48.60%
5
Yes
NO
8
68.3%
36.9%
47.91%
Threshold 0.4 controls the total number of estimations for each index window.
Polyphonic
Sound
(window)
Polyphonic Sounds
Classifiers
Get frame
Feature extraction
Multiple labels
Ci ,,C j
Compressed representations of the signal: Harmonic Peaks, Mel Frequency
Ceptral Coefficients (MFCC), Spectral Flatness, ….
Irrelevant information (inharmonic frequencies or partials) is removed.
Violin and viola have similar MFCC patterns. The same is with double-bass
and guitar. It is difficult to distinguish them in polyphonic sounds.
More information from the raw signal is needed.
Short Term Power Spectrum – low level representation of signal (calculated by STFT)
Spectrum slice – 0.12 seconds long
Power Spectrum patterns of flute & trombone can be seen in the mixture
Experiment:
Middle C instrument sounds (pitch equal to C4 in MIDI notation, frequency -261.6 Hz
Training set:
Power Spectrum from 3323 frames - extracted by STFT from 26 single
instrument sounds: electric guitar, bassoon, oboe, B-flat, clarinet, marimba, C trumpet,
E-flat clarinet, tenor trombone, French horn, flute, viola, violin, English horn, vibraphone,
Accordion, electric bass, cello, tenor saxophone, B-flat trumpet, bass flute, double bass,
Alto flute, piano, Bach trumpet, tuba, and bass clarinet.
Testing Set:
Fifty two audio files are mixed (using Sound Forge ) by two of these 26 single
instrument sounds.
Classifier –
(1) KNN with Euclidean distance (spectrum match based classification);
(2) Decision Tree (multi label classification based on previously extracted features)
Timbre Pattern Match Based on Power Spectrum
Spectrum-based VS Feature-based
87.10%
79.41%
82.43%
Spectrum Match
Spectrum Match
Spectrum Match
(without
percussion )
k=1
k=5
k=5
64.28%
Feature-based
experiment #
description
Recall
Precision
F-score
1
Feature-based + Decision Tree (n=2)
64.28%
44.8%
52.81%
2
Spectrum Match + KNN (k=1;n=2)
79.41%
50.8%
61.96%
3
Spectrum Match + KNN (k=5;n=2)
82.43%
45.8%
58.88%
4
Spectrum Match + KNN (k=5;n=2)
without percussion instrument
87.1%
n – number of labels assigned to each frame; k – parameter for KNN
Hierarchical structure
Flute
English Horn
Viola
Violin
Instrument granularity classifiers which are trained at
each level of the hierarchical tree
Hornbostel/Sachs
Modules of cascade classifier for single instrument estimation --- Hornboch /Sachs
Pitch 3B
96.02%
91.80%
98.94%
= 95.00% >
*
New Experiment:
Middle C instrument sounds (pitch equal to C4 in MIDI notation, frequency - 261.6 Hz
Training set:
2762 frames extracted from the following instrument sounds:
electric guitar, bassoon, oboe, B-flat, clarinet, marimba, C trumpet,
E-flat clarinet, tenor trombone, French horn, flute, viola, violin, English horn, vibraphone,
Accordion, electric bass, cello, tenor saxophone, B-flat trumpet, bass flute, double bass,
Alto flute, piano, Bach trumpet, tuba, and bass clarinet.
Classifiers – WEKA:
(1) KNN with Euclidean distance (spectrum match based classification);
(2) Decision Tree (classification based on previously extracted features)
Confidence –
ratio of the correct classified instances over the total number of instances
Classification on different Feature Groups
Group
Feature
description
Classifier
Confidence
A
33 Spectrum Flatness
Band Coefficients
KNN
Decision Tree
99.23%
94.69%
13 MFCC coefficients
KNN
Decision Tree
98.19%
93.57%
28 Harmonic Peaks
KNN
Decision Tree
86.60%
91.29%
D
38 Spectrum projection
coefficients
KNN
Decision Tree
47.45%
31.81%
E
Log spectral centroid,
spread, flux, rolloff,
zerocrossing
KNN
Decision Tree
99.34%
99.77%
B
C
Feature and classifier selection at each level of cascade system
KNN + Band Coefficients
Node
feature
Classifier
chordophone
Band Coefficients
KNN
aerophone
MFCC coefficients
KNN
idiophone
Band Coefficients
KNN
Node
feature
Classifier
chrd_composite
Band Coefficients
KNN
aero_double-reed
MFCC coefficients
KNN
aero_lip-vibrated
MFCC coefficients
KNN
aero_side
MFCC coefficients
KNN
aero_single-reed
Band Coefficients
Decision Tree
idio_struck
Band Coefficients
KNN
Classification on the combination
of different feature groups
Classification based on KNN
Classification based on Decision Tree
From those two experiments, we see that:
1) KNN classifier works better with feature vectors
such as spectral flatness coefficients,
projection coefficients and MFCC.
2) Decision tree works better with harmonic peaks
and statistical features.
Simply adding more features together does not improve
the classifiers and sometime even worsens classification
results (such as adding harmonic to other feature groups).
Feature and classifier selection at each level of Cascade
System - Hornbostel/Sachs hierarchical tree
Feature and classifier selection at top level
Feature and classifier selection at second level
Feature and classifier selection at third level
Feature and Classifier Selection
Node
chordophone
aerophone
idiophone
Feature
Flatness coefficients
MFCC coefficients
Flatness coefficients
Classifier
KNN
KNN
KNN
Feature and Classifier Selection Table for Level 1
Node
chrd_composite
aero_double-reed
aero_lip-vibrated
aero_side
Aero single-reed
Idio Struck
Feature
Classifier
Flatness coefficients
KNN
MFCC coefficients
KNN
MFCC coefficients
KNN
MFCC coefficients
KNN
Flatness coefficients Decision Tree
Flatness coefficients
KNN
Feature and Classifier Selection Table for Level 2
HIERARCHICAL STRUCTURE BUILT BY CLUSTERING ANALYSIS
Common method to calculate the distance or similarity between clusters:
single linkage (nearest neighbor), complete linkage (furthest neighbor),
unweighted pair-group method using arithmetic averages (UPGMA),
weighted pair-group method using arithmetic averages (WPGMA),
unweighted pair-group method using the centroid average (UPGMC),
weighted pair-group method using the centroid average (WPGMC),
Ward's method.
Most common distance functions: Euclidean, Manhattan, Canberra
(examines the sum of series of a fraction differences between coordinates
of a pair of objects), Pearson correlation coefficient (PCC) – measures the
degree of association between objects, Spearman's rank correlation
coefficient.
Clustering algorithm – HCLUST (Agglomerative hierarchical clustering) –
R Package
Testing Datasets
(MFCC, flatness coefficients, harmonic peaks) :
The middle C pitch group which contains 46 different musical sound objects.
Each sound object is segmented into multiple 0.12s frames and
each frame is stored as an instance in the testing dataset.
There are totally 2884 frames
We also extract three different features (MFCC, flatness coefficients,
and harmonic peaks) from those sound objects.
Each feature produces one dataset of 2884 frames for clustering.
Clustering:
When the algorithm finishes the clustering process,
a particular cluster ID is assigned to each single frame.
Contingency Table derived from clustering result
Cluster 1
…
…
Instrument 1
X11
…
…
X1
…
Xi1
…
…
X n1
…
…
…
Xij
…
Cluster n
X1n
j
…
…
Instrument n
…
…
…
Instrument i
…
Cluster j
Xin
…
…
…
X nj
X nn
Evaluation result of Hclust algorithm (14 results which yield
the highest score among 126 experiments
Feature
Flatness Coefficients
Flatness Coefficients
Flatness Coefficients
mfcc
mfcc
Flatness Coefficients
mfcc
mfcc
mfcc
Flatness Coefficients
Flatness Coefficients
mfcc
Flatness Coefficients
mfcc
method
ward
ward
ward
ward
ward
ward
ward
ward
ward
ward
ward
ward
mcquitty
average
metric
pearson
euclidean
manhattan
kendall
pearson
kendall
euclidean
manhattan
spearman
spearman
maximum
maximum
euclidean
manhattan
α
87.3%
85.8%
85.6%
81.0%
83.0%
82.9%
80.5%
80.1%
81.3%
83.7%
86.1%
79.8%
88.9%
87.3%
w
37
37
36
36
35
35
35
35
34
33
32
34
30
30
score
32.30
31.74
30.83
29.18
29.05
29.03
28.17
28.04
27.63
27.62
27.56
27.12
26.67
26.20
w – number of clusters, α - average clustering accuracy of all the instruments,
score= α*w
Clustering result from Hclust algorithm with Ward linkage method and Pearson
distance measure; Flatness coefficients are used as the selected feature
“ctrumpet” and “batchtrumpet” are clustered in the same group. “ctrumpet_harmonStemOut” is clustered
in one single group instead of merging with “ctrumpet”. Bassoon is considered as the sibling of the
regular French horn. “French horn muted” is clustered in another different group together with “English
Horn” and “Oboe” .
Comparison between non-cascade classification and cascade
classification with different hierarchical schemas
Experiment Classification method
1
non-cascade
2
Description
Feature-based
Recall Precision F-Score
64.3%
44.8%
52.81%
non-cascade
Spectrum-Match 79.4%
50.8%
61.96%
3
Cascade
Hornbostel/Sachs 75.0%
43.5%
55.06%
4
Cascade
77.8%
53.6%
63.47%
5
Cascade
machine learned 87.5%
62.3%
72.78%
play method
We evaluate the classification system by the mixture
sounds which contain two single instrument sounds.
We also create 49 polyphonic sounds by randomly
selecting three different single instrument sounds and
mixing them together.
We then test those three-instrument mixtures with
five different classification methods (experiment 2 to 6)
which are described in the previous two-instrument
mixture experiments. Single-label classification based on
the sound separation method is also tested on the mixtures
(experiment 1).
KNN (k=3) is used as the classifier for each experiment.
Classification results of 3-instrument mixtures with different algorithms
Exp#
Classifier
Method
Precision
F-Score
31.48%
43.06%
36.37%
69.44%
58.64%
63.59%
Recall
2 Non_Cascade
Single-label based on sound
separation
Feature-based multi-label
classification Spectrum-Match
3 Non_Cascade
multi-label classification
85.51%
55.04%
66.97%
4 Cascade(hornbostel)
multi-label classification
64.49%
63.10%
63.79%
5 Cascade(playmethod)
multi-label classification
66.67%
55.25%
60.43%
6 Cascade(machine Learned)
multi-label classification
63.77%
69.67%
66.59%
1 Non-Cascade
User entering query
He is looking
for a particular
piece of music
Mozart, 40th
Symphony
User is not satisfied and he is entering a new query
Yes, but I’m sad
today, play the
same song but
make it sadder.
Modified
Mozart, 40th
Symphony
- Action Rules System
Action Rule
Action rule is defined as a term
[(ω) ∧ (α → β)] →(ϕ→ψ)
conjunction of fixed condition
features shared by both groups
proposed changes in values
of flexible features
Information System
A
B
D
a1 b2 d1
a2 b2
a2 b2 d2
desired effect of the action
"Action Rules Discovery without pre-existing classification rules",
Z.W. Ras, A. Dardzinska, Proceedings of RSCTC 2008 Conference, in Akron, Ohio,
LNAI 5306, Springer, 2008, 181-190
http://www.cs.uncc.edu/~ras/Papers/Ras-Aga-AKRON.pdf
WWW.MIR.UNCC.EDU

Auto indexing system for musical instruments

intelligent query answering system for music
instruments
Download