www.kdd.uncc.edu
CCI, UNC-Charlotte
Research sponsored by NSF http//:www.mir.uncc.edu
IIS-0414815, IIS-0968647 presented by
Alicja Wieczorkowska (Polish-Japanese Institute of IT, Warsaw, Poland)
Krzysztof Marasek (Polish-Japanese Institute of IT, Warsaw, Poland)
My former PhD students:
Elzbieta Kubera (Maria Curie-Sklodowska University, Lublin, Poland )
Rory Lewis (University of Colorado at Colorado Springs, USA)
Wenxin Jiang (Fred Hutchinson Cancer Research Center in Seattle, USA)
Xin Zhang (University of North Carolina, Pembroke, USA)
My current PhD student:
Amanda Cohen-Mostafavi (University of North Carolina, Charlotte, USA)
MIRAI - Musical Database (mostly MUMS)
[music pieces played by 57 different music instruments ]
Goal : Design and Implement a System for
Automatic Indexing of Music by Instruments
(objective task) and Emotions (subjective task)
Outcome :
Musical Database represented as FS-tree guarantying efficient storage and retrieval
[music pieces indexed by instruments and emotions].
MIRAI - Musical Database
[music pieces played by 57+ different music instruments (see below) and described by over 910 attributes]
Alto Flute, Bach-trumpet, bass-clarinet, bassoon, bass-trombone, Bb trumpet, b-flat clarinet, cello, cello-bowed, cello-martele, cello-muted, cello-pizzicato, contrabassclarinet, contrabassoon, crotales, c-trumpet, ctrumpet-harmonStemOut, doublebass-bowed, doublebass-martele, doublebass-muted, doublebass-pizzicato, eflatclarinet, electric-bass, electric-guitar, englishhorn, flute, frenchhorn, frenchHorn-muted, glockenspiel, marimba-crescendo, marimba-singlestroke, oboe, piano-9ft, piano-hamburg, piccolo, piccolo-flutter, saxophone-soprano, saxophone-tenor, steeldrums, symphonic, tenor-trombone, tenor-trombone-muted, tuba, tubular-bells, vibraphone-bowed, vibraphone-hardmallet, viola-bowed, viola-martele, viola-muted, viola-natural, viola-pizzicato, violin-artificial, violin-bowed, violin-ensemble, violin-muted, violin-natural-harmonics, xylophone.
What is needed?
Database of monophonic and polyphonic music signals and their descriptions in terms of new features (including temporal) in addition to the standard MPEG7 features
.
These signals are labeled by instruments and emotions forming additional features called decision features.
Why is needed?
To build classifiers for automatic indexing of musical sound by instruments and emotions.
Query
Indexed
Audio Database
Instruments
Durations
…
…
…
…
…
…
Query
Adapter
User
Music Objects
…
Empty
Answer?
PCM :
Sampling Rate
44.1K Hz
16 bits
2,646,000 values/min.
PCM (Pulse Code Modulation) - the most straightforward mechanism to store audio.
Analog audio is sampled & individual samples are stored sequentially in binary format.
The nature and types of raw data
Data source organization volume
Traditional data Structured Modest
Audio data Unstructured Very large
Type
Discrete,
Categorical
Continuous,
Numeric
Quality
Clean
Noise
Amplitude values at each sample point
Feature
Extraction lower level raw data form manageable
Feature
Database
Higher level representations traditional pattern recognition classification clustering regression
Signal
Hamming
Window
Hamming
Window
STFT
Signal envelope
STFT
NFFT
FFT points
Power
Spectrum
Log Attack Time
Temporal Centroid
Harmonic
Peaks
Detection
Spectral Centroid
Instantaneous
Harmonic Spectral Spread
Instantaneous
Harmonic Spectral Centroid
Fundamental
Frequency
Instantaneous
Harmonic Spectral Deviation
Instantaneous
Harmonic Spectral Variation
MPEG7 features Non-MPEG7 features & new temporal features
Spectrum Centroid
Spectrum Spread
Spectrum Flatness
Spectrum Basic Functions
Spectrum Projection Functions
Log Attack Time
Harmonic Peaks
……………..
Roll-Off
Flux
Mel frequency cepstral coefficients
(MFCC)
Tristimulus and similar parameters
(contents of odd and even partials- Od, Ev)
Mean frequency deviation for low partials
Changing ratios of spectral spread
Changing ratios of spectral centroid
New Temporal Feat ures – S’(i), C’(i), S’’(i), C’’(i)
S’(i) = [S(i+1) – S(i)]/S(i) ; C’(i) = [C(i+1) – C(i)]/C(i) where S(i+1), S(i) and C(i+1), C(i) are the spectral spread and spectral centroid of two consecutive frames: frame i+1 and frame i.
The changing ratios of spectral spread and spectral centroid for two consecutive frames are considered as the first derivatives of the spread and spectral centroid.
Following the same method we calculate the second derivatives:
S’’(i) = [S’(i+1) – S’(i)]/S’(i) ; C’’(i) = [C’(i+1) – C’(i)]/C’(i)
Remark:
Sequence [S(i), S(i+1), S(i+2),….., S(i+k)] can be approximated by polynomial p(x)=a
0
+a
1
*x+a
2
*x 2 + a
3
*x 3 + ……… ; new features: a
0
, a
1
, a
2
, a
3
, ……
S
" ' ' ' ' ' ' "
Experiment with WEKA : 19 instruments [flute, piano, violin, saxophone, vibraphone, trumpet, marimba, french-horn, viola, basson, clarinet, cello, trombone, accordian, guitar, tuba, english-horn, oboe, double-bass],
J48 with 0.25 confidence factor for pruning tree, minimum number of instances per leaf – 10;
KNN – number of neighbors – 3
Euclidean distance is used as similarity function.
Classification confidence with temporal features
Experiment
1
2
5
6
3
4
Features
S, C
S, C, S’ , C’
S, C, S’ , C’ , S’’ , C’’
S ,C
S, C, S’ , C’
S, C, S’ , C’ , S’’ , C’’
Classifier
Decision Tree
Decision Tree
Decision Tree
KNN
KNN
KNN
Confidence
80.47%
83.68%
84.76%
80.31%
84.07%
85.51%
Confusion matrices: left is from Experiment 1, right is from Experiment 3.
The correctly classified instances are highlighted in green and the incorrectly classified instances are highlighted in yellow
Precision
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Fl ut e
P ia no
V io lin
S ax op ho ne
V ib ra ph on e
Tr um pe t
M ar im ba
Fr en ch ho rn
V io la
B as so on
C la rin et
C el lo
Tr om bo ne
A cc or di an
G ui ta r
Precision of the decision tree for each instrument
Tu ba
E ng lis hH or n
O bo e
D ou bl eB as s
Recall
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Fl ut e
P ia no
V io lin
S ax op ho ne
V ib ra ph on e
Tr um pe t
M ar im ba
Fr en ch ho rn
V io la
B as so on
C la rin et
C el lo
Tr om bo ne
A cc or di an
G ui ta r
Tu ba
E ng lis hH or n
O bo e
D ou bl eB as s
Recall of the decision tree for each instrument
F-Score
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Fl ut e
P ia no
V io lin
S ax op ho ne
V ib ra ph on e
Tr um pe t
M ar im ba
Fr en ch ho rn
V io la
B as so on
C la rin et
C el lo
Tr om bo ne
A cc or di an
G ui ta r
Tu ba
E ng lis hH or n
O bo e
D ou bl eB as s
F-score of the decision tree for each instrument
A
B
C
A
B
C
A
B
C
Polyphonic sounds – how to handle?
1.
Single-label classification Based on Sound Separation
2. Multi-labeled classifiers
Polyphonic
Sound
Problems ?
segmentation
Get frame
.
Feature extraction
Classifier
Sound separation
Get
Instrument
Information loss during the signal subtraction
Sound Separation Flowchart
Timbre estimation in polyphonic sounds and designing multi-labeled classifiers
timbre relevant descriptors
Spectrum Centroid , Spread
Spectrum Flatness Band Coefficients
Harmonic Peaks
Mel frequency cepstral coefficients (MFCC)
Tristimulus
Feature extraction
Mel-Frequency Cepstral Coefficients
Timbre estimation based on multi-label classifier
40ms window segmentation
Get frame
Features
Extraction
Classifier timbre descriptors instrument confidence
Candidate 1 70%
Candidate 2
.
.
.
.
.
.
.
.
.
50%
.
.
.
.
.
.
S
Timbre Estimation Results based on different methods
[Instruments - 45, Training Data (TD) - 2917 single instr. sounds from MUMS, Testing on 308 mixed sounds randomly chosen from TD, window size – 1s, frame size – 120ms, hop size – 40ms (~25 frames), Mel-frequency cepstral coefficients ( MFCC) extracted from each frame
Single Label Vs Multiple Label Seperation Vs Non-Sperataion
67.69%
64.28%
61.20%
61.20%
54.55%
1
SP
2
SP
2
SP
2
NS
4
NS experiment
# pitch based
Sound
Separation
N(Labels) max
Recall Precision F-score
1
2
3
4
5
Yes
Yes
Yes
Yes
Yes
Yes/No
Yes
No
No
No
1
2
2
4
8
54.55%
39.2%
61.20%
38.1%
64.28%
44.8%
67.69%
37.9%
68.3% 36.9%
45.60%
46.96%
52.81%
48.60%
47.91%
Threshold 0.4 controls the total number of estimations for each index window.
Polyphonic
Sound
(window)
Classifiers
Get frame Feature extraction
Multiple labels
C i , ,
C j
Compressed representations of the signal: Harmonic Peaks, Mel Frequency
Ceptral Coefficients (MFCC), Spectral Flatness, ….
Irrelevant information (inharmonic frequencies or partials) is removed.
Violin and viola have similar MFCC patterns. The same is with double-bass and guitar. It is difficult to distinguish them in polyphonic sounds.
More information from the raw signal is needed.
Short Term Power Spectrum
– low level representation of signal (calculated by STFT)
Spectrum slice – 0.12 seconds long
Power Spectrum patterns of flute & trombone can be seen in the mixture
Middle C instrument sounds (pitch equal to C4 in MIDI notation, frequency -261.6 Hz
Training set:
Power Spectrum from 3323 frames - extracted by STFT from 26 single instrument sounds: electric guitar, bassoon, oboe, B-flat, clarinet, marimba, C trumpet,
E-flat clarinet, tenor trombone, French horn, flute, viola, violin, English horn, vibraphone,
Accordion, electric bass, cello, tenor saxophone, B-flat trumpet, bass flute, double bass,
Alto flute, piano, Bach trumpet, tuba, and bass clarinet.
Testing Set:
Fifty two audio files are mixed (using Sound Forge ) by two of these 26 single instrument sounds.
Classifier –
(1) KNN with Euclidean distance (spectrum match based classification);
(2) Decision Tree (multi label classification based on previously extracted features)
Timbre Pattern Match Based on Power Spectrum
Spectrum-based VS Feature-based
79.41%
82.43%
87.10%
64.28%
Feature-based Spectrum Match Spectrum Match Spectrum Match
(without percussion ) k=1 k=5 k=5 experiment # description Recall
Precision F-score
1
2
3
4
Feature-based + Decision Tree (n=2)
Spectrum Match + KNN (k=1;n=2)
Spectrum Match + KNN (k=5;n=2)
Spectrum Match + KNN (k=5;n=2) without percussion instrument
64.28%
79.41%
82.43%
87.1%
44.8%
50.8%
45.8%
52.81%
61.96%
58.88% n – number of labels assigned to each frame; k – parameter for KNN
Idiophone Membranophone Aerophone Chordophone
Lip Vibration Single Reed Free Side
C Trumpet
French Horn
Tuba
Oboe
Bassoon Whip Flute
Alto Flute
Blow Bowed Muted …… Picked Pizzicato Shaken
Alto Flute …… Flute Piccolo Bassoon
Obj
CA
1
Classification
Attributes
… … CA n
Decision Attributes
Hornbostel Sachs Play Method
1 0.22
… … 0.28
[Aerophone, Side, Alto Flute] [Blown, Alto Flute]
2 0.31
… … 0.77
[Idiophone, Concussion, Bell] [Concussive, Bell]
3 0.05
… … 0.21
[Chordophone, Composite,
Cello]
4 0.12
… … 0.11
[Chordophone, Composite,
Violin]
Xin Cynthia Zhang
Xin Cynthia Zhang
[Bowed, Cello]
[Martele, Violin]
27
27
Level I
Level II
1
C[1] C[2]
2
1
C[2,1] C[2,2]
2
d[1]
1
d[2]
2
d[3]
3
1
d[3,1] d[3,2]
2
X a b c x1 a[1] b[2] c[1] d d[3] x2 a[1] b[1] c[1] d[3,1] x3 a[1] b[2] c[2,2] d[1] x4 a[2] b[2] c[2]
Classification Attributes d[1]
Decision Attributes
Hornbostel/Sachs
We do not include membranophones because instruments in this family usually do not produce harmonic sound so that they need special techniques to be identified
Modules of cascade classifier for single instrument estimation --Hornboch /Sachs
Pitch 3B
96.02%
91.80%
98.94%
*
= 95.00% >
Middle C instrument sounds (pitch equal to C4 in MIDI notation, frequency - 261.6 Hz
Training set:
2762 frames extracted from the following instrument sounds: electric guitar, bassoon, oboe, B-flat, clarinet, marimba, C trumpet,
E-flat clarinet, tenor trombone, French horn, flute, viola, violin, English horn, vibraphone,
Accordion, electric bass, cello, tenor saxophone, B-flat trumpet, bass flute, double bass,
Alto flute, piano, Bach trumpet, tuba, and bass clarinet.
Classifiers – WEKA:
(1) KNN with Euclidean distance (spectrum match based classification);
(2) Decision Tree (classification based on previously extracted features)
Confidence – ratio of the correct classified instances over the total number of instances
Group
A
B
C
D
E
Feature description
33 Spectrum Flatness
Band Coefficients
13 MFCC coefficients
28 Harmonic Peaks
38 Spectrum projection coefficients
Log spectral centroid, spread, flux, rolloff, zerocrossing
Classifier
KNN
Decision Tree
KNN
Decision Tree
KNN
Decision Tree
KNN
Decision Tree
KNN
Decision Tree
Confidence
99.23%
94.69%
98.19%
93.57%
86.60%
91.29%
47.45%
31.81%
99.34%
99.77%
Feature and classifier selection at each level of cascade system
KNN + Band Coefficients
Node chordophone aerophone idiophone feature
Band Coefficients
MFCC coefficients
Band Coefficients
Classifier
KNN
KNN
KNN
Node chrd_composite aero_double-reed aero_lip-vibrated aero_side aero_single-reed idio_struck feature
Band Coefficients
MFCC coefficients
MFCC coefficients
MFCC coefficients
Band Coefficients
Band Coefficients
Classifier
KNN
KNN
KNN
KNN
Decision Tree
KNN
Classification based on KNN Classification based on Decision Tree
From those two experiments, we see that:
1) KNN classifier works better with feature vectors such as spectral flatness coefficients, projection coefficients and MFCC.
2) Decision tree works better with harmonic peaks and statistical features.
Simply adding more features together does not improve the classifiers and sometime even worsens classification results (such as adding harmonic to other feature groups).
HIERARCHICAL STRUCTURE BUILT BY CLUSTERING ANALYSIS
Seven common method to calculate the distance or similarity between clusters : single linkage (nearest neighbor), complete linkage (furthest neighbor), unweighted pair-group method using arithmetic averages (UPGMA), weighted pair-group method using arithmetic averages (WPGMA), unweighted pair-group method using the centroid average
(UPGMC), weighted pair-group method using the centroid average (WPGMC), Ward's method.
Six most common distance functions : Euclidean, Manhattan, Canberra (examines the sum of series of a fraction differences between coordinates of a pair of objects), Pearson correlation coefficient (PCC) – measures the degree of association between objects, Spearman's rank correlation coefficient, Kendal (counts the number of pairwise disagreements between two lists)
Clustering algorithm – HCLUST (Agglomerative hierarchical clustering) – R Package
Testing Datasets
(MFCC, flatness coefficients, harmonic peaks) :
The middle C pitch group which contains 46 different musical sound objects.
Each sound object is segmented into multiple 0.12s frames and each frame is stored as an instance in the testing dataset.
There are totally 2884 frames
This dataset is represented by 3 different sets of features
(MFCC, flatness coefficients, and harmonic peaks)
Total number of experiments = 3 7 6 = 126
Clustering:
When the algorithm finishes the clustering process, a particular cluster ID is assigned to each single frame.
Contingency Table derived from clustering result
Cluster 1
Instrument 1
…
X 11
…
Instrument i
…
X i1
…
Instrument n
X n1
…
…
…
…
…
…
Cluster j
X 1 j
…
X ij
…
X nj
…
…
…
…
…
…
Cluster n
X 1n
…
X in
…
X nn
Evaluation result of Hclust algorithm (14 results which yield the highest score among 126 experiments
Feature method
Flatness Coefficients ward
Flatness Coefficients ward
Flatness Coefficients ward mfcc ward mfcc ward
Flatness Coefficients ward metric pearson euclidean manhattan kendall pearson kendall mfcc mfcc ward ward mfcc ward
Flatness Coefficients ward euclidean manhattan spearman spearman
Flatness Coefficients ward mfcc ward maximum maximum
Flatness Coefficients mcquitty euclidean mfcc average manhattan
α w score
87.3% 37 32.30
85.8% 37 31.74
85.6% 36 30.83
81.0% 36 29.18
83.0% 35 29.05
82.9% 35 29.03
80.5% 35 28.17
80.1% 35 28.04
81.3% 34 27.63
83.7% 33 27.62
86.1% 32 27.56
79.8% 34 27.12
88.9% 30 26.67
87.3% 30 26.20
w – number of clusters, α average clustering accuracy of all the instruments, score= α*w
Clustering result from Hclust algorithm with Ward linkage method and Pearson distance measure; Flatness coefficients are used as the selected feature
“ctrumpet” and “batchtrumpet” are clustered in the same group. “ctrumpet_harmonStemOut” is clustered in one single group instead of merging with “ctrumpet”. Bassoon is considered as the sibling of the regular French horn. “French horn muted” is clustered in another different group together with “English
Horn” and “Oboe” .
Looking for optimal
[classification method data representation] in monophonic music
[Middle C pitch group - 46 different musical sound objects]
Experiment Classification method Description Recall Precision F-Score
3
4
5
1
2 non-cascade non-cascade
Cascade
Cascade
Cascade
Feature-based 64.3%
Spectrum-Match 79.4%
44.8%
50.8%
52.81%
61.96%
Hornbostel/Sachs 75.0% 43.5% 55.06% play method 77.8% 53.6% 63.47% machine learned 87.5% 62.3% 72.78%
Looking for optimal
[classification method data representation] in polyphonic music
[Middle C pitch group - 46 different musical sound objects]
Testing Data: 49 polyphonic sounds are created by selecting three different single instrument sounds from the training database and mixing them together.
This set of sounds is used to test again our five different arrangement for [classification method
data representation]
KNN ( k=3 ) is used as the classifier for each experiment .
Looking for optimal [classification method
data representation] in polyphonic music
Testing Data: 49 polyphonic sounds are created by selecting three different single instrument sounds from the training database and mixing them together.
This set of sounds is used to test again our five different arrangement for
[classification method
data representation]
KNN ( k=3 ) is used as the classifier for each experiment.
Exp# Classifier
1 Non-Cascade
2 Non_Cascade
3 Non_Cascade
4 Cascade(hornbostel)
Method Recall Precision F-Score
Single-label based on sound separation 31.48% 43.06% 36.37%
Feature-based multi-label classification Spectrum-Match 69.44% 58.64% multi-label classification multi-label classification
5 Cascade(playmethod) multi-label classification
6 Cascade(machine Learned) multi-label classification
63.59%
85.51% 55.04% 66.97%
64.49% 63.10%
66.67%
63.77%
55.25%
69.67%
63.79%
60.43%
66.59%
User entering query
He is looking for a particular piece of music
Mozart, 40 th
Symphony
User is not satisfied and he is entering a new query
Yes, but I’m sad today, play the same song but make it sadder.
Modified
Mozart, 40 th
Symphony - Action Rules System
Action rule is defined as a term
[(ω) ∧ (α → β)] →(ϕ→ψ) conjunction of fixed condition features shared by both groups
Information System
A B D
a1 b2 d1
a2 b2
a2 b2 d2 proposed changes in values of flexible features desired effect of the action
M
2
M
3
M
4
…..
M
1
Meta-actions based decision system S(d)=(X,A
{d}, V ), with A= {A
1
,A
2
,…,A m
}
A
1
A
2
A
3
A
4
….. A m
Influence Matrix
E
11
E
12
E
13
E
14
E
1m
M n
E
21
E
31
E
41
E m1
E
22
E
32
E
42
E m2
E
23
E
33
E
43
E m3
E
24
E
34
E
44
E m4
E
2m
E
3m
E
4m
E mn if E
E
31
E
34
32
= [ a
2
= [a
= [a
1
4
a
1
a
4 a
’],
’]
2
’ ], then
Candidate action rule r = [( A
1
, a
1
a
1
’)
(A
2
, a
2
a
2
’
)
(A
4
, a
4
a
4
’)])
(d , d
1
d
1
’)
Rule r is supported & covered by M
3
"Action Rules Discovery without pre-existing classification rules",
Z.W. Ras, A. Dardzinska, Proceedings of RSCTC 2008 Conference, in Akron, Ohio,
LNAI 5306, Springer, 2008, 181-190 http://www.cs.uncc.edu/~ras/Papers/Ras-Aga-AKRON.pdf
Since the window diminishes the signal on both edges, it leads to information loss due to the narrowing of frequency spectrum. In order to preserve this information, those consecutive analysis frames have overlap in time.
The empirical experiments show the best overlap is two third of window size
A A B A A A
Time
Windowing
Hamming window spectral leakage