Knowledge Discovery Based Music Information Retrieval for

advertisement

www.kdd.uncc.edu

CCI, UNC-Charlotte

Music Information Retrieval based on multi-label cascade classification system

Research sponsored by NSF http//:www.mir.uncc.edu

IIS-0414815, IIS-0968647 presented by

Zbigniew W. Ras

Collaborators:

Alicja Wieczorkowska (Polish-Japanese Institute of IT, Warsaw, Poland)

Krzysztof Marasek (Polish-Japanese Institute of IT, Warsaw, Poland)

My former PhD students:

Elzbieta Kubera (Maria Curie-Sklodowska University, Lublin, Poland )

Rory Lewis (University of Colorado at Colorado Springs, USA)

Wenxin Jiang (Fred Hutchinson Cancer Research Center in Seattle, USA)

Xin Zhang (University of North Carolina, Pembroke, USA)

My current PhD student:

Amanda Cohen-Mostafavi (University of North Carolina, Charlotte, USA)

MIRAI - Musical Database (mostly MUMS)

[music pieces played by 57 different music instruments ]

Goal : Design and Implement a System for

Automatic Indexing of Music by Instruments

(objective task) and Emotions (subjective task)

Outcome :

Musical Database represented as FS-tree guarantying efficient storage and retrieval

[music pieces indexed by instruments and emotions].

MIRAI - Musical Database

[music pieces played by 57+ different music instruments (see below) and described by over 910 attributes]

Alto Flute, Bach-trumpet, bass-clarinet, bassoon, bass-trombone, Bb trumpet, b-flat clarinet, cello, cello-bowed, cello-martele, cello-muted, cello-pizzicato, contrabassclarinet, contrabassoon, crotales, c-trumpet, ctrumpet-harmonStemOut, doublebass-bowed, doublebass-martele, doublebass-muted, doublebass-pizzicato, eflatclarinet, electric-bass, electric-guitar, englishhorn, flute, frenchhorn, frenchHorn-muted, glockenspiel, marimba-crescendo, marimba-singlestroke, oboe, piano-9ft, piano-hamburg, piccolo, piccolo-flutter, saxophone-soprano, saxophone-tenor, steeldrums, symphonic, tenor-trombone, tenor-trombone-muted, tuba, tubular-bells, vibraphone-bowed, vibraphone-hardmallet, viola-bowed, viola-martele, viola-muted, viola-natural, viola-pizzicato, violin-artificial, violin-bowed, violin-ensemble, violin-muted, violin-natural-harmonics, xylophone.

Automatic Indexing of Music

What is needed?

Database of monophonic and polyphonic music signals and their descriptions in terms of new features (including temporal) in addition to the standard MPEG7 features

.

These signals are labeled by instruments and emotions forming additional features called decision features.

Why is needed?

To build classifiers for automatic indexing of musical sound by instruments and emotions.

MIRAI - Cooperative Music Information Retrieval

System based on Automatic Indexing

Query

Indexed

Audio Database

Instruments

Durations

Query

Adapter

User

Music Objects

Empty

Answer?

Raw data--signal representation

Binary File

PCM :

Sampling Rate

44.1K Hz

16 bits

2,646,000 values/min.

PCM (Pulse Code Modulation) - the most straightforward mechanism to store audio.

Analog audio is sampled & individual samples are stored sequentially in binary format.

Challenges to applying KDD in MIR

The nature and types of raw data

Data source organization volume

Traditional data Structured Modest

Audio data Unstructured Very large

Type

Discrete,

Categorical

Continuous,

Numeric

Quality

Clean

Noise

Feature extractions

Amplitude values at each sample point

Feature

Extraction lower level raw data form manageable

Feature

Database

Higher level representations traditional pattern recognition classification clustering regression

MPEG7 features

Signal

Hamming

Window

Hamming

Window

STFT

Signal envelope

STFT

NFFT

FFT points

Power

Spectrum

Log Attack Time

Temporal Centroid

Harmonic

Peaks

Detection

Spectral Centroid

Instantaneous

Harmonic Spectral Spread

Instantaneous

Harmonic Spectral Centroid

Fundamental

Frequency

Instantaneous

Harmonic Spectral Deviation

Instantaneous

Harmonic Spectral Variation

Derived Database

MPEG7 features Non-MPEG7 features & new temporal features

Spectrum Centroid

Spectrum Spread

Spectrum Flatness

Spectrum Basic Functions

Spectrum Projection Functions

Log Attack Time

Harmonic Peaks

……………..

Roll-Off

Flux

Mel frequency cepstral coefficients

(MFCC)

Tristimulus and similar parameters

(contents of odd and even partials- Od, Ev)

Mean frequency deviation for low partials

Changing ratios of spectral spread

Changing ratios of spectral centroid

New Temporal Feat ures – S’(i), C’(i), S’’(i), C’’(i)

S’(i) = [S(i+1) – S(i)]/S(i) ; C’(i) = [C(i+1) – C(i)]/C(i) where S(i+1), S(i) and C(i+1), C(i) are the spectral spread and spectral centroid of two consecutive frames: frame i+1 and frame i.

The changing ratios of spectral spread and spectral centroid for two consecutive frames are considered as the first derivatives of the spread and spectral centroid.

Following the same method we calculate the second derivatives:

S’’(i) = [S’(i+1) – S’(i)]/S’(i) ; C’’(i) = [C’(i+1) – C’(i)]/C’(i)

Remark:

Sequence [S(i), S(i+1), S(i+2),….., S(i+k)] can be approximated by polynomial p(x)=a

0

+a

1

*x+a

2

*x 2 + a

3

*x 3 + ……… ; new features: a

0

, a

1

, a

2

, a

3

, ……

S

" ' ' ' ' ' ' "

Experiment with WEKA : 19 instruments [flute, piano, violin, saxophone, vibraphone, trumpet, marimba, french-horn, viola, basson, clarinet, cello, trombone, accordian, guitar, tuba, english-horn, oboe, double-bass],

J48 with 0.25 confidence factor for pruning tree, minimum number of instances per leaf – 10;

KNN – number of neighbors – 3

Euclidean distance is used as similarity function.

Classification confidence with temporal features

Experiment

1

2

5

6

3

4

Features

S, C

S, C, S’ , C’

S, C, S’ , C’ , S’’ , C’’

S ,C

S, C, S’ , C’

S, C, S’ , C’ , S’’ , C’’

Classifier

Decision Tree

Decision Tree

Decision Tree

KNN

KNN

KNN

Confidence

80.47%

83.68%

84.76%

80.31%

84.07%

85.51%

Confusion matrices: left is from Experiment 1, right is from Experiment 3.

The correctly classified instances are highlighted in green and the incorrectly classified instances are highlighted in yellow

Precision

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

Fl ut e

P ia no

V io lin

S ax op ho ne

V ib ra ph on e

Tr um pe t

M ar im ba

Fr en ch ho rn

V io la

B as so on

C la rin et

C el lo

Tr om bo ne

A cc or di an

G ui ta r

Precision of the decision tree for each instrument

Tu ba

E ng lis hH or n

O bo e

D ou bl eB as s

Recall

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

Fl ut e

P ia no

V io lin

S ax op ho ne

V ib ra ph on e

Tr um pe t

M ar im ba

Fr en ch ho rn

V io la

B as so on

C la rin et

C el lo

Tr om bo ne

A cc or di an

G ui ta r

Tu ba

E ng lis hH or n

O bo e

D ou bl eB as s

Recall of the decision tree for each instrument

F-Score

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

Fl ut e

P ia no

V io lin

S ax op ho ne

V ib ra ph on e

Tr um pe t

M ar im ba

Fr en ch ho rn

V io la

B as so on

C la rin et

C el lo

Tr om bo ne

A cc or di an

G ui ta r

Tu ba

E ng lis hH or n

O bo e

D ou bl eB as s

F-score of the decision tree for each instrument

A

B

C

A

B

C

A

B

C

Polyphonic sounds – how to handle?

1.

Single-label classification Based on Sound Separation

2. Multi-labeled classifiers

Polyphonic

Sound

Problems ?

segmentation

Get frame

.

Feature extraction

Classifier

Sound separation

Get

Instrument

Information loss during the signal subtraction

Sound Separation Flowchart

Timbre estimation in polyphonic sounds and designing multi-labeled classifiers

 timbre relevant descriptors

Spectrum Centroid , Spread

Spectrum Flatness Band Coefficients

Harmonic Peaks

Mel frequency cepstral coefficients (MFCC)

Tristimulus

Sub-pattern of single instrument in mixture

Feature extraction

Mel-Frequency Cepstral Coefficients

Timbre estimation based on multi-label classifier

40ms window segmentation

Get frame

Features

Extraction

Classifier timbre descriptors instrument confidence

Candidate 1 70%

Candidate 2

.

.

.

.

.

.

.

.

.

50%

.

.

.

.

.

.

S

Timbre Estimation Results based on different methods

[Instruments - 45, Training Data (TD) - 2917 single instr. sounds from MUMS, Testing on 308 mixed sounds randomly chosen from TD, window size – 1s, frame size – 120ms, hop size – 40ms (~25 frames), Mel-frequency cepstral coefficients ( MFCC) extracted from each frame

Single Label Vs Multiple Label Seperation Vs Non-Sperataion

67.69%

64.28%

61.20%

61.20%

54.55%

1

SP

2

SP

2

SP

2

NS

4

NS experiment

# pitch based

Sound

Separation

N(Labels) max

Recall Precision F-score

1

2

3

4

5

Yes

Yes

Yes

Yes

Yes

Yes/No

Yes

No

No

No

1

2

2

4

8

54.55%

39.2%

61.20%

38.1%

64.28%

44.8%

67.69%

37.9%

68.3% 36.9%

45.60%

46.96%

52.81%

48.60%

47.91%

Threshold 0.4 controls the total number of estimations for each index window.

Polyphonic

Sound

(window)

Polyphonic Sounds

Classifiers

Get frame Feature extraction

Multiple labels

C i ,  ,

C j

Compressed representations of the signal: Harmonic Peaks, Mel Frequency

Ceptral Coefficients (MFCC), Spectral Flatness, ….

Irrelevant information (inharmonic frequencies or partials) is removed.

Violin and viola have similar MFCC patterns. The same is with double-bass and guitar. It is difficult to distinguish them in polyphonic sounds.

More information from the raw signal is needed.

Short Term Power Spectrum

– low level representation of signal (calculated by STFT)

Spectrum slice – 0.12 seconds long

Power Spectrum patterns of flute & trombone can be seen in the mixture

Experiment:

Middle C instrument sounds (pitch equal to C4 in MIDI notation, frequency -261.6 Hz

Training set:

Power Spectrum from 3323 frames - extracted by STFT from 26 single instrument sounds: electric guitar, bassoon, oboe, B-flat, clarinet, marimba, C trumpet,

E-flat clarinet, tenor trombone, French horn, flute, viola, violin, English horn, vibraphone,

Accordion, electric bass, cello, tenor saxophone, B-flat trumpet, bass flute, double bass,

Alto flute, piano, Bach trumpet, tuba, and bass clarinet.

Testing Set:

Fifty two audio files are mixed (using Sound Forge ) by two of these 26 single instrument sounds.

Classifier –

(1) KNN with Euclidean distance (spectrum match based classification);

(2) Decision Tree (multi label classification based on previously extracted features)

Timbre Pattern Match Based on Power Spectrum

Spectrum-based VS Feature-based

79.41%

82.43%

87.10%

64.28%

Feature-based Spectrum Match Spectrum Match Spectrum Match

(without percussion ) k=1 k=5 k=5 experiment # description Recall

Precision F-score

1

2

3

4

Feature-based + Decision Tree (n=2)

Spectrum Match + KNN (k=1;n=2)

Spectrum Match + KNN (k=5;n=2)

Spectrum Match + KNN (k=5;n=2) without percussion instrument

64.28%

79.41%

82.43%

87.1%

44.8%

50.8%

45.8%

52.81%

61.96%

58.88% n – number of labels assigned to each frame; k – parameter for KNN

Schema I - Hornbostel Sachs

Idiophone Membranophone Aerophone Chordophone

Lip Vibration Single Reed Free Side

C Trumpet

French Horn

Tuba

Oboe

Bassoon Whip Flute

Alto Flute

Schema II - Play Methods

Blow Bowed Muted …… Picked Pizzicato Shaken

Alto Flute …… Flute Piccolo Bassoon

Decision Table

Obj

CA

1

Classification

Attributes

… … CA n

Decision Attributes

Hornbostel Sachs Play Method

1 0.22

… … 0.28

[Aerophone, Side, Alto Flute] [Blown, Alto Flute]

2 0.31

… … 0.77

[Idiophone, Concussion, Bell] [Concussive, Bell]

3 0.05

… … 0.21

[Chordophone, Composite,

Cello]

4 0.12

… … 0.11

[Chordophone, Composite,

Violin]

Xin Cynthia Zhang

Xin Cynthia Zhang

[Bowed, Cello]

[Martele, Violin]

27

27

Example

Level I

Level II

1

C[1] C[2]

2

1

C[2,1] C[2,2]

2

d[1]

1

d[2]

2

d[3]

3

1

d[3,1] d[3,2]

2

X a b c x1 a[1] b[2] c[1] d d[3] x2 a[1] b[1] c[1] d[3,1] x3 a[1] b[2] c[2,2] d[1] x4 a[2] b[2] c[2]

Classification Attributes d[1]

Decision Attributes

Instrument granularity classifiers which are trained at each level of the hierarchical tree

Hornbostel/Sachs

We do not include membranophones because instruments in this family usually do not produce harmonic sound so that they need special techniques to be identified

Modules of cascade classifier for single instrument estimation --Hornboch /Sachs

Pitch 3B

96.02%

91.80%

98.94%

*

= 95.00% >

New Experiment:

Middle C instrument sounds (pitch equal to C4 in MIDI notation, frequency - 261.6 Hz

Training set:

2762 frames extracted from the following instrument sounds: electric guitar, bassoon, oboe, B-flat, clarinet, marimba, C trumpet,

E-flat clarinet, tenor trombone, French horn, flute, viola, violin, English horn, vibraphone,

Accordion, electric bass, cello, tenor saxophone, B-flat trumpet, bass flute, double bass,

Alto flute, piano, Bach trumpet, tuba, and bass clarinet.

Classifiers – WEKA:

(1) KNN with Euclidean distance (spectrum match based classification);

(2) Decision Tree (classification based on previously extracted features)

Confidence – ratio of the correct classified instances over the total number of instances

Classification on different Feature Groups

Group

A

B

C

D

E

Feature description

33 Spectrum Flatness

Band Coefficients

13 MFCC coefficients

28 Harmonic Peaks

38 Spectrum projection coefficients

Log spectral centroid, spread, flux, rolloff, zerocrossing

Classifier

KNN

Decision Tree

KNN

Decision Tree

KNN

Decision Tree

KNN

Decision Tree

KNN

Decision Tree

Confidence

99.23%

94.69%

98.19%

93.57%

86.60%

91.29%

47.45%

31.81%

99.34%

99.77%

Feature and classifier selection at each level of cascade system

KNN + Band Coefficients

Node chordophone aerophone idiophone feature

Band Coefficients

MFCC coefficients

Band Coefficients

Classifier

KNN

KNN

KNN

Node chrd_composite aero_double-reed aero_lip-vibrated aero_side aero_single-reed idio_struck feature

Band Coefficients

MFCC coefficients

MFCC coefficients

MFCC coefficients

Band Coefficients

Band Coefficients

Classifier

KNN

KNN

KNN

KNN

Decision Tree

KNN

Classification on the combination of different feature groups

Classification based on KNN Classification based on Decision Tree

From those two experiments, we see that:

1) KNN classifier works better with feature vectors such as spectral flatness coefficients, projection coefficients and MFCC.

2) Decision tree works better with harmonic peaks and statistical features.

Simply adding more features together does not improve the classifiers and sometime even worsens classification results (such as adding harmonic to other feature groups).

HIERARCHICAL STRUCTURE BUILT BY CLUSTERING ANALYSIS

Seven common method to calculate the distance or similarity between clusters : single linkage (nearest neighbor), complete linkage (furthest neighbor), unweighted pair-group method using arithmetic averages (UPGMA), weighted pair-group method using arithmetic averages (WPGMA), unweighted pair-group method using the centroid average

(UPGMC), weighted pair-group method using the centroid average (WPGMC), Ward's method.

Six most common distance functions : Euclidean, Manhattan, Canberra (examines the sum of series of a fraction differences between coordinates of a pair of objects), Pearson correlation coefficient (PCC) – measures the degree of association between objects, Spearman's rank correlation coefficient, Kendal (counts the number of pairwise disagreements between two lists)

Clustering algorithm – HCLUST (Agglomerative hierarchical clustering) – R Package

Testing Datasets

(MFCC, flatness coefficients, harmonic peaks) :

The middle C pitch group which contains 46 different musical sound objects.

Each sound object is segmented into multiple 0.12s frames and each frame is stored as an instance in the testing dataset.

There are totally 2884 frames

This dataset is represented by 3 different sets of features

(MFCC, flatness coefficients, and harmonic peaks)

Total number of experiments = 3  7  6 = 126

Clustering:

When the algorithm finishes the clustering process, a particular cluster ID is assigned to each single frame.

Contingency Table derived from clustering result

Cluster 1

Instrument 1

X 11

Instrument i

X i1

Instrument n

X n1

Cluster j

X 1 j

X ij

X nj

Cluster n

X 1n

X in

X nn

Evaluation result of Hclust algorithm (14 results which yield the highest score among 126 experiments

Feature method

Flatness Coefficients ward

Flatness Coefficients ward

Flatness Coefficients ward mfcc ward mfcc ward

Flatness Coefficients ward metric pearson euclidean manhattan kendall pearson kendall mfcc mfcc ward ward mfcc ward

Flatness Coefficients ward euclidean manhattan spearman spearman

Flatness Coefficients ward mfcc ward maximum maximum

Flatness Coefficients mcquitty euclidean mfcc average manhattan

α w score

87.3% 37 32.30

85.8% 37 31.74

85.6% 36 30.83

81.0% 36 29.18

83.0% 35 29.05

82.9% 35 29.03

80.5% 35 28.17

80.1% 35 28.04

81.3% 34 27.63

83.7% 33 27.62

86.1% 32 27.56

79.8% 34 27.12

88.9% 30 26.67

87.3% 30 26.20

w – number of clusters, α average clustering accuracy of all the instruments, score= α*w

Clustering result from Hclust algorithm with Ward linkage method and Pearson distance measure; Flatness coefficients are used as the selected feature

“ctrumpet” and “batchtrumpet” are clustered in the same group. “ctrumpet_harmonStemOut” is clustered in one single group instead of merging with “ctrumpet”. Bassoon is considered as the sibling of the regular French horn. “French horn muted” is clustered in another different group together with “English

Horn” and “Oboe” .

Looking for optimal

[classification method  data representation] in monophonic music

[Middle C pitch group - 46 different musical sound objects]

Experiment Classification method Description Recall Precision F-Score

3

4

5

1

2 non-cascade non-cascade

Cascade

Cascade

Cascade

Feature-based 64.3%

Spectrum-Match 79.4%

44.8%

50.8%

52.81%

61.96%

Hornbostel/Sachs 75.0% 43.5% 55.06% play method 77.8% 53.6% 63.47% machine learned 87.5% 62.3% 72.78%

Looking for optimal

[classification method  data representation] in polyphonic music

[Middle C pitch group - 46 different musical sound objects]

Testing Data: 49 polyphonic sounds are created by selecting three different single instrument sounds from the training database and mixing them together.

This set of sounds is used to test again our five different arrangement for [classification method

 data representation]

KNN ( k=3 ) is used as the classifier for each experiment .

Looking for optimal [classification method

 data representation] in polyphonic music

Testing Data: 49 polyphonic sounds are created by selecting three different single instrument sounds from the training database and mixing them together.

This set of sounds is used to test again our five different arrangement for

[classification method

 data representation]

KNN ( k=3 ) is used as the classifier for each experiment.

Exp# Classifier

1 Non-Cascade

2 Non_Cascade

3 Non_Cascade

4 Cascade(hornbostel)

Method Recall Precision F-Score

Single-label based on sound separation 31.48% 43.06% 36.37%

Feature-based multi-label classification Spectrum-Match 69.44% 58.64% multi-label classification multi-label classification

5 Cascade(playmethod) multi-label classification

6 Cascade(machine Learned) multi-label classification

63.59%

85.51% 55.04% 66.97%

64.49% 63.10%

66.67%

63.77%

55.25%

69.67%

63.79%

60.43%

66.59%

WWW.MIR.UNCC.EDU

Auto indexing system for musical instruments

intelligent query answering system for music instruments

User entering query

He is looking for a particular piece of music

Mozart, 40 th

Symphony

User is not satisfied and he is entering a new query

Yes, but I’m sad today, play the same song but make it sadder.

Modified

Mozart, 40 th

Symphony - Action Rules System

Action Rule

Action rule is defined as a term

[(ω) ∧ (α → β)] →(ϕ→ψ) conjunction of fixed condition features shared by both groups

Information System

A B D

a1 b2 d1

a2 b2

a2 b2 d2 proposed changes in values of flexible features desired effect of the action

Action Rules Discovery

M

2

M

3

M

4

…..

M

1

Meta-actions based decision system S(d)=(X,A

{d}, V ), with A= {A

1

,A

2

,…,A m

}

A

1

A

2

A

3

A

4

….. A m

Influence Matrix

E

11

E

12

E

13

E

14

E

1m

M n

E

21

E

31

E

41

E m1

E

22

E

32

E

42

E m2

E

23

E

33

E

43

E m3

E

24

E

34

E

44

E m4

E

2m

E

3m

E

4m

E mn if E

E

31

E

34

32

= [ a

2

= [a

= [a

1

4

 a

1

 a

4 a

’],

’]

2

’ ], then

Candidate action rule r = [( A

1

, a

1

 a

1

’) 

(A

2

, a

2

 a

2

)

(A

4

, a

4

 a

4

’)]) 

(d , d

1

 d

1

’)

Rule r is supported & covered by M

3

"Action Rules Discovery without pre-existing classification rules",

Z.W. Ras, A. Dardzinska, Proceedings of RSCTC 2008 Conference, in Akron, Ohio,

LNAI 5306, Springer, 2008, 181-190 http://www.cs.uncc.edu/~ras/Papers/Ras-Aga-AKRON.pdf

Since the window diminishes the signal on both edges, it leads to information loss due to the narrowing of frequency spectrum. In order to preserve this information, those consecutive analysis frames have overlap in time.

The empirical experiments show the best overlap is two third of window size

A A B A A A

Time

Windowing

Hamming window spectral leakage

Download