Music Information Retrieval for Jazz 1. 2.

advertisement
Music Information Retrieval
for Jazz
Dan Ellis
Laboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Eng., Columbia Univ., NY USA
{dpwe,thierry}@ee.columbia.edu
1.
2.
3.
4.
MIR for Jazz - Dan Ellis
http://labrosa.ee.columbia.edu/
Music Information Retrieval
Automatic Tagging
Musical Content
Future Work
2012-11-15
1 /29
Machine Listening
• Extracting useful information from sound
Describe
Automatic
Narration
Emotion
Music
Recommendation
Classify
Environment
Awareness
ASR
Music
Transcription
“Sound
Intelligence”
VAD
Speech/Music
Environmental
Sound
Speech
Dectect
Task
... like (we) animals do
MIR for Jazz - Dan Ellis
Music
Domain
2012-11-15
2 /29
1. The Problem
• We have a lot of music Can computers help?
• Applications
archive organization
musicological insight?
MIR for Jazz - Dan Ellis
◦ music recommendation
2012-11-15
3 /29
Music Information Retrieval (MIR)
• Small field that has grown since ~2000
musicologists, engineers, librarians
significant commercial interest
• MIR as musical analog
of text IR
find stuff in large archives
• Popular tasks
genre classification
chord, melody, full transcription
music recommendation
• Annual evaluations
“Standard” test corpora - pop
MIR for Jazz - Dan Ellis
2012-11-15
4 /29
2. Automatic Tagging
• Statistical Pattern Recognition:
Finding matches
to training examples
Sensor
signal
Pre-processing/
segmentation
1600
segment
1400
Feature extraction
1200
feature vector
1000
Classification
800
• Need:
600
200
Feature design
• Applications
class
400
600
800
1000
1200
Post-processing
◦ Labeled training examples
Genre ◦ Instrumentation ◦ Artist ◦ Studio ...
MIR for Jazz - Dan Ellis
2012-11-15
5 /29
Features: MFCC
• Mel-Frequency Cepstral Coefficients
the standard features from speech recognition
0.5
Sound
0
FFT X[k]
spectra
−0.5
0.25
0.255
0.26
0.265
0.27 time / s
10
Mel scale
freq. warp
log |X[k]|
audspec
5
0
4
0x 10
1000
2000
3000
freq / Hz
0
5
10
15
freq / Mel
0
5
10
15
freq / Mel
0
10
20
30
quefrency
2
1
0
100
50
IFFT
cepstra
0
200
0
Truncate
−200
MFCCs
MIR for Jazz - Dan Ellis
2012-11-15
6 /29
Representing Audio
• MFCCs are short-time features (25 ms)
• Sound is a
“trajectory”
in MFCC space
8
7
6
5
4
3
2
1
0
VTS_04_0001 - Spectrogram
20
10
0
-10
MFCC covariance
-20
1
2
3
4
5
6
7
8
9
time / sec
20
18
16
14
12
10
8
6
4
2
1
MIR for Jazz - Dan Ellis
MFCC
Covariance
Matrix
30
2
3
4
5
6
7
8
9
time / sec
50
20
level / dB
18
16
20
15
10
5
0
-5
-10
-15
-20
value
MFCC dimension
MFCC
features
MFCC bin
Audio
freq / kHz
• Describe whole track by its statistics
14
12
0
10
8
6
4
2
5
10
15
MFCC dimension
20
-50
2012-11-15
7 /29
MFCCs for Music
• Can resynthesize MFCCs by shaping noise
gives an idea about the information retained
freq / Hz
3644
1770
860
418
203
mel quefrency
12
10
8
6
4
2
freq / Hz
Resynthesis
MFCCs
Original
Freddy Freeloader
3644
1770
860
418
203
0
MIR for Jazz - Dan Ellis
10
20
30
40
50
60
time / s
2012-11-15
8 /29
Ground Truth
Mandel & Ellis ’08
• MajorMiner: Free-text tags for 10s clips
400 users, 7500 unique tags, 70,000 taggings
•
Example: drum, bass, piano, jazz, slow, instrumental, saxophone, soft, quiet, club,
ballad, smooth, soulful, easy_listening, swing, improvisation, 60s, cool, light
MIR for Jazz - Dan Ellis
2012-11-15
9 /29
Classification
Ground
truth
Sound
Chop into
10 sec blocks
MFCC
(20 dims)
Mean
∆
∆2
Covariance
Standardize
across
(60 dims)
train
set
Σ (399
samples)
µ
C, γ
One-vs-all
SVM
Average
Prec.
• MFCC features
+ human ground
truth
+ standard machine
learning tools
MIR for Jazz - Dan Ellis
2012-11-15 10/29
Classification Results
• Classifiers trained from top 50 tags
freq / Hz
01 Soul Eyes
2416
1356
761
427
240
135
_90s
club
trance
end
drum_bass
singing
horns
punk
samples
silence
quiet
noise
solo
strings
indie
house
alternative
r_b
funk
soft
ambient
british
distortion
drum_machine
country
keyboard
saxophone
fast
instrumental
electronica
80s
voice
beat
slow
rap
hip_hop
jazz
piano
techno
dance
female
bass
vocal
pop
electronic
rock
synth
male
guitar
drum
MIR for Jazz - Dan Ellis
50
100
150
200
250
300
1.5
1
0.5
0
−0.5
−1
−1.5
−2
40
80
120
160
200
240
280
time / s
320
2012-11-15 11/29
3. Musical Content
don’t respect pitch
pitch is important
visible in spectrogram
freq / Hz
• MFCCs (and speech recognizers)
3644
1770
860
418
203
46
48
50
52
54
time / s
• Pitch-related tasks
note transcription
chord transcription
matching by musical content (“cover songs”)
MIR for Jazz - Dan Ellis
2012-11-15 12/29
Note Transcription
feature representation
Poliner & Ellis
‘05,’06,’07
feature vector
Training data and features:
•MIDI, multi-track recordings,
playback piano, & resampled audio
(less than 28 mins of train audio).
•Normalized magnitude STFT.
classification posteriors
Classification:
•N-binary SVMs (one for ea. note).
•Independent frame-level
classification on 10 ms grid.
•Dist. to class bndy as posterior.
hmm smoothing
Temporal Smoothing:
•Two state (on/off) independent
HMM for ea. note. Parameters
learned from training data.
•Find Viterbi sequence for ea. note.
MIR for Jazz - Dan Ellis
2012-11-15 13/29
Polyphonic Transcription
• Real music excerpts + ground truth
MIREX 2007
Frame-level transcription
Estimate the fundamental frequency of all notes present on a 10 ms grid
1.25
1.00
0.75
0.50
0.25
0
Precision
Recall
Acc
Etot
Esubs
Emiss
Efa
Note-level transcription
Group frame-level predictions into note-level transcriptions by estimating onset/offset
1.25
1.00
0.75
0.50
0.25
0
Precision
MIR for Jazz - Dan Ellis
Recall
Ave. F-measure
Ave. Overlap
2012-11-15 14/29
Chroma Features
Fujishima 1999
Warren et al. 2003
• Idea: Project onto 12 semitones
regardless of octave
maintains main “musical” distinction
invariant to musical equivalence
no need to worry about harmonics?
chroma
freq / kHz
chroma
4
G
F
G
F
3
2
D
C
1
A
0
50 100 150 fft bin
D
C
2
4
6
8
time / sec
A
50
100
150
200
250 time / frame
NM
C(b) =
B(12 log2 (k/k0 )
b)W (k)|X[k]|
k=0
W(k) is weighting, B(b) selects every ~ mod12
MIR for Jazz - Dan Ellis
2012-11-15 15/29
Chroma Resynthesis
Ellis & Poliner 2007
• Chroma describes the notes in an octave
... but not the octave
• Can resynthesize by presenting all octaves
... with a smooth envelope
“Shepard tones” - octave is ambiguous
M
0
-10
-20
-30
-40
-50
-60
12 Shepard tone spectra
freq / kHz
level / dB
b
b
o+ 12
yb (t) =
W (o + ) cos 2
w0 t
12
o=1
Shepard tone resynth
4
3
2
1
0
500
1000
1500
2000
2500 freq / Hz
0
2
4
6
8
10 time / sec
endless sequence illusion
MIR for Jazz - Dan Ellis
2012-11-15 16/29
Chroma Example
• Simple Shepard tone resynthesis
can also reimpose broad spectrum from MFCCs
3644
1770
860
418
203
freq / Hz
chroma
freq / Hz
Freddie
A
G
E
D
C
3644
1770
860
418
203
0
MIR for Jazz - Dan Ellis
10
20
30
40
50
60 time / s
2012-11-15 17/29
•
Onset detection
Simplest thing is
energy envelope
Bello et al. 2005
W/2
2
e(n0 ) =
n= W/2
w[n] |x(n + n0 )|
freq / kHz
Harnoncourt
Maracatu
8
6
4
f
2
0
level / dB
emphasis on
high frequencies?
|X(f, t)|
60
50
40
30
20
10
level / dB
0
f
f · |X(f, t)|
60
50
40
30
20
10
0
MIR for Jazz - Dan Ellis
0
2
4
6
8
10
time / sec
0
2
4
6
8
10
time / sec
2012-11-15 18/29
Tempo Estimation
• Beat tracking (may) need global tempo period τ
otherwise problem is not “optimal substructure”
• Pick peak in
onset envelope
autocorrelation
after applying
“human
preference”
window
check for
subbeat
MIR for Jazz - Dan Ellis
Onset Strength Envelope (part)
4
2
0
-2
8
8.5
9
9.5
10
10.5
11
11.5
3
3.5
Raw Autocorrelation
12
time / s
400
200
0
Windowed Autocorrelation
200
100
0
-100
0
0.5
1
1.5
Secondary Tempo Period
Primary Tempo Period
2
2.5
lag / s
4
2012-11-15 19/29
Beat Tracking by Dynamic Programming
• To optimize
N
N
C({ti }) =
F (ti
O(ti ) +
i=1
ti
1, p)
i=2
define C*(t) as best score up to time t
then build up recursively (with traceback P(t))
O(t)
τ
C*(t)
t
C*(t) = O(t) + max{αF(t – τ, τp) + C*(τ)}
τ
P(t) = argmax{αF(t – τ, τp) + C*(τ)}
τ
final beat sequence {ti} is best C* + back-trace
MIR for Jazz - Dan Ellis
2012-11-15 20/29
Beat Tracking Results
• Prefers drums & steady tempo
Soul Eyes
0
0
MIR for Jazz - Dan Ellis
5
100
10
200
300
15
400
500
20
600
25
700
800
time / s
900
period / 4 ms samp
2012-11-15 21/29
Beat-Synchronous Chroma
• Record one chroma vector per beat
compact representation of harmonies
3644
1770
860
418
203
Resynth
B
A
G
E
D
C
3644
1770
860
418
203
Resynth
+MFCC
Beat-sync
chroma
freq / Hz
Freddy
3644
1770
860
418
203
0
MIR for Jazz - Dan Ellis
20
10
40
60
20
80
30
100
40
120
50
time / beat
60 time / s
2012-11-15 22/29
• Beat synchronous chroma look like chords
G
5
A-C-D-F
0
A-C-E
A
B-D-G
E
D
C
C-E-G
chroma bin
Chord Recognition
...
10
15
20
time / sec
can we transcribe them?
• Two approaches
manual templates
(prior knowledge)
learned models
(from training data)
MIR for Jazz - Dan Ellis
2012-11-15 23/29
Chord Recognition System
• Analogous to speech recognition
Sheh & Ellis 2003
Gaussian models of features for each chord
Hidden Markov Models for chord transitions
Beat track
Audio
test
100-1600 Hz
BPF
Chroma
25-400 Hz
BPF
Chroma
train
Labels
beat-synchronous
chroma features
chord
labels
HMM
Viterbi
C maj
B
A
G
Root
normalize
Gaussian
Unnormalize
24
Gauss
models
E
D
C
C
D
E
G
A
B
C
D
E
G
A
B
c min
B
A
G
E
D
Resample
C
b
a
g
Count
transitions
24x24
transition
matrix
f
e
d
c
B
A
G
F
E
D
C
MIR for Jazz - Dan Ellis
C
D
EF
G
A
B c
d
e f
g
a
b
2012-11-15 24/29
Chord Recognition
freq / Hz
• Often works:
Audio
Let It Be/06-Let It Be
2416
761
Beatsynchronous
chroma
G
F
E
D
C
B
A
G
A:min
C
G
F
C
C
C
0
MIR for Jazz - Dan Ellis
G
2
a
F
4
C
6
G
8
F
10
C
12
A:min
G
• But only about 60% of the time
Recognized
A:min/b7
F:maj7
C
F:maj6
Ground
truth
chord
A:min/b7
F:maj7
240
14
G
a
16
18
2012-11-15 25/29
What did the models learn?
• Chord model centers (means)
indicate chord ‘templates’:
PCP_ROT family model means (train18)
0.4
DIM
DOM7
MAJ
MIN
MIN7
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
C
MIR for Jazz - Dan Ellis
0
D
5
E F
10
G A
15
20
BC
25
(for C-root chords)
2012-11-15 26 /29
Chords for Jazz
freq / Hz
• How many types?
Freddy − logf sgram
2416
761
240
chroma
0
10
20
B
A#
A
G#
G
F#
F
E
D#
D
C#
C
30
40
50
Freddy − beat−sync chroma
60
time / sec
chords
a#
g#
f#
e
d
c
A#
G#
F#
E
D
C
chroma
Freddy − chord likelihoods + Viterbi path
B
A#
A
G#
G
F#
F
E
D#
D
C#
C
MIR for Jazz - Dan Ellis
Freddy − chord−based chroma reconstruction
20
40
60
80
100
120
time / beats
2012-11-15 27/29
Future Work
• Matching items
Between the Bars − Elliot Smith − pwr .25
12
10
8
6
cover songs / standards
similar instruments, styles
4
2
120
130
140
150
160
170
Between the Bars − Glenn Phillips − pwr .25 − −17 beats − trsp 2
12
10
8
6
4
2
chroma bin
120
130
140
150
160
170
160
170
pointwise product (sum(12x300) = 19.34)
12
10
8
6
4
2
120
130
140
150
time / beats
• Analyzing musical content
solo transcription & modeling
musical structure
• And so much more...
MIR for Jazz - Dan Ellis
2012-11-15 28/29
Summary
•
Finding Musical Similarity at Large Scale
Low-level
features
Classification
and Similarity
browsing
discovery
production
Music
Structure
Discovery
modeling
generation
curiosity
Melody
and notes
Music
audio
Key
and chords
Tempo
and beat
MIR for Jazz - Dan Ellis
2012-11-15 29/29
Download