Extracting and Using Music Audio Information Dan Ellis

advertisement
Extracting and Using
Music Audio Information
Dan Ellis
Laboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Engineering, Columbia University, NY USA
http://labrosa.ee.columbia.edu/
1.
2.
3.
4.
Motivation: Music Collections
Music Information
Music Similarity
Music Structure Discovery
Music Audio Information - Ellis
2007-11-02
p. 1 /42
LabROSA Overview
Information
Extraction
Music
Recognition
Environment
Separation
Machine
Learning
Retrieval
Signal
Processing
Speech
Music Audio Information - Ellis
2007-11-02
p. 2 /42
1. Managing Music Collections
• A lot of music data available
e.g. 60G of MP3 ≈ 1000 hr of audio, 15k tracks
• Management challenge
how can computers help?
• Application scenarios
personal music collection
discovering new music
“music placement”
Music Audio Information - Ellis
2007-11-02
p. 3 /42
Learning from Music
• What can we infer from 1000 h of music?
common patterns
sounds, melodies, chords, form
what is and what isn’t music
Scatter of PCA(3:6) of 12x16 beatchroma
60
50
40
30
•
• Applications
Data driven musicology?
modeling/description/coding
computer generated music
curiosity...
Music Audio Information - Ellis
20
10
10
20
30
40
50
60
10
20
30
40
50
60
60
50
40
30
20
10
2007-11-02
p. 4 /42
The Big Picture
Low-level
features
Classification
and Similarity
browsing
discovery
production
Music
Structure
Discovery
modeling
generation
curiosity
Melody
and notes
Music
audio
Key
and chords
Tempo
and beat
.. so far
Music Audio Information - Ellis
2007-11-02
p. 5 /42
2. Music Information
• How to represent music audio?
• Audio features
spectrogram, MFCCs, bases
• Musical elements
Frequency
4000
3000
2000
1000
0
0
0.5
1
1.5
2
2.5
3
notes, beats, chords, phrases
requires transcription
• Or something inbetween?
optimized for a certain task?
Music Audio Information - Ellis
2007-11-02
p. 6 /42
3.5
4
4.5
5
Time
Transcription as Classification
• Exchange signal models for data
Poliner & Ellis ‘05,’06,’07
transcription as pure classification problem:
feature representation
feature vector
Training data and features:
•MIDI, multi-track recordings,
playback piano, & resampled audio
(less than 28 mins of train audio).
•Normalized magnitude STFT.
classification posteriors
Classification:
•N-binary SVMs (one for ea. note).
•Independent frame-level
classification on 10 ms grid.
•Dist. to class bndy as posterior.
hmm smoothing
Temporal Smoothing:
•Two state (on/off) independent
HMM for ea. note. Parameters
learned from training data.
•Find Viterbi sequence for ea. note.
Music Audio Information - Ellis
2007-11-02
p. 7 /42
Polyphonic Transcription
• Real music excerpts + ground truth
MIREX 2007
Frame-level transcription
Estimate the fundamental frequency of all notes present on a 10 ms grid
1.25
1.00
0.75
0.50
0.25
0
Precision
Recall
Acc
Etot
Esubs
Emiss
Efa
Note-level transcription
Group frame-level predictions into note-level transcriptions by estimating onset/offset
1.25
1.00
0.75
0.50
0.25
0
Precision
Music Audio Information - Ellis
Recall
Ave. F-measure
Ave. Overlap
2007-11-02
p. 8 /42
Beat Tracking
Ellis ’06,’07
• Goal: One feature vector per ‘beat’ (tatum)
for tempo normalization, efficiency
• “Onset Strength Envelope”
freq / mel
sumf(max(0, difft(log |X(t, f)|)))
40
30
20
10
0
• Autocorr. + window → global tempo estimate
0
5
10
time / sec 15
0
168.5100BPM 200
0
300
Music Audio Information - Ellis
400
500
600
700
800
900
1000
lag / 4 ms samples
2007-11-02
p. 9 /42
Beat Tracking
• Dynamic Programming finds beat times {t }
i
optimizes i O(ti) +  i W((ti+1 – ti – p)/)
where O(t) is onset strength envelope (local score)
W(t) is a log-Gaussian window (transition cost)
p is the default beat period per measured tempo
incrementally find best predecessor at every time
backtrace from largest final score to get beats
C*(t)
O(t)
τ
t
C*(t) = γ O(t) + (1–γ)max{W((τ – τp)/β)C*(τ)}
τ
P(t) = argmax{W((τ – τp)/β)C*(τ)}
τ
Music Audio Information - Ellis
2007-11-02
p. 10/42
Beat Tracking
• DP will bridge gaps (non-causal)
there is always a best path ...
freq / Bark band
Alanis Morissette - All I Want - gap + beats
40
30
20
10
• 2nd place in MIREX 2006 Beat Tracking
182
184
186
188
190
192
time / sec
compared to McKinney & Moelants human data
freq / Bark band
Subject #
test 2 (Bragg) - McKinney + Moelants Subject data
40
40
30
20
10
20
0
0
5
Music Audio Information - Ellis
10
time / s
2007-11-02
p. 11/42
15
Chroma Features
• Chroma features convert spectral energy
into musical weights in a canonical octave
i.e. 12 semitone bins
IF chroma
chroma
3
2
G
F
D
C
1
A
0
2
4
6
8
10
100
time / sec
200
300
400
500
• Can resynthesize as “Shepard Tones”
600
700
time / frames
all octaves at once
0
-10
-20
-30
-40
-50
-60
12 Shepard tone spectra
freq / kHz
level / dB
Piano
scale
freq / kHz
Piano chromatic scale
4
Shepard tone resynth
4
3
2
1
0
500
1000
1500
2000
2500 freq / Hz
Music Audio Information - Ellis
0
2
4
6
2007-11-02
8
10 time / sec
p. 12/42
Key Estimation
• Covariance of chroma reflects key
• Normalize by transposing for best fit
single Gaussian
model of one piece
find ML rotation
of other pieces
model all
transposed pieces
iterate until
convergence
aligned chroma
Taxman
Eleanor Rigby
I'm Only Sleeping
G
F
G
F
G
F
D
D
D
C
C
C
A
A
A
A
Love You To
C D
F G
A
Aligned Global model
G
F
G
F
D
D
D
C
C
C
A
A
A
C D
She Said She Said
C D
F G
A
Good Day Sunshine
G
F
G
F
D
D
D
C
C
C
A
C D
F G
C D
F G
And Your Bird Can Sing
G
F
A
F G
A
A
F G
C D
Yellow Submarine
G
F
A
Music Audio Information - Ellis
Ellis ICASSP ’07
A
A
C D
F G
2007-11-02
A
p. 13/42
C D
F G
aligned chroma
Chord Transcription
Sheh & Ellis ’03
• “Real Books” give chord transcriptions
but no exact timing
.. just like speech transcripts
# The Beatles - A Hard Day's Night
#
G Cadd9 G F6 G Cadd9 G F6 G C D G C9 G
G Cadd9 G F6 G Cadd9 G F6 G C D G C9 G
Bm Em Bm G Em C D G Cadd9 G F6 G Cadd9 G
F6 G C D G C9 G D
G C7 G F6 G C7 G F6 G C D G C9 G Bm Em Bm
G Em C D
G Cadd9 G F6 G Cadd9 G F6 G C D G C9 G
C9 G Cadd9 Fadd9
• Use EM to simultaneously
learn and align chord models
Model inventory
ae1
ae2
dh1
dh2
ae3
Initialization Θ
init
parameters
Repeat until
convergence
Labelled training data
dh ax k ae t
s ae t aa n
dh
s ae
E-step:
probabilities p(qni |XN
1,Θold)
of unknowns
M-step:
maximize via
parameters
Music Audio Information - Ellis
ax
k
t
dh
ae
t
aa n
ax
k
Uniform
initialization
alignments
ae
Θ : max E[log p(X,Q | Θ)]
2007-11-02
p. 14/42
Chord Transcription
Frame-level Accuracy
Feature
Recog.
Alignment
Beatles - Beatles For Sale - Eight Days a Week (4096pt)
#
8.7%
PCP_ROT
21.7%
22.0%
G
#
120
F
76.0%
100
(random ~3%)
MFCCs are poor
(can overtrain)
PCPs better
(ROT helps generalization)
E
pitch class
MFCC
80
#
D
60
#
40
C
20
B
0
intensity
#
A
16.27
true
align
recog
24.84
time / sec
E
G
D
E
G
DBm
E
G
Bm
Bm
G
G
Am
Em7 Bm
Em7
• Needed more training data...
Music Audio Information - Ellis
2007-11-02
p. 15/42
3. Music Similarity
• The most central problem...
motivates extracting musical information
supports real applications (playlists, discovery)
• But do we need content-based similarity?
compete with collaborative filtering
compete with fingerprinting + metadata
• Maybe ... for the Future of Music
connect listeners directly to musicians
Music Audio Information - Ellis
2007-11-02
p. 16/42
Discriminative Classification
Mandel & Ellis ‘05
• Classification as a proxy for similarity
• Distribution models...
Training
MFCCs
Artist 1
GMMs
KL
Artist
Artist 2
Min
KL
• vs. SVM
Test Song
Training
MFCCs
Song Features
Artist 1
D
D
D
DAG SVM
Artist
Artist 2
D
D
D
Test Song
Music Audio Information - Ellis
2007-11-02
p. 17/42
A
Segment-Level Features
• Statistics of spectra and envelope
Mandel & Ellis ‘07
LabROSA’s audio music similarity and classification system
define a pointMichael
in feature
I Mandel and space
Daniel P W Ellis
· Dept of Electrical Engineering · Columbia University, New York
for SVMLabROSA
classification,
or Euclidean similarity...
{mim,dpwe}@ee.columbia.edu
d
o
1. Feature Extraction
ures are the same as Mandel and Ellis (2005), mean and
ovariance of MFCCs
tures are similar to those in Rauber et al. (2002)
ut into overlapping 10-second clips
mel spectrum of each clip, averaging clips’ features
mel frequencies together to get magnitude bands for low,
3. Results
2. Similarity and Classification
• We use a DAG-SVM for n-way classification of songs (Platt et al., 2000)
• The distance between songs is the Euclidean distance between their
feature vectors
• During testing, some songs were a top similarity match for many songs
• Re-normalizing each song’s feature vector avoided this problem
Music Audio Information - Ellis
Audio Music Similarity
2007-11-02
0.8
0.7
p. 18/42
Audio Classification
80
70
0.6
60
0.5
50
0.4
40
0.3
30
MIREX’07 Results
Results and classification
• One system for3.similarity
Audio Music Similarity
Audio Classification
80
0.8
70
0.7
0.6
60
0.5
50
0.4
40
0.3
30
20
0.2
Greater0
Psum
Fine
WCsum
SDsum
Greater1
0.1
10
0
0
PS
GT
Genre ID Hierarchical
Genre ID Raw
Mood ID
Composer ID
Artist ID
LB
CB1
TL1
ME
TL2
CB2
CB3
BK1
PC
BK2
PS = Pohle, Schnitzer; GT = George Tzanetakis; LB = Barrington, Turnbull, Torres, Lanckriet;
CB = Christoph Bastuck; TL = Lidy, Rauber, Pertusa, Iñesta; ME = Mandel, Ellis; BK = Bosteels,
Kerre; PC = Paradzinets, Chen
Music Audio Information - Ellis
IMsvm
MEspec
ME
TL
GT
IM
knn
KL
CL
GH
IM = IMIRSEL M2K; ME = Mandel, Ellis; TL = Lidy,
Rauber, Pertusa, Iñesta; GT = George Tzanetakis; KL = Kyogu Lee; CL = Laurier, Herrera;
GH = Guaus, Herrera
2007-11-02
p. 19/42
Active-Learning Playlists
• SVMs are well suited to “active learning”
solicit labels on items closest to current boundary
• Automatic player
with “skip”
= Ground truth
data collection
active-SVM
automatic playlist
generation
Music Audio Information - Ellis
2007-11-02
p. 20/42
Cover Song Detection
Ellis & Poliner ‘07
• “Cover Songs” = reinterpretation of a piece
different instrumentation, character
no match with “timbral” features
Let It Be - Nick Cave
Let It Be / Beatles / verse 1
4
freq / kHz
freq / kHz
Let It Be - The Beatles
3
2
1
0
Let It Be / Nick Cave / verse 1
4
3
2
1
0
• Need a different representation!
2
4
6
8
10 time / sec
2
4
6
8
10
time / sec
beat-synchronous chroma features
Beat-sync chroma features
G
F
chroma
chroma
Beat-sync chroma features
D
C
A
G
F
D
C
A
5
10
15
20
25
Music Audio Information - Ellis
beats
5
10
15
2007-11-02
20
25
beats
p. 21/42
Beat-Synchronous Chroma Features
• Beat + chroma features / 30ms frames
→ average chroma within each beat
compact; sufficient?
34,5-.-6,7
&#
%#
$#
89/,)-/)4,9:);
"#
#
0;48+2-1*9/
"$
"#
(
'
&
$
#
!
"#
)*+,-.-/,0 "!
0;48+2-1*9/
"$
"#
(
'
&
$
!
"#
"!
Music Audio Information - Ellis
$#
$!
%#
%!
)*+,-.-1,2)/
2007-11-02
p. 22/42
Matching: Global Correlation
• Cross-correlate entire beat-chroma matrices
... at all possible transpositions
implicit combination of match quality and duration
chroma bins
Elliott Smith - Between the Bars
G
E
D
C
A
skew / semitones
chroma bins
100
200
300
400
Glen Phillips - Between the Bars
500
beats @281 BPM
G
E
D
C
A
Cross-correlation
+5
0
-5
-500
• One good matching fragment is sufficient...?
-400
-300
-200
Music Audio Information - Ellis
-100
0
100
200
300
400 skew / beats
2007-11-02
p. 23/42
MIREX 06 Results
• Cover song contest
• Found 761/3300
= 23% recall
next best: 11%
guess: 3%
Music Audio Information - Ellis
song-set (each row is one query song)
30 songs x 11
versions of each (!)
(data has not been
disclosed)
# true covers in top 10
8 systems compared
(4 cover song
+ 4 similarity)
MIREX 06 Cover Song Results:
# Covers retrieved per song per system
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
8
6
4
2
0
correct
matches
retrieved
CS DE KL1 KL2 KWL KWT LR TP
cover song systems
similarity systems
2007-11-02
p. 24/42
Cross-Correlation Similarity
• Use cover-song approach to find similarity
e.g. similar note/instrumentation sequence
may sound very similar to judges
• Numerous variants
try on chroma (melody/harmony)
and MFCCs (timbre)
try full search (xcorr)
or landmarks (indexable)
compare to random,
segment-level stats
• Evaluate by subjective tests
modeled after MIREX similarity
Music Audio Information - Ellis
2007-11-02
p. 25/42
•
Cross-Correlation Similarity
or boundary within each segation only at the single skew
e use the BIC method [8] to
he feature matrix that maxiof fitting separate Gaussians
e boundary compared to fitsingle Gaussian i.e. the time
ray into maximally dissimio miss some of the matching
simplicity may be the only
databases consisting of mil-
AND RESULTS
Table 1. Results of the subjective similarity evaluation.
Human
web-based
judgments
Counts
are the number of times
the best hit returned by each
was rated as similar by a human rater. Each albinaryalgorithm
judgments
for speed
gorithm provided one return for each of 30 queries, and was
judged
by
6 raters,
hence the
countscandidate
are out of a maximum
6 users
x
30
queries
x
10
returns
possible of 180.
Algorithm
(1) Xcorr, chroma
(2) Xcorr, MFCC
(3) Xcorr, combo
(4) Xcorr, combo + tempo
(5) Xcorr, combo at boundary
(6) Baseline, MFCC
(7) Baseline, rhythmic
(8) Baseline, combo
Random choice 1
Random choice 2
Similar count
48/180 = 27%
48/180 = 27%
55/180 = 31%
34/180 = 19%
49/180 = 27%
81/180 = 45%
49/180 = 27%
88/180 = 49%
22/180 = 12%
28/180 = 16%
ing music similarity systems
ntitative analysis. As noted
fication tasks that have been
ikely fall short of accounting
larly in the case of a system
combined score evaluated only at the reference boundary of
ch structural detail instead of
section 2.1. To these, we added three additional hits from a
ducted a small subjective lismore conventional feature statistics system using (6) MFCC
after the MIREX music simimean and covariance (as in [2]), (7) subband rhythmic fead to collect only a single sim(modulation
spectra, similar to [10]), and 2007-11-02
(8) a simple
ch returned clip (toMusic
simplify
Audiotures
Information
- Ellis
summation of the normalized scores under these two meacluding some random selecsures. Finally, we added two randomly-chosen clips to bring
mparison.
• Cross-correlation inferior to baseline...
... but is getting somewhere, even with ‘landmark’
p. 26/42
Cross-Correlation Similarity
• Results are not overwhelming
.. but database is only a few thousand clips
Music Audio Information - Ellis
2007-11-02
p. 27/42
“Anchor Space”
Berenzweig & Ellis ‘03
• Acoustic features describe each song
.. but from a signal, not a perceptual, perspective
.. and not the differences between songs
• Use genre classifiers to define new space
prototype genres are “anchors”
n-dimensional
vector in "Anchor
Space"
Anchor
Anchor
Audio
Input
(Class i)
p(a1|x)
AnchorAnchor
Anchor
Audio
Input
(Class j)
|x)
p(a2n-dimensional
vector in "Anchor
Space"
GMM
Modeling
Similarity
Computation
p(a1|x)p(an|x)
p(a2|x)
Anchor
Conversion to Anchorspace
GMM
Modeling
KL-d, EMD, etc.
p(an|x)
Conversion to Anchorspace
Music Audio Information - Ellis
2007-11-02
p. 28/42
“Anchor Space”
• Frame-by-frame high-level categorizations
0
0.6
0.4
0.2
Electronica
fifth cepstral coef
compare to
raw features?
Anchor Space Features
Cepstral Features
0
0.2
0.4
0.6
madonna
bowie
0.8
1
0.5
0
third cepstral coef
5
10
15
madonna
bowie
15
0.5
properties in distributions? dynamics?
Music Audio Information - Ellis
2007-11-02
10
Country
p. 29/42
5
‘Playola’ Similarity Browser
Music Audio Information - Ellis
2007-11-02
p. 30/42
Ground-truth data
• Hard to evaluate Playola’s ‘accuracy’
Ellis et al, ‘02
user tests...
ground truth?
• “Musicseer” online
survey/game:
ran for 9 months in 2002
> 1,000 users,
> 20k judgments
http://labrosa.ee.columbia.edu/
projects/musicsim/
Music Audio Information - Ellis
2007-11-02
p. 31/42
“Semantic Bases”
• Describe segment in human-relevant terms
e.g. anchor space, but more so
• Need ground truth...
what words to people use?
• MajorMiner
game:
400 users
7500 unique tags
70,000 taggings
2200 10-sec clips used
• Train classifiers...
Music Audio Information - Ellis
2007-11-02
p. 32/42
3. Music Structure Discovery
• Use the many examples to map out the
“manifold” of music audio
... and hence define the subset that is music
x 104
32GMMs on 1000 MFCC20s
u2
-2.5
tina_turner
roxette
-3
rolling_stones
-3.5
queen
pink_floyd
-4
artist model
s
metallica
madonna
-4.5
green_day
genesis
-5
garth_brooks
-5.5
fleetwood_mac
depeche_mode
-6
dave_matthews_band
creedence_clearwater_revival
-6.5
bryan_adams
beatles
• Problems
-7
aerosmith
ae
be
br
cr
da
de
fl
ga
ge
gr
test tracks
ma
me
pi
qu
ro
ro
ti
u2
alignment/registration of data
factoring & abstraction
separating parts?
Music Audio Information - Ellis
2007-11-02
p. 33/42
Eigenrhythms: Drum Pattern Space
Ellis & Arroyo ‘04
• Pop songs built on repeating “drum loop”
variations on a few bass, snare, hi-hat patterns
• Eigen-analysis (or ...) to capture variations?
by analyzing lots of (MIDI) data, or from audio
• Applications
music categorization
“beat box” synthesis
insight
Music Audio Information - Ellis
2007-11-02
p. 34/42
Aligning the Data
• Need to align patterns prior to modeling...
tempo (stretch):
by inferring BPM &
normalizing
downbeat (shift):
correlate against
‘mean’ template
Music Audio Information - Ellis
2007-11-02
p. 35/42
Eigenrhythms (PCA)
• Need 20+ Eigenvectors for good coverage
of 100 training patterns (1200 dims)
• Eigenrhythms both add and subtract
Music Audio Information - Ellis
2007-11-02
p. 36/42
Posirhythms (NMF)
Posirhythm 1
Posirhythm 2
HH
HH
SN
SN
BD
BD
Posirhythm 3
Posirhythm 4
HH
HH
SN
SN
BD
BD
0.1
Posirhythm 5
Posirhythm 6
HH
HH
SN
SN
BD
BD
0
1
50
100
2
150
200
3
250
300
4
350
400
0
-0.1
0
1
50
100
2
150
200
3
250
300
4
• Nonnegative: only adds beat-weight
• Capturing some structure
Music Audio Information - Ellis
2007-11-02
p. 37/42
350
samples (@ 2
beats (@ 120
Eigenrhythm BeatBox
• Resynthesize rhythms from eigen-space
Music Audio Information - Ellis
2007-11-02
p. 38/42
Melody Clustering
• Goal: Find ‘fragments’ that recur in melodies
.. across large music database
.. trade data for model sophistication
Training
data
Melody
extraction
5 second
fragments
VQ
clustering
Top
clusters
• Data sources
pitch tracker, or MIDI training data
• Melody fragment representation
DCT(1:20) - removes average, smoothes detail
Music Audio Information - Ellis
2007-11-02
p. 39/42
Melody Clustering
• Clusters match underlying contour:
• Some interesting
matches:
e.g. Pink + Nsync
Music Audio Information - Ellis
2007-11-02
p. 40/42
Beat-Chroma Fragment Codebook
• Idea: Find the very popular music fragments
e.g. perfect cadence, rising melody, ...?
• Clustering a large enough database should
reveal these
but: registration of phrase boundaries, transposition
• Need to deal with really large datasets
e.g. 100k+ tracks, multiple landmarks in each
but: Locality Sensitive Hashing
can help - quickly finds ‘most’
points in a certain radius
• Experiments in progress...
Music Audio Information - Ellis
2007-11-02
p. 41/42
Conclusions
Low-level
features
Classification
and Similarity
browsing
discovery
production
Music
Structure
Discovery
modeling
generation
curiosity
Melody
and notes
Music
audio
Key
and chords
Tempo
and beat
• Lots of data
+ noisy transcription
+ weak clustering
musical insights?
Music Audio Information - Ellis
2007-11-02
p. 42/42
Download