Extracting Information from Music Audio Dan Ellis

advertisement
Extracting Information
from Music Audio
Dan Ellis
Laboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Engineering, Columbia University, NY USA
http://labrosa.ee.columbia.edu/
1. Learning Music
2. Melody Extraction
3. Music Similarity
Music Info Extraction - Ellis
2005-09-20
p. 1 /25
1. Learning from Music
• A lot of music data available
e.g. 60G of MP3
≈ 1000 hr of audio, 15k tracks
• What can we do with it?
implicit definition of ‘music’
• Quality vs. quantity
Speech recognition lesson:
10x data, 1/10th annotation, twice as useful
• Motivating Applications
music similarity / classification
computer (assisted) music generation
insight into music
Music Info Extraction - Ellis
2005-09-20
p. 2 /25
Ground Truth Data
File: /Users/dpwe/projects/aclass/aimee.wav
music data available
manual annotation is
much rarer
7000
6500
6000
5500
5000
4500
4000
3500
3000
2500
2000
1500
1000
500
t
0:02
0:04
f 9 Printed: Tue Mar 11 13:04:28
• A lot of unlabeled
Hz
0:06
0:08
0:10
0:12
• Unsupervised structure discovery possible
.. but labels help to indicate what you want
• Weak annotation sources
mus
artist-level descriptions
symbol sequences without timing (MIDI)
errorful transcripts
• Evaluation requires ground truth
limiting factor in Music IR evaluations?
Music Info Extraction - Ellis
2005-09-20
p. 3 /25
0:14
vox
0:16
0:18
mu
Talk Roadmap
4
Similarity/
recommend'n
Anchor
models
Music
audio
1
Semantic
bases
Melody
extraction
Drums
extraction
2
3
Event
extraction
Music Info Extraction - Ellis
Fragment
clustering
Eigenrhythms
Synthesis/
generation
?
2005-09-20
p. 4 /25
2. Melody Transcription
with Graham Poliner
• Audio → Score very desirable
for data compression, searching, learning
• Full solution is elusive
signal separation of overlapping voices
music constructed to frustrate!
• Simplified problem:
“Dominant Melody” at each time frame
Frequency
4000
3000
2000
1000
0
0
0.5
1
1.5
2
2.5
3
3.5
4
Music Info Extraction - Ellis
4.5
5
Time
2005-09-20
p. 5 /25
Break tracks
- need to detect new ‘onset’
at single frequencies
500
time / s
Conventional Transcription
0.06
0.04
0.02
0
0
0
0
1
2
3
4
• Pitched notes have harmonic spectra
0.5
→1 transcribe
by searching for harmonics
1.5 time / s
ram
Modeling
e.g. sinusoid modeling + grouping
Group by onset &
freq / Hz
common 3000
harmonicity
find sets2500
of tracks that start
ut-signal
around 2000
the same time
1500
- + stable1000
harmonic pattern
500
Pass on to 0constraint‘onset’
0
1
2
3
based
filtering...
es
APR - Dan Ellis
e/s
4
time / s
• Explicit expert-derived knowledge
L10 - Music Analysis
Music Info Extraction - Ellis
2005-04-06 - 5
2005-09-20
p. 6 /25
Transcription as Classification
• Signal models typically used for transcription
harmonic spectrum, superposition
• But ... trade domain knowledge for data
transcription as pure classification problem:
Audio
Trained
classifier
p("C0"|Audio)
p("C#0"|Audio)
p("D0"|Audio)
p("D#0"|Audio)
p("E0"|Audio)
p("F0"|Audio)
single N-way discrimination for “melody”
per-note classifiers for polyphonic transcription
Music Info Extraction - Ellis
2005-09-20
p. 7 /25
Melody Transcription Features
• Short-time Fourier Transform Magnitude
(Spectrogram)










































• Standardize over 50 pt frequency window

Music Info Extraction - Ellis



2005-09-20


p. 8 /25


Training Data
• Need {data, label} pairs for classifier training
• Sources:
freq / kHz
pre-mixing multitrack recordings + hand-labeling?
synthetic music (MIDI) + forced-alignment?
2
1.5
1
0.5
30
0
20
2
10
1.5
0
1
0.5
0
0
0.5
1
Music Info Extraction - Ellis
1.5
2
2.5
3
3.5
time / sec
2005-09-20
p. 9 /25
Melody Transcription Results
• Trained on 17 examples
.. plus transpositions out to +/- 6 semitones
SMO SVM (Weka)
• Tested on ISMIR MIREX 2005 set
includes foreground/background detection
Example...
Music Info Extraction - Ellis
2005-09-20
p. 10/25
Melody Clustering
• Goal: Find ‘fragments’ that recur in melodies
.. across large music database
.. trade data for model sophistication
Training
data
Melody
extraction
5 second
fragments
VQ
clustering
Top
clusters
• Data sources
pitch tracker, or MIDI training data
• Melody fragment representation
DCT(1:20) - removes average, smoothes detail
Music Info Extraction - Ellis
2005-09-20
p. 11/25
Melody clustering results
• Clusters match underlying contour:
• Some interesting
matches:
e.g. Pink + Nsync
Music Info Extraction - Ellis
2005-09-20
p. 12/25
3. Music Similarity
• Can we predict which songs
with Mike Mandel
and Adam Berenzweig
“sound alike” to a listener?
.. based on the audio waveforms?
many aspects to subjective similarity
• Applications
query-by-example
automatic playlist generation
discovering new music
• Problems
the right representation
modeling individual similarity
Music Info Extraction - Ellis
2005-09-20
p. 13/25
Music Similarity Features
• Need “timbral” features:
log-domain









auditory-like
frequency
warping

Mel-Frequency Cepstral Coeffs (MFCCs)










discrete

cosine


 
transform


orthogonalization

Music Info Extraction - Ellis




2005-09-20


p. 14/25
 
Timbral Music Similarity
• Measure similarity of feature distribution
i.e. collapse across time to get density p(xi)
compare by e.g. KL divergence
• e.g. Artist Identification
learn artist model p(xi | artist X) (e.g. as GMM)
classify unknown song to closest model
Training
Artist 1
MFCCs
GMMs
KL
Artist 2
Min
Artist
KL
Test Song
Music Info Extraction - Ellis
2005-09-20
p. 15/25
“Anchor Space”
• Acoustic features describe each song
.. but from a signal, not a perceptual, perspective
.. and not the differences between songs
• Use genre classifiers to define new space
prototype genres are “anchors”
n-dimensional
vector in "Anchor
Space"
Anchor
Anchor
Audio
Input
(Class i)
p(a1|x)
AnchorAnchor
Anchor
Audio
Input
(Class j)
|x)
p(a2n-dimensional
vector in "Anchor
Space"
GMM
Modeling
Similarity
Computation
p(a1|x)p(an|x)
p(a2|x)
Anchor
Conversion to Anchorspace
GMM
Modeling
KL-d, EMD, etc.
p(an|x)
Conversion to Anchorspace
Music Info Extraction - Ellis
2005-09-20
p. 16/25
Anchor Space
• Frame-by-frame high-level categorizations
0
0.6
0.4
0.2
Electronica
fifth cepstral coef
compare to
raw features?
Anchor Space Features
Cepstral Features
0
0.2
0.4
0.6
madonna
bowie
0.8
1
0.5
0
third cepstral coef
5
10
15
madonna
bowie
15
0.5
properties in distributions? dynamics?
Music Info Extraction - Ellis
2005-09-20
10
Country
p. 17/25
5
‘Playola’ Similarity Browser
Music Info Extraction - Ellis
2005-09-20
p. 18/25
Ground-truth data
• Hard to evaluate Playola’s ‘accuracy’
user tests...
ground truth?
• “Musicseer” online survey:
ran for 9 months in 2002
> 1,000 users, > 20k judgments
http://labrosa.ee.columbia.edu/
projects/musicsim/
Music Info Extraction - Ellis
2005-09-20
p. 19/25
Evaluation
• Compare Classifier measures against
Musicseer subjective results
“triplet” agreement percentage
Top-N ranking agreement score:
! " 13
1
αr =
2
N
si = ∑ αrrαkcr
r=1
αc = α2r
First-place agreement percentage
Top rank agreement
test
- simple significance
80
70
60
SrvKnw 4789x3.58
%
50
SrvAll 6178x8.93
40
GamKnw 7410x3.96
30
GamAll 7421x8.92
20
10
0
cei
cmb
erd
Music Info Extraction - Ellis
e3d
opn
kn2
rnd
ANK
2005-09-20
p. 20/25
Using SVMs for Artist ID
• Support Vector Machines (SVMs) find
hyperplanes in a high-dimensional space
relies only on matrix of
distances between points
much ‘smarter’ than
nearest-neighbor/overlap
want diversity of reference
vectors...
(w  x) + b = + 1
(w x) + b = –1
x1
x2
yi = – 1
y i = +1
w
(w x) + b = 0
Music Info Extraction - Ellis
2005-09-20
p. 21/25
Song-Level SVM Artist ID
• Instead of one model per artist/genre,
use every training song as an ‘anchor’
then SVM finds best support for each artist
Training
MFCCs
Song Features
Artist 1
D
D
D
Artist
DAG SVM
Artist 2
D
D
D
Test Song
Music Info Extraction - Ellis
2005-09-20
p. 22/25
Artist ID Results
• ISMIR/MIREX 2005 also evaluated Artist ID
• 148 artists, 1800 files (split train/test)
from ‘uspop2002’
• Song-level SVM clearly dominates
using only MFCCs!
e 4: Results of the formal MIREX 2005 Audio Artist ID evaluation (USPOP2002) from http://www.musicMIREX 05 Audio Artist (USPOP2002)
/evaluation/mirex-results/audio-artist/.
Rank
1
2
3
4
5
6
7
Participant
Mandel
Bergstra
Pampalk
West
Tzanetakis
Logan
Lidy
Raw Accuracy Normalized
68.3%
68.0%
59.9%
60.9%
56.2%
56.0%
41.0%
41.0%
28.6%
28.5%
14.8%
14.8%
Did not complete
Music Info Extraction - Ellis
References
-Julien Aucouturier and Francois Pachet. Improving
Runtime / s
10240
86400
4321
26871
2443
?
2005-09-20 p. 23/25
John C. Platt, Nello Cristianini, and John Shawe-Ta
Large margin dags for multiclass classification. In
Playlist Generation
• SVMs are well suited to “active learning”
solicit labels on items closest to current boundary
• Automatic player
with “skip”
= Ground truth
data collection
active-SVM
automatic playlist
generation
Music Info Extraction - Ellis
2005-09-20
p. 24/25
Conclusions
Similarity/
recommend'n
Anchor
models
Semantic
bases
Music
audio
Melody
extraction
Drums
extraction
Event
extraction
Fragment
clustering
Synthesis/
generation
Eigenrhythms
?
• Lots of data
+ noisy transcription
+ weak clustering
musical insights?
Music Info Extraction - Ellis
2005-09-20
p. 25/25
Download