Extracting Information from Music Audio Dan Ellis

advertisement
Extracting Information
from Music Audio
Dan Ellis
Laboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Engineering, Columbia University, NY USA
http://labrosa.ee.columbia.edu/
1.
2.
3.
4.
Motivation: Learning Music
Notes Extraction
Drum Pattern Modeling
Music Similarity
Music Information Extraction - Ellis
2006-02-13
p. 1 /32
LabROSA Overview
Information
Extraction
Music
Recognition
Environment
Separation
Retrieval
Machine
Learning
Signal
Processing
Speech
Music Information Extraction - Ellis
2006-02-13
p. 2 /32
1. Learning from Music
• A lot of music data available
e.g. 60G of MP3
≈ 1000 hr of audio, 15k tracks
• What can we do with it?
implicit definition of ‘music’
• Quality vs. quantity
Speech recognition lesson:
10x data, 1/10th annotation, twice as useful
• Motivating Applications
music similarity (recommendation, playlists)
computer (assisted) music generation
insight into music
Music Information Extraction - Ellis
2006-02-13
p. 3 /32
Ground Truth Data
File: /Users/dpwe/projects/aclass/aimee.wav
music data available
manual annotation is
expensive and rare
7000
6500
6000
5500
5000
4500
4000
3500
3000
2500
2000
1500
1000
500
t
0:02
0:04
f 9 Printed: Tue Mar 11 13:04:28
• A lot of unlabeled
Hz
0:06
0:08
0:10
0:12
• Unsupervised structure discovery possible
.. but labels help to indicate what you want
• Weak annotation sources
mus
artist-level descriptions
symbol sequences without timing (MIDI)
errorful transcripts
• Evaluation requires ground truth
limiting factor in Music IR evaluations?
Music Information Extraction - Ellis
2006-02-13
p. 4 /32
0:14
vox
0:16
0:18
mu
Talk Roadmap
4
Similarity/
recommend'n
Anchor
models
Music
audio
1
Semantic
bases
Melody
extraction
Drums
extraction
2
3
Event
extraction
Music Information Extraction - Ellis
Fragment
clustering
Eigenrhythms
Synthesis/
generation
?
2006-02-13
p. 5 /32
2. Notes Extraction
with Graham Poliner
• Audio → Score very desirable
for data compression, searching, learning
• Full solution is elusive
signal separation of overlapping voices
music constructed to frustrate!
• Maybe simplify problem:
“Dominant Melody” at each time frame
Frequency
4000
3000
2000
1000
0
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Time
Music Information Extraction - Ellis
2006-02-13
p. 6 /32
Break tracks
- need to detect new ‘onset’
at single frequencies
500
time / s
Conventional Transcription
0.06
0.04
0.02
0
0
0
0
1
2
3
4
• Pitched notes have harmonic spectra
0.5
→1 transcribe
by searching for harmonics
1.5 time / s
ram
Modeling
e.g. sinusoid modeling + grouping
Group by onset &
freq / Hz
common 3000
harmonicity
find sets2500
of tracks that start
ut-signal
around 2000
the same time
1500
- + stable1000
harmonic pattern
500
Pass on to 0constraint‘onset’
0
1
2
3
based
filtering...
es
APR - Dan Ellis
e/s
4
time / s
• Explicit expert-derived knowledge
L10 - Music Analysis
Music Information Extraction - Ellis
2005-04-06 - 5
2006-02-13
p. 7 /32
Transcription as Classification
• Signal models typically used for transcription
harmonic spectrum, superposition
• But ... trade domain knowledge for data
transcription as pure classification problem:
Audio
Trained
classifier
p("C0"|Audio)
p("C#0"|Audio)
p("D0"|Audio)
p("D#0"|Audio)
p("E0"|Audio)
p("F0"|Audio)
single N-way discrimination for “melody”
per-note classifiers for polyphonic transcription
Music Information Extraction - Ellis
2006-02-13
p. 8 /32
Melody Transcription Features
• Short-time Fourier Transform Magnitude
(Spectrogram)










































• Standardize over 50 pt frequency window

Music Information Extraction - Ellis



2006-02-13


p. 9 /32


Training Data
• Need {data, label} pairs for classifier training
• Sources:
freq / kHz
pre-mixing multitrack recordings + hand-labeling?
synthetic music (MIDI) + forced-alignment?
2
1.5
1
0.5
30
0
20
2
10
1.5
0
1
0.5
0
0
0.5
1
1.5
Music Information Extraction - Ellis
2
2.5
3
3.5
time / sec
2006-02-13
p. 10/32
Melody Transcription Results
• Trained on 17 examples
.. plus transpositions out to +/- 6 semitones
All-pairs SVMs (Weka)
• Tested on ISMIR MIREX 2005 set
Table 1: Results of the formal MIREX 2005 Audio Melody Extraction evaluation from http://www.music-ir.
org/evaluation/mirex-results/audio-melody/. Results marked with * are not directly comparable to the
others because those systems did not perform voiced/unvoiced detection. Results marked † are artificially low due to an
unresolved algorithmic issue.
includes foreground/background detection
Rank
1
2
3
3
5
6
7
8
9
10
Participant
Dressler
Ryynänen
Poliner
Paiva 2
Marolt
Paiva 1
Goto
Vincent 1
Vincent 2
Brossier
Overall Accuracy
71.4%
64.3%
61.1%
61.1%
59.5%
57.8%
49.9%*
47.9%*
46.4%*
3.2%* †
Voicing d!
1.85
1.56
1.56
1.22
1.06
0.83
0.59*
0.23*
0.86*
0.14 * †
Example...
STFT frame in the analysis of the synthesized audio.
Music
1.4 Segmentation
Information Extraction - Ellis
Voiced/Unvoiced melody classification is performed by
Raw Pitch
68.1%
68.6%
67.3%
58.5%
60.1%
62.7%
65.8%
59.8%
59.6%
3.9% †
Raw Chroma
71.4%
74.1%
73.4%
62.0%
67.1%
66.7%
71.8%
67.6%
71.1%
8.1% †
Runtime / s
32
10970
5471
45618
12461
44312
211
?
251
41
ever, give another picture: For pitch, our system is in a
three-way tie for top performance with the top two ranked
systems, and when
octave errors p.
are11
ignored
/32 we are in2006-02-13
significantly worse than the best system (Ryynänen in this
case), and almost significantly better than the top-ranked
Polyphonic Transcription
• Train SVM detectors for every piano note
same features & classifier but different labels
88 separate detectors, independent smoothing
• Use MIDI syntheses, player piano recordings
freq / pitch
Bach 847 Disklavier
A6
20
A5
10
A4
0
A3
A2
-10
A1
-20
0
1
2
3
4
5
6
7
8
level / dB
9
time / sec
about 30 min training data
Music Information Extraction - Ellis
2006-02-13
p. 12/32
Piano Transcription Results
improvement from classifier:
• Significant
Table 1: Frame level transcription results.
frame-level accuracy results:
Algorithm
SVM
Klapuri&Ryynänen
Marolt
Errs
43.3%
66.6%
84.6%
!")
False Pos
27.9%
28.1%
36.5%
False Neg
15.4%
38.5%
48.1%
d!
3.44
2.71
2.35
:540/+;/<5.6=/0
:540/+>-06.6=/0
3450067685.6-,+/22-2+9
Breakdown
• Overall Accuracy
by frameAcc: Overall accuracy is a frame-level version of the metric
proposed by Dixon in [Dixon, 2000] defined as:
type:
!))
()
&)
$)
N
Acc
=" # $ % & ' (
)
!
(F
P + FN + N)
*+,-./0+12/0/,.
")
http://labrosa.ee.columbia.edu/projects/melody/
(3)
where Music
N isInformation
the number
of correctly
transcribed2006-02-13
frames, F P
is/32
the number of
p. 13
Extraction
- Ellis
unvoiced frames U V transcribed as voiced V , and F N is the number of voiced
3. Eigenrhythms: Drum Pattern Space
with John Arroyo
• Pop songs built on repeating “drum loop”
variations on a few bass, snare, hi-hat patterns
• Eigen-analysis (or ...) to capture variations?
by analyzing lots of (MIDI) data, or from audio
• Applications
music categorization
“beat box” synthesis
insight
Music Information Extraction - Ellis
2006-02-13
p. 14/32
Aligning the Data
• Need to align patterns prior to modeling...
tempo (stretch):
by inferring BPM &
normalizing
downbeat (shift):
correlate against
‘mean’ template
Music Information Extraction - Ellis
2006-02-13
p. 15/32
Eigenrhythms (PCA)
• Need 20+ Eigenvectors for good coverage
of 100 training patterns (1200 dims)
• Eigenrhythms both add and subtract
Music Information Extraction - Ellis
2006-02-13
p. 16/32
Posirhythms (NMF)
Posirhythm 1
Posirhythm 2
HH
HH
SN
SN
BD
BD
Posirhythm 3
Posirhythm 4
HH
HH
SN
SN
BD
BD
0.1
Posirhythm 5
Posirhythm 6
HH
HH
SN
SN
BD
BD
0
1
50
100
2
150
200
3
250
300
4
350
400
0
-0.1
0
1
50
100
2
150
200
3
250
300
4
• Nonnegative: only adds beat-weight
• Capturing some structure
Music Information Extraction - Ellis
2006-02-13
p. 17/32
350
samples (@ 2
beats (@ 120
Eigenrhythms for Classification
• Projections in Eigenspace / LDA space
PCA(1,2) projection (16% corr)
LDA(1,2) projection (33% corr)
10
6
blues
4
country
disco
hiphop
2
house
newwave
rock 0
pop
punk
-2
rnb
5
0
-5
-10
-20
-10
0
10
-4
-8
-6
-4
-2
• 10-way Genre classification (nearest nbr):
PCA3: 20% correct
LDA4: 36% correct
Music Information Extraction - Ellis
2006-02-13
p. 18/32
0
2
Eigenrhythm BeatBox
• Resynthesize rhythms from eigen-space
Music Information Extraction - Ellis
2006-02-13
p. 19/32
4. Music Similarity
with Mike Mandel
and Adam Berenzweig
• Can we predict which songs
“sound alike” to a listener?
.. based on the audio waveforms?
many aspects to subjective similarity
• Applications
query-by-example
automatic playlist generation
discovering new music
• Problems
the right representation
modeling individual similarity
Music Information Extraction - Ellis
2006-02-13
p. 20/32
Music Similarity Features
• Need “timbral” features:
log-domain
!"ec%r'(r)m





+e,-.requency
!"ec%r'(r)m

auditory-like
frequency
warping

Mel-Frequency Cepstral Coeffs (MFCCs)










discrete

cosine

+e,-Frequency
4e"5%r),64'e..icien%5 
transform


orthogonalization

Music Information Extraction - Ellis




2006-02-13


p. 21/32
 
Timbral Music Similarity
• Measure similarity of feature distribution
i.e. collapse across time to get density p(xi)
compare by e.g. KL divergence
• e.g. Artist Identification
learn artist model p(xi | artist X) (e.g. as GMM)
classify unknown song to closest model
Training
GMMs
Artist 1
MFCCs
KL
Artist 2
Min
Artist
KL
Test Song
Music Information Extraction - Ellis
2006-02-13
p. 22/32
“Anchor Space”
• Acoustic features describe each song
.. but from a signal, not a perceptual, perspective
.. and not the differences between songs
• Use genre classifiers to define new space
prototype genres are “anchors”
n-dimensional
vector in "Anchor
Space"
Anchor
Anchor
Audio
Input
(Class i)
p(a1|x)
AnchorAnchor
Anchor
Audio
Input
(Class j)
|x)
p(a2n-dimensional
vector in "Anchor
Space"
GMM
Modeling
Similarity
Computation
p(a1|x)p(an|x)
p(a2|x)
Anchor
Conversion to Anchorspace
GMM
Modeling
KL-d, EMD, etc.
p(an|x)
Conversion to Anchorspace
Music Information Extraction - Ellis
2006-02-13
p. 23/32
Anchor Space
• Frame-by-frame high-level categorizations
0
0.6
0.4
0.2
Electronica
fifth cepstral coef
compare to
raw features?
Anchor Space Features
Cepstral Features
0
0.2
0.4
0.6
madonna
bowie
0.8
1
0.5
0
third cepstral coef
5
10
15
madonna
bowie
15
0.5
properties in distributions? dynamics?
Music Information Extraction - Ellis
2006-02-13
10
Country
p. 24/32
5
‘Playola’ Similarity Browser
Music Information Extraction - Ellis
2006-02-13
p. 25/32
Ground-truth data
• Hard to evaluate Playola’s ‘accuracy’
user tests...
ground truth?
• “Musicseer” online survey:
ran for 9 months in 2002
> 1,000 users, > 20k judgments
http://labrosa.ee.columbia.edu/
projects/musicsim/
Music Information Extraction - Ellis
2006-02-13
p. 26/32
Evaluation
• Compare Classifier measures against
Musicseer subjective results
“triplet” agreement percentage
Top-N ranking agreement score:
! " 13
1
αr =
2
N
si = ∑ αrrαkcr
r=1
αc = α2r
First-place agreement percentage
Top rank agreement
test
- simple significance
80
70
60
SrvKnw 4789x3.58
%
50
SrvAll 6178x8.93
40
GamKnw 7410x3.96
30
GamAll 7421x8.92
20
10
0
cei
cmb
erd
e3d
opn
Music Information Extraction - Ellis
kn2
rnd
ANK
2006-02-13
p. 27/32
Using SVMs for Artist ID
• Support Vector Machines (SVMs) find
hyperplanes in a high-dimensional space
relies only on matrix of
distances between points
much ‘smarter’ than
nearest-neighbor/overlap
want diversity of reference
vectors...
(w  x) + b = + 1
(w x) + b = –1
x1
x2
yi = – 1
y i = +1
w
(w x) + b = 0
Music Information Extraction - Ellis
2006-02-13
p. 28/32
Song-Level SVM Artist ID
• Instead of one model per artist/genre,
use every training song as an ‘anchor’
then SVM finds best support for each artist
Training
MFCCs
Song Features
Artist 1
D
D
D
Artist
DAG SVM
Artist 2
D
D
D
Test Song
Music Information Extraction - Ellis
2006-02-13
p. 29/32
Artist ID Results
• ISMIR/MIREX 2005 also evaluated Artist ID
• 148 artists, 1800 files (split train/test)
from ‘uspop2002’
• Song-level SVM clearly dominates
using only MFCCs!
e 4: Results of the formal MIREX 2005 Audio Artist ID evaluation (USPOP2002) from http://www.music/evaluation/mirex-results/audio-artist/.
MIREX 05 Audio Artist (USPOP2002)
Rank Participant Raw Accuracy Normalized Runtime / s
1
Mandel
68.3%
68.0%
10240
2
Bergstra
59.9%
60.9%
86400
3
Pampalk
56.2%
56.0%
4321
4
West
41.0%
41.0%
26871
5
Tzanetakis
28.6%
28.5%
2443
6
Logan
14.8%
14.8%
?
7
Lidy
Did not complete
Music Information Extraction - Ellis
References
2006-02-13 p. 30/32
John C. Platt, Nello Cristianini, and John Shawe-Ta
Large margin dags for multiclass classification. In
Playlist Generation
• SVMs are well suited to “active learning”
solicit labels on items closest to current boundary
• Automatic player
with “skip”
= Ground truth
data collection
active-SVM
automatic playlist
generation
Music Information Extraction - Ellis
2006-02-13
p. 31/32
Conclusions
Similarity/
recommend'n
Anchor
models
Semantic
bases
Music
audio
Melody
extraction
Drums
extraction
Event
extraction
Fragment
clustering
Synthesis/
generation
Eigenrhythms
?
• Lots of data
+ noisy transcription
+ weak clustering
musical insights?
Music Information Extraction - Ellis
2006-02-13
p. 32/32
Download