Pattern Recognition Applied to Music Signals JHU CLSP Summer School

advertisement
JHU CLSP Summer School
Pattern Recognition
Applied to Music Signals
1 Music Content Analysis
2 Classification and Features
3 Statistical Pattern Recognition
4 Gaussian Mixtures and Neural Nets
5 Singing Detection
Dan Ellis <dpwe@ee.columbia.edu>
http://www.ee.columbia.edu/~dpwe/muscontent/
Laboratory for Recognition and Organization of Speech and Audio
Columbia University, New York
July 1st, 2003
Dan Ellis
Pattern Recognition
2003-07-01 - 1
Music Content Analysis
1
•
Music contains information
at many levels
- what is it?
•
We’d like to get this information out
automatically
- fine-level transcription of events
- broad-level classification of pieces
•
Information extraction can be framed as:
pattern classification / recognition
or machine learning
- build systems based on (labeled) training data
Dan Ellis
Pattern Recognition
2003-07-01 - 2
Music analysis
•
What information can we get from music?
Frequency
4000
3000
2000
1000
0
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Time
•
Score recovery
- extract the ‘performance’
•
Instrument identification
•
Ensemble performance
- ‘gestalts’: chords, tone colors
•
Broader timescales
- phrasing & musical structure
- artist / genre clustering and classification
Dan Ellis
Pattern Recognition
2003-07-01 - 3
Outline
1
Music Content Analysis
2
Classification and Features
- classification
- spectrograms
- cepstra
3
Statistical Pattern Recognition
4
Gaussian Mixtures and Neural Nets
5
Singing Detection
Dan Ellis
Pattern Recognition
2003-07-01 - 4
Classification and Features
2
•
Classification means:
finding categorical (discrete) labels
for real-world (continuous) observations
F2/Hz
f/Hz
4000
2000
3000
ay
ao
x
2000
1000
1000
0
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
x
1.0
time/s
0
0
•
Dan Ellis
1000
2000
F1/Hz
Problems
- parameter tuning
- feature overlap
Pattern Recognition
2003-07-01 - 5
Classification system parts
Sensor
signal
Pre-processing/
segmentation
• STFT
• Locate vowels
segment
Feature extraction
• Formant extraction
feature vector
Classification
class
Post-processing
•
Dan Ellis
• Context constraints
• Costs/risk
Right features are critical
- place upper bound on classifier
- should make important aspects visible
- invariance under irrelevant modifications
Pattern Recognition
2003-07-01 - 6
The Spectrogram
•
Short-time Fourier transform:
N–1
( n – mL )- )
------------------------------X [ k, m ] = ∑
x [ n ] ⋅ w [ n – mL ] ⋅ exp – j ( 2πk
n=0
N
Plot STFT X [ k, m ] as a grayscale image:
•
0.1
4000
10
0
3000
-10
2000
-20
-30
1000
intensity / dB
freq / Hz
0
-40
0
2.35
2.4
2.45
2.5
2.55
2.6
-50
time / s
freq / Hz
4000
3000
2000
1000
0
Dan Ellis
0
0.5
1
1.5
2
Pattern Recognition
2.5
time / s
2003-07-01 - 7
Cepstra
•
Spectrograms are good for visualization;
Cepstrum is preferred for classification
- dct of STFT: c k = idft ( log X [ k, m ] )
•
Cepstra capture coarse information
in fewer dimensions with less correlation:
Cepstral
coefficients
Auditory
spectrum
Features
Covariance matrix
25
20
20
16
15
12
10
8
5
4
Example joint distrib (10,15)
-2
-3
-4
20
18
16
14
12
10
8
6
4
2
16
12
8
4
50
Dan Ellis
-5
100
150 frames
5
10
Pattern Recognition
15
20
3
2
1
0
-1
-2
-3
-4
-5
0
2003-07-01 - 8
5
Outline
1
Music Content Analysis
2
Classification and Features
3
Statistical Pattern Recognition
- Priors and posteriors
- Bayesian classifier
4
Gaussian Mixtures and Neural Nets
5
Singing Detection
Dan Ellis
Pattern Recognition
2003-07-01 - 9
Statistical Pattern Recognition
3
•
•
Observations are random variables
whose distribution depends on the class:
Class
Observation
ωi
x
(hidden)
discrete
p(x|ωi)
Pr(ωi|x)
continuous
Source distributions p(x|ωi)
- reflect variability in feature
- reflect noise in observation
- generally have to be estimated from data
(rather than known in advance)
p(x|ωi)
ω 1 ω2 ω3 ω4
x
Dan Ellis
Pattern Recognition
2003-07-01 - 10
Priors and posteriors
•
Bayesian inference can be interpreted as
updating prior beliefs with new information, x:
Bayes’ Rule:
Likelihood
p ( x ωi )
Pr ( ω i ) ⋅ --------------------------------------------------- = Pr ( ω i x )
∑ p ( x ω j ) ⋅ Pr ( ω j )
Prior
probability
j
‘Evidence’ = p(x)
Posterior
probability
•
Posterior is prior scaled by likelihood
& normalized by evidence (so Σ(posteriors) = 1)
•
Objection: priors are often unknown
- but omitting them amounts to assuming they are
all equal
Dan Ellis
Pattern Recognition
2003-07-01 - 11
Bayesian (MAP) classifier
•
Optimal classifier is ω̂ = argmax Pr ( ω i x )
ωi
but we don’t know Pr ( ω i x )
•
Can model conditional distributions
p ( x ω i ) then use Bayes’ rule to find MAP class
p(x|ω1)
Labeled
training
examples
{xn,ωxn}
Sort
according
to class
•
Estimate
conditional pdf
for class ω1
Or, can model directly e.g. train a neural net
to map from inputs x to a set of outputs Pr(ωi)
- discriminative model
Dan Ellis
Pattern Recognition
2003-07-01 - 12
Outline
1
Music Content Analysis
2
Classification and Features
3
Statistical Pattern Recognition
4
Gaussian Mixtures and Neural Nets
- Gaussians
- Gaussian mixtures
- Multi-layer perceptrons (MLPs)
- Training and test data
5
Singing Detection
Dan Ellis
Pattern Recognition
2003-07-01 - 13
4
Gaussian Mixtures and Neural Nets
•
Gaussians as parametric distribution models:
T –1
1
1
p ( x ω i ) = ----------------------------------- ⋅ exp – --- ( x – µ i ) Σ i ( x – µ i )
d
2
1⁄2
( 2π ) Σ i
Described by d dimensional mean vector µi
•
and d x d covariance matrix Σi
5
4
1
0.5
3
0
2
4
1
2
4
2
0 0
•
0
0
1
2
3
4
5
Classify by maximizing log likelihood i.e.
T –1
1
1
argmax – --- ( x – µ i ) Σ i ( x – µ i ) – --- log Σ i + log Pr ( ω i )
2
2
ωi
Dan Ellis
Pattern Recognition
2003-07-01 - 14
Gaussian Mixture models (GMMs)
•
Weighted sum of Gaussians can fit any PDF:
weights ck
i.e.
p ( x ) ≈ ∑ ck p ( x mk )
Gaussians p(x|mk)
k
- each observation from random single Gaussian?
resulting
surface
3
2
1.4
1
1.2
0
1
-1
0.8
-2
0.6
0
5
10
original data
•
Gaussian
components
0.4
15
20
0.2
25
30
35
0
Find ck and mk parameters via EM
- easy if we knew which mk generated each x
Dan Ellis
Pattern Recognition
2003-07-01 - 15
GMM examples
•
Vowel data fit with different mixture counts:
1 Gauss logp(x)=-1911
2 Gauss logp(x)=-1864
1600
1600
1400
1400
1200
1200
1000
1000
800
800
600
200
400
600
800
1000
1200
600
200
3 Gauss logp(x)=-1849
1600
1400
1400
1200
1200
1000
1000
800
800
Dan Ellis
400
600
800
1000
1200
600
800
1000
1200
4 Gauss logp(x)=-1840
1600
600
200
400
600
200
Pattern Recognition
400
600
800
1000
2003-07-01 - 16
1200
Neural networks
•
Don’t model distributions p ( x ω i ) ,
instead, model posteriors Pr ( ω i x )
•
Sums over nonlinear functions of sums
→ large range of decision surfaces
•
e.g. Multi-layer perceptron (MLP)
with 1 hidden layer:
y k = F [ ∑ w jk ⋅ F [ ∑ w ij x i ] ]
j
j
h1
x1
+
wjk +
x2
h
wij + F[·] 2
x3
+
+
Input
layer
•
Dan Ellis
Hidden
layer
y1
y2
Output
layer
Train the weights wij with back-propagation
Pattern Recognition
2003-07-01 - 17
Neural net example
•
2 input units (normalized F1, F2)
•
5 hidden units, 3 output units (“U”, “O”, “A”)
“A”
“O”
“U”
F1
F2
•
Sigmoid nonlinearity:
1
F [ x ] = ----------------–x
1+e
⇒
dF
= F (1 – F )
dx
1
0.8
sigm(x)
0.6
d
sigm(x)
dx
0.4
0.2
0
5
Dan Ellis
4
3
2
1
0
Pattern Recognition
1
2
3
4
5
2003-07-01 - 18
Neural net training
2:5:3 net: MS error by training epoch
Mean squared error
1
0.8
0.6
0.4
0.2
0
0
10
20
30
40
50
60
Training epoch
Contours @ 10 iterations
1600
1400
1400
1200
1200
1000
1000
800
800
400
600
800
1000
80
90
1200
600
200
400
600
800
1000
example...
Dan Ellis
100
Contours @ 100 iterations
1600
600
200
70
Pattern Recognition
2003-07-01 - 19
1200
Aside: Training and test data
•
A rich model can learn every training example
(overtraining)
Test
data
error
rate
Overfitting
Training
data
training or parameters
•
But, goal is to classify new, unseen data
i.e. generalization
- sometimes use ‘cross validation’ set to decide
when to stop training
•
For evaluation results to be meaningful:
- don’t test with training data!
- don’t train on test data (even indirectly...)
Dan Ellis
Pattern Recognition
2003-07-01 - 20
Outline
1
Music Content Analysis
2
Classification and Features
3
Statistical Pattern Recognition
4
Gaussian Mixtures and Neural Nets
5
Singing Detection
- Motivation
- Features
- Classifiers
Dan Ellis
Pattern Recognition
2003-07-01 - 21
Singing Detection
5
(Berenzweig et al. ’01)
•
Can we automatically detect when singing is
present?
File: /Users/dpwe/projects/aclass/aimee.wav
Hz
t
0:02
0:04
f 9 Printed: Tue Mar 11 13:04:28
7000
6500
6000
5500
5000
4500
4000
3500
3000
2500
2000
1500
1000
500
0:06
0:08
0:10
mus
0:12
0:14
vox
0:16
0:18
0:20
0:22
mus
- for further processing (lyrics recognition?)
- as a song signature?
- as a basis for classification?
Dan Ellis
Pattern Recognition
0:24
vox
2003-07-01 - 22
0:26
0:28
mus
Singing Detection: Requirements
trn/mus/3
freq / kHz
10
•
Labeled training examples
- 60 x 15 sec. radio excerpts
- hand-mark sung phrases
•
Labeled test data
- several complete tracks from
CDs, hand-labelled
5
hand-label vox
0
trn/mus/8
freq / kHz
10
5
0
trn/mus/19
freq / kHz
10
5
0
trn/mus/28
freq / kHz
10
5
0
0
2
4
6
8
10
12
tim
•
Feature choice
- Mel-frequency Cepstral Coefficients (MFCCs)
popular for speech; maybe sung voice too?
- separation of voices? temporal dimension?
•
Classifier choice
- MLP Neural Net
- GMMs for singing / music
- SVM?
Dan Ellis
Pattern Recognition
2003-07-01 - 23
GMM System
•
Separate models for p(x|sing), p(x|no sing)
- combined via likelihood ratio test
GMM1
MFCC
calculation
Log l'hood
ratio test
...
music
C0
C1
C12
p(X|“singing”)
log
singing?
p(X|“singing”)
p(X|“not”)
GMM2
p(X|“no singing”)
•
How many Gaussians for each?
- say 20; depends on data & complexity
•
What kind of covariance?
- diagonal (spherical?)
Dan Ellis
Pattern Recognition
2003-07-01 - 24
GMM Results
freq / kHz
•
Raw and smoothed results (Best FA=84.9%):
Aimee Mann : Build That Wall + handvox
10
8
6
4
2
log(lhood)
0
26d 20mix GMM on 16ms frames (FA=65.8% @ thr=0)
10
5
0
-5
log(lhood)
-10
26d 20mix GMM smoothed by 61 pt (1 sec) hann (FA=84.9% @ thr=-0.8)
5
0
-5
0
5
10
15
20
25
time / sec
30
•
MLP has advantage of discriminant training
•
Each GMM trains only on data subset
→ faster to train? (2 x 10 min vs. 20 min)
Dan Ellis
Pattern Recognition
2003-07-01 - 25
MLP Neural Net
Directly estimate p(singing | x)
music
C0
C1
MFCC
calculation
...
•
“singing”
“not singing”
C12
- net has 26 inputs (+∆), 15 HUs, 2 o/ps (26:15:2)
•
How many hidden units?
- depends on data amount, boundary complexity
•
Feature context window?
- useful in speech
•
Delta features?
- useful in speech
•
Training parameters...
Dan Ellis
Pattern Recognition
2003-07-01 - 26
MLP Results
•
Raw net outputs on a CD track (FA 74.1%):
freq / kHz
Aimee Mann : Build That Wall + handvox
10
8
6
4
2
0
26:15:1 netlab on 16ms frames (FA=74.1% @ thr=0.5)
p(singing)
1
0.8
0.6
0.4
0.2
0
0
5
•
p(singing)
1
10
15
20
25
time / sec 30
Smoothed for ycontinuity:
p (
)
(best FA = 90.5%
)
0.8
0.6
0.4
0.2
0
0
5
Dan Ellis
10
15
Pattern Recognition
20
25
time / sec 30
2003-07-01 - 27
Artist Classification
(Berenzweig et al. 2002)
•
Artist label as available stand-in for genre
•
Train MLP to classify frames among 21 artists
•
Using only “voice” segments:
Song-level accuracy improves 56.7% → 64.9%
Track 117 - Aimee Mann (dynvox=Aimee, unseg=Aimee)
true voice
Michael Penn
The Roots
The Moles
Eric Matthews
Arto Lindsay
Oval
Jason Falkner
Built to Spill
Beck
XTC
Wilco
Aimee Mann
The Flaming Lips
Mouse on Mars
Dj Shadow
Richard Davies
Cornelius
Mercury Rev
Belle & Sebastian
Sugarplastic
Boards of Canada
0
50
100
150
200
time / sec
Track 4 - Arto Lindsay (dynvox=Arto, unseg=Oval)
true voice
Michael Penn
The Roots
The Moles
Eric Matthews
Arto Lindsay
Oval
Jason Falkner
Built to Spill
Beck
XTC
Wilco
Aimee Mann
The Flaming Lips
Mouse on Mars
Dj Shadow
Richard Davies
Cornelius
Mercury Rev
Belle & Sebastian
Sugarplastic
Boards of Canada
0
10
Dan Ellis
20
30
40
50
Pattern Recognition
60
70
80
2003-07-01 - 28
time / sec
Summary
•
Music content analysis:
Pattern classification
•
Basic machine learning methods:
Neural Nets, GMMs
•
Singing detection: classic application
but... the time dimension?
Dan Ellis
Pattern Recognition
2003-07-01 - 29
References
A.L. Berenzweig and D.P.W. Ellis (2001)
“Locating Singing Voice Segments within Music
Signals”,
Proc. IEEE Workshop on Apps. of Sig. Proc. to
Acous. and Audio, Mohonk NY, October 2001.
http://www.ee.columbia.edu/~dpwe/pubs/waspaa01-singing.pdf
R.O. Duda, P. Hart, R. Stork (2001)
Pattern Classification, 2nd Ed.
Wiley, 2001.
E. Scheirer and M. Slaney (1997)
“Construction and evaluation of a robust multifeature
speech/music discriminator”,
Proc. IEEE ICASSP, Munich, April 1997.
http://www.ee.columbia.edu/~dpwe/e6820/papers/ScheiS97-mussp.pdf
Dan Ellis
Pattern Recognition
2003-07-01 - 30
Download