JHU CLSP Summer School Pattern Recognition Applied to Music Signals 1 Music Content Analysis 2 Classification and Features 3 Statistical Pattern Recognition 4 Gaussian Mixtures and Neural Nets 5 Singing Detection Dan Ellis <dpwe@ee.columbia.edu> http://www.ee.columbia.edu/~dpwe/muscontent/ Laboratory for Recognition and Organization of Speech and Audio Columbia University, New York July 1st, 2003 Dan Ellis Pattern Recognition 2003-07-01 - 1 Music Content Analysis 1 • Music contains information at many levels - what is it? • We’d like to get this information out automatically - fine-level transcription of events - broad-level classification of pieces • Information extraction can be framed as: pattern classification / recognition or machine learning - build systems based on (labeled) training data Dan Ellis Pattern Recognition 2003-07-01 - 2 Music analysis • What information can we get from music? Frequency 4000 3000 2000 1000 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Time • Score recovery - extract the ‘performance’ • Instrument identification • Ensemble performance - ‘gestalts’: chords, tone colors • Broader timescales - phrasing & musical structure - artist / genre clustering and classification Dan Ellis Pattern Recognition 2003-07-01 - 3 Outline 1 Music Content Analysis 2 Classification and Features - classification - spectrograms - cepstra 3 Statistical Pattern Recognition 4 Gaussian Mixtures and Neural Nets 5 Singing Detection Dan Ellis Pattern Recognition 2003-07-01 - 4 Classification and Features 2 • Classification means: finding categorical (discrete) labels for real-world (continuous) observations F2/Hz f/Hz 4000 2000 3000 ay ao x 2000 1000 1000 0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 x 1.0 time/s 0 0 • Dan Ellis 1000 2000 F1/Hz Problems - parameter tuning - feature overlap Pattern Recognition 2003-07-01 - 5 Classification system parts Sensor signal Pre-processing/ segmentation • STFT • Locate vowels segment Feature extraction • Formant extraction feature vector Classification class Post-processing • Dan Ellis • Context constraints • Costs/risk Right features are critical - place upper bound on classifier - should make important aspects visible - invariance under irrelevant modifications Pattern Recognition 2003-07-01 - 6 The Spectrogram • Short-time Fourier transform: N–1 ( n – mL )- ) ------------------------------X [ k, m ] = ∑ x [ n ] ⋅ w [ n – mL ] ⋅ exp – j ( 2πk n=0 N Plot STFT X [ k, m ] as a grayscale image: • 0.1 4000 10 0 3000 -10 2000 -20 -30 1000 intensity / dB freq / Hz 0 -40 0 2.35 2.4 2.45 2.5 2.55 2.6 -50 time / s freq / Hz 4000 3000 2000 1000 0 Dan Ellis 0 0.5 1 1.5 2 Pattern Recognition 2.5 time / s 2003-07-01 - 7 Cepstra • Spectrograms are good for visualization; Cepstrum is preferred for classification - dct of STFT: c k = idft ( log X [ k, m ] ) • Cepstra capture coarse information in fewer dimensions with less correlation: Cepstral coefficients Auditory spectrum Features Covariance matrix 25 20 20 16 15 12 10 8 5 4 Example joint distrib (10,15) -2 -3 -4 20 18 16 14 12 10 8 6 4 2 16 12 8 4 50 Dan Ellis -5 100 150 frames 5 10 Pattern Recognition 15 20 3 2 1 0 -1 -2 -3 -4 -5 0 2003-07-01 - 8 5 Outline 1 Music Content Analysis 2 Classification and Features 3 Statistical Pattern Recognition - Priors and posteriors - Bayesian classifier 4 Gaussian Mixtures and Neural Nets 5 Singing Detection Dan Ellis Pattern Recognition 2003-07-01 - 9 Statistical Pattern Recognition 3 • • Observations are random variables whose distribution depends on the class: Class Observation ωi x (hidden) discrete p(x|ωi) Pr(ωi|x) continuous Source distributions p(x|ωi) - reflect variability in feature - reflect noise in observation - generally have to be estimated from data (rather than known in advance) p(x|ωi) ω 1 ω2 ω3 ω4 x Dan Ellis Pattern Recognition 2003-07-01 - 10 Priors and posteriors • Bayesian inference can be interpreted as updating prior beliefs with new information, x: Bayes’ Rule: Likelihood p ( x ωi ) Pr ( ω i ) ⋅ --------------------------------------------------- = Pr ( ω i x ) ∑ p ( x ω j ) ⋅ Pr ( ω j ) Prior probability j ‘Evidence’ = p(x) Posterior probability • Posterior is prior scaled by likelihood & normalized by evidence (so Σ(posteriors) = 1) • Objection: priors are often unknown - but omitting them amounts to assuming they are all equal Dan Ellis Pattern Recognition 2003-07-01 - 11 Bayesian (MAP) classifier • Optimal classifier is ω̂ = argmax Pr ( ω i x ) ωi but we don’t know Pr ( ω i x ) • Can model conditional distributions p ( x ω i ) then use Bayes’ rule to find MAP class p(x|ω1) Labeled training examples {xn,ωxn} Sort according to class • Estimate conditional pdf for class ω1 Or, can model directly e.g. train a neural net to map from inputs x to a set of outputs Pr(ωi) - discriminative model Dan Ellis Pattern Recognition 2003-07-01 - 12 Outline 1 Music Content Analysis 2 Classification and Features 3 Statistical Pattern Recognition 4 Gaussian Mixtures and Neural Nets - Gaussians - Gaussian mixtures - Multi-layer perceptrons (MLPs) - Training and test data 5 Singing Detection Dan Ellis Pattern Recognition 2003-07-01 - 13 4 Gaussian Mixtures and Neural Nets • Gaussians as parametric distribution models: T –1 1 1 p ( x ω i ) = ----------------------------------- ⋅ exp – --- ( x – µ i ) Σ i ( x – µ i ) d 2 1⁄2 ( 2π ) Σ i Described by d dimensional mean vector µi • and d x d covariance matrix Σi 5 4 1 0.5 3 0 2 4 1 2 4 2 0 0 • 0 0 1 2 3 4 5 Classify by maximizing log likelihood i.e. T –1 1 1 argmax – --- ( x – µ i ) Σ i ( x – µ i ) – --- log Σ i + log Pr ( ω i ) 2 2 ωi Dan Ellis Pattern Recognition 2003-07-01 - 14 Gaussian Mixture models (GMMs) • Weighted sum of Gaussians can fit any PDF: weights ck i.e. p ( x ) ≈ ∑ ck p ( x mk ) Gaussians p(x|mk) k - each observation from random single Gaussian? resulting surface 3 2 1.4 1 1.2 0 1 -1 0.8 -2 0.6 0 5 10 original data • Gaussian components 0.4 15 20 0.2 25 30 35 0 Find ck and mk parameters via EM - easy if we knew which mk generated each x Dan Ellis Pattern Recognition 2003-07-01 - 15 GMM examples • Vowel data fit with different mixture counts: 1 Gauss logp(x)=-1911 2 Gauss logp(x)=-1864 1600 1600 1400 1400 1200 1200 1000 1000 800 800 600 200 400 600 800 1000 1200 600 200 3 Gauss logp(x)=-1849 1600 1400 1400 1200 1200 1000 1000 800 800 Dan Ellis 400 600 800 1000 1200 600 800 1000 1200 4 Gauss logp(x)=-1840 1600 600 200 400 600 200 Pattern Recognition 400 600 800 1000 2003-07-01 - 16 1200 Neural networks • Don’t model distributions p ( x ω i ) , instead, model posteriors Pr ( ω i x ) • Sums over nonlinear functions of sums → large range of decision surfaces • e.g. Multi-layer perceptron (MLP) with 1 hidden layer: y k = F [ ∑ w jk ⋅ F [ ∑ w ij x i ] ] j j h1 x1 + wjk + x2 h wij + F[·] 2 x3 + + Input layer • Dan Ellis Hidden layer y1 y2 Output layer Train the weights wij with back-propagation Pattern Recognition 2003-07-01 - 17 Neural net example • 2 input units (normalized F1, F2) • 5 hidden units, 3 output units (“U”, “O”, “A”) “A” “O” “U” F1 F2 • Sigmoid nonlinearity: 1 F [ x ] = ----------------–x 1+e ⇒ dF = F (1 – F ) dx 1 0.8 sigm(x) 0.6 d sigm(x) dx 0.4 0.2 0 5 Dan Ellis 4 3 2 1 0 Pattern Recognition 1 2 3 4 5 2003-07-01 - 18 Neural net training 2:5:3 net: MS error by training epoch Mean squared error 1 0.8 0.6 0.4 0.2 0 0 10 20 30 40 50 60 Training epoch Contours @ 10 iterations 1600 1400 1400 1200 1200 1000 1000 800 800 400 600 800 1000 80 90 1200 600 200 400 600 800 1000 example... Dan Ellis 100 Contours @ 100 iterations 1600 600 200 70 Pattern Recognition 2003-07-01 - 19 1200 Aside: Training and test data • A rich model can learn every training example (overtraining) Test data error rate Overfitting Training data training or parameters • But, goal is to classify new, unseen data i.e. generalization - sometimes use ‘cross validation’ set to decide when to stop training • For evaluation results to be meaningful: - don’t test with training data! - don’t train on test data (even indirectly...) Dan Ellis Pattern Recognition 2003-07-01 - 20 Outline 1 Music Content Analysis 2 Classification and Features 3 Statistical Pattern Recognition 4 Gaussian Mixtures and Neural Nets 5 Singing Detection - Motivation - Features - Classifiers Dan Ellis Pattern Recognition 2003-07-01 - 21 Singing Detection 5 (Berenzweig et al. ’01) • Can we automatically detect when singing is present? File: /Users/dpwe/projects/aclass/aimee.wav Hz t 0:02 0:04 f 9 Printed: Tue Mar 11 13:04:28 7000 6500 6000 5500 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0:06 0:08 0:10 mus 0:12 0:14 vox 0:16 0:18 0:20 0:22 mus - for further processing (lyrics recognition?) - as a song signature? - as a basis for classification? Dan Ellis Pattern Recognition 0:24 vox 2003-07-01 - 22 0:26 0:28 mus Singing Detection: Requirements trn/mus/3 freq / kHz 10 • Labeled training examples - 60 x 15 sec. radio excerpts - hand-mark sung phrases • Labeled test data - several complete tracks from CDs, hand-labelled 5 hand-label vox 0 trn/mus/8 freq / kHz 10 5 0 trn/mus/19 freq / kHz 10 5 0 trn/mus/28 freq / kHz 10 5 0 0 2 4 6 8 10 12 tim • Feature choice - Mel-frequency Cepstral Coefficients (MFCCs) popular for speech; maybe sung voice too? - separation of voices? temporal dimension? • Classifier choice - MLP Neural Net - GMMs for singing / music - SVM? Dan Ellis Pattern Recognition 2003-07-01 - 23 GMM System • Separate models for p(x|sing), p(x|no sing) - combined via likelihood ratio test GMM1 MFCC calculation Log l'hood ratio test ... music C0 C1 C12 p(X|“singing”) log singing? p(X|“singing”) p(X|“not”) GMM2 p(X|“no singing”) • How many Gaussians for each? - say 20; depends on data & complexity • What kind of covariance? - diagonal (spherical?) Dan Ellis Pattern Recognition 2003-07-01 - 24 GMM Results freq / kHz • Raw and smoothed results (Best FA=84.9%): Aimee Mann : Build That Wall + handvox 10 8 6 4 2 log(lhood) 0 26d 20mix GMM on 16ms frames (FA=65.8% @ thr=0) 10 5 0 -5 log(lhood) -10 26d 20mix GMM smoothed by 61 pt (1 sec) hann (FA=84.9% @ thr=-0.8) 5 0 -5 0 5 10 15 20 25 time / sec 30 • MLP has advantage of discriminant training • Each GMM trains only on data subset → faster to train? (2 x 10 min vs. 20 min) Dan Ellis Pattern Recognition 2003-07-01 - 25 MLP Neural Net Directly estimate p(singing | x) music C0 C1 MFCC calculation ... • “singing” “not singing” C12 - net has 26 inputs (+∆), 15 HUs, 2 o/ps (26:15:2) • How many hidden units? - depends on data amount, boundary complexity • Feature context window? - useful in speech • Delta features? - useful in speech • Training parameters... Dan Ellis Pattern Recognition 2003-07-01 - 26 MLP Results • Raw net outputs on a CD track (FA 74.1%): freq / kHz Aimee Mann : Build That Wall + handvox 10 8 6 4 2 0 26:15:1 netlab on 16ms frames (FA=74.1% @ thr=0.5) p(singing) 1 0.8 0.6 0.4 0.2 0 0 5 • p(singing) 1 10 15 20 25 time / sec 30 Smoothed for ycontinuity: p ( ) (best FA = 90.5% ) 0.8 0.6 0.4 0.2 0 0 5 Dan Ellis 10 15 Pattern Recognition 20 25 time / sec 30 2003-07-01 - 27 Artist Classification (Berenzweig et al. 2002) • Artist label as available stand-in for genre • Train MLP to classify frames among 21 artists • Using only “voice” segments: Song-level accuracy improves 56.7% → 64.9% Track 117 - Aimee Mann (dynvox=Aimee, unseg=Aimee) true voice Michael Penn The Roots The Moles Eric Matthews Arto Lindsay Oval Jason Falkner Built to Spill Beck XTC Wilco Aimee Mann The Flaming Lips Mouse on Mars Dj Shadow Richard Davies Cornelius Mercury Rev Belle & Sebastian Sugarplastic Boards of Canada 0 50 100 150 200 time / sec Track 4 - Arto Lindsay (dynvox=Arto, unseg=Oval) true voice Michael Penn The Roots The Moles Eric Matthews Arto Lindsay Oval Jason Falkner Built to Spill Beck XTC Wilco Aimee Mann The Flaming Lips Mouse on Mars Dj Shadow Richard Davies Cornelius Mercury Rev Belle & Sebastian Sugarplastic Boards of Canada 0 10 Dan Ellis 20 30 40 50 Pattern Recognition 60 70 80 2003-07-01 - 28 time / sec Summary • Music content analysis: Pattern classification • Basic machine learning methods: Neural Nets, GMMs • Singing detection: classic application but... the time dimension? Dan Ellis Pattern Recognition 2003-07-01 - 29 References A.L. Berenzweig and D.P.W. Ellis (2001) “Locating Singing Voice Segments within Music Signals”, Proc. IEEE Workshop on Apps. of Sig. Proc. to Acous. and Audio, Mohonk NY, October 2001. http://www.ee.columbia.edu/~dpwe/pubs/waspaa01-singing.pdf R.O. Duda, P. Hart, R. Stork (2001) Pattern Classification, 2nd Ed. Wiley, 2001. E. Scheirer and M. Slaney (1997) “Construction and evaluation of a robust multifeature speech/music discriminator”, Proc. IEEE ICASSP, Munich, April 1997. http://www.ee.columbia.edu/~dpwe/e6820/papers/ScheiS97-mussp.pdf Dan Ellis Pattern Recognition 2003-07-01 - 30