Music Information Retrieval for Jazz Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA {dpwe,thierry}@ee.columbia.edu 1. 2. 3. 4. MIR for Jazz - Dan Ellis http://labrosa.ee.columbia.edu/ Music Information Retrieval Automatic Tagging Musical Content Future Work 2012-11-15 1 /29 Machine Listening • Extracting useful information from sound Describe Automatic Narration Emotion Music Recommendation Classify Environment Awareness ASR Music Transcription “Sound Intelligence” VAD Speech/Music Environmental Sound Speech Dectect Task ... like (we) animals do MIR for Jazz - Dan Ellis Music Domain 2012-11-15 2 /29 1. The Problem • We have a lot of music Can computers help? • Applications archive organization musicological insight? MIR for Jazz - Dan Ellis ◦ music recommendation 2012-11-15 3 /29 Music Information Retrieval (MIR) • Small field that has grown since ~2000 musicologists, engineers, librarians significant commercial interest • MIR as musical analog of text IR find stuff in large archives • Popular tasks genre classification chord, melody, full transcription music recommendation • Annual evaluations “Standard” test corpora - pop MIR for Jazz - Dan Ellis 2012-11-15 4 /29 2. Automatic Tagging • Statistical Pattern Recognition: Finding matches to training examples Sensor signal Pre-processing/ segmentation 1600 segment 1400 Feature extraction 1200 feature vector 1000 Classification 800 • Need: 600 200 Feature design • Applications class 400 600 800 1000 1200 Post-processing ◦ Labeled training examples Genre ◦ Instrumentation ◦ Artist ◦ Studio ... MIR for Jazz - Dan Ellis 2012-11-15 5 /29 Features: MFCC • Mel-Frequency Cepstral Coefficients the standard features from speech recognition 0.5 Sound 0 FFT X[k] spectra −0.5 0.25 0.255 0.26 0.265 0.27 time / s 10 Mel scale freq. warp log |X[k]| audspec 5 0 4 0x 10 1000 2000 3000 freq / Hz 0 5 10 15 freq / Mel 0 5 10 15 freq / Mel 0 10 20 30 quefrency 2 1 0 100 50 IFFT cepstra 0 200 0 Truncate −200 MFCCs MIR for Jazz - Dan Ellis 2012-11-15 6 /29 Representing Audio • MFCCs are short-time features (25 ms) • Sound is a “trajectory” in MFCC space 8 7 6 5 4 3 2 1 0 VTS_04_0001 - Spectrogram 20 10 0 -10 MFCC covariance -20 1 2 3 4 5 6 7 8 9 time / sec 20 18 16 14 12 10 8 6 4 2 1 MIR for Jazz - Dan Ellis MFCC Covariance Matrix 30 2 3 4 5 6 7 8 9 time / sec 50 20 level / dB 18 16 20 15 10 5 0 -5 -10 -15 -20 value MFCC dimension MFCC features MFCC bin Audio freq / kHz • Describe whole track by its statistics 14 12 0 10 8 6 4 2 5 10 15 MFCC dimension 20 -50 2012-11-15 7 /29 MFCCs for Music • Can resynthesize MFCCs by shaping noise gives an idea about the information retained freq / Hz 3644 1770 860 418 203 mel quefrency 12 10 8 6 4 2 freq / Hz Resynthesis MFCCs Original Freddy Freeloader 3644 1770 860 418 203 0 MIR for Jazz - Dan Ellis 10 20 30 40 50 60 time / s 2012-11-15 8 /29 Ground Truth Mandel & Ellis ’08 • MajorMiner: Free-text tags for 10s clips 400 users, 7500 unique tags, 70,000 taggings • Example: drum, bass, piano, jazz, slow, instrumental, saxophone, soft, quiet, club, ballad, smooth, soulful, easy_listening, swing, improvisation, 60s, cool, light MIR for Jazz - Dan Ellis 2012-11-15 9 /29 Classification Ground truth Sound Chop into 10 sec blocks MFCC (20 dims) Mean ∆ ∆2 Covariance Standardize across (60 dims) train set Σ (399 samples) µ C, γ One-vs-all SVM Average Prec. • MFCC features + human ground truth + standard machine learning tools MIR for Jazz - Dan Ellis 2012-11-15 10/29 Classification Results • Classifiers trained from top 50 tags freq / Hz 01 Soul Eyes 2416 1356 761 427 240 135 _90s club trance end drum_bass singing horns punk samples silence quiet noise solo strings indie house alternative r_b funk soft ambient british distortion drum_machine country keyboard saxophone fast instrumental electronica 80s voice beat slow rap hip_hop jazz piano techno dance female bass vocal pop electronic rock synth male guitar drum MIR for Jazz - Dan Ellis 50 100 150 200 250 300 1.5 1 0.5 0 −0.5 −1 −1.5 −2 40 80 120 160 200 240 280 time / s 320 2012-11-15 11/29 3. Musical Content don’t respect pitch pitch is important visible in spectrogram freq / Hz • MFCCs (and speech recognizers) 3644 1770 860 418 203 46 48 50 52 54 time / s • Pitch-related tasks note transcription chord transcription matching by musical content (“cover songs”) MIR for Jazz - Dan Ellis 2012-11-15 12/29 Note Transcription feature representation Poliner & Ellis ‘05,’06,’07 feature vector Training data and features: •MIDI, multi-track recordings, playback piano, & resampled audio (less than 28 mins of train audio). •Normalized magnitude STFT. classification posteriors Classification: •N-binary SVMs (one for ea. note). •Independent frame-level classification on 10 ms grid. •Dist. to class bndy as posterior. hmm smoothing Temporal Smoothing: •Two state (on/off) independent HMM for ea. note. Parameters learned from training data. •Find Viterbi sequence for ea. note. MIR for Jazz - Dan Ellis 2012-11-15 13/29 Polyphonic Transcription • Real music excerpts + ground truth MIREX 2007 Frame-level transcription Estimate the fundamental frequency of all notes present on a 10 ms grid 1.25 1.00 0.75 0.50 0.25 0 Precision Recall Acc Etot Esubs Emiss Efa Note-level transcription Group frame-level predictions into note-level transcriptions by estimating onset/offset 1.25 1.00 0.75 0.50 0.25 0 Precision MIR for Jazz - Dan Ellis Recall Ave. F-measure Ave. Overlap 2012-11-15 14/29 Chroma Features Fujishima 1999 Warren et al. 2003 • Idea: Project onto 12 semitones regardless of octave maintains main “musical” distinction invariant to musical equivalence no need to worry about harmonics? chroma freq / kHz chroma 4 G F G F 3 2 D C 1 A 0 50 100 150 fft bin D C 2 4 6 8 time / sec A 50 100 150 200 250 time / frame NM C(b) = B(12 log2 (k/k0 ) b)W (k)|X[k]| k=0 W(k) is weighting, B(b) selects every ~ mod12 MIR for Jazz - Dan Ellis 2012-11-15 15/29 Chroma Resynthesis Ellis & Poliner 2007 • Chroma describes the notes in an octave ... but not the octave • Can resynthesize by presenting all octaves ... with a smooth envelope “Shepard tones” - octave is ambiguous M 0 -10 -20 -30 -40 -50 -60 12 Shepard tone spectra freq / kHz level / dB b b o+ 12 yb (t) = W (o + ) cos 2 w0 t 12 o=1 Shepard tone resynth 4 3 2 1 0 500 1000 1500 2000 2500 freq / Hz 0 2 4 6 8 10 time / sec endless sequence illusion MIR for Jazz - Dan Ellis 2012-11-15 16/29 Chroma Example • Simple Shepard tone resynthesis can also reimpose broad spectrum from MFCCs 3644 1770 860 418 203 freq / Hz chroma freq / Hz Freddie A G E D C 3644 1770 860 418 203 0 MIR for Jazz - Dan Ellis 10 20 30 40 50 60 time / s 2012-11-15 17/29 • Onset detection Simplest thing is energy envelope Bello et al. 2005 W/2 2 e(n0 ) = n= W/2 w[n] |x(n + n0 )| freq / kHz Harnoncourt Maracatu 8 6 4 f 2 0 level / dB emphasis on high frequencies? |X(f, t)| 60 50 40 30 20 10 level / dB 0 f f · |X(f, t)| 60 50 40 30 20 10 0 MIR for Jazz - Dan Ellis 0 2 4 6 8 10 time / sec 0 2 4 6 8 10 time / sec 2012-11-15 18/29 Tempo Estimation • Beat tracking (may) need global tempo period τ otherwise problem is not “optimal substructure” • Pick peak in onset envelope autocorrelation after applying “human preference” window check for subbeat MIR for Jazz - Dan Ellis Onset Strength Envelope (part) 4 2 0 -2 8 8.5 9 9.5 10 10.5 11 11.5 3 3.5 Raw Autocorrelation 12 time / s 400 200 0 Windowed Autocorrelation 200 100 0 -100 0 0.5 1 1.5 Secondary Tempo Period Primary Tempo Period 2 2.5 lag / s 4 2012-11-15 19/29 Beat Tracking by Dynamic Programming • To optimize N N C({ti }) = F (ti O(ti ) + i=1 ti 1, p) i=2 define C*(t) as best score up to time t then build up recursively (with traceback P(t)) O(t) τ C*(t) t C*(t) = O(t) + max{αF(t – τ, τp) + C*(τ)} τ P(t) = argmax{αF(t – τ, τp) + C*(τ)} τ final beat sequence {ti} is best C* + back-trace MIR for Jazz - Dan Ellis 2012-11-15 20/29 Beat Tracking Results • Prefers drums & steady tempo Soul Eyes 0 0 MIR for Jazz - Dan Ellis 5 100 10 200 300 15 400 500 20 600 25 700 800 time / s 900 period / 4 ms samp 2012-11-15 21/29 Beat-Synchronous Chroma • Record one chroma vector per beat compact representation of harmonies 3644 1770 860 418 203 Resynth B A G E D C 3644 1770 860 418 203 Resynth +MFCC Beat-sync chroma freq / Hz Freddy 3644 1770 860 418 203 0 MIR for Jazz - Dan Ellis 20 10 40 60 20 80 30 100 40 120 50 time / beat 60 time / s 2012-11-15 22/29 • Beat synchronous chroma look like chords G 5 A-C-D-F 0 A-C-E A B-D-G E D C C-E-G chroma bin Chord Recognition ... 10 15 20 time / sec can we transcribe them? • Two approaches manual templates (prior knowledge) learned models (from training data) MIR for Jazz - Dan Ellis 2012-11-15 23/29 Chord Recognition System • Analogous to speech recognition Sheh & Ellis 2003 Gaussian models of features for each chord Hidden Markov Models for chord transitions Beat track Audio test 100-1600 Hz BPF Chroma 25-400 Hz BPF Chroma train Labels beat-synchronous chroma features chord labels HMM Viterbi C maj B A G Root normalize Gaussian Unnormalize 24 Gauss models E D C C D E G A B C D E G A B c min B A G E D Resample C b a g Count transitions 24x24 transition matrix f e d c B A G F E D C MIR for Jazz - Dan Ellis C D EF G A B c d e f g a b 2012-11-15 24/29 Chord Recognition freq / Hz • Often works: Audio Let It Be/06-Let It Be 2416 761 Beatsynchronous chroma G F E D C B A G A:min C G F C C C 0 MIR for Jazz - Dan Ellis G 2 a F 4 C 6 G 8 F 10 C 12 A:min G • But only about 60% of the time Recognized A:min/b7 F:maj7 C F:maj6 Ground truth chord A:min/b7 F:maj7 240 14 G a 16 18 2012-11-15 25/29 What did the models learn? • Chord model centers (means) indicate chord ‘templates’: PCP_ROT family model means (train18) 0.4 DIM DOM7 MAJ MIN MIN7 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 C MIR for Jazz - Dan Ellis 0 D 5 E F 10 G A 15 20 BC 25 (for C-root chords) 2012-11-15 26 /29 Chords for Jazz freq / Hz • How many types? Freddy − logf sgram 2416 761 240 chroma 0 10 20 B A# A G# G F# F E D# D C# C 30 40 50 Freddy − beat−sync chroma 60 time / sec chords a# g# f# e d c A# G# F# E D C chroma Freddy − chord likelihoods + Viterbi path B A# A G# G F# F E D# D C# C MIR for Jazz - Dan Ellis Freddy − chord−based chroma reconstruction 20 40 60 80 100 120 time / beats 2012-11-15 27/29 Future Work • Matching items Between the Bars − Elliot Smith − pwr .25 12 10 8 6 cover songs / standards similar instruments, styles 4 2 120 130 140 150 160 170 Between the Bars − Glenn Phillips − pwr .25 − −17 beats − trsp 2 12 10 8 6 4 2 chroma bin 120 130 140 150 160 170 160 170 pointwise product (sum(12x300) = 19.34) 12 10 8 6 4 2 120 130 140 150 time / beats • Analyzing musical content solo transcription & modeling musical structure • And so much more... MIR for Jazz - Dan Ellis 2012-11-15 28/29 Summary • Finding Musical Similarity at Large Scale Low-level features Classification and Similarity browsing discovery production Music Structure Discovery modeling generation curiosity Melody and notes Music audio Key and chords Tempo and beat MIR for Jazz - Dan Ellis 2012-11-15 29/29