Extracting Information from Music Audio Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Engineering, Columbia University, NY USA http://labrosa.ee.columbia.edu/ 1. 2. 3. 4. Learning Music Melody Extraction Drum Pattern Modeling Music Similarity Music Info Extraction - Ellis 2005-10-26 p. 1 /30 LabROSA Overview Information Extraction Music Eigenrhythms Machine Meeting Learning turns Environment Personal audio FDLP Signal Processing Speech Music Info Extraction - Ellis 2005-10-26 p. 2 /30 1. Learning from Music • A lot of music data available e.g. 60G of MP3 ≈ 1000 hr of audio, 15k tracks • What can we do with it? implicit definition of ‘music’ • Quality vs. quantity Speech recognition lesson: 10x data, 1/10th annotation, twice as useful • Motivating Applications music similarity / classification computer (assisted) music generation insight into music Music Info Extraction - Ellis 2005-10-26 p. 3 /30 Ground Truth Data File: /Users/dpwe/projects/aclass/aimee.wav music data available manual annotation is much rarer 7000 6500 6000 5500 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 t 0:02 0:04 f 9 Printed: Tue Mar 11 13:04:28 • A lot of unlabeled Hz 0:06 0:08 0:10 0:12 • Unsupervised structure discovery possible .. but labels help to indicate what you want • Weak annotation sources mus artist-level descriptions symbol sequences without timing (MIDI) errorful transcripts • Evaluation requires ground truth limiting factor in Music IR evaluations? Music Info Extraction - Ellis 2005-10-26 p. 4 /30 0:14 vox 0:16 0:18 mu Talk Roadmap 4 Similarity/ recommend'n Anchor models Music audio 1 Semantic bases Melody extraction Drums extraction 2 3 Event extraction Music Info Extraction - Ellis Fragment clustering Eigenrhythms Synthesis/ generation ? 2005-10-26 p. 5 /30 2. Melody Transcription with Graham Poliner • Audio → Score very desirable for data compression, searching, learning • Full solution is elusive signal separation of overlapping voices music constructed to frustrate! • Simplified problem: “Dominant Melody” at each time frame Frequency 4000 3000 2000 1000 0 0 0.5 1 1.5 2 2.5 3 3.5 4 Music Info Extraction - Ellis 4.5 5 Time 2005-10-26 p. 6 /30 Break tracks - need to detect new ‘onset’ at single frequencies 500 time / s Conventional Transcription 0.06 0.04 0.02 0 0 0 0 1 2 3 4 • Pitched notes have harmonic spectra 0.5 →1 transcribe by searching for harmonics 1.5 time / s ram Modeling e.g. sinusoid modeling + grouping Group by onset & freq / Hz common 3000 harmonicity find sets2500 of tracks that start ut-signal around 2000 the same time 1500 - + stable1000 harmonic pattern 500 Pass on to 0constraint‘onset’ 0 1 2 3 based filtering... es APR - Dan Ellis e/s 4 time / s • Explicit expert-derived knowledge L10 - Music Analysis Music Info Extraction - Ellis 2005-04-06 - 5 2005-10-26 p. 7 /30 Transcription as Classification • Signal models typically used for transcription harmonic spectrum, superposition • But ... trade domain knowledge for data transcription as pure classification problem: Audio Trained classifier p("C0"|Audio) p("C#0"|Audio) p("D0"|Audio) p("D#0"|Audio) p("E0"|Audio) p("F0"|Audio) single N-way discrimination for “melody” per-note classifiers for polyphonic transcription Music Info Extraction - Ellis 2005-10-26 p. 8 /30 Training Data • Need {data, label} pairs for classifier training • Sources: freq / kHz pre-mixing multitrack recordings + hand-labeling? synthetic music (MIDI) + forced-alignment? 2 1.5 1 0.5 30 0 20 2 10 1.5 0 1 0.5 0 0 0.5 1 Music Info Extraction - Ellis 1.5 2 2.5 3 3.5 time / sec 2005-10-26 p. 9 /30 Melody Transcription Results • Trained on 17 examples .. plus transpositions out to +/- 6 semitones SMO SVM (Weka) • Tested on ISMIR MIREX 2005 set includes foreground/background detection Example... Music Info Extraction - Ellis 2005-10-26 p. 10/30 Melody Clustering • Goal: Find ‘fragments’ that recur in melodies .. across large music database .. trade data for model sophistication Training data Melody extraction 5 second fragments VQ clustering Top clusters • Data sources pitch tracker, or MIDI training data • Melody fragment representation DCT(1:20) - removes average, smoothes detail Music Info Extraction - Ellis 2005-10-26 p. 11/30 Melody clustering results • Clusters match underlying contour: • Some interesting matches: e.g. Pink + Nsync Music Info Extraction - Ellis 2005-10-26 p. 12/30 3. Eigenrhythms: Drum Pattern Space with John Arroyo • Pop songs built on repeating “drum loop” variations on a few bass, snare, hi-hat patterns • Eigen-analysis (or ...) to capture variations? by analyzing lots of (MIDI) data, or from audio • Applications music categorization “beat box” synthesis insight Music Info Extraction - Ellis 2005-10-26 p. 13/30 Aligning the Data • Need to align patterns prior to modeling... tempo (stretch): by inferring BPM & normalizing downbeat (shift): correlate against ‘mean’ template Music Info Extraction - Ellis 2005-10-26 p. 14/30 Eigenrhythms (PCA) • Need 20+ Eigenvectors for good coverage of 100 training patterns (1200 dims) • Eigenrhythms both add and subtract Music Info Extraction - Ellis 2005-10-26 p. 15/30 Posirhythms (NMF) Posirhythm 1 Posirhythm 2 HH HH SN SN BD BD Posirhythm 3 Posirhythm 4 HH HH SN SN BD BD 0.1 Posirhythm 5 Posirhythm 6 HH HH SN SN BD BD 0 1 50 100 2 150 200 3 250 300 4 350 400 0 -0.1 0 1 50 100 2 150 200 3 250 300 4 • Nonnegative: only adds beat-weight • Capturing some structure Music Info Extraction - Ellis 2005-10-26 p. 16/30 350 samples (@ 2 beats (@ 120 Eigenrhythms for Classification • Projections in Eigenspace / LDA space PCA(1,2) projection (16% corr) LDA(1,2) projection (33% corr) 10 6 blues 4 country disco hiphop 2 house newwave rock 0 pop punk -2 rnb 5 0 -5 -10 -20 -10 0 10 -4 -8 -6 -4 -2 • 10-way Genre classification (nearest nbr): PCA3: 20% correct LDA4: 36% correct Music Info Extraction - Ellis 2005-10-26 p. 17/30 0 2 Eigenrhythm BeatBox • Resynthesize rhythms from eigen-space Music Info Extraction - Ellis 2005-10-26 p. 18/30 4. Music Similarity • Can we predict which songs with Mike Mandel and Adam Berenzweig “sound alike” to a listener? .. based on the audio waveforms? many aspects to subjective similarity • Applications query-by-example automatic playlist generation discovering new music • Problems the right representation modeling individual similarity Music Info Extraction - Ellis 2005-10-26 p. 19/30 Timbral Music Similarity • Measure similarity of feature distribution i.e. collapse across time to get density p(xi) compare by e.g. KL divergence • e.g. Artist Identification learn artist model p(xi | artist X) (e.g. as GMM) classify unknown song to closest model Training Artist 1 MFCCs GMMs KL Artist 2 Min Artist KL Test Song Music Info Extraction - Ellis 2005-10-26 p. 20/30 “Anchor Space” • Acoustic features describe each song .. but from a signal, not a perceptual, perspective .. and not the differences between songs • Use genre classifiers to define new space prototype genres are “anchors” n-dimensional vector in "Anchor Space" Anchor Anchor Audio Input (Class i) p(a1|x) AnchorAnchor Anchor Audio Input (Class j) |x) p(a2n-dimensional vector in "Anchor Space" GMM Modeling Similarity Computation p(a1|x)p(an|x) p(a2|x) Anchor Conversion to Anchorspace GMM Modeling KL-d, EMD, etc. p(an|x) Conversion to Anchorspace Music Info Extraction - Ellis 2005-10-26 p. 21/30 Anchor Space • Frame-by-frame high-level categorizations 0 0.6 0.4 0.2 Electronica fifth cepstral coef compare to raw features? Anchor Space Features Cepstral Features 0 0.2 0.4 0.6 madonna bowie 0.8 1 0.5 0 third cepstral coef 5 10 15 madonna bowie 15 0.5 properties in distributions? dynamics? Music Info Extraction - Ellis 2005-10-26 10 Country p. 22/30 5 ‘Playola’ Similarity Browser Music Info Extraction - Ellis 2005-10-26 p. 23/30 Ground-truth data • Hard to evaluate Playola’s ‘accuracy’ user tests... ground truth? • “Musicseer” online survey: ran for 9 months in 2002 > 1,000 users, > 20k judgments http://labrosa.ee.columbia.edu/ projects/musicsim/ Music Info Extraction - Ellis 2005-10-26 p. 24/30 Evaluation • Compare Classifier measures against Musicseer subjective results “triplet” agreement percentage Top-N ranking agreement score: ! " 13 1 αr = 2 N si = ∑ αrrαkcr r=1 αc = α2r First-place agreement percentage Top rank agreement test - simple significance 80 70 60 SrvKnw 4789x3.58 % 50 SrvAll 6178x8.93 40 GamKnw 7410x3.96 30 GamAll 7421x8.92 20 10 0 cei cmb erd Music Info Extraction - Ellis e3d opn kn2 rnd ANK 2005-10-26 p. 25/30 Using SVMs for Artist ID • Support Vector Machines (SVMs) find hyperplanes in a high-dimensional space relies only on matrix of distances between points much ‘smarter’ than nearest-neighbor/overlap want diversity of reference vectors... (w x) + b = + 1 (w x) + b = –1 x1 x2 yi = – 1 y i = +1 w (w x) + b = 0 Music Info Extraction - Ellis 2005-10-26 p. 26/30 Song-Level SVM Artist ID • Instead of one model per artist/genre, use every training song as an ‘anchor’ then SVM finds best support for each artist Training MFCCs Song Features Artist 1 D D D Artist DAG SVM Artist 2 D D D Test Song Music Info Extraction - Ellis 2005-10-26 p. 27/30 Artist ID Results • ISMIR/MIREX 2005 also evaluated Artist ID • 148 artists, 1800 files (split train/test) from ‘uspop2002’ • Song-level SVM clearly dominates using only MFCCs! e 4: Results of the formal MIREX 2005 Audio Artist ID evaluation (USPOP2002) from http://www.musicMIREX 05 Audio Artist (USPOP2002) /evaluation/mirex-results/audio-artist/. Rank 1 2 3 4 5 6 7 Participant Mandel Bergstra Pampalk West Tzanetakis Logan Lidy Raw Accuracy Normalized 68.3% 68.0% 59.9% 60.9% 56.2% 56.0% 41.0% 41.0% 28.6% 28.5% 14.8% 14.8% Did not complete Music Info Extraction - Ellis References -Julien Aucouturier and Francois Pachet. Improving Runtime / s 10240 86400 4321 26871 2443 ? 2005-10-26 p. 28/30 John C. Platt, Nello Cristianini, and John Shawe-Ta Large margin dags for multiclass classification. In Playlist Generation • SVMs are well suited to “active learning” solicit labels on items closest to current boundary • Automatic player with “skip” = Ground truth data collection active-SVM automatic playlist generation Music Info Extraction - Ellis 2005-10-26 p. 29/30 Conclusions Similarity/ recommend'n Anchor models Semantic bases Music audio Melody extraction Drums extraction Event extraction Fragment clustering Synthesis/ generation Eigenrhythms ? • Lots of data + noisy transcription + weak clustering musical insights? Music Info Extraction - Ellis 2005-10-26 p. 30/30