What can we Learn from Large Music Databases? Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Engineering, Columbia University, NY USA dpwe@ee.columbia.edu 1. 2. 3. 4. Learning Music Music Similarity Melody, Drums, Event extraction Conclusions Learning from Music - Ellis 2004-12-18 p. 1 /24 Learning from Music • A lot of music data available e.g. 60G of MP3 ≈ 1000 hr of audio/15k tracks • What can we do with it? implicit definition of ‘music’ • Quality vs. quantity Speech recognition lesson: 10x data, 1/10th annotation, twice as useful • Motivating Applications music similarity / classification computer (assisted) music generation insight into music Learning from Music - Ellis 2004-12-18 p. 2 /24 Ground Truth Data music data available manual annotation is much rarer 7000 6500 6000 5500 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 t 0:02 0:04 f 9 Printed: Tue Mar 11 13:04:28 • A lot of unlabeled File: /Users/dpwe/projects/aclass/aimee.wav Hz 0:06 0:08 0:10 0:12 • Unsupervised structure discovery possible .. but labels help to indicate what you want • Weak annotation sources mus artist-level descriptions symbol sequences without timing (MIDI) errorful transcripts • Evaluation requires ground truth limiting factor in Music IR evaluations? Learning from Music - Ellis 2004-12-18 p. 3 /24 0:14 vox 0:16 0:18 mu Talk Roadmap Anchor models 1 Similarity/ recommend'n Semantic bases Music audio Melody extraction Drums extraction 2 5 Event extraction Learning from Music - Ellis 6 Fragment clustering Eigenrhythms 3 4 Synthesis/ generation ? 2004-12-18 p. 4 /24 1. Music Similarity Browsing • Musical information overload with Adam Berenzweig record companies filter/categorize music an automatic system would be less odious • Connecting audio and preference map to a ‘semantic space’? n-dimensional vector in "Anchor Space" Anchor Anchor Audio Input (Class i) p(a1|x) AnchorAnchor Anchor Audio Input (Class j) p(a2n-dimensional |x) vector in "Anchor Space" GMM Modeling Similarity Computation p(a1|x)p(an|x) p(a2|x) Anchor Conversion to Anchorspace GMM Modeling KL-d, EMD, etc. p(an|x) Conversion to Anchorspace Learning from Music - Ellis 2004-12-18 p. 5 /24 Anchor Space • Frame-by-frame high-level categorizations 0 0.6 0.4 0.2 Electronica fifth cepstral coef compare to raw features? Anchor Space Features Cepstral Features 0 0.2 0.4 0.6 madonna bowie 0.8 1 0.5 0 third cepstral coef 5 10 15 madonna bowie 15 0.5 properties in distributions? dynamics? Learning from Music - Ellis 2004-12-18 10 Country p. 6 /24 5 ‘Playola’ Similarity Browser Learning from Music - Ellis 2004-12-18 p. 7 /24 Semantic Bases Brian Whitman • What should the ‘anchor’ dimensions be? hand-chosen genres? X somehow choose automatically • “Community metadata”: Use Web to get words/phrases.. .. that are informative about artists .. and that can be predicted from audio cates the number of frames in which a term classiositively agrees with the truth value (both classifier uth say a frame is ‘funky,’ for example). b indicates umber of frames in which the term classifier india negative term association but the truth value india positive association (the classifier says a frame is unky,’ but truth says it is). The value c is the amount mes the term classifier predicts a positive association e truth is negative, and the value of d is the amount of s the term classifier and truth agree to be a negative ation. We wish to maximize a and d as correct clasions; by contrast, random guessing by the classifier Learning from Music - Ellis give the same ratio of classifier labels regardless of d truth i.e. a/b ≈ c/d. With N = a + b + c + d, the • Refine classifiers to below artist level e.g. by EM? adj Term aggressive softer synthetic punk sleepy funky noisy angular acoustic romantic K-L bits 0.0034 0.0030 0.0029 0.0024 0.0022 0.0020 0.0020 0.0016 0.0015 0.0014 np Term reverb the noise new wave elvis costello the mud his guitar guitar bass and drums instrumentals melancholy three chords K-L bits 0.0064 0.0051 0.0039 0.0036 0.0032 0.0029 0.0027 0.0021 0.0020 0.0019 Table 2. Selected top-performing models of adjective and noun phrase terms used to predict new reviews of music p. 8 /24 from the K-L 2004-12-18 with their corresponding bits of information distance measure. 2. Transcription as Classification with Graham Poliner • Signal models typically used for transcription harmonic spectrum, superposition • But ... trade domain knowledge for data transcription as pure classification problem: Audio Trained classifier p("C0"|Audio) p("C#0"|Audio) p("D0"|Audio) p("D#0"|Audio) p("E0"|Audio) p("F0"|Audio) single N-way discrimination for “melody” per-note classifiers for polyphonic transcription Learning from Music - Ellis 2004-12-18 p. 9 /24 Classifier Transcription Results • Trained on MIDI syntheses (32 songs) SMO SVM (Weka) • Tested on ISMIR MIREX 2003 set foreground/background separation Frame-level pitch concordance system “jazz3” overall fg+bg 71.5% 44.3% just fg 56.1% 45.4% Learning from Music - Ellis 2004-12-18 p. 10/24 Forced-Alignment of MIDI with Rob Turetsky • MIDI is a handy description of music notes, instruments, tracks .. to drive synthesis • Align MIDI ‘replicas’ to get GTruth for audio freq / kHz estimate time-warp relation 4 Original 2 0 freq / kHz "Don't you want me" (Human League), verse1 4 17 18 19 20 21 22 23 24 25 26 MIDI Replica 2 MIDI # 0 80 MIDI notes 60 40 19 20 21 Learning from Music - Ellis 22 23 24 25 26 27 28 time / sec 2004-12-18 p. 11/24 3. Melody Clustering with Graham Poliner • Goal: Find ‘fragments’ that recur in melodies .. across large music database .. trade data for model sophistication Training data Melody extraction 5 second fragments VQ clustering Top clusters • Data sources pitch tracker, or MIDI training data • Melody fragment representation DCT(1:20) - removes average, smoothes detail Learning from Music - Ellis 2004-12-18 p. 12/24 Melody clustering results • Clusters match underlying contour: • Finds some similarities: e.g. Pink + Nsync Learning from Music - Ellis 2004-12-18 p. 13/24 4. Eigenrhythms: Drum Pattern Space with John Arroyo • Pop songs built on repeating “drum loop” variations on a few bass, snare, hi-hat patterns • Eigen-analysis (or ...) to capture variations? by analyzing lots of (MIDI) data, or from audio • Applications music categorization “beat box” synthesis insight Learning from Music - Ellis 2004-12-18 p. 14/24 Aligning the Data • Need to align patterns prior to modeling... tempo (stretch): by inferring BPM & normalizing downbeat (shift): correlate against ‘mean’ template Learning from Music - Ellis 2004-12-18 p. 15/24 Eigenrhythms (PCA) • Need 20+ Eigenvectors for good coverage of 100 training patterns (1200 dims) • Eigenrhythms both add and subtract Learning from Music - Ellis 2004-12-18 p. 16/24 Posirhythms (NMF) Posirhythm 1 Posirhythm 2 HH HH SN SN BD BD Posirhythm 3 Posirhythm 4 HH HH SN SN BD BD 0.1 Posirhythm 5 Posirhythm 6 HH HH SN SN BD BD 0 1 50 100 2 150 200 3 250 300 4 350 400 0 -0.1 0 1 50 100 2 150 200 3 250 300 4 • Nonnegative: only adds beat-weight • Capturing some structure Learning from Music - Ellis 2004-12-18 p. 17/24 350 samples (@ 2 beats (@ 120 Eigenrhythms for Classification • Projections in Eigenspace / LDA space PCA(1,2) projection (16% corr) LDA(1,2) projection (33% corr) 10 6 blues 4 country disco hiphop 2 house newwave rock 0 pop punk -2 rnb 5 0 -5 -10 -20 -10 0 10 -4 -8 -6 -4 -2 • 10-way Genre classification (nearest nbr): PCA3: 20% correct LDA4: 36% correct Learning from Music - Ellis 2004-12-18 p. 18/24 0 2 Eigenrhythm BeatBox • Resynthesize rhythms from eigen-space Learning from Music - Ellis 2004-12-18 p. 19/24 5. Event Extraction • Music often contains many repeated events notes, drum sounds but: usually overlapped... • Vector Quantization finds common patterns: Event dictionary Find alignments Combine & re-estimate Training data representation... aligning/matching... how much coverage required? Learning from Music - Ellis 2004-12-18 p. 20/24 Drum Track Extraction with Ron Weiss, after Yoshii et al. ’04 • Initialize dictionary with Bass Drum, Snare • Match only on a few spectral peaks narrowband energy most likely to avoid overlap • Median filter to re-estimate template .. after normalizing amplitudes can pick up partials from common notes Learning from Music - Ellis 2004-12-18 p. 21/24 Generalized Event Detection with Michael Mandel • Based on ‘Shazam’ audio fingerprints (Wang’03) -.$/0$*(1%&23, <666 8666 ;666 7666 4666 6 456 457 458 455 459 4:6 4:7 !"#$%&'$()*+', 4:8 4:5 4:9 496 457 458 455 459 4:6 4:7 !"#$%&'$()*+', 4:8 4:5 4:9 496 -.$/0$*(1%&23, <666 8666 ;666 7666 4666 6 456 relative timing of F1-F2-ΔT triples discriminates pieces narrowband features to avoid collision (again) • Fingerprint events, not recordings: choose top triples, look for repeats rank reduction of triples x time matrix Learning from Music - Ellis 2004-12-18 p. 22/24 Event detection results • Procedure find hash triples cluster them patterns in hash co-occurrence = events? 150 100 50 0 0 50 100 Learning from Music - Ellis 150 200 250 2004-12-18 300 p. 23/24 Conclusions Similarity/ recommend'n Anchor models Semantic bases Music audio Melody extraction Drums extraction Event extraction Fragment clustering Synthesis/ generation Eigenrhythms ? • Lots of data + noisy transcription + weak clustering musical insights? Learning from Music - Ellis 2004-12-18 p. 24/24 Approaches to Chord Transcription with Alex Sheh • Note transcription, then note→chord rules like labeling chords in MIDI transcripts • Spectrum→chord rules i.e. find harmonic peaks, use knowledge of likely notes in each chord • Trained classifier don’t use any “expert knowledge” instead, learn patterns from labeled examples • Train ASR HMMs with chords ≈ words Learning from Music - Ellis 2004-12-18 p. 25/24 Chord Sequence Data Sources • All we need are the chord sequences for our training examples Hal Leonard “Paperback Song Series” - manually retyped for 20 songs: “Beatles for Sale”, “Help”, “Hard Day’s Night” freq / kHz The Beatles - Hard Day's Night # The Beatles - A Hard Day's Night # G Cadd9 G F6 G Cadd9 G F6 G C D G C9 G G Cadd9 G F6 G Cadd9 G F6 G C D G C9 G Bm Em Bm G Em C D G Cadd9 G F6 G Cadd9 G F6 G C D G C9 G D G C7 G F6 G C7 G F6 G C D G C9 G Bm Em Bm G Em C D G Cadd9 G F6 G Cadd9 G F6 G C D G C9 G C9 G Cadd9 Fadd9 5 4 3 2 1 0 0 5 10 15 20 25 time / sec - hand-align chords for 2 test examples Learning from Music - Ellis 2004-12-18 p. 26/24 Chord Results • Recognition weak, but forced-alignment OK Frame-level Accuracy Recognition Alignment MFCC 8.7% PCP_ROT 21.7% # G # 22.0% F E pitch class Feature Beatles - Beatles For Sale - Eight Days a Week (4096pt) 76.0% # D # (random ~3%) MFCCs are poor (can overtrain) PCPs better (ROT helps generalization) C B # A 16.27 true align recog Learning from Music - Ellis inte 24.84 time / sec E G D E G DBm E G Bm Bm G G Am 2004-12-18 Em7 Bm p. 27/24 Em7 What did the models learn? • Chord model centers (means) indicate chord ‘templates’: PCP_ROT family model means (train18) 0.4 DIM DOM7 MAJ MIN MIN7 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 C 0 D 5 Learning from Music - Ellis E F 10 G A 15 20 BC 25 2004-12-18 (for C-root chords) p. 28/24