ELEN E4896 MUSIC SIGNAL PROCESSING Lecture 12: Alignment and Matching 1. Music Alignment 2. Cover Song Detection 3. Echo Nest Analyze Dan Ellis Dept. Electrical Engineering, Columbia University dpwe@ee.columbia.edu E4896 Music Signal Processing (Dan Ellis) http://www.ee.columbia.edu/~dpwe/e4896/ 2013-04-15 - 1 /22 1. Music Alignment • Often have Kurth et al., 2007 versions of the same music with unmatched time axes different performances performance vs. score • Various applications for aligning them synchronizing different tracks (with TSM) synchronized score display ground truth transcriptions E4896 Music Signal Processing (Dan Ellis) 2013-04-15 - 2 /22 The Similarity Matrix Foote 1999 48 time / beats • Point-to-point comparison of sequences yj (k)|2 or normalized inner product (cosine distance) xi (k)yj (k) 2 2 k |xi (k)| k |yj (k)| k 5 4 3 2 1 B A G E D C dcos (i, j) = 1 6 32 k |xi (k) j B A G E D C E4896 Music Signal Processing (Dan Ellis) dij 16 deuc (i, j) = Let It Be - The Beatles e.g. Euclidean distance Euclidean Distance Let It Be - Nick Cave 16 32 i 48 time / beats 2013-04-15 - 3 /22 Dynamic Programming Bellman 1957 • Find best path combining local + transitions works for any kind of similarity matrix Local costs dij ; C* ; paths j Allowable transitions {ik, jk} T(1,0) = 0.1 T(1,1) = 0.0 T(0,1) = 0.1 {ik-1, jk-1} Finds path {ik, jk} to minimize cost ... Cimax ,jmax = 2.8 2.5 2.3 2.0 1.7 dij 8 2.7 1.6 7 2.1 2.2 2.0 1.8 1.5 1.3 1.2 1.6 6 1.4 1.5 1.3 1.2 1.0 0.9 1.3 1.5 5 1.2 1.2 1.0 1.0 0.7 0.9 1.2 1.5 1.5 0.9 4 0.8 0.8 0.7 0.6 0.7 1.2 1.8 ik 1 , jk jk 1) ... recursively Ci,j = 0.6 3 0.5 0.5 0.4 0.5 0.9 1.5 2.2 2.7 2 0.3 0.3 0.4 0.6 1.0 1.5 2.1 2.6 1 0.1 0.4 0.6 0.8 1.1 1.6 2.3 2.9 0.4 1 min 0.7 2.4 k x,y={(1,1),(1,0),(0,1)} 0.8 0.5 d(ik , jk ) +T (ik 1 d(i, j) + T (x, y) + Ci E4896 Music Signal Processing (Dan Ellis) 2 3 4 5 6 7 8 i 0.3 0.2 0.1 0 x,j y 2013-04-15 - 4 /22 Audio-to-Audio Alignment • Dynamic programming to get time mapping + phase vocoder time scaling 500 450 400 350 300 250 200 150 100 50 50 E4896 Music Signal Processing (Dan Ellis) 100 150 200 250 300 350 400 2013-04-15 - 5 /22 Audio-Score Alignment • Aligning a score representation (e.g. MIDI) is a proxy for polyphonic transcription Let It Be + aligned MIDI labels 1000 freq / Hz 800 600 400 200 0 0 2 4 E4896 Music Signal Processing (Dan Ellis) 6 8 10 12 time / sec 14 16 18 20 22 2013-04-15 - 6 /22 Peak Structure Distance Orio & Schwartz 2001 • How do we match spectra to score notes? Synthesized audio freq / kHz MIDI “Piano roll” note synthesize audio from MIDI & compare audio? “Peak Structure distance”: k M [k]|X[k]| dpsd = 1 is energy where we expect? k |X[k]| C6 C5 C4 C3 C2 E4896 Music Signal Processing (Dan Ellis) freq / bins freq / bins “Peak Structure” = energy blw mask 50 100 150 200 250 300 350 400 1 time / frames 0.5 0 Predicted spectrum = mask M[k] 0 0 5 10 15 20 time / sec 80 60 40 20 80 60 40 20 0 0 50 100 150 200 250 300 350 400 50 100 150 200 250 300 350 400 450 time / frames 2013-04-15 - 7 /22 2. Cover Song Detection • Musicians are fond of ‘cover versions’ usually alter melody, harmony, instrumentation, rhythm, style can be hard to spot even for a human! • Can try to match via alignment .. with some threshold on best alignment cost? E4896 Music Signal Processing (Dan Ellis) 2013-04-15 - 8 /22 Smith-Waterman Local Alignment different number, ordering of verse/ chorus/brige want to find any large aligned regions Beatles vs. Carol Woods − cosine dist 200 180 160 140 120 100 80 60 40 20 50 100 • “Local alignment” measure Si,j = max max{0, s(i, j) x,y Smith−Waterman cd/2,.96.1.2 time / beats • Cover version may have different form 150 P (x, y) + Si 50 100 time / beats x,j y } want largest score S* similarity s(i, j) must exceed penalty P(x,y) on avg. (e.g. 0.96 for diagonal, 1.2 for off-diagonal) E4896 Music Signal Processing (Dan Ellis) 2013-04-15 - 9 /22 Local Alignment Cover Detection Serrà & Gòmez, 2008 • Smith-Waterman needs predictable values use binary similarity based on best transposition Euclidean E4896 Music Signal Processing (Dan Ellis) Binary Non-cover 2013-04-15 - 10/22 Cross-correlation Covers System • DP is good for time-warping, but expensive beat-timing is tempo independent (if it works) simply cross-correlate beat-chroma patches? chroma bins Query how big are the pieces? G E D C A 100 200 extract 300 400 beats 500 how do we combine individual scores? also expensive cross-correlate chroma bins Candidate G E D C A 100 200 E4896 Music Signal Processing (Dan Ellis) 300 400 500 beats 2013-04-15 - 11/22 Global Cross-Correlation Ellis & Poliner, 2007 • Cross-correlate entire beat-chroma matrices ... at all possible transpositions (circular) implicit combination of match quality and duration chroma bins Elliott Smith - Between the Bars G E D C A skew / semitones chroma bins 100 200 300 400 Glen Phillips - Between the Bars 500 beats @281 BPM G E D C A Cross-correlation +5 0 -5 -500 • One good matching fragment is sufficient...? -400 -300 E4896 Music Signal Processing (Dan Ellis) -200 -100 0 100 200 300 400 skew / beats 2013-04-15 - 12/22 Filtered Cross-Correlation • Raw correlation not as important as precise local match skew / semitones looking for large contrast at ±1 beat skew i.e. high-pass filter Cross-correlation +5 0 -5 -500 -400 -300 -200 -100 0 100 200 Cross-correlation @ skew = +2 semitones 0.6 300 400 skew / beats 300 400 skew / beats raw 0.4 0.2 filtered 0 -500 -400 -300 -200 E4896 Music Signal Processing (Dan Ellis) -100 0 100 200 2013-04-15 - 13/22 Cover Song Results • 23 Covers found in 8700 song ‘uspop2002’ Cover Songs - dpwe23 - 12/23 correct Take_Me_To_The_River/annie_lennox Let_It_Be/nick_cave I_Love_You/faith_hill I_Can_t_Get_No_Satisfaction/rolling_stones Hush/milli_vanilli Grand_Illusion/styx Gold_Dust_Woman/sheryl_crow God_Only_Knows/brian_wilson Query Faith/limp_bizkit Enjoy_The_Silence/tori_amos Day_Tripper/cheap_trick Come_Together/beatles Cocaine/nazareth Claudette/roy_orbison Cecilia/simon_and_garfunkel Caroline_No/brian_wilson Blue_Collar_Man/styx Between_The_Bars/glen_phillips Before_You_Accuse_Me/eric_clapton America/simon_and_garfunkel All_Along_The_Watchtower/dave_matthews_band Addicted_To_Love/tina_turner Abracadabra/sugar_ray Ab Ad Al Am Be Be Bl Ca Ce Cl Co Co Da En Fa Go Go Gr Hu I_ I_ Le Ta Test popular ‘decoys’ – normalization issues E4896 Music Signal Processing (Dan Ellis) 2013-04-15 - 14/22 Analyzing Cover Song Correlation • Look inside global cross-correlation to find matching fragments... xcorr = t f (C1(t, f)⋅C2(t, f)) - view along time chroma Let It Be / Beatles (beats 11-441) G F D C chroma A 50 100 150 200 250 Let It Be / Nick Cave (beats 13-443) 50 100 150 200 50 100 150 200 300 350 400 time / beats 250 300 350 400 time / beats 250 300 350 400 time / beats G F D C A 0.4 0.2 0 -0.2 0 E4896 Music Signal Processing (Dan Ellis) 2013-04-15 - 15/22 Cover Song False Alarm • Correlation can be weak “Cocaine” (Clapton) vs. “Satisfaction” (Stones) chroma Eric Clapton - Cocaine - beats 17:1027 G F D C A 100 200 300 400 500 600 700 800 900 1000 chroma Rolling Stones - Satisfaction - beats 1:1011 G F D C A 100 200 300 400 500 600 700 800 900 1000 100 200 300 400 500 600 700 800 900 1000 2 1 0 -1 -2 0 E4896 Music Signal Processing (Dan Ellis) 2013-04-15 - 16/22 3. Echo Nest Analyze • Web service to provide beat, chroma, ... analysis (and much more) register for free API key freq / Hz TRKUYPW128F92E1FC0 - Tori Amos - Smells Like Teen Spirit Original 761 240 B A G http:// developer.echonest.c om/account/register/ EN Features E D C freq / Hz upload MP3, get back XML with analysis data 2416 2416 Resynth 761 240 0 E4896 Music Signal Processing (Dan Ellis) 2 4 6 8 10 12 time / sec 14 2013-04-15 - 17/22 EN Analyze Usage • Matlab wrapper function E4896 Music Signal Processing (Dan Ellis) 2013-04-15 - 18/22 Million Song Dataset (MSD) Thierry Bertin-Mahieux • Commercial-scale dataset available to MIR researchers 1M pop songs 250 GB of features (6 years of listening) • EN Analyze features +... Lyrics, Tags, Covers, Listeners ... http://labrosa.ee.columbia.edu/millionsong E4896 Music Signal Processing (Dan Ellis) 2013-04-15 - 19/22 MSD Metadata EN Metadata Last.fm Tags artist: release: title: id: key: mode: loudness: tempo: time_signature: duration: sample_rate: audio_md5: 7digitalid: familiarity: year: 100.0 – cover 57.0 – covers 43.0 – female vocalists 42.0 – piano 34.0 – alternative 14.0 – singer-songwriter 11.0 – acoustic 8.0 – tori amos 7.0 – beautiful 6.0 – rock 6.0 – pop 6.0 – Nirvana 6.0 – female vocalist 6.0 – 90s 5.0 – out of genre covers 'Tori Amos' 'LIVE AT MONTREUX' 'Smells Like Teen Spirit' 'TRKUYPW128F92E1FC0' 5 0 -16.6780 87.2330 4 216.4502 22050 '8' 5764727 0.8500 1992 SHS Covers %5489,4468, Smells TRTUOVJ128E078EE10 TRFZJOZ128F4263BE3 TRJHCKN12903CDD274 TRELTOJ128F42748B7 TRJKBXL128F92F994D TRIHLAW128F429BBF8 TRKUYPW128F92E1FC0 5.0 4.0 4.0 4.0 4.0 3.0 3.0 3.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 – – – – – – – – – – – – – – – cover songs soft rock nirvana cover Mellow alternative rock chick rock Ballad Awesome Covers melancholic k00l ch1x indie female vocalistist female cover song american MxM Lyric Bag-of-Words Like Teen Spirit Nirvana Weird Al Yankovic Pleasure Beach The Flying Pickets Rhythms Del Mundo feat. Shanade The Bad Plus Tori Amos E4896 Music Signal Processing (Dan Ellis) 12 11 10 9 7 6 6 6 hello i a and it are we now 6 6 6 4 4 4 3 3 here us entertain the feel yeah to my 3 3 3 3 3 3 3 3 is with oh out an light less danger 2013-04-15 - 20/22 Summary • Music Alignment Dynamic Programming finds correspondence • Cover Songs DP, or cross-correlation for efficiency • EN Analyze Web service to analyze audio E4896 Music Signal Processing (Dan Ellis) 2013-04-15 - 21/22 References • • • • • • R. Bellman, Dynamic Programming, Princeton University Press. 1957. D. Ellis and G. Poliner, “Identifying Cover Songs With Chroma Features and Dynamic Programming Beat Tracking,” Proc. ICASSP-07, Hawai'i, pp. IV-1429-1432, 2007. J. Foote, “Visualizing Music and Audio using Self-Similarity,” In Proc. ACM Multimedia, Orlando, pp. 77-80, 1999. Frank Kurth, Meinard Müller, Christian Fremerey,Yoon ha Chang, and Michael Clausen, “Automated synchronization of scanned sheet music with audio recordings,” Proc. 8th International Conference on Music Information Retrieval (ISMIR),Vienna, pp. 261-266, 2007. N. Orio & D. Schwarz, “Alignment of monophonic and polyphonic music to a score,” Proc. Int. Comp. Music Conf., Havana, pp.155-158, 2001. J. Serrà, E. Gómez, P. Herrera, X. Serra, “Chroma Binary Similarity and Local Alignment Applied to Cover Song Identification,” IEEE Trans. on Audio, Speech and Lang. Proc., 16(6), pp. 1138-1151, 2008. E4896 Music Signal Processing (Dan Ellis) 2013-04-15 - 22/22