ELEN E4896 MUSIC SIGNAL PROCESSING Lecture 13: Audio Fingerprinting 1. The Fingerprinting Problem 2. Frame-Based Approach 3. Landmark Approach Dan Ellis Dept. Electrical Engineering, Columbia University dpwe@ee.columbia.edu E4896 Music Signal Processing (Dan Ellis) http://www.ee.columbia.edu/~dpwe/e4896/ 2013-04-22 - 1 /18 1. The Fingerprinting Problem • Audio Fingerprinting: Known-Item search for the exact same performance (no “cover versions”) despite differences in audio channel, encoding, noise etc. • Applications media monitoring metadata reconciliation “what’s that song?” E4896 Music Signal Processing (Dan Ellis) 2013-04-22 - 2 /18 A Simple Fingerprint • “Fingerprint” is a compact record sufficient to uniquely identify an example difficulty depends on item density, noise 1 0.5 0 −0.5 1 −1 1 0.5 0 x2 0.5 −0.5 −1 1 0 0.5 0 −0.5 −0.5 −1 1 0.5 0 −1 −1 −0.5 0 0.5 x1 1 −0.5 −1 hash functions? E4896 Music Signal Processing (Dan Ellis) 2013-04-22 - 3 /18 (speaker/mic), added noise the “coffeeshop” problem 6 4 2 0 freq / kHz • Immunity to channel freq / kHz Fingerprinting Challenges 6 4 2 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 time / sec • Recognize fragments from anywhere in the track the shorter the better • Large corpus of reference items • False alarm vs. false reject E4896 Music Signal Processing (Dan Ellis) 2013-04-22 - 4 /18 2. Frame-Based Approaches • Standard audio-processing paradigm chop-up waveform into frames each frame → feature vector match on a sequence of feature vectors Reference items Analysis Query Analysis DB Match • Challenges make the features invariant to channel variations make features insensitive to timing skew / offset computational efficiency E4896 Music Signal Processing (Dan Ellis) 2013-04-22 - 5 /18 Channel Immunity Haitsma & Kalker 2003 • Audio matching should be invariant to lossy encoding (low-bitrate MP3) dynamic range compression (per band?) added noise (quantization, environment noise) auditory magnitudespectrogram X(t, f) frequency channel • Local threshold Auditory Spectrogram bitmask “hash”: B(t, f ) = 1 iff X(t, f ) + X(t + 1, f + 1) > X(t, f + 1) + X(t + 1, f ) 30 16 25 15 20 14 15 13 10 12 5 11 Binarized Hash 30 16 25 15 20 14 15 13 10 12 5 11 50 E4896 Music Signal Processing (Dan Ellis) 100 150 200 250 300 350 time frame 180 190 200 210 220 2013-04-22 - 6 /18 Timing Skew • What happens if reference frames are out of sync with test frames? make frame length much longer than “hop” time frame 1 query 1 2 3 2 time frame 1 4 3 4 query 1 2 3 2 4 3 4 → features are very smooth in time E4896 Music Signal Processing (Dan Ellis) 2013-04-22 - 7 /18 Retrieval & Matching • Matching is by Haitsma & Kalker 2003 Hamming Distance between query & ref use 256 x 32 bit frames (3 sec @ 11.6 ms frames) 10k tracks ~ 200M frames • Only check near exact match of one 32-bit word hash table index of occurrences of all 232 = 4G values repeat for all 256 columns (can also test all 32 one-bit differences) E4896 Music Signal Processing (Dan Ellis) 2013-04-22 - 8 /18 False Alarms vs. False Reject 32-bit hash • One a = Pr(same hash | same audio) ~ 0.5 (dep. noise?) b = Pr(same hash | different audio) ~ 1/232 (or 33/232 ~ 1/227 = 10-8 for 1-bit diffs) Pr(false match in L frames) = 1 - (1 - b)L (~Lb) L = 200M → Pr(false match) = 0.78 32-bit hashes (e.g. K=8) • KPr(all match | diff audio) = b ~ 10 K -65 Pr(all match | same audio) = aK ~ 1/256 matches out of N (e.g. 4 of 16 – Binomial) • KPr(false reject) = Pr(B(16, a) < 4) = 0.0106 E4896 Music Signal Processing (Dan Ellis) 0 log10 Prob Pr(false accept) = Pr(B(16, b) ≥ 4) ~10-29 (~b4) −5 log10(Pr(false reject)) log10(Pr(false accept)) −10 0 2 4 6 8 10 # matches out of 16 12 14 16 2013-04-22 - 9 /18 3. Landmark Approach • Idea: Wang 2003, 2006 Use structures in audio as time reference instead of arbitrary time frames eliminates “framing errors” • Another idea: Use individual, spectrally-local structures as component hashes robustness to missing frequency bands • Use time-frequency peaks i.e. onsets of specific harmonics highest energy → most robust to noise the Shazam algorithm E4896 Music Signal Processing (Dan Ellis) 2013-04-22 - 10/18 Shazam Landmarks • Find local peaks in spectrogram but only ~256 different frequencies - common • Join them into pairs look for 2nd peak within some window freq / kHz Phone ring - Shazam fingerprint 4 -10 -20 3 -30 -40 2 -50 -60 1 -70 0 0 0.5 1 1.5 2 2.5 time / s -80 level / dB hash {start freq, end freq, time diff} = 256 x 256 x 64 = 4M distinct patterns build index: [ hash → {track_ID, offset_sec} ] E4896 Music Signal Processing (Dan Ellis) 2013-04-22 - 11/18 Selecting Peaks • Landmark Density controls match chance, required query size control by density of peaks • Pick peaks based on local decaying surface width, rate of decay → peak density E4896 Music Signal Processing (Dan Ellis) 2013-04-22 - 12/18 Shazam Matching • Any subset of hashes is sufficient • Check temporal sequence for ref items with multiple hits freq / Hz • Nearby pairs of peaks → hashes • Each query hash → list of matching ref items Query audio 4000 3500 3000 2500 2000 1500 1000 500 0 Match: 05−Full Circle at 0.032 sec 4000 3500 3000 2500 2000 1500 1000 500 0 E4896 Music Signal Processing (Dan Ellis) 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 time / sec 2013-04-22 - 13/18 Shazam Decoy • Only a tiny part of the signal needs to be preserved Query audio 4000 3500 3000 2500 2000 1500 1000 500 0 0.5 1 1.5 2 4000 2.5 3 3.5 Match: 05−Full Circle at 0 sec 4 4.5 5 5.5 3500 3000 2500 2000 1500 1000 500 0 0 1 E4896 Music Signal Processing (Dan Ellis) 2 3 4 5 2013-04-22 - 14/18 Error Analysis • ~20 hashes / sec → ~4000 hashes / track 10k tracks → 40M hash entries (160MB) 20 bit space → 106 distinct hashes b = Pr(hash | wrong audio) = (1 - 10-6)4000 = 0.4% a = Pr(hash | right audio) = 0.1 ? • 5 sec query → K =100 hashes (Binomial) Pr(N chance matches) = KCN bN(1-b)K-N N = 6 → Pr ~ 109 •3x10-15 •0.7 = 2.5x10-6 Pr(true matches < N) = Σ KCN aN(1-a)K-N ~ 6% log10 Prob 0 −5 −10 log10(Pr(false reject)) log10(Pr(false accept)) 0 2 4 E4896 Music Signal Processing (Dan Ellis) 6 8 10 # matches out of 100 12 14 16 2013-04-22 - 15/18 Diagnostic Uses • Recurring Landmarks show reused audio Matching landmarks: 01−Taxman vs. 01−Taxman Time / s in 01−Taxman 100 50 0 −50 time in Rusted Pipe − time in 03 Rusted Pipe / s −100 0 20 40 60 80 100 Time / s in 01−Taxman 120 140 160 Matching landmarks: 03 Rusted Pipe vs. Rusted Pipe 200 100 0 −100 −200 0 50 E4896 Music Signal Processing (Dan Ellis) 100 150 200 time in 03 Rusted Pipe / s 250 300 2013-04-22 - 16/18 Summary • Fingerprinting Match same recording despite channel • Frame-based Local features with fuzzy match • Landmark based Highly tolerant of added noise E4896 Music Signal Processing (Dan Ellis) 2013-04-22 - 17/18 References • • • J. Haitsma and T. Kalker, “A highly robust audio fingerprinting system with an efficient search strategy,” J. New Music Research 32(2), 211-221, 2003. A. Wang, “An Industrial-Strength Audio Search Algorithm,” Proc. Int. Symp. on Music Info. Retrieval, 7-13, 2003. A. Wang, “The Shazam music recognition service,” Comm. ACM 49(8), 44-48, 2006. E4896 Music Signal Processing (Dan Ellis) 2013-04-22 - 18/18