Audio Fingerprinting J.-S. Roger Jang (張智星) jang@mirlab.org http://mirlab.org/jang MIR Lab, CSIE Dept. National Taiwan University 1 Intro to Audio Fingerprinting (AFP) Goal Identify a noisy version of a given audio clips Also known as… “Query by exact example” no “cover versions” Can also be used to… Align two different-speed audio clips of the same source Dan Ellis used AFP for aligned annotation on Beatles dataset AFP Challenges Music variations Encoding/compression (MP3 encoding, etc) Channel variations Speakers & microphones, room acoustics Environmental noise Efficiency (6M tags/day for Shazam) Database collection (15M tracks for Shazam) AFP Applications Commercial applications of AFP Music identification & purchase Royalty assignment (over radio) TV shows or commercials ID (over TV) Copyright violation (over web) Major commercial players Shazam, Soundhound, Intonow, Viggle… Company: Shazam Facts First commercial product of AFP Since 2002, UK Technology Audio fingerprinting Founder Avery Wang (PhD at Standard, 1994) Company: Soundhound Facts First product with multi-modal music search AKA: midomi Technologies Audio fingerprinting Query by singing/humming Speech recognition Founder Keyvan Mohajer (PhD at Stanford, 2007) Two Stages in AFP Offline Feature extraction Hash table construction for songs in database Inverted indexing Online Feature extraction Hash table search Ranked list of the retrieved songs/music Robust Feature Extraction Various kinds of features for AFP Invariance along time and frequency Landmark of a pair of local maxima Wavelets … Extensive test required for choosing the best features Representative Approaches to AFP Philips J. Haitsma and T. Kalker, “A highly robust audio fingerprinting system”, ISMIR 2002. Shazam A.Wang, “An industrialstrength audio search algorithm”, ISMIR 2003 Google S. Baluja and M. Covell, “Content fingerprinting using wavelets”, Euro. Conf. on Visual Media Production, 2006. V. Chandrasekhar, M. Sharifi, and D. A. Ross, “Survey and evaluation of audio fingerprinting schemes for mobile query-by-example applications”, ISMIR 2011 Philips: Thresholding as Features Observation Magnitude spectrum S(t, f) The sign of energy differences is robust to various operations Lossy encoding Range compression Added noise Thresholding as Features F ( t , f ) 1, if S ( t , f ) S ( t 1, f 1) S ( t , f 1) S ( t 1, f ) Fingerprint F(t, f) (Source: Dan Ellis) Philips: Thresholding as Features (II) Robust to low-bitrate MP3 encoding (see the right) Sensitive to “frame time difference” Hop size is kept small! Original fingerprinting Fingerprinting after MP3 encoding BER=0.078 Philips: Robustness of Features BER of the features after various operations General low High for speed and timescale changes (which is not likely to occur under query by example) Philips: Search Strategies Via hashing Inverted indexing Shazam’s Method Ideas Take advantage of music local structures Find salient peaks on spectrogram Pair peaks to form landmarks for comparison Efficient search by hash tables Use positions of landmarks as hash keys Use song ID and offset time as hash values Use time constraints to find matched landmarks Database Preparation Compute spectrogram Perform mean subtraction & high-pass filtering Detect salient peaks Find initial threshold Update the threshold along time Pair salient peaks to form landmarks Define target zone Form landmarks and save them to a hash table Query Match Identify landmarks Find matched landmarks Retrieve landmarks from the hash table Keep only time-consistent landmarks Rank the database items Via matched landmark counts Via other confidence measures Shazam: Landmarks as Features Pair peaks in target zone to form landmarks Spectrogram • Landmark: [t1, f1, t2, f2] • 24-bit hash key: • f1: 9 bits • Δf = f2-f1: 8 bits • Δt = t2-t1: 7 bits • Hash value: • Song ID • Landmark’s start time t1 Salient peaks of spectrogram (Avery Wang, 2003) How to Find Salient Peaks We need to find peaks that are salient along both frequency and time axes Frequency axis: Gaussian local smoothing Time axis: Decaying threshold over time How to Find Initial Threshold? Goal Example To suppress neighboring peaks Ideas Find the local max. of mag. spectra of initial 10 frames Superimpose a Gaussian on each local max. Find the max. of all Gaussians Based on Bad Romance envelopeGen.m 4 Original signal Positive local maxima Final output 3.5 3 2.5 2 1.5 1 0.5 0 50 100 150 200 250 How to Update the Threshold along Time? Decay the threshold Find local maxima larger than the threshold salient peaks Define the new threshold as the max of the old threshold and the Gaussians passing through the active local maxima How to Control the No. of Salient peaks? To decrease the no. of salient peaks Perform forward and backward sweep to find salient peaks along both directions Use Gaussians with larger standard deviation … Time-decaying Thresholds landmarkFind01.m Forward pass 250 5 200 Forward: Freq index 4 150 3 100 2 50 1 200 400 600 Frame index 800 1000 1200 Backward pass 250 5 Backward: Freq index 200 4 150 3 100 2 50 1 200 400 600 Frame index 800 1000 1200 How to Pair Salient Peaks? Target zone A target zone is created right following each salient peaks The leading peak are paired with each peak in the target zone to form landmarks. Each landmark is denoted by [𝑡1 , 𝑓1 , 𝑡2 , 𝑓2 ] Salient Peaks and Landmarks Peak picking after forward smoothing Matched landmarks (green) (Source: Dan Ellis) Time Skew Out of sync of frame boundary time Reference frames Query frames 1 Increase frame size Repeated LM extraction 1 2 1 Solution 3 2 4 2 1 3 2 4 3 4 1 time skew! 3 5 2 5 4 3 4 1 2 3 4 1 2 3 4 1 2 3 4 To Avoid Time Skew To avoid time skew, query landmarks are extracted at various time shifts Example of 4 shifts of step = hop/4 LM set 1 LM set 2 LM set 3 LM set 4 Union & unique Query landmark set Landmarks for Hash Table Access Convert each landmark to hash key (and value) Landmark from the database hash table creation songId: [𝑡1 , 𝑓1 , 𝑡2 , 𝑓2 ] 24-bit hash key = 𝑓1 (9 bits) + ∆𝑓 8 𝑏𝑖𝑡𝑠 + ∆𝑡 (7 𝑏𝑖𝑡𝑠) 32-bit hash value = songId (18 bits) + 𝑡1 (14 𝑏𝑖𝑡𝑠) Landmark from the query hash table lookup Use 𝑓1 , 𝑡2 , 𝑓2 to generate hash key for hash table lookup Use 𝑡1 to find matched (time-consistent) landmarks Parameters in Our Implementation Landmarks Sample rate = 8000 Hz Frame size = 1024 Overlap = 512 Frame rate = 15.625 Landmark rate = ~400 LM/sec Hash table Hash key size = 2^24 = 16.78M Max song ID = 2^18 = 262 K Max start time = 2^14/frameRate = 17.48 minutes Our implementation is based on Dan Ellis’ work: Robust Landmark-Based Audio Fingerprinting, http://labrosa.ee.columbia.edu/matlab/fingerprint Structure of Hash Table Collision happens when LMs have the same [𝑓1 , 𝑡2 , 𝑓2 ]: Hash Table Lookup Query (hash keys from landmarks) Hash table … … … 8002 15007 0 1 9753 1432 1232 41 10002 19662 653 677 1461 438 142 486 997 73 1977 … 8002 65 … 15007 … 224-2 224-1 Hash keys 436 Hash values Retrieved landmarks How to Find Query Offset Time? Offset time of query can be derived by… Retrieved LM Database LM start time Database landmarks Retrieved and matched LM Query landmarks Query offset time Query LM start time Find Matched Landmarks Start time plot for landmarks X axis: start time of database LM Y axis: start time of query LM Query offset time ≈ x - y t=9.5 sec Query offset time A given LM starting at 9.5 sec retrieves 3 LMs in the hash table But only this one is matched! Find Matched Landmarks We can determine the offset time by plotting histogram of start time difference (x-y): Start time plot Histogram of start time difference (x-y) (Avery Wang, 2003) Matched Landmark Count To find matched (time-consistent) landmark count of a song: All retrieved landmarks LM from the same song 2286 Histogram of Offset time of a song Song ID Offset time Hash value 2046 6925 485890 2286 555 485890 Offset time Count 2286 795 485890 555±1 18 2286 1035 485890 795±1 1 2286 2715 384751 2286 555 384751 1035±1 1 2286 556 963157 2715±1 1 … … … … … Matched landmark count of song 2286 Final Ranking A common way to have the final ranking Based on each song’s matched landmark count Can also be converted into scores between 0~100 Song ID Matched landmark count Offset time 2286 18 555 ±1 2746 13 5002±1 2255 9 1681±1 2033 5 2347±1 2019 4 527±1 … … … Freq index Matched Landmarks vs. Noise 250 200 150 100 50 0 -5 -10 Freq index 200 Freq index 600 Frame index 800 1000 1200 250 200 150 100 50 0 -5 200 400 600 Frame index 800 1000 1200 250 200 150 100 50 400 600 Frame index 800 1000 400 600 Frame index 800 1000 Noisy02 1200 250 200 150 100 50 200 Noisy01 -10 4 2 0 -2 -4 200 Freq index 400 Original 1200 4 2 0 -2 -4 -6 Noisy03 Run goLmVsNoise.m in AFP toolbox to create this example. Optimization Strategies for AFP Several ways to optimize AFP Strategy for query landmark extraction Confidence measure Incremental retrieval Better use of the hash table Re-ranking for better performance Strategy for LM Extraction (1) 10-sec query Goal To trade computation for accuracy Steps: 1. Construct a classifier to determine if a query is a “hit” or a “miss”. 2. Increase the landmark counts of “miss” queries for better accuracy “hit” Classifier Regular LM extraction “miss” Dense LM extraction AFP engine Retrieved songs Strategy for LM Extraction (2) Classifier construction Training data: “hit” and “miss” queries Classifier: SVM Features mean volume standard deviation of volume standard deviation of absolute sum of highorder difference … Requirement Fast in evaluation Simple or readily available features Efficient classifier Adaptive Effective threshold for detecting miss queries Strategy for LM Extraction (3) To increase landmarks for “miss” queries Use more time-shifted query for LM extraction Our test takes 4 shifts vs. 8 shifts Decay the thresholds more rapidly to reveal more salient peaks … Strategy for LM Extraction (4) Song database 44.1kHz, 16-bits 1500 songs 1000 songs (30 seconds) from GTZAN dataset 500 songs (3~5 minutes) from our own collection of English/Chinese songs Datasets 10-sec clips recorded by mobile phones Training data 1412 clips (1223:189) Test data 1062 clips Strategy for LM Extraction (5) AFP accuracy vs. computing time Confidence Measure (1) Confusion matrix Performance indices False acceptance rate Predicted No Yes Groundtruth No C00 C01 Yes C10 C11 FAR = 𝑐01 𝑐00+𝑐01 False rejection rate FRR = 𝑐10 𝑐10+𝑐11 Confidence Measure (2) Factors for confidence measure Matched landmark count Landmark count Salient peak count … How to use these factors Take a value of the factor and used it as a threshold Normalize the threshold by dividing it by query duration Vary the threshold to identify FAR & FRR Dataset for Confidence Measure Song database 44.1kHz, 16-bits 1500 songs 1000 songs (30 seconds) from GTZAN dataset 16284 songs (3~5 minutes) from our own collection of English songs Datasets 10-sec clips recorded by mobile phones In the database 1062 clips Not in the database 1412 clips Confidence Measure (3) DET (Detection Error Tradeoff) Curve Accuracy vs. tolerance No OOV queries Toleranace of matched landmarks Accuracy ±0 79.19% ±1 79.66% ±2 79.57% Incremental Retrieval Goal Take additional query input if the confidence measure is not high enough Implementation issues Use only forward mode for landmark extraction no. of landmarks ↗ computation time ↗ Use statistics of matched landmarks to restricted the number of extracted landmarks for comparison Hash Table Optimization Possible directions for hash table optimization To increase song capacity 20 bits for songId Song capacity = 2^20 = 1 M Max start time = 2^12/frameRate = 4.37 minutes Longer songs are split into shorter segments To increase efficiency 80-20 rule Put 20% of the most likely songs to fast memory Put 80% of the lese likely songs to slow memory To avoid collision better hashing strategies Re-ranking for Better Performance Features that can be used to rank the matched songs Matched landmark count Matched frequency count 1 Matched frequency count 2 … Our AFP Engine Music database 260k tracks currently 1M tracks in the future Driving forces Fundamental issues in computer science (hashing, indexing…) Requests from local companies Methods Landmarks as feature (Shazam’s method) Speedup by GPU Platform Single CPU + 3 GPUs Specs of Our AFP Engine Platform OS: CentOS 6 CPU: Intel Xeon x5670 six cores 2.93GHz Memory: 96GB Database Please refer to this page. Experiments Accuracy test Corpora Database: 2550 tracks Test files: 5 mobilerecorded songs chopped into segments of 5, 10, 15, and 20 seconds Accuracy vs. duration 5-sec clips: 161/275=58.6% 10-sec clips: 121/136=89.0% 15-sec clips: 88/90=97.8% 20-sec clips: 65/66=98.5% Computing time. vs. duration Accuracy vs. computing time MATLAB Prototype for AFP Toolboxes Audio fingerprinting SAP Utility Dataset Russian songs Instruction Download the toolboxes Modify afpOptSet.m (in the audio fingerprinting toolbox) to add toolbox paths Run goDemo.m. Demos of Audio Fingerprinting Commercial apps Shazam Soundhound Our demo http://mirlab.org/demo/audioFingerprinting QBSH vs. AFP QBSH Goal: MIR Feature: Pitch Perceptible Small data size Method: LS Database Harder to collect Small storage Bottleneck CPU/GPU-bound AFP Goal: MIR Features: Landmarks Not perceptible Big data size Method: Matched LM Database Easier to collect Large storage Bottleneck I/O-bound Conclusions For AFP Conclusions Landmark-based methods are effective Machine learning is indispensable for further improvement. Future work: Scale up Shazam: 15M tracks in database, 6M tags/day Our goal: 1M tracks with a single PC and GPU 10M tracks with cloud computing of 10 PC References (I) Robust Landmark-Based Audio Fingerprinting, Dan Ellis, http://labrosa.ee.columbia.edu/matlab/fingerprint/ Avery Wang (Shazam) “An Industrial-Strength Audio Search Algorithm”, ISMIR, 2003 “The Shazam music recognition service”, Comm. ACM 49(8), 44-48,, 2006.. J. Haitsma and T. Kalker (Phlillips) “A highly robust audio fingerprinting system”, ISMIR, 2002 “A highly robust audio fingerprinting system with an efficient search strategy,” J. New Music Research 32(2), 211-221, 2003. References (II) Google: “Content Fingerprinting Using Wavelets”, Baluja, Covell., Proc. CVMP , 2006 “Survey and Evaluation of Audio Fingerprinting Schemes for Mobile Query-by-Example Applications”, Vijay Chandrasekhar, Matt Sharifi, David A. Ross, ISMIR, 2011 “Computer Vision for Music Identification”, Y. Ke, D. Hoiem, and R. Sukthankar, CVPR, 2005