AFP feature extraction and comparison

Audio Fingerprinting J.-S. Roger Jang (張智星) jang@mirlab.org http://mirlab.org/jang MIR Lab, CSIE Dept. National Taiwan University 1 Intro to Audio Fingerprinting (AFP) Goal Identify a noisy version of a given audio clips Also known as… “Query by exact example”  no “cover versions” Can also be used to… Align two different-speed audio clips of the same source Dan Ellis used AFP for aligned annotation on Beatles dataset AFP Challenges Music variations Encoding/compression (MP3 encoding, etc) Channel variations Speakers & microphones, room acoustics Environmental noise Efficiency (6M tags/day for Shazam) Database collection (15M tracks for Shazam) AFP Applications Commercial applications of AFP Music identification & purchase Royalty assignment (over radio) TV shows or commercials ID (over TV) Copyright violation (over web) Major commercial players Shazam, Soundhound, Intonow, Viggle… Company: Shazam Facts First commercial product of AFP Since 2002, UK Technology Audio fingerprinting Founder Avery Wang (PhD at Standard, 1994) Company: Soundhound Facts First product with multi-modal music search AKA: midomi Technologies Audio fingerprinting Query by singing/humming Speech recognition Founder Keyvan Mohajer (PhD at Stanford, 2007) Two Stages in AFP Offline Feature extraction Hash table construction for songs in database Inverted indexing Online Feature extraction Hash table search Ranked list of the retrieved songs/music Robust Feature Extraction Various kinds of features for AFP Invariance along time and frequency Landmark of a pair of local maxima Wavelets … Extensive test required for choosing the best features Representative Approaches to AFP Philips J. Haitsma and T. Kalker, “A highly robust audio fingerprinting system”, ISMIR 2002. Shazam A.Wang, “An industrialstrength audio search algorithm”, ISMIR 2003 Google S. Baluja and M. Covell, “Content fingerprinting using wavelets”, Euro. Conf. on Visual Media Production, 2006. V. Chandrasekhar, M. Sharifi, and D. A. Ross, “Survey and evaluation of audio fingerprinting schemes for mobile query-by-example applications”, ISMIR 2011 Philips: Thresholding as Features Observation Magnitude spectrum S(t, f) The sign of energy differences is robust to various operations Lossy encoding Range compression Added noise Thresholding as Features F ( t , f )  1, if S ( t , f )  S ( t  1, f  1)  S ( t , f  1)  S ( t  1, f ) Fingerprint F(t, f) (Source: Dan Ellis) Philips: Thresholding as Features (II) Robust to low-bitrate MP3 encoding (see the right) Sensitive to “frame time difference”  Hop size is kept small! Original fingerprinting Fingerprinting after MP3 encoding BER=0.078 Philips: Robustness of Features BER of the features after various operations General low High for speed and timescale changes (which is not likely to occur under query by example) Philips: Search Strategies Via hashing Inverted indexing Shazam’s Method Ideas Take advantage of music local structures Find salient peaks on spectrogram Pair peaks to form landmarks for comparison Efficient search by hash tables Use positions of landmarks as hash keys Use song ID and offset time as hash values Use time constraints to find matched landmarks Database Preparation Compute spectrogram Perform mean subtraction & high-pass filtering Detect salient peaks Find initial threshold Update the threshold along time Pair salient peaks to form landmarks Define target zone Form landmarks and save them to a hash table Query Match Identify landmarks Find matched landmarks Retrieve landmarks from the hash table Keep only time-consistent landmarks Rank the database items Via matched landmark counts Via other confidence measures Shazam: Landmarks as Features Pair peaks in target zone to form landmarks Spectrogram • Landmark: [t1, f1, t2, f2] • 24-bit hash key: • f1: 9 bits • Δf = f2-f1: 8 bits • Δt = t2-t1: 7 bits • Hash value: • Song ID • Landmark’s start time t1 Salient peaks of spectrogram (Avery Wang, 2003) How to Find Salient Peaks We need to find peaks that are salient along both frequency and time axes Frequency axis: Gaussian local smoothing Time axis: Decaying threshold over time How to Find Initial Threshold? Goal Example To suppress neighboring peaks Ideas Find the local max. of mag. spectra of initial 10 frames Superimpose a Gaussian on each local max. Find the max. of all Gaussians Based on Bad Romance envelopeGen.m 4 Original signal Positive local maxima Final output 3.5 3 2.5 2 1.5 1 0.5 0 50 100 150 200 250 How to Update the Threshold along Time? Decay the threshold Find local maxima larger than the threshold  salient peaks Define the new threshold as the max of the old threshold and the Gaussians passing through the active local maxima How to Control the No. of Salient peaks? To decrease the no. of salient peaks Perform forward and backward sweep to find salient peaks along both directions Use Gaussians with larger standard deviation … Time-decaying Thresholds landmarkFind01.m Forward pass 250 5 200 Forward: Freq index 4 150 3 100 2 50 1 200 400 600 Frame index 800 1000 1200 Backward pass 250 5 Backward: Freq index 200 4 150 3 100 2 50 1 200 400 600 Frame index 800 1000 1200 How to Pair Salient Peaks? Target zone A target zone is created right following each salient peaks The leading peak are paired with each peak in the target zone to form landmarks. Each landmark is denoted by [𝑡1 , 𝑓1 , 𝑡2 , 𝑓2 ] Salient Peaks and Landmarks Peak picking after forward smoothing Matched landmarks (green) (Source: Dan Ellis) Time Skew Out of sync of frame boundary time Reference frames Query frames 1 Increase frame size Repeated LM extraction 1 2 1 Solution 3 2 4 2 1 3 2 4 3 4 1 time skew! 3 5 2 5 4 3 4 1 2 3 4 1 2 3 4 1 2 3 4 To Avoid Time Skew To avoid time skew, query landmarks are extracted at various time shifts Example of 4 shifts of step = hop/4 LM set 1 LM set 2 LM set 3 LM set 4 Union & unique Query landmark set Landmarks for Hash Table Access Convert each landmark to hash key (and value) Landmark from the database  hash table creation songId: [𝑡1 , 𝑓1 , 𝑡2 , 𝑓2 ] 24-bit hash key = 𝑓1 (9 bits) + ∆𝑓 8 𝑏𝑖𝑡𝑠 + ∆𝑡 (7 𝑏𝑖𝑡𝑠) 32-bit hash value = songId (18 bits) + 𝑡1 (14 𝑏𝑖𝑡𝑠) Landmark from the query  hash table lookup Use 𝑓1 , 𝑡2 , 𝑓2 to generate hash key for hash table lookup Use 𝑡1 to find matched (time-consistent) landmarks Parameters in Our Implementation Landmarks Sample rate = 8000 Hz Frame size = 1024 Overlap = 512 Frame rate = 15.625 Landmark rate = ~400 LM/sec Hash table Hash key size = 2^24 = 16.78M Max song ID = 2^18 = 262 K Max start time = 2^14/frameRate = 17.48 minutes Our implementation is based on Dan Ellis’ work: Robust Landmark-Based Audio Fingerprinting, http://labrosa.ee.columbia.edu/matlab/fingerprint Structure of Hash Table Collision happens when LMs have the same [𝑓1 , 𝑡2 , 𝑓2 ]: Hash Table Lookup Query (hash keys from landmarks) Hash table … … … 8002 15007 0 1 9753 1432 1232 41 10002 19662 653 677 1461 438 142 486 997 73 1977 … 8002 65 … 15007 … 224-2 224-1 Hash keys 436 Hash values Retrieved landmarks How to Find Query Offset Time? Offset time of query can be derived by… Retrieved LM Database LM start time Database landmarks Retrieved and matched LM Query landmarks Query offset time Query LM start time Find Matched Landmarks Start time plot for landmarks X axis: start time of database LM Y axis: start time of query LM Query offset time ≈ x - y t=9.5 sec Query offset time A given LM starting at 9.5 sec retrieves 3 LMs in the hash table But only this one is matched! Find Matched Landmarks We can determine the offset time by plotting histogram of start time difference (x-y): Start time plot Histogram of start time difference (x-y) (Avery Wang, 2003) Matched Landmark Count To find matched (time-consistent) landmark count of a song: All retrieved landmarks LM from the same song 2286 Histogram of Offset time of a song Song ID Offset time Hash value 2046 6925 485890 2286 555 485890 Offset time Count 2286 795 485890 555±1 18 2286 1035 485890 795±1 1 2286 2715 384751 2286 555 384751 1035±1 1 2286 556 963157 2715±1 1 … … … … … Matched landmark count of song 2286 Final Ranking A common way to have the final ranking Based on each song’s matched landmark count Can also be converted into scores between 0~100 Song ID Matched landmark count Offset time 2286 18 555 ±1 2746 13 5002±1 2255 9 1681±1 2033 5 2347±1 2019 4 527±1 … … … Freq index Matched Landmarks vs. Noise 250 200 150 100 50 0 -5 -10 Freq index 200 Freq index 600 Frame index 800 1000 1200 250 200 150 100 50 0 -5 200 400 600 Frame index 800 1000 1200 250 200 150 100 50 400 600 Frame index 800 1000 400 600 Frame index 800 1000 Noisy02 1200 250 200 150 100 50 200 Noisy01 -10 4 2 0 -2 -4 200 Freq index 400 Original 1200 4 2 0 -2 -4 -6 Noisy03 Run goLmVsNoise.m in AFP toolbox to create this example. Optimization Strategies for AFP Several ways to optimize AFP Strategy for query landmark extraction Confidence measure Incremental retrieval Better use of the hash table Re-ranking for better performance Strategy for LM Extraction (1) 10-sec query Goal To trade computation for accuracy Steps: 1. Construct a classifier to determine if a query is a “hit” or a “miss”. 2. Increase the landmark counts of “miss” queries for better accuracy “hit” Classifier Regular LM extraction “miss” Dense LM extraction AFP engine Retrieved songs Strategy for LM Extraction (2) Classifier construction Training data: “hit” and “miss” queries Classifier: SVM Features mean volume standard deviation of volume standard deviation of absolute sum of highorder difference … Requirement Fast in evaluation Simple or readily available features Efficient classifier Adaptive Effective threshold for detecting miss queries Strategy for LM Extraction (3) To increase landmarks for “miss” queries Use more time-shifted query for LM extraction Our test takes 4 shifts vs. 8 shifts Decay the thresholds more rapidly to reveal more salient peaks … Strategy for LM Extraction (4) Song database 44.1kHz, 16-bits 1500 songs 1000 songs (30 seconds) from GTZAN dataset 500 songs (3~5 minutes) from our own collection of English/Chinese songs Datasets 10-sec clips recorded by mobile phones Training data 1412 clips (1223:189) Test data 1062 clips Strategy for LM Extraction (5) AFP accuracy vs. computing time Confidence Measure (1) Confusion matrix Performance indices False acceptance rate Predicted No Yes Groundtruth No C00 C01 Yes C10 C11 FAR = 𝑐01 𝑐00+𝑐01 False rejection rate FRR = 𝑐10 𝑐10+𝑐11 Confidence Measure (2) Factors for confidence measure Matched landmark count Landmark count Salient peak count … How to use these factors Take a value of the factor and used it as a threshold Normalize the threshold by dividing it by query duration Vary the threshold to identify FAR & FRR Dataset for Confidence Measure Song database 44.1kHz, 16-bits 1500 songs 1000 songs (30 seconds) from GTZAN dataset 16284 songs (3~5 minutes) from our own collection of English songs Datasets 10-sec clips recorded by mobile phones In the database 1062 clips Not in the database 1412 clips Confidence Measure (3) DET (Detection Error Tradeoff) Curve Accuracy vs. tolerance No OOV queries Toleranace of matched landmarks Accuracy ±0 79.19% ±1 79.66% ±2 79.57% Incremental Retrieval Goal Take additional query input if the confidence measure is not high enough Implementation issues Use only forward mode for landmark extraction  no. of landmarks ↗  computation time ↗ Use statistics of matched landmarks to restricted the number of extracted landmarks for comparison Hash Table Optimization Possible directions for hash table optimization To increase song capacity  20 bits for songId Song capacity = 2^20 = 1 M Max start time = 2^12/frameRate = 4.37 minutes  Longer songs are split into shorter segments To increase efficiency  80-20 rule Put 20% of the most likely songs to fast memory Put 80% of the lese likely songs to slow memory To avoid collision  better hashing strategies Re-ranking for Better Performance Features that can be used to rank the matched songs Matched landmark count Matched frequency count 1 Matched frequency count 2 … Our AFP Engine Music database 260k tracks currently 1M tracks in the future Driving forces Fundamental issues in computer science (hashing, indexing…) Requests from local companies Methods Landmarks as feature (Shazam’s method) Speedup by GPU Platform Single CPU + 3 GPUs Specs of Our AFP Engine Platform OS: CentOS 6 CPU: Intel Xeon x5670 six cores 2.93GHz Memory: 96GB Database Please refer to this page. Experiments Accuracy test Corpora Database: 2550 tracks Test files: 5 mobilerecorded songs chopped into segments of 5, 10, 15, and 20 seconds Accuracy vs. duration 5-sec clips: 161/275=58.6% 10-sec clips: 121/136=89.0% 15-sec clips: 88/90=97.8% 20-sec clips: 65/66=98.5% Computing time. vs. duration Accuracy vs. computing time MATLAB Prototype for AFP Toolboxes Audio fingerprinting SAP Utility Dataset Russian songs Instruction Download the toolboxes Modify afpOptSet.m (in the audio fingerprinting toolbox) to add toolbox paths Run goDemo.m. Demos of Audio Fingerprinting Commercial apps Shazam Soundhound Our demo http://mirlab.org/demo/audioFingerprinting QBSH vs. AFP QBSH Goal: MIR Feature: Pitch Perceptible Small data size Method: LS Database Harder to collect Small storage Bottleneck CPU/GPU-bound AFP Goal: MIR Features: Landmarks Not perceptible Big data size Method: Matched LM Database Easier to collect Large storage Bottleneck I/O-bound Conclusions For AFP Conclusions Landmark-based methods are effective Machine learning is indispensable for further improvement. Future work: Scale up Shazam: 15M tracks in database, 6M tags/day Our goal: 1M tracks with a single PC and GPU 10M tracks with cloud computing of 10 PC References (I)  Robust Landmark-Based Audio Fingerprinting, Dan Ellis, http://labrosa.ee.columbia.edu/matlab/fingerprint/  Avery Wang (Shazam) “An Industrial-Strength Audio Search Algorithm”, ISMIR, 2003 “The Shazam music recognition service”, Comm. ACM 49(8), 44-48,, 2006..  J. Haitsma and T. Kalker (Phlillips) “A highly robust audio fingerprinting system”, ISMIR, 2002 “A highly robust audio fingerprinting system with an efficient search strategy,” J. New Music Research 32(2), 211-221, 2003. References (II)  Google: “Content Fingerprinting Using Wavelets”, Baluja, Covell., Proc. CVMP , 2006 “Survey and Evaluation of Audio Fingerprinting Schemes for Mobile Query-by-Example Applications”, Vijay Chandrasekhar, Matt Sharifi, David A. Ross, ISMIR, 2011  “Computer Vision for Music Identification”, Y. Ke, D. Hoiem, and R. Sukthankar, CVPR, 2005

AFP feature extraction and comparison

Related documents

Products

Support

AFP feature extraction and comparison

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib