Slides

advertisement
Audio Fingerprinting
J.-S. Roger Jang (張智星)
jang@mirlab.org
http://mirlab.org/jang
MIR Lab, CSIE Dept.
National Taiwan University
1
Intro to Audio Fingerprinting (AFP)
Goal
Identify a noisy version of a given audio clips
Also known as…
“Query by exact example”  no “cover versions”
Can also be used to…
Align two different-speed audio clips of the same
source
Dan Ellis used AFP for aligned annotation on Beatles
dataset
AFP Challenges
Music variations
Encoding/compression (MP3 encoding, etc)
Channel variations
Speakers & microphones, room acoustics
Environmental noise
Efficiency (6M tags/day for Shazam)
Database collection (15M tracks for Shazam)
AFP Applications
Commercial applications of AFP
Music identification & purchase
Royalty assignment (over radio)
TV shows or commercials ID (over TV)
Copyright violation: Youtube
Major commercial players
Shazam, Soundhound, Intonow, Viggle…
Company: Shazam
Facts
First commercial product of AFP
Since 2002, UK
Technology
Audio fingerprinting
Founder
Avery Wang (PhD at Standard, 1994)
Company: Soundhound
Facts
First product with multi-modal music search
AKA: midomi
Technologies
Audio fingerprinting
Query by singing/humming
Speech recognition
Multi-modal retrieval
Founder
Keyvan Mohajer (PhD at Stanford, 2007)
Two Stages in AFP
Offline
Feature extraction
Hash table construction
for songs in database
Inverted indexing
Online
Feature extraction
Hash table search
Ranked list of the
retrieved songs/music
Robust Feature Extraction
Various kinds of features for AFP
Invariance along time and frequency
Landmark of a pair of local maxima
Wavelets
…
Extensive test required for choosing the best
features
Representative Approaches to AFP
Philips
J. Haitsma and T.
Kalker, “A highly robust
audio fingerprinting
system”, ISMIR 2002.
Shazam
A.Wang, “An industrialstrength audio search
algorithm”, ISMIR 2003
Google
S. Baluja and M. Covell,
“Content fingerprinting using
wavelets”, Euro. Conf. on
Visual Media Production,
2006.
V. Chandrasekhar, M. Sharifi,
and D. A. Ross, “Survey and
evaluation of audio
fingerprinting schemes for
mobile query-by-example
applications”, ISMIR 2011
Philips: Thresholding as Features
Observation
Magnitude spectrum S(t, f)
The sign of energy
differences is robust to
various operations
Lossy encoding
Range compression
Added noise
Thresholding as Features
F (t , f )  1, if
S (t , f )  S (t  1, f  1) 
S (t , f  1)  S (t  1, f )
Fingerprint F(t, f)
(Source: Dan Ellis)
Philips: Thresholding as Features (II)
Robust to low-bitrate
MP3 encoding (see the
right)
Sensitive to “frame time
difference”  Hop size
is kept small!
Original
fingerprinting
Fingerprinting after
MP3 encoding
BER=0.078
Philips: Robustness of Features
BER of the features
after various operations
General low
High for speed and timescale changes (which is
not likely to occur under
query by example)
Philips: Search Strategies
Via hashing
Inverted indexing
Shazam’s Method
Ideas
Take advantage of music local structures
Find salient peaks on spectrogram
Pair peaks to form landmarks for comparison
Efficient search by hash tables
Use positions of landmarks (excluding t1) as hash keys
Use song ID and start time as hash values
Use time constraints to find matched landmarks
Database Preparation
Compute spectrogram
Perform mean subtraction & high-pass filtering
Detect salient peaks
Find initial threshold
Update the threshold along time
Pair salient peaks to form landmarks
Define target zone
Form landmarks and save them to a hash table
Query Match
Identify landmarks
Find matched landmarks
Retrieve landmarks from the hash table
Keep only time-consistent landmarks
Rank the database items
Via matched landmark counts
Via other confidence measures
Shazam: Landmarks as Features
Pair peaks in
target zone to
form landmarks
Spectrogram
• Landmark: [t1, f1, t2, f2]
• 24-bit hash key:
• f1: 9 bits
• Δf = f2-f1: 8 bits
• Δt = t2-t1: 7 bits
• Hash value:
• Song ID
• Landmark’s
start time t1
Salient peaks of
spectrogram
(Avery Wang, 2003)
How to Find Salient Peaks
We need to find peaks that are salient along
both frequency and time axes
Frequency axis: Gaussian local smoothing
Time axis: Decaying threshold over time
How to Find Initial Threshold?
Goal
Example
To suppress neighboring
peaks
Ideas
Find the local max. of
mag. spectra of initial 10
frames
Superimpose a Gaussian
on each local max.
Find the max. of all
Gaussians
Based on Bad Romance
envelopeGen.m
4
Original signal
Positive local maxima
Final output
3.5
3
2.5
2
1.5
1
0.5
0
50
100
150
200
250
How to Update the Threshold along Time?
Decay the threshold
Find local maxima larger
than the threshold 
salient peaks
Define the new threshold
as the max of the old
threshold and the
Gaussians passing
through the active local
maxima
How to Control the No. of Salient peaks?
To decrease the no. of salient peaks
Perform forward and backward sweep to find
salient peaks along both directions
Use Gaussians with larger standard deviation
…
Time-decaying Thresholds
landmarkFind01.m
Forward pass
250
5
200
Forward:
Freq index
4
150
3
100
2
50
1
200
400
600
Frame index
800
1000
1200
Backward pass
250
5
Backward:
Freq index
200
4
150
3
100
2
50
1
200
400
600
Frame index
800
1000
1200
How to Pair Salient Peaks?
Target zone
A target zone is created
right following each
salient peaks
The leading peak are
paired with each peak in
the target zone to form
landmarks.
Each landmark is
denoted by [t1, f1, t2, f2].
Peak 1
Peak 2
Salient Peaks and Landmarks
Peak picking after
forward smoothing
Matched landmarks (green)
(Source: Dan Ellis)
Time Skew
Out of sync of frame
boundary
time
Reference
frames
Query
frames
1
Increase frame size
Repeated LM extraction
1
2
1
Solution
3
2
4
2
1
3
2
4
3
4
1
time skew!
3
5
2
5
4
3
4
1
2
3
4
1
2
3
4
1
2
3
4
To Avoid Time Skew
To avoid time skew, query landmarks are
extracted at various time shifts
Example of 4 shifts of step = hop/4
LM set 1
LM set 2
LM set 3
LM set 4
Union &
unique
Query
landmark set
Landmarks for Hash Table Access
Convert each landmark to hash key/value
Landmark from the database  hash table creation
songId, [t1, f1, t2, f2]
24-bit hash key = f1 (9 bits) + ∆f (8 bits) + ∆t (7 bits)
32-bit hash value = songId (18 bits) + t1 (14 bits)
Landmark from the query  hash table lookup
Use f1, t2, f2 to generate hash key for hash table lookup
Use t1 (start time) to find matched (time-consistent)
landmarks
Parameters in Our Implementation
Landmarks
Sample rate = 8000 Hz
Frame size = 1024
Overlap = 512
Frame rate = 15.625
Landmark rate = ~400
LM/sec
Hash table
Hash key size = 2^24 =
16.78M
Max song ID = 2^18 =
262 K
Max start time =
2^14/frameRate = 17.48
minutes
Our implementation is based on Dan Ellis’ work:
Robust Landmark-Based Audio Fingerprinting, http://labrosa.ee.columbia.edu/matlab/fingerprint
Structure of Hash Table
Collision happens when LMs have the same [f1, t2, f2]
Hash Table Lookup
Query (hash keys from landmarks)
Hash table
…
…
Generated from f1, t2, f2
…
8002
15007
0
1
9753
1432
1232
41
10002
19662
653
677
1461
438
142
486
997
73
1977
…
8002
65
… 15007 … 224-2
224-1
Hash
keys
436
Hash
values
Retrieved
landmarks
How to Find Query Offset Time?
Offset time of query can be derived by…
A landmark is Represented
by its first peak
Database LM start time
Retrieved and
matched LM
Retrieved
Database
landmarks
Query
landmarks
Query offset time
Query LM start time
Find Matched Landmarks
Start time plot for landmarks
A given LM starting at
9.5 sec retrieves 3 LMs
X axis: start time of database LMs
in the hash table
Y axis: start time of query LMs
Query offset time = x-y
t=9.5 sec
Query offset time
But only this
one is matched!
Find Matched Landmarks
We can determine the offset time by plotting
histogram of start time difference (x-y):
Start time plot
Histogram of
start time
difference
(x-y)
(Avery Wang, 2003)
Matched Landmark Count
To find matched (time-consistent) landmark count of a
song:
All retrieved landmarks
LM from
the same
song 2286
Song ID
Offset time
Hash value
2046
6925
485890
2286
555
485890
2286
795
485890
2286
1035
485890
2286
2715
384751
2286
555
384751
2286
556
963157
…
…
…
Histogram of
Offset time of a song
Offset
time
Count
18
1
1
1
…
…
Matched
landmark
count of
song 2286
Final Ranking
A common way to have the final ranking
Based on each song’s matched landmark count
Can also be converted into scores between 0~100
Song ID
Matched
landmark
count
2286
18
2746
13
2255
9
2033
5
2019
4
…
…
Offset time
…
Freq index
Matched Landmarks vs. Noise
250
200
150
100
50
0
-5
-10
Freq index
200
Freq index
600
Frame index
800
1000
1200
250
200
150
100
50
0
-5
200
400
600
Frame index
800
1000
1200
250
200
150
100
50
400
600
Frame index
800
1000
400
600
Frame index
800
1000
Noisy02
1200
250
200
150
100
50
200
Noisy01
-10
4
2
0
-2
-4
200
Freq index
400
Original
1200
4
2
0
-2
-4
-6
Noisy03
Run goLmVsNoise.m in AFP toolbox to create this example.
Optimization Strategies for AFP
Several ways to optimize AFP
Strategy for query landmark extraction
Confidence measure
Incremental retrieval
Better use of the hash table
Re-ranking for better performance
Strategy for LM Extraction (1)
10-sec
query
Goal
To trade computation for
accuracy
Steps:
1. Use a classifier to predict if
a query is likely to be a
“hit” or a “miss”.
2. Increase the landmark
counts of “miss” queries
for better accuracy
“hit”
Classifier
Regular LM
extraction
“miss”
Dense LM
extraction
AFP engine
Retrieved
songs
Strategy for LM Extraction (2)
Classifier construction
Training data: “hit” and
“miss” queries
Classifier: SVM
Features
mean volume
standard deviation of
volume
standard deviation of
absolute sum of highorder difference
…
Requirement
Fast in evaluation
Simple or readily
available features
Efficient classifier
Adaptive
Effective threshold for
detecting miss queries
Strategy for LM Extraction (3)
To increase landmarks for “miss” queries
Use more time-shifted query for LM extraction
Our test takes 4 shifts vs. 8 shifts
Decay the thresholds more rapidly to reveal more
salient peaks
…
Strategy for LM Extraction (4)
Song database
44.1kHz, 16-bits
1500 songs
1000 songs (30 seconds)
from GTZAN dataset
500 songs (3~5 minutes)
from our own collection
of English/Chinese songs
Datasets
10-sec clips recorded by
mobile phones
Training data
1412 clips (1223:189)
Test data
1062 clips
Strategy for LM Extraction (5)
AFP accuracy vs. computing time
Confidence Measure (1)
Confusion matrix
Performance indices
False acceptance rate
Predicted
No
Groundtruth
No
Yes
FAR 
Yes
c01
c00  c01
False rejection rate
C00
C01
C10
C11
FRR 
c10
c10  c11
Confidence Measure (2)
Factors for confidence
measure
Matched landmark count
Landmark count
Salient peak count
…
How to use these
factors
Take a value of the
factor and used it as a
threshold
Normalize the threshold
by dividing it by query
duration
Vary the threshold to
identify FAR & FRR
Dataset for Confidence Measure
Song database
44.1kHz, 16-bits
1500 songs
1000 songs (30 seconds)
from GTZAN dataset
16284 songs (3~5
minutes) from our own
collection of English
songs
Datasets
10-sec clips recorded by
mobile phones
In the database
1062 clips
Not in the database
1412 clips
Confidence Measure (3)
DET (Detection Error
Tradeoff) Curve
Accuracy vs. tolerance
No OOV queries
Toleranace of
matched
landmarks
Accuracy
79.19%
79.66%
79.57%
Incremental Retrieval
Goal
Take additional query input if the confidence
measure is not high enough
Implementation issues
Use only forward mode for landmark extraction
 no. of landmarks ↗  computation time ↗
Use statistics of matched landmarks to restricted
the number of extracted landmarks for comparison
Hash Table Optimization
Possible directions for hash table optimization
To increase song capacity  20 bits for songId
Song capacity = 2^20 = 1 M
Max start time = 2^12/frameRate = 4.37 minutes 
Longer songs are split into shorter segments
To increase efficiency  80-20 rule
Put 20% of the most likely songs to fast memory
Put 80% of the lese likely songs to slow memory
To avoid collision  better hashing strategies
Re-ranking for Better Performance
Features that can be used to rank the matched
songs
Matched landmark count
Matched frequency count 1
Matched frequency count 2
…
Our AFP Engine
Music database
260k tracks currently
1M tracks in the future
Driving forces
Fundamental issues in
computer science
(hashing, indexing…)
Requests from local
companies
Methods
Landmarks as feature
(Shazam’s method)
Speedup by GPU
Platform
Single CPU + 3 GPUs
Specs of Our AFP Engine
Platform
OS: CentOS 6
CPU: Intel Xeon x5670 six cores 2.93GHz
Memory: 96GB
Database
Please refer to this page.
Experiments
Accuracy test
Corpora
Database: 2550 tracks
Test files: 5 mobilerecorded songs chopped
into segments of 5, 10,
15, and 20 seconds
Accuracy vs. duration
5-sec clips: 161/275=58.6%
10-sec clips: 121/136=89.0%
15-sec clips: 88/90=97.8%
20-sec clips: 65/66=98.5%
Computing time. vs. duration
Accuracy vs. computing time
MATLAB Prototype for AFP
Toolboxes
Audio fingerprinting
SAP
Utility
Dataset
Russian songs
Instruction
Download the toolboxes
Modify afpOptSet.m (in
the audio fingerprinting
toolbox) to add toolbox
paths
Run goDemo.m.
Demos of Audio Fingerprinting
Commercial apps
Shazam
Soundhound
Our demo
http://mirlab.org/demo/audioFingerprinting
QBSH vs. AFP
QBSH
Goal: MIR
Feature: Pitch
Perceptible
Small data size
Method: LS
Database
Harder to collect
Small storage
Bottleneck
CPU/GPU-bound
AFP
Goal: MIR
Features: Landmarks
Not perceptible
Big data size
Method: Matched LM
Database
Easier to collect
Large storage
Bottleneck
I/O-bound
Conclusions For AFP
Conclusions
Landmark-based methods are effective
Machine learning is indispensable for further
improvement.
Future work: Scale up
Shazam: 15M tracks in database, 6M tags/day
Our goal:
1M tracks with a single PC and GPU
10M tracks with cloud computing of 10 PC
References (I)
 Robust Landmark-Based Audio Fingerprinting, Dan Ellis,
http://labrosa.ee.columbia.edu/matlab/fingerprint/
 Avery Wang (Shazam)
“An Industrial-Strength Audio Search Algorithm”, ISMIR, 2003
“The Shazam music recognition service”, Comm. ACM 49(8), 44-48,,
2006..
 J. Haitsma and T. Kalker (Phlillips)
“A highly robust audio fingerprinting system”, ISMIR, 2002
“A highly robust audio fingerprinting system with an efficient search
strategy,” J. New Music Research 32(2), 211-221, 2003.
References (II)
 Google:
“Content Fingerprinting Using Wavelets”, Baluja, Covell., Proc.
CVMP , 2006
“Survey and Evaluation of Audio Fingerprinting Schemes for Mobile
Query-by-Example Applications”, Vijay Chandrasekhar, Matt Sharifi,
David A. Ross, ISMIR, 2011
 “Computer Vision for Music Identification”, Y. Ke, D.
Hoiem, and R. Sukthankar, CVPR, 2005
Download