Lecture 13: Audio Fingerprinting 1. The Fingerprinting Problem 2. Frame-Based Approach

advertisement
ELEN E4896 MUSIC SIGNAL PROCESSING
Lecture 13:
Audio Fingerprinting
1. The Fingerprinting Problem
2. Frame-Based Approach
3. Landmark Approach
Dan Ellis
Dept. Electrical Engineering, Columbia University
dpwe@ee.columbia.edu
E4896 Music Signal Processing (Dan Ellis)
http://www.ee.columbia.edu/~dpwe/e4896/
2013-04-22 - 1 /18
1. The Fingerprinting Problem
• Audio Fingerprinting: Known-Item search
for the exact same performance
(no “cover versions”)
despite differences in audio channel, encoding, noise
etc.
• Applications
media monitoring
metadata reconciliation
“what’s that song?”
E4896 Music Signal Processing (Dan Ellis)
2013-04-22 - 2 /18
A Simple Fingerprint
• “Fingerprint” is a compact record sufficient
to uniquely identify an example
difficulty depends on item density, noise
1
0.5
0
−0.5
1
−1
1
0.5
0
x2
0.5
−0.5
−1
1
0
0.5
0
−0.5
−0.5
−1
1
0.5
0
−1
−1
−0.5
0
0.5
x1
1
−0.5
−1
hash functions?
E4896 Music Signal Processing (Dan Ellis)
2013-04-22 - 3 /18
(speaker/mic),
added noise
the “coffeeshop”
problem
6
4
2
0
freq / kHz
• Immunity to channel
freq / kHz
Fingerprinting Challenges
6
4
2
0
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
time / sec
• Recognize fragments
from anywhere in the track
the shorter the better
• Large corpus of reference items
• False alarm vs. false reject
E4896 Music Signal Processing (Dan Ellis)
2013-04-22 - 4 /18
2. Frame-Based Approaches
• Standard audio-processing paradigm
chop-up waveform into frames
each frame → feature vector
match on a sequence of feature vectors
Reference
items
Analysis
Query
Analysis
DB
Match
• Challenges
make the features invariant to channel variations
make features insensitive to timing skew / offset
computational efficiency
E4896 Music Signal Processing (Dan Ellis)
2013-04-22 - 5 /18
Channel Immunity
Haitsma & Kalker 2003
• Audio matching should be invariant to
lossy encoding (low-bitrate MP3)
dynamic range compression (per band?)
added noise (quantization, environment noise)
auditory magnitudespectrogram X(t, f)
frequency channel
•
Local threshold
Auditory Spectrogram
bitmask “hash”:
B(t, f ) = 1 iff
X(t, f ) + X(t + 1, f + 1)
> X(t, f + 1) + X(t + 1, f )
30
16
25
15
20
14
15
13
10
12
5
11
Binarized Hash
30
16
25
15
20
14
15
13
10
12
5
11
50
E4896 Music Signal Processing (Dan Ellis)
100
150
200
250
300
350
time frame
180
190
200
210
220
2013-04-22 - 6 /18
Timing Skew
• What happens if reference frames are
out of sync with test frames?
make frame length much longer than “hop”
time
frame 1
query 1
2
3
2
time
frame 1
4
3
4
query 1
2
3
2
4
3
4
→ features are very smooth in time
E4896 Music Signal Processing (Dan Ellis)
2013-04-22 - 7 /18
Retrieval & Matching
• Matching is by
Haitsma & Kalker 2003
Hamming Distance
between query & ref
use 256 x 32 bit frames
(3 sec @ 11.6 ms frames)
10k tracks ~ 200M frames
• Only check near
exact match of
one 32-bit word
hash table index of occurrences of all 232 = 4G values
repeat for all 256 columns
(can also test all 32 one-bit differences)
E4896 Music Signal Processing (Dan Ellis)
2013-04-22 - 8 /18
False Alarms vs. False Reject
32-bit hash
• One
a = Pr(same hash | same audio) ~ 0.5 (dep. noise?)
b = Pr(same hash | different audio) ~ 1/232
(or 33/232 ~ 1/227 = 10-8 for 1-bit diffs)
Pr(false match in L frames) = 1 - (1 - b)L (~Lb)
L = 200M → Pr(false match) = 0.78
32-bit hashes (e.g. K=8)
• KPr(all
match | diff audio) = b ~ 10
K
-65
Pr(all match | same audio) = aK ~ 1/256
matches out of N (e.g. 4 of 16 – Binomial)
• KPr(false
reject) = Pr(B(16, a) < 4) = 0.0106
E4896 Music Signal Processing (Dan Ellis)
0
log10 Prob
Pr(false accept)
= Pr(B(16, b) ≥ 4)
~10-29 (~b4)
−5
log10(Pr(false reject))
log10(Pr(false accept))
−10
0
2
4
6
8
10
# matches out of 16
12
14
16
2013-04-22 - 9 /18
3. Landmark Approach
• Idea:
Wang 2003, 2006
Use structures in audio as time reference
instead of arbitrary time frames
eliminates “framing errors”
• Another idea:
Use individual, spectrally-local structures as
component hashes
robustness to missing frequency bands
• Use time-frequency peaks
i.e. onsets of specific harmonics
highest energy → most robust to noise
the Shazam algorithm
E4896 Music Signal Processing (Dan Ellis)
2013-04-22 - 10/18
Shazam Landmarks
• Find local peaks in spectrogram
but only ~256 different frequencies - common
• Join them into pairs
look for 2nd peak within some window
freq / kHz
Phone ring - Shazam fingerprint
4
-10
-20
3
-30
-40
2
-50
-60
1
-70
0
0
0.5
1
1.5
2
2.5 time / s
-80
level / dB
hash {start freq, end freq, time diff}
= 256 x 256 x 64 = 4M distinct patterns
build index: [ hash → {track_ID, offset_sec} ]
E4896 Music Signal Processing (Dan Ellis)
2013-04-22 - 11/18
Selecting Peaks
• Landmark Density controls match chance,
required query size
control by density of peaks
• Pick peaks
based on local
decaying
surface
width, rate of
decay → peak
density
E4896 Music Signal Processing (Dan Ellis)
2013-04-22 - 12/18
Shazam Matching
• Any subset of
hashes is
sufficient
• Check
temporal
sequence for
ref items with
multiple hits
freq / Hz
• Nearby pairs of peaks → hashes
• Each query hash → list of matching ref items
Query audio
4000
3500
3000
2500
2000
1500
1000
500
0
Match: 05−Full Circle at 0.032 sec
4000
3500
3000
2500
2000
1500
1000
500
0
E4896 Music Signal Processing (Dan Ellis)
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
time / sec
2013-04-22 - 13/18
Shazam Decoy
• Only a tiny part of the signal needs to be
preserved
Query audio
4000
3500
3000
2500
2000
1500
1000
500
0
0.5
1
1.5
2
4000
2.5
3
3.5
Match: 05−Full Circle at 0 sec
4
4.5
5
5.5
3500
3000
2500
2000
1500
1000
500
0
0
1
E4896 Music Signal Processing (Dan Ellis)
2
3
4
5
2013-04-22 - 14/18
Error Analysis
• ~20 hashes / sec → ~4000 hashes / track
10k tracks → 40M hash entries (160MB)
20 bit space → 106 distinct hashes
b = Pr(hash | wrong audio) = (1 - 10-6)4000 = 0.4%
a = Pr(hash | right audio) = 0.1 ?
• 5 sec query → K =100 hashes (Binomial)
Pr(N chance matches) = KCN bN(1-b)K-N
N = 6 → Pr ~ 109 •3x10-15 •0.7 = 2.5x10-6
Pr(true matches < N) = Σ KCN aN(1-a)K-N ~ 6%
log10 Prob
0
−5
−10
log10(Pr(false reject))
log10(Pr(false accept))
0
2
4
E4896 Music Signal Processing (Dan Ellis)
6
8
10
# matches out of 100
12
14
16
2013-04-22 - 15/18
Diagnostic Uses
• Recurring Landmarks show reused audio
Matching landmarks: 01−Taxman vs. 01−Taxman
Time / s in 01−Taxman
100
50
0
−50
time in Rusted Pipe − time in 03 Rusted Pipe / s
−100
0
20
40
60
80
100
Time / s in 01−Taxman
120
140
160
Matching landmarks: 03 Rusted Pipe vs. Rusted Pipe
200
100
0
−100
−200
0
50
E4896 Music Signal Processing (Dan Ellis)
100
150
200
time in 03 Rusted Pipe / s
250
300
2013-04-22 - 16/18
Summary
• Fingerprinting
Match same recording despite channel
• Frame-based
Local features with fuzzy match
• Landmark based
Highly tolerant of added noise
E4896 Music Signal Processing (Dan Ellis)
2013-04-22 - 17/18
References
•
•
•
J. Haitsma and T. Kalker, “A highly robust audio fingerprinting system with an
efficient search strategy,” J. New Music Research 32(2), 211-221, 2003.
A. Wang, “An Industrial-Strength Audio Search Algorithm,” Proc. Int. Symp. on
Music Info. Retrieval, 7-13, 2003.
A. Wang, “The Shazam music recognition service,” Comm. ACM 49(8), 44-48,
2006.
E4896 Music Signal Processing (Dan Ellis)
2013-04-22 - 18/18
Download