ppt

advertisement
Distance Functions and
Metric Learning:
Part 1
Michael Werman
Ofir Pele
The Hebrew University
of Jerusalem
12/1/2011 version. Updated Version:
http://www.cs.huji.ac.il/~ofirpele/DFML_ECCV2010_tutorial/
Rough Outline
• Bin-to-Bin Distances.
• Cross-Bin Distances:
– Quadratic-Form (aka Mahalanobis) / Quadratic-Chi.
– The Earth Mover’s Distance.
• Perceptual Color Differences.
• Hands-On Code Example.
Distance ?
Metric
² Non-negativity:
D(P; Q) ¸ 0
² Identity of indiscernibles:
D(P; Q) = 0 i® P = Q
² Symmetry:
D(P; Q) = D(Q; P )
² Subadditivity (triangle inequality):
·
D(P; Q) D(P; K) + D(K; Q)
Pseudo-Metric (aka Semi-Metric)
² Non-negativity:
D(P; Q) ¸ 0
² Property changed to:
D(P; Q) = 0 if P = Q
² Symmetry:
D(P; Q) = D(Q; P )
² Subadditivity (triangle inequality):
·
D(P; Q) D(P; K) + D(K; Q)
Metric
² Non-negativity:
D(P; Q) ¸ 0
² Identity of indiscernibles:
D(P; Q) = 0 i® P = Q
An object is most similar to itself
Metric
² Symmetry:
D(P; Q) = D(Q; P )
² Subadditivity (triangle inequality):
D(P; Q) · D(P; K) + D(K; Q)
Useful for many algorithms
Minkowski-Form Distances
Lp (P; Q) =
Ã
X
i
jP ¡ Q jp
i
i
!1
p
Minkowski-Form Distances
L1 (P; Q) =
X
jP ¡ Q j
i
i
i
sX
¡
L2 (P; Q) =
(Pi Qi )2
i
j
¡
j
L1 (P; Q) = max Pi Qi
i
Kullback-Leibler Divergence
KL(P; Q) =
X
i
• Information theoretic origin.
• Non symmetric.
•
Qi = 0 ?
Pi
Pi log
Qi
Jensen-Shannon Divergence
1
1
J S(P; Q) = KL(P; M ) + KL(Q; M )
2
2
1
M = (P + Q)
2
1X
2Pi
J S(P; Q) =
Pi log
+
2
Pi + Q i
i
1X
2Qi
Qi log
2
Pi + Qi
i
Jensen-Shannon Divergence
1
1
J S(P; Q) = KL(P; M ) + KL(Q; M )
2
2
1
M = (P + Q)
2
• Information Theoretic origin.
p
• Symmetric.
•
JS
is a metric.
Jensen-Shannon Divergence
• Using Taylor 1
extension and some algebra:
X
X (P ¡ Q )2n
1
i
i
JS(P; Q) =
2n(2n ¡ 1)
(Pi + Qi )2n¡1
n=1
i
1 X (Pi ¡ Qi )2
=
+
2
(Pi + Qi )
i
1 X (Pi ¡ Qi )4
+ :::
12
(Pi + Qi )3
i
Â2
Histogram Distance
¡
X
1
(Pi Qi )2
Â2 (P; Q) =
2
(Pi + Qi )
i
• Statistical origin.
• Experimentally results are very similar to JS.
•p
Reduces the effect of large bins.
Â2
•
is a metric
Â2
Histogram Distance
Â2 (
;
) < Â2 (
;
)
Â2
Histogram Distance
L1 (
;
) > L1 (
;
)
Â2
Histogram Distance
¡
X
1
(Pi Qi )2
Â2 (P; Q) =
2
(Pi + Qi )
i
• Experimentally better than
L2
.
Bin-to-Bin Distances
L1 ; L2 ; Â2
• Bin-to-Bin distances such as L2
are
sensitive to quantization:
L1 (
;
>
)
L1 (
;
)
Bin-to-Bin Distances
• #bins
robustness
distinctiveness
• #bins
robustness
distinctiveness
L1 (
;
>
)
L1 (
;
)
Bin-to-Bin Distances
• #bins
robustness
distinctiveness
• #bins
robustness
distinctiveness
Can we achieve
robustness
and
distinctiveness ?
The Quadratic-Form Histogram Distance
q
QF A (P; Q) = (P ¡ Q)T A(P ¡ Q)
sX
=
(Pi ¡ Qi )(Pj ¡ Qj )Aij
ij
•
Aij
• If
is the similarity between bin i and j.
A
is the inverse of the covariance matrix, QF is called
Mahalanobis distance.
The Quadratic-Form Histogram Distance
q
QF A (P; Q) = (P ¡ Q)T A(P ¡ Q)
A=I
sX
=
(Pi ¡ Qi )(Pj ¡ Qj )Aij
ij
sX
=
(Pi ¡ Qi )2 = L2 (P; Q)
ij
The Quadratic-Form Histogram Distance
q
QF A (P; Q) = (P ¡ Q)T A(P ¡ Q)
• Does not reduce the effect of large bins.
• Alleviates the quantization problem.
• Linear time computation in # non zero
Aij
.
The Quadratic-Form Histogram Distance
q
QF A (P; Q) = (P ¡ Q)T A(P ¡ Q)
• If A is positive-semidefinite then QF is a pseudo-metric.
Pseudo-Metric (aka Semi-Metric)
² Non-negativity:
D(P; Q) ¸ 0
² Property changed to:
D(P; Q) = 0 if P = Q
² Symmetry:
D(P; Q) = D(Q; P )
² Subadditivity (triangle inequality):
·
D(P; Q) D(P; K) + D(K; Q)
The Quadratic-Form Histogram Distance
q
QF A (P; Q) = (P ¡ Q)T A(P ¡ Q)
• If A is positive-definite then QF is a metric.
The Quadratic-Form Histogram Distance
q
QF A (P; Q) = (P ¡ Q)T A(P ¡ Q)
q
¡
¡
= (P Q)T W T W (P Q)
= L2 (W P; W Q)
The Quadratic-Form Histogram Distance
q
QF A (P; Q) = (P ¡ Q)T A(P ¡ Q)
= L2 (W P; W Q)
We assume there is a linear transformation
that makes bins independent
There are cases where this is not true
e.g. COLOR
The Quadratic-Form Histogram Distance
q
QF A (P; Q) = (P ¡ Q)T A(P ¡ Q)
• Converting distance to similarity (Hafner et. al 95):
Aij = 1 ¡
Dij
maxij (Dij )
The Quadratic-Form Histogram Distance
q
QF A (P; Q) = (P ¡ Q)T A(P ¡ Q)
• Converting distance to similarity (Hafner et. al 95):
Aij = e
¡®
D(i;j)
maxij (D(i;j))
If ® is large enough,
A will be positive-de¯nitive
The Quadratic-Form Histogram Distance
q
QF A (P; Q) = (P ¡ Q)T A(P ¡ Q)
• Learning the similarity matrix: part 2 of this tutorial.
The Quadratic-Chi Histogram Distance
v
uX
u
(Pi ¡ Qi )(Pj ¡ Qj )Aij
P
P
QCA (P; Q) = t
m
( (Pc + Qc )Aci )m ( (Pc + Qc )Acj )m
ij
•
Aij
c
c
is the similarity between bin i and j.
• Generalizes
QF
and
Â2
.
• Reduces the effect of large bins.
• Alleviates the quantization problem.
• Linear time computation in # non zero
Aij
.
The Quadratic-Chi Histogram Distance
v
uX
u
(Pi ¡ Qi )(Pj ¡ Qj )Aij
P
P
QCA (P; Q) = t
m
( (Pc + Qc )Aci )m ( (Pc + Qc )Acj )m
ij
•
c
c
is non-negative if A is positive-semidefinite.
• Symmetric.
• Triangle inequality unknown.
·
• If we define
0
0
=0
and
0
m<1
, QC is continuous.
Similarity-Matrix-Quantization-Invariant Property
D
(
,
)= (
D
,
)
Sparseness-Invariant
Empty bins
D
(
,
)
= D
(
,
)
The Quadratic-Chi Histogram Distance
Code
function dist= QuadraticChi(P,Q,A,m)
Z= (P+Q)*A;
% 1 can be any number as Z_i==0 iff D_i=0
Z(Z==0)= 1;
Z= Z.^m;
D= (P-Q)./Z;
% max is redundant if A is
% positive-semidefinite
dist= sqrt( max(D*A*D',0) );
The Quadratic-Chi Histogram Distance
Code
6
Time Complexity: O(#Aij = 0)
function dist= QuadraticChi(P,Q,A,m)
Z= (P+Q)*A;
% 1 can be any number as Z_i==0 iff D_i=0
Z(Z==0)= 1;
Z= Z.^m;
D= (P-Q)./Z;
% max is redundant if A is
% positive-semidefinite
dist= sqrt( max(D*A*D',0) );
The Quadratic-Chi Histogram Distance
Code
What about sparse (e.g. BoW) histograms ?
function dist= QuadraticChi(P,Q,A,m)
Z= (P+Q)*A;
% 1 can be any number as Z_i==0 iff D_i=0
Z(Z==0)= 1;
Z= Z.^m;
D= (P-Q)./Z;
% max is redundant if A is
% positive-semidefinite
dist= sqrt( max(D*A*D',0) );
The Quadratic-Chi Histogram Distance Code
What about sparse (e.g. BoW) histograms ?
P ¡Q=
0
0
0
-3 0
0
0
4
5
0
0
0
Time Complexity: O(SK)
0
The Quadratic-Chi Histogram Distance Code
What about sparse (e.g. BoW) histograms ?
P ¡Q=
0
0
0
-3 0
0
0
4
5
0
0
0
0
Time Complexity: O(SK)
6
6
#(P = 0) + #(Q = 0)
The Quadratic-Chi Histogram Distance Code
What about sparse (e.g. BoW) histograms ?
P ¡Q=
0
0
0
-3 0
0
0
4
5
0
0
0
Time Complexity: O(SK)
Average of non-zero entries
in each row of A
0
The Earth Mover’s Distance
The Earth Mover’s Distance
• The Earth Mover’s Distance is defined as the
minimal cost that must be paid to transform one
histogram into the other, where there is a
“ground distance” between the basic features that
are aggregated into the histogram.
The Earth Mover’s Distance
≠
The Earth Mover’s Distance
=
The Earth Mover’s Distance
EM DD (P; Q) =
s:t :
X
F =fFij g
Fij = Pi
j
X
i;j
min
Fij =
X
Fij ¸ 0
i
X
Fij Dij
i;j
X
i
Pi =
Fij = Qj
X
j
Qj = 1
The Earth Mover’s Distance
P
Fij Dij
i;j
P
EM DD (P; Q) = min
Fij
F =fFij g
i
X
X
·
·
s:t :
Fij Pi
Fij Qj
j
X
i;j
i
X
X
Fij = min(
Pi ;
Qj )
¸
Fij 0
i
j
The Earth Mover’s Distance
\
EM
D
• Pele and Werman 08 –
, a new EMD
definition.
Definition:
\EMDD (P; Q) =
C
min
F =fFij g
s:t :
j
X
i;j
Fij Dij +
i;j
¯
¯
¯X
¯
X ¯
¯
¡
¯
¯£C
P
Q
¯
i
j¯
¯
¯
i
X
X
Fij ·Pi
j
X
Fij ·Qj
i
X
X
Fij = min(
Pi ;
Qj )
i
j
Fij ¸ 0
EM D
Demander
Supplier
Dij = 0
P
P
Demander ·
Supplier
\
EM
D
Demander
Supplier
Dij = C
P
P
Demander ·
Supplier
When to Use
\
EM
D
• When the total mass of two histograms is
important.
EM D
¡
;
¢
= EM D
¡
;
¢
When to Use
\
EM
D
• When the total mass of two histograms is
important.
EM D
\
EM
D
¡
¡
;
¢
;
¢
= EM D
\
< EM
D
¡
¡
;
¢
;
¢
When to Use
\
EM
D
• When the difference in total mass between
histograms is a distinctive cue.
EM D
¡
;
¢
= EM D
¡
;
¢
=0
When to Use
\
EM
D
• When the difference in total mass between
histograms is a distinctive cue.
EM D
\
EM
D
¡
¡
;
¢
;
¢
= EM D
\
< EM
D
¡
¡
;
¢
;
¢
=0
When to Use
\
EM
D
• If ground distance is a metric:
² EM D is a metric only for normalized histograms.
² EM
\
D is a metric for all histograms (C ¸ 1 ¢).
2
\
EM
D - a Natural Extension to L1
² EM
\
D = L1 if:
(
0
Dij =
2
if i = j
otherwise
\
EM
DD (P; Q) =
C
min
F =fFij g
X
C=1
Fij Dij +
i;j
¯
¯
¯
¯
X
X
¯
¯
¡
¯
¯£C
P
Q
¯
i
j¯
¯
¯
i
j
The Earth Mover’s Distance Complexity Zoo
• General ground distance:
O(N 3 log N )
– Orlin 88
•
L1
normalized 1D histograms
– Werman, Peleg and Rosenfeld 85
O(N )
The Earth Mover’s Distance Complexity Zoo
•
L1
normalized 1D cyclic histograms
O(N )
– Pele and Werman 08 (Werman, Peleg, Melter, and Kong
86)
The Earth Mover’s Distance Complexity Zoo
•
L1
Manhattan grids
– Ling and Okada 07
O(N 2 log N (D + log N ))
The Earth Mover’s Distance Complexity Zoo
• What about N-dimensional histograms with a
cyclic dimensions?
The Earth Mover’s Distance Complexity Zoo
• What about N-dimensional histograms with a
cyclic dimensions?
The Earth Mover’s Distance Complexity Zoo
• What about N-dimensional histograms with a
cyclic dimensions?
The Earth Mover’s Distance Complexity Zoo
•
L1
general histograms
O(N 2 log2D¡1 N )
– Gudmundsson, Klein, Knauer and Smid 07
p1
p2
p3
p4
p5
p6
p7
p8
l2
l1
l2`
The Earth Mover’s Distance Complexity Zoo
•
min(L1 ; 2)
1D linear/cyclic
histograms
– Pele and Werman 08
f1; 2; : : : ; ¢g O(N )
The Earth Mover’s Distance Complexity Zoo
•
min(L1 ; 2)
1D linear/cyclic
histograms
– Pele and Werman 08
f1; 2; : : : ; ¢g O(N )
The Earth Mover’s Distance Complexity Zoo
min(L1 ; 2)
•
K
f1; 2; : : : ; ¢gD
general
histograms
O(N 2 K log( N ))
is the number of edges with cost 1
– Pele and Werman 08
K
The Earth Mover’s Distance Complexity Zoo
• Any thresholded distance
O(N 2 log N (K + log N ))
– Pele and Werman 09
number of edges with cost
different from the threshold
O(N 2 log2 N )
K = O(log N )
Thresholded Distances
• EMD with a thresholded ground distance is
not an approximation of EMD.
• It has better performance.
The Flow Network Transformation
Original
Network
Simplified
Network
The Flow Network Transformation
Original
Network
Simplified
Network
The Flow Network Transformation
Flowing the Monge sequence
(if ground distance is a metric,
zero-cost edges are a Monge sequence)
The Flow Network Transformation
Removing Empty Bins
and their edges
The Flow Network Transformation
We actually finished here….
Combining Algorithms
• EMD algorithms can be combined.
L1
• For example :
Combining Algorithms
• EMD algorithms can be combined.
L1
• For example, thresholded
:
The Earth Mover’s Distance Approximations
The Earth Mover’s Distance Approximations
• Charikar 02,
and
Thaper 03 – approximated
f1; Indyk
g
d
:::;¢
EMD
by embedding it into the
L1 on
norm.
Time complexity:
O(T N d log ¢)
Distortion (in expectation):
O(d log ¢)
The Earth Mover’s Distance Approximations
• Grauman and Darrell 05 – Pyramid Match Kernel
L
(PMK) same as Indyk and Thaper, replacing
with histogram intersection.
1
• PMK approximates EMD with partial matching.
• PMK is a mercer kernel.
• Time complexity & distortion – same as Indyk and
Thaper (proved in Grauman and Darrell 07).
The Earth Mover’s Distance Approximations
• Lazebnik, Schmid and Ponce 06 – used PMK in
the spatial domain (SPM).
The Earth Mover’s Distance Approximations
level 0
++
+
+
+
+
+
+
level 1
++
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
++ +
+
x 1/8
level 2
++
x 1/4
++ +
++ +
+
+
x 1/2
The Earth Mover’s Distance Approximations
The Earth Mover’s Distance Approximations
• Shirdhonkar and Jacobs 08 - approximated EMD
using the sum of absolute values of the weighted
wavelet coefficients of the difference histogram.
P
Wavelet
D i ffer en ce
Q
Transform
j=1
j=0
Absolute
values
£2¡2£0
+
£2¡2£1
The Earth Mover’s Distance Approximations
• Khot and Naor 06 – any embedding of the EMD
L
over the d-dimensional Hamming
(d) cube into
must incur a distortion of
.
1
• Andoni, Indyk and Krauthgamer 08 - for setsswith
cardinalities upper bounded
by as parameter
O(log
log d)
the distortion reduces to
.
• Naor and Schechtman
07
- any embedding of the
f
g
2
0; 1; : : : ; ¢
p
EMD over
must incur a distortion
( log ¢)
of
.
Robust Distances
• Very high distances
outliers
same difference.
Robust Distances
CIEDE200 Distance from Blue
• With colors, the natural choice.
120
100
80
60
40
20
0
Robust Distances
¢E 00(blue; red) = 56
¢E 00(blue; yellow) = 102
Robust Distances - Exponent
• Usually a negative exponent is used:
² Let d(a; b) be a distance measure between two
features - a; b.
² The negative exponent distance is:
¡d(a;b)
¡
de (a; b) = 1 e ¾
Robust Distances - Exponent
• Exponent is used because (Ruzon and Tomasi 01):
robust, smooth, monotonic, and a metric
Input is always discrete anyway …
Robust Distances - Thresholded
² Let d(a; b) be a distance measure between two
features - a; b.
² The thresholded distance with a threshold
of t > 0 is:
dt (a; b) = min(d(a; b); t).
Thresholded Distances
• Thresholded metrics are also metrics (Pele and
Werman ICCV 2009).
• Better results.
• Pele and Werman ICCV 2009 algorithm computes
EMD with thresholded ground distances much
faster.
• Thresholded distance corresponds to sparse
similarities matrix -> faster QC / QF computation.
Aij = 1 ¡
Dij
maxij (Dij )
Thresholded Distances
• Thresholded vs. exponent:
– Fast computation of cross-bin distances with a
thresholded ground distance.
– Exponent changes small distances – can be a
problem (e.g. color differences).
Thresholded Distances
distance from blue
• Color distance should be thresholded (robust).
150
100
50
0
¢E00
min(¢E00 ; 10)
00 )
10(1 ¡ e ¡¢E
4
00 )
10(1 ¡ e ¡¢E
5
Thresholded Distances
Exponent changes small distances
distance from blue
¢E00
min(¢E00 ; 10)
10
5
0
00 )
10(1 ¡ e ¡¢E
4
00 )
10(1 ¡ e ¡¢E
5
A Ground Distance for SIFT
The ground distance between two SIFT bins
(xi ; yi ; oi ) and (xj ; yj ; oj ):
jj
¡
jj
dR = (xi ; yi ) (xj ; yj ) 2 +
min(joi ¡ oj j; M ¡ joi ¡ oj j)
dT = min(dR ; T )
A Ground Distance for Color Image
The ground distances between two LAB image bins
(xi ; yi ; Li ; ai ; bi ) and (xj ; yj ; Lj ; aj ; bj ) we use are:
dcT = min((jj(xi ; yi ) ¡ (xj ; yj )jj2 )+
¢00 ((Li ; ai ; bi ); (Lj ; aj ; bj )); T )
Perceptual Color Differences
Perceptual Color Differences
• Euclidean distance on L*a*b* space is widely
considered as perceptual uniform.
distance from blue
300
200
100
0
jjLabjj
2
Perceptual Color Differences
60
jjLabjj
distance from blue
adist2
2
40
20
0
Purples before blues
Perceptual Color Differences
•
¢E00
on L*a*b* space is better.
Luo, Cui and Rigg 01. Sharma, Wu and Dalal 05.
distance from blue
20
15
10
5
0
¢E00
Perceptual Color Differences
•
¢E00
jjLabjj
2
¢E00
on L*a*b* space is better.
Perceptual Color Differences
on L*a*b* space is better.
150
distance from blue
•
¢E00
100
50
0
¢E00
Perceptual Color Differences
•
¢E00
on L*a*b* space is better.
• But still has major problems.
¢E 00(blue; red) = 56
¢E 00(blue; yellow) = 102
• Color distance should be thresholded (robust).
Perceptual Color Differences
distance from blue
• Color distance should be saturated (robust).
150
100
50
0
¢E00
min(¢E00 ; 10)
00 )
10(1 ¡ e ¡¢E
4
00 )
10(1 ¡ e ¡¢E
5
Thresholded Distances
Exponent changes small distances
distance from blue
¢E00
min(¢E00 ; 10)
10
5
0
00 )
10(1 ¡ e ¡¢E
4
00 )
10(1 ¡ e ¡¢E
5
Perceptual Color Descriptors
Perceptual Color Descriptors
• 11 basic color terms. Berlin and Kay 69.
white
black
purple
red
green
pink
yellow
orange
blue
grey
brown
Perceptual Color Descriptors
• 11 basic color terms. Berlin and Kay 69.
Perceptual Color Descriptors
• 11 basic color terms. Berlin and Kay 69.
•
Image copyright by Eric Rolph. Taken from:
http://upload.wikimedia.org/wikipedia/commons/5/5c/Double-alaskan-rainbow.jpg
Perceptual Color Descriptors
• 11 basic color terms. Berlin and Kay 69.
Perceptual Color Descriptors
• How to give each pixel an “11-colors”
description ?
Perceptual Color Descriptors
• Learning Color Names from Real-World Images,
J. van de Weijer, C. Schmid, J. Verbeek CVPR 2007.
Perceptual Color Descriptors
• Learning Color Names from Real-World Images,
J. van de Weijer, C. Schmid, J. Verbeek CVPR 2007.
Perceptual Color Descriptors
• Learning Color Names from Real-World Images,
J. van de Weijer, C. Schmid, J. Verbeek CVPR 2007.
chip-based
real-world
Perceptual Color Descriptors
• Learning Color Names from Real-World Images,
J. van de Weijer, C. Schmid, J. Verbeek CVPR 2007.
• For each color returns a probability distribution
over the 11 basic colors.
white
black
purple
red
green
pink
yellow
orange
blue
grey
brown
Perceptual Color Descriptors
• Applying Color Names to Image Description,
J. van de Weijer, C. Schmid ICIP 2007.
• Outperformed state of the art color descriptors.
Perceptual Color Descriptors
• Using illumination invariants – black, gray and
white are the same.
• “Too much invariance” happens in other cases
(Local features and kernels for classification of texture and object categories:
An in-depth study - Zhang, Marszalek, Lazebnik and Schmid. IJCV 2007,
Learning the discriminative power-invariance trade-off - Varma and Ray. ICCV
2007).
• To conclude: Don’t solve imaginary problems.
Perceptual Color Descriptors
• This method is still not perfect.
• 11 color vector for purple(255,0,255) is:
0 0 0 0 0 1 0 0 0 0 0
• In real world images there are no such
over-saturated colors.
Open Questions
• EMD variant that reduces the effect of large bins.
• Learning the ground distance for EMD.
• Learning the similarity matrix and normalization
factor for QC.
Hands-On Code Example
http://www.cs.huji.ac.il/~ofirpele/FastEMD/code/
http://www.cs.huji.ac.il/~ofirpele/QC/code/
Tutorial:
http://www.cs.huji.ac.il/~ofirpele/DFML_ECCV2010_tutorial/
Or
“Ofir Pele”
Download