Distance Functions and Metric Learning: Part 1 Michael Werman Ofir Pele The Hebrew University of Jerusalem 12/1/2011 version. Updated Version: http://www.cs.huji.ac.il/~ofirpele/DFML_ECCV2010_tutorial/ Rough Outline • Bin-to-Bin Distances. • Cross-Bin Distances: – Quadratic-Form (aka Mahalanobis) / Quadratic-Chi. – The Earth Mover’s Distance. • Perceptual Color Differences. • Hands-On Code Example. Distance ? Metric ² Non-negativity: D(P; Q) ¸ 0 ² Identity of indiscernibles: D(P; Q) = 0 i® P = Q ² Symmetry: D(P; Q) = D(Q; P ) ² Subadditivity (triangle inequality): · D(P; Q) D(P; K) + D(K; Q) Pseudo-Metric (aka Semi-Metric) ² Non-negativity: D(P; Q) ¸ 0 ² Property changed to: D(P; Q) = 0 if P = Q ² Symmetry: D(P; Q) = D(Q; P ) ² Subadditivity (triangle inequality): · D(P; Q) D(P; K) + D(K; Q) Metric ² Non-negativity: D(P; Q) ¸ 0 ² Identity of indiscernibles: D(P; Q) = 0 i® P = Q An object is most similar to itself Metric ² Symmetry: D(P; Q) = D(Q; P ) ² Subadditivity (triangle inequality): D(P; Q) · D(P; K) + D(K; Q) Useful for many algorithms Minkowski-Form Distances Lp (P; Q) = à X i jP ¡ Q jp i i !1 p Minkowski-Form Distances L1 (P; Q) = X jP ¡ Q j i i i sX ¡ L2 (P; Q) = (Pi Qi )2 i j ¡ j L1 (P; Q) = max Pi Qi i Kullback-Leibler Divergence KL(P; Q) = X i • Information theoretic origin. • Non symmetric. • Qi = 0 ? Pi Pi log Qi Jensen-Shannon Divergence 1 1 J S(P; Q) = KL(P; M ) + KL(Q; M ) 2 2 1 M = (P + Q) 2 1X 2Pi J S(P; Q) = Pi log + 2 Pi + Q i i 1X 2Qi Qi log 2 Pi + Qi i Jensen-Shannon Divergence 1 1 J S(P; Q) = KL(P; M ) + KL(Q; M ) 2 2 1 M = (P + Q) 2 • Information Theoretic origin. p • Symmetric. • JS is a metric. Jensen-Shannon Divergence • Using Taylor 1 extension and some algebra: X X (P ¡ Q )2n 1 i i JS(P; Q) = 2n(2n ¡ 1) (Pi + Qi )2n¡1 n=1 i 1 X (Pi ¡ Qi )2 = + 2 (Pi + Qi ) i 1 X (Pi ¡ Qi )4 + ::: 12 (Pi + Qi )3 i Â2 Histogram Distance ¡ X 1 (Pi Qi )2 Â2 (P; Q) = 2 (Pi + Qi ) i • Statistical origin. • Experimentally results are very similar to JS. •p Reduces the effect of large bins. Â2 • is a metric Â2 Histogram Distance Â2 ( ; ) < Â2 ( ; ) Â2 Histogram Distance L1 ( ; ) > L1 ( ; ) Â2 Histogram Distance ¡ X 1 (Pi Qi )2 Â2 (P; Q) = 2 (Pi + Qi ) i • Experimentally better than L2 . Bin-to-Bin Distances L1 ; L2 ; Â2 • Bin-to-Bin distances such as L2 are sensitive to quantization: L1 ( ; > ) L1 ( ; ) Bin-to-Bin Distances • #bins robustness distinctiveness • #bins robustness distinctiveness L1 ( ; > ) L1 ( ; ) Bin-to-Bin Distances • #bins robustness distinctiveness • #bins robustness distinctiveness Can we achieve robustness and distinctiveness ? The Quadratic-Form Histogram Distance q QF A (P; Q) = (P ¡ Q)T A(P ¡ Q) sX = (Pi ¡ Qi )(Pj ¡ Qj )Aij ij • Aij • If is the similarity between bin i and j. A is the inverse of the covariance matrix, QF is called Mahalanobis distance. The Quadratic-Form Histogram Distance q QF A (P; Q) = (P ¡ Q)T A(P ¡ Q) A=I sX = (Pi ¡ Qi )(Pj ¡ Qj )Aij ij sX = (Pi ¡ Qi )2 = L2 (P; Q) ij The Quadratic-Form Histogram Distance q QF A (P; Q) = (P ¡ Q)T A(P ¡ Q) • Does not reduce the effect of large bins. • Alleviates the quantization problem. • Linear time computation in # non zero Aij . The Quadratic-Form Histogram Distance q QF A (P; Q) = (P ¡ Q)T A(P ¡ Q) • If A is positive-semidefinite then QF is a pseudo-metric. Pseudo-Metric (aka Semi-Metric) ² Non-negativity: D(P; Q) ¸ 0 ² Property changed to: D(P; Q) = 0 if P = Q ² Symmetry: D(P; Q) = D(Q; P ) ² Subadditivity (triangle inequality): · D(P; Q) D(P; K) + D(K; Q) The Quadratic-Form Histogram Distance q QF A (P; Q) = (P ¡ Q)T A(P ¡ Q) • If A is positive-definite then QF is a metric. The Quadratic-Form Histogram Distance q QF A (P; Q) = (P ¡ Q)T A(P ¡ Q) q ¡ ¡ = (P Q)T W T W (P Q) = L2 (W P; W Q) The Quadratic-Form Histogram Distance q QF A (P; Q) = (P ¡ Q)T A(P ¡ Q) = L2 (W P; W Q) We assume there is a linear transformation that makes bins independent There are cases where this is not true e.g. COLOR The Quadratic-Form Histogram Distance q QF A (P; Q) = (P ¡ Q)T A(P ¡ Q) • Converting distance to similarity (Hafner et. al 95): Aij = 1 ¡ Dij maxij (Dij ) The Quadratic-Form Histogram Distance q QF A (P; Q) = (P ¡ Q)T A(P ¡ Q) • Converting distance to similarity (Hafner et. al 95): Aij = e ¡® D(i;j) maxij (D(i;j)) If ® is large enough, A will be positive-de¯nitive The Quadratic-Form Histogram Distance q QF A (P; Q) = (P ¡ Q)T A(P ¡ Q) • Learning the similarity matrix: part 2 of this tutorial. The Quadratic-Chi Histogram Distance v uX u (Pi ¡ Qi )(Pj ¡ Qj )Aij P P QCA (P; Q) = t m ( (Pc + Qc )Aci )m ( (Pc + Qc )Acj )m ij • Aij c c is the similarity between bin i and j. • Generalizes QF and Â2 . • Reduces the effect of large bins. • Alleviates the quantization problem. • Linear time computation in # non zero Aij . The Quadratic-Chi Histogram Distance v uX u (Pi ¡ Qi )(Pj ¡ Qj )Aij P P QCA (P; Q) = t m ( (Pc + Qc )Aci )m ( (Pc + Qc )Acj )m ij • c c is non-negative if A is positive-semidefinite. • Symmetric. • Triangle inequality unknown. · • If we define 0 0 =0 and 0 m<1 , QC is continuous. Similarity-Matrix-Quantization-Invariant Property D ( , )= ( D , ) Sparseness-Invariant Empty bins D ( , ) = D ( , ) The Quadratic-Chi Histogram Distance Code function dist= QuadraticChi(P,Q,A,m) Z= (P+Q)*A; % 1 can be any number as Z_i==0 iff D_i=0 Z(Z==0)= 1; Z= Z.^m; D= (P-Q)./Z; % max is redundant if A is % positive-semidefinite dist= sqrt( max(D*A*D',0) ); The Quadratic-Chi Histogram Distance Code 6 Time Complexity: O(#Aij = 0) function dist= QuadraticChi(P,Q,A,m) Z= (P+Q)*A; % 1 can be any number as Z_i==0 iff D_i=0 Z(Z==0)= 1; Z= Z.^m; D= (P-Q)./Z; % max is redundant if A is % positive-semidefinite dist= sqrt( max(D*A*D',0) ); The Quadratic-Chi Histogram Distance Code What about sparse (e.g. BoW) histograms ? function dist= QuadraticChi(P,Q,A,m) Z= (P+Q)*A; % 1 can be any number as Z_i==0 iff D_i=0 Z(Z==0)= 1; Z= Z.^m; D= (P-Q)./Z; % max is redundant if A is % positive-semidefinite dist= sqrt( max(D*A*D',0) ); The Quadratic-Chi Histogram Distance Code What about sparse (e.g. BoW) histograms ? P ¡Q= 0 0 0 -3 0 0 0 4 5 0 0 0 Time Complexity: O(SK) 0 The Quadratic-Chi Histogram Distance Code What about sparse (e.g. BoW) histograms ? P ¡Q= 0 0 0 -3 0 0 0 4 5 0 0 0 0 Time Complexity: O(SK) 6 6 #(P = 0) + #(Q = 0) The Quadratic-Chi Histogram Distance Code What about sparse (e.g. BoW) histograms ? P ¡Q= 0 0 0 -3 0 0 0 4 5 0 0 0 Time Complexity: O(SK) Average of non-zero entries in each row of A 0 The Earth Mover’s Distance The Earth Mover’s Distance • The Earth Mover’s Distance is defined as the minimal cost that must be paid to transform one histogram into the other, where there is a “ground distance” between the basic features that are aggregated into the histogram. The Earth Mover’s Distance ≠ The Earth Mover’s Distance = The Earth Mover’s Distance EM DD (P; Q) = s:t : X F =fFij g Fij = Pi j X i;j min Fij = X Fij ¸ 0 i X Fij Dij i;j X i Pi = Fij = Qj X j Qj = 1 The Earth Mover’s Distance P Fij Dij i;j P EM DD (P; Q) = min Fij F =fFij g i X X · · s:t : Fij Pi Fij Qj j X i;j i X X Fij = min( Pi ; Qj ) ¸ Fij 0 i j The Earth Mover’s Distance \ EM D • Pele and Werman 08 – , a new EMD definition. Definition: \EMDD (P; Q) = C min F =fFij g s:t : j X i;j Fij Dij + i;j ¯ ¯ ¯X ¯ X ¯ ¯ ¡ ¯ ¯£C P Q ¯ i j¯ ¯ ¯ i X X Fij ·Pi j X Fij ·Qj i X X Fij = min( Pi ; Qj ) i j Fij ¸ 0 EM D Demander Supplier Dij = 0 P P Demander · Supplier \ EM D Demander Supplier Dij = C P P Demander · Supplier When to Use \ EM D • When the total mass of two histograms is important. EM D ¡ ; ¢ = EM D ¡ ; ¢ When to Use \ EM D • When the total mass of two histograms is important. EM D \ EM D ¡ ¡ ; ¢ ; ¢ = EM D \ < EM D ¡ ¡ ; ¢ ; ¢ When to Use \ EM D • When the difference in total mass between histograms is a distinctive cue. EM D ¡ ; ¢ = EM D ¡ ; ¢ =0 When to Use \ EM D • When the difference in total mass between histograms is a distinctive cue. EM D \ EM D ¡ ¡ ; ¢ ; ¢ = EM D \ < EM D ¡ ¡ ; ¢ ; ¢ =0 When to Use \ EM D • If ground distance is a metric: ² EM D is a metric only for normalized histograms. ² EM \ D is a metric for all histograms (C ¸ 1 ¢). 2 \ EM D - a Natural Extension to L1 ² EM \ D = L1 if: ( 0 Dij = 2 if i = j otherwise \ EM DD (P; Q) = C min F =fFij g X C=1 Fij Dij + i;j ¯ ¯ ¯ ¯ X X ¯ ¯ ¡ ¯ ¯£C P Q ¯ i j¯ ¯ ¯ i j The Earth Mover’s Distance Complexity Zoo • General ground distance: O(N 3 log N ) – Orlin 88 • L1 normalized 1D histograms – Werman, Peleg and Rosenfeld 85 O(N ) The Earth Mover’s Distance Complexity Zoo • L1 normalized 1D cyclic histograms O(N ) – Pele and Werman 08 (Werman, Peleg, Melter, and Kong 86) The Earth Mover’s Distance Complexity Zoo • L1 Manhattan grids – Ling and Okada 07 O(N 2 log N (D + log N )) The Earth Mover’s Distance Complexity Zoo • What about N-dimensional histograms with a cyclic dimensions? The Earth Mover’s Distance Complexity Zoo • What about N-dimensional histograms with a cyclic dimensions? The Earth Mover’s Distance Complexity Zoo • What about N-dimensional histograms with a cyclic dimensions? The Earth Mover’s Distance Complexity Zoo • L1 general histograms O(N 2 log2D¡1 N ) – Gudmundsson, Klein, Knauer and Smid 07 p1 p2 p3 p4 p5 p6 p7 p8 l2 l1 l2` The Earth Mover’s Distance Complexity Zoo • min(L1 ; 2) 1D linear/cyclic histograms – Pele and Werman 08 f1; 2; : : : ; ¢g O(N ) The Earth Mover’s Distance Complexity Zoo • min(L1 ; 2) 1D linear/cyclic histograms – Pele and Werman 08 f1; 2; : : : ; ¢g O(N ) The Earth Mover’s Distance Complexity Zoo min(L1 ; 2) • K f1; 2; : : : ; ¢gD general histograms O(N 2 K log( N )) is the number of edges with cost 1 – Pele and Werman 08 K The Earth Mover’s Distance Complexity Zoo • Any thresholded distance O(N 2 log N (K + log N )) – Pele and Werman 09 number of edges with cost different from the threshold O(N 2 log2 N ) K = O(log N ) Thresholded Distances • EMD with a thresholded ground distance is not an approximation of EMD. • It has better performance. The Flow Network Transformation Original Network Simplified Network The Flow Network Transformation Original Network Simplified Network The Flow Network Transformation Flowing the Monge sequence (if ground distance is a metric, zero-cost edges are a Monge sequence) The Flow Network Transformation Removing Empty Bins and their edges The Flow Network Transformation We actually finished here…. Combining Algorithms • EMD algorithms can be combined. L1 • For example : Combining Algorithms • EMD algorithms can be combined. L1 • For example, thresholded : The Earth Mover’s Distance Approximations The Earth Mover’s Distance Approximations • Charikar 02, and Thaper 03 – approximated f1; Indyk g d :::;¢ EMD by embedding it into the L1 on norm. Time complexity: O(T N d log ¢) Distortion (in expectation): O(d log ¢) The Earth Mover’s Distance Approximations • Grauman and Darrell 05 – Pyramid Match Kernel L (PMK) same as Indyk and Thaper, replacing with histogram intersection. 1 • PMK approximates EMD with partial matching. • PMK is a mercer kernel. • Time complexity & distortion – same as Indyk and Thaper (proved in Grauman and Darrell 07). The Earth Mover’s Distance Approximations • Lazebnik, Schmid and Ponce 06 – used PMK in the spatial domain (SPM). The Earth Mover’s Distance Approximations level 0 ++ + + + + + + level 1 ++ + + + + + + + + + + + + + + + ++ + + x 1/8 level 2 ++ x 1/4 ++ + ++ + + + x 1/2 The Earth Mover’s Distance Approximations The Earth Mover’s Distance Approximations • Shirdhonkar and Jacobs 08 - approximated EMD using the sum of absolute values of the weighted wavelet coefficients of the difference histogram. P Wavelet D i ffer en ce Q Transform j=1 j=0 Absolute values £2¡2£0 + £2¡2£1 The Earth Mover’s Distance Approximations • Khot and Naor 06 – any embedding of the EMD L over the d-dimensional Hamming (d) cube into must incur a distortion of . 1 • Andoni, Indyk and Krauthgamer 08 - for setsswith cardinalities upper bounded by as parameter O(log log d) the distortion reduces to . • Naor and Schechtman 07 - any embedding of the f g 2 0; 1; : : : ; ¢ p EMD over must incur a distortion ( log ¢) of . Robust Distances • Very high distances outliers same difference. Robust Distances CIEDE200 Distance from Blue • With colors, the natural choice. 120 100 80 60 40 20 0 Robust Distances ¢E 00(blue; red) = 56 ¢E 00(blue; yellow) = 102 Robust Distances - Exponent • Usually a negative exponent is used: ² Let d(a; b) be a distance measure between two features - a; b. ² The negative exponent distance is: ¡d(a;b) ¡ de (a; b) = 1 e ¾ Robust Distances - Exponent • Exponent is used because (Ruzon and Tomasi 01): robust, smooth, monotonic, and a metric Input is always discrete anyway … Robust Distances - Thresholded ² Let d(a; b) be a distance measure between two features - a; b. ² The thresholded distance with a threshold of t > 0 is: dt (a; b) = min(d(a; b); t). Thresholded Distances • Thresholded metrics are also metrics (Pele and Werman ICCV 2009). • Better results. • Pele and Werman ICCV 2009 algorithm computes EMD with thresholded ground distances much faster. • Thresholded distance corresponds to sparse similarities matrix -> faster QC / QF computation. Aij = 1 ¡ Dij maxij (Dij ) Thresholded Distances • Thresholded vs. exponent: – Fast computation of cross-bin distances with a thresholded ground distance. – Exponent changes small distances – can be a problem (e.g. color differences). Thresholded Distances distance from blue • Color distance should be thresholded (robust). 150 100 50 0 ¢E00 min(¢E00 ; 10) 00 ) 10(1 ¡ e ¡¢E 4 00 ) 10(1 ¡ e ¡¢E 5 Thresholded Distances Exponent changes small distances distance from blue ¢E00 min(¢E00 ; 10) 10 5 0 00 ) 10(1 ¡ e ¡¢E 4 00 ) 10(1 ¡ e ¡¢E 5 A Ground Distance for SIFT The ground distance between two SIFT bins (xi ; yi ; oi ) and (xj ; yj ; oj ): jj ¡ jj dR = (xi ; yi ) (xj ; yj ) 2 + min(joi ¡ oj j; M ¡ joi ¡ oj j) dT = min(dR ; T ) A Ground Distance for Color Image The ground distances between two LAB image bins (xi ; yi ; Li ; ai ; bi ) and (xj ; yj ; Lj ; aj ; bj ) we use are: dcT = min((jj(xi ; yi ) ¡ (xj ; yj )jj2 )+ ¢00 ((Li ; ai ; bi ); (Lj ; aj ; bj )); T ) Perceptual Color Differences Perceptual Color Differences • Euclidean distance on L*a*b* space is widely considered as perceptual uniform. distance from blue 300 200 100 0 jjLabjj 2 Perceptual Color Differences 60 jjLabjj distance from blue adist2 2 40 20 0 Purples before blues Perceptual Color Differences • ¢E00 on L*a*b* space is better. Luo, Cui and Rigg 01. Sharma, Wu and Dalal 05. distance from blue 20 15 10 5 0 ¢E00 Perceptual Color Differences • ¢E00 jjLabjj 2 ¢E00 on L*a*b* space is better. Perceptual Color Differences on L*a*b* space is better. 150 distance from blue • ¢E00 100 50 0 ¢E00 Perceptual Color Differences • ¢E00 on L*a*b* space is better. • But still has major problems. ¢E 00(blue; red) = 56 ¢E 00(blue; yellow) = 102 • Color distance should be thresholded (robust). Perceptual Color Differences distance from blue • Color distance should be saturated (robust). 150 100 50 0 ¢E00 min(¢E00 ; 10) 00 ) 10(1 ¡ e ¡¢E 4 00 ) 10(1 ¡ e ¡¢E 5 Thresholded Distances Exponent changes small distances distance from blue ¢E00 min(¢E00 ; 10) 10 5 0 00 ) 10(1 ¡ e ¡¢E 4 00 ) 10(1 ¡ e ¡¢E 5 Perceptual Color Descriptors Perceptual Color Descriptors • 11 basic color terms. Berlin and Kay 69. white black purple red green pink yellow orange blue grey brown Perceptual Color Descriptors • 11 basic color terms. Berlin and Kay 69. Perceptual Color Descriptors • 11 basic color terms. Berlin and Kay 69. • Image copyright by Eric Rolph. Taken from: http://upload.wikimedia.org/wikipedia/commons/5/5c/Double-alaskan-rainbow.jpg Perceptual Color Descriptors • 11 basic color terms. Berlin and Kay 69. Perceptual Color Descriptors • How to give each pixel an “11-colors” description ? Perceptual Color Descriptors • Learning Color Names from Real-World Images, J. van de Weijer, C. Schmid, J. Verbeek CVPR 2007. Perceptual Color Descriptors • Learning Color Names from Real-World Images, J. van de Weijer, C. Schmid, J. Verbeek CVPR 2007. Perceptual Color Descriptors • Learning Color Names from Real-World Images, J. van de Weijer, C. Schmid, J. Verbeek CVPR 2007. chip-based real-world Perceptual Color Descriptors • Learning Color Names from Real-World Images, J. van de Weijer, C. Schmid, J. Verbeek CVPR 2007. • For each color returns a probability distribution over the 11 basic colors. white black purple red green pink yellow orange blue grey brown Perceptual Color Descriptors • Applying Color Names to Image Description, J. van de Weijer, C. Schmid ICIP 2007. • Outperformed state of the art color descriptors. Perceptual Color Descriptors • Using illumination invariants – black, gray and white are the same. • “Too much invariance” happens in other cases (Local features and kernels for classification of texture and object categories: An in-depth study - Zhang, Marszalek, Lazebnik and Schmid. IJCV 2007, Learning the discriminative power-invariance trade-off - Varma and Ray. ICCV 2007). • To conclude: Don’t solve imaginary problems. Perceptual Color Descriptors • This method is still not perfect. • 11 color vector for purple(255,0,255) is: 0 0 0 0 0 1 0 0 0 0 0 • In real world images there are no such over-saturated colors. Open Questions • EMD variant that reduces the effect of large bins. • Learning the ground distance for EMD. • Learning the similarity matrix and normalization factor for QC. Hands-On Code Example http://www.cs.huji.ac.il/~ofirpele/FastEMD/code/ http://www.cs.huji.ac.il/~ofirpele/QC/code/ Tutorial: http://www.cs.huji.ac.il/~ofirpele/DFML_ECCV2010_tutorial/ Or “Ofir Pele”