Similarity-based Classifiers: Problems and Solutions Classifying based on similarities: Van Gogh Van Gogh Or Monet ? Monet 2 the Similarity-based Classification Problem Training Samples: f (x i ; yi )gni= 1; x i 2 Ð; yi 2 G; i = 1; : : : ; n (paintings) (painter) 3 the Similarity-based Classification Problem Training Samples: f (x i ; yi )gni= 1; x i 2 Ð; yi 2 G; i = 1; : : : ; n Underlying Similarity Function: à : Ð £ Ð ! R £ ¤ £ ¤T Training Similarities: S = Ã(x i ; x j ) n£ n ; y = y1 : : : yn 4 the Similarity-based Classification Problem Training Samples: f (x i ; yi )gni= 1; x i 2 Ð; yi 2 G; i = 1; : : : ; n Underlying Similarity Function: à : Ð £ Ð ! R £ ¤ £ ¤T Training Similarities: S = Ã(x i ; x j ) n£ n ; y = y1 : : : yn £ ¤T Test Similarities: s = Ã(x; x 1) : : : Ã(x; x n ) ; Ã(x; x) Problem: Est imat e t he class label y^ for t est sample x given S, y, s, and Ã(x; x). ? 5 Examples of Similarity Functions Computational Biology – Smith-Waterman algorithm (Smith & Waterman, 1981) – FASTA algorithm (Lipman & Pearson, 1985) – BLAST algorithm (Altschul et al., 1990) Computer Vision – – – – Tangent distance (Duda et al., 2001) Earth mover’s distance (Rubner et al., 2000) Shape matching distance (Belongie et al., 2002) Pyramid match kernel (Grauman & Darrell, 2007) Information Retrieval – Levenshtein distance (Levenshtein, 1966) – Cosine similarity between tf-idf vectors (Manning & Schütze, 1999) 6 Approaches to Similarity-based Classification Classify x given S, y, s, and Ã(x; x). 7 Approaches to Similarity-based Classification Classify x given S, y, s, and Ã(x; x). 8 Can we treat similarities as kernels? Kernels are inner products in some Hilbert space. 9 Can we treat similarities as kernels? Kernels are inner products in some Hilbert space. x Example Inner Product hx; zi = x T z. z hx; zi Propert ies of an Inner Product hx; zi : conjugate symmetric, real linear: hax; zi = a < x; z > positive de¯nite: hx; xi > 0 unless x = 0 An inner product implies a norm: kxk = p hx; xi 10 Can we treat similarities as kernels? Kernels are inner products in some Hilbert space. Inner products are similarities. Are our notions of similarities always inner products?No! 11 Example: Amazon similarity Ð space of all books, Á(A; B) = % buy book A after viewing book B on Amazon 10 S 20 30 Inner product-like? 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90 96 books 12 Example: Amazon similarity Ð space of all books, Á(A; B) = % buy book A after viewing book B on Amazon 10 assymmet ric! S 20 30 40 50 60 70 80 Á(HTF, Bishop) = 3 90 10 20 30 40 50 60 70 96 books 80 90 Á(Bishop, HTF) = 8 13 Example: Amazon similarity Ð space of all books, Á(A; B) = % buy book A after viewing book B on Amazon 1.2 10 S 30 Not PSD! 1 0.8 Eigenvalue 20 40 50 60 0.6 0.4 0.2 70 80 0 90 -0.2 10 20 30 40 50 60 70 96 books 80 90 negat ive 0 10 20 30 40 50 60 Eigenvalue Rank Rank 70 80 90 Well, let’s just make S be a kernel matrix First , symmet rize: S à 12 (S + ST ) ) S = U¤ U T ; ¤ = diag(¸ 1; : : : ; ¸ n ) Clip: Sclip = U diag(max(¸ 1; 0); : : : ; max(¸ n ; 0))U T 0 0 Sclip is the PSD matrix closest to S in terms of the Frobenius norm. PSD Cone Sclip S 15 Well, let’s just make S be a kernel matrix First , symmet rize: S à 12 (S + ST ) ) S = U¤ U T ; ¤ = diag(¸ 1; : : : ; ¸ n ) Flip: S° ip = U diag(j¸ 1j; : : : ; j¸ n j) U T 0 0 (similar e®ect : Snew = ST S) 16 Well, let’s just make S be a kernel matrix First , symmet rize: S à 12 (S + ST ) ) S = U¤ U T ; ¤ = diag(¸ 1; : : : ; ¸ n ) Shift : Sshift = U (¤ + jmin(¸ min (S); 0)j I ) U T 0 0 17 Well, let’s just make S be a kernel matrix First , symmet rize: S à 12 (S + ST ) ) S = U¤ U T ; ¤ = diag(¸ 1; : : : ; ¸ n ) Sshift = U (¤ + jmin(¸ min (S); 0)j I ) U T 0 0 Flip, Clip or Shift? Best bet is Clip. 18 Well, let’s just make S be a kernel matrix First , symmet rize: S à 12 (S + ST ) ) S = U¤ U T ; ¤ = diag(¸ 1; : : : ; ¸ n ) Learn the best kernel matrix for the SVM: (Luss NIPS 2007, Chen et al. ICML 2009) n 1X min min L (f (x i ); yi ) + ´ kf k2K + ° kK ¡ SkF K º 0 f 2HK n i= 1 19 Approaches to Similarity-based Classification Classify x given S, y, s, and Ã(x; x). 20 Let the similarities to the training samples be features £ ¤T Let Ã(x; x 1) : : : Ã(x; x n ) 2 Rn be the feature vect or for x. – SVM (Graepel et al., 1998; Liao & Noble, 2003) – Linear programming (LP) machine (Graepel et al., 1999) – Linear discriminant analysis (LDA) (Pekalska et al., 2001) – Quadratic discriminant analysis (QDA) (Pekalska & Duin, 2002) – Potential support vector machine (P-SVM) (Hochreiter & Obermayer, 2006; Knebel et al., 2008) 1 minimize ky ¡ S®k22 + ²k®k1 + ° k®k1 ® 2 Asymptotically does this work? Our results suggest you need to choose a slow-growing subset of n. 21 # samples AMAZON47 classes AURAL CALTECH FACE MIREX SONAR 101 classes REC 10 2 classes 139 classes classes n = 204 n =100 n = 8677 n = 945 VOTING VDM 2 classes n = 3090 n = 435 SVM (clip) 81.24 13.00 33.49 4.18 57.83 4.89 SVM simas-feature (linear) 76.10 14.25 38.18 4.29 55.54 5.40 SVM simas-feature (RBF) 75.98 14.25 38.16 3.92 55.72 5.52 P-SVM 70.12 14.25 34.23 4.05 63.81 22 5.34 # samples AMAZON47 classes AURAL CALTECH FACE MIREX SONAR 101 classes REC 10 2 classes 139 classes classes n = 204 n =100 n = 8677 n = 945 VOTING VDM 2 classes n = 3090 n = 435 SVM-kNN (clip) (Zhang et al. 2006) 17.56 13.75 36.82 4.23 61.25 5.23 SVM (clip) 81.24 13.00 33.49 4.18 57.83 4.89 SVM simas-feature (linear) 76.10 14.25 38.18 4.29 55.54 5.40 SVM simas-feature (RBF) 75.98 14.25 38.16 3.92 55.72 5.52 P-SVM 70.12 14.25 34.23 4.05 63.81 23 5.34 Approaches to Similarity-based Classification Classify x given S, y, s, and Ã(x; x). 24 Weighted Nearest-Neighbors Take a weighted vote of the k-nearest-neighbors: Xk y^ = arg max g2 G wi I f yi = gg i= 1 Algorithmic parallel of the exemplar model of human learning. ? 25 Weighted Nearest-Neighbors Take a weighted vote of the k-nearest-neighbors: Xk y^ = arg max g2 G wi I f yi = gg i= 1 Algorithmic parallel of the exemplar model of human learning. For wi ¸ 0 and P i wi = 1, get class posterior estimate: P^ (Y = gjX = x) = Xk wi I f yi = gg i= 1 Good for asymmetric costs Good for interpretation Good for system integration. 26 Design Goals for the Weights ? 27 Design Goals for the Weights ? Design Goal 1 (Affinity): wi should be an increasing function of ψ(x, xi). 28 Design Goals for the Weights ? 29 Design Goals for the Weights (Chen et al. JMLR 2009) ? Design Goal 2 (Diversity): wi should be a decreasing function of ψ(xi, xj). 30 Linear Interpolation Weights Linear interpolation weights will meet these goals: X X wi x i = x; such t hat wi ¸ 0; wi = 1 i x1 x2 x i x4 x3 non-unique solut ion 31 Linear Interpolation Weights Linear interpolation weights will meet these goals: X X wi x i = x; such t hat wi ¸ 0; wi = 1 i x1 x2 x i x4 x1 x4 x x3 x2 x3 non-unique solut ion no solut ion 32 LIME weights Linear interpolation weights will meet these goals: X wi x i = x; such t hat wi ¸ 0; i X wi = 1 i Linear interpolation with maximum entropy (LIME) weights (Gupta et al., IEEE PAMI 2006): minimize w ° °2 k ° Xk ° X ° ° wi x i ¡ x ° + ¸ wi log wi ° ° ° i= 1 Xk subject t o 2 i= 1 wi = 1; wi ¸ 0; i = 1; : : : ; k: i= 1 33 LIME weights Linear interpolation weights will meet these goals: X wi x i = x; such t hat wi ¸ 0; i X wi = 1 i Linear interpolation with maximum entropy (LIME) weights (Gupta et al., IEEE PAMI 2006): minimize w ° °2 k ° Xk ° X ° ° wi x i ¡ x ° + ¸ wi log wi ° ° ° i= 1 Xk subject t o 2 i= 1 wi = 1; wi ¸ 0; i = 1; : : : ; k: i= 1 34 LIME weights Linear interpolation weights will meet these goals: X wi x i = x; such t hat wi ¸ 0; i X wi = 1 i Linear interpolation with maximum entropy (LIME) weights (Gupta et al., IEEE PAMI 2006): minimize w ° °2 k ° Xk ° X ° ° wi x i ¡ x ° + ¸ wi log wi ° ° ° i= 1 Xk subject t o 2 i= 1 wi = 1; wi ¸ 0; i = 1; : : : ; k: i= 1 maximum entropy ! push weight s to be equal 35 LIME weights Linear interpolation weights will meet these goals: X wi x i = x; such t hat wi ¸ 0; i X wi = 1 i Linear interpolation with maximum entropy (LIME) weights (Gupta et al., IEEE PAMI 2006): minimize w ° °2 k ° Xk ° X ° ° wi x i ¡ x ° + ¸ wi log wi ° ° ° i= 1 Xk subject t o 2 i= 1 wi = 1; wi ¸ 0; i = 1; : : : ; k: i= 1 maximum entropy = exponent ial solut ion consist ent (Friedlander Gupta I EEE I T 2005) noise averaging 36 Kernelize Linear Interpolation (Chen et al. JMLR 2009) LIME weights: minimize w ° °2 ° Xk ° Xk ° ° wi x i ¡ x ° + ¸ wi log wi ° ° ° i= 1 Xk subject t o 2 i= 1 wi = 1; wi ¸ 0; i = 1; : : : ; k: i= 1 Let X = [x 1; : : : x k ], re-writ e wit h mat rices and change t o ridge regularizer: 1 T T ¸ w X X w ¡ x T X w + wT w w 2 2 subject t o w ¸ 0; 1T w = 1; minimize 37 Kernelize Linear Interpolation LIME weights: minimize w ° °2 ° Xk ° Xk ° ° wi x i ¡ x ° + ¸ wi log wi ° ° ° i= 1 Xk subject t o 2 i= 1 wi = 1; wi ¸ 0; i = 1; : : : ; k: i= 1 Let X = [x 1; : : : x k ], re-writ e wit h mat rices and change t o ridge regularizer: 1 T T ¸ w X X w ¡ x T X w + wT w w 2 2 subject t o w ¸ 0; 1T w = 1; minimize regularizes the variance of the weights 38 Kernelize Linear Interpolation LIME weights: minimize w ° °2 ° Xk ° Xk ° ° wi x i ¡ x ° + ¸ wi log wi ° ° ° i= 1 Xk subject t o 2 i= 1 wi = 1; wi ¸ 0; i = 1; : : : ; k: i= 1 Let X = [x 1; : : : x k ], re-writ e wit h mat rices and change t o ridge regularizer: 1 T T ¸ w X X w ¡ x T X w + wT w w 2 2 subject t o w ¸ 0; 1T w = 1; minimize only need inner products – can replace with kernel or similarities! 39 KRI Weights Satisfy Design Goals Kernel ridge interpolation (KRI) weights: 1 T ¸ T T minimize w Sw ¡ s w + w w w 2 2 subject t o w ¸ 0; 1T w = 1: 40 KRI Weights Satisfy Design Goals Kernel ridge interpolation (KRI) weights: 1 T ¸ T T minimize w Sw ¡ s w + w w w 2 2 subject t o w ¸ 0; 1T w = 1: affinity: £ ¤T s = Ã(x; x 1) : : : Ã(x; x n ) ; so wi high if Ã(x; x i ) high 41 KRI Weights Satisfy Design Goals Kernel ridge interpolation (KRI) weights: 1 T ¸ T T minimize w Sw ¡ s w + w w w 2 2 subject t o w ¸ 0; 1T w = 1: diversity: 1 T 1X w Sw = Ã(x i ; x j )wi wj 2 2 i ;j 42 KRI Weights Satisfy Design Goals Kernel ridge interpolation (KRI) weights: 1 T ¸ T T minimize w Sw ¡ s w + w w w 2 2 subject t o w ¸ 0; 1T w = 1: Make S PSD, problem is a QP QP w/ box constraints Can solve with SMO 43 KRI Weights Satisfy Design Goals Kernel ridge interpolation (KRI) weights: 1 T ¸ T T arg min w Sw ¡ s w + w w 2 2 w subject t o w ¸ 0; 1T w = 1: Remove the constraints on the weights: 1 T ¸ T T arg min w Sw ¡ s w + w w 2 2 w ´ (S + ¸ I ) ¡ 1 s Can show equivalent to local ridge regression: KRR weights. 44 Weighted k-NN: Example 1 2 5 60 S= 6 40 0 0 5 0 0 3 0 07 7; 05 5 0 0 5 0 2 3 4 6 37 7 s= 6 4 25 1 KRI weights wK RI = arg 0.6 KRR weights 1 T ¸ w Sw ¡ sT w + wT w 2 0;1 T w= 1 2 w¸ 0.6 w1 0.5 0.4 wK RR = (S + ¸ I ) ¡ 1s min 0.5 w1 0.4 w2 w2 0.25 w3 0.25 w3 0.1 0.1 0 -2 10 0 w4 10 0 ¸¸ 10 2 w4 -0.1 10 -2 10 ¸¸ 0 10 2 45 Weighted k-NN: Example 2 2 5 61 S= 6 41 1 1 5 4 2 3 1 27 7; 25 5 1 4 5 2 2 3 3 6 37 7 s= 6 4 35 3 KRI weights wK RI = arg 0.4 KRR weights 1 T ¸ w Sw ¡ sT w + wT w 2 0;1 T w= 1 2 wK RR = (S + ¸ I ) ¡ 1s min w¸ 0.45 w1 0.4 0.35 w1 0.35 0.3 w4 0.3 w4 0.25 0.25 0.2 0.2 0.15 -2 10 w2 , w3 0.15 10 0 ¸¸ 10 2 w2 , w3 0.1 10 -2 10 ¸¸ 0 10 2 46 Weighted k-NN: Example 3 2 5 61 S= 6 41 1 1 5 4 2 3 1 27 7; 25 5 1 4 5 2 2 3 2 6 47 7 s= 6 4 35 3 KRI weights wK RI = arg KRR weights 1 T ¸ w Sw ¡ sT w + wT w 2 0;1 T w= 1 2 w¸ 0.7 0.6 wK RR = (S + ¸ I ) ¡ 1s min 1 w2 0.8 0.5 0.6 0.4 0.4 0.25 0.25 w4 0 w1 -0.2 0.1 0 -2 10 w2 w4 w1 w3 -0.4 w3 10 0 ¸¸ 10 2 10 -2 10 ¸¸ 0 10 2 47 LOCAL GLOBAL Amazon47 Aural Sonar Caltech101 Face Rec Mirex Voting # samples 204 100 8677 945 3090 435 # classes 47 2 101 139 10 2 k-NN 16.95 17.00 41.55 4.23 61.21 5.80 affinity k-NN 15.00 15.00 39.20 4.23 61.15 5.86 KRI k-NN (clip) 17.68 14.00 30.13 4.15 61.20 5.29 KRR k-NN (pinv) 16.10 15.25 29.90 4.31 61.18 5.52 SVM-KNN (clip) 17.56 13.75 36.82 4.23 61.25 5.23 SVM sim-as-kernel (clip) 81.24 13.00 33.49 4.18 57.83 4.89 SVM sim-as-feature (linear) 76.10 14.25 38.18 4.29 55.54 5.40 SVM sim-as-feature (RBF) 75.98 14.25 38.16 3.92 55.72 5.52 P-SVM 70.12 14.25 34.23 4.05 63.81 5.34 48 LOCAL GLOBAL Amazon47 Aural Sonar Caltech101 Face Rec Mirex Voting # samples 204 100 8677 945 3090 435 # classes 47 2 101 139 10 2 k-NN 16.95 17.00 41.55 4.23 61.21 5.80 affinity k-NN 15.00 15.00 39.20 4.23 61.15 5.86 KRI k-NN (clip) 17.68 14.00 30.13 4.15 61.20 5.29 KRR k-NN (pinv) 16.10 15.25 29.90 4.31 61.18 5.52 SVM-KNN (clip) 17.56 13.75 36.82 4.23 61.25 5.23 SVM sim-as-kernel (clip) 81.24 13.00 33.49 4.18 57.83 4.89 SVM sim-as-feature (linear) 76.10 14.25 38.18 4.29 55.54 5.40 SVM sim-as-feature (RBF) 75.98 14.25 38.16 3.92 55.72 5.52 P-SVM 70.12 14.25 34.23 4.05 63.81 5.34 49 LOCAL GLOBAL Amazon47 Aural Sonar Caltech101 Face Rec Mirex Voting # samples 204 100 8677 945 3090 435 # classes 47 2 101 139 10 2 k-NN 16.95 17.00 41.55 4.23 61.21 5.80 affinity k-NN 15.00 15.00 39.20 4.23 61.15 5.86 KRI k-NN (clip) 17.68 14.00 30.13 4.15 61.20 5.29 KRR k-NN (pinv) 16.10 15.25 29.90 4.31 61.18 5.52 SVM-KNN (clip) 17.56 13.75 36.82 4.23 61.25 5.23 SVM sim-as-kernel (clip) 81.24 13.00 33.49 4.18 57.83 4.89 SVM sim-as-feature (linear) 76.10 14.25 38.18 4.29 55.54 5.40 SVM sim-as-feature (RBF) 75.98 14.25 38.16 3.92 55.72 5.52 P-SVM 70.12 14.25 34.23 4.05 63.81 5.34 50 LOCAL GLOBAL Amazon47 Aural Sonar Caltech101 Face Rec Mirex Voting # samples 204 100 8677 945 3090 435 # classes 47 2 101 139 10 2 k-NN 16.95 17.00 41.55 4.23 61.21 5.80 affinity k-NN 15.00 15.00 39.20 4.23 61.15 5.86 KRI k-NN (clip) 17.68 14.00 30.13 4.15 61.20 5.29 KRR k-NN (pinv) 16.10 15.25 29.90 4.31 61.18 5.52 SVM-KNN (clip) 17.56 13.75 36.82 4.23 61.25 5.23 SVM sim-as-kernel (clip) 81.24 13.00 33.49 4.18 57.83 4.89 SVM sim-as-feature (linear) 76.10 14.25 38.18 4.29 55.54 5.40 SVM sim-as-feature (RBF) 75.98 14.25 38.16 3.92 55.72 5.52 P-SVM 70.12 14.25 34.23 4.05 63.81 5.34 51 Approaches to Similarity-based Classification Classify x given S, y, s, and Ã(x; x). 52 Generative Classifiers Model t he probability of what you see given each class: Linear discriminant analysis Quadrat ic discriminant analysis Gaussian mixt ure models... Pro: Produces class probabilit ies 53 Generative Classifiers Model t he probability of what you see given each class: Linear discriminant analysis Quadrat ic discriminant analysis Gaussian mixt ure models... Our Goal: Model P(T(s)jg) class descriptive statistics of s We use: T(s) = [Ã(x; ¹ 1); Ã(x; ¹ 2); : : : ; Ã(x; ¹ G)] ¹ h is a centroid for each class 54 Similarity Discriminant Analysis (Cazzanti and Gupta, ICML 2007, 2008, 2009) Model P(T(s)jg) Assume G similarit ies class-conditionally independent Est imat e P(Ã(x; ¹ h jg) as max-ent dist r. given empirical mean. Result is exponent ial. Reduce model bias by applying locally (local SDA) Reduce est . variance by regularizing over localities 55 Similarity Discriminant Analysis (Cazzanti and Gupta, ICML 2007, 2008, 2009) Model P(T(s)jg) Assume G similarit ies class-conditionally independent Reg. Local SDA Performance: Competitive Est imat e P(Ã(x; ¹ h jg) as max-ent dist r. given empirical mean. Result is exponent ial. Reduce model bias by applying locally (local SDA) Reduce est . variance by regularizing over localities 56 Some Conclusions Performance depends heavily on oddities of each dataset Weighted k-NN with affinity-diversity weights work well. Preliminary: Reg. Local SDA works well. Probabilities useful . Local models useful - less approximating - hard to model entire space, underlying manifold? - always feasible 57 Some Conclusions Performance depends heavily on oddities of each dataset Weighted k-NN with affinity-diversity weights work well. Preliminary: Reg. Local SDA works well. Probabilities useful . Local models useful - less approximating - hard to model entire space, underlying manifold? - always feasible 58 Some Conclusions Performance depends heavily on oddities of each dataset Weighted k-NN with affinity-diversity weights work well. Preliminary: Reg. Local SDA works well. Probabilities useful . Local models useful - less approximating - hard to model entire space, underlying manifold? - always feasible 59 Some Conclusions Performance depends heavily on oddities of each dataset Weighted k-NN with affinity-diversity weights work well. Preliminary: Reg. Local SDA works well. Probabilities useful . Local models useful - less approximating - hard to model entire space, underlying manifold? - always feasible 60 Some Conclusions Performance depends heavily on oddities of each dataset Weighted k-NN with affinity-diversity weights work well. Preliminary: Reg. Local SDA works well. Probabilities useful . Local models useful - less approximating - hard to model entire space, underlying manifold? - always feasible 61 Lots of Open Questions Making S PSD. Fast k-NN search for similarities Similarity-based regression Relationship with learning on graphs Try it out on real data Fusion with Euclidean features (see our FUSION 2009 papers) Open theoretical questions (Chen et al. JMLR 2009, Balcan et al. ML 2008) 62 Code/Data/Papers: idl.ee.washington.edu/similaritylearning Similarity-based Classification by Chen et al., JMLR 2009 Training and Test Consistency £ ¤T For a test sample x, given s = Ã(x; x 1) : : : Ã(x; x n ) , shall we classify x as y^ = sgn((c?) T s + b?) ? No! If a training sample was used as a test sample, could change its class! 64 Data Sets Amazon Aural Sonar Protein 10 20 20 20 40 30 40 40 60 50 80 60 60 100 70 80 80 120 90 140 100 20 30 40 50 60 70 80 90 20 40 60 80 100 35 70 1 30 60 25 50 0.8 0.6 0.4 0.2 0 -0.2 Eigenvalue Eigenvalue 1.2 Eigenvalue Eigenvalue Eigenvalue Eigenvalue 10 20 15 10 5 10 20 30 40 50 60 Eigenvalue Rank 70 Eigenvalue Rank 80 90 -5 40 20 40 60 80 100 120 140 60 80 100 Eigenvalue Rank 120 140 40 30 20 10 0 0 0 20 -10 0 10 20 30 40 50 60 70 Eigenvalue Rank Eigenvalue Rank 80 90 0 Eigenvalue Rank 65 Data Sets Voting Yeast-5-7 100 200 Yeast-5-12 50 50 100 100 150 150 300 400 200 200 300 200 400 50 100 150 200 50 120 120 200 100 100 150 100 50 0 -50 Eigenvalue Eigenvalue 250 Eigenvalue Eigenvalue Eigenvalue Eigenvalue 100 80 60 40 20 0 0 50 100 150 200 250 300 Eigenvalue Rank Eigenvalue Rank 350 400 -20 100 150 200 80 60 40 20 0 0 20 40 60 80 100 120 140 160 180 Eigenvalue Rank Eigenvalue Rank -20 0 20 40 60 80 100 120 140 160 180 Eigenvalue Rank Eigenvalue Rank 66 SVM Review Empirical risk minimization (ERM) with regularization: n 1X minimize L (f (x i ); yi ) + ´ kf k2K f 2HK n i= 1 L hinge loss Hinge loss: L(f (x); y) = max(1 ¡ yf (x); 0) 0-1 loss 2 1 1 0 1 2 yf ( x) SVM Primal: 1 T 1 » + ´ cT K c c;b;» n subject to diag(y)(K c + b1) ¸ 1 ¡ »; » ¸ 0: minimize 67 Learning the Kernel Matrix Find for classification the best K regularized toward S: n 1X min min L (f (x i ); yi ) + ´ kf k2K + ° kK ¡ SkF K º 0 f 2HK n i= 1 SVM that learns the full kernel matrix: 1 T 1 » + ´ cT K c + ° kK ¡ SkF c;b;»;K n subject t o diag(y)(K c + b1) ¸ 1 ¡ »; minimize » ¸ 0; K º 0: 68 Related Work SVM Dual: 1 T ® diag(y)K diag(y)® ® 2 subject t o yT ® = 0; 0 · ® · C1: maximize 1T ® ¡ Robust SVM (Luss & d’Aspremont, 2007): µ maximize ® 1 min 1T ® ¡ ®T diag(y)K diag(y)® + ½kK ¡ Sk2F Kº 0 2 ¶ subject t o yT ® = 0; 0 · ® · C1: “This can be interpreted as a worst-case robust classification problem with bounded uncertainty on the kernel matrix K.” 69 Related Work Let A = f ® 2 n j yT ® = 0; 0 · ® · C1g Rewrite the robust SVM as 1 T max min 1 ® ¡ ® diag(y)K diag(y)® + ½kK ¡ Sk2F ®2 A K º 0 2 T Theorem (Sion, 1958) Let M and N be convex spaces one of which is compact, and f(μ,ν) a function on M N, which is quasiconcave in M, quasiconvex in N, upper semicontinuous in μ for each ν N, and lower semi-continuous in ν for each μ M, then sup¹ 2 M inf º 2N f (¹ ; º ) = inf º 2 N sup¹ 2 M f (¹ ; º ): 70 Related Work Let A = f ® 2 n j yT ® = 0; 0 · ® · C1g Rewrite the robust SVM as 1 T max min 1 ® ¡ ® diag(y)K diag(y)® + ½kK ¡ Sk2F ®2 A K º 0 2 T By Sion’s minimax theorem, the robust SVM is equivalent to: zero duality gap 1 T min max 1 ® ¡ ® diag(y)K diag(y)® + ½kK ¡ Sk2F K º 0 ®2 A 2 T L(x; ¸ ?) or f (x) Compare n 1X min min L (f (x i ); yi ) + ´ kf k2K + ° kK ¡ SkF K º 0 f 2HK n i= 1 L(x ?; ¸ ) or g(¸ ) x ¸ 71 Learning the Kernel Matrix It is not trivial to directly solve: 1 T 1 » + ´ cT K c + ° kK ¡ SkF c;b;»;K n subject t o diag(y)(K c + b1) ¸ 1 ¡ »; minimize » ¸ 0; K º 0: Lemma (Generalized Schur Complement) Let K 2 Rn£ n , z 2 Rn and u 2 R . Then · ¸ K z º 0 zT u if and only if K º 0, z is in the range of K, and u ¡ zT K yz ¸ 0. Let z = K c, and notice that cT K c = zT K yz since K K yK = K . 72 Learning the Kernel Matrix It is not trivial to directly solve: 1 T 1 » + ´ cT K c + ° kK ¡ SkF c;b;»;K n subject t o diag(y)(K c + b1) ¸ 1 ¡ »; minimize » ¸ 0; K º 0: However, it can be expressed as a convex conic program: 1 T 1 »+ ´ u + °v z;b;»;K ;u;v n subject t o diag(y)(z + b1) ¸ 1 ¡ »; » ¸ 0; · ¸ K z º 0; kK ¡ SkF · v: zT u minimize – We can recover the optimal c? by c? = (K ?) y z?. 73 Learning the Spectrum Modification Concerns about learning the full kernel matrix: – Though the problem is convex, the number of variables is O(n2). – The flexibility of the model may lead to overfitting. 74