Support Vector Machine Rong Jin Linear Classifiers denotes +1 denotes -1 • How to find the linear decision boundary that linearly separates data points from two classes? Linear Classifiers denotes +1 denotes -1 Linear Classifiers denotes +1 denotes -1 Any of these would be fine.. ..but which is best? Copyright © 2001, 2003, Andrew W. Moore Classifier Margin denotes +1 denotes -1 Copyright © 2001, 2003, Andrew W. Moore Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint. Maximum Margin denotes +1 denotes -1 The maximum margin linear classifier is the linear classifier with the maximum margin. This is the simplest kind of SVM (called an Linear SVM) Copyright © 2001, 2003, Andrew W. Moore Why Maximum Margin ? 1. denotes +1 denotes -1 2. 3. Copyright © 2001, 2003, Andrew W. Moore If we’ve made a small error in the location of the boundary (it’s been jolted in its perpendicular direction) this gives us least chance of causing a misclassification. There’s some theory (using VC dimension) that is related to (but not the same as) the proposition that this is a good thing. Empirically it works very very well. Estimate the Margin denotes +1 denotes -1 x • What is the distance expression for a point x to a line wx+b= 0? Copyright © 2001, 2003, Andrew W. Moore Estimate the Margin denotes +1 denotes -1 x • What is the classification margin for wx+b= 0? Copyright © 2001, 2003, Andrew W. Moore Maximize the Classification Margin denotes +1 denotes -1 Copyright © 2001, 2003, Andrew W. Moore x Maximum Margin denotes +1 denotes -1 Copyright © 2001, 2003, Andrew W. Moore x Maximum Margin denotes +1 denotes -1 x Copyright © 2001, 2003, Andrew W. Moore Maximum Margin Quadratic programming problem • Quadratic objective function • Linear equality and inequality constraints • Well studied problem in OR Quadratic Programming T Find T arg m in c d u u Ru 2 u Subject to Quadratic criterion a 11 u 1 a 12 u 2 ... a 1 m u m b1 a 21 u 1 a 22 u 2 ... a 2 m u m b 2 : n additional linear inequality constraints a n 1u 1 a n 2 u 2 ... a nm u m b n a ( n 1 )1u 1 a ( n 1 ) 2 u 2 ... a ( n 1 ) m u m b ( n 1 ) a ( n 2 )1u 1 a ( n 2 ) 2 u 2 ... a ( n 2 ) m u m b ( n 2 ) : a ( n e )1u 1 a ( n e ) 2 u 2 ... a ( n e ) m u m b ( n e ) Copyright © 2001, 2003, Andrew W. Moore e additional linear equality constraints And subject to Linearly Inseparable Case denotes +1 denotes -1 Copyright © 2001, 2003, Andrew W. Moore This is going to be a problem! What should we do? Linearly Inseparable Case denotes +1 denotes -1 Copyright © 2001, 2003, Andrew W. Moore • Relax the constraints • Penalize the relaxation Linearly Inseparable Case denotes +1 denotes -1 Copyright © 2001, 2003, Andrew W. Moore • Relax the constraints • Penalize the relaxation Linearly Inseparable Case denotes +1 denotes -1 3 1 2 Still a quadratic programming problem Copyright © 2001, 2003, Andrew W. Moore Linearly Inseparable Case Copyright © 2001, 2003, Andrew W. Moore Linearly Inseparable Case Support Vector Machine Regularized logistic regression Copyright © 2001, 2003, Andrew W. Moore Dual Form of SVM How to decide b ? Copyright © 2001, 2003, Andrew W. Moore Dual Form of SVM wx b 1 denotes +1 denotes -1 w w x b 1 Copyright © 2001, 2003, Andrew W. Moore Suppose we’re in 1-dimension What would SVMs do with this data? x=0 Copyright © 2001, 2003, Andrew W. Moore Suppose we’re in 1-dimension Not a big surprise x=0 Positive “plane” Copyright © 2001, 2003, Andrew W. Moore Negative “plane” Harder 1-dimensional Dataset x=0 Copyright © 2001, 2003, Andrew W. Moore Harder 1-dimensional Dataset x=0 Copyright © 2001, 2003, Andrew W. Moore Harder 1-dimensional Dataset x=0 Copyright © 2001, 2003, Andrew W. Moore Common SVM Basis Functions • Polynomial terms of x of degree 1 to q • Radial (Gaussian) basis functions Copyright © 2001, 2003, Andrew W. Moore 1 2 x1 2 x2 : 2 xm 2 x1 2 x 2 : 2 x m Φ (x ) 2 x1 x 2 2 x1 x 3 : 2 x1 x m 2 x x 2 3 : 2 x x 1 m : 2x x Copyright m 1 2003, m © 2001, Andrew W. Moore Constant Term Quadratic Basis Functions Linear Terms Pure Quadratic Terms Quadratic CrossTerms Number of terms (assuming m input dimensions) = (m+2)choose-2 = (m+2)(m+1)/2 = m2/2 Dual Form of SVM Copyright © 2001, 2003, Andrew W. Moore Dual Form of SVM Copyright © 2001, 2003, Andrew W. Moore Quadratic Dot Products Φ (a ) Φ (b ) 2 a1 2a2 : 2am 2 a1 2 a2 : 2 am 2 a1 a 2 2 a1 a 3 : 2 a1 a m 2 a2a3 : 2 a1 a m : 2 a m 1 a m 1 2 b1 2 b2 : 2 bm 2 b1 2 b2 : 2 bm 2 b1b 2 2 b1b 3 : 2 b1b m 2 b 2 b3 : 2 b1b m : 2 b m 1b m 1 1 + m 2a b i i i 1 + m a 2 i 2 bi i 1 + m m 2a a i i 1 j i 1 j bi b j Quadratic Dot Products Just out of casual, innocent, interest, let’s look at another function of a and b: ( a .b 1) ( a .b ) 2 a .b 1 2 1 2 a i bi i 1 2 m a i bi 2 a i bi 1 i 1 i 1 m Φ (a ) Φ (b ) m 2 m m a m 2 i b i 1 2 i m 2a a i j aba i i 1 bi b j m i j 1 j b j 2 a i bi 1 i 1 i 1 j i 1 m (a b ) i i 1 Copyright © 2001, 2003, Andrew W. Moore m i m 2 2 m aba i 1 j i 1 i i m j b j 2 a i bi 1 i 1 Kernel Trick Copyright © 2001, 2003, Andrew W. Moore Kernel Trick Copyright © 2001, 2003, Andrew W. Moore SVM Kernel Functions • Polynomial kernel function • Radial basis kernel function (universal kernel) Copyright © 2001, 2003, Andrew W. Moore Kernel Tricks • Replacing dot product with a kernel function • Not all functions are kernel functions • Are they kernel functions ? Copyright © 2001, 2003, Andrew W. Moore Kernel Tricks Mercer’s condition To expand Kernel function k(x,y) into a dot product, i.e. k(x,y)=(x)(y), k(x, y) has to be positive semi-definite function, i.e., for any function f(x) whose f 2 ( x ) d x is finite, the following inequality holds Copyright © 2001, 2003, Andrew W. Moore Kernel Tricks • Introducing nonlinearity into the model • Computationally efficient Copyright © 2001, 2003, Andrew W. Moore Nonlinear Kernel (I) Copyright © 2001, 2003, Andrew W. Moore Nonlinear Kernel (II) Copyright © 2001, 2003, Andrew W. Moore Reproducing Kernel Hilbert Space (RKHS) • Reproducing Kernel Hilbert Space H • Eigen decomposition: • Elements of space H: • Reproducing property Reproducing Kernel Hilbert Space (RKHS) • Representer theorem Kernelize Logistic Regression • How can we introduce nonlinearity into the logistic regression model? Copyright © 2001, 2003, Andrew W. Moore Diffusion Kernel • Kernel function describes the correlation or similarity between two data points • Given that a similarity function s(x,y) • Non-negative and symmetric • Does not obey Mercer’s condition • How can we generate a kernel function based on this similarity function? A graph theory approach … Copyright © 2001, 2003, Andrew W. Moore Diffusion Kernel • Create a graph for all the training examples • Each vertex corresponds to a data point • The weight of each edge is the similarity s(x,y) • Graph Laplacian • Negative semi-definite Copyright © 2001, 2003, Andrew W. Moore Diffusion Kernel Consider a simple Laplacian Consider L2, L3, … What do these matrices represent? A diffusion kernel Copyright © 2001, 2003, Andrew W. Moore Diffusion Kernel: Properties • Positive definite • How to compute the diffusion kernel ? Copyright © 2001, 2003, Andrew W. Moore Doing Multi-class Classification • SVMs can only handle two-class outputs (i.e. a categorical output variable with arity 2). • What can be done? • Answer: with output arity N, learn N SVM’s SVM 1 learns “Output==1” vs “Output != 1” SVM 2 learns “Output==2” vs “Output != 2” : SVM N learns “Output==N” vs “Output != N” • Then to predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region. Copyright © 2001, 2003, Andrew W. Moore Kernel Learning What if we have multiple kernel functions • Which one is the best ? • How can we combine multiple kernels ? Kernel Learning References • An excellent tutorial on VC-dimension and Support Vector Machines: C.J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):955-974, 1998. http://citeseer.nj.nec.com/burges98tutorial.html • The VC/SRM/SVM Bible: (Not for beginners including myself) Statistical Learning Theory by Vladimir Vapnik, WileyInterscience; 1998 • Software: SVM-light, http://svmlight.joachims.org/, free download Copyright © 2001, 2003, Andrew W. Moore