svm

Support Vector Machine Rong Jin Linear Classifiers denotes +1 denotes -1 • How to find the linear decision boundary that linearly separates data points from two classes? Linear Classifiers denotes +1 denotes -1 Linear Classifiers denotes +1 denotes -1 Any of these would be fine.. ..but which is best? Copyright © 2001, 2003, Andrew W. Moore Classifier Margin denotes +1 denotes -1 Copyright © 2001, 2003, Andrew W. Moore Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint. Maximum Margin denotes +1 denotes -1 The maximum margin linear classifier is the linear classifier with the maximum margin. This is the simplest kind of SVM (called an Linear SVM) Copyright © 2001, 2003, Andrew W. Moore Why Maximum Margin ? 1. denotes +1 denotes -1 2. 3. Copyright © 2001, 2003, Andrew W. Moore If we’ve made a small error in the location of the boundary (it’s been jolted in its perpendicular direction) this gives us least chance of causing a misclassification. There’s some theory (using VC dimension) that is related to (but not the same as) the proposition that this is a good thing. Empirically it works very very well. Estimate the Margin denotes +1 denotes -1 x • What is the distance expression for a point x to a line wx+b= 0? Copyright © 2001, 2003, Andrew W. Moore Estimate the Margin denotes +1 denotes -1 x • What is the classification margin for wx+b= 0? Copyright © 2001, 2003, Andrew W. Moore Maximize the Classification Margin denotes +1 denotes -1 Copyright © 2001, 2003, Andrew W. Moore x Maximum Margin denotes +1 denotes -1 Copyright © 2001, 2003, Andrew W. Moore x Maximum Margin denotes +1 denotes -1 x Copyright © 2001, 2003, Andrew W. Moore Maximum Margin Quadratic programming problem • Quadratic objective function • Linear equality and inequality constraints • Well studied problem in OR Quadratic Programming T Find T arg m in c  d u  u Ru 2 u Subject to Quadratic criterion a 11 u 1  a 12 u 2  ...  a 1 m u m  b1 a 21 u 1  a 22 u 2  ...  a 2 m u m  b 2 : n additional linear inequality constraints a n 1u 1  a n 2 u 2  ...  a nm u m  b n a ( n  1 )1u 1  a ( n  1 ) 2 u 2  ...  a ( n  1 ) m u m  b ( n  1 ) a ( n  2 )1u 1  a ( n  2 ) 2 u 2  ...  a ( n  2 ) m u m  b ( n  2 ) : a ( n  e )1u 1  a ( n  e ) 2 u 2  ...  a ( n  e ) m u m  b ( n  e ) Copyright © 2001, 2003, Andrew W. Moore e additional linear equality constraints And subject to Linearly Inseparable Case denotes +1 denotes -1 Copyright © 2001, 2003, Andrew W. Moore This is going to be a problem! What should we do? Linearly Inseparable Case denotes +1 denotes -1 Copyright © 2001, 2003, Andrew W. Moore • Relax the constraints • Penalize the relaxation Linearly Inseparable Case denotes +1 denotes -1 Copyright © 2001, 2003, Andrew W. Moore • Relax the constraints • Penalize the relaxation Linearly Inseparable Case denotes +1 denotes -1 3 1 2 Still a quadratic programming problem Copyright © 2001, 2003, Andrew W. Moore Linearly Inseparable Case Copyright © 2001, 2003, Andrew W. Moore Linearly Inseparable Case Support Vector Machine Regularized logistic regression Copyright © 2001, 2003, Andrew W. Moore Dual Form of SVM How to decide b ? Copyright © 2001, 2003, Andrew W. Moore Dual Form of SVM wx b 1 denotes +1 denotes -1 w w  x  b  1 Copyright © 2001, 2003, Andrew W. Moore Suppose we’re in 1-dimension What would SVMs do with this data? x=0 Copyright © 2001, 2003, Andrew W. Moore Suppose we’re in 1-dimension Not a big surprise x=0 Positive “plane” Copyright © 2001, 2003, Andrew W. Moore Negative “plane” Harder 1-dimensional Dataset x=0 Copyright © 2001, 2003, Andrew W. Moore Harder 1-dimensional Dataset x=0 Copyright © 2001, 2003, Andrew W. Moore Harder 1-dimensional Dataset x=0 Copyright © 2001, 2003, Andrew W. Moore Common SVM Basis Functions • Polynomial terms of x of degree 1 to q • Radial (Gaussian) basis functions Copyright © 2001, 2003, Andrew W. Moore 1     2 x1    2 x2    :     2 xm   2   x1   2 x 2     :   2 x   m Φ (x )   2 x1 x 2    2 x1 x 3     :    2 x1 x m    2 x x 2 3     :   2 x x 1 m     :    2x x  Copyright m  1 2003, m   © 2001, Andrew W. Moore Constant Term Quadratic Basis Functions Linear Terms Pure Quadratic Terms Quadratic CrossTerms Number of terms (assuming m input dimensions) = (m+2)choose-2 = (m+2)(m+1)/2 = m2/2 Dual Form of SVM Copyright © 2001, 2003, Andrew W. Moore Dual Form of SVM Copyright © 2001, 2003, Andrew W. Moore Quadratic Dot Products               Φ (a )  Φ (b )                     2 a1   2a2     :     2am   2   a1   2 a2     :   2 am      2 a1 a 2   2 a1 a 3     :   2 a1 a m     2 a2a3     :   2 a1 a m     :   2 a m 1 a m   1   2 b1  2 b2   :   2 bm  2  b1  2 b2   :  2 bm  2 b1b 2   2 b1b 3   :  2 b1b m   2 b 2 b3   :  2 b1b m   :  2 b m 1b m  1 1 + m  2a b i i i 1 + m a 2 i 2 bi i 1 + m m   2a a i i 1 j  i  1 j bi b j Quadratic Dot Products Just out of casual, innocent, interest, let’s look at another function of a and b: ( a .b  1)  ( a .b )  2 a .b  1 2 1  2  a i bi  i 1 2 m      a i bi   2  a i bi  1 i 1  i 1  m Φ (a )  Φ (b )  m 2 m m a m 2 i b  i 1 2 i  m   2a a i j aba i i 1 bi b j m i j 1 j b j  2  a i bi  1 i 1 i 1 j  i  1 m   (a b ) i i 1 Copyright © 2001, 2003, Andrew W. Moore m i m 2  2 m aba i 1 j  i  1 i i m j b j  2  a i bi  1 i 1 Kernel Trick Copyright © 2001, 2003, Andrew W. Moore Kernel Trick Copyright © 2001, 2003, Andrew W. Moore SVM Kernel Functions • Polynomial kernel function • Radial basis kernel function (universal kernel) Copyright © 2001, 2003, Andrew W. Moore Kernel Tricks • Replacing dot product with a kernel function • Not all functions are kernel functions • Are they kernel functions ? Copyright © 2001, 2003, Andrew W. Moore Kernel Tricks Mercer’s condition To expand Kernel function k(x,y) into a dot product, i.e. k(x,y)=(x)(y), k(x, y) has to be positive semi-definite function, i.e., for any function f(x) whose f 2 ( x ) d x is finite, the following inequality holds Copyright © 2001, 2003, Andrew W. Moore Kernel Tricks • Introducing nonlinearity into the model • Computationally efficient Copyright © 2001, 2003, Andrew W. Moore Nonlinear Kernel (I) Copyright © 2001, 2003, Andrew W. Moore Nonlinear Kernel (II) Copyright © 2001, 2003, Andrew W. Moore Reproducing Kernel Hilbert Space (RKHS) • Reproducing Kernel Hilbert Space H • Eigen decomposition: • Elements of space H: • Reproducing property Reproducing Kernel Hilbert Space (RKHS) • Representer theorem Kernelize Logistic Regression • How can we introduce nonlinearity into the logistic regression model? Copyright © 2001, 2003, Andrew W. Moore Diffusion Kernel • Kernel function describes the correlation or similarity between two data points • Given that a similarity function s(x,y) • Non-negative and symmetric • Does not obey Mercer’s condition • How can we generate a kernel function based on this similarity function? A graph theory approach … Copyright © 2001, 2003, Andrew W. Moore Diffusion Kernel • Create a graph for all the training examples • Each vertex corresponds to a data point • The weight of each edge is the similarity s(x,y) • Graph Laplacian • Negative semi-definite Copyright © 2001, 2003, Andrew W. Moore Diffusion Kernel Consider a simple Laplacian Consider L2, L3, … What do these matrices represent? A diffusion kernel Copyright © 2001, 2003, Andrew W. Moore Diffusion Kernel: Properties • Positive definite • How to compute the diffusion kernel ? Copyright © 2001, 2003, Andrew W. Moore Doing Multi-class Classification • SVMs can only handle two-class outputs (i.e. a categorical output variable with arity 2). • What can be done? • Answer: with output arity N, learn N SVM’s SVM 1 learns “Output==1” vs “Output != 1” SVM 2 learns “Output==2” vs “Output != 2” : SVM N learns “Output==N” vs “Output != N” • Then to predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region. Copyright © 2001, 2003, Andrew W. Moore Kernel Learning What if we have multiple kernel functions • Which one is the best ? • How can we combine multiple kernels ? Kernel Learning References • An excellent tutorial on VC-dimension and Support Vector Machines: C.J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):955-974, 1998. http://citeseer.nj.nec.com/burges98tutorial.html • The VC/SRM/SVM Bible: (Not for beginners including myself) Statistical Learning Theory by Vladimir Vapnik, WileyInterscience; 1998 • Software: SVM-light, http://svmlight.joachims.org/, free download Copyright © 2001, 2003, Andrew W. Moore

svm

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib