Support Vector Machines Binary Classification ( 0 yi = +1 < 0 yi = 1 Given training data (xi, yi) for i = 1 . . . N , with xi Rd and yi { 1, 1}, learn a classifier f (x) such that f (x i ) i.e. yif (xi) > 0 for a correct classification. Linear separability linearly separable not linearly separable Linear classifiers A linear classifier has the form f (x) = w>x + b X2 is the normal to the line, and b the bias f (x) < 0 • is known as the weight vector • in 2D the discriminant is a line • f (x) = 0 f (x) > 0 X1 Linear classifiers A linear classifier has the form f (x) = w>x + b f (x) = 0 • in 3D the discriminant is a plane, and in nD it is a hyperplane For a K-NN classifier it was necessary to `carry’ the training data For a linear classifier, the training data is used to learn w and then discarded Only w is needed for classifying new data Frank Rosenblatt 19228-1971 The Perceptron Classifier Given linearly separable data xi labelled into two categories yi = {-1,1} , find a weight vector w such that the discriminant function f (x i ) = w > x i + b separates the categories for i = 1, .., N • how can we find this separating hyperplane ? The Perceptron Algorithm sign(f (xi)) xi Write classifier as f (xi) = w̃>x̃i + w0 = w>xi w+ where w = (w̃, w0), xi = (x̃i, 1) • Initialize w = 0 w • Cycle though the data points { xi, yi } • if xi is misclassified then • Until all the data is correctly classified +(Green+points+are+misclassified)+ error The white arrow should point to a half plane of blue points 1 0.5 −1 −1 1 0.5 0 −0.5 2 4 −1 −1 −0.5 −0.5 0 0 0.5 0.5 1 1 Illustra(on+of+two+steps+in+the+Percep(on+learning+algorithm+(Bishop+4.7)+ 1 0.5 1 0 0.5 error 0.5 1 0 0 0 −0.5 −0.5 −0.5 −0.5 1 −1 −1 1 0.5 0 −0.5 3 −1 −1 What is the best w? • maximum margin solution: most stable under perturbations of the inputs X > x) + b w b ||w|| Support Vector wTx + b = 0 Support Vector Machine i yi (xi Support Vector linearly separable data f (x) = i support vectors SVM – sketch derivation • Since w>x + b = 0 and c(w>x + b) = 0 define the same plane, we have the freedom to choose the normalization of w ´ = ³ w > x+ ||w|| x ´ = 2 ||w|| • Choose normalization such that w>x++b = +1 and w>x + b = 1 for the positive and negative support vectors respectively x • Then the margin is given by w ³ . x+ ||w|| w Margin = 2 ||w|| Support Vector Support Vector Machine linearly separable data Support Vector wTx + b = 1 wTx + b = 0 wTx + b = -1 SVM – Optimization 1 1 ´ 1 for i = 1 . . . N if yi = +1 for i = 1 . . . N if yi = 1 • Learning the SVM can be formulated as an optimization: 2 max subject to w>xi+b ||w|| w • Or equivalently ³ min ||w||2 subject to yi w>xi + b w • This is a quadratic optimization problem subject to linear constraints and there is a unique minimum The problem statement 1, i = 1, . . . , N 1 (w) = wT w 2 Given a set of training data {(xi , di ), i = 1, . . . , N }, minimize subject to the constraint that di (wT xi + b) 3 The problem statement 1, i = 1, . . . , N 1 (w) = wT w 2 i T N X i=1 T i j di dj xi xj i di xi + b) i di xi i (di (w N X i=1 i di T xi + b N X N 1X 2 i=1 j=1 + N X i=1 1) Given a set of training data {(xi , di ), i = 1, . . . , N }, minimize subject to the constraint that di (wT xi + b) N X i=1 Jw = 0 = w 1 J(w, b, ) = wT w 2 Looks like a job for LAGRANGE MULTIPLIERS! So that and N X i=1 i di w Jb = 0 = N X i=1 Now for the DUAL PROBLEM 1 J(w, b, ) = wT w 2 N X i=1 Note that from (4) third term is zero. Using Eq. (3): Q( ) = Lets enjoy the moment! i (3) (4) DUAL PROBLEM i max Q( ) = Subject to constraints i s, N X i=1 i N X i=1 =0 N X N 1X 2 i=1 j=1 i di T i j di dj xi xj 0, i = 1, . . . , N N X i di xi get the w from 5 This is easier to solve than the original. Furthermore it only depends on the training samples {(xi , di ), i = 1, . . . , N }. Once you have the w= i=1 b=1 w T xs and the b from a support vector that has di = 1, Almost done ... KERNEL FUNCTIONS 6 Now the big bonus occurs because all the machinery we have developed will work if we map the points xi to a higher dimensional space, provided we observe certain conventions. wi i (x) + b = 0 2 (x), . . . , m1 (x)) 0 (x) = 1. Let (xi ) be a function that does the mapping. So the new hyperplane is N X i=1 1 (x), For simplicity in notation define (x) = ( 0 (x), where m1 is the new dimension size and by convention i (x) T j (x) Then all the work we did with x works with (x). The only issue is that instead of xiT xj we have a Kernel function, K(xi , xj ) where K(xi , xj ) = and Kernel functions need to have certain nice properties. : ) Examples Polynomials (xiT xj + 1)p