Linear hyperplanes as classifiers Usman Roshan Hyperplane separators Hyperplane separators w Hyperplane separators w Hyperplane separators r x xp w Hyperplane separators r x xp w Nearest mean as hyperplane separator m2 m1 Nearest mean as hyperplane separator m2 m1 m1 + (m2-m1)/2 Nearest mean as hyperplane separator m2 m1 Separating hyperplanes Perceptron Gradient descent Perceptron training Perceptron training Perceptron training by gradient descent Obtaining probability from hyperplane distances Multilayer perceptrons • Many perceptrons with hidden layer • Can solve XOR and model non-linear functions • Leads to non-convex optimization problem solved by back propagation Back propagation • Ilustration of back propagation – http://home.agh.edu.pl /~vlsi/AI/backp_t_en/b ackprop.html • Many local minima Training issues for multilayer perceptrons • Convergence rate – Momentum • Adaptive learning • Overtraining – Early stopping Separating hyperplanes • For two sets of points there are many hyperplane separators y • Which one should we choose for classification? • In other words which one is most likely to produce least error? x Separating hyperplanes • Best hyperplane is the one that maximizes the minimum distance of all training points to the plane (Learning with kernels, Scholkopf and Smola, 2002) • Its expected error is at most the fraction of misclassified points plus a complexity term (Learning with kernels, Scholkopf and Smola, 2002) Margin of a plane • We define the margin as the minimum distance to training points (distance to closest y point) • The optimally separating plane is the one with the maximum margin x Optimally separating hyperplane y w x Optimally separating hyperplane • How do we find the optimally separating hyperplane? • Recall distance of a point to the plane defined earlier Hyperplane separators r x xp w Distance of a point to the separating plane • And so the distance to the plane r is given by wT x w0 r w or wT x w0 ry w where y is -1 if the point is on the left side of the plane and +1 otherwise. Support vector machine: optimally separating hyperplane Distance of point x (with label y) to the hyperplane is given by y(wT x w0 ) w We want this to be at least some value y(wT x w0 ) r w By scaling w we can obtain infinite solutions. Therefore we require that r w 1 So we minimize ||w|| to maximize the distance which gives us the SVM optimization problem. Support vector machine: optimally separating hyperplane SVM optimization criterion min w 1 w 2 2 subject to yi ( wT xi w0 ) 1, for all i We can solve this with Lagrange multipliers. That tells us that w i yi xi i The xi for which i is non-zero are called support vectors. Support vector machine: optimally separating hyperplane y 1/||w|| w 2/||w|| x Inseparable case • What is there is no separating hyperplane? For example XOR function. • One solution: consider all hyperplanes and select the one with the minimal number of misclassified points • Unfortunately NP-complete (see paper by Ben-David, Eiron, Long on course website) • Even NP-complete to polynomially approximate (Learning with kernels, Scholkopf and Smola, and paper on website) Inseparable case • But if we measure error as the sum of the distance of misclassified points to the plane then we can solve for a support vector machine in polynomial time • Roughly speaking margin error bound theorem applies (Theorem 7.3, Scholkopf and Smola) • Note that total distance error can be considerably larger than number of misclassified points Optimally separating hyperplane with errors y w x Support vector machine: optimally separating hyperplane In practice we allow for error terms in case there is no hyperplane. min w, w0 ,i ( 1 2 w +C i ) subject to yi ( wT xi w0 ) 1 i , for all i 2 i SVM software • Plenty of SVM software out there. Two popular packages: – SVM-light – LIBSVM Kernels • What if no separating hyperplane exists? • Consider the XOR function. • In a higher dimensional space we can find a separating hyperplane • Example with SVM-light Kernels • The solution to the SVM is obtained by applying KKT rules (a generalization of Lagrange multipliers). The problem to solve becomes 1 Ld i i j yi y j xiT x j 2 i j i subject to y i i i 0 and 0 i C Kernels • The previous problem can be solved in turn again with KKT rules. • The dot product can be replaced by a matrix K(i,j)=xiTxj or a positive definite matrix K. 1 Ld i i j yi y j K ( xi x j ) 2 i j i subject to y i i i 0 and 0 i C Kernels • With the kernel approach we can avoid explicit calculation of features in high dimensions • How do we find the best kernel? • Multiple Kernel Learning (MKL) solves it for K as a linear combination of base kernels.