Support Vector Machine (SVM) Based on Nello Cristianini presentation http://www.support-vector.net/tutorial.html Basic Idea • Use Linear Learning Machine (LLM). • Overcome the linearity constraints: Map to non-linearly to higher dimension. • Select between hyperplans Use margin as a test • Generalization depends on the margin. General idea Original Problem Transformed Problem Kernel Based Algorithms • Two separate learning functions • Learning Algorithm: in an imbedded space • Kernel function performs the embedding Basic Example: Kernel Perceptron • Hyperplane classification f(x)=<w,x>+b = <w’,x’> h(x)= sign(f(x)) • Perceptron Algorithm: Sample: (xi,ti), ti{-1,+1} If ti <wk,xi> < 0 THEN /* Error*/ wk+1 = wk + ti xi k=k+1 Recall • Margin of hyperplan w w x D( w) min xi ,bi S w • Mistake bound xmax M * D( w ) 2 Observations • Solution is a linear combination of inputs w = ai ti xi where ai >0 • Mistake driven Only points on which we make mistake influence! • Support vectors The non-zero ai Dual representation • Rewrite basic function: f(x) = <w,x> +b = ai ti <xi , x> +b w = ai ti xi • Change update rule: IF tj ( ai ti <xi , xj> +b) < 0 THEN aj = aj+1 • Observation: Data only inside inner product! Limitation of Perceptron • Only linear separations • Only converges for linearly separable data • Only defined on vectorial data The idea of a Kernel • Embed data to a different space • Possibly higher dimension • Linearly separable in the new space. Original Problem Transformed Problem Kernel Mapping • • • • • • Need only to compute inner-products. Mapping: M(x) Kernel: K(x,y) = < M(x) , M(y)> Dimensionality of M(x): unimportant! Need only to compute K(x,y) Using it in the embedded space: Replace <x,y> by K(x,y) Example x=(x1 , x2); z=(z1 , z2); K(x,z) = (<x,z>)2 ( x, z ) ( x1 z1 x2 z2 ) 2 2 ( x1 z 1 x z 2 x1 z1 x2 z2 ) 2 2 2 2 2 2 ([x1 , x , 2 x1 x2 ],[ z 1 , z , 2 z1 z2 ]) 2 2 2 (M(x),M(z)) 2 2 2 Polynomial Kernel Original Problem Transformed Problem Kernel Matrix Example of Basic Kernels • Polynomial K(x,z)= (<x,z> )d • Gaussian K(x,z)= exp{- ||x-z||2 /2} Kernel: Closure Properties • • • • K(x,z) = K1(x,z) + c K(x,z) = c*K1(x,z) K(x,z) = K1(x,z) * K2(x,z) K(x,z) = K1(x,z) + K2(x,z) • Create new kernels using basic ones! Support Vector Machines • Linear Learning Machines (LLM) • Use dual representation • Work in the kernel induced feature space f(x) = ai ti K(xi , x) +b • Which hyperplane to select Generalization of SVM • PAC theory: error = O( Vcdim / m) Problem: Vcdim >> m No preference between consistent hyperplanes Margin based bounds • • • • H: Basic Hypothesis class conv(H): finite convex combinations of H D: Distribution over X and {+1,-1} S: Sample of size m over D Margin based bounds • THEOREM: for every f in conv(H) PrD [ yf ( x) 0] PrS [ yf ( x) ] L 1 L O m log m log | H | log 1 / 2 θ Maximal Margin Classifier • Maximizes the margin • Minimizes the overfitting due to margin selection. • Increases margin Rather than reduce dimensionality SVM: Support Vectors Margins • Geometric Margin: mini ti f(xi)/ ||w|| Functional margin: mini ti f(xi) f(x) Main trick in SVM • Insist on functional marginal at least 1. Support vectors have margin 1. • Geometric margin = 1 / || w|| • Proof. SVM criteria • Find a hyperplane (w,b) • That Maximizes: || w ||2 = <w,w> • Subject to: for all i ti (<w,xi>+b) 1 Quadratic Programming • • • • Quadratic goal function. Linear constraint. Unique Maximum. Polynomial time algorithms. Dual Problem • Maximize W(a) = ai - 1/2 i,j ai ti aj tj K(xi , xj) +b • Subject to i ai ti =0 ai 0 Applications: Text • Classify a text to given categories Sports, news, business, science, … • Feature space Bag of words Huge sparse vector! Applications: Text • Practicalities: Mw(x) = tfw log (idfw) / K ftw = text frequency of w idfw = inverse document frequency idfw = # documents / # documents with w • Inner product <M(x),M(z)> sparse vectors • SVM: finds a hyperplan in “document space”