Tony Jebara, Columbia University Machine Learning 4771 Instructor: Tony Jebara Tony Jebara, Columbia University Topic 6 • Review: Support Vector Machines • Primal & Dual Solution • Non-separable SVMs • Kernels • SVM Demo Tony Jebara, Columbia University Review: SVM • Support vector machines are (in the simplest case) linear classifiers that do structural risk minimization (SRM) • Directly maximize margin to reduce guaranteed risk J(θ) • Assume first the 2-class data is linearly separable: {(x ,y ),…, (x ,y )} f (x; θ) = sign (w x + b ) have 1 1 N N where x i ∈ D and yi ∈ {−1,1} T • Decision boundary or hyperplane given by wT x + b = 0 • Note: can scale w & b while keeping same boundary • Many solutions exist which have empirical error Remp(θ)=0 • Want widest or thickest one (max margin), also it’s unique! ⇒ Tony Jebara, Columbia University Support Vector Machines wT x + b = 0 • Define: H+=positive margin hyperplane H- =negative margin hyperplane q =distance from decision plane to origin q = minx x − 0 subject to wT x + b = 0 2 1 minx 2 x − 0 −λ wT x + b 1) grad ∂ ∂x ( ( 1 2 ( ) )) x̂ = − ( ) x − λw = 0 x = λw w wT w 4) distance q = x̂ − 0 = − 3) Sol’n wT x + b = 0 x T x − λ wT x + b = 0 2) plug into b 5) Define without loss of generality since can scale b & w b wT w w = T constraint wT (λw ) + b = 0 λ=− b wT w T w w = H → w x +b = 0 H + → wTT x + b = +1 H − → w x + b = −1 b w b T w w Tony Jebara, Columbia University Support Vector Machines • The constraints on the SVM for Remp(θ)=0 are thus: wT x i + b ≥ +1 wT x i + b ≤ −1 H+ ∀yi = +1 ∀yi = −1 T • Or more simply: yi w x i + b −1 ≥ 0 ( • The margin of the SVM is: m = d+ + d− ) • Distance to origin: H → q = • Therefore: d+ = d− = 1 w d+ d− H b w H + → q+ = H− H + → wTT x + b = +1 H − → w x + b = −1 b−1 w and margin m = 2 w H − → q− = • Want to max margin, or equivalently minimize: w or w 2 1 • SVM Problem: min 2 w subject to yi (wT x i + b ) −1 ≥ 0 • This is a quadratic program! • Can plug this into a matlab function called “qp()”, done! −1−b w 2 Tony Jebara, Columbia University SVM in Dual Form • We can also solve the problem via convex duality 2 1 • Primal SVM problem LP: min 2 w subject to yi (wT x i + b ) −1 ≥ 0 • This is a quadratic program, quadratic cost function with multiple linear inequalities (these carve out a convex hull) • Subtract from cost each inequality times an α Lagrange multiplier, take derivatives of w & b: LP = minw,b max ∂ ∂w ∂ ∂b 1 α≥0 2 w 2 ( ( ) ) − ∑ i αi yi wT x i + b −1 LP = w − ∑ i αiyi x i = 0 → w = ∑ i αiyi x i LP =− ∑ i αiyi = 0 • Plug back in, dual: LD = ∑ i αi − 12 ∑ i ∑ j αiα j yiy j x iT x j • Also have constraints: ∑ i αiyi = 0 & αi ≥ 0 • Above LD must be maximized! convex duality… also qp() Tony Jebara, Columbia University SVM Dual Solution Properties • We have dual convex program: ∑α− ∑ i i 1 2 T α α y y x xj i, j i j i j i subject to ∑ αy i i i = 0 & αi ≥ 0 • Solve for N alphas (one per data point) instead of D w’s • Still convex (qp) so unique solution, gives alphas • Alphas can be used to get w: w = ∑ i αiyix i • Support Vectors: have non-zero alphas shown with thicker circles, all live on the margin: wT x i + b = ±1 • Solution is sparse, most alphas=0 these are non-support vectors SVM ignores them if they move (without crossing margin) or if they are deleted, SVM doesn’t change (stays robust) Tony Jebara, Columbia University SVM Dual Solution Properties • Primal & Dual Illustration: wT x i + b ≥ ±1 • Recall we could get w from alphas: w = ∑ i αiyix i • Or could use as is: f (x ) = sign (x T w + b ) = sign (∑ i αiyix T x i + b ) • Karush-Kuhn-Tucker Conditions (KKT): solve value of b on margin (for nonzero alphas) have: wT x i + b = yi using known w, compute b for each support vector b = average (bi ) bi = yi − wT x i ∀i : αi > 0 then… • Sparsity (few nonzero alphas) is useful for several reasons • Means SVM only uses some of training data to learn • Should help improve its ability to generalize to test data • Computationally faster when using final learned classifier Tony Jebara, Columbia University Non-Separable SVMs • What happens when non-separable? • There is no solution and convex hull shrinks to nothing • Not all constraints can be resolved, their alphas go to ∞ • Instead of perfectly classifying each point: yi (wT x i + b ) ≥ 1 we “Relax” the problem with (positive) slack variables xi’s allow data to (sometimes) fall on wrong side, for example: wT x i + b ≥ −0.03 if yi = +1 • New constraints: wT x i + b ≥ +1 − ξi wT x i + b ≤ −1 + ξi if yi = +1 if yi = −1 where ξi ≥ 0 where ξi ≥ 0 • But too much xi’s means too much slack, so penalize them LP : min 1 2 2 w +C ∑ i ξi ( ) subject to yi wT x i + b −1 + ξi ≥ 0 Tony Jebara, Columbia University Non-Separable SVMs • This new problem is still convex, still qp()! • User chooses scalar C (or cross-validates) which controls how much slack xi to use (how non-separable) and how robust to outliers or bad points on the wrong side Large margin LP : min ∂ ∂b LP and Low slack 1 2 2 On right side ( ( For xi positivity ) ) w +C ∑ i ξi − ∑ i αi yi wT x i + b −1 + ξi − ∑ i βi ξi ∂ ∂w LP as before... ∂ ∂ξi LP = C − αi − βi = 0 αi = C − βi but... αi & βi ≥ 0 ∴ 0 ≤ αi ≤ C • Can now write dual problem (to maximize): LD : max ∑ i αi − 12 ∑ i, j αi α j yiy j x iT x j subject to ∑ i αiyi = 0 and αi ∈ ⎡⎢0,C ⎤⎥ ⎣ ⎦ • Same dual as before but alphas can’t grow beyond C Tony Jebara, Columbia University Non-Separable SVMs • As we try to enforce a classification for a data point its Lagrange multiplier alpha keeps growing endlessly • Clamping alpha to stop growing at C makes SVM “give up” on those non-separable points • The dual program is now: • Solve as before with extra constraints that alphas positive AND less than C… gives alphas… from alphas get w = ∑ i αiyix i • Karush-Kuhn-Tucker Conditions (KKT): solve value of b on margin for not=zero alphas AND not=C alphas for all others have support vectors, assume ξi = 0 and use formula yi (wT x i + bi ) −1 + ξi = 0 to get bi and b = average (bi ) • Mechanical analogy: support vector forces & torques Tony Jebara, Columbia University Nonlinear SVMs • What if the problem is not linear? Tony Jebara, Columbia University Nonlinear SVMs • What if the problem is not linear? • We can use our old trick… • Map d-dimensional x data from L-space to high dimensional H (Hilbert) feature-space via basis functions Φ(x) • For example, quadratic classifier: ⎡ x ⎢ Φ (x ) = ⎢ T vec x x ⎢ ⎣ x i → Φ (x i ) via ( ) Η ⎤ ⎥ ⎥ ⎥ ⎦ L • Call phi’s feature vectors computed from original x inputs • Replace all x’s in the SVM equations with phi’s • Now solve the following learning problem: LD : max ∑ i αi − 1 2 ∑ ( ) α α y y φ x φ xj i, j i j i j ( i ) T s.t. αi ∈ ⎡⎢0,C ⎤⎥ , ∑ i αiyi = 0 ⎣ ⎦ • Which gives⎛ a nonlinear classifier in original space: T ⎞ f (x ) = sign ⎜⎜∑ i αiyi φ (x ) φ (x i ) + b ⎟⎟⎟ ⎝ ⎠ Tony Jebara, Columbia University Kernels (see http://www.youtube.com/watch?v=3liCbRZPrZA) • One important aspect of SVMs: all math involves only the inner products between the phi features! T ⎛ ⎞⎟ ⎜ f (x ) = sign ⎜∑ i αiyi φ (x ) φ (x i ) + b ⎟⎟ ⎝ ⎠ • Replace all inner products with a general kernel function • Mercer kernel: accepts 2 inputs and outputs a scalar via: ⎧ ⎪ ⎪ ⎪ k (x, x) = φ (x ), φ (x) = ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎩ φ (x ) φ (x) for finite φ ∫ φ (x,t ) φ (x,t )dt otherwise T t T • Example: quadratic polynomial φ x = ⎡⎢ x 2 () 1 ⎣ 2x1x 2 x k (x, x) = φ (x ) φ (x) T 2 2 ⎤ ⎥ ⎦ = x12x12 + 2x1x1x 2x2 + x 22x22 = (x1x1 + x 2x2 ) 2 Tony Jebara, Columbia University Kernels • Sometimes, many Φ(x) will produce the same k(x,x’) • Sometimes k(x,x’) computable but features huge or infinite! • Example: polynomials If explicit polynomial mapping, feature space Φ(x) is huge d-dimensional data, p-th order ⎛ d + p −1 polynomial,dim (Η) = ⎜⎜⎜⎜ p ⎝ images of size 16x16 with p=4 have dim(H)=183million p T but can equivalently just use kernel: k (x,y ) = (x y ) p p ( ) = (∑ x x ) k (x, x) = x T x ∝∑ r ∝∑ r i i i Multinomial Theorem p! r1 !r2 !r3 !…(p − r1 − r2 −…)! ( r r r wr x11 x 22 xdd ∝ φ (x ) φ (x) )( r r r r r r r r x11 x 22 xdd x11 x22 xdd r wr x11 x22 xdd ) w=weight on term Equivalent! ⎞⎟ ⎟⎟ ⎟⎟ ⎠ Tony Jebara, Columbia University Kernels • Replace eachx iT x j → k (x i ,x j ) , for example: p T P-th Order Polynomial Kernel: k (x, x) = (x x + 1) 2⎞ ⎛ 1 k (x, x) = exp ⎜⎜⎜− x − x ⎟⎟⎟ RBF Kernel (infinite!): 2 ⎝ ⎠ 2σ Sigmoid (hyperbolic tan) Kernel: k (x, x) = tanh κx T x − δ ( ) • Using kernels we get generalized inner product SVM: ( LD : max ∑ i αi − 12 ∑ i, j αi α j yiy j k x i ,x j f (x ) = sign (∑ α y k (x ,x ) + b) i i i ) s.t. αi ∈ ⎡⎢0,C ⎤⎥ , ∑ i αiyi = 0 ⎣ ⎦ i • Still qp solver, just use Gram matrix K (positive definite) ⎡ ⎢ k (x1,x1 ) k (x1,x 2 ) k (x1,x 3 ) ⎢ K = ⎢⎢ k (x1,x 2 ) k (x 2,x 2 ) k (x 2,x 3 ) ⎢ ⎢ k (x1,x 3 ) k (x 2,x 3 ) k (x 3,x 3 ) ⎢⎣ ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦ ( K ij = k x i ,x j ) Tony Jebara, Columbia University Kernelized SVMs • Polynomial kernel: • Radial basis function kernel: Polynomial Kernel ) ( ) ( ⎛ 1 k (x ,x ) = exp ⎜⎜⎜− ⎝ 2σ T i k x i ,x j = x x j + 1 i j p 2 xi − x j RBF kernel • Least-squares, logistic-regression, perceptron are also kernelizable 2 ⎞⎟ ⎟⎟ ⎠ Tony Jebara, Columbia University SVM Demo • SVM Demo by Steve Gunn: http://www.isis.ecs.soton.ac.uk/resources/svminfo/ • In svc.m replace [alpha lambda how] = qp(…); with [alpha lambda how] = quadprog(H,c,[],[],A,b,vlb,vub,x0); This replaces the old Matlab command qp (quadratic programming) with the new one for more recent versions