Machine Learning 4771

Tony Jebara, Columbia University Machine Learning 4771 Instructor: Tony Jebara Tony Jebara, Columbia University Topic 6 • Review: Support Vector Machines • Primal & Dual Solution • Non-separable SVMs • Kernels • SVM Demo Tony Jebara, Columbia University Review: SVM • Support vector machines are (in the simplest case) linear classifiers that do structural risk minimization (SRM) • Directly maximize margin to reduce guaranteed risk J(θ) • Assume first the 2-class data is linearly separable: {(x ,y ),…, (x ,y )} f (x; θ) = sign (w x + b ) have 1 1 N N where x i ∈ D and yi ∈ {−1,1} T • Decision boundary or hyperplane given by wT x + b = 0 • Note: can scale w & b while keeping same boundary • Many solutions exist which have empirical error Remp(θ)=0 • Want widest or thickest one (max margin), also it’s unique! ⇒ Tony Jebara, Columbia University Support Vector Machines wT x + b = 0 • Define: H+=positive margin hyperplane H- =negative margin hyperplane q =distance from decision plane to origin   q = minx x − 0 subject to wT x + b = 0 2 1  minx 2 x − 0 −λ wT x + b 1) grad ∂ ∂x ( ( 1 2 ( ) )) x̂ = − ( ) x − λw = 0 x = λw w wT w  4) distance q = x̂ − 0 = − 3) Sol’n wT x + b = 0 x T x − λ wT x + b = 0 2) plug into b 5) Define without loss of generality since can scale b & w b wT w w = T constraint wT (λw ) + b = 0 λ=− b wT w T w w = H → w x +b = 0 H + → wTT x + b = +1 H − → w x + b = −1 b w b T w w Tony Jebara, Columbia University Support Vector Machines • The constraints on the SVM for Remp(θ)=0 are thus: wT x i + b ≥ +1 wT x i + b ≤ −1 H+ ∀yi = +1 ∀yi = −1 T • Or more simply: yi w x i + b −1 ≥ 0 ( • The margin of the SVM is: m = d+ + d− ) • Distance to origin: H → q = • Therefore: d+ = d− = 1 w d+ d− H b w H + → q+ = H− H + → wTT x + b = +1 H − → w x + b = −1 b−1 w and margin m = 2 w H − → q− = • Want to max margin, or equivalently minimize: w or w 2 1 • SVM Problem: min 2 w subject to yi (wT x i + b ) −1 ≥ 0 • This is a quadratic program! • Can plug this into a matlab function called “qp()”, done! −1−b w 2 Tony Jebara, Columbia University SVM in Dual Form • We can also solve the problem via convex duality 2 1 • Primal SVM problem LP: min 2 w subject to yi (wT x i + b ) −1 ≥ 0 • This is a quadratic program, quadratic cost function with multiple linear inequalities (these carve out a convex hull) • Subtract from cost each inequality times an α Lagrange multiplier, take derivatives of w & b: LP = minw,b max ∂ ∂w ∂ ∂b 1 α≥0 2 w 2 ( ( ) ) − ∑ i αi yi wT x i + b −1 LP = w − ∑ i αiyi x i = 0 → w = ∑ i αiyi x i LP =− ∑ i αiyi = 0 • Plug back in, dual: LD = ∑ i αi − 12 ∑ i ∑ j αiα j yiy j x iT x j • Also have constraints: ∑ i αiyi = 0 & αi ≥ 0 • Above LD must be maximized! convex duality… also qp() Tony Jebara, Columbia University SVM Dual Solution Properties • We have dual convex program: ∑α− ∑ i i 1 2 T α α y y x xj i, j i j i j i subject to ∑ αy i i i = 0 & αi ≥ 0 • Solve for N alphas (one per data point) instead of D w’s • Still convex (qp) so unique solution, gives alphas • Alphas can be used to get w: w = ∑ i αiyix i • Support Vectors: have non-zero alphas shown with thicker circles, all live on the margin: wT x i + b = ±1 • Solution is sparse, most alphas=0 these are non-support vectors SVM ignores them if they move (without crossing margin) or if they are deleted, SVM doesn’t change (stays robust) Tony Jebara, Columbia University SVM Dual Solution Properties • Primal & Dual Illustration: wT x i + b ≥ ±1 • Recall we could get w from alphas: w = ∑ i αiyix i • Or could use as is: f (x ) = sign (x T w + b ) = sign (∑ i αiyix T x i + b ) • Karush-Kuhn-Tucker Conditions (KKT): solve value of b on margin (for nonzero alphas) have: wT x i + b = yi using known w, compute b for each support vector b = average (bi ) bi = yi − wT x i ∀i : αi > 0 then… • Sparsity (few nonzero alphas) is useful for several reasons • Means SVM only uses some of training data to learn • Should help improve its ability to generalize to test data • Computationally faster when using final learned classifier Tony Jebara, Columbia University Non-Separable SVMs • What happens when non-separable? • There is no solution and convex hull shrinks to nothing • Not all constraints can be resolved, their alphas go to ∞ • Instead of perfectly classifying each point: yi (wT x i + b ) ≥ 1 we “Relax” the problem with (positive) slack variables xi’s allow data to (sometimes) fall on wrong side, for example: wT x i + b ≥ −0.03 if yi = +1 • New constraints: wT x i + b ≥ +1 − ξi wT x i + b ≤ −1 + ξi if yi = +1 if yi = −1 where ξi ≥ 0 where ξi ≥ 0 • But too much xi’s means too much slack, so penalize them LP : min 1 2 2 w +C ∑ i ξi ( ) subject to yi wT x i + b −1 + ξi ≥ 0 Tony Jebara, Columbia University Non-Separable SVMs • This new problem is still convex, still qp()! • User chooses scalar C (or cross-validates) which controls how much slack xi to use (how non-separable) and how robust to outliers or bad points on the wrong side Large margin LP : min ∂ ∂b LP and Low slack 1 2 2 On right side ( ( For xi positivity ) ) w +C ∑ i ξi − ∑ i αi yi wT x i + b −1 + ξi − ∑ i βi ξi ∂ ∂w LP as before... ∂ ∂ξi LP = C − αi − βi = 0 αi = C − βi but... αi & βi ≥ 0 ∴ 0 ≤ αi ≤ C • Can now write dual problem (to maximize): LD : max ∑ i αi − 12 ∑ i, j αi α j yiy j x iT x j subject to ∑ i αiyi = 0 and αi ∈ ⎡⎢0,C ⎤⎥ ⎣ ⎦ • Same dual as before but alphas can’t grow beyond C Tony Jebara, Columbia University Non-Separable SVMs • As we try to enforce a classification for a data point its Lagrange multiplier alpha keeps growing endlessly • Clamping alpha to stop growing at C makes SVM “give up” on those non-separable points • The dual program is now: • Solve as before with extra constraints that alphas positive AND less than C… gives alphas… from alphas get w = ∑ i αiyix i • Karush-Kuhn-Tucker Conditions (KKT): solve value of b on margin for not=zero alphas AND not=C alphas for all others have support vectors, assume ξi = 0 and use formula yi (wT x i + bi ) −1 + ξi = 0 to get bi and b = average (bi ) • Mechanical analogy: support vector forces & torques Tony Jebara, Columbia University Nonlinear SVMs • What if the problem is not linear? Tony Jebara, Columbia University Nonlinear SVMs • What if the problem is not linear? • We can use our old trick… • Map d-dimensional x data from L-space to high dimensional H (Hilbert) feature-space via basis functions Φ(x) • For example, quadratic classifier:  ⎡ x  ⎢ Φ (x ) = ⎢  T vec x x ⎢ ⎣ x i → Φ (x i ) via ( ) Η ⎤ ⎥ ⎥ ⎥ ⎦ L • Call phi’s feature vectors computed from original x inputs • Replace all x’s in the SVM equations with phi’s • Now solve the following learning problem: LD : max ∑ i αi − 1 2 ∑ ( ) α α y y φ x φ xj i, j i j i j ( i ) T s.t. αi ∈ ⎡⎢0,C ⎤⎥ , ∑ i αiyi = 0 ⎣ ⎦ • Which gives⎛ a nonlinear classifier in original space: T ⎞ f (x ) = sign ⎜⎜∑ i αiyi φ (x ) φ (x i ) + b ⎟⎟⎟ ⎝ ⎠ Tony Jebara, Columbia University Kernels (see http://www.youtube.com/watch?v=3liCbRZPrZA) • One important aspect of SVMs: all math involves only the inner products between the phi features! T ⎛ ⎞⎟ ⎜ f (x ) = sign ⎜∑ i αiyi φ (x ) φ (x i ) + b ⎟⎟ ⎝ ⎠ • Replace all inner products with a general kernel function • Mercer kernel: accepts 2 inputs and outputs a scalar via: ⎧ ⎪ ⎪ ⎪ k (x, x) = φ (x ), φ (x) = ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎩ φ (x ) φ (x) for finite φ ∫ φ (x,t ) φ (x,t )dt otherwise T t T • Example: quadratic polynomial φ x = ⎡⎢ x 2 () 1 ⎣ 2x1x 2 x k (x, x) = φ (x ) φ (x) T 2 2 ⎤ ⎥ ⎦ = x12x12 + 2x1x1x 2x2 + x 22x22 = (x1x1 + x 2x2 ) 2 Tony Jebara, Columbia University Kernels • Sometimes, many Φ(x) will produce the same k(x,x’) • Sometimes k(x,x’) computable but features huge or infinite! • Example: polynomials If explicit polynomial mapping, feature space Φ(x) is huge d-dimensional data, p-th order ⎛ d + p −1 polynomial,dim (Η) = ⎜⎜⎜⎜ p ⎝ images of size 16x16 with p=4 have dim(H)=183million p T but can equivalently just use kernel: k (x,y ) = (x y ) p p ( ) = (∑ x x ) k (x, x) = x T x ∝∑ r ∝∑ r i i i Multinomial Theorem p! r1 !r2 !r3 !…(p − r1 − r2 −…)! ( r r r wr x11 x 22 xdd ∝ φ (x ) φ (x) )( r r r r r r r r x11 x 22 xdd x11 x22 xdd r wr x11 x22 xdd ) w=weight on term Equivalent! ⎞⎟ ⎟⎟ ⎟⎟ ⎠ Tony Jebara, Columbia University Kernels • Replace eachx iT x j → k (x i ,x j ) , for example: p T P-th Order Polynomial Kernel: k (x, x) = (x x + 1) 2⎞ ⎛ 1 k (x, x) = exp ⎜⎜⎜− x − x ⎟⎟⎟ RBF Kernel (infinite!): 2 ⎝ ⎠ 2σ Sigmoid (hyperbolic tan) Kernel: k (x, x) = tanh κx T x − δ ( ) • Using kernels we get generalized inner product SVM: ( LD : max ∑ i αi − 12 ∑ i, j αi α j yiy j k x i ,x j f (x ) = sign (∑ α y k (x ,x ) + b) i i i ) s.t. αi ∈ ⎡⎢0,C ⎤⎥ , ∑ i αiyi = 0 ⎣ ⎦ i • Still qp solver, just use Gram matrix K (positive definite) ⎡ ⎢ k (x1,x1 ) k (x1,x 2 ) k (x1,x 3 ) ⎢ K = ⎢⎢ k (x1,x 2 ) k (x 2,x 2 ) k (x 2,x 3 ) ⎢ ⎢ k (x1,x 3 ) k (x 2,x 3 ) k (x 3,x 3 ) ⎢⎣ ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦ ( K ij = k x i ,x j ) Tony Jebara, Columbia University Kernelized SVMs • Polynomial kernel: • Radial basis function kernel: Polynomial Kernel ) ( ) ( ⎛ 1 k (x ,x ) = exp ⎜⎜⎜− ⎝ 2σ T i k x i ,x j = x x j + 1 i j p 2 xi − x j RBF kernel • Least-squares, logistic-regression, perceptron are also kernelizable 2 ⎞⎟ ⎟⎟ ⎠ Tony Jebara, Columbia University SVM Demo • SVM Demo by Steve Gunn: http://www.isis.ecs.soton.ac.uk/resources/svminfo/ • In svc.m replace [alpha lambda how] = qp(…); with [alpha lambda how] = quadprog(H,c,[],[],A,b,vlb,vub,x0); This replaces the old Matlab command qp (quadratic programming) with the new one for more recent versions

Machine Learning 4771

Related documents

Products

Support

Machine Learning 4771

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib