Support Vector Machines Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm awm@cs.cmu.edu 412-268-7599 Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers x denotes +1 Nov 23rd, 2001 α f yest f(x,w,b) = sign(w. x - b) denotes -1 How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 2 1 Linear Classifiers x α f yest f(x,w,b) = sign(w. x - b) denotes +1 denotes -1 How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 3 Linear Classifiers x denotes +1 α f yest f(x,w,b) = sign(w. x - b) denotes -1 How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 4 2 Linear Classifiers x α f yest f(x,w,b) = sign(w. x - b) denotes +1 denotes -1 How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 5 Linear Classifiers x denotes +1 α f yest f(x,w,b) = sign(w. x - b) denotes -1 Any of these would be fine.. ..but which is best? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 6 3 Classifier Margin α x f f(x,w,b) = sign(w. x - b) denotes +1 Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint. denotes -1 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 7 Maximum Margin α x denotes +1 f yest f(x,w,b) = sign(w. x - b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. denotes -1 Linear SVM Copyright © 2001, 2003, Andrew W. Moore yest This is the simplest kind of SVM (Called an LSVM) Support Vector Machines: Slide 8 4 Maximum Margin α x denotes +1 f f(x,w,b) = sign(w. x - b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. denotes -1 Support Vectors are those datapoints that the margin pushes up against Linear SVM Copyright © 2001, 2003, Andrew W. Moore yest This is the simplest kind of SVM (Called an LSVM) Support Vector Machines: Slide 9 Why Maximum Margin? 1. Intuitively this feels safest. denotes +1 denotes -1 Support Vectors are those datapoints that the margin pushes up against f(x,w,b) = sign(w. - b) 2. If we’ve made a small error inx the location of the boundary (it’s been The maximum jolted in its perpendicular direction) this gives us leastmargin chance linear of causing a misclassification. classifier is the 3. LOOCV is easy since the classifier model is linear immune to removal of any with the,nonum, support-vector datapoints. maximum margin. 4. There’s some theory (using VC is the dimension) that isThis related to (but not kind the same as) thesimplest proposition thatof this is a good thing. SVM (Called an LSVM) 5. Empirically it works very very well. Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 10 5 Specifying a line and margin “Pr s= las e C t c zon edi “Pr ” +1 s= las C t c zone edi - 1” Plus-Plane Classifier Boundary Minus-Plane • How do we represent this mathematically? • …in m input dimensions? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 11 Specifying a line and margin ” +1 = s las t C one c i z ed “Pr =1 +b wx =0 +b wx b=-1 + wx s= las C t c zone edi r P “ - 1” Plus-Plane Classifier Boundary Minus-Plane • Plus-plane = { x : w . x + b = +1 } • Minus-plane = { x : w . x + b = -1 } if w . x + b >= 1 -1 if w . x + b <= -1 Universe explodes if -1 < w . x + b < 1 Classify as.. +1 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 12 6 Computing the margin width “Pr s= las e C t c zon edi =1 +b wx =0 +b wx b=-1 + wx “Pr ” +1 s= las C t c zone edi M = Margin Width - 1” How do we compute M in terms of w and b? • Plus-plane = { x : w . x + b = +1 } • Minus-plane = { x : w . x + b = -1 } Claim: The vector w is perpendicular to the Plus Plane. Why? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 13 Computing the margin width ” +1 = s las t C one c i z ed “Pr =1 +b wx =0 +b wx b=-1 + wx s= las C t c zone edi r P “ M = Margin Width - 1” How do we compute M in terms of w and b? • Plus-plane = { x : w . x + b = +1 } • Minus-plane = { x : w . x + b = -1 } Claim: The vector w is perpendicular to the Plus Plane. Why? Let u and v be two vectors on the Plus Plane. What is w . ( u – v ) ? And so of course the vector w is also perpendicular to the Minus Plane Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 14 7 Computing the margin width “Pr s= las e C t c zon edi =1 +b wx =0 +b wx b=-1 + wx • • • • • “Pr ” +1 x+ M = Margin Width - -1 = sx las C e t c zon edi ” How do we compute M in terms of w and b? Plus-plane = { x : w . x + b = +1 } Minus-plane = { x : w . x + b = -1 } The vector w is perpendicular to the Plus Plane Let x- be any point on the minus plane Let x+ be the closest plus-plane-point to x-. Copyright © 2001, 2003, Andrew W. Moore Any location in not ¨mm:: not R necessarily a datapoint Support Vector Machines: Slide 15 Computing the margin width ” +1 x+ M = Margin Width = ss a l ct C zone edi r P “ 1” How do we compute x =- s s M in terms of w =1 Cla +b ict zone 0 wx d = e +b -1 and b? “Pr wx = +b wx • • • • • • Plus-plane = { x : w . x + b = +1 } Minus-plane = { x : w . x + b = -1 } The vector w is perpendicular to the Plus Plane Let x- be any point on the minus plane Let x+ be the closest plus-plane-point to x-. Claim: x+ = x- + λ w for some value of λ. Why? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 16 8 Computing the margin width “Pr s= las e C t c zon edi =1 +b wx =0 +b wx b=-1 + wx • • • • • • ” +1 x+ M = Margin Width The line from x- to x+ is perpendicular to the do we compute 1” How x =- s planes. las M in terms of w ct C zone i d e r and ? So to getbfrom x- to x+ “P travel some distance in w. direction { x : w . x + b = +1 } Plus-plane = Minus-plane = { x : w . x + b = -1 } The vector w is perpendicular to the Plus Plane Let x- be any point on the minus plane Let x+ be the closest plus-plane-point to x-. Claim: x+ = x- + λ w for some value of λ. Why? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 17 Computing the margin width ” +1 x+ M = Margin Width = ss a l ct C zone edi r P “ 1” x =- s s =1 Cla +b ict zone 0 wx d = e +b -1 “Pr wx = +b wx What we know: • w . x+ + b = +1 • w . x- + b = -1 • x+ = x- + λ w • |x+ - x- | = M It’s now easy to get M in terms of w and b Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 18 9 Computing the margin width “Pr s= las e C t c zon edi =1 +b wx =0 +b wx b=-1 + wx What we know: • w . x+ + b = +1 • w . x- + b = -1 • x+ = x- + λ w • |x+ - x- | = M It’s now easy to get M in terms of w and b “Pr ” +1 x+ M = Margin Width - -1 = sx las C e t c zon edi ” w . (x - + λ w) + b = 1 => w . x - + b + λ w .w = 1 => -1 + λ w .w = 1 => λ= Copyright © 2001, 2003, Andrew W. Moore 2 w.w Support Vector Machines: Slide 19 Computing the margin width ” 2 +1 x+ M = Margin Width = = s w.w las t C one c i z ed “Pr 1” x =- s s =1 Cla +b ict zone 0 wx d = e M = |x+ - x- | =| λ w |= +b -1 “Pr wx = +b wx What we know: • w . x+ + b = +1 • w . x- + b = -1 • x+ = x- + λ w • |x+ - x- | = M 2 • λ= = λ | w | = λ w.w = 2 w.w 2 = w.w w.w w.w Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 20 10 Learning the Maximum Margin Classifier “Pr s= las e C t c zon edi =1 +b wx =0 +b wx b=-1 + wx “Pr ” +1 x+ M = Margin Width = - -1 = sx las C e t c zon edi 2 w.w ” Given a guess of w and b we can • Compute whether all data points in the correct half-planes • Compute the width of the margin So now we just need to write a program to search the space of w’s and b’s to find the widest margin that matches all the datapoints. How? Gradient descent? Simulated Annealing? Matrix Inversion? EM? Newton’s Method? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 21 Learning via Quadratic Programming • QP is a well-studied class of optimization algorithms to maximize a quadratic function of some real-valued variables subject to linear constraints. Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 22 11 Quadratic Programming Find u T Ru arg max c + d u + 2 u T Quadratic criterion a11u1 + a12u2 + ... + a1mum ≤ b1 Subject to a21u1 + a22u2 + ... + a2 mu m ≤ b2 : an1u1 + an 2u 2 + ... + anmum ≤ bn n additional linear inequality constraints a( n +1)1u1 + a( n +1) 2u2 + ... + a( n +1) mum = b( n +1) a( n + 2 )1u1 + a( n + 2 ) 2u2 + ... + a( n + 2 ) mum = b( n + 2 ) : a( n + e )1u1 + a( n + e ) 2u2 + ... + a( n + e ) mu m = b( n + e ) Copyright © 2001, 2003, Andrew W. Moore e additional linear equality constraints And subject to Support Vector Machines: Slide 23 Quadratic Programming Find arg max c + dT u + u Subject to u T Ru 2 Quadratic criterion a11u1 + a12u2 + ... + a1mum ≤ b1g : a( n + e )1u1 + a( n + e ) 2u2 + ... + a( n + e ) mu m = b( n + e ) Copyright © 2001, 2003, Andrew W. Moore e additional linear equality constraints din s for fin m h it r o ic lg ... + a2 mqu a21u1 +re ae22 r≤atb n additional linear xisut2a+ ed uamd nt2ly in a The r t s n ie o inequality c ic f f h s uc :ch more e radient u m constraints a ng optim bly tha a li e r d an1u1 + aann2u 2 + ...as+ceannm t. u m ≤ bn u dly…yo And subject to very fid o write e r a y a( n +1)1u1 (+Buat( nth+1e) 2u2 +on...’t +waan(tn +t1) mum = b( n +1) d obably elf) yo u a( n + 2 )1u1 + ap(rn + 2 ) 2uo2n+e ... +rsa( n + 2 ) mum = b( n + 2 ) Support Vector Machines: Slide 24 12 Learning the Maximum Margin Classifier “Pr s= las e C ct zon edi =1 +b wx =0 +b wx b=-1 + wx ” +1 s= las C ct zone edi “Pr M = Given guess of w , b we can 2 w.w • Compute whether all data points are in the correct half-planes • Compute the margin width Assume R datapoints, each (xk,yk) where yk = +/- 1 - 1” What should our quadratic optimization criterion be? How many constraints will we have? What should they be? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 25 Learning the Maximum Margin Classifier “Pr s= las e C t c zon edi =1 +b wx =0 +b wx b=-1 + wx e “Pr ” +1 s= las C e t dic zon M = Given guess of w , b we can 2 w.w • - 1” What should our quadratic optimization criterion be? Minimize w.w Compute whether all data points are in the correct half-planes • Compute the margin width Assume R datapoints, each (xk,yk) where yk = +/- 1 How many constraints will we have? R What should they be? w . xk + b >= 1 if yk = 1 w . xk + b <= -1 if yk = -1 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 26 13 Uh-oh! This is going to be a problem! What should we do? denotes +1 denotes -1 Copyright © 2001, 2003, Andrew W. Moore Uh-oh! denotes +1 denotes -1 Support Vector Machines: Slide 27 This is going to be a problem! What should we do? Idea 1: Find minimum w.w, while minimizing number of training set errors. Problemette: Two things to minimize makes for an ill-defined optimization Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 28 14 Uh-oh! This is going to be a problem! What should we do? Idea 1.1: denotes +1 denotes -1 Minimize w.w + C (#train errors) Tradeoff parameter There’s a serious practical problem that’s about to make us reject this approach. Can you guess what it is? Copyright © 2001, 2003, Andrew W. Moore Uh-oh! Support Vector Machines: Slide 29 This is going to be a problem! What should we do? Idea 1.1: denotes +1 denotes -1 Minimize w.w + C (#train errors) Tradeoff parameter Can’t be expressed as a Quadratic Programming problem. Solving it may be too slow. There’s a serious practical make Can you guess what it is? (Also, doesn’t distinguish between ny problem that’s about So… a to disastrous errors and near misses) ther o us reject this approach. ideas? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 30 15 Uh-oh! denotes +1 This is going to be a problem! What should we do? Idea 2.0: denotes -1 Minimize w.w + C (distance of error points to their correct place) Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 31 Learning Maximum Margin with Noise M = Given guess of w , b we can 2 w.w • =1 +b wx =0 +b wx b=-1 + wx What should our quadratic optimization criterion be? Copyright © 2001, 2003, Andrew W. Moore Compute sum of distances of points to their correct zones • Compute the margin width Assume R datapoints, each (xk,yk) where yk = +/- 1 How many constraints will we have? What should they be? Support Vector Machines: Slide 32 16 Learning Maximum Margin with Noise ε2 =1 +b wx =0 +b wx b=-1 + wx M = Given guess of w , b we can ε11 2 w.w • Compute sum of distances of points to their correct zones • Compute the margin width Assume R datapoints, each (xk,yk) where yk = +/- 1 ε7 What should our quadratic How many constraints will we optimization criterion be? have? R Minimize R What should they be? 1 2 w.w + C ∑ εk k =1 w . xk + b >= 1-εk if yk = 1 w . xk + b <= -1+εk if yk = -1 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 33 Learning Maximum Marginmwith Noise = # input ε2 =1 +b wx =0 +b wx b=-1 + wx M = Given guessdimensions of w , b we can ε11 2 w.w • Compute sum of distances of points to their correct Our original (noiseless data) QP had m+1 zones variables: w1, w2, … wm, and b. • Compute the margin width ε7 Our new (noisy data) QP Rhas m+1+R each Assume datapoints, variables: w1, w2(x , …,yw)m,where b, εk ,yε1 = ,…+/εR 1 k k k What should our quadratic How many constraints will we R= # records optimization criterion be? have? R Minimize R What should they be? 1 2 w.w + C ∑ εk k =1 Copyright © 2001, 2003, Andrew W. Moore w . xk + b >= 1-εk if yk = 1 w . xk + b <= -1+εk if yk = -1 Support Vector Machines: Slide 34 17 Learning Maximum Margin with Noise ε2 =1 +b wx =0 +b wx b=-1 + wx M = Given guess of w , b we can ε11 2 w.w • Compute sum of distances of points to their correct zones • Compute the margin width Assume R datapoints, each (xk,yk) where yk = +/- 1 ε7 What should our quadratic How many constraints will we optimization criterion be? have? R Minimize R What should they be? 1 2 w.w + C ∑ εk k =1 w . xk + b >= 1-εk if yk = 1 w . xk + b <= -1+εk if yk = -1 There’s a bug in this QP. Can you spot it? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 35 Learning Maximum Margin with Noise ε2 =1 +b wx =0 +b wx b=-1 + wx M = Given guess of w , b we can ε11 2 w.w • Compute sum of distances of points to their correct zones • Compute the margin width Assume R datapoints, each (xk,yk) where yk = +/- 1 ε7 What should our quadratic How many constraints will we optimization criterion be? have? 2R Minimize R What should they be? 1 2 w.w + C ∑ εk k =1 Copyright © 2001, 2003, Andrew W. Moore w . xk + b >= 1-εk if yk = 1 w . xk + b <= -1+εk if yk = -1 εk >= 0 for all k Support Vector Machines: Slide 36 18 An Equivalent QP Warning: up until Rong Zhang spotted my error in Oct 2003, this equation had been wrong in earlier versions of the notes. This version is correct. R 1 R R Maximize ∑ αk − ∑∑ αk αl Qkl where Qkl = yk yl (x k .x l ) 2 k =1 l =1 k =1 Subject to these constraints: 0 ≤ αk ≤ C ∀k R ∑ k =1 αk yk = 0 Then define: w = R ∑ k =1 Then classify with: αk ykx k f(x,w,b) = sign(w. x - b) b = y K (1 − ε K ) − x K . w K where K = arg max α k k Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 37 An Equivalent QP Warning: up until Rong Zhang spotted my error in Oct 2003, this equation had been wrong in earlier versions of the notes. This version is correct. R 1 R R Maximize ∑ αk − ∑∑ αk αl Qkl where Qkl = yk yl (x k .x l ) 2 k =1 l =1 k =1 Subject to these constraints: 0 ≤ αk ≤ C ∀k Then define: w = R ∑ k =1 R ∑ k =1 αk yk = 0 Datapoints with αk > 0 will be the support vectors Then classify with: αk ykx k f(x,w,b) = sign(w. x - b) ..so this sum only needs to be over the K support vectors. b = y K (1 − ε K ) − x K . w where K = arg max α k k Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 38 19 An Equivalent QP R 1 R R Maximize ∑ αk − ∑∑ αk αl Qkl where Qkl = yk yl (x k .x l ) 2 k =1 l =1 k =1 Why did I tell you about this R Subject to these equivalent 0 ≤ αk ≤QP? C ∀k αk constraints: • It’s a formulation that QP k = 1 packages can optimize more Then define: Datapoints with αk > 0 quickly ∑ w = R ∑ k =1 yk = 0 will be the support αk vectors Then classify with: of further jawy k• xBecause k dropping developments f(x,w,b)you’re = sign(w. x - b) ..so this sum only needs about to learn. b = y K (1 − ε K ) − x K . w K where K = arg max α k to be over the support vectors. k Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 39 Suppose we’re in 1-dimension What would SVMs do with this data? x=0 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 40 20 Suppose we’re in 1-dimension Not a big surprise x=0 Positive “plane” Copyright © 2001, 2003, Andrew W. Moore Negative “plane” Support Vector Machines: Slide 41 Harder 1-dimensional dataset That’s wiped the smirk off SVM’s face. What can be done about this? x=0 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 42 21 Harder 1-dimensional dataset Remember how permitting nonlinear basis functions made linear regression so much nicer? Let’s permit them here too x=0 Copyright © 2001, 2003, Andrew W. Moore z k = ( xk , xk2 ) Support Vector Machines: Slide 43 Harder 1-dimensional dataset Remember how permitting nonlinear basis functions made linear regression so much nicer? Let’s permit them here too x=0 Copyright © 2001, 2003, Andrew W. Moore z k = ( xk , xk2 ) Support Vector Machines: Slide 44 22 Common SVM basis functions zk = ( polynomial terms of xk of degree 1 to q ) zk = ( radial basis functions of xk ) ⎛ | xk − c j | ⎞ ⎟⎟ z k [ j ] = φ j (x k ) = KernelFn⎜⎜ KW ⎝ ⎠ zk = ( sigmoid functions of xk ) This is sensible. Is that the end of the story? No…there’s one more trick! Copyright © 2001, 2003, Andrew W. Moore 1 ⎛ ⎞ ⎜ ⎟ 2 x1 ⎟ ⎜ ⎜ 2 x2 ⎟ ⎜ ⎟ : ⎜ ⎟ ⎜ ⎟ 2 xm ⎟ ⎜ 2 ⎜ ⎟ x1 ⎜ ⎟ 2 x2 ⎜ ⎟ ⎜ ⎟ : ⎜ ⎟ 2 xm ⎜ ⎟ Φ(x) = ⎜ 2 x1 x2 ⎟ ⎜ ⎟ ⎜ 2 x1 x3 ⎟ ⎜ ⎟ : ⎜ ⎟ ⎜ 2 x1 xm ⎟ ⎜ ⎟ ⎜ 2 x2 x3 ⎟ ⎜ ⎟ : ⎜ ⎟ ⎜ 2 x1 xm ⎟ ⎜ ⎟ : ⎜⎜ ⎟⎟ ⎝ 2 xm −1 xm ⎠ Constant Term Linear Terms Support Vector Machines: Slide 45 Quadratic Basis Functions Number of terms (assuming m input dimensions) = (m+2)-choose-2 Pure Quadratic Terms = (m+2)(m+1)/2 = (as near as makes no difference) m2/2 You may be wondering what those Quadratic Cross-Terms Copyright © 2001, 2003, Andrew W. Moore 2 ’s are doing. •You should be happy that they do no harm •You’ll find out why they’re there soon. Support Vector Machines: Slide 46 23 QP with basis functions Warning: up until Rong Zhang spotted my error in Oct 2003, this equation had been wrong in earlier versions of the notes. This version is correct. 1 R R αk − ∑∑ αk αl Qkl where Qkl = yk yl (Φ(x k ).Φ(x l )) ∑ 2 k =1 l =1 k =1 R Maximize Subject to these constraints: 0 ≤ αk ≤ C ∀k R ∑ k =1 αk yk = 0 Then define: ∑ w = Then classify with: α k ykΦ (x k ) k s.t. α k > 0 f(x,w,b) = sign(w. φ(x) - b) b = y K (1 − ε K ) − x K . w K where K = arg max α k k Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 47 QP with basis functions 1 R R αk − ∑∑ αk αl Qkl where Qkl = yk yl (Φ(x k ).Φ(x l )) ∑ 2 k =1 l =1 k =1 R Maximize Subject to these constraints: We must do R2/2 dot products to R get this matrix ready. 0 ≤ αk ≤ CEach∀ k α y = 0 dot product ∑ requiresk m2k/2 k =1 additions and multiplications The whole thing costs R2 m2 /4. Yeeks! Then define: w = ∑ α k ykΦ (x k ) k s.t. α k > 0 …or does it? Then classify with: f(x,w,b) = sign(w. φ(x) - b) b = y K (1 − ε K ) − x K . w K where K = arg max α k k Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 48 24 Quadratic Dot Products 1 1 ⎛ ⎞ ⎛ ⎞ ⎜ ⎟ ⎜ ⎟ 2a1 ⎟ ⎜ 2b1 ⎟ ⎜ ⎜ 2 a2 ⎟ ⎜ 2b2 ⎟ ⎜ ⎟ ⎜ ⎟ : : ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ 2 am ⎟ ⎜ 2bm ⎟ ⎜ ⎜ ⎟ ⎜ b12 ⎟ a12 ⎜ ⎟ ⎜ ⎟ 2 2 a2 ⎜ ⎟ ⎜ b2 ⎟ ⎜ ⎟ ⎜ ⎟ : : ⎜ ⎟ ⎜ ⎟ 2 2 ⎜ am ⎟ ⎜ bm ⎟ •⎜ Φ(a) • Φ(b) = ⎜ ⎟ 2a1a2 2b1b2 ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ 2a1a3 ⎟ ⎜ 2b1b3 ⎟ ⎜ ⎟ ⎜ ⎟ : : ⎜ ⎟ ⎜ ⎟ ⎜ 2a1am ⎟ ⎜ 2b1bm ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ 2a2 a3 ⎟ ⎜ 2b2b3 ⎟ ⎜ ⎟ ⎜ ⎟ : : ⎜ ⎟ ⎜ ⎟ ⎜ 2a1am ⎟ ⎜ 2b1bm ⎟ ⎜ ⎟ ⎜ ⎟ : : ⎜⎜ ⎟⎟ ⎜⎜ ⎟⎟ ⎝ 2am −1am ⎠ ⎝ 2bm −1bm ⎠ 1 + m ∑ 2a b i i i =1 + m ∑a b i =1 2 2 i i + m m ∑ ∑ 2a a b b i i =1 j = i +1 j i Quadratic Dot Products Copyright © 2001, 2003, Andrew W. Moore j Support Vector Machines: Slide 49 Just out of casual, innocent, interest, let’s look at another function of a and b: (a.b + 1) 2 = (a.b) 2 + 2a.b + 1 2 m ⎛ m ⎞ = ⎜ ∑ ai bi ⎟ + 2∑ ai bi + 1 i =1 ⎝ i =1 ⎠ Φ(a) • Φ(b) = m m m m i =1 i =1 i =1 j = i +1 1 + 2∑ ai bi + ∑ ai2bi2 + ∑ ∑ 2a a b b i j i m m = ∑∑ ai bi a j b j + 2∑ ai bi + 1 m j i =1 j =1 m i =1 m = ∑ ( ai bi ) 2 + 2∑ i =1 Copyright © 2001, 2003, Andrew W. Moore m m ∑aba b i =1 j = i +1 i i j j + 2∑ ai bi + 1 i =1 Support Vector Machines: Slide 50 25 Quadratic Dot Products Just out of casual, innocent, interest, let’s look at another function of a and b: (a.b + 1) 2 = (a.b) 2 + 2a.b + 1 2 m ⎛ m ⎞ = ⎜ ∑ ai bi ⎟ + 2∑ ai bi + 1 i =1 ⎝ i =1 ⎠ Φ(a) • Φ(b) = m m m m i =1 i =1 i =1 j = i +1 1 + 2∑ ai bi + ∑ ai2bi2 + ∑ ∑ 2a a b b i j i m m = ∑∑ ai bi a j b j + 2∑ ai bi + 1 m i =1 j =1 j m i =1 m = ∑ ( ai bi ) 2 + 2∑ i =1 m m ∑aba b i =1 j = i +1 i i j j + 2∑ ai bi + 1 i =1 They’re the same! And this is only O(m) to compute! Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 51 QP with Quadratic basis functions Warning: up until Rong Zhang spotted my error in Oct 2003, this equation had been wrong in earlier versions of the notes. This version is correct. 1 R R αk − ∑∑ αk αl Qkl where Qkl = yk yl (Φ(x k ).Φ(x l )) ∑ 2 k =1 l =1 k =1 R Maximize Subject to these constraints: We must do R2/2 dot products to R get this matrix ready. 0 ≤ αk ≤ CEach∀ k α k yrequires k = 0 dot product ∑ now only k =1 m additions and multiplications Then define: w = ∑ α k ykΦ (x k ) k s.t. α k > 0 Then classify with: f(x,w,b) = sign(w. φ(x) - b) b = y K (1 − ε K ) − x K . w K where K = arg max α k k Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 52 26 Higher Order Polynomials φ(x) Polynomial Cost to build Qkl matrix tradition ally Quadratic All m2/2 m2 R2 /4 terms up to degree 2 Cost if 100 inputs φ(a).φ(b) Cost to 2,500 R2 (a.b+1)2 m R2 / 2 50 R2 Cost if build Qkl 100 inputs matrix sneakily Cubic All m3/6 m3 R2 /12 83,000 R2 terms up to degree 3 (a.b+1)3 m R2 / 2 50 R2 Quartic All m4/24 m4 R2 /48 1,960,000 R2 (a.b+1)4 m R2 / 2 terms up to degree 4 Copyright © 2001, 2003, Andrew W. Moore 50 R2 Support Vector Machines: Slide 53 QP with Quintic basis functions We must do RR2/2 dot products R R to get this matrix ready. Maximize ∑ α + ∑∑ α α Q k k l kl where In 100-d, each k =1 now l =1 needs 103 k =1dot product operations instead of 75 million But there are still worrying things lurking away. Subject to these What are they? constraints: Qkl = yk yl (Φ(x k ).Φ(x l )) 0 ≤ αk ≤ C ∀k R ∑ k =1 αk yk = 0 Then define: w = ∑ α k ykΦ (x k ) k s.t. α k > 0 b = y K (1 − ε K ) − x K . w K where K = arg max α k k Copyright © 2001, 2003, Andrew W. Moore Then classify with: f(x,w,b) = sign(w. φ(x) - b) Support Vector Machines: Slide 54 27 QP with Quintic basis functions We must do RR2/2 dot products R R to get this matrix ready. Maximize ∑ α + ∑∑ α α Q k k l kl where In 100-d, each k =1 now l =1 needs 103 k =1dot product operations instead of 75 million Qkl = yk yl (Φ(x k ).Φ(x l )) R But there are still worrying things lurking away. Subject to these What are they? k k k constraints: •The fear of overfittingkwith = 1 this enormous number of terms 0 ≤ α ≤ C ∀k Then define: ∑ w = ∑ α y = 0 •The evaluation phase (doing a set of predictions on a test set) will be very expensive (why?) α k ykΦ (x k ) k s.t. α k > 0 b = y K (1 − ε K ) − x K . w K where K = arg max α k k Then classify with: f(x,w,b) = sign(w. φ(x) - b) Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 55 QP with Quintic basis functions We must do RR2/2 dot products R R to get this matrix ready. Maximize ∑ α + ∑∑ α α Q k k l kl In 100-d, each k =1 now l =1 needs 103 k =1dot product operations instead of 75 million where Qkl = yk yl (Φ(x k ).Φ(x l )) The use of Maximum Margin magically makes this not a problem R But there are still worrying things lurking away. Subject to these What are they? k k k constraints: •The fear of overfittingkwith = 1 this enormous number of terms 0 ≤ α ≤ C ∀k Then define: w = ∑ α y = 0 •The evaluation phase (doing a set of predictions on a test set) will be very expensive (why?) α k ykΦ (x k ) k s.t. α k > 0 b = y K (1 − ε K ) − x K . w K where K = arg max α k k Copyright © 2001, 2003, Andrew W. Moore ∑ Because each w. φ(x) (see below) needs 75 million operations. What can be done? Then classify with: f(x,w,b) = sign(w. φ(x) - b) Support Vector Machines: Slide 56 28 QP with Quintic basis functions We must do RR2/2 dot products R R to get this matrix ready. Maximize ∑ α + ∑∑ α α Q k k l kl where In 100-d, each k =1 now l =1 needs 103 k =1dot product operations instead of 75 million Qkl = yk yl (Φ(x k ).Φ(x l )) The use of Maximum Margin magically makes this not a problem R But there are still worrying things lurking away. Subject to these What are they? k k k constraints: •The fear of overfittingkwith = 1 this enormous number of terms 0 ≤ α ≤ C ∀k Then define: ∑ w = α k ykΦ (x k ) = = 0 Because each w. φ(x) (see below) needs 75 million operations. What can be done? Φ (x) ∑ εα y)k Φ− (xx kK) .⋅ w − K 5 ∑=α arg k y k ( xmax k ⋅ x + 1 K α) k k s.t. α > 0 w b ⋅ =Φ ( yx ) =(1 where α y •The evaluation phase (doing a set of predictions on a test set) will be very expensive (why?) k s.t. α k > 0 K ∑ k k s.t. α k > 0K k k Only Sm operations (S=#support vectors) Then classify with: f(x,w,b) = sign(w. φ(x) - b) Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 57 QP with Quintic basis functions We must do RR2/2 dot products R R to get this matrix ready. Maximize ∑ α + ∑∑ α α Q k k l kl where In 100-d, each k =1 now l =1 needs 103 k =1dot product operations instead of 75 million Qkl = yk yl (Φ(x k ).Φ(x l )) The use of Maximum Margin magically makes this not a problem R But there are still worrying things lurking away. Subject to these What are they? k k k constraints: •The fear of overfittingkwith = 1 this enormous number of terms 0 ≤ α ≤ C ∀k Then define: ∑ w = α k ykΦ (x k ) Φ (x) ∑ εα y)k Φ− (xx kK) .⋅ w − K 5 α y ( x ⋅ x + 1 ) ∑= arg k k k K max αk k s.t. α > 0 w b ⋅ =Φ ( yx ) =(1 where = α y = 0 •The evaluation phase (doing a set of predictions on a test set) will be very expensive (why?) k s.t. α k > 0 K ∑ k k s.t. α k > 0K k k Only Sm operations (S=#support vectors) Copyright © 2001, 2003, Andrew W. Moore Because each w. φ(x) (see below) needs 75 million operations. What can be done? Then When classify you see this with: many callout bubbles on a slide it’s time to wrap the author in a blanket, gently take him away and murmur f(x,w,b) = been sign(w. (x) - b)for too “someone’s at the PowerPoint long.” φ Support Vector Machines: Slide 58 29 QP with Quintic basis functions 1 R R Andrew’s why don’t αk − ∑∑ αk αl Qkl where Qkl =opinion yk ylof(Φ (xSVMs (x l )) ∑ k ).Φ overfit as much as you’d think: 2 k =1 l =1 k =1 R Maximize Subject to these constraints: No matter what the basis function, there are really R only up to R parameters: α1, α2 .. αkR, and usually k most are set tok zero by the Maximum =1 Margin. 0 ≤ αk ≤ C ∀k Then define: ∑ w = α k ykΦ (x k ) k s.t. α k > 0 Φ (x) ∑ εα y)k Φ− (xx kK) .⋅ w − K 5 ∑=α arg k y k ( xmax k ⋅ x + 1 K α) k k s.t. α > 0 w b ⋅ =Φ ( yx ) =(1 K where = k k s.t. α k > 0K k k Only Sm operations (S=#support vectors) ∑ α y = 0 Asking for small w.w is like “weight decay” in Neural Nets and like Ridge Regression parameters in Linear regression and like the use of Priors in Bayesian Regression---all designed to smooth the function and reduce overfitting. Then classify with: f(x,w,b) = sign(w. φ(x) - b) Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 59 SVM Kernel Functions • K(a,b)=(a . b +1)d is an example of an SVM Kernel Function • Beyond polynomials there are other very high dimensional basis functions that can be made practical by finding the right Kernel Function • Radial-Basis-style Kernel Function: ⎛ (a − b) 2 ⎞ ⎟ K (a, b) = exp⎜⎜ − 2σ 2 ⎟⎠ ⎝ • Neural-net-style Kernel Function: K (a, b) = tanh(κ a.b − δ ) Copyright © 2001, 2003, Andrew W. Moore σ, κ and δ are magic parameters that must be chosen by a model selection method such as CV or VCSRM* *see last lecture Support Vector Machines: Slide 60 30 VC-dimension of an SVM • Very very very loosely speaking there is some theory which under some different assumptions puts an upper bound on the VC dimension as ⎡ Diameter ⎤ ⎢ Margin ⎥ ⎢ ⎥ • where • Diameter is the diameter of the smallest sphere that can enclose all the high-dimensional term-vectors derived from the training set. • Margin is the smallest margin we’ll let the SVM use • This can be used in SRM (Structural Risk Minimization) for choosing the polynomial degree, RBF σ, etc. • But most people just use Cross-Validation Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 61 SVM Performance • Anecdotally they work very very well indeed. • Example: They are currently the best-known classifier on a well-studied hand-written-character recognition benchmark • Another Example: Andrew knows several reliable people doing practical real-world work who claim that SVMs have saved them when their other favorite classifiers did poorly. • There is a lot of excitement and religious fervor about SVMs as of 2001. • Despite this, some practitioners (including your lecturer) are a little skeptical. Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 62 31 Doing multi-class classification • SVMs can only handle two-class outputs (i.e. a categorical output variable with arity 2). • What can be done? • Answer: with output arity N, learn N SVM’s • • • • SVM 1 learns “Output==1” vs “Output != 1” SVM 2 learns “Output==2” vs “Output != 2” : SVM N learns “Output==N” vs “Output != N” • Then to predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region. Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 63 References • An excellent tutorial on VC-dimension and Support Vector Machines: C.J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):955-974, 1998. http://citeseer.nj.nec.com/burges98tutorial.html • The VC/SRM/SVM Bible: Statistical Learning Theory by Vladimir Vapnik, WileyInterscience; 1998 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 64 32 What You Should Know • Linear SVMs • The definition of a maximum margin classifier • What QP can do for you (but, for this class, you don’t need to know how it does it) • How Maximum Margin can be turned into a QP problem • How we deal with noisy (non-separable) data • How we permit non-linear boundaries • How SVM Kernel functions permit us to pretend we’re working with ultra-high-dimensional basisfunction terms Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 65 33