Classification IV Lecturer: Dr. Bo Yuan LOGO E-mail: yuanb@sz.tsinghua.edu.cn Overview Support Vector Machines 2 Linear Classifier f ( x, w, b) sign ( g ( x)) sign ( w x b) w Just in case ... w·x + b <0 n w x wi xi i 1 w·x + b >0 w x1 b w x2 b w( x1 x2 ) 0 3 Distance to Hyperplane g ( x) w x b x x x'w g ( x) w( x' w) b w x'b w w x' w w M || x x' |||| w || |b| || w || | g ( x) | || w || | g ( x) | w w || w || 4 Selection of Classifiers ? Which classifier is the best? All have the same training error. How about generalization? 5 Unknown Samples B A Classifier B divides the space more consistently (unbiased). 6 Margins Support Vectors Support Vectors 7 Margins The margin of a linear classifier is defined as the width that the boundary could be increased by before hitting a data point. Intuitively, it is safer to choose a classifier with a larger margin. Wider buffer zone for mistakes The hyperplane is decided by only a few data points. Support Vectors! Others can be discarded! Select the classifier with the maximum margin. Linear Support Vector Machines (LSVM) Works very well in practice. How to specify the margin formally? 8 Margins M=Margin Width x+ X- 2 M || w || 9 Objective Function Correctly classify all data points: w xi b 1 if yi 1 w xi b 1 if yi 1 yi (w xi b) 1 0 Maximize the margin 2 1 T max M min w w w 2 Quadratic Optimization Problem Minimize Subject to 1 t ( w) w w 2 yi ( w xi b) 1 10 Lagrange Multipliers l l 1 2 LP || w || i yi ( w xi b) i 2 i 1 i 1 L p w L p b l 0 w i yi xi i 1 Dual Problem l 0 i yi 0 i 1 LD i i 1 i j yi y j xi x j 2 i, j 1 T i H where H ij yi y j xi x j 2 i subject to : iyi 0 & i 0 i 11 Quadratic problem again! Solutions of w & b Support Vectors : Samples with positive l g ( x) i yi xi x b y s ( xs w b) 1 i 1 y s ( m y m xm x s b ) 1 mS y s2 ( m ym xm x s b) y s inner product mS b y s m y m xm x s mS 1 b Ns ( y sS s mS m y m xm x s ) 12 An Example x2 2 1 y (1, 1, +1) i 1 i i 0 1 2 0 1 2 H12 y1 y1 x1 x1 H H 11 H 21 H 22 y2 y1 x2 x1 2 (0, 0, -1) x1 x1 x2 1 0 y1 y2 x1 x2 2 0 y2 y2 x2 x2 0 0 1 1 LD i 1 , 2 H 21 12 2 i 1 2 2 2 w i yi xi 11 [1,1] 1 (1) [0,0] [1,1] i 1 1 1; 2 1 b wx1 1 2 1 1 g ( x) wx b x1 x2 1 M 13 2 2 2 w 2 Soft Margin e2 e11 e7 yi ( wxi b) 1 i 0 1 t ( w) w w C i 2 i i 0 l l l 1 2 L P w C i i [ yi ( w xi b) 1 i ] i i 2 i 1 i 1 i 1 14 Soft Margin L p w L p b L p i l 0 w i yi xi i 1 l 0 i yi 0 i 1 0 C i ui l l l 1 2 L P w C i i [ yi ( w xi b) 1 i ] i i 2 i 1 i 1 i 1 1 T LD i H 2 i s.t. 0 i C and y 0 i i 15 i Non-linear SVMs 0 x 0 x x2 x 16 Feature Space x2 x22 Φ: x → φ(x) x1 𝑥1 2 + 𝑥2 2 = 𝑟 2 x12 17 Feature Space x2 Φ: x → φ(x) x1 18 Quadratic Basis Functions 1 2 x 1 2 x2 2 xm 2 x 1 x2 2 x2 m ( x) 2 x1 x2 2 x x 1 3 2 x x 1 m 2x x 2 3 2x x 2 m 2 xm 1 xm Constant Terms Linear Terms Number of terms C Pure Quadratic Terms Quadratic Cross-Terms 19 2 m 2 (m 2)( m 1) m 2 2 2 Calculation of Φ(xi )·Φ(xj ) 1 1 2 a 2 b 1 1 2 a2 2b2 2am 2bm 2 2 a b 1 1 a2 b2 2 2 a2 b2 m m (a ) (b) 2a1a2 2b1b2 2 a a 2 b b 1 3 1 3 2 a a 2 b b 1 m 1 m 2a a 2b b 2 3 2 3 2a a 2b b 2 m 2 m 2am 1am 2bm 1bm 1 m 2a b i i i 1 m 2 2 a i bi xi x j ( xi ) ( x j ) i 1 m 1 m 2a a b b i 1 j i 1 i j i 20 j It turns out … m m 1 m m (a) (b) 1 2 ai bi a b 2ai a j bi b j i 1 i 1 2 2 i i i 1 j i 1 m m i 1 i 1 (a b 1) 2 (a b) 2 2a b 1 ( ai bi ) 2 2 ai bi 1 m m m ai bi a j b j 2 ai bi 1 i 1 j 1 i 1 m 1 m (ai bi ) 2 2 i 1 m a b a b i 1 j i 1 i i j m j 2 ai bi 1 i 1 K (a, b) (a b 1) 2 (a) (b) O(m 2 ) O (m) 21 Kernel Trick The linear classifier relies on dot products between vectors xi·xj If every data point is mapped into a high-dimensional space via some transformation Φ: x → φ(x), the dot product becomes: φ(xi) ·φ(xj) A kernel function is some function that corresponds to an inner product in some expanded feature space: K(xi, xj) = φ(xi) ·φ(xj) Example: x=[x1, x2]; K(xi, xj) = (1 + xi · xj)2 K ( xi , x j ) (1 xi x j ) 2 1 xi21 x 2j1 2 xi1 x j1 xi 2 x j 2 xi22 x 2j 2 2 xi1 x j1 2 xi 2 x j 2 [1, xi21 , 2 xi1 xi 2 , xi22 , 2 xi1 , 2 xi 2 ] [1, x 2j1 , 2 x j1 x j 2 , x 2j 2 , 2 x j1 , 2 x j 2 ] (x i ) (x j ), where ( x) [1, x12 , 2 x1 x2 , x22 , 2 x1 , 2 x2 ] 22 Kernels Polynomial : K ( xi , x j ) ( xi x j 1) d x x i j Gaussian : K ( xi , x j ) exp 2 2 2 Hyperbolic Tangent : K ( xi , x j ) tanh( xi x j c) 23 String Kernel Similarity between text strings: Car vs. Custard K (car , cat ) 4 K (car , car ) K (cat , cat ) 24 6 24 Solutions of w & b l w i yi ( xi ) i 1 l l i 1 i 1 w ( x j ) i yi ( xi ) ( x j ) i yi K ( xi , x j ) 1 1 b ( y s m y m ( xm ) ( x s ) ) ( y s m y m K ( xm , x s ) ) N s sS N s sS mS mS l g ( x) i yi K ( xi , x) b i 1 l g ( x) w x b i yi xi x b i 1 25 Decision Boundaries 26 More Maths … 27 SVM Roadmap Linear Classifier Maximum Margin Linear SVM Noise Soft Margin Nonlinear Problem a·b → Φ(a)·Φ(b) High Computational Cost Kernel Trick K(a,b)=Φ(a)·Φ(b) 28 Reading Materials Text Book Nello Cristianini and John Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press, 2000. Online Resources http://www.kernel-machines.org/ http://www.support-vector-machines.org/ http://www.tristanfletcher.co.uk/SVM%20Explained.pdf http://www.csie.ntu.edu.tw/~cjlin/libsvm/ A list of papers uploaded to the web learning portal Wikipedia & Google 29 Review What is the definition of margin in a linear classifier? Why do we want to maximize the margin? What is the mathematical expression of margin? How to solve the objective function in SVM? What are support vectors? What is soft margin? How does SVM solve nonlinear problems? What is so called “kernel trick”? What are the commonly used kernels? 30 Next Week’s Class Talk Volunteers are required for next week’s class talk. Topic : SVM in Practice Hints: Applications Demos Multi-Class Problems Software • A very popular toolbox: Libsvm Any other interesting topics beyond this lecture Length: 20 minutes plus question time 31