DDA3020 Machine Learning Lecture 07 Support Vector Machine II Jicong Fan School of Data Science, CUHK-SZ October 17/22, 2022 Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 SupportOctober Vector Machine 17/22, 2022 II 1 / 44 Outline 1 Lagrange duality and KKT conditions 2 Optimizing SVM by Lagrange duality 3 SVM with slack variables 4 SVM with kernels 5 Others Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 SupportOctober Vector Machine 17/22, 2022 II 2 / 44 1 Lagrange duality and KKT conditions 2 Optimizing SVM by Lagrange duality 3 SVM with slack variables 4 SVM with kernels 5 Others Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 SupportOctober Vector Machine 17/22, 2022 II 3 / 44 Lagrange duality Given a general minimization problem min w∈Rn f (w) hi (w) ≤ 0, `j (w) = 0, subject to i = 1, . . . , m j = 1, . . . , r Note that here w denotes the argument we aim to optimize, rather than a data point. The Lagrangian function: L(w, u, v) = f (w) + m X ui hi (w) + i=1 r X vj `j (w) j=1 The Lagrange dual function: g(u, v) = minn L(w, u, v) w∈R The dual problem: max u∈Rm ,v∈Rr g(u, v) subject to u ≥ 0 Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 SupportOctober Vector Machine 17/22, 2022 II 4 / 44 KKT conditions Given general problem min w∈Rn subject to f (w) hi (w) ≤ 0, `j (w) = 0, i = 1, . . . , m j = 1, . . . , r The Karush-Kuhn-Tucker conditions or KKT conditions are: 0 ∈ ∂f (w) + m X i=1 ui ∂hi (w) + r X vj ∂`j (w) (stationarity) j=1 ui · hi (w) = 0 for all i hi (w) ≤ 0, `j (w) = 0 for all i, j ui ≥ 0 for all i (complementary slackness) (primal feasibility) (dual feasibility) Reference: S. Boyd and L. Vandenberghe (2004), Convex Optimization, Cambridge University Press, Chapter 5. Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 SupportOctober Vector Machine 17/22, 2022 II 5 / 44 1 Lagrange duality and KKT conditions 2 Optimizing SVM by Lagrange duality 3 SVM with slack variables 4 SVM with kernels 5 Others Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 SupportOctober Vector Machine 17/22, 2022 II 6 / 44 Optimization of SVM The objective function of support vector machine is min w,b 1 kwk2 2 (1) s.t. yi (w> xi + b) ≥ 1, ∀i It can be transformed to min w,b 1 kwk2 2 (2) s.t. 1 − yi (w> xi + b) ≤ 0, ∀i Its Lagrange function is L(w, b, α) = m X 1 kwk2 + αi 1 − yi (w> xi + b) , 2 i Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 SupportOctober Vector Machine 17/22, 2022 II 7 / 44 Optimization of SVM Lagrange function: L(w, b, α) = m X 1 kwk2 + αi 1 − yi (w> xi + b) , 2 i The primal and dual optimal solutions should satisfy KKT conditions: Stationarity: m X ∂L =0 ⇒ w= αi yi xi ∂w i (3) m X ∂L =0 ⇒ αi yi = 0 ∂b i (4) Feasibility: αi ≥ 0, 1 − yi (w> xi + b) ≤ 0, ∀i (5) Complementary slackness: αi 1 − yi (w> xi + b) = 0, ∀i Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 SupportOctober Vector Machine 17/22, 2022 II (6) 8 / 44 Optimization of SVM Lagrange function: L(w, b, α) = m X 1 kwk2 + αi 1 − yi (w> xi + b) , 2 i Replacing the stationary condition into Lagrange function, we have L(w, b, α) m m m m X X X X > 1 = kwk2 + αi − αi yi αj yj xj xi − αi yi b 2 i i j i m m = m X i m X i (8) m X X X 1 = kwk2 + αi − αi αj yi yj x> αi yi i xj − b 2 i i,j i = (7) 1 αi − kwk2 2 (9) (10) m αi − 1X αi αj yi yj x> i xj 2 i,j Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 SupportOctober Vector Machine 17/22, 2022 II (11) 9 / 44 Optimization of SVM Then, we obtain the following dual problem: max α s.t. m X i m X m αi − 1X αi αj yi yj x> i xj , 2 i,j αi yi = 0, αi ≥ 0, ∀i (12) (13) i It can be solved by any off-the-shelf optimization solver. Then, we replace the solved α back into the stationary condition, thus we obtain the primal solution w, w= m X αi yi xi (14) i Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 Support October Vector17/22, Machine 2022 II 10 / 44 Optimization of SVM Solution interpretation: The primal solution w and the dual solution α should also satisfy other KKT conditions Feasibility: αi ≥ 0, 1 − yi (w> xi + b) ≤ 0, ∀i Complementary slackness: αi 1 − yi (w> xi + b) = 0, ∀i When comparing above conditions together, we have that for xi , ∀i, If it satisfies 1 − yi (w> xi + b) < 0, then αi = 0; If it satisfies 1 − yi (w> xi + b) = 0, then αi ≥ 0. If αi = 0, then it means that xi doesn’t contribute to w, i.e., the SVM classifier The data points with αi > 0 construct the classifier, and they are called support vectors, which locate at the hyperplanes yi (w> xi + b) = 1. And, we define the support set as S = {i|αi > 0} This is why we call it support vector machine. Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 Support October Vector17/22, Machine 2022 II 11 / 44 Optimization of SVM The remaining issue is how to determine the bias parameter b? For any support vector xj , j ∈ S, we have yj (w> xj + b) = 1, ∀j ∈ S m X ⇒ yj ( αi yi x> i xj + b) = 1, ∀j ∈ S (15) (16) i Product yj for both sides of the above equation, and utilizing yj · yj = 1, we have m X α i yi x > i xj + b = yj , ∀j ∈ S (17) i m ⇒b= X 1 X yj − αi yi x> i xj |S| i (18) j∈S Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 Support October Vector17/22, Machine 2022 II 12 / 44 Prediction using SVM Prediction: Given the optimized parameters {α, w, b}, given a new data x, its prediction is w> x + b = m X i m αi yi x> i x+ X 1 X yj − αi yi x> i xj |S| i (19) j∈S If w> x + b > 0, then the predicted class of x is +1, otherwise −1 If and only if y(w> x + b) > 0, then your prediction is correct Note that the prediction of new data depends on inner product with existing training data, which is important to derive kernel SVM later Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 Support October Vector17/22, Machine 2022 II 13 / 44 1 Lagrange duality and KKT conditions 2 Optimizing SVM by Lagrange duality 3 SVM with slack variables 4 SVM with kernels 5 Others Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 Support October Vector17/22, Machine 2022 II 14 / 44 SVM with slack variables In above derivation, we assume that all primal constraints yi (w> xi + b) ≥ 1, ∀i can be satisfied, implying that the training data is separable. However, sometimes samples of different classes are overlapped (i.e., nonseparable), as shown below. Consequently, some constraints will be violated, and we can not obtain the feasible solution. Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 Support October Vector17/22, Machine 2022 II 15 / 44 SVM with slack variables To handle such data, we introduce slack variable ξi ≥ 0 We allow some errors for training data, i.e., yi (w> xi +b) ≥ 1−ξi , ∀i, rather than yi (w> xi + b) ≥ 1, ∀i But we hope that such errors ξi , ∀i are small Please plot the corresponding hinge loss with slack variables Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 Support October Vector17/22, Machine 2022 II 16 / 44 SVM with slack variables In this case, the SVM is formulated as follows m X 1 min kwk2 + C ξi w,b,ξ 2 i (20) s.t. 1−ξi − yi (w> xi + b) ≤ 0, −ξi ≤ 0, ∀i Its Lagrange function is L(w, b, , ξ, α, µ) = m m X X 1 αi 1−ξi − yi (w> xi + b) + µi (−ξi ) , kwk2 + C ξi + 2 i i and αi , µi ≥ 0, ∀i. Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 Support October Vector17/22, Machine 2022 II 17 / 44 SVM with slack variables Lagrange function: m m X X 1 2 L(w, b, , ξ, α, µ) = kwk + C αi 1−ξi − yi (w> xi + b) + µi (−ξi ) , ξi + 2 i i The primal and dual optimal solutions should satisfy KKT conditions: Stationarity: m X ∂L =0 ⇒ w= αi yi xi ∂w i (21) m X ∂L =0 ⇒ αi yi = 0 ∂b i (22) ∂L = 0 ⇒ αi = C − µi , ∀i ∂ξi (23) αi ≥ 0, 1−ξi − yi (w> xi + b) ≤ 0, ξi ≥ 0, µi ≥ 0, ∀i (24) Feasibility: Complementary slackness: αi 1−ξi − yi (w> xi + b) = 0, µi ξi = 0, ∀i Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 Support October Vector17/22, Machine 2022 II (25) 18 / 44 SVM with slack variables Lagrange function: L(w, b, ξ, α, µ) = m m X X 1 αi 1−ξi − yi (w> xi + b) + µi (−ξi ) . kwk2 + C ξi + 2 i i Replacing all stationary conditions into Lagrange function to eliminate primal variables, we have m L(α, µ) = = m X X 1 kwk2 + αi 1 − yi (w> xi + b) + (C − αi − µi )ξi 2 i i m X i αi − (26) 1X αi αj yi yj x> i xj . 2 i,j Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 Support October Vector17/22, Machine 2022 II 19 / 44 SVM with slack variables Then, we obtain the following dual problem: max α,µ s.t. m X i m X αi − 1X αi αj yi yj x> i xj , 2 i,j αi yi = 0, 0 ≤ αi ≤ C, µi ≥ 0, αi = C − µi , ∀i (27) (28) i Utilizing αi = C − µi , we obtain a simpler dual problem: max α s.t. m X i m X 1X αi αj yi yj x> i xj , 2 i,j (29) αi yi = 0, 0 ≤ αi ≤ C, ∀i (30) αi − i Note that the only change in dual problem is the constraint 0 ≤ αi ≤ C, which is αi ≥ 0 in the dual problem of standard SVM. Reference: https://nianlonggu.com/2019/06/07/tutorial-on-SVM/ Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 Support October Vector17/22, Machine 2022 II 20 / 44 SVM with slack variables Solution interpretation The solution αi has three cases: αi = 0, 0 < αi < C, αi = C αi = 0: the corresponding data are correctly classified and doesn’t contribute to the classifier, locating outside of the margin 0 < αi < C: in this case, µi > 0 due to αi = C − µi ; Since µi ξi = 0, then we have ξi = 0. The corresponding data are correctly classified and contributes to the classifier, locating on the margin αi = C: in this case, µi = 0; then we have ξi > 0. The corresponding data contributes to the classifier, locating inside the margin If ξi ≤ 1, then the data is still correctly classified, not crossing decision boundary If ξi > 1, then the data is incorrectly classified, crossing decision boundary Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 Support October Vector17/22, Machine 2022 II 21 / 44 SVM with slack variables How to determine the bias parameter b? We define M = {i|0 < αi < C} Since 0 < αi < C, we have ξi = 0 Then, for any support vector xj , j ∈ M, we have yj (w> xj + b) = 1, ∀j ∈ M m X ⇒ yj ( αi yi x> i xj + b) = 1, ∀j ∈ M (31) (32) i Utilizing yj · yj = 1, we have m X αi yi x> i xj + b = yj , ∀j ∈ M (33) m X 1 X yj − αi yi x> i xj |M| i (34) i ⇒b= j∈M Note that using the average of all support vectors, rather than one single support vector, could make the solution of b more numerically stable. Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 Support October Vector17/22, Machine 2022 II 22 / 44 Optimization of SVM Why do we prefer to optimize the dual problem, rather than directly optimizing the primal problem? There are two main advantages: By examining the dual form of the optimization problem, we gained significant insight into the structure of the problem The entire algorithm can be written in terms of only inner products between input feature vectors. In the following, we will exploit this property to apply the kernels to classification problem. The resulting algorithm, support vector machines, will be able to efficiently learn in very high dimensional spaces. Reference: https://stats.stackexchange.com/questions/19181/why-bother-with-the-dual-problem-wh Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 Support October Vector17/22, Machine 2022 II 23 / 44 1 Lagrange duality and KKT conditions 2 Optimizing SVM by Lagrange duality 3 SVM with slack variables 4 SVM with kernels 5 Others Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 Support October Vector17/22, Machine 2022 II 24 / 44 Kernels In above derivation, SVM can only handle linearly separable data. For non-linearly separable data (e.g., XOR data, and the following data), how to use SVM? Recall that the polynomial regression can handle non-linearly separable data Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 Support October Vector17/22, Machine 2022 II 25 / 44 SVM with polynomial hypothesis function Predict y = 1 if w0 + w1 x1 + w2 x2 + w3 x1 x2 +w4 x21 + w5 x22 + · · · ≥ 0 As introduced before, one can choose high-order polynomial hypothesis function to handle non-linear separable data, fw,b (x) = w> [1; x1 ; x2 ; x1 x2 ; x21 ; x22 ; · · · ] (35) However, in many real problems, such as image classification, the dimensionality of original features |x| is already very high. Consequently, the dimensionality of high-order polynomial function will be too high, causing high computational cost or overfitting To tackle this difficulty, we will introduce kernel. Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 Support October Vector17/22, Machine 2022 II 26 / 44 Kernels Given a new data x, compute its new features based on proximity o landmarks l(1) , l(2) , l(3) (plot above), and here we use the Gaussian kernel, as follows kx − l(1) k2 ) 2σ 2 kx − l(2) k2 f2 = similarity(x, l(2) ) = exp(− ) 2σ 2 kx − l(3) k2 f3 = similarity(x, l(3) ) = exp(− ) 2σ 2 Then, we have a new representation [f1 ; f2 ; f3 ] for the data x f1 = similarity(x, l(1) ) = exp(− Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 Support October Vector17/22, Machine 2022 II (36) (37) (38) 27 / 44 Kernels Kernel and similarity f1 = similarity(x, l(1) ) = exp(− kx − l(1) k2 ) 2σ 2 (39) If x ≈ l(1) , then f1 ≈ 1 If x is far from l(1) , then f1 ≈ 0 Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 Support October Vector17/22, Machine 2022 II 28 / 44 Kernels Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 Support October Vector17/22, Machine 2022 II 29 / 44 Kernels Prediction with kernels Predict “1” if w0 + w1 f1 + w2 f2 + w3 f3 ≥ 0 We set w0 = −0.5, w1 = 1, w2 = 1, w3 = 0 Thus, we require that f1 + f2 − 0.5 ≥ 0 for predicting “1” Consequently, we can obtain a non-linear decision boundary (plot above) Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 Support October Vector17/22, Machine 2022 II 30 / 44 Kernels Given x : fi = similarity x, l(i) x − l(i) = exp − 2σ 2 2! How to obtain the landmark points? We can set all training data points as landmarks. Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 Support October Vector17/22, Machine 2022 II 31 / 44 SVM with Kernels We firstly define the following kernel k(xi , xj ) = φ(xi )> φ(xj ) (40) Utilizing this kernel to replacing x> i xj , we have the following dual problem max α s.t. m X i m X αi − 1X αi αj yi yj k(xi , xj ), 2 i,j αi yi = 0, αi ≥ 0, ∀i (41) (42) i The solution of b becomes m b= X 1 X αi yi k(xi , xj ) yj − |S| i (43) j∈S The prediction of new data x becomes m X w> x + b = αi yi k(xi , x) + b (44) i Since α is sparse, the above classifier is also called sparse kernel classifier. Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 Support October Vector17/22, Machine 2022 II 32 / 44 SVM with Kernels Widely used kernels: p x> xi 1+ , p>0 σ2 kx − xi k2 Radial Basis Function (RBF) kernel: k(x, xi ) = exp − 2σ 2 1 Sigmoidal kernel: k(x, xi ) = x> xi +b 1 + exp− σ2 Polynomial kernel: k(x, xi ) = Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 Support October Vector17/22, Machine 2022 II (45) (46) (47) 33 / 44 SVM with Kernels Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 Support October Vector17/22, Machine 2022 II 34 / 44 SVM with Kernels Figure: Decision boundaries produced by SVM with a 2nd-order polynomial kernel (top-left), a 3rd-order polynomial kernel (top-right), and a RBF kernel (bottom). Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 Support October Vector17/22, Machine 2022 II 35 / 44 1 Lagrange duality and KKT conditions 2 Optimizing SVM by Lagrange duality 3 SVM with slack variables 4 SVM with kernels 5 Others Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 Support October Vector17/22, Machine 2022 II 36 / 44 Multi-class SVM SVM is good for binary classification: f (x) > 0 ⇒ x ∈ Class 1; f (x) ≤ 0 ⇒ x ∈ Class 2 To classify multiple classes, we use the one-vs-rest approach to convert K binary classifications to a K-class classification: Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 Support October Vector17/22, Machine 2022 II 37 / 44 Multi-class SVM y ∈ {1, 2, 3, . . . , K} Many SVM packages already have built-in multi-class classification functionality. Otherwise, use one-vs.-all method. (Train K SVMs, one to distinguish y = k from the rest, for k = 1, 2, . . . , K), get (w(1) , b(1) ), . . . , (w(K) , b(K) ). Predict the label of x as arg max w(k) > x + b(k) k∈{1,2,...,K} Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 Support October Vector17/22, Machine 2022 II 38 / 44 SVM vs. Logistic regression SVM : Hinge loss loss (f (xi ) , yi ) = 1 − w> xi + b yi + Logistic Regression : Log loss ( - log conditional likelihood) > loss (f (xi ) , yi ) = − log P (yi | xi , w, b) = log 1 + e−(w xi +b)yi Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 Support October Vector17/22, Machine 2022 II 39 / 44 SVM vs. Logistic regression Logistic regression (LR) vs. SVM n = |x| indicates the number of features, and m = |Dtrain | is the number of training data If n is large (relative to m), then the data is linearly separable, one can use LR or SVM without kernel If n is small, and m is intermediate, then the data may be non-linearly separable, one use SVM with Gaussian kernel If n is small, and m is large, then create/add more features to make the data more separable, and one can use LR or SVM without kernel. Why not choose SVM with kernel in this case? Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 Support October Vector17/22, Machine 2022 II 40 / 44 Software of SVM Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 Support October Vector17/22, Machine 2022 II 41 / 44 Using SVM in practice You are not required to implement SVM by yourself, and there are many well implemented sortwares, such as libsvm. When you choose a software to learn a SVM model, you need to specify: Choice of parameter C (i.e., the tradeoff hyper-parameter of the slack variables) Choice of kernel Linear kernel Gaussian kernel, but you should set the kernel size (i.e., variance of Gaussian distribution). Note that do perform feature scaling before using the Gaussian kernel. Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 Support October Vector17/22, Machine 2022 II 42 / 44 Reading material References: Andrew Ng’s note on SVM: https://see.stanford.edu/materials/aimlcs229/cs229-notes3.pdf Chapter 7.1 of Bishop’s book KKT conditions: https://www.stat.cmu.edu/~ryantibs/convexopt-S15/scribes/12-kktpdf More variants of SVM: Semi-supervised SVM Structured SVM SVM with latent variables SVM for regression Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 Support October Vector17/22, Machine 2022 II 43 / 44 Summary of SVM What you need to know: Lagrange duality and KKT conditions Support vector machine: Derivation of large margin Derivation of hingle loss Optimization using dual problem and KKT conditions SVM with slack variables SVM with kernels Relationship between SVM and logistic regression Jicong Fan School of Data Science, CUHK-SZ DDA3020 Machine Learning Lecture 07 Support October Vector17/22, Machine 2022 II 44 / 44