1.1 SVM Learning 1.1 2 SVM Learning • SVM learning task as an optimization problem Introduction to Machine Learning • Find w and b that gives zero training error CSE474/574: Support Vector Machines • Maximizes the margin (= Varun Chandola <chandola@buffalo.edu> • Same as minimizing kwk Optimization Formulation kwk2 w,b 2 subject to yn (w> xn + b) ≥ 1, n = 1, . . . , N. minimize Outline Contents 1 Support Vector Machines 1.1 SVM Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Solving SVM Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Quadratic Optimization and Lagrange Multipliers 2.1 Kahrun-Kuhn-Tucker Conditions . . . . . . . . . . 2.2 Kernel SVM . . . . . . . . . . . . . . . . . . . . . . 2.3 Support Vectors . . . . . . . . . . . . . . . . . . . . 2.4 Optimization Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 2 • Optimization with N linear inequality constraint A Different Interpretation of Margin • What impact does the margin have on w? 2 8 8 8 12 • Large margin ⇒ Small kwk • Small kwk ⇒ regularized/simple solutions • Simple solutions ⇒ Better generalizability (Occam’s Razor) 3 Linear Classifers and Loss Function 13 3.1 Regularizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2 Approximate Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 • Computational Learning Theory provides a formal justification [1] 1.2 1 2 ) kwk Support Vector Machines Solving SVM Optimization Problem Optimization Formulation kwk2 w,b 2 subject to yn (w> xn + b) ≥ 1, n = 1, . . . , N. minimize • A hyperplane based classifier defined by w and b • Like perceptron • Find hyperplane with maximum separation margin on the training data • There is an quadratic objective function to minimize and N constraints • Assume that data is linearly separable (will relax this later) • “Off-the-shelf” packages - quadprog (MATLAB), CVXOPT – Zero training error (loss) • Is that the best way? SVM Prediction Rule y = sign(w> x + b) SVM Learning • Input: Training data {(x1 , y1 ), (x2 , y2 ), . . . , (xN , yN )} • Objective: Learn w and b that maximizes the margin 2 Quadratic Optimization and Lagrange Multipliers minimize f (x, y) = 2 − x2 − 2y 2 x,y minimize x,y f (x, y) = 2 − x2 − 2y 2 subject to h(x, y) = x + y − 1 = 0. 2. Quadratic Optimization and Lagrange Multipliers • Tool for solving constrained optimization problems of differentiable functions minimize x,y 3 2. Quadratic Optimization and Lagrange Multipliers Handling Inequality Constraints 2 − x2 − 2y 2 f (x, y) = minimize x,y subject to h(x, y) = x + y − 1 = 0. • A Lagrangian multiplier (β) lets you combine the two equations into one The Lagrangian in the above example becomes: L(x, y, α) = f (x, y) − αg(x, y) = x3 + y 2 − α(x2 − 1) Setting the gradient to 0 with respect to x, y and β will give us the optimal values. Solving for the gradient of the Lagrangian gives us: ∂ L(x, y, α) = 3x2 − 2αx = 0 ∂x ∂ L(x, y, α) = 2y = 0 ∂y ∂ L(x, y, α) = x2 − 1 = 0 ∂α1 ∂L = −4y − β = 0 ∂y ∂L =x+y−1=0 ∂β Multiple Constraints x,y x2 + y 2 f (x, y) = subject to h1 (x, y) = x + 1 = 0 h2 (x, y) = y + 1 = 0. L(x, y, β) = f (x, y) − X g(x) = x2 − 1 ≥ 0. g(x) ≥ 0 ⇒ α ≥ 0 g(x) ≤ 0 ⇒ α ≤ 0 g(x) = 0 ⇒ α is unconstrained Solution 1. Writing the objective as Lagrangian. minimize x3 + y 2 • Inequality constraints are transferred as constraints on the Lagrangian, α x,y,β ∂L = −2x − β = 0 ∂x f (x, y) = subject to minimize L(x, y, β) = f (x, y) − βg(x, y) L(x, y, β) = 2 − x2 − 2y 2 − β(x + y − 1) 4 βi hi (x, y) i Furthermore we require that: α≥0 From above equations we get y = 0, x = ±1 and α = ± 23 . But since α ≥ 0, hence α = 32 . This gives x = 1, y = 0, and f = 1. Handling Both Constraints The Lagrangian can be written as: minimize w subject to gi (w) ≤ 0 and hi (w) = 0 L(x, y, β) = f (x, y) − β1 h1 (x, y) − −β2 h2 (x, y) = x2 + y 2 − β1 (x + 1) − β2 (y + 1) Setting the gradient of the Lagrangian to zero, we obtain: ∂ L(x, y, β) = 2x − β1 = 0 ∂x ∂ L(x, y, β) = 2y − β2 = 0 ∂y ∂ L(x, y, β) = x + 1 = 0 ∂β1 ∂ L(x, y, β) = y + 1 = 0 ∂β2 f (w) Generalized Lagrangian L(w, α, β) = f (w) + k X αi gi (w) + i=1 l X βi hi (w) i=1 Primal and Dual Formulations Primal Optimization • Let θP be defined as: θP (w) = max L(w, α, β) α,β:αi ≥0 Solving the above four equations we get, x = −1, y = −1, β1 = −2, β2 = −2, and f = 2. i = 1, . . . , k i = 1, . . . , l. 2. Quadratic Optimization and Lagrange Multipliers • One can prove that the optimal value for the original constrained problem is same as: 5 2. Quadratic Optimization and Lagrange Multipliers Optimization Formulation kwk2 w,b 2 subject to yn (w> xn + b) ≥ 1, n = 1, . . . , N. minimize p∗ = minθP (w) = min max L(w, α, β) w w α,β:αi ≥0 Consider θP (w) = = A Toy Example max L(w, α, β) α,β:αi ≥0 max f (w) + α,β:αi ≥0 k X 6 αi gi (w) + i=1 l X • x ∈ <2 βi hi (w) • Two training points: x1 , y1 = (1, 1), −1 i−1 It is easy to show that if any constraints are not satisfied, i.e., if either gi (w) > 0 or hi (w) 6= 0, then θP (w) = ∞. Which means that: f (w) if primal constraints are satisfied θP (w) = ∞ otherwise, x2 , y2 = (2, 2), +1 • Find the best hyperplane w = (w1 , w2 ) Optimization problem for the toy example minimize Primal and Dual Formulations (II) f (w) = w subject to g1 (w, b) = Dual Optimization 1 kwk2 2 y1 (w> x1 + b) − 1 ≥ 0 g2 (w, b) = y2 (w> x2 + b) − 1 ≥ 0. • Consider θD , defined as: θD (α, β) = minL(w, α, β) w • Substituting actual values for x1 , y1 and x2 , y2 . 1 kwk2 2 subject to g1 (w, b) = −(w> x1 + b) − 1 ≥ 0 • The dual optimization problem can be posed as: minimize f (w) = w d∗ = max θD (α, β) = max minL(w, α, β) α,β:αi ≥0 w α,β:αi ≥0 g2 (w, b) = (w> x2 + b) − 1 ≥ 0. The above problem can be also written as: d∗ == p∗ ? 1 2 (w + w22 ) 2 1 subject to g1 (w1 , w2 , b) = −(w1 + w2 + b) − 1 ≥ 0 g2 (w1 , w2 , b) = (2w1 + 2w2 + b) − 1 ≥ 0. • Note that d∗ ≤ p∗ minimize w1 ,w2 ,b • “Max min” of a function is always less than or equal to “Min max” • When will they be equal? f (w1 , w2 ) = To solve the toy optimization problem, we rewrite it in the Lagrangian form: Relation between primal and dual L(w1 , w2 , b, α) = • In general d∗ ≤ p∗ Setting ∇L = 0, we get: • For d∗ == p∗ , certain conditions should be true • Known as the Kahrun-Kuhn-Tucker conditions • For d∗ = p∗ = L(w∗ , α∗ , β ∗ ): ∂ L(w∗ , α∗ , β ∗ ) ∂w ∂ L(w∗ , α∗ , β ∗ ) ∂βi αi∗ gi (w∗ ) gi (w∗ ) αi∗ =0 = 0, i = 1, . . . , l = 0, ≤ 0, ≥ 0, i = 1, . . . , k i = 1, . . . , k i = 1, . . . , k 1 2 (w + w22 ) + α1 (w1 + w2 + b + 1) − α2 (2w1 + 2w2 + b − 1) 2 1 ∂ L(w1 , w2 , b, α) ∂w1 ∂ L(w1 , w2 , b, α) ∂w2 ∂ L(w1 , w2 , b, α) ∂b ∂ L(w1 , w2 , b, α) ∂α1 ∂ L(w1 , w2 , b, α) ∂α2 = w1 + α1 − 2α2 = 0 = w2 + α1 − 2α2 = 0 = α1 − α2 = 0 = w1 + w2 + b + 1 = 0 = 2w1 + 2w2 + b − 1 = 0 Solving the above equations, we get, w1 = w2 = 1 and b = −3. 2. Quadratic Optimization and Lagrange Multipliers Back to SVM Optimization Optimization Formulation 7 2.1 Kahrun-Kuhn-Tucker Conditions 2.1 Kahrun-Kuhn-Tucker Conditions N X ∂ LP (w, b, α) = w − α n y n xn = 0 ∂w n=1 kwk2 2 subject to yn (w> xn + b) ≥ 1, n = 1, . . . , N. minimize w,b yn {w> xn + b} − 1 ≥ 0 αn ≥ 0 αn (yn {w> xn + b} − 1) = 0 Rewriting as a (primal) Lagrangian N w,b,α LP (w, b, α) = kwk2 X + αn {1 − yn (w> xn + b)} 2 n=1 subject to αn ≥ 0 n = 1, . . . , N. • Use KKT condition #5 • For αn > 0 Solving the Lagrangian • Set gradient of LP to 0 N X ∂LP α n y n xn =0⇒ w= ∂w n=1 N X ∂LP =0⇒ αn yn = 0 ∂b n=1 • Substituting in LP to get the dual LD maximize w,b,α subject to N 1 X LD (w, b, α) = αn − αm αn ym yn (x> m xn ) 2 m,n=1 n=1 N X n=1 (yn {w> xn + b} − 1) = 0 • Which means that: max w> xn + min w> xn b=− 2.2 n:yn =−1 n:yn =1 2 Kernel SVM Observation 1: Dot Product Formulation • All training examples (xn ’s) occur in dot/inner products Dual Lagrangian Formulation N X • Also recall the prediction using SVMs y ∗ = sign(w> x∗ + b) N X αn yn xn )> x∗ + b) = sign(( αn yn = 0, αn ≥ 0 n = 1, . . . , N. n=1 N X ∗ = sign( αn yn (x> n x ) + b) • Dual Lagrangian is a quadratic programming problem in αn ’s – Use “off-the-shelf” solvers n=1 • Replace the dot products with kernel functions • Having found αn ’s – Kernel or non-linear SVM w = N X αn yn xn n=1 • What will be the bias term b? Investigating Kahrun Kuhn Tucker Conditions • For the primal and dual formulations • We can optimize the dual formulation (as shown earlier) • Solution should satisfy the Karush-Kuhn-Tucker (KKT) Conditions (1) N X ∂ LP (w, b, α) = − αn y n = 0 ∂b n=1 • Introducing Lagrange Multipliers,αn , n = 1, . . . , N minimize 8 2.3 Support Vectors Observation 2: Most αn ’s are 0 • KKT condition #5: αn (yn {w> xn + b} − 1) = 0 • If xn not on margin yn {w> xn + b} > 1 ⇒ αn = 0 (2) (3) (4) (5) 2.3 Support Vectors 10 1 y − 1 b= ·x w ·x w w ·x + w b= • Only need these for prediction 0 + • These are the support vectors 2 k kw b= • αn 6= 0 only for xn on margin 9 + 2.3 Support Vectors x b k kw 2.3 Support Vectors 11 2.4 Optimization Constraints 12 y b= + b= 0 ·x w b= w ·x − 1 + − 1 b= + w ·x w w ·x + w ·x + w ·x b= 0 + 2 k kw b= 1 1 y x b k kw x One can see from the prediction equation that: y ∗ = sign( N X ∗ αn yn (x> nx ) ) n=1 In the summation, the entries for xn that do not lie on the margin will have no contribution to the sum because αn for those xn ’s will be 0. Hence we only need to the non-zero input examples to get the prediction. 2.4 Optimization Constraints • It is OK to have some misclassified training examples – Some ξn ’s will be non-zero • Minimize the number of such examples What have we seen so far? • For linearly separable data, SVM learns a weight vector w – Minimize N X ξn n=1 • Maximizes the margin • SVM training is a constrained optimization problem – Each training example should lie outside the margin – N constraints • Optimization Problem for Non-Separable Case maximize w,b f (w, b) = kwk2 + C N X ξn n=1 subject to yn (w> xn + b) ≥ 1 − ξn , ξn ≥ 0 n = 1, . . . , N. • Cannot go for zero training error • C controls the impact of margin and the margin error. • Still learn a maximum margin hyperplane • What is the role of C? 1. Allow some examples to be misclassified • Similar optimization procedure as for the separable case (QP for the dual) 2. Allow some examples to fall inside the margin • Weights have the same expression • How do you set up the optimization for SVM training w= N X αn yn xn n=1 • All training examples exist in dot products (kernelizable) Introducing Slack Variables • Separable Case: To ensure zero training loss, constraint was > yn (w xn + b) ≥ 1 ∀n = 1 . . . N • Non-separable Case: Relax the constraint yn (w> xn + b) ≥ 1 − ξn ∀n = 1 . . . N • Support vectors are slightly different 1. Points on the margin (ξn = 0) 2. Inside the margin but on the correct side (0 < ξn < 1) 3. On the wrong side of the hyperplane (ξn ≥ 1) C dictates if we focus more on maximizing the margin or reducing the training error. • ξn is called slack variable (ξn ≥ 0) • Training time for SVM training is O(N 3 ) • For misclassification, ξn > 1 • Many faster but approximate approaches exist 3. Linear Classifers and Loss Function 13 3.1 Regularizers 3.1 – Approximate QP solvers – Online training 14 Regularizers • Recall the optimization problem for linear classification • SVMs can be extended in different ways minL(w, b) = min 1. Non-linear boundaries (kernel trick) w,b 2. Multi-class classification 3. Regression (Support Vector Regression) w,b N X I(yn (w> xn + b) < 0) + λR(w, b) n=1 • What is the role of the regularizer term? – Ensure simplicity 3 Linear Classifers and Loss Function • Linear binary classification can be written as a general optimization problem: minL(w, b) = min w,b w,b N X I(yn (w> xn + b) < 0) + λR(w, b) • Ideally we want most entries of w to be zero • Why? • Desired minimization R(w, b) = n=1 • I is an indicator function (1 if (.) is 0, 0 otherwise) • Objective function = Loss function + λRegularizer • Objective function wants to fit training data well and have simpler solution D X d=1 I(wd 6= 0) • NP Hard The reason we want most entries in the weight vector w to be 0 is so that the prediction depends only on a few features. This would ensure that changes in xd for those features will not change the prediction, hence higher bias. • Combinatorial optimization problem • NP-hard • No polynomial time algorithm • Loss function is non-smooth, non-convex 3.2 Approximate Regularization • Norm based regularization – l2 squared norm kwk22 = • Small changes in w, b can change the loss by lot • Different linear classifiers use different approximations to 0-1 loss – l1 norm kwk1 = – Also known as surrogate loss functions Support Vector Machines D X D X d=1 D X kwkp = ( wdp )1/p d=1 • Squared Loss – Norm becomes non-convex for p < 1 – l1 norm gives best results – l2 norm is easiest to deal with Logistic Regression • Log Loss |wd | – lp norm • Hinge Loss Perceptrons wd2 d=1 References References [1] V. Vapnik. Statistical learning theory. Wiley, 1998.