Introduction to Machine Learning 1.1 SVM Learning

advertisement
1.1 SVM Learning
1.1
2
SVM Learning
• SVM learning task as an optimization problem
Introduction to Machine Learning
• Find w and b that gives zero training error
CSE474/574: Support Vector Machines
• Maximizes the margin (=
Varun Chandola <chandola@buffalo.edu>
• Same as minimizing kwk
Optimization Formulation
kwk2
w,b
2
subject to yn (w> xn + b) ≥ 1, n = 1, . . . , N.
minimize
Outline
Contents
1 Support Vector Machines
1.1 SVM Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Solving SVM Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Quadratic Optimization and Lagrange Multipliers
2.1 Kahrun-Kuhn-Tucker Conditions . . . . . . . . . .
2.2 Kernel SVM . . . . . . . . . . . . . . . . . . . . . .
2.3 Support Vectors . . . . . . . . . . . . . . . . . . . .
2.4 Optimization Constraints . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
2
2
• Optimization with N linear inequality constraint
A Different Interpretation of Margin
• What impact does the margin have on w?
2
8
8
8
12
• Large margin ⇒ Small kwk
• Small kwk ⇒ regularized/simple solutions
• Simple solutions ⇒ Better generalizability (Occam’s Razor)
3 Linear Classifers and Loss Function
13
3.1 Regularizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Approximate Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
• Computational Learning Theory provides a formal justification [1]
1.2
1
2
)
kwk
Support Vector Machines
Solving SVM Optimization Problem
Optimization Formulation
kwk2
w,b
2
subject to yn (w> xn + b) ≥ 1, n = 1, . . . , N.
minimize
• A hyperplane based classifier defined by w and b
• Like perceptron
• Find hyperplane with maximum separation margin on the training data
• There is an quadratic objective function to minimize and N constraints
• Assume that data is linearly separable (will relax this later)
• “Off-the-shelf” packages - quadprog (MATLAB), CVXOPT
– Zero training error (loss)
• Is that the best way?
SVM Prediction Rule
y = sign(w> x + b)
SVM Learning
• Input: Training data {(x1 , y1 ), (x2 , y2 ), . . . , (xN , yN )}
• Objective: Learn w and b that maximizes the margin
2
Quadratic Optimization and Lagrange Multipliers
minimize f (x, y) = 2 − x2 − 2y 2
x,y
minimize
x,y
f (x, y) =
2 − x2 − 2y 2
subject to h(x, y) = x + y − 1 = 0.
2. Quadratic Optimization and Lagrange
Multipliers
• Tool for solving constrained optimization problems of differentiable functions
minimize
x,y
3
2. Quadratic Optimization and Lagrange
Multipliers
Handling Inequality Constraints
2 − x2 − 2y 2
f (x, y) =
minimize
x,y
subject to h(x, y) = x + y − 1 = 0.
• A Lagrangian multiplier (β) lets you combine the two equations into one
The Lagrangian in the above example becomes:
L(x, y, α) = f (x, y) − αg(x, y)
= x3 + y 2 − α(x2 − 1)
Setting the gradient to 0 with respect to x, y and β will give us the optimal values.
Solving for the gradient of the Lagrangian gives us:
∂
L(x, y, α) = 3x2 − 2αx = 0
∂x
∂
L(x, y, α) = 2y = 0
∂y
∂
L(x, y, α) = x2 − 1 = 0
∂α1
∂L
= −4y − β = 0
∂y
∂L
=x+y−1=0
∂β
Multiple Constraints
x,y
x2 + y 2
f (x, y) =
subject to h1 (x, y) = x + 1 = 0
h2 (x, y) = y + 1 = 0.
L(x, y, β) = f (x, y) −
X
g(x) = x2 − 1 ≥ 0.
g(x) ≥ 0 ⇒ α ≥ 0
g(x) ≤ 0 ⇒ α ≤ 0
g(x) = 0 ⇒ α is unconstrained
Solution 1. Writing the objective as Lagrangian.
minimize
x3 + y 2
• Inequality constraints are transferred as constraints on the Lagrangian, α
x,y,β
∂L
= −2x − β = 0
∂x
f (x, y) =
subject to
minimize L(x, y, β) = f (x, y) − βg(x, y)
L(x, y, β) = 2 − x2 − 2y 2 − β(x + y − 1)
4
βi hi (x, y)
i
Furthermore we require that:
α≥0
From above equations we get y = 0, x = ±1 and α = ± 23 . But since α ≥ 0, hence α = 32 . This gives x = 1,
y = 0, and f = 1.
Handling Both Constraints
The Lagrangian can be written as:
minimize
w
subject to gi (w) ≤ 0
and
hi (w) = 0
L(x, y, β) = f (x, y) − β1 h1 (x, y) − −β2 h2 (x, y)
= x2 + y 2 − β1 (x + 1) − β2 (y + 1)
Setting the gradient of the Lagrangian to zero, we obtain:
∂
L(x, y, β) = 2x − β1 = 0
∂x
∂
L(x, y, β) = 2y − β2 = 0
∂y
∂
L(x, y, β) = x + 1 = 0
∂β1
∂
L(x, y, β) = y + 1 = 0
∂β2
f (w)
Generalized Lagrangian
L(w, α, β) = f (w) +
k
X
αi gi (w) +
i=1
l
X
βi hi (w)
i=1
Primal and Dual Formulations
Primal Optimization
• Let θP be defined as:
θP (w) = max L(w, α, β)
α,β:αi ≥0
Solving the above four equations we get, x = −1, y = −1, β1 = −2, β2 = −2, and f = 2.
i = 1, . . . , k
i = 1, . . . , l.
2. Quadratic Optimization and Lagrange
Multipliers
• One can prove that the optimal value for the original constrained problem is same as:
5
2. Quadratic Optimization and Lagrange
Multipliers
Optimization Formulation
kwk2
w,b
2
subject to yn (w> xn + b) ≥ 1, n = 1, . . . , N.
minimize
p∗ = minθP (w) = min max L(w, α, β)
w
w α,β:αi ≥0
Consider
θP (w) =
=
A Toy Example
max L(w, α, β)
α,β:αi ≥0
max f (w) +
α,β:αi ≥0
k
X
6
αi gi (w) +
i=1
l
X
• x ∈ <2
βi hi (w)
• Two training points:
x1 , y1 = (1, 1), −1
i−1
It is easy to show that if any constraints are not satisfied, i.e., if either gi (w) > 0 or hi (w) 6= 0, then
θP (w) = ∞. Which means that:
f (w) if primal constraints are satisfied
θP (w) =
∞
otherwise,
x2 , y2 = (2, 2), +1
• Find the best hyperplane w = (w1 , w2 )
Optimization problem for the toy example
minimize
Primal and Dual Formulations (II)
f (w) =
w
subject to g1 (w, b) =
Dual Optimization
1
kwk2
2
y1 (w> x1 + b) − 1 ≥ 0
g2 (w, b) = y2 (w> x2 + b) − 1 ≥ 0.
• Consider θD , defined as:
θD (α, β) = minL(w, α, β)
w
• Substituting actual values for x1 , y1 and x2 , y2 .
1
kwk2
2
subject to g1 (w, b) = −(w> x1 + b) − 1 ≥ 0
• The dual optimization problem can be posed as:
minimize
f (w) =
w
d∗ = max θD (α, β) = max minL(w, α, β)
α,β:αi ≥0 w
α,β:αi ≥0
g2 (w, b) =
(w> x2 + b) − 1 ≥ 0.
The above problem can be also written as:
d∗ == p∗ ?
1 2
(w + w22 )
2 1
subject to g1 (w1 , w2 , b) = −(w1 + w2 + b) − 1 ≥ 0
g2 (w1 , w2 , b) = (2w1 + 2w2 + b) − 1 ≥ 0.
• Note that d∗ ≤ p∗
minimize
w1 ,w2 ,b
• “Max min” of a function is always less than or equal to “Min max”
• When will they be equal?
f (w1 , w2 ) =
To solve the toy optimization problem, we rewrite it in the Lagrangian form:
Relation between primal and dual
L(w1 , w2 , b, α) =
• In general d∗ ≤ p∗
Setting ∇L = 0, we get:
• For d∗ == p∗ , certain conditions should be true
• Known as the Kahrun-Kuhn-Tucker conditions
• For d∗ = p∗ = L(w∗ , α∗ , β ∗ ):
∂
L(w∗ , α∗ , β ∗ )
∂w
∂
L(w∗ , α∗ , β ∗ )
∂βi
αi∗ gi (w∗ )
gi (w∗ )
αi∗
=0
= 0,
i = 1, . . . , l
= 0,
≤ 0,
≥ 0,
i = 1, . . . , k
i = 1, . . . , k
i = 1, . . . , k
1 2
(w + w22 ) + α1 (w1 + w2 + b + 1) − α2 (2w1 + 2w2 + b − 1)
2 1
∂
L(w1 , w2 , b, α)
∂w1
∂
L(w1 , w2 , b, α)
∂w2
∂
L(w1 , w2 , b, α)
∂b
∂
L(w1 , w2 , b, α)
∂α1
∂
L(w1 , w2 , b, α)
∂α2
= w1 + α1 − 2α2 = 0
= w2 + α1 − 2α2 = 0
= α1 − α2 = 0
= w1 + w2 + b + 1 = 0
= 2w1 + 2w2 + b − 1 = 0
Solving the above equations, we get, w1 = w2 = 1 and b = −3.
2. Quadratic Optimization and Lagrange
Multipliers
Back to SVM Optimization
Optimization Formulation
7
2.1 Kahrun-Kuhn-Tucker Conditions
2.1
Kahrun-Kuhn-Tucker Conditions
N
X
∂
LP (w, b, α) = w −
α n y n xn = 0
∂w
n=1
kwk2
2
subject to yn (w> xn + b) ≥ 1, n = 1, . . . , N.
minimize
w,b
yn {w> xn + b} − 1 ≥ 0
αn ≥ 0
αn (yn {w> xn + b} − 1) = 0
Rewriting as a (primal) Lagrangian
N
w,b,α
LP (w, b, α) =
kwk2 X
+
αn {1 − yn (w> xn + b)}
2
n=1
subject to αn ≥ 0 n = 1, . . . , N.
• Use KKT condition #5
• For αn > 0
Solving the Lagrangian
• Set gradient of LP to 0
N
X
∂LP
α n y n xn
=0⇒ w=
∂w
n=1
N
X
∂LP
=0⇒
αn yn = 0
∂b
n=1
• Substituting in LP to get the dual LD
maximize
w,b,α
subject to
N
1 X
LD (w, b, α) =
αn −
αm αn ym yn (x>
m xn )
2 m,n=1
n=1
N
X
n=1
(yn {w> xn + b} − 1) = 0
• Which means that:
max w> xn + min w> xn
b=−
2.2
n:yn =−1
n:yn =1
2
Kernel SVM
Observation 1: Dot Product Formulation
• All training examples (xn ’s) occur in dot/inner products
Dual Lagrangian Formulation
N
X
• Also recall the prediction using SVMs
y ∗ = sign(w> x∗ + b)
N
X
αn yn xn )> x∗ + b)
= sign((
αn yn = 0, αn ≥ 0 n = 1, . . . , N.
n=1
N
X
∗
= sign(
αn yn (x>
n x ) + b)
• Dual Lagrangian is a quadratic programming problem in αn ’s
– Use “off-the-shelf” solvers
n=1
• Replace the dot products with kernel functions
• Having found αn ’s
– Kernel or non-linear SVM
w =
N
X
αn yn xn
n=1
• What will be the bias term b?
Investigating Kahrun Kuhn Tucker Conditions
• For the primal and dual formulations
• We can optimize the dual formulation (as shown earlier)
• Solution should satisfy the Karush-Kuhn-Tucker (KKT) Conditions
(1)
N
X
∂
LP (w, b, α) = −
αn y n = 0
∂b
n=1
• Introducing Lagrange Multipliers,αn , n = 1, . . . , N
minimize
8
2.3
Support Vectors
Observation 2: Most αn ’s are 0
• KKT condition #5:
αn (yn {w> xn + b} − 1) = 0
• If xn not on margin
yn {w> xn + b} > 1
⇒
αn = 0
(2)
(3)
(4)
(5)
2.3 Support Vectors
10
1
y
−
1
b=
·x
w
·x
w
w
·x
+
w
b=
• Only need these for prediction
0
+
• These are the support vectors
2 k
kw
b=
• αn 6= 0 only for xn on margin
9
+
2.3 Support Vectors
x
b k
kw
2.3 Support Vectors
11
2.4 Optimization Constraints
12
y
b=
+
b=
0
·x
w
b=
w
·x
−
1
+
−
1
b=
+
w
·x
w
w
·x
+
w
·x
+
w
·x
b=
0
+
2 k
kw
b=
1
1
y
x
b k
kw
x
One can see from the prediction equation that:
y ∗ = sign(
N
X
∗
αn yn (x>
nx ) )
n=1
In the summation, the entries for xn that do not lie on the margin will have no contribution to the sum
because αn for those xn ’s will be 0. Hence we only need to the non-zero input examples to get the
prediction.
2.4
Optimization Constraints
• It is OK to have some misclassified training examples
– Some ξn ’s will be non-zero
• Minimize the number of such examples
What have we seen so far?
• For linearly separable data, SVM learns a weight vector w
– Minimize
N
X
ξn
n=1
• Maximizes the margin
• SVM training is a constrained optimization problem
– Each training example should lie outside the margin
– N constraints
• Optimization Problem for Non-Separable Case
maximize
w,b
f (w, b) = kwk2 + C
N
X
ξn
n=1
subject to yn (w> xn + b) ≥ 1 − ξn , ξn ≥ 0 n = 1, . . . , N.
• Cannot go for zero training error
• C controls the impact of margin and the margin error.
• Still learn a maximum margin hyperplane
• What is the role of C?
1. Allow some examples to be misclassified
• Similar optimization procedure as for the separable case (QP for the dual)
2. Allow some examples to fall inside the margin
• Weights have the same expression
• How do you set up the optimization for SVM training
w=
N
X
αn yn xn
n=1
• All training examples exist in dot products (kernelizable)
Introducing Slack Variables
• Separable Case: To ensure zero training loss, constraint was
>
yn (w xn + b) ≥ 1 ∀n = 1 . . . N
• Non-separable Case: Relax the constraint
yn (w> xn + b) ≥ 1 − ξn
∀n = 1 . . . N
• Support vectors are slightly different
1. Points on the margin (ξn = 0)
2. Inside the margin but on the correct side (0 < ξn < 1)
3. On the wrong side of the hyperplane (ξn ≥ 1)
C dictates if we focus more on maximizing the margin or reducing the training error.
• ξn is called slack variable (ξn ≥ 0)
• Training time for SVM training is O(N 3 )
• For misclassification, ξn > 1
• Many faster but approximate approaches exist
3. Linear Classifers and Loss Function
13
3.1 Regularizers
3.1
– Approximate QP solvers
– Online training
14
Regularizers
• Recall the optimization problem for linear classification
• SVMs can be extended in different ways
minL(w, b) = min
1. Non-linear boundaries (kernel trick)
w,b
2. Multi-class classification
3. Regression (Support Vector Regression)
w,b
N
X
I(yn (w> xn + b) < 0) + λR(w, b)
n=1
• What is the role of the regularizer term?
– Ensure simplicity
3
Linear Classifers and Loss Function
• Linear binary classification can be written as a general optimization problem:
minL(w, b) = min
w,b
w,b
N
X
I(yn (w> xn + b) < 0) + λR(w, b)
• Ideally we want most entries of w to be zero
• Why?
• Desired minimization
R(w, b) =
n=1
• I is an indicator function (1 if (.) is 0, 0 otherwise)
• Objective function = Loss function + λRegularizer
• Objective function wants to fit training data well and have simpler solution
D
X
d=1
I(wd 6= 0)
• NP Hard
The reason we want most entries in the weight vector w to be 0 is so that the prediction depends only
on a few features. This would ensure that changes in xd for those features will not change the prediction,
hence higher bias.
• Combinatorial optimization problem
• NP-hard
• No polynomial time algorithm
• Loss function is non-smooth, non-convex
3.2
Approximate Regularization
• Norm based regularization
– l2 squared norm
kwk22 =
• Small changes in w, b can change the loss by lot
• Different linear classifiers use different approximations to 0-1 loss
– l1 norm
kwk1 =
– Also known as surrogate loss functions
Support Vector Machines
D
X
D
X
d=1
D
X
kwkp = (
wdp )1/p
d=1
• Squared Loss
– Norm becomes non-convex for p < 1
– l1 norm gives best results
– l2 norm is easiest to deal with
Logistic Regression
• Log Loss
|wd |
– lp norm
• Hinge Loss
Perceptrons
wd2
d=1
References
References
[1] V. Vapnik. Statistical learning theory. Wiley, 1998.
Download