Machine Learning 4771

advertisement
Tony Jebara, Columbia University
Machine Learning
4771
Instructor: Tony Jebara
Tony Jebara, Columbia University
Topic 6
• Review: Support Vector Machines
• Primal & Dual Solution
• Non-separable SVMs
• Kernels
• SVM Demo
Tony Jebara, Columbia University
Review: SVM
• Support vector machines are (in the simplest case)
linear classifiers that do structural risk minimization (SRM)
• Directly maximize margin to reduce guaranteed risk J(θ)
• Assume first the 2-class data is linearly separable:
{(x ,y ),…, (x ,y )}
f (x; θ) = sign (w x + b )
have
1
1
N
N
where x i ∈ D and yi ∈ {−1,1}
T
• Decision boundary or hyperplane given by wT x + b = 0
• Note: can scale w & b while keeping same boundary
• Many solutions exist which have empirical error Remp(θ)=0
• Want widest or thickest one (max margin), also it’s unique!
⇒
Tony Jebara, Columbia University
Support Vector Machines
wT x + b = 0
• Define:
H+=positive margin hyperplane
H- =negative margin hyperplane
q =distance from decision plane to origin
 
q = minx x − 0 subject to wT x + b = 0
2
1 
minx 2 x − 0 −λ wT x + b
1) grad
∂
∂x
(
(
1
2
(
)
))
x̂ = −
( )
x − λw = 0
x = λw
w
wT w

4) distance q = x̂ − 0 = −
3) Sol’n
wT x + b = 0
x T x − λ wT x + b = 0 2) plug into
b
5) Define without
loss of generality
since can scale b & w
b
wT w
w =
T
constraint
wT (λw ) + b = 0
λ=−
b
wT w
T
w w =
H → w x +b = 0
H + → wTT x + b = +1
H − → w x + b = −1
b
w
b
T
w w
Tony Jebara, Columbia University
Support Vector Machines
• The constraints on the SVM
for Remp(θ)=0 are thus:
wT x i + b ≥ +1
wT x i + b ≤ −1
H+
∀yi = +1
∀yi = −1
T
• Or more simply: yi w x i + b −1 ≥ 0
(
• The margin of the SVM is:
m = d+ + d−
)
• Distance to origin: H → q =
• Therefore: d+ = d− =
1
w
d+
d−
H
b
w
H + → q+ =
H−
H + → wTT x + b = +1
H − → w x + b = −1
b−1
w
and margin m =
2
w
H − → q− =
• Want to max margin, or equivalently minimize: w or w
2
1
• SVM Problem: min 2 w subject to yi (wT x i + b ) −1 ≥ 0
• This is a quadratic program!
• Can plug this into a matlab function called “qp()”, done!
−1−b
w
2
Tony Jebara, Columbia University
SVM in Dual Form
• We can also solve the problem via convex duality
2
1
• Primal SVM problem LP: min 2 w subject to yi (wT x i + b ) −1 ≥ 0
• This is a quadratic program, quadratic cost
function with multiple linear inequalities
(these carve out a convex hull)
• Subtract from cost each inequality times an α Lagrange multiplier, take derivatives of w & b:
LP = minw,b max
∂
∂w
∂
∂b
1
α≥0 2
w
2
( (
) )
− ∑ i αi yi wT x i + b −1
LP = w − ∑ i αiyi x i = 0 → w = ∑ i αiyi x i
LP =− ∑ i αiyi = 0
• Plug back in, dual: LD = ∑ i αi − 12 ∑ i ∑ j αiα j yiy j x iT x j
• Also have constraints: ∑ i αiyi = 0 & αi ≥ 0
• Above LD must be maximized! convex duality… also qp()
Tony Jebara, Columbia University
SVM Dual Solution Properties
• We have dual convex program:
∑α− ∑
i
i
1
2
T
α
α
y
y
x
xj
i, j i j i j i
subject to
∑ αy
i
i i
= 0 & αi ≥ 0
• Solve for N alphas (one per data point) instead of D w’s
• Still convex (qp) so unique solution, gives alphas
• Alphas can be used to get w: w = ∑ i αiyix i
• Support Vectors: have non-zero alphas
shown with thicker circles, all live on
the margin: wT x i + b = ±1
• Solution is sparse, most alphas=0
these are non-support vectors
SVM ignores them if they move
(without crossing margin) or if
they are deleted, SVM doesn’t
change (stays robust)
Tony Jebara, Columbia University
SVM Dual Solution Properties
• Primal & Dual Illustration:
wT x i + b ≥ ±1
• Recall we could get w from alphas: w = ∑ i αiyix i
• Or could use as is: f (x ) = sign (x T w + b ) = sign (∑ i αiyix T x i + b )
• Karush-Kuhn-Tucker Conditions (KKT): solve value of b
on margin (for nonzero alphas) have: wT x i + b = yi
using known w, compute b for each support vector
b = average (bi )
bi = yi − wT x i ∀i : αi > 0 then…
• Sparsity (few nonzero alphas) is useful for several reasons
• Means SVM only uses some of training data to learn
• Should help improve its ability to generalize to test data
• Computationally faster when using final learned classifier
Tony Jebara, Columbia University
Non-Separable SVMs
• What happens when non-separable?
• There is no solution
and convex hull
shrinks to nothing
• Not all constraints can be resolved, their alphas go to ∞
• Instead of perfectly classifying each point: yi (wT x i + b ) ≥ 1
we “Relax” the problem with (positive) slack variables xi’s
allow data to (sometimes) fall on wrong side, for example:
wT x i + b ≥ −0.03 if yi = +1
• New constraints: wT x i + b ≥ +1 − ξi
wT x i + b ≤ −1 + ξi
if yi = +1
if yi = −1
where ξi ≥ 0
where ξi ≥ 0
• But too much xi’s means too much slack, so penalize them
LP : min
1
2
2
w +C ∑ i ξi
(
)
subject to yi wT x i + b −1 + ξi ≥ 0
Tony Jebara, Columbia University
Non-Separable SVMs
• This new problem is still convex, still qp()!
• User chooses scalar C (or cross-validates) which controls
how much slack xi to use (how non-separable) and how
robust to outliers or bad points on the wrong side
Large margin
LP : min
∂
∂b
LP and
Low slack
1
2
2
On right side
( (
For xi positivity
)
)
w +C ∑ i ξi − ∑ i αi yi wT x i + b −1 + ξi − ∑ i βi ξi
∂
∂w
LP as before...
∂
∂ξi
LP = C − αi − βi = 0
αi = C − βi but... αi & βi ≥ 0
∴ 0 ≤ αi ≤ C
• Can now write dual problem (to maximize):
LD : max ∑ i αi − 12 ∑ i, j αi α j yiy j x iT x j subject to ∑ i αiyi = 0 and αi ∈ ⎡⎢0,C ⎤⎥
⎣
⎦
• Same dual as before but alphas can’t grow beyond C
Tony Jebara, Columbia University
Non-Separable SVMs
• As we try to enforce a classification for a data point
its Lagrange multiplier alpha keeps growing endlessly
• Clamping alpha to stop growing at C makes SVM “give up”
on those non-separable points
• The dual program is now:
• Solve as before with
extra constraints that
alphas positive AND
less than C… gives alphas… from alphas get w = ∑ i αiyix i
• Karush-Kuhn-Tucker Conditions (KKT): solve value of b
on margin for not=zero alphas AND not=C alphas
for all others have support vectors, assume ξi = 0 and use
formula yi (wT x i + bi ) −1 + ξi = 0 to get bi and b = average (bi )
• Mechanical analogy: support vector forces & torques
Tony Jebara, Columbia University
Nonlinear SVMs
• What if the problem is not linear?
Tony Jebara, Columbia University
Nonlinear SVMs
• What if the problem is not linear?
• We can use our old trick…
• Map d-dimensional x data from L-space to high dimensional
H (Hilbert) feature-space via basis functions Φ(x)
• For example, quadratic classifier:

⎡
x

⎢
Φ (x ) = ⎢
 T
vec
x
x
⎢
⎣
x i → Φ (x i ) via
( )
Η
⎤
⎥
⎥
⎥
⎦
L
• Call phi’s feature vectors computed from original x inputs
• Replace all x’s in the SVM equations with phi’s
• Now solve the following learning problem:
LD : max ∑ i αi −
1
2
∑
( )
α α y y φ x φ xj
i, j i j i j ( i )
T
s.t. αi ∈ ⎡⎢0,C ⎤⎥ , ∑ i αiyi = 0
⎣
⎦
• Which gives⎛ a nonlinear
classifier
in original space:
T
⎞
f (x ) = sign ⎜⎜∑ i αiyi φ (x ) φ (x i ) + b ⎟⎟⎟
⎝
⎠
Tony Jebara, Columbia University
Kernels
(see http://www.youtube.com/watch?v=3liCbRZPrZA)
• One important aspect of SVMs: all math involves only
the inner products between the phi features!
T
⎛
⎞⎟
⎜
f (x ) = sign ⎜∑ i αiyi φ (x ) φ (x i ) + b ⎟⎟
⎝
⎠
• Replace all inner products with a general kernel function
• Mercer kernel: accepts 2 inputs and outputs a scalar via:
⎧
⎪
⎪
⎪
k (x, x) = φ (x ), φ (x) = ⎪
⎨
⎪
⎪
⎪
⎪
⎩
φ (x ) φ (x)
for finite φ
∫ φ (x,t ) φ (x,t )dt
otherwise
T
t
T
• Example: quadratic polynomial φ x = ⎡⎢ x 2
()
1
⎣
2x1x 2 x
k (x, x) = φ (x ) φ (x)
T
2
2
⎤
⎥
⎦
= x12x12 + 2x1x1x 2x2 + x 22x22
= (x1x1 + x 2x2 )
2
Tony Jebara, Columbia University
Kernels
• Sometimes, many Φ(x) will produce the same k(x,x’)
• Sometimes k(x,x’) computable but features huge or infinite!
• Example: polynomials
If explicit polynomial mapping, feature space Φ(x) is huge
d-dimensional data, p-th order
⎛ d + p −1
polynomial,dim (Η) = ⎜⎜⎜⎜
p
⎝
images of size 16x16 with p=4 have dim(H)=183million
p
T
but can equivalently
just
use
kernel:
k (x,y ) = (x y )
p
p
( ) = (∑ x x )
k (x, x) = x T x
∝∑
r
∝∑
r
i
i i
Multinomial Theorem
p!
r1 !r2 !r3 !…(p − r1 − r2 −…)!
(
r
r
r
wr x11 x 22 xdd
∝ φ (x ) φ (x)
)(
r
r
r
r
r
r
r
r
x11 x 22 xdd x11 x22 xdd
r
wr x11 x22 xdd
)
w=weight on term
Equivalent!
⎞⎟
⎟⎟
⎟⎟
⎠
Tony Jebara, Columbia University
Kernels
• Replace eachx iT x j → k (x i ,x j ) , for example:
p
T
P-th Order Polynomial Kernel:
k (x, x) = (x x + 1)
2⎞
⎛ 1
k (x, x) = exp ⎜⎜⎜−
x − x ⎟⎟⎟
RBF Kernel (infinite!):
2
⎝
⎠
2σ
Sigmoid (hyperbolic tan) Kernel: k (x, x) = tanh κx T x − δ
(
)
• Using kernels we get generalized inner product SVM:
(
LD : max ∑ i αi − 12 ∑ i, j αi α j yiy j k x i ,x j
f (x ) = sign
(∑ α y k (x ,x ) + b)
i
i i
)
s.t. αi ∈ ⎡⎢0,C ⎤⎥ , ∑ i αiyi = 0
⎣
⎦
i
• Still qp solver, just use Gram matrix K (positive definite)
⎡
⎢ k (x1,x1 ) k (x1,x 2 ) k (x1,x 3 )
⎢
K = ⎢⎢ k (x1,x 2 ) k (x 2,x 2 ) k (x 2,x 3 )
⎢
⎢ k (x1,x 3 ) k (x 2,x 3 ) k (x 3,x 3 )
⎢⎣
⎤
⎥
⎥
⎥
⎥
⎥
⎥
⎥⎦
(
K ij = k x i ,x j
)
Tony Jebara, Columbia University
Kernelized SVMs
• Polynomial kernel:
• Radial basis function kernel:
Polynomial Kernel
)
( ) (
⎛ 1
k (x ,x ) = exp ⎜⎜⎜−
⎝
2σ
T
i
k x i ,x j = x x j + 1
i
j
p
2
xi − x j
RBF kernel
• Least-squares, logistic-regression, perceptron are also
kernelizable
2
⎞⎟
⎟⎟
⎠
Tony Jebara, Columbia University
SVM Demo
• SVM Demo by Steve Gunn:
http://www.isis.ecs.soton.ac.uk/resources/svminfo/
• In svc.m replace
[alpha lambda how] = qp(…);
with
[alpha lambda how] = quadprog(H,c,[],[],A,b,vlb,vub,x0);
This replaces the old
Matlab command qp
(quadratic programming)
with the new one
for more recent versions
Download