Support Vector Machines

advertisement
Support Vector
Machines
Note to other teachers and users of
these slides. Andrew would be delighted
if you found this source material useful in
giving your own lectures. Feel free to use
these slides verbatim, or to modify them
to fit your own needs. PowerPoint
originals are available. If you make use
of a significant portion of these slides in
your own lecture, please include this
message, or the following link to the
source repository of Andrew’s tutorials:
http://www.cs.cmu.edu/~awm/tutorials .
Comments and corrections gratefully
received.
Andrew W. Moore
Professor
School of Computer Science
Carnegie Mellon University
www.cs.cmu.edu/~awm
awm@cs.cmu.edu
412-268-7599
Copyright © 2001, 2003, Andrew W. Moore
Linear Classifiers
x
denotes +1
Nov 23rd, 2001
α
f
yest
f(x,w,b) = sign(w. x - b)
denotes -1
How would you
classify this data?
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machines: Slide 2
1
Linear Classifiers
x
α
f
yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1
How would you
classify this data?
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machines: Slide 3
Linear Classifiers
x
denotes +1
α
f
yest
f(x,w,b) = sign(w. x - b)
denotes -1
How would you
classify this data?
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machines: Slide 4
2
Linear Classifiers
x
α
f
yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1
How would you
classify this data?
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machines: Slide 5
Linear Classifiers
x
denotes +1
α
f
yest
f(x,w,b) = sign(w. x - b)
denotes -1
Any of these
would be fine..
..but which is
best?
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machines: Slide 6
3
Classifier Margin
α
x
f
f(x,w,b) = sign(w. x - b)
denotes +1
Define the margin
of a linear
classifier as the
width that the
boundary could be
increased by
before hitting a
datapoint.
denotes -1
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machines: Slide 7
Maximum Margin
α
x
denotes +1
f
yest
f(x,w,b) = sign(w. x - b)
The maximum
margin linear
classifier is the
linear classifier
with the, um,
maximum margin.
denotes -1
Linear SVM
Copyright © 2001, 2003, Andrew W. Moore
yest
This is the
simplest kind of
SVM (Called an
LSVM)
Support Vector Machines: Slide 8
4
Maximum Margin
α
x
denotes +1
f
f(x,w,b) = sign(w. x - b)
The maximum
margin linear
classifier is the
linear classifier
with the, um,
maximum margin.
denotes -1
Support Vectors
are those
datapoints that
the margin
pushes up
against
Linear SVM
Copyright © 2001, 2003, Andrew W. Moore
yest
This is the
simplest kind of
SVM (Called an
LSVM)
Support Vector Machines: Slide 9
Why Maximum Margin?
1. Intuitively this feels safest.
denotes +1
denotes -1
Support Vectors
are those
datapoints that
the margin
pushes up
against
f(x,w,b)
= sign(w.
- b)
2. If we’ve made
a small
error inx the
location of the boundary (it’s been
The maximum
jolted in its perpendicular
direction)
this gives us leastmargin
chance linear
of causing a
misclassification. classifier is the
3. LOOCV is easy since
the classifier
model is
linear
immune to removal
of
any
with the,nonum,
support-vector datapoints.
maximum margin.
4. There’s some theory (using VC
is the
dimension) that isThis
related
to (but not
kind
the same as) thesimplest
proposition
thatof
this
is a good thing. SVM (Called an
LSVM)
5. Empirically it works
very very well.
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machines: Slide 10
5
Specifying a line and margin
“Pr
s=
las e
C
t
c zon
edi
“Pr
”
+1
s=
las
C
t
c zone
edi
- 1”
Plus-Plane
Classifier Boundary
Minus-Plane
• How do we represent this mathematically?
• …in m input dimensions?
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machines: Slide 11
Specifying a line and margin
”
+1
=
s
las
t C one
c
i
z
ed
“Pr
=1
+b
wx
=0
+b
wx b=-1
+
wx
s=
las
C
t
c zone
edi
r
P
“
- 1”
Plus-Plane
Classifier Boundary
Minus-Plane
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
if
w . x + b >= 1
-1
if
w . x + b <= -1
Universe
explodes
if
-1 < w . x + b < 1
Classify as.. +1
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machines: Slide 12
6
Computing the margin width
“Pr
s=
las e
C
t
c zon
edi
=1
+b
wx
=0
+b
wx b=-1
+
wx
“Pr
”
+1
s=
las
C
t
c zone
edi
M = Margin Width
- 1”
How do we compute
M in terms of w
and b?
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
Claim: The vector w is perpendicular to the Plus Plane. Why?
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machines: Slide 13
Computing the margin width
”
+1
=
s
las
t C one
c
i
z
ed
“Pr
=1
+b
wx
=0
+b
wx b=-1
+
wx
s=
las
C
t
c zone
edi
r
P
“
M = Margin Width
- 1”
How do we compute
M in terms of w
and b?
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
Claim: The vector w is perpendicular to the Plus Plane. Why?
Let u and v be two vectors on the
Plus Plane. What is w . ( u – v ) ?
And so of course the vector w is also
perpendicular to the Minus Plane
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machines: Slide 14
7
Computing the margin width
“Pr
s=
las e
C
t
c zon
edi
=1
+b
wx
=0
+b
wx b=-1
+
wx
•
•
•
•
•
“Pr
”
+1 x+
M = Margin Width
- -1
=
sx
las
C
e
t
c zon
edi
”
How do we compute
M in terms of w
and b?
Plus-plane = { x : w . x + b = +1 }
Minus-plane = { x : w . x + b = -1 }
The vector w is perpendicular to the Plus Plane
Let x- be any point on the minus plane
Let x+ be the closest plus-plane-point to x-.
Copyright © 2001, 2003, Andrew W. Moore
Any location in
not
¨mm:: not
R
necessarily a
datapoint
Support Vector Machines: Slide 15
Computing the margin width
”
+1 x+
M = Margin Width
=
ss
a
l
ct C zone
edi
r
P
“
1” How do we compute
x
=- s
s
M in terms of w
=1
Cla
+b
ict zone
0
wx
d
=
e
+b -1
and b?
“Pr
wx
=
+b
wx
•
•
•
•
•
•
Plus-plane = { x : w . x + b = +1 }
Minus-plane = { x : w . x + b = -1 }
The vector w is perpendicular to the Plus Plane
Let x- be any point on the minus plane
Let x+ be the closest plus-plane-point to x-.
Claim: x+ = x- + λ w for some value of λ. Why?
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machines: Slide 16
8
Computing the margin width
“Pr
s=
las e
C
t
c zon
edi
=1
+b
wx
=0
+b
wx b=-1
+
wx
•
•
•
•
•
•
”
+1 x+
M = Margin Width
The line from x- to x+ is
perpendicular
to the
do we compute
1” How
x
=- s
planes.
las
M in terms of w
ct C zone
i
d
e
r
and
?
So to
getbfrom
x- to x+
“P
travel some distance in
w.
direction
{ x : w . x + b = +1
}
Plus-plane =
Minus-plane = { x : w . x + b = -1 }
The vector w is perpendicular to the Plus Plane
Let x- be any point on the minus plane
Let x+ be the closest plus-plane-point to x-.
Claim: x+ = x- + λ w for some value of λ. Why?
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machines: Slide 17
Computing the margin width
”
+1 x+
M = Margin Width
=
ss
a
l
ct C zone
edi
r
P
“
1”
x
=- s
s
=1
Cla
+b
ict zone
0
wx
d
=
e
+b -1
“Pr
wx
=
+b
wx
What we know:
• w . x+ + b = +1
• w . x- + b = -1
• x+ = x- + λ w
• |x+ - x- | = M
It’s now easy to get M
in terms of w and b
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machines: Slide 18
9
Computing the margin width
“Pr
s=
las e
C
t
c zon
edi
=1
+b
wx
=0
+b
wx b=-1
+
wx
What we know:
• w . x+ + b = +1
• w . x- + b = -1
• x+ = x- + λ w
• |x+ - x- | = M
It’s now easy to get M
in terms of w and b
“Pr
”
+1 x+
M = Margin Width
- -1
=
sx
las
C
e
t
c zon
edi
”
w . (x - + λ w) + b = 1
=>
w . x - + b + λ w .w = 1
=>
-1 + λ w .w = 1
=>
λ=
Copyright © 2001, 2003, Andrew W. Moore
2
w.w
Support Vector Machines: Slide 19
Computing the margin width
”
2
+1 x+
M = Margin Width =
=
s
w.w
las
t C one
c
i
z
ed
“Pr
1”
x
=- s
s
=1
Cla
+b
ict zone
0
wx
d
=
e
M = |x+ - x- | =| λ w |=
+b -1
“Pr
wx
=
+b
wx
What we know:
• w . x+ + b = +1
• w . x- + b = -1
• x+ = x- + λ w
• |x+ - x- | = M
2
•
λ=
= λ | w | = λ w.w
=
2 w.w
2
=
w.w
w.w
w.w
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machines: Slide 20
10
Learning the Maximum Margin Classifier
“Pr
s=
las e
C
t
c zon
edi
=1
+b
wx
=0
+b
wx b=-1
+
wx
“Pr
”
+1 x+
M = Margin Width =
- -1
=
sx
las
C
e
t
c zon
edi
2
w.w
”
Given a guess of w and b we can
• Compute whether all data points in the correct half-planes
• Compute the width of the margin
So now we just need to write a program to search the space
of w’s and b’s to find the widest margin that matches all
the datapoints. How?
Gradient descent? Simulated Annealing? Matrix Inversion?
EM? Newton’s Method?
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machines: Slide 21
Learning via Quadratic Programming
• QP is a well-studied class of optimization
algorithms to maximize a quadratic function of
some real-valued variables subject to linear
constraints.
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machines: Slide 22
11
Quadratic Programming
Find
u T Ru
arg max c + d u +
2
u
T
Quadratic criterion
a11u1 + a12u2 + ... + a1mum ≤ b1
Subject to
a21u1 + a22u2 + ... + a2 mu m ≤ b2
:
an1u1 + an 2u 2 + ... + anmum ≤ bn
n additional linear
inequality
constraints
a( n +1)1u1 + a( n +1) 2u2 + ... + a( n +1) mum = b( n +1)
a( n + 2 )1u1 + a( n + 2 ) 2u2 + ... + a( n + 2 ) mum = b( n + 2 )
:
a( n + e )1u1 + a( n + e ) 2u2 + ... + a( n + e ) mu m = b( n + e )
Copyright © 2001, 2003, Andrew W. Moore
e additional linear
equality
constraints
And subject to
Support Vector Machines: Slide 23
Quadratic Programming
Find
arg max c + dT u +
u
Subject to
u T Ru
2
Quadratic criterion
a11u1 + a12u2 + ... + a1mum ≤ b1g
:
a( n + e )1u1 + a( n + e ) 2u2 + ... + a( n + e ) mu m = b( n + e )
Copyright © 2001, 2003, Andrew W. Moore
e additional linear
equality
constraints
din
s for fin
m
h
it
r
o
ic
lg ... + a2 mqu
a21u1 +re ae22
r≤atb
n additional linear
xisut2a+
ed uamd nt2ly
in
a
The
r
t
s
n
ie
o
inequality
c
ic
f
f
h
s uc
:ch more e radient
u
m
constraints
a
ng
optim
bly tha
a
li
e
r
d
an1u1 + aann2u 2 + ...as+ceannm
t. u m ≤ bn
u
dly…yo
And subject to
very fid o write
e
r
a
y
a( n +1)1u1 (+Buat( nth+1e) 2u2 +on...’t +waan(tn +t1) mum = b( n +1)
d
obably
elf)
yo u
a( n + 2 )1u1 + ap(rn + 2 ) 2uo2n+e ...
+rsa( n + 2 ) mum = b( n + 2 )
Support Vector Machines: Slide 24
12
Learning the Maximum Margin Classifier
“Pr
s=
las e
C
ct zon
edi
=1
+b
wx
=0
+b
wx b=-1
+
wx
”
+1
s=
las
C
ct zone
edi
“Pr
M = Given guess of w , b we can
2
w.w •
Compute whether all data
points are in the correct
half-planes
• Compute the margin width
Assume R datapoints, each
(xk,yk) where yk = +/- 1
- 1”
What should our quadratic
optimization criterion be?
How many constraints will we
have?
What should they be?
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machines: Slide 25
Learning the Maximum Margin Classifier
“Pr
s=
las e
C
t
c zon
edi
=1
+b
wx
=0
+b
wx b=-1
+
wx
e
“Pr
”
+1
s=
las
C
e
t
dic zon
M = Given guess of w , b we can
2
w.w •
- 1”
What should our quadratic
optimization criterion be?
Minimize w.w
Compute whether all data
points are in the correct
half-planes
• Compute the margin width
Assume R datapoints, each
(xk,yk) where yk = +/- 1
How many constraints will we
have? R
What should they be?
w . xk + b >= 1 if yk = 1
w . xk + b <= -1 if yk = -1
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machines: Slide 26
13
Uh-oh!
This is going to be a problem!
What should we do?
denotes +1
denotes -1
Copyright © 2001, 2003, Andrew W. Moore
Uh-oh!
denotes +1
denotes -1
Support Vector Machines: Slide 27
This is going to be a problem!
What should we do?
Idea 1:
Find minimum w.w, while
minimizing number of
training set errors.
Problemette: Two things
to minimize makes for
an ill-defined
optimization
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machines: Slide 28
14
Uh-oh!
This is going to be a problem!
What should we do?
Idea 1.1:
denotes +1
denotes -1
Minimize
w.w + C (#train errors)
Tradeoff parameter
There’s a serious practical
problem that’s about to make
us reject this approach. Can
you guess what it is?
Copyright © 2001, 2003, Andrew W. Moore
Uh-oh!
Support Vector Machines: Slide 29
This is going to be a problem!
What should we do?
Idea 1.1:
denotes +1
denotes -1
Minimize
w.w + C (#train errors)
Tradeoff parameter
Can’t be expressed as a Quadratic
Programming problem.
Solving it may be too slow.
There’s a serious practical
make
Can
you guess what it is?
(Also, doesn’t distinguish between
ny
problem that’s about
So… a to
disastrous errors and near misses)
ther
o
us reject this approach.
ideas?
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machines: Slide 30
15
Uh-oh!
denotes +1
This is going to be a problem!
What should we do?
Idea 2.0:
denotes -1
Minimize
w.w + C (distance of error
points to their
correct place)
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machines: Slide 31
Learning Maximum Margin with Noise
M = Given guess of w , b we can
2
w.w •
=1
+b
wx
=0
+b
wx b=-1
+
wx
What should our quadratic
optimization criterion be?
Copyright © 2001, 2003, Andrew W. Moore
Compute sum of distances
of points to their correct
zones
• Compute the margin width
Assume R datapoints, each
(xk,yk) where yk = +/- 1
How many constraints will we
have?
What should they be?
Support Vector Machines: Slide 32
16
Learning Maximum Margin with Noise
ε2
=1
+b
wx
=0
+b
wx b=-1
+
wx
M = Given guess of w , b we can
ε11
2
w.w •
Compute sum of distances
of points to their correct
zones
• Compute the margin width
Assume R datapoints, each
(xk,yk) where yk = +/- 1
ε7
What should our quadratic How many constraints will we
optimization criterion be?
have? R
Minimize
R
What should they be?
1
2
w.w + C ∑ εk
k =1
w . xk + b >= 1-εk if yk = 1
w . xk + b <= -1+εk if yk = -1
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machines: Slide 33
Learning Maximum Marginmwith
Noise
= # input
ε2
=1
+b
wx
=0
+b
wx b=-1
+
wx
M = Given guessdimensions
of w , b we can
ε11
2
w.w •
Compute sum of distances
of points to their correct
Our original (noiseless data) QP had m+1
zones
variables: w1, w2, … wm, and b.
• Compute the margin width
ε7
Our new
(noisy data)
QP Rhas
m+1+R each
Assume
datapoints,
variables: w1, w2(x
, …,yw)m,where
b, εk ,yε1 =
,…+/εR 1
k k
k
What should our quadratic How many constraints will we
R= # records
optimization criterion be?
have? R
Minimize
R
What should they be?
1
2
w.w + C ∑ εk
k =1
Copyright © 2001, 2003, Andrew W. Moore
w . xk + b >= 1-εk if yk = 1
w . xk + b <= -1+εk if yk = -1
Support Vector Machines: Slide 34
17
Learning Maximum Margin with Noise
ε2
=1
+b
wx
=0
+b
wx b=-1
+
wx
M = Given guess of w , b we can
ε11
2
w.w •
Compute sum of distances
of points to their correct
zones
• Compute the margin width
Assume R datapoints, each
(xk,yk) where yk = +/- 1
ε7
What should our quadratic How many constraints will we
optimization criterion be?
have? R
Minimize
R
What should they be?
1
2
w.w + C ∑ εk
k =1
w . xk + b >= 1-εk if yk = 1
w . xk + b <= -1+εk if yk = -1
There’s a bug in this QP. Can you spot it?
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machines: Slide 35
Learning Maximum Margin with Noise
ε2
=1
+b
wx
=0
+b
wx b=-1
+
wx
M = Given guess of w , b we can
ε11
2
w.w •
Compute sum of distances
of points to their correct
zones
• Compute the margin width
Assume R datapoints, each
(xk,yk) where yk = +/- 1
ε7
What should our quadratic How many constraints will we
optimization criterion be?
have? 2R
Minimize
R
What should they be?
1
2
w.w + C ∑ εk
k =1
Copyright © 2001, 2003, Andrew W. Moore
w . xk + b >= 1-εk if yk = 1
w . xk + b <= -1+εk if yk = -1
εk >= 0 for all k
Support Vector Machines: Slide 36
18
An Equivalent QP
Warning: up until Rong Zhang spotted my error in
Oct 2003, this equation had been wrong in earlier
versions of the notes. This version is correct.
R
1 R R
Maximize ∑ αk − ∑∑ αk αl Qkl where Qkl = yk yl (x k .x l )
2 k =1 l =1
k =1
Subject to these
constraints:
0 ≤ αk ≤ C ∀k
R
∑
k =1
αk yk = 0
Then define:
w =
R
∑
k =1
Then classify with:
αk ykx k
f(x,w,b) = sign(w. x - b)
b = y K (1 − ε K ) − x K . w K
where K = arg max α k
k
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machines: Slide 37
An Equivalent QP
Warning: up until Rong Zhang spotted my error in
Oct 2003, this equation had been wrong in earlier
versions of the notes. This version is correct.
R
1 R R
Maximize ∑ αk − ∑∑ αk αl Qkl where Qkl = yk yl (x k .x l )
2 k =1 l =1
k =1
Subject to these
constraints:
0 ≤ αk ≤ C ∀k
Then define:
w =
R
∑
k =1
R
∑
k =1
αk yk = 0
Datapoints with αk > 0
will be the support
vectors
Then classify with:
αk ykx k
f(x,w,b) = sign(w. x - b)
..so this sum only needs
to be over the
K
support vectors.
b = y K (1 − ε K ) − x K . w
where K = arg max α k
k
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machines: Slide 38
19
An Equivalent QP
R
1 R R
Maximize ∑ αk − ∑∑ αk αl Qkl where Qkl = yk yl (x k .x l )
2 k =1 l =1
k =1
Why did I tell you about this R
Subject to these equivalent
0 ≤ αk ≤QP?
C ∀k
αk
constraints:
• It’s a formulation that QP k = 1
packages can optimize more
Then define:
Datapoints with αk > 0
quickly
∑
w =
R
∑
k =1
yk = 0
will be the support
αk
vectors
Then
classify with:
of further
jawy k• xBecause
k
dropping developments
f(x,w,b)you’re
= sign(w. x - b)
..so
this
sum
only
needs
about to learn.
b = y K (1 − ε K ) − x K . w K
where K = arg max α k
to be over the
support vectors.
k
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machines: Slide 39
Suppose we’re in 1-dimension
What would
SVMs do with
this data?
x=0
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machines: Slide 40
20
Suppose we’re in 1-dimension
Not a big surprise
x=0
Positive “plane”
Copyright © 2001, 2003, Andrew W. Moore
Negative “plane”
Support Vector Machines: Slide 41
Harder 1-dimensional dataset
That’s wiped the
smirk off SVM’s
face.
What can be
done about
this?
x=0
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machines: Slide 42
21
Harder 1-dimensional dataset
Remember how
permitting nonlinear basis
functions made
linear regression
so much nicer?
Let’s permit them
here too
x=0
Copyright © 2001, 2003, Andrew W. Moore
z k = ( xk , xk2 )
Support Vector Machines: Slide 43
Harder 1-dimensional dataset
Remember how
permitting nonlinear basis
functions made
linear regression
so much nicer?
Let’s permit them
here too
x=0
Copyright © 2001, 2003, Andrew W. Moore
z k = ( xk , xk2 )
Support Vector Machines: Slide 44
22
Common SVM basis functions
zk = ( polynomial terms of xk of degree 1 to q )
zk = ( radial basis functions of xk )
⎛ | xk − c j | ⎞
⎟⎟
z k [ j ] = φ j (x k ) = KernelFn⎜⎜
KW
⎝
⎠
zk = ( sigmoid functions of xk )
This is sensible.
Is that the end of the story?
No…there’s one more trick!
Copyright © 2001, 2003, Andrew W. Moore
1
⎛
⎞
⎜
⎟
2 x1 ⎟
⎜
⎜
2 x2 ⎟
⎜
⎟
:
⎜
⎟
⎜
⎟
2 xm ⎟
⎜
2
⎜
⎟
x1
⎜
⎟
2
x2
⎜
⎟
⎜
⎟
:
⎜
⎟
2
xm
⎜
⎟
Φ(x) = ⎜
2 x1 x2 ⎟
⎜
⎟
⎜ 2 x1 x3 ⎟
⎜
⎟
:
⎜
⎟
⎜ 2 x1 xm ⎟
⎜
⎟
⎜ 2 x2 x3 ⎟
⎜
⎟
:
⎜
⎟
⎜ 2 x1 xm ⎟
⎜
⎟
:
⎜⎜
⎟⎟
⎝ 2 xm −1 xm ⎠
Constant Term
Linear Terms
Support Vector Machines: Slide 45
Quadratic
Basis Functions
Number of terms (assuming m input
dimensions) = (m+2)-choose-2
Pure
Quadratic
Terms
= (m+2)(m+1)/2
= (as near as makes no difference) m2/2
You may be wondering what those
Quadratic
Cross-Terms
Copyright © 2001, 2003, Andrew W. Moore
2 ’s are doing.
•You should be happy that they do no
harm
•You’ll find out why they’re there
soon.
Support Vector Machines: Slide 46
23
QP with basis functions
Warning: up until Rong Zhang spotted my error in
Oct 2003, this equation had been wrong in earlier
versions of the notes. This version is correct.
1 R R
αk − ∑∑ αk αl Qkl where Qkl = yk yl (Φ(x k ).Φ(x l ))
∑
2 k =1 l =1
k =1
R
Maximize
Subject to these
constraints:
0 ≤ αk ≤ C ∀k
R
∑
k =1
αk yk = 0
Then define:
∑
w =
Then classify with:
α k ykΦ (x k )
k s.t. α k > 0
f(x,w,b) = sign(w. φ(x) - b)
b = y K (1 − ε K ) − x K . w K
where K = arg max α k
k
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machines: Slide 47
QP with basis functions
1 R R
αk − ∑∑ αk αl Qkl where Qkl = yk yl (Φ(x k ).Φ(x l ))
∑
2 k =1 l =1
k =1
R
Maximize
Subject to these
constraints:
We must do R2/2 dot products to
R
get this matrix ready.
0 ≤ αk ≤ CEach∀
k
α y = 0
dot product ∑
requiresk m2k/2
k =1
additions and multiplications
The whole thing costs R2 m2 /4.
Yeeks!
Then define:
w =
∑
α k ykΦ (x k )
k s.t. α k > 0
…or does it?
Then classify with:
f(x,w,b) = sign(w. φ(x) - b)
b = y K (1 − ε K ) − x K . w K
where K = arg max α k
k
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machines: Slide 48
24
Quadratic Dot
Products
1
1
⎛
⎞ ⎛
⎞
⎜
⎟ ⎜
⎟
2a1 ⎟ ⎜
2b1 ⎟
⎜
⎜
2 a2 ⎟ ⎜
2b2 ⎟
⎜
⎟ ⎜
⎟
:
:
⎜
⎟ ⎜
⎟
⎜
⎟ ⎜
⎟
2 am ⎟ ⎜
2bm ⎟
⎜
⎜
⎟ ⎜ b12
⎟
a12
⎜
⎟ ⎜
⎟
2
2
a2
⎜
⎟ ⎜ b2
⎟
⎜
⎟ ⎜
⎟
:
:
⎜
⎟ ⎜
⎟
2
2
⎜ am
⎟ ⎜ bm
⎟
•⎜
Φ(a) • Φ(b) = ⎜
⎟
2a1a2
2b1b2 ⎟
⎜
⎟ ⎜
⎟
⎜ 2a1a3 ⎟ ⎜ 2b1b3 ⎟
⎜
⎟ ⎜
⎟
:
:
⎜
⎟ ⎜
⎟
⎜ 2a1am ⎟ ⎜ 2b1bm ⎟
⎜
⎟ ⎜
⎟
⎜ 2a2 a3 ⎟ ⎜ 2b2b3 ⎟
⎜
⎟ ⎜
⎟
:
:
⎜
⎟ ⎜
⎟
⎜ 2a1am ⎟ ⎜ 2b1bm ⎟
⎜
⎟ ⎜
⎟
:
:
⎜⎜
⎟⎟ ⎜⎜
⎟⎟
⎝ 2am −1am ⎠ ⎝ 2bm −1bm ⎠
1
+
m
∑ 2a b
i i
i =1
+
m
∑a b
i =1
2 2
i i
+
m
m
∑ ∑ 2a a b b
i
i =1 j = i +1
j i
Quadratic Dot
Products
Copyright © 2001, 2003, Andrew W. Moore
j
Support Vector Machines: Slide 49
Just out of casual, innocent, interest,
let’s look at another function of a and
b:
(a.b + 1) 2
= (a.b) 2 + 2a.b + 1
2
m
⎛ m
⎞
= ⎜ ∑ ai bi ⎟ + 2∑ ai bi + 1
i =1
⎝ i =1
⎠
Φ(a) • Φ(b) =
m
m
m
m
i =1
i =1
i =1 j = i +1
1 + 2∑ ai bi + ∑ ai2bi2 + ∑
∑ 2a a b b
i
j i
m
m
= ∑∑ ai bi a j b j + 2∑ ai bi + 1
m
j
i =1 j =1
m
i =1
m
= ∑ ( ai bi ) 2 + 2∑
i =1
Copyright © 2001, 2003, Andrew W. Moore
m
m
∑aba b
i =1 j = i +1
i i
j
j
+ 2∑ ai bi + 1
i =1
Support Vector Machines: Slide 50
25
Quadratic Dot
Products
Just out of casual, innocent, interest,
let’s look at another function of a and
b:
(a.b + 1) 2
= (a.b) 2 + 2a.b + 1
2
m
⎛ m
⎞
= ⎜ ∑ ai bi ⎟ + 2∑ ai bi + 1
i =1
⎝ i =1
⎠
Φ(a) • Φ(b) =
m
m
m
m
i =1
i =1
i =1 j = i +1
1 + 2∑ ai bi + ∑ ai2bi2 + ∑
∑ 2a a b b
i
j i
m
m
= ∑∑ ai bi a j b j + 2∑ ai bi + 1
m
i =1 j =1
j
m
i =1
m
= ∑ ( ai bi ) 2 + 2∑
i =1
m
m
∑aba b
i =1 j = i +1
i i
j
j
+ 2∑ ai bi + 1
i =1
They’re the same!
And this is only O(m) to
compute!
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machines: Slide 51
QP with Quadratic basis functions
Warning: up until Rong Zhang spotted my error in
Oct 2003, this equation had been wrong in earlier
versions of the notes. This version is correct.
1 R R
αk − ∑∑ αk αl Qkl where Qkl = yk yl (Φ(x k ).Φ(x l ))
∑
2 k =1 l =1
k =1
R
Maximize
Subject to these
constraints:
We must do R2/2 dot products to
R
get this matrix ready.
0 ≤ αk ≤ CEach∀
k
α k yrequires
k = 0
dot product ∑
now only
k =1
m additions and multiplications
Then define:
w =
∑
α k ykΦ (x k )
k s.t. α k > 0
Then classify with:
f(x,w,b) = sign(w. φ(x) - b)
b = y K (1 − ε K ) − x K . w K
where K = arg max α k
k
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machines: Slide 52
26
Higher Order Polynomials
φ(x)
Polynomial
Cost to
build Qkl
matrix
tradition
ally
Quadratic All m2/2
m2 R2 /4
terms up to
degree 2
Cost if 100
inputs
φ(a).φ(b) Cost to
2,500 R2
(a.b+1)2 m R2 / 2 50 R2
Cost if
build Qkl 100
inputs
matrix
sneakily
Cubic
All m3/6
m3 R2 /12 83,000 R2
terms up to
degree 3
(a.b+1)3 m R2 / 2 50 R2
Quartic
All m4/24
m4 R2 /48 1,960,000 R2 (a.b+1)4 m R2 / 2
terms up to
degree 4
Copyright © 2001, 2003, Andrew W. Moore
50 R2
Support Vector Machines: Slide 53
QP with Quintic basis functions
We must do RR2/2 dot products
R R to get this
matrix ready.
Maximize
∑ α + ∑∑ α α Q
k
k
l
kl
where
In 100-d, each
k =1 now
l =1 needs 103
k =1dot product
operations instead of 75 million
But there are still worrying things lurking away.
Subject to these
What are they?
constraints:
Qkl = yk yl (Φ(x k ).Φ(x l ))
0 ≤ αk ≤ C ∀k
R
∑
k =1
αk yk = 0
Then define:
w =
∑
α k ykΦ (x k )
k s.t. α k > 0
b = y K (1 − ε K ) − x K . w K
where K = arg max α k
k
Copyright © 2001, 2003, Andrew W. Moore
Then classify with:
f(x,w,b) = sign(w. φ(x) - b)
Support Vector Machines: Slide 54
27
QP with Quintic basis functions
We must do RR2/2 dot products
R R to get this
matrix ready.
Maximize
∑ α + ∑∑ α α Q
k
k
l
kl
where
In 100-d, each
k =1 now
l =1 needs 103
k =1dot product
operations instead of 75 million
Qkl = yk yl (Φ(x k ).Φ(x l ))
R
But there are still worrying things lurking away.
Subject to these
What are they?
k
k
k
constraints:
•The fear of overfittingkwith
= 1 this enormous
number of terms
0 ≤ α ≤ C ∀k
Then define:
∑
w =
∑
α y
= 0
•The evaluation phase (doing a set of
predictions on a test set) will be very
expensive (why?)
α k ykΦ (x k )
k s.t. α k > 0
b = y K (1 − ε K ) − x K . w K
where K = arg max α k
k
Then classify with:
f(x,w,b) = sign(w. φ(x) - b)
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machines: Slide 55
QP with Quintic basis functions
We must do RR2/2 dot products
R R to get this
matrix ready.
Maximize
∑ α + ∑∑ α α Q
k
k
l
kl
In 100-d, each
k =1 now
l =1 needs 103
k =1dot product
operations instead of 75 million
where
Qkl = yk yl (Φ(x k ).Φ(x l ))
The use of Maximum Margin
magically makes this not a
problem R
But there are still worrying things lurking away.
Subject to these
What are they?
k
k
k
constraints:
•The fear of overfittingkwith
= 1 this enormous
number of terms
0 ≤ α ≤ C ∀k
Then define:
w =
∑
α y
= 0
•The evaluation phase (doing a set of
predictions on a test set) will be very
expensive (why?)
α k ykΦ (x k )
k s.t. α k > 0
b = y K (1 − ε K ) − x K . w K
where K = arg max α k
k
Copyright © 2001, 2003, Andrew W. Moore
∑
Because each w. φ(x) (see below)
needs 75 million operations. What
can be done?
Then classify with:
f(x,w,b) = sign(w. φ(x) - b)
Support Vector Machines: Slide 56
28
QP with Quintic basis functions
We must do RR2/2 dot products
R R to get this
matrix ready.
Maximize
∑ α + ∑∑ α α Q
k
k
l
kl
where
In 100-d, each
k =1 now
l =1 needs 103
k =1dot product
operations instead of 75 million
Qkl = yk yl (Φ(x k ).Φ(x l ))
The use of Maximum Margin
magically makes this not a
problem R
But there are still worrying things lurking away.
Subject to these
What are they?
k
k
k
constraints:
•The fear of overfittingkwith
= 1 this enormous
number of terms
0 ≤ α ≤ C ∀k
Then define:
∑
w =
α k ykΦ (x k )
=
= 0
Because each w. φ(x) (see below)
needs 75 million operations. What
can be done?
Φ (x)
∑ εα y)k Φ− (xx kK) .⋅ w
−
K
5
∑=α arg
k y k ( xmax
k ⋅ x + 1
K
α) k
k s.t. α > 0
w
b ⋅ =Φ ( yx ) =(1
where
α y
•The evaluation phase (doing a set of
predictions on a test set) will be very
expensive (why?)
k s.t. α k > 0
K
∑
k
k s.t. α k > 0K
k
k
Only Sm operations (S=#support vectors)
Then classify with:
f(x,w,b) = sign(w. φ(x) - b)
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machines: Slide 57
QP with Quintic basis functions
We must do RR2/2 dot products
R R to get this
matrix ready.
Maximize
∑ α + ∑∑ α α Q
k
k
l
kl
where
In 100-d, each
k =1 now
l =1 needs 103
k =1dot product
operations instead of 75 million
Qkl = yk yl (Φ(x k ).Φ(x l ))
The use of Maximum Margin
magically makes this not a
problem R
But there are still worrying things lurking away.
Subject to these
What are they?
k
k
k
constraints:
•The fear of overfittingkwith
= 1 this enormous
number of terms
0 ≤ α ≤ C ∀k
Then define:
∑
w =
α k ykΦ (x k )
Φ (x)
∑ εα y)k Φ− (xx kK) .⋅ w
−
K
5
α
y
(
x
⋅
x
+
1
)
∑= arg
k
k
k
K
max
αk
k s.t. α > 0
w
b ⋅ =Φ ( yx ) =(1
where
=
α y
= 0
•The evaluation phase (doing a set of
predictions on a test set) will be very
expensive (why?)
k s.t. α k > 0
K
∑
k
k s.t. α k > 0K
k
k
Only Sm operations (S=#support vectors)
Copyright © 2001, 2003, Andrew W. Moore
Because each w. φ(x) (see below)
needs 75 million operations. What
can be done?
Then
When
classify
you see this
with:
many callout bubbles on
a slide it’s time to wrap the author in a
blanket, gently take him away and murmur
f(x,w,b)
= been
sign(w.
(x) - b)for too
“someone’s
at the PowerPoint
long.”
φ
Support Vector Machines: Slide 58
29
QP with Quintic basis functions
1 R R
Andrew’s
why
don’t
αk − ∑∑ αk αl Qkl where
Qkl =opinion
yk ylof(Φ
(xSVMs
(x l ))
∑
k ).Φ
overfit as much as you’d think:
2 k =1 l =1
k =1
R
Maximize
Subject to these
constraints:
No matter what the basis function,
there are really R
only up to R
parameters: α1, α2 .. αkR, and
usually
k
most are set tok zero
by
the
Maximum
=1
Margin.
0 ≤ αk ≤ C ∀k
Then define:
∑
w =
α k ykΦ (x k )
k s.t. α k > 0
Φ (x)
∑ εα y)k Φ− (xx kK) .⋅ w
−
K
5
∑=α arg
k y k ( xmax
k ⋅ x + 1
K
α) k
k s.t. α > 0
w
b ⋅ =Φ ( yx ) =(1
K
where
=
k
k s.t. α k > 0K
k
k
Only Sm operations (S=#support vectors)
∑
α y
= 0
Asking for small w.w is like “weight
decay” in Neural Nets and like Ridge
Regression parameters in Linear
regression and like the use of Priors
in Bayesian Regression---all designed
to smooth the function and reduce
overfitting.
Then classify with:
f(x,w,b) = sign(w. φ(x) - b)
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machines: Slide 59
SVM Kernel Functions
• K(a,b)=(a . b +1)d is an example of an SVM
Kernel Function
• Beyond polynomials there are other very high
dimensional basis functions that can be made
practical by finding the right Kernel Function
• Radial-Basis-style Kernel Function:
⎛ (a − b) 2 ⎞
⎟
K (a, b) = exp⎜⎜ −
2σ 2 ⎟⎠
⎝
• Neural-net-style Kernel Function:
K (a, b) = tanh(κ a.b − δ )
Copyright © 2001, 2003, Andrew W. Moore
σ, κ and δ are magic
parameters that must
be chosen by a model
selection method
such as CV or
VCSRM*
*see last lecture
Support Vector Machines: Slide 60
30
VC-dimension of an SVM
• Very very very loosely speaking there is some theory which
under some different assumptions puts an upper bound on
the VC dimension as
⎡ Diameter ⎤
⎢ Margin ⎥
⎢
⎥
• where
• Diameter is the diameter of the smallest sphere that can
enclose all the high-dimensional term-vectors derived
from the training set.
• Margin is the smallest margin we’ll let the SVM use
• This can be used in SRM (Structural Risk Minimization) for
choosing the polynomial degree, RBF σ, etc.
• But most people just use Cross-Validation
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machines: Slide 61
SVM Performance
• Anecdotally they work very very well indeed.
• Example: They are currently the best-known
classifier on a well-studied hand-written-character
recognition benchmark
• Another Example: Andrew knows several reliable
people doing practical real-world work who claim
that SVMs have saved them when their other
favorite classifiers did poorly.
• There is a lot of excitement and religious fervor
about SVMs as of 2001.
• Despite this, some practitioners (including your
lecturer) are a little skeptical.
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machines: Slide 62
31
Doing multi-class classification
• SVMs can only handle two-class outputs (i.e. a
categorical output variable with arity 2).
• What can be done?
• Answer: with output arity N, learn N SVM’s
•
•
•
•
SVM 1 learns “Output==1” vs “Output != 1”
SVM 2 learns “Output==2” vs “Output != 2”
:
SVM N learns “Output==N” vs “Output != N”
• Then to predict the output for a new input, just
predict with each SVM and find out which one puts
the prediction the furthest into the positive region.
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machines: Slide 63
References
• An excellent tutorial on VC-dimension and Support
Vector Machines:
C.J.C. Burges. A tutorial on support vector machines
for pattern recognition. Data Mining and Knowledge
Discovery, 2(2):955-974, 1998.
http://citeseer.nj.nec.com/burges98tutorial.html
• The VC/SRM/SVM Bible:
Statistical Learning Theory by Vladimir Vapnik, WileyInterscience; 1998
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machines: Slide 64
32
What You Should Know
• Linear SVMs
• The definition of a maximum margin classifier
• What QP can do for you (but, for this class, you
don’t need to know how it does it)
• How Maximum Margin can be turned into a QP
problem
• How we deal with noisy (non-separable) data
• How we permit non-linear boundaries
• How SVM Kernel functions permit us to pretend
we’re working with ultra-high-dimensional basisfunction terms
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machines: Slide 65
33
Download