svm

advertisement
Support Vector Machine
Rong Jin
Linear Classifiers
denotes +1
denotes -1
• How to find the linear decision boundary that
linearly separates data points from two classes?
Linear Classifiers
denotes +1
denotes -1
Linear Classifiers
denotes +1
denotes -1
Any of these
would be fine..
..but which is best?
Copyright © 2001, 2003,
Andrew W. Moore
Classifier Margin
denotes +1
denotes -1
Copyright © 2001, 2003,
Andrew W. Moore
Define the margin of
a linear classifier as
the width that the
boundary could be
increased by before
hitting a datapoint.
Maximum Margin
denotes +1
denotes -1
The maximum margin
linear classifier is the
linear classifier with
the maximum margin.
This is the simplest
kind of SVM (called an
Linear SVM)
Copyright © 2001, 2003,
Andrew W. Moore
Why Maximum Margin ?
1.
denotes +1
denotes -1
2.
3.
Copyright © 2001, 2003,
Andrew W. Moore
If we’ve made a small error
in the location of the
boundary (it’s been jolted
in its perpendicular
direction) this gives us least
chance of causing a
misclassification.
There’s some theory (using
VC dimension) that is
related to (but not the
same as) the proposition
that this is a good thing.
Empirically it works very
very well.
Estimate the Margin
denotes +1
denotes -1
x
• What is the distance expression for a point x to a line
wx+b= 0?
Copyright © 2001, 2003,
Andrew W. Moore
Estimate the Margin
denotes +1
denotes -1
x
• What is the classification margin for wx+b= 0?
Copyright © 2001, 2003,
Andrew W. Moore
Maximize the Classification Margin
denotes +1
denotes -1
Copyright © 2001, 2003,
Andrew W. Moore
x
Maximum Margin
denotes +1
denotes -1
Copyright © 2001, 2003,
Andrew W. Moore
x
Maximum Margin
denotes +1
denotes -1
x
Copyright © 2001, 2003,
Andrew W. Moore
Maximum Margin
Quadratic programming problem
• Quadratic objective function
• Linear equality and inequality constraints
• Well studied problem in OR
Quadratic Programming
T
Find
T
arg m in c  d u 
u Ru
2
u
Subject to
Quadratic criterion
a 11 u 1  a 12 u 2  ...  a 1 m u m  b1
a 21 u 1  a 22 u 2  ...  a 2 m u m  b 2
:
n additional linear
inequality
constraints
a n 1u 1  a n 2 u 2  ...  a nm u m  b n
a ( n  1 )1u 1  a ( n  1 ) 2 u 2  ...  a ( n  1 ) m u m  b ( n  1 )
a ( n  2 )1u 1  a ( n  2 ) 2 u 2  ...  a ( n  2 ) m u m  b ( n  2 )
:
a ( n  e )1u 1  a ( n  e ) 2 u 2  ...  a ( n  e ) m u m  b ( n  e )
Copyright © 2001, 2003,
Andrew W. Moore
e additional linear
equality
constraints
And subject to
Linearly Inseparable Case
denotes +1
denotes -1
Copyright © 2001, 2003,
Andrew W. Moore
This is going to be a problem!
What should we do?
Linearly Inseparable Case
denotes +1
denotes -1
Copyright © 2001, 2003,
Andrew W. Moore
• Relax the constraints
• Penalize the relaxation
Linearly Inseparable Case
denotes +1
denotes -1
Copyright © 2001, 2003,
Andrew W. Moore
• Relax the constraints
• Penalize the relaxation
Linearly Inseparable Case
denotes +1
denotes -1
3
1
2
Still a quadratic programming
problem
Copyright © 2001, 2003,
Andrew W. Moore
Linearly Inseparable Case
Copyright © 2001, 2003,
Andrew W. Moore
Linearly Inseparable Case
Support
Vector
Machine
Regularized
logistic
regression
Copyright © 2001, 2003,
Andrew W. Moore
Dual Form of SVM
How to decide b ?
Copyright © 2001, 2003,
Andrew W. Moore
Dual Form of SVM
wx b 1
denotes +1
denotes -1
w
w  x  b  1
Copyright © 2001, 2003,
Andrew W. Moore
Suppose we’re in 1-dimension
What would SVMs
do with this data?
x=0
Copyright © 2001, 2003,
Andrew W. Moore
Suppose we’re in 1-dimension
Not a big surprise
x=0
Positive “plane”
Copyright © 2001, 2003,
Andrew W. Moore
Negative “plane”
Harder 1-dimensional Dataset
x=0
Copyright © 2001, 2003,
Andrew W. Moore
Harder 1-dimensional Dataset
x=0
Copyright © 2001, 2003,
Andrew W. Moore
Harder 1-dimensional Dataset
x=0
Copyright © 2001, 2003,
Andrew W. Moore
Common SVM Basis Functions
• Polynomial terms of x of degree 1 to q
• Radial (Gaussian) basis functions
Copyright © 2001, 2003,
Andrew W. Moore
1




2 x1 


2 x2 


:




2 xm


2


x1


2
x
2




:


2
x


m
Φ (x ) 

2 x1 x 2 


2 x1 x 3 



:



2 x1 x m 


2
x
x
2 3 



:


2
x
x
1 m 



:


 2x x 
Copyright
m  1 2003,
m 
 © 2001,
Andrew W. Moore
Constant Term
Quadratic
Basis
Functions
Linear Terms
Pure Quadratic
Terms
Quadratic CrossTerms
Number of terms (assuming m
input dimensions) = (m+2)choose-2
= (m+2)(m+1)/2
= m2/2
Dual Form of SVM
Copyright © 2001, 2003,
Andrew W. Moore
Dual Form of SVM
Copyright © 2001, 2003,
Andrew W. Moore
Quadratic Dot
Products














Φ (a )  Φ (b ) 















 
 
2 a1  
2a2  
 
:
 
 
2am
 
2
 
a1
 
2
a2
 
 
:
 
2
am
 



2 a1 a 2
 
2 a1 a 3  
 
:
 
2 a1 a m  
 
2 a2a3  
 
:
 
2 a1 a m  
 
:
 
2 a m 1 a m  
1


2 b1 
2 b2 

:


2 bm

2

b1

2
b2


:

2
bm

2 b1b 2 

2 b1b 3 

:

2 b1b m 

2 b 2 b3 

:

2 b1b m 

:

2 b m 1b m 
1
1
+
m
 2a b
i
i
i 1
+
m
a
2
i
2
bi
i 1
+
m
m
  2a a
i
i 1 j  i  1
j
bi b j
Quadratic Dot
Products
Just out of casual, innocent, interest, let’s look
at another function of a and b:
( a .b  1)
 ( a .b )  2 a .b  1
2
1  2  a i bi 
i 1
2
m


   a i bi   2  a i bi  1
i 1
 i 1

m
Φ (a )  Φ (b ) 
m
2
m
m
a
m
2
i
b 
i 1
2
i

m
  2a a
i
j
aba
i
i 1
bi b j
m
i
j 1
j
b j  2  a i bi  1
i 1
i 1 j  i  1
m

 (a b )
i
i 1
Copyright © 2001, 2003,
Andrew W. Moore
m
i
m
2
 2
m
aba
i 1 j  i  1
i
i
m
j
b j  2  a i bi  1
i 1
Kernel Trick
Copyright © 2001, 2003,
Andrew W. Moore
Kernel Trick
Copyright © 2001, 2003,
Andrew W. Moore
SVM Kernel Functions
• Polynomial kernel function
• Radial basis kernel function (universal kernel)
Copyright © 2001, 2003,
Andrew W. Moore
Kernel Tricks
• Replacing dot product with a kernel function
• Not all functions are kernel functions
• Are they kernel functions ?
Copyright © 2001, 2003,
Andrew W. Moore
Kernel Tricks
Mercer’s condition
To expand Kernel function k(x,y) into a dot product, i.e.
k(x,y)=(x)(y), k(x, y) has to be positive semi-definite
function, i.e., for any function f(x) whose f 2 ( x ) d x is finite,
the following inequality holds
Copyright © 2001, 2003,
Andrew W. Moore
Kernel Tricks
• Introducing nonlinearity into the model
• Computationally efficient
Copyright © 2001, 2003,
Andrew W. Moore
Nonlinear Kernel (I)
Copyright © 2001, 2003,
Andrew W. Moore
Nonlinear Kernel (II)
Copyright © 2001, 2003,
Andrew W. Moore
Reproducing Kernel Hilbert Space (RKHS)
• Reproducing Kernel Hilbert Space H
• Eigen decomposition:
• Elements of space H:
• Reproducing property
Reproducing Kernel Hilbert Space (RKHS)
• Representer theorem
Kernelize Logistic Regression
• How can we introduce nonlinearity into the
logistic regression model?
Copyright © 2001, 2003,
Andrew W. Moore
Diffusion Kernel
• Kernel function describes the correlation or
similarity between two data points
• Given that a similarity function s(x,y)
• Non-negative and symmetric
• Does not obey Mercer’s condition
• How can we generate a kernel function based on
this similarity function?
A graph theory approach …
Copyright © 2001, 2003,
Andrew W. Moore
Diffusion Kernel
• Create a graph for all the training examples
• Each vertex corresponds to a data point
• The weight of each edge is the similarity s(x,y)
• Graph Laplacian
• Negative semi-definite
Copyright © 2001, 2003,
Andrew W. Moore
Diffusion Kernel
Consider a simple Laplacian
Consider L2, L3, …
What do these matrices represent?
A diffusion kernel
Copyright © 2001, 2003,
Andrew W. Moore
Diffusion Kernel: Properties
• Positive definite
• How to compute the diffusion kernel ?
Copyright © 2001, 2003,
Andrew W. Moore
Doing Multi-class Classification
• SVMs can only handle two-class outputs (i.e. a
categorical output variable with arity 2).
• What can be done?
• Answer: with output arity N, learn N SVM’s
SVM 1 learns “Output==1” vs “Output != 1”
SVM 2 learns “Output==2” vs “Output != 2”
:
SVM N learns “Output==N” vs “Output != N”
• Then to predict the output for a new input, just
predict with each SVM and find out which one puts
the prediction the furthest into the positive region.
Copyright © 2001, 2003,
Andrew W. Moore
Kernel Learning
What if we have multiple kernel functions
• Which one is the best ?
• How can we combine multiple kernels ?
Kernel Learning
References
• An excellent tutorial on VC-dimension and
Support Vector Machines:
C.J.C. Burges. A tutorial on support vector machines for
pattern recognition. Data Mining and Knowledge
Discovery, 2(2):955-974, 1998.
http://citeseer.nj.nec.com/burges98tutorial.html
• The VC/SRM/SVM Bible: (Not for beginners
including myself)
Statistical Learning Theory by Vladimir Vapnik, WileyInterscience; 1998
• Software: SVM-light,
http://svmlight.joachims.org/, free download
Copyright © 2001, 2003,
Andrew W. Moore
Download