PPT - Bo Yuan

advertisement
Classification IV
Lecturer: Dr. Bo Yuan
LOGO
E-mail: yuanb@sz.tsinghua.edu.cn
Overview
Support Vector Machines
2
Linear Classifier
f ( x, w, b)  sign ( g ( x))
 sign ( w  x  b)
w
Just in case ...
w·x + b <0
n
w  x   wi xi
i 1
w·x + b >0
w  x1  b  w  x2  b
w( x1  x2 )  0
3
Distance to Hyperplane
g ( x)  w  x  b
x
x  x'w
g ( x)  w( x'  w)  b
 w  x'b  w  w
x'
 w  w
M || x  x' |||| w ||
|b|
|| w ||
| g ( x) |  || w || | g ( x) |


w w
|| w ||
4
Selection of Classifiers
?
Which classifier is the best?
All have the same training error.
How about generalization?
5
Unknown Samples
B
A
Classifier B divides the space more consistently (unbiased).
6
Margins
Support Vectors
Support Vectors
7
Margins
 The margin of a linear classifier is defined as the width that the
boundary could be increased by before hitting a data point.
 Intuitively, it is safer to choose a classifier with a larger margin.
 Wider buffer zone for mistakes
 The hyperplane is decided by only a few data points.
 Support Vectors!
 Others can be discarded!
 Select the classifier with the maximum margin.
 Linear Support Vector Machines (LSVM)
 Works very well in practice.
 How to specify the margin formally?
8
Margins
M=Margin Width
x+
X-
2
M
|| w ||
9
Objective Function
 Correctly classify all data points:
w  xi  b  1
if yi  1
w  xi  b  1
if yi  1
yi (w  xi  b)  1  0
 Maximize the margin
2
1 T
max M 
 min w w
w
2
 Quadratic Optimization Problem
 Minimize
 Subject to
1 t
 ( w)  w w
2
yi ( w  xi  b)  1
10
Lagrange Multipliers
l
l
1
2
LP  || w ||   i yi ( w  xi  b)    i
2
i 1
i 1
L p
w
L p
b
l
 0  w    i yi xi
i 1
Dual Problem
l
 0    i yi  0
i 1
LD    i 
i
1
 i j yi y j xi  x j

2 i, j
1 T
   i   H where H ij  yi y j xi  x j
2
i
subject to :   iyi  0 &  i  0
i
11
Quadratic problem again!
Solutions of w & b
Support Vectors : Samples with positive 
l
g ( x)    i yi xi  x  b
y s ( xs  w  b)  1
i 1
y s (   m y m xm  x s  b )  1
mS
y s2 (   m ym xm  x s b)  y s
inner product
mS
b  y s    m y m xm  x s
mS
1
b
Ns
 ( y  
sS
s
mS
m
y m xm  x s )
12
An Example
x2
2
1
 y
(1, 1, +1)
i 1
i
i
 0  1   2  0  1   2
H12   y1 y1 x1  x1
H
H   11


 H 21 H 22   y2 y1 x2  x1
2
(0, 0, -1)
x1
x1  x2 1  0
y1 y2 x1  x2  2 0


y2 y2 x2  x2  0 0
1 
1
LD   i  1 ,  2 H    21  12
2
i 1
 2 
2
2
w    i yi xi  11 [1,1]  1 (1)  [0,0]  [1,1]
i 1
1  1;  2  1
b   wx1  1  2  1  1
g ( x)  wx  b  x1  x2  1
M
13
2
2

 2
w
2
Soft Margin
e2
e11
e7
yi ( wxi  b)  1  i  0
1 t
 ( w)  w w  C   i
2
i
i  0
l
l
l
1 2
L P  w  C   i    i [ yi ( w  xi  b)  1   i ]   i i
2
i 1
i 1
i 1
14
Soft Margin
L p
w
L p
b
L p
 i
l
 0  w    i yi xi
i 1
l
 0    i yi  0
i 1
 0  C   i  ui
l
l
l
1 2
L P  w  C   i    i [ yi ( w  xi  b)  1   i ]   i i
2
i 1
i 1
i 1
1 T
LD    i   H
2
i
s.t. 0   i  C
and
 y  0
i
i
15
i
Non-linear SVMs
0
x
0
x
x2
x
16
Feature Space
x2
x22
Φ: x → φ(x)
x1
𝑥1 2 + 𝑥2 2 = 𝑟 2
x12
17
Feature Space
x2
Φ: x → φ(x)
x1
18
Quadratic Basis Functions
1




2
x
1



2 x2 





 2 xm 


2
x


1
 x2

2





 x2

m

 ( x)  
 2 x1 x2 


2
x
x
1 3 






2
x
x
1
m


 2x x 
2 3





 2x x 
2 m







 2 xm 1 xm 
Constant Terms
Linear Terms
Number of terms
C
Pure Quadratic Terms
Quadratic Cross-Terms
19
2
m 2
(m  2)( m  1) m 2


2
2
Calculation of Φ(xi )·Φ(xj )
1
1

 


 

2
a
2
b
1
1

 


2 a2  
2b2 

 




 

 2am   2bm 

 

2
2
a
b

 

1
1
 a2
  b2 
2
2

 




 

 a2
  b2

m
m


 (a )   (b)  
 2a1a2   2b1b2 

 

2
a
a
2
b
b
1 3  
1 3 


 




 

2
a
a
2
b
b
1
m
1
m

 

 2a a   2b b 
2 3
2 3

 




 

 2a a   2b b 
2 m
2 m

 




 


 

 2am 1am   2bm 1bm 
1
m
 2a b
i i
i 1
m
2 2
a
 i bi
xi  x j   ( xi )   ( x j )
i 1
m 1
m
  2a a b b
i 1 j i 1
i
j i
20
j
It turns out …
m
m 1
m
m
(a)  (b)  1  2 ai bi   a b    2ai a j bi b j
i 1
i 1
2 2
i i
i 1 j i 1
m
m
i 1
i 1
(a  b  1) 2  (a  b) 2  2a  b  1  ( ai bi ) 2  2 ai bi  1
m
m
m
  ai bi a j b j  2 ai bi  1
i 1 j 1
i 1
m 1
m
  (ai bi )  2
2
i 1
m
a b a b
i 1 j i 1
i i
j
m
j
 2 ai bi  1
i 1
K (a, b)  (a  b  1) 2  (a)  (b)
O(m 2 )
O (m)
21
Kernel Trick

The linear classifier relies on dot products between vectors xi·xj

If every data point is mapped into a high-dimensional space via some
transformation Φ: x → φ(x), the dot product becomes: φ(xi) ·φ(xj)

A kernel function is some function that corresponds to an inner product in
some expanded feature space: K(xi, xj) = φ(xi) ·φ(xj)

Example: x=[x1, x2]; K(xi, xj) = (1 + xi · xj)2
K ( xi , x j )  (1  xi  x j ) 2  1  xi21 x 2j1  2 xi1 x j1 xi 2 x j 2  xi22 x 2j 2  2 xi1 x j1  2 xi 2 x j 2
 [1, xi21 , 2 xi1 xi 2 , xi22 , 2 xi1 , 2 xi 2 ]  [1, x 2j1 , 2 x j1 x j 2 , x 2j 2 , 2 x j1 , 2 x j 2 ]
  (x i )   (x j ), where  ( x)  [1, x12 , 2 x1 x2 , x22 , 2 x1 , 2 x2 ]
22
Kernels
Polynomial : K ( xi , x j )  ( xi  x j  1) d
 x x
i
j

Gaussian : K ( xi , x j )  exp  
2 2


2




Hyperbolic Tangent : K ( xi , x j )  tanh( xi  x j  c)
23
String Kernel
Similarity between text strings: Car vs. Custard
K (car , cat )  4
K (car , car )  K (cat , cat )  24  6
24
Solutions of w & b
l
w    i yi ( xi )
i 1
l
l
i 1
i 1
w  ( x j )    i yi ( xi )  ( x j )    i yi K ( xi , x j )
1
1
b   ( y s    m y m  ( xm )   ( x s ) )   ( y s    m y m K ( xm , x s ) )
N s sS
N s sS
mS
mS
l
g ( x)    i yi K ( xi , x)  b
i 1
l
g ( x)  w  x  b    i yi xi  x  b
i 1
25
Decision Boundaries
26
More Maths …
27
SVM Roadmap
Linear Classifier
Maximum Margin
Linear SVM
Noise
Soft Margin
Nonlinear Problem
a·b → Φ(a)·Φ(b)
High Computational Cost
Kernel Trick
K(a,b)=Φ(a)·Φ(b)
28
Reading Materials
 Text Book
 Nello Cristianini and John Shawe-Taylor, An Introduction to Support Vector
Machines and Other Kernel-Based Learning Methods. Cambridge University
Press, 2000.
 Online Resources
 http://www.kernel-machines.org/
 http://www.support-vector-machines.org/
 http://www.tristanfletcher.co.uk/SVM%20Explained.pdf
 http://www.csie.ntu.edu.tw/~cjlin/libsvm/
 A list of papers uploaded to the web learning portal
 Wikipedia & Google
29
Review
 What is the definition of margin in a linear classifier?
 Why do we want to maximize the margin?
 What is the mathematical expression of margin?
 How to solve the objective function in SVM?
 What are support vectors?
 What is soft margin?
 How does SVM solve nonlinear problems?
 What is so called “kernel trick”?
 What are the commonly used kernels?
30
Next Week’s Class Talk
 Volunteers are required for next week’s class talk.
 Topic : SVM in Practice
 Hints:
 Applications
 Demos
 Multi-Class Problems
 Software
• A very popular toolbox: Libsvm
 Any other interesting topics beyond this lecture
 Length: 20 minutes plus question time
31
Download