 

advertisement
Support Vector Machines (an Overview/Synopsis/Amplification of Module 33)
It turns out that a mathematically coherent way of using the support vector classifier material and
a kernel is this. One might pick K  x, z  and decide to make "voting functions" based primarily
on linear combinations of slices of the kernel at observed inputs, i.e. on
N
gβ  x     i K  x, xi   βk  x 
i 1
for
k  x    K  x, x1  , K  x, x 2  , , K  x, x N  
That is, one is making N new "features" from the training cases and the kernel function, and
considering the linear function space defined by those. BUT now rather than simply operating
on these new features using regular  N inner products
N
k  x  k  z    K  x, xi K  z, xi 
i 1
one substitutes a ("reproducing kernel Hilbert space" inner product defined by)
K  x, z 
That is, one charges ahead formally using values of the kernel in place of inner products in an
"N -feature" optimization problem and is led to an "optimal" voting function
 0  gβ  x 
(i.e. is led to values of a constant 0 and a vector β   N ) and a corresponding classifier
fˆ  x   sign   0  g β  x  
Interestingly enough, typically relatively few of the entries of β are non-zero (these correspond
to "support vectors" in the N -dimensional feature space).
1 One sense in which this is mathematically coherent is that the program just described produces a
solution to the following optimization problem (for some appropriate   0 corresponding
appropriately to the budget C ):
Minimize over choices of β N and  0  the quantity
 1  y  
N
i 1

i
0
 βk  xi  
   12 βKβ

for K  K  xi , x j  .
N N
In fact, more is true. The program just described produces a solution to what appears to be a
harder function-space optimization problem. That is, (for some appropriate   0 corresponding
appropriately to the budget C ) the program produces a real 0 and a function g in the space of
(potentially infinite) linear combinations of all slices of K with inner product defined by
K  , x  , K  , z   K  x, z 
optimizing
 1  y  
N
i 1
where the g
2
i
0
 g  xi  
   12
g
2
is the (reproducing kernel Hilbert space) squared norm of g derived from the
inner product.
In this optimization problem, the kernel provides the space of functions (potentially infinite
linear combinations of slices of it) AND the inner products (its values). The weight  governs
how much the linear combination of slices of the kernel are allowed to deviate from 0 (and the
voting function that includes the constant 0 is allowed to vary across inputs).
Use of the "hinge loss" 1  yh  x   for voting function h  x  in this formulation is
mathematically convenient (far more convenient than using 0-1 loss error rate). Motivations for
why it is sensible to use this loss include the following:
1. For  x, y  ~ P it's fairly easy to argue that the function h optimizing the "expected
hinge loss"
E 1  yh  x  
2 is
1

h opt  u   sign  P  y  1| x  u   
2

… also the optimal classifier under 0-1 loss. So the penalized function optimization
problems stated above may produce voting functions approximating the optimal 0-1
loss classification classifier.
2. It is the case that for all real u
I u  0  1  u 
So, using classifier f  x   sign  h  x   (that makes sense only when h  x  is not 0 …
some arbitrary choice must be made in that case), provided the probability is 0 that
h  x   0 , the 0-1 loss the error rate for f is
E I  f  x   y   E I  yh  x   0   E 1  yh  x  
That is, we also see that the hinge loss penalty used in the function optimization
problem is an empirical version of an upper bound on the 0-1 loss error rate. Thus its
use may be relevant to choice of a 0-1- loss classifier.
3. Recalling the original form of the optimization problem leading to the support vector
classifier, the quantity 1  yi   0  βxi   is the fraction of the margin M that input xi
is allowed to violate its cushion around (the hyperplane that is) the classification
boundary. (Cases with xi on the "right" side of their cushion don't get penalized at all.
Ones with this value equal to 1 are on the classification boundary. Ones with values of
this larger than 1 correspond to points mis-classified by the voting function.) The sum
of such terms is a penalty on the total of fractional violations of the cushion. (Of
course, dividing the whole optimization criterion by N makes this an average
fractional violation.) So, the hinge loss has direct relevance to the basic "large margin"
motivation of the support vector analysis.
3 
Download