From Support Vector Classi…ers to Support Vector Machines via Kernels
A version of the support vector classi…er optimization problem is for some positive C minimize
2 < p and
0
2 <
1
2 k k
2
+ C i =1 i subject to y i
( x 0 i
+
0
) + for some i i
0
1 8 i where ultimately the SV classi…er is f
^
( x ) = sign [ x
0
+
Or equivalently the optimization problem is for some positive C
0
] u 2 < p maximize with and
0 k u
2 < k = 1
M subject to y i
( x
0 i u + for some i
0
) M (1
0 with
N i
) i =1 i
8 i
C where ultimately the SV classi…er is f
^
( x ) = sign [ x
0 u +
0
]
For a given C there is a corresponding C and the optimizing u in the second formulation is the unit vector corresponding to the optimizing in the …rst formulation, = k k , while the is
0 in the …rst divided by k k from the …rst.
0 in the second formulation
The margin (de…ning the "cushion around classi…cation boundary" is M in the second formulation and 1 = k k in the …rst. Points with optimized vectors," ones with optimized i
= 0 i
0 are "support being "on the margin" around the classi…cation boundary, ones with i
> 0 being on the "wrong side" of the appropriate margin and ones with set point x i i
> 1 indicating that the training is misclassi…ed by the SV classi…er. Both C and C are complexity parameters, small C and large C corresponding to low complexity classi…ers.
The notion of a non-negative de…nite kernel K ( x ; z ) and an implicit abstract linear space to which < p is mapped via some (abstract) function T ( x ) and in which the inner product of T ( x ) and T ( z ) is K ( x ; z ) provides a very ‡exible generalization of support vector classi…ers.
One might declare that what is sought is some element of the abstract space of the form
S = i =1 i
T ( x i
)
(i.e. some constants
1
;
2
; : : : ;
N
) and some constant
0 that minimize for some positive C
1
2 k S k
2
+ C i =1 i subject to y i
( h T ( x i
) ; S i +
0
) + where ultimately the classi…er will be of the form i
1 8 i for some i f
^
( x ) = sign [ h T ( x ) ; S i +
0
]
0
(The inner product here is the inner product in the abstract space corresponding to K ( x ; z ) and the norm is the norm in that space produced by the inner product.) This amounts to saying that we’ll think of solving the support vector classi…er problem in the abstract space and then e¤ectively doing classi…cation by moving the data to that space via the (abstract) transformation T ( x ) .
1
for
The elements of this are easily expressed in terms of the kernel. First k S k
2
=
=
=
=
= h S; S i i
T ( x i
) ; i =1 i
T ( x i i i 0 h T ( x i
) ; T ( x i 0
) i i;i 0
X i;i 0
0
K i i 0
K ( x i
; x i 0
)
)
+
0
= (
1
;
2
; : : : ;
N
) 2 < N and K the N N matrix with i; j entry K ( x i
; x j
) . Then
X h T ( x ) ; S i = i
K ( x ; x i
) i
= ( K ( x ; x
1
) ; K ( x ; x
2
) ; : : : ; K ( x ; x
N
))
= ( x )
0 for an N -vector-valued function with l th entry ( x ) l
= K ( x ; x l
) .
So the optimization problem is for some positive C to minimize (the quadratic in plus penalty on the slack variables i
)
1
2
0
K + C i =1 i subject to the constraint y i
( x i
)
0
+
0
+ i
1 8 i for some i
0 which is essentially of the same form as the optimization problem corresponding to a support vector classi…er where now the "data vectors" are ( x i
) (rather than x i as in the untransformed problem). Upon solving the optimization problem (that is …nding a support vector classi…er in the abstract space based on a solution of a dual problem in < N ), the ultimate corresponding classi…er in the original variable x is f
^
( x ) = sign ( x )
0
"
X
= sign i
+
0
K ( x ; x i
) + i
0
# and one has built a "voting function" (a surrogate for the likelihood ratio or conditional probability of class
1 given x if the classi…er is any good) using a linear combination of slices of the kernel function taken at input vectors x i in the training data set.
In this formulation, the margin is M = 1 = p
0
K in the abstract space around the set of points with inner product with S equal to
0
.
In terms of points in the abstract space) away from the classi…cation boundary
< p , the two surfaces representing points M (in
(
X
) x 2 < p j i
K ( x ; x i
) +
0
= 0 i are of of the forms
( x 2 < p j
X i i
K ( x ; x i
) +
0
= 1
) and
( x 2 < p j
X i i
K ( x ; x i
) +
0
= 1
)
Training vectors x i
2 < p correspond to support vectors in the abstract space if they are "on the wrong side" y i
= i 0 i 0
1
K are support vectors if
( x i
; x i 0
) +
0
P i 0 i 0
K ( x i
; x i 0
) +
0
> 1 and points x i with y i
= 1
< 1 . These support vectors turn out to correspond to i with x i with are support vectors if i
= 0 .)
2