Kernels

advertisement
Kernels
Usman Roshan
CS 675 Machine Learning
Feature space representation
• Consider two classes shown below
• Data cannot be separated by a hyperplane
Feature space representation
• Suppose we square each coordinate
• In other words (x1 , x2 ) => (x12 , x22 )
• Now the data are well separated
Feature spaces/Kernel trick
• Using a linear classifier (nearest means or
SVM) we solve a non-linear problem simply by
working in a different feature space.
• With kernels
– we don’t have to make the new feature space
explicit.
– we can implicitly work in a different space and
efficiently compute dot products there.
Support vector machine
• Consider the hard margin SVM optimization
min w,w
0
1 2
T
w subject to yi (w xi + w0 ) ³ 1 for all i
2
• Solve by applying KKT. Think of KKT as a tool
for constrained convex optimization.
• Form Lagrangian
1 2
Lp  w -   i ( yi ( wT xi  w0 )  1)
2
i
where  i are Lagrange multipliers
Support vector machine
• KKT says the optimal w and w0 are given by
the saddle point solution
1 2
min w, w0 maxi Lp  w - i ( yi ( wT xi  w0 )  1)
2
i
subject to  i  0
¶L p
¶L p
• And KKT conditions
= 0,
= 0 imply
¶w
¶w
0
that
w = åa i yi xi and 0 = åa i yi
i
i
Support vector machine
• After applying the Lagrange multipliers we
obtain the dual by substituting w into the
primal (dual is maximized)
1
T
maxa Ld = å a i - å å a ia j yi y j xi x j
i
2 i j
i
subject to
åa y
i
i
i
= 0 and 0 £ a i £ C
SVM and kernels
• We can rewrite the dual in a compact form:
1 T
maxa Ld = a e - a G(K )a
2
subject to
T
a y = 0 and a < C
T
where G(i, j) = yi y j K(i, j), K(i, j) = x x j ,
T
i
a [i] = a i , y[i] = yi
Optimization
• The SVM is thus a quadratic program that can
be solved by any quadratic program solver.
• Platt’s Sequential Minimization Optimization
(SMO) algorithm offers a simple specific
solution to the SVM dual
• Idea is to perform coordinate ascent by
selecting two variables at a time to optimize
• Let’s look at some kernels.
Example kernels
• Polynomial kernels of degree d give a feature
space with higher order non-linear terms
K ( xi , x j )  ( x x j 1)
T
i
d
• Radial basis kernel gives infinite dimensional
space (Taylor series)
2
K ( xi , x j )  e

xi  x j
2s
Example kernels
• Empirical kernel map
– Define a set of reference vectors m j for j = 1…
– Define a score s(xi , m j ) between xi and mj
– Then f (xi ) = [s(xi , m1 ),… , s(xi ,mM )]
– And K(xi , x j ) = f T (xi )f (x j )
M
Example kernels
• Bag of words
– Given two documents D1 and D2 the we define the
kernel K(D1,D2) as the number of words in
common
– To prove this is a kernel first create a large set of
words Wi. Define the mapping Φ(D1) as a high
dimensional vector where Φ(D1)[i] is 1 if the word
Wi is present in the document.
SVM and kernels
• What if we make the kernel matrix K a variable
and optimize the dual
1 T
min K maxa Ld = a e - a G(K )a
2
subject to
T
a T y = 0 and a < C
where G(i, j) = yi y j K(i, j), K(i, j) = xiT x j ,
a [i] = a i , y[i] = yi
• But now there is no way to tie the kernel
matrix to the training data points.
SVM and kernels
• To tie the kernel matrix to training data we
assume that the kernel to be determined is a
linear combination of some existing base
kernels.
K= K
å
i
i
• Now we have a problem that is not a
quadratic program anymore.
• Instead we have a semi-definite program
(Lanckriet et. al. 2002)
Theoretical foundation
• Recall the margin error theorem (7.3 from
Learning with kernels)
Theoretical foundation
• The kernel analogue of Theorem 7.3 from
Lackriet et. al. 2002:
How does MKL work in practice?
• Gonnen and Alpaydin, JMLR, 2011
• Datasets:
– Digit recognition,
– Internet advertisements
– Protein folding
• Form kernels with different sets of features
• Apply SVM with various kernel learning
algorithms.
How does MKL work in practice?
From Gonnen and Alpaydin, JMLR, 2011
How does MKL work in practice?
From Gonnen and Alpaydin, JMLR, 2011
How does MKL work in practice?
From Gonnen and Alpaydin, JMLR, 2011
How does MKL work in practice?
• MKL better than single kernel k
1
• Mean kernel hard to beat K = å Ki
k i=1
• Non-linear MKL looks promising
Download