1. Stat 231. Lecture 6.

advertisement
1. Stat 231. A.L. Yuille. Fall 2004






Hyperplanes with Features.
The “Kernel Trick”.
Mercer’s Theorem.
Kernels for Discrimination, PCA, Support Vectors.
Read 5.11 Duda, Hart, Stork.
And Or better, 12.3. Hastie, Tibshirani, Friedman.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
2. Beyond Linear Classifiers







Increase the dimensions of the data using Feature Vectors.
Search for a linear hyperplane between features
Logical X-OR.
X-OR requires decision rule
Impossible with a linear classifer.
Define
Then solve XOR by hyperplane
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
3. Which Feature Vectors?



With sufficient feature vectors we can perform any classification
using the linear separation algorithms applied to feature space.
Two Problems:
1. How to select the features?
2. How to achieve Generalization and prevent overlearning?
The Kernel Trick simplifies both problems. (But we won’t
address (2) for a few lectures).
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
4. The Kernel Trick



Kernel Trick: Define the kernel:
Claim: linear separation algorithms in feature space
only depends on
Claim: we can use all results from linear separation
(previous two lectures) by replacing all dotproducts
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
5. The Kernel Trick

Hyperplanes in feature space are surfaces for

With associated classifier
Determine the classifier that maximizes the margin, as in
previous lecture, replacing
The dual problem depends only on
by the dot products

Replace them by


Lecture notes for Stat 231: Pattern Recognition and Machine Learning
The Kernel Trick


Solve the dual to get the
only on K.
Then the solution is:
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
which depends
6. Learning with Kernels
All the material in the previous lecture can be adapted directly
By replacing the dot product by the kernel




Margins, Support Vectors, Primal and Dual Problems.
Just specify the kernel, don’t bother with the features
The kernel trick depends on the quadratic nature of the learning
problem. It can be applied to other quadratic problems, eg. PCA.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
7. Example Kernels



Popular kernels are
Constants:
What conditions, if any, need we put on kernels to ensure that
they can be derived from features?
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
8. Kernels, Mercer’s Theorem


For a finite dataset
express kernel as a matrix
with components
The matrix is symmetric and positive definite matrix with
eigenvalues and eigenvectors

Then

Feature vectors:
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
9. Kernels, Mercer’s Theorem

Mercer’s Theorem extends this result to
Functional Analysis (F.A). Most results in Linear Algebra can be
extended to F.A. (Matrices with infinite dimensions).
E.G. We define eigenfunctions of

requiring finite
Provided


is positive definite, the features are
Almost any kernel is okay.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
10. Kernel Examples.

Figure of kernel discrimination.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
11. Kernel PCA


The kernel trick can be applied to any quadratic problem.
PCA: Seek eigenvectors and eigenvalues of

Where, wlog
In feature space, replace

All non-zero eigenvectors



Reduces to solving
Then
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
are of form
12.Summary.




The Kernel Trick allows us to do linear separation in feature
space.
Just specify the kernel, no need to explicitly specify the features.
Replace dot product with the kernel.
Allows classifications impossible using linear separation on
original features
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
Download