PPTX, 2793 KB

advertisement
PATTERN RECOGNITION
AND MACHINE LEARNING
CHAPTER 6: KERNEL METHODS
Kernel methods (1)
In chapters 3 and 4 we dealed with linear parametric models of
this form:
where Áj(x) are known as basis functions and Á(x) is a fixed
nonlinear feature space mapping
This models can be re-cast into an equivalent ‘dual
representation’ in which the prediction is based on linear
combination of a kernel functions
Dual representation(1)
Let’s consider a linear regression model whose parameters
are determines by minimizing a regularized sum-of-squares
error function
J w  
1
2
N
 f
 w   x  
2
T
n
i 1

2
w w,   0
T
If we set the gradient of J(w) with respect to w equal to
zero, the solution for w takes the following form
w   I     t
T
1
T
If we substitute it into the model we obtain the following
prediction
y  x , w   w   x   t  I 
T
T
T

1
  x 
Dual representation(2)
Let’s define the Gramm matrix K   T , which is an NxN
symmetric matrix with elements
K nm    x n    x m   k  x n , x m 
T
In terms of Gram matrix the prediction can be written as
y  x , w   w   x   t  I 
T
T
T

1
   x   k ( x )   I K  t
1
So now the prediction is expressed entirely in terms of the
kernel function
Kernel trick
If we have an algorithm formulated in such a
way that the input vector x enters only in
the form of scalar products, then we can
replace the scalar product with some other
choice of kernel.
Advantages of dual representation
1. We have to invert an NxN matrix, whereas in the
original parameter space formulation we had to invert
MxM matrix in order to determine w. I could be very
important if N is significantly smaller than M.
2. As a dual representation is expressed entirely in terms
of kernel function k(x,x’), we can work directly in terms
of kernels, which allows to use feature spaces of high,
even infinite, dimensionality.
3. Kernel functions can be defined not only over simply
vectors of real numbers but also over objects as diverse
as graphs, sets, string, and text documents
Constructing kernels (1)
Kernel is valid if it corresponds to a scalar product in some
(perhaps infinite dimensional) feature space
Three approaches to construct kernels:
1. To choose a feature space mapping Á(x)
M
T
k  x , z     x    z     i  x  i  z 
i 1
2. To construct kernel function directly
k  x , z   x z    x 1 z 1  x 2 z 2 
2
T
2
 x1 z1  2 x1 z1 x 2 z2  x 2 z2
2

2
2
 x , 2 x1 x2 , x
2
1
  x   z 
T
2
2
z
2
1
2
, 2 z1 z2 , z
2
2

T
Constructing kernels (2)
3.
Constructing kernels (3)
A necessary and sufficient condition for
function k(x,x’) to be a valid kernel, is that
Gram matrix K should be positive
semidefinite for all possible choices of the
set {x}
Some worth mentioning kernels (1)
• Linear kernel
• Gaussian kernel
• Kernel for sets
Some worth mentioning kernels (2)
• Kernel for probabilistic generative models
• Hidden Markov models
• Fisher kernel
Radial Basis Function Networks
To specify a regression model based on linear
combination of fixed basis function we
should choose the particular form of such
functions.
On possible choice is to use radial basis
functions, where each basis function
depends only on the radial distance form a
certain centre so that
Nadaraya-Watson model (1)
We want to find the regression function y(x), using a Parzen
density estimator to model the joint distribution
It can be shown that
Where
and
Nadaraya-Watson model (2)
Gaussian processes (1)
Let’s apply kernels to probabilistic discriminative models
Instead of defining prior on parameters vector w we define a
prior probability over functions directly
?
Gaussian processes (2)
y is a linear combination of Gaussian distributed
variables and hence is itself Gaussian
Where K is the Gram matrix with elements
So the marginal distribution p(y) is defined by a Gram
matrix so that
Gaussian processes (3)
A Gaussian process is defined as a probability distribution over
functions y(x) such that the set of values of y(x) evaluated
at an arbitrary set of points {x} jointly has a Gaussian
distribution.
This distribution is specified completely by the second-order
statistics, the mean and the covariance.
The mean by symmetry is taken to be zero.
The covariance is given by the kernel function
Gaussian processes (4)
Gaussian kernel
Exponential kernel
Gaussian processes for regression(1)
Let’s consider t to be a target value
Where is a hyperparameter representing the precision of
the noise.
Mentioning that
we can find the marginal
distribution
where
Gaussian processes for regression(2)
Our goal is to find the conditional distribution
The joint distribution is given by
Where
Using the results from Chapter 2 we see that
is a
Gaussian distribution with mean and covariance given by
Gaussian processes for regression(3)
Gaussian processes for regression(4)
Automatic relevance determination
Gaussian processes for classification (1)
Our goal is to model the posterior probabilities of
the target variable for a new input vector, given
a set of training data.
These probabilities must lie in the interval (0, 1),
whereas a Gaussian process model makes
predictions that lie on the entire real axis.
To adapt Gaussian processes we should transform
the output of the Gaussian processes using an
appropriate nonlinear activation function.
Gaussian processes for classification (2)
Let’s define a Gaussian process over a function a(x) and a logistic sigmoid
transformation of the output
Gaussian processes for classification (3)
We need to determine the predictive distribution
So we introduce a Gaussian process prior over the
vector a which in turn defines a non-Gaussian
process over t
For two-class problem the required prediction
distribution is given by
where
Gaussian processes for classification (4)
Approaches to Gaussian approximation:
1. Variational inference
2. Expectation propagation
3. Laplace approximation
Download