Qualitative Robustness in Estimation Mohammed Nasser Associate

advertisement
Basics of Kernel Methods in
Statistical Learning Theory
Mohammed Nasser
Professor
Department of Statistics
Rajshahi University
E-mail: mnasser.ru@gmail.com
Contents
Glimpses of Historical Development
Definition and Examples of Kernel
Some Mathematical Properties of Kernels
Construction of Kernels
Heuristic Presentation of Kernel Methods
Meaning of Kernels
Mercer Theorem and Its Latest Development
Direction of Future Development
Conclusion
2
Computer Scientists’ Contribution to
Statistics: Kernel Methods
Vladimir Vapnik
Jerome H. Friedman
3
Early History
 In 1900 Karl Pearson published his famous article
on goodness of fit, judged as one of first best twelve
scientific articles in twentieth century.
In 1902 Jacques Hadamard pointed that
mathematical models of physical phenomena should
have the properties that
A solution exists
The solution is unique
The solution depends continuously on the data, in
some reasonable topology
( Well-Posed Problem)
4
Early History
In 1940 Fréchet, PhD student of Hadamard highly criticized
mean and standard deviation as measures of location and scale
respectively. But he did express his belief in development of
statistics without proposing any alternative.
 During sixties and seventies Tukey, Huber and Hampel tried
to develop Robust Statistics in order to remove ill-posedness of
classical statistics.
 Robustness means insensitivity to minor change in both
model and sample, high tolerance to major changes and good
performance at model.
 Data Mining onslaught and the problem of non-linearity and
nonvectorial data have made robust statistics somewhat
nonattractive.
Let Us See What KM present…………….
Recent History
Support Vector Machines (SVM) introduced in COLT-92
(conference on learning theory) greatly developed since
then.
 Result: a class of algorithms for Pattern Recognition
(Kernel Machines)
Now: a large and diverse community, from machine
learning, optimization, statistics, neural networks,
functional analysis, etc
Centralized website: www.kernel-machines.org
First Text book (2000): see www.support-vector.net
 Now ( 2012): At least twenty books of different taste are
avialable in international market
The book, “ The Elements of Statistical Learning”(2001)
by Hastie,Tibshirani and Friedman went into second
edition within seven years.
6
History More
David Hilbert used the German word ‘kern’ in his first
paper on integral equations(Hilbert 1904).
The mathematical result underlying the kernel trick,
Mercer's theorem, is almost a century old (Mercer
1909). It tells us that any `reasonable' kernel function
corresponds to some feature space.
 which kernels can be used to compute distances in
feature spaces was developed by Schoenberg (1938).
The methods for representing kernels in linear spaces
were first studied by Kolmogorov (1941) for a countable
input domain.
The method for representing kernels in linear spaces
for the general case was developed by Aronszajn
(1950).
Dunford and Schwartz (1963) showed that Mercer's
theorem also holds true for general compact spaces.T
7
History More
The use of Mercer's theorem for interpreting kernels as inner
products in a feature space was introduced into machine
learning by Aizerman, Braverman and Rozonoer (1964)
Berg, Christensen and Ressel (1984) published a good
monograph on the theory of kernels.
Saitoh (1988) showed the connection between positivity (a
`positive matrix‘ defined in Aronszajn (1950)) and the positive
semi-definiteness of all finite set kernel matrices.
Reproducing kernels were extensively used in machine
learning and neural networks by Poggio and Girosi, see for
example Poggio and Girosi (1990), a paper on radial basis
function networks.
The theory of kernels was used in approximation and
regularization theory, and the first chapter of Spline Models
for Observational Data (Wahba 1990) gave a number of
theoretical results on kernel functions.
8
Kernel methods: Heuristic View
The common characteristic (structure) among the
following statistical methods?
1. Principal Components Analysis
2. (Ridge ) regression
3. Fisher discriminant analysis
4. Canonical correlation analysis
5.Singular value decomposition
6. Independent component analysis
KPCA
SVR
KFDA
KCCA
KICA
We consider linear combinations of input vector:
f ( x)  wT x
We make use concepts of length and dot product
available in Euclidean space.
9
10
Kernel methods: Heuristic View
• Linear learning typically has nice properties
– Unique optimal solutions, Fast learning algorithms
– Better statistical analysis
• But one big problem
– Insufficient capacity
That means, in many data sets it fails to detect
nonlinearship among the variables.
• The other demerits
- Cann’t handle non-vectorial data
10
Data
Vectors
Collections of features
e.g. height, weight, blood pressure, age, . . .
Can map categorical variables into vectors
Matrices
Images, Movies
Remote sensing and satellite data
(multispectral)
Strings
Documents
Gene sequences
Structured Objects
XML documents
Graphs
11
Kernel methods: Heuristic View
Genome-wide data
mRNA
expression data
hydrophobicity data
protein-protein
interaction data
sequence data
(gene, protein)
12
13
Kernel methods: Heuristic View
f
f
f
Original Space
Feature Space
14
Definition of Kernels
Definition: A finitely positive semi-definite function k : x  y  R
is a symmetric function of its arguments for which matrices
formed  T K  0  by restriction on any finite subset
of points is positive semi-definite.
 It is a generalized dot product
It is not generally bilinear
But it obeys C-S inequality
15
Kernel Methods: Basic Ideas
Proper Kernel
k(x, y )  f(x ), f(y ) 
Is always a kernel. When is the converse true?
Theorem(Aronszajn,1950): A function k : x  y  R can be
written as k ( x, y)  ( x), ( y)  where  ( x) is a feature
map x  ( x)  F iff k(x,y) satisfies the semi-definiteness
property.
We can now check if k(x,y) is a proper kernel using only
properties of k(x,y) itself, i.e. without the need to know the
feature map ! If the map is needed we may take help of MERCER
THEOREM
16
Kernel methods consist of two modules
:
1) The choice of kernel (this is non-trivial)
2) The algorithm which takes kernels as input
Modularity: Any kernel can be used with any kernel-algorithm.
some kernels:
k ( x, y)  e( || x  y||
2
/ c)
k ( x, y)  ( x, y   )d
k ( x, y)  tanh(  x, y   )
1
k ( x, y) 
|| x  y ||2 c 2
some kernel algorithms:
- support vector machine
- Fisher discriminant analysis
- kernel regression
- kernel PCA
- kernel CCA
17
Kernel Construction
The set of kernels forms a closed convex cone
18
19
Reproducing Kernel Hilbert Space
• Reproducing kernel Hilbert space (RKHS)
X: set. A Hilbert space H consisting of functions on X is called a
reproducing kernel Hilbert space (RKHS) if the evaluation
functional
ex : H  R , f  f ( x )
is continuous for each
x X
– A Hilbert space H consisting of functions on X is a RKHS if
and only if there exists k ( , x)  H
(reproducing kernel)
such that
f  H , x  X .
k ( , x), f H  f ( x)
20
(by Riesz’s lemma)
Reproducing Kernel Hilbert Space II
Theorem (construction of RKHS)
If k: X x X  R is positive definite, there uniquely exists a
RKHS Hk on X such that
(1) k ( , x)  H
for all x  X ,
(2) the linear hull of {k ( , x ) | x  X } is dense in Hk ,
(3) k ( , x) is a reproducing kernel of Hk, i.e.,
k ( , x), f H  f ( x) f  H k , x  X .
k
At this moment we put no structure on X. To have
bettter properties of members of g in H we have to put
extra structure on X and assume additional properties
of K/
21
Classification
Y=g(X)
X!Y
Anything:
• continuous (, d, …)
• discrete ({0,1}, {1,…k}, …)
• structured (tree, string, …)
•…
• discrete:
– {0,1}
binary
– {1,…k}
multi-
class
– tree, etc.
structured
22
Classification
X
Perceptron
Logistic Regression
Support Vector Machine
Anything:
• continuous (, d, …)
• discrete ({0,1}, {1,…k}, …)
Decision Tree
Random Forest
• structured (tree, string, …)
•…
Kernel trick
23
Regression
Y=g(X)
X!Y
Anything:
• continuous (,
d,
…)
• continuous:
– , d
• discrete ({0,1}, {1,…k}, …)
• structured (tree, string, …)
Not Always
•…
24
Regression
X
Perceptron
Normal Regression
Support Vector regression
Anything:
• continuous (, d, …)
GLM
• discrete ({0,1}, {1,…k}, …)
• structured (tree, string, …)
•…
Kernel trick
25
Kernel Methods: Heuristic View
Traditional or
non
traditional
Steps for Kernel Methods
[k(xi ,xj)]
A positive
semi
definite
matrix
f(x)=∑αik(xi, x)
K=
Pattern function
DATA MATRIX
Kernel
Matrix,
Why
p.s.d??
what K????
26
Kernel methods: Heuristic View
f
f
f
Original Space
Feature Space
27
Kernel Methods: Basic Ideas
The kernel methods approach is to stick with linear functions
but work in a high dimensional feature space:
The expectation is that the feature space has a much higher
dimension than the input space. Feature space has a innerproduct like
k (x i , x j )  f( x i ), f(x i )
28
Kernel methods: Heuristic View
Form of functions
• So kernel methods use linear functions in a feature space:
• For regression this could be the function
• For classification require thresholding
29
Kernel methods: Heuristic View
Feature spaces
 : x  ( x), R  F
d
non-linear mapping to F
1. high-D space L2
2. infinite-D countable space :
3. function space (Hilbert space)
example:

( x, y)  ( x , y , 2 xy)
2
2
30
Kernel methods: Heuristic View
Example
• Consider the mapping
• Let us consider a linear equation in this feature
space:
• We actually have an ellipse – i.e. a non-linear shape
in the input space.
ax 1  0.x 1x 2  0.x 2x 1  bx 2  c
2
2
31
Kernel methods: Heuristic View
Ridge Regression (duality)
problem:
min w  ( yi  wT xi )2   || w ||2
i 1
target
T
1
T
solution: w  ( X X   I d ) X y
 X T ( XX T   I )1 y
f(x)=wTx
 X T (G   I ) 1 y
xi i
= ∑αi(xi,x)  
i 1
linear comb. data
regularization
input
dxd inverse

inverse
Gij  xi , x j 
Inner product of obs.
Dual Representation
32
Kernel methods” Heuristic View
Kernel trick
Note: In the dual representation we used the Gram matrix
to express the solution.
Kernel Trick:
Replace : x  ( x),
kernel
Gij  xi , x j  Gij  ( xi ), ( x j )  K ( xi , x j )
If we use algorithms that only depend on the Gram-
matrix, G, then we never have to know (compute) the
actual features Φ(x)
33
Gist of Kernel methods
Choice of a Kernel Function.
Through choice of a kernel function we choose a
Hilbert space.
We then apply the linear method in this new space
without increasing computational complexity using
mathematical niceties of this space.
34
Kernels to Similarity
• Intuition of kernels as similarity measures:
|| f ( x) ||2  || f ( x) ||2 d (f ( x),f ( x))2
k ( x, x) 
2
• When the diagonal entries of the Kernel Gram Matrix
are constant, kernels are directly related to similarities.
– For example Gaussian Kernel
|| x  x ||2
K G ( x, x)  exp(
)
2
2
– In general, it is useful to think of a kernel as a
similarity measure.
35
Kernels to Distance
• Distance between two points x1 and x2 in
feature space:
d(x 1, x 2 )  f(x 1)  f(x 2 )  k(x 1, x 1)  k(x 2, x 2 )  2k(x 1, x 2 )
• Distance between two points x1 and S in
feature space:
2 n
1 n n
d(x 1, S )  k (x 1, x 1)   k (x 1, x i )  2  k (x i , x j )
n i 1
n j 1 i 1
36
Kernel methods: Heuristic View
Genome-wide data
mRNA
expression data
hydrophobicity data
protein-protein
interaction data
sequence data
(gene, protein)
37
Similarity to Kernels
How can we
make it
positive
semidefinite
if it is not
semidefinite?
1 0.5 0.3


k 33  0.5 1 0.6
0.3 0.6 1
38
From Similarity Scores to Kernels
Removal of negative eigenvalues
Form the similarity matrix S, where the (i,j)-th entry of S
denotes the similarity between the i-th and j-th data points. S is
symmetric, but is in general not positive semi-definite, i.e., S
has negative eigenvalues.
S  UU , where Σ  diag(1 , 2 ,, n ), and
T
1  2    r  0  r 1    n .

 T
K  UU , where Σ  diag(1 , 2 ,, r ,0,,0).
39
From Similarity Scores to Kernels
t1
x1
x2
xn
t2
t1
- - - tn
--s2m
x1
x2
xn
t2
- - tn
-- s2m
40
Kernels as Measures of Function Regularity
Empirical Risk functional,

Problems of
empirical risk
minimization
R L ,P ( g )
n
=

L( x, y, g ( x))dPn ( x, y )
X Y
   L( x, y, g ( x))dPn ( y / x) dPn X
XY
1 n
  L( xi , yi , g ( xi ))
n i 1
41
What Can We Do?
 We can restrict the set of functions over which we
minimize empirical risk functionals
modify the criterion to be minimized (e.g. adding a penalty
for `complicated‘ functions). We can combine two.
Regularization
Structural risk
Minimization
42
43
Best Approximation
f
H

M
fˆ
fˆ

44
Best approximation
f
• Assume M is finite dimensional with
basis {k1,......,km}
i.e., fˆ=a1k1k+…….+a
k1,1k, k2 2, , , k, kmmkn
 fˆ  M gives m conditions (i=1,…,m)
ˆ H
f

f
<ki , f- (a1k1+…….+a km)>=0
i.e. <ki,f>- ai<ki,k1>-…….. an <ki,km>=0
45
RKHS approximation
m conditions become:
Yi - ai<ki,k1>-…….. an <ki,km>=0
We can then estimate the parameters using:
a=K-1y
In practice it can be ill-conditioned, so we minimise:
m

m
fˆ ( xi )  zi

2
 f
2
,  0
min[  ( f ( xi )  yi )   f
i 1
f H k
2
i 1
a=(K+λI)-1y
Hk
,  0
46
Approximation vs estimation
Hypothesis
space
Best possible
estimate
Estimate
Target
space
True
function
47
How to choose kernels?
• There is no absolute rule for choosing the right kernel,
adapted to a particular problem.
• Kernel should capture the desired similarity.
– Kernels for vectors: Polynomial and Gaussian kernel
– String kernel (text documents)
– Diffusion kernel (graphs)
– Sequence kernel (protein, DNA, RNA)
48
Kernel Selection
• Ideally select the optimal kernel based on our prior
knowledge of the problem domain.
• Actually, consider a family of kernels defined in a way
that again reflects our prior expectations.
• Simple way: require only limited amount of additional
information from the training data.
• Elaborate way: Combine label information
49
50
Future Development
Mathematics:
 Generalization of Mercer Theorem for pseudo metric
spaces
 Development of mathematical tools for multivariate
regression
Statistics:
Application of kernels in multivariate data depth
 Application of ideas of robust statistics
Application of these methods in circular data
 They can be used to study nonlinear time series
51
Acknowledgement
Jieping Ye
Department of Computer Science and
Engineering
Arizona State University
http://www.public.asu.edu/~jye02
• http://www.kernel-machines.org/
– Papers, software, workshops, conferences, etc.
52
Thank You
53
Download