Kernel Methods

advertisement
Kernel Methods
Dept. Computer Science & Engineering,
Shanghai Jiao Tong University
Outline
•
•
•
•
•
•
•
One-Dimensional Kernel Smoothers
Local Regression
Local Likelihood
Kernel Density estimation
Naive Bayes
Radial Basis Functions
Mixture Models and EM
2015/4/9
Kernel Methods
2
One-Dimensional Kernel Smoothers
• k-NN:
fˆ ( x)  Ave( yi | xi  Nk ( x))
• 30-NN curve is
bumpy, since fˆ ( x) is
discontinuous in x.
• The average changes
in a discrete way,
leading to a
discontinuous fˆ ( x) .
2015/4/9
Kernel Methods
3
One-Dimensional Kernel Smoothers
• Nadaraya-Watson
Kernel weighted
average:
ˆf ( x )  

N
0
i 1
N
K  ( x0 , xi ) yi
i 1
K  ( x0 , xi )
• Epanechnikov
quadratic kernel:
 x  x0
K  ( x0 , x)  D
 
2015/4/9
Kernel Methods




4
One-Dimensional Kernel Smoothers
• More general kernel:
 x  x0 

K  ( x0 , x)  D

h
(
x
)
  0 
– h ( x0 ): width function that determines the width
of the neighborhood at x0.
– For quadratic kernel h ( x0 )  , Bias constant
– For k-NN kernel   k hk ( x0 ) | x0  x[ k ] |,
replaced
Variance  constant
– The Epanechnikov kernel has compact support
2015/4/9
Kernel Methods
5
One-Dimensional Kernel Smoothers
• Three popular kernel for local smoothing:
 x  x0
K  ( x0 , x)  D
 




• Epanechnikov kernel
and tri-cube kernel
are compact but
tri-cube has two
continuous derivatives
• Gaussian kernel is
infinite support
2015/4/9
Kernel Methods
6
Local Linear Regression
• Boundary issue
– Badly biased on the boundaries because of the
asymmetry of the kernel in the region.
– Linear fitting remove the bias to first order
2015/4/9
Kernel Methods
7
Local Linear Regression
• Locally weighted linear regression make a firstorder correction
• Separate weighted least squares at each target
point x0: min N K ( x , x )[ y   ( x )   ( x ) x ]2
 ( x0 ),  ( x0 )

i 1

0
i
i
0
0
i
• The estimate: fˆ ( x0 )  ˆ ( x0 )  ˆ ( x0 ) x0
• b(x)T=(1,x); B: Nx2 regression matrix with i-th row
b(x)T; WNN ( x0 )  diagK ( x0 , xi ), i  1,, N
N
fˆ ( x0 )  b( x0 )T ( BTW ( x0 ) B) 1 BTW ( x0 ) y   li ( x0 ) yi
i 1
2015/4/9
Kernel Methods
8
Local Linear Regression
• The weights li ( x0 ) combine the weighting
kernel K ( x0 ,) and the least squares
operations——Equivalent Kernel
2015/4/9
Kernel Methods
9
Local Linear Regression
• The expansion for Efˆ ( x0 ), using the linearity of local
regression and a series expansion of the true
function f around x0
N
N
N
i 1
i 1
Efˆ ( x0 )   li ( x0 ) f ( xi )  f ( x0 ) li ( x0 )  f ( x0 ) ( xi  x0 )li ( x0 )
i 1
f ( x0 )
2

(
x

x
)
li ( x0 )  R

i
0
2 i 1
N
• For local regression i 1 li ( x0 )  1, i 1 ( xi  x0 )li ( x0 )  0
• The bias Efˆ ( x0 )  f ( x0 ) depends only on quadratic
and higher-order terms in the expansion of f .
N
2015/4/9
Kernel Methods
N
10
Local Polynomial Regression
• Fit local polynomial
fits
of
any
degree
d
N
d


j
min
K  ( x0 , xi )  yi   ( x0 )    j ( x0 ) xi 

 ( x0 ),  j ( x0 ), j 1,, d
i 1
j 1


d
fˆ ( x0 )  ˆ ( x0 )   j 1 ˆ j ( x0 ) x0j
2015/4/9
Kernel Methods
2
11
Local Polynomial Regression
• Bias only have components of degree d+1 and
higher.
• The reduction for bias costs the increased variance.
2
2
ˆ
var( f ( x0 ))   l ( x0 ) , l ( x0 ) increases with d
2015/4/9
Kernel Methods
12
选择核的宽度
• 核 K  中, 是参数,控制核宽度:
– 对于有紧支集的核, 取其支集区域的半径
– 对于高斯核, 取其方差
– 对k-对近邻域法, 取 k/N
• 窗口宽度导致偏倚-方差权衡:
– 窗口较窄,方差误差大,均值误差偏倚小
– 窗口较宽,方差误差小,均值误差偏倚大
2015/4/9
Kernel Methods
13
Structured Local Regression
• Structured kernels 
( x  x0 )T A( x  x0 ) 
K  , A ( x0 , x)  D




– Introduce structure by imposing appropriate
restrictions on A
• Structured regression function
f ( X1 , X 2 ,, X p )     g j ( X j )   gkl ( X k , X l ) 
j
k l
– Introduce structure by eliminating some of the
higher-order terms
2015/4/9
Kernel Methods
14
Local Likelihood & Other Models
• Any parametric model can be made local:
– Parameter associated with yi : i   ( xi )  xiT 
N
– Log-likelihood: l (  )  i 1 l ( yi , xiT  )
– Model  ( X ) likelihood local to x0 :
N
l (  ( x0 ))   K  ( x0 , xi )l ( yi , xiT  ( x0 ))
i 1
– A varying coefficient model  (z )
N
l ( ( z0 ))   K  ( z0 , zi )l ( yi , ( x0 ,  ( z0 )))
i 1
e.g .  ( x,  )  xT 
2015/4/9
Kernel Methods
15
Local Likelihood & Other Models
• Logistic Regression
Pr(G  j | X  x) 
exp( j 0   Tj x)
1  k 1 exp( k 0   kT x)
J 1
– Local log-likelihood for the J class model
N
T

K
(
x
,
x
)

(
x
)


(
x
)
i1  0 i gi 0 0 gi 0 ( xi  x0 )


 log1  k 1 exp( k 0 ( x0 )   k ( x0 )T ( xi  x0 ))
– Center the local regressions at
J 1
ˆ ( x ))
exp(

j0
0
Pˆ r(G  j | X  x) 
J 1
1  k 1 exp(ˆk 0 ( x0 ))
2015/4/9
Kernel Methods
16
Kernel Density Estimation
• A natural local estimate
ˆf ( x )  # xi  N ( x0 )
X
0
N
• The smooth Parzen estimate
N
ˆf ( x )  1
K ( x0 , xi )

X
0
i 1 
N
– For Gaussian kernel K ( x , x ) /    ( x  x )

0
i
i
0
– The estimate become
ˆf ( x)  1 N  ( x  x )
  i 0
X
N i 1
1
1
N
2

exp(

(
x

x
/

)
)
i
0
2
p / 2 i 1
N (2  )
2
2015/4/9
Kernel Methods
17
Kernel Density Estimation
• A kernel density estimate for systolic blood
pressure. The density estimate at each point is the
average contribution from each of the kernels at
that point.
2015/4/9
Kernel Methods
18
Kernel Density Classification
• Bayes’ theorem:
• The estimate for CHD uses the tri-cube kernel with
k-NN bandwidth.
2015/4/9
Kernel Methods
19
Kernel Density Classification
• The population class densities and the
posterior probabilities
2015/4/9
Kernel Methods
20
Naïve Bayes
• Naïve Bayes model assume that given a
class G=j, the features
Xk are independent:
p
f j ( X )   f jk ( X k )
k 1
ˆ
– f jk ( X k ) is kernel density estimate, or Gaussian,
for coordinate Xk in class j.
– If Xk is categorical, use Histogram.
  k 1 f k ( X k )
  f ( X )
Pr(G   | X )
logit
 log
 log
p
Pr(G  J | X )
 J fJ (X )
 J  f Jk ( X k )
p
k 1
2015/4/9

f k ( X k )
p
p
 log  k 1 log
    k 1 g k ( X k )
J
f Jk ( X k )
Kernel Methods
21
Radial Basis Function & Kernel
• Radial basis function combine the local and
flexibility of kernel methods.
 x  j
f ( x)   j 1 K  j ( j , x)  j   j 1 D
 j

M
M


 j

– Each basis element is indexed by a location or
prototype parameter  j and a scale parameter  j
– D, a pop choice is the standard Gaussian
density function.
2015/4/9
Kernel Methods
22
Radial Basis Function & Kernel
• For simplicity, focus on least squares methods for
regression, and use the Gaussian kernel.
• RBF network model:
2
T
N 
M



(
x


)
(
x


)


i
j
i
j

min   yi  0    j exp

2

 , ,  i 1 



j 1
j



j
j
M
j 1
• Estimate the  j ,  j separately from the  j .
• A undesirable side effect of creating holes——
regions of IRp where none of the kernels has
appreciable support.
2015/4/9
Kernel Methods
23
Radial Basis Function & Kernel
• Renormalized radial basis functions.
D x  j / 
h j ( x) 


M
k 1

D x   k /  
• The expansion in renormalized RBF
K ( x0 , xi )
N
f ( x)  i 1 yi N
i 1 K  ( x0 , xi )

N
 ifunction
y h ( x0with
) fixed width can
Gaussian radial basis
1 i i
leave holes.
Renormalized Gaussian radial basis function produce
basis functions similar in some respects to B-splines.
2015/4/9
Kernel Methods
24
Mixture Models & EM
• Gaussian Mixture Model
M
f ( x)  m 1 m ( x;  m ,  m )
–  m are mixture proportions,

M
m 1
m  1
• EM algorithm for mixtures
– Given x1 , x2 ,, xn ,log-likelihood:

l ( y, )  i 1 log 1 ( xi )  (1   ) 2 ( xi )
N

Bad
– Suppose we observe Latent Binary
N
N
L( x, z, )  i 1 log ( xi ) i 1 log(1   ) ( xi )
zi 1
zi  0
1
z such that z  1  x ~  , z  0  x ~  2
1
2015/4/9
Kernel Methods
2
Good
25
Mixture Models & EM
• Given  0 ,compute
~
~
0
( )  E( x, z, )( , y), max(( ))
• In Example
ˆˆ ( xi )

0
E ( zi | xi , ) 
 wi
ˆˆ ( xi )  (1  ˆ )ˆ ( xi )
1

1

2

( )  i 1 wi log ˆ1 ( xi )  (1  wi ) log (1   ) 2 ( xi )
N
2015/4/9
Kernel Methods

26
Mixture Models & EM
• Application
of mixtures
to the heart
disease risk
factor study.
2015/4/9
Kernel Methods
27
Mixture Models & EM
• Mixture model
used for
classification
of the
simulated data
2015/4/9
Kernel Methods
28
Download