2005-04-15 kernel smoothing

advertisement
2005-04-15
Supplemental notes,
BIOINF 2054/BIOSTAT 2018
Statistical Foundations for Bioinformatics Data Mining
Target readings:
Hastie, Tibshirani, Friedman
Chapter 6 Kernel Smoothing
Recall kernel-weighted averaging:
N
fˆ ( x0 ) 
 K ( x , x ) y
i 1
N
0
i
i
 K ( x , x )
i 1
0
i
Example: the kNN algorithm, with
 | x  x0 | 
K  ( x0 , x)  D 

h
(
x
)
  0 
D(u )  I | u | 1
and the width is data-driven:
h ( x0)  hk ( x0) | x0  x0( k ) |
where x0( k ) is the kth nearest neighbor.
Q:
how do the variance and bias vary with the data density?
Local linear regression
Replace the one-df fit at x0 by a two-df fit:
N
RSS (x0, ,  )   K  ( x0 , xi )  yi     xi 
2
i 1
T
fˆ ( x0 )  (1 x0 ) arg min ,  RSS ( x0 , ,  )
So there’s a different regression fit for each point x0 .
Local linear regression does “automatic kernel carpentry”.
-1-
2005-04-15
Supplemental notes,
BIOINF 2054/BIOSTAT 2018
Statistical Foundations for Bioinformatics Data Mining
What does this mean?
N
fˆ ( x0 )  (1 x0 )(BT KB) 1 BT Ky   li ( x0 ) yi
i 1
where K = K  x0   diag ( K ( x0 , xi )) , B  (1 X ) .
Think of this as:
N
N
i 1
i 1
fˆ ( x0 )   li ( x0 ) yi   K  * ( x0 , xi ; x0 ) yi
I.e. a kernel-weighted average, but the shape of the kernel
now depends on x0 , in a “good” way. See Fig 6.4.
N
Note that
N
 l ( x )  1 and  ( x  x )l ( x )  0
i 1
i
0
i 1
i
0
i
0
because l ( x0 )(1 X )  (1 x0 )(BT KB)1 BT KB  (1 x0 ) .
So
N
N
i 1
i 1
Efˆ ( x0 )   li ( x0 ) Eyi   li ( x0 ) f ( xi )
N
  li ( x0 )( f ( x0 )  ( xi  x0 ) f '( x0 )  O( xi  x0 ) 2
i 1
N
 f ( x0 )   li ( x0 )O( xi  x0 ) 2
i 1
-2-
2005-04-15
Supplemental notes,
BIOINF 2054/BIOSTAT 2018
Statistical Foundations for Bioinformatics Data Mining
Extensions & variations of kernel methods:
Polynomial regression.
Local multiple regression in p .
Structured local regression in p .
Local likelihood.
Relationship to kernel density estimation
Forget about y for a moment. Estimate
fˆX ( x0 ) 
# xi  N ( x0 )
N  {width of N ( x0 )}
or better:
N
fˆX ( x0 )  N 1   1K  ( x0 , xi )
i 1
Now, do this conditional on group membership y, then
estimate the regression as
Pr(G  j | X  x0 ) 
ˆ j fˆX ; j ( x0 )
J
ˆ
k 1
k
Conditioning Linear methods
on X
on Y
Logistic regression
Discriminant analysis
fˆX ;k ( x0 )
Kernel method
Kernel smoothing
Kernel density estimation
With kernel methods, “the model is the entire training data
set, and the fitting is done at evaluation of prediction time.”
(p. 190)
-3-
Download