Kernel Density Estimation

advertisement
Introduction to Non Parametric
Statistics
Kernel Density Estimation
Nonparametric Statistics

Fewer restrictive assumptions about data
and underlying probability distributions.

Population distributions may be skewed
and multi-modal.
Kernel Density Estimation (KDE)
Kernel Density Estimation (KDE) is a non-parametric technique
for density estimation in which a known density function (the
kernel) is averaged across the observed data points to create a
smooth approximation.
Density Estimation and
Histograms
Let b denote the bin-width then the histogram estimation at a
point x from a random sample of size n is given by,
fˆH ( x ; b) 
number of observations in bin containing x
nb
Two choices have to be made when constructing a histogram:
 Positioning of the bin edges
 Bin-width
KDE – Smoothing the Histogram
Let X 1 ,, X n be a random sample taken from a continuous,
univariate density f. The kernel density estimator is given by,
fˆ ( x; h) 
n
1
K{( x  X i ) h}

n h i 1
 K is a function satisfying  K ( x) dx  1
 The function K is referred to as the kernel.
 h is a positive number, usually called the bandwidth or
window width.
Kernels







Gaussian
Refer to Table 2.1 Wand and Jones, page 31.
Epanechnikov
… most unimodal densities perform about the
same as each other when used as a kernel.
Rectangular
Triangular
Biweight
Uniform
 Typically K is chosen to be a unimodal
PDF.
 Use the Gaussian kernel.
Cosine
Wand M.P. and M.C. Jones (1995), Kernel Smoothing,
Monographs on Statistics and Applied Probability 60, Chapman
and Hall/CRC, 212 pp.
KDE – Based on Five Observations
Kernel density estimate constructed
using five observations with the
kernel chosen to be the N(0,1)
density.
x=c(3, 4.5, 5.0, 8, 9)
0.00
0.05
Density
0.10
0.15
Density of X
0
2
4
6
8
N = 5 Bandwidth = 1.195
10
12
Histogram - Positioning of Bin
Edges
Histogram of x
0.20
0.00
0.05
0.10
Density
0.15
0.20
0.15
Density
0.10
0.05
0.00
2
4
6
8
10
2
x


Histogram of x
x=c(3, 4.5, 5.0, 8, 9)
hist(x,right=T,freq=F), R-default
(a,b] right closed (left-open)
4
6
8
10
x


hist(x,right=F,freq=F)
[a,b) left closed (right-open)
Area=1
Histogram - Bin Width
Histogram of x
0.04
0.06
Density
0.2
0.00
0.02
0.1
0.0
Density
0.08
0.3
0.10
0.4
0.12
Histogram of x
3
4
5
6
7
8
x
hist(x,breaks=5,right=F,prob=T)
9
0
2
4
6
8
x
hist(x,breaks=2,right=F,prob=T)
Area=1
10
KDE – Numerical Implementation
"kde" <- function(x,h)
{
npt=100
r <- max(x) - min(x); xmax <- max(x) + 0.1*r; xmin <- min(x) - 0.1*r
n <- length(x)
xgrid <- seq(from=xmin, to=xmax, length=npt)
f = vector()
for (i in 1:npt){
tmp=vector()
for (ii in 1:n){
z=(xgrid[i] - x[ii])/h
density=dnorm(z)
tmp[ii]=density
}
f[i]=sum(tmp)
}
f=f/(n*h)
lines(xgrid,f,col="grey")
} #end function
n
1
fˆ ( x; h) 
K{( x  X i ) h}

n h i 1
Variable description
x = xgrid
X =x
Bandwidth Estimators

Optimal Smoothing

Normal Optimal Smoothing

Cross-validation

Plug-in bandwidths
Download