Introduction to Non Parametric Statistics Kernel Density Estimation Nonparametric Statistics Fewer restrictive assumptions about data and underlying probability distributions. Population distributions may be skewed and multi-modal. Kernel Density Estimation (KDE) Kernel Density Estimation (KDE) is a non-parametric technique for density estimation in which a known density function (the kernel) is averaged across the observed data points to create a smooth approximation. Density Estimation and Histograms Let b denote the bin-width then the histogram estimation at a point x from a random sample of size n is given by, fˆH ( x ; b) number of observations in bin containing x nb Two choices have to be made when constructing a histogram: Positioning of the bin edges Bin-width KDE – Smoothing the Histogram Let X 1 ,, X n be a random sample taken from a continuous, univariate density f. The kernel density estimator is given by, fˆ ( x; h) n 1 K{( x X i ) h} n h i 1 K is a function satisfying K ( x) dx 1 The function K is referred to as the kernel. h is a positive number, usually called the bandwidth or window width. Kernels Gaussian Refer to Table 2.1 Wand and Jones, page 31. Epanechnikov … most unimodal densities perform about the same as each other when used as a kernel. Rectangular Triangular Biweight Uniform Typically K is chosen to be a unimodal PDF. Use the Gaussian kernel. Cosine Wand M.P. and M.C. Jones (1995), Kernel Smoothing, Monographs on Statistics and Applied Probability 60, Chapman and Hall/CRC, 212 pp. KDE – Based on Five Observations Kernel density estimate constructed using five observations with the kernel chosen to be the N(0,1) density. x=c(3, 4.5, 5.0, 8, 9) 0.00 0.05 Density 0.10 0.15 Density of X 0 2 4 6 8 N = 5 Bandwidth = 1.195 10 12 Histogram - Positioning of Bin Edges Histogram of x 0.20 0.00 0.05 0.10 Density 0.15 0.20 0.15 Density 0.10 0.05 0.00 2 4 6 8 10 2 x Histogram of x x=c(3, 4.5, 5.0, 8, 9) hist(x,right=T,freq=F), R-default (a,b] right closed (left-open) 4 6 8 10 x hist(x,right=F,freq=F) [a,b) left closed (right-open) Area=1 Histogram - Bin Width Histogram of x 0.04 0.06 Density 0.2 0.00 0.02 0.1 0.0 Density 0.08 0.3 0.10 0.4 0.12 Histogram of x 3 4 5 6 7 8 x hist(x,breaks=5,right=F,prob=T) 9 0 2 4 6 8 x hist(x,breaks=2,right=F,prob=T) Area=1 10 KDE – Numerical Implementation "kde" <- function(x,h) { npt=100 r <- max(x) - min(x); xmax <- max(x) + 0.1*r; xmin <- min(x) - 0.1*r n <- length(x) xgrid <- seq(from=xmin, to=xmax, length=npt) f = vector() for (i in 1:npt){ tmp=vector() for (ii in 1:n){ z=(xgrid[i] - x[ii])/h density=dnorm(z) tmp[ii]=density } f[i]=sum(tmp) } f=f/(n*h) lines(xgrid,f,col="grey") } #end function n 1 fˆ ( x; h) K{( x X i ) h} n h i 1 Variable description x = xgrid X =x Bandwidth Estimators Optimal Smoothing Normal Optimal Smoothing Cross-validation Plug-in bandwidths