Histograms Histogram of exp h=0.1 h=0.5

advertisement
Histograms
1.0
Histogram of exp
0.6
0.4
0.2
0.0
Density
0.8
h=0.1
h=0.5
h=3
0
1
2
3
exp
4
5
Theoretically
The simplest form of histogram
Bj = [(j-1),j)h
Some asymptotics
Fact: If X ~ Po(μ) then for large μ
Suppose we have m bins in a
histogram. Then
is approximately a 1-α CI for f(x)
where
Risk
When looking at parametric
estimators we often compare the
mse. When estimating a function,
we want the estimator to be good
everywhere, so we may integrate
the mean squared error:
Risk
Loss function
Pick h to minimize the risk
Density estimation
Estimate F(x) by Fn(x)
Difference quotient
Histogram confidence
set revisited
We have
where Z1,...,Zn ~ N(0,1). The
histogram estimates a discretized
version of f, say
Let
and
Denote
Use
and
Confidence band for
the exponential
histogram
0.6
0.4
0.2
0.0
Density
0.8
1.0
Histogram of exp
0
1
2
3
exp
4
5
The exponential sample
0.6
0.4
0.2
0.0
Density
0.8
1.0
Empirical density
0
1
2
3
x
4
5
6
Smoothing
The idea of smoothing is to
replace an observation at x with a
smooth local kernel function
K(x) ≥ 0.
The functions should satisfy
Kernels
0.0
-2
-1
0
1
2
3
-3
-2
-1
0
1
c(-3, 3)
Epanechnikov
Biweight
2
3
2
3
0.4
0.0
0.0
0.4
c(0, 1)
0.8
c(-3, 3)
0.8
-3
c(0, 1)
0.4
c(0, 1)
0.4
0.0
c(0, 1)
0.8
Gaussian
0.8
Rectangular
-3
-2
-1
0
c(-3, 3)
1
2
3
-3
-2
-1
0
c(-3, 3)
1
Kernel density
estimates
The exponential sample
Gaussian
0.0
0
2
4
6
0
2
4
6
N = 100 Bandwidth = 0.4082
Epanechnikov
Biweight
0.2
0.0
0.2
Density
0.4
0.4
N = 100 Bandwidth = 0.4082
0.0
Density
0.2
Density
0.2
0.0
Density
0.4
0.4
Rectangular
0
2
4
6
N = 100 Bandwidth = 0.4082
0
2
4
6
N = 100 Bandwidth = 0.4082
Choice of kernel
and bandwidth
Kernel is not very important (but
better if it is smooth).
Bandwidth matters a lot. Standard
methods:
(a) Based on f being Gaussian
h = 0.9 σ / n1/5 (R default,
Silverman’s rule)
h = 1.06 σ / n1/5 (Scott’s rule)
(b) Based on estimating f”
(Sheather and Jones)
Bandwidth differences
Scott's rule of thumb
0.0
0
2
4
6
0
2
4
6
N = 100 Bandwidth = 0.4807
Biased crossvalidation
Sheather and Jones
0.4
0.2
0.0
0.2
Density
0.4
0.6
N = 100 Bandwidth = 0.4082
0.0
Density
0.2
Density
0.2
0.0
Density
0.4
0.4
Silverman's rule of thumb
0
2
4
6
N = 100 Bandwidth = 0.5171
0
1
2
3
4
5
6
N = 100 Bandwidth = 0.2131
Mexican stamps
1872 stamp series issed by
Mexico. Thickness of paper
affects the value of these stamps.
Why clusters?
There are at least two different
paper providers (hand made
paper).
A stack of paper was determined
by weight, so the manufacturer
would have some extra thick or
extra thin sheets sitting around to
get the weight right.
Our data set has 485 thickness
determinations from a stamp
collection.
Histogram and density
We are hunting bumps in the
density (clusters of paper types)
20
10
0
Density
30
40
Histogram of thickness
0.06
0.08
0.10
thickness
0.12
0.14
Possible model
If there are M bumps, consider a
mixture of normals:
Assumptions matter!
Izenman & Sommer (J Amer Stat
Assoc 1988) finds 7 modes using
a nonparametric approach, and 3
using a parametric normal mixture
model
Other authors find between 2 and
10 modes in the data set
Cannot just look at the stamps—
the collection has been sold
Download