A cross-validation method for choosing the

advertisement
A cross-validation method for choosing the
pilot bandwidth in kernel density estimation
J.E. Chacón, J. Montanero, A.G. Nogales and P. Pérez
Dpto. de Matemáticas, Universidad de Extremadura, Avda. de Elvas, s/n, E-06071
Badajoz, Spain.
jechacon@unex.es, jmf@unex.es, nogales@unex.es, paloma@unex.es
Summary. Bootstrap bandwidth selection in kernel density estimation requires the
use of a pilot bandwidth. Typically, this pilot bandwidth is chosen according to some
asymptotic criterion, which also needs another pilot bandwidth for this second stage.
In contrast, our proposal is based on a non-asymptotic minimum variance unbiased
estimator of the error criterion for the pilot bandwidth.
Key words: kernel density estimation, bandwidth selection, bootstrap, pilot bandwidth, cross-validation
1 Introduction
The problem is about estimating a density function f using a real-valued sample
X1 , . . . , Xn from the probability distribution with density f . We are going to consider
here the kernel density estimator
fn,K,h (x) =
n
1X
Kh (x − Xi ),
n i=1
R
where the kernel K is an integrable symmetric function with K = 1, the bandwidth
h is a positive real number and we have used the rescaled-kernel notation Kh (x) =
K(x/h)/h. We are interested in an automatic way of choosing the bandwidth h,
as this parameter controls the amount of smoothing in the kernel estimator (see,
e.g., [Sil86]). The L2 -error criterion will be used; that is, we will measure the error
of the estimate fn,K,h through the mean integrated squared error (MISE), defined
by
Z
MISEn (h) = E
[fn,K,h (x) − f (x)]2 dx.
Thus, henceforth we will assume that both f and K are square integrable functions.
Our goal is to propose a new bandwidth selector; that is, an estimator of the
optimal bandwidth h0n := argminh>0 MISEn (h). To this end, we must propose an
\ n (h) of the MISE function and then define the selector as the miniestimator MISE
\ n (h).
mizer of this criterion function MISE
1236
J.E. Chacón, J. Montanero, A.G. Nogales and P. Pérez
It is easy to show (see [Cha04]) that the MISE function can be written as
MISEn (h) = R(f ) +
R(K)
+ RK,h
e (f ),
nh
where we have used the notations
R
R
R(K) = K 2
R(f ) = f 2 ,
R
e
e = n−1 (K ∗ K) − 2K,
RK,h
e (f ) = (Kh ∗ f )f, K
n
with ∗ standing for the convolution operator. Therefore, the optimal bandwidth can
be defined as the minimizer of the function
Mn (h) = MISEn (h) − R(f ) =
R(K)
+ RK,h
e (f ),
nh
which depends on the unknown density f only through the functional RK,h
e (f ).
2 Bootstrap bandwidth selection
The bootstrap approach consists of estimating Mn (h) by
∗
Mn,L,g
(h) :=
R(K)
+ RK,h
e (fn,L,g ),
nh
where fn,L,g is another kernel estimator, with possibly different kernel L and pilot
bandwidth g (see, e.g., [Cao90] or [CMN04] for a general formulation of the bootstrap method), and taking the bootstrap bandwidth selector as the minimizer of the
∗
criterion Mn,L,g
(h). However, this approach raises the new problem of choosing the
pilot bandwidth g. So, following the steps for selecting h, we must evaluate the mean
square error (MSE) function
∗
MSEn (h, g) = E {Mn,L,g
(h) − Mn (h)}2
and try to estimate the optimal pilot bandwidth, g1n (h), which minimizes
MSEn (h, g) as a function of g. Notice that, using this methodology, the pilot bandwidth must be necessarily local, as Mn 6∈ Lp , for any p ≥ 1.
The plug-in estimator of the functional RK,h
e (f ) can be written as
RK,h
e (fn,L,g ) =
n
1 X e
(Kh ∗ L̄g )(Xi − Xj ),
n2 i,j=1
e h ∗ L̄g )(0)/n = R e (Lg )/n.
with L̄ = L ∗ L. It includes the non-random term (K
K,h
e
As E[(Kh ∗ L̄g )(X1 − X2 )] = RK,h
e (Lg ∗ f ), we have that
E[RK,h
e (fn,L,g )] =
n−1
1
RK,h
e (Lg ∗ f ) + RK,h
e (Lg ).
n
n
∗
This suggests replacing the bootstrap estimator Mn,L,g
for its no-diagonal version
∗∗
Mn,L,g (h) = R(K)/(nh) + Tn,L,g (h), where
Tn,L,g (h) =
X
1
e h ∗ L̄g )(Xi − Xj ),
(K
n(n − 1)
i6=j
A cross-validation pilot bandwidth for kernel density estimation
1237
which satisfies E[Tn,L,g (h)] = RK,h
e (Lg ∗f ); moreover, Tn,L,g (h) is the minimum variance unbiased estimator (MVUE) of RK,h
e (Lg ∗ f ) for every fixed h > 0. Techniques
related with these may be found in [HM87] and [JS91] for a similar problem.
∗∗
We will denote the optimal pilot bandwidth for Mn,L,g
(h) as
g2n (h) = argming>0 MMSEn (h, g),
where
∗∗
MMSEn (h, g) = E {Mn,L,g
(h) − Mn (h)}2
Thus, to choose the pilot bandwidth g2n (h) we must give an estimator of
MMSEn (h, g). As usual, it is possible to give a squared bias-variance decomposition
of the error MMSEn (h, g) = Bn2 (h, g) + Vn (h, g). Next we provide exact expressions
for these two terms.
Theorem 1. If we denote
e h ∗ L̄g − K
eh
ψ ≡ ψh,g = K
e h ∗ L̄g
ϕ ≡ ϕh,g = K
then the bias and variance functions can be expressed as
Bn2 (h, g) = E[ψ(X1 − X2 )]2
Vn (h, g) =
4(n − 2)
2
4n − 6
ξ1 +
ξ2 −
ξ0
n(n − 1)
n(n − 1)
n(n − 1)
where
ξ0 = E[ϕ(X1 − X2 )]2
ξ1 = E[ϕ(X1 − X2 )ϕ(X2 − X3 )]
ξ2 = E[ϕ(X1 − X2 )2 ].
Bn2 (h, g) is regular statistical functional of order 4 (see [Lee90], p. 2). Therefore,
its MVUE is given by
Un[4]
where the sum
[4]
P
(n,k)
!−1
=
n
4
X
Ψ[4] (Xi1 , Xi2 , Xi3 , Xi4 ),
(n,4)
is taken over all subsets 1 ≤ i1 < . . . < ik ≤ n of {1, 2, . . . , n}
and where Ψ (x1 , x2 , x3 , x4 ) denotes the symmetrization of ψ(x1 − x2 )ψ(x3 − x4 );
that is,
Ψ[4] (x1 , x2 , x3 , x4 ) =
1 X
ψ(xτ (1) − xτ (2) )ψ(xτ (3) − xτ (4) )
4! τ ∈S
4
with S4 being the group of permutations of 4 elements.
As ξ0 , ξ1 and ξ2 are also regular statistical functionals, it is possible to give their
[2]
[3]
[4]
MVUEs, Vn , Un and Un , respectively, in a similar way. Collecting all the MVUEs
we get the following result:
1238
J.E. Chacón, J. Montanero, A.G. Nogales and P. Pérez
Theorem 2. The MVUE of MMSEn (h, g) for every fixed h, g > 0 is given by
MCVn (h, g) = Un[4] +
4(n − 2) [3]
2
4n − 6 [4]
Un +
Un[2] −
Vn
n(n − 1)
n(n − 1)
n(n − 1)
We call this procedure “a cross-validation method” due to the analogy with the
fact that the well-known cross-validation criterion for selecting h is the MVUE of
the function Mn (h) (see [Cha04], Chapter 4). However, it is clear that the crossvalidation sense is lost in this case.
3 Exact calculations
We will give here an exact expression for the function MMSEn (h, g) when f = K =
L = φ, the density of a standard Gaussian distribution; that is,
2
1
φ(x) = √ e−x /2 ,
2π
x ∈ R.
Using Theorem 1 above, to give an exact expression for MMSEn (h, g) we only need
to calculate Bn (h, g) and ξi for i = 0, 1, 2. If we denote
C0 (h, g, n) = RK,h
e (Lg ∗ f )
C1 (h, g, s1 , s2 ) = E
C2 (h, g, s1 , s2 ) = E
φs1 h ∗ φg√2 (X1 − X2 ) φs2 h ∗ φg√2 (X2 − X3 )
φs1 h ∗ φg√2 (X1 − X2 ) φs2 h ∗ φg√2 (X1 − X2 )
then
Bn2 (h, g) = [C0 (h, g, n) − C0 (h, 0, n)]2
ξ0 = C0 (h, g, n)2
ξi =
2
X
αj αk Ci (h, g,
p √
j,
k)
j,k=1
j≥k
for i = 1, 2 and α1 = −2, α2 = n/(n − 1). As it is possible to obtain explicit formulae
for the functions C0 , C1 , C2 using the results in [Ald95], we can compute the exact
value of MMSEn (h, g) in this case. Figure 4 shows the function MMSEn (h, g) for
sample size n = 100.
Once we have an explicit formula for MMSEn (h, g) we can find g2n (h) numerically in the Gaussian case. This exact optimal pilot bandwidth exhibits some surprising properties:
1. Figure 2a shows g2n (h) as a function of h for n = 100 and it is clear that
g2n (h) 6→ 0 as h → 0, contrary to the behavior exhibited by most existing
methods for choosing the pilot bandwidth, as for instance the Sheather-Jones
method ( [SJ91]); see also [JMS96].
2. However, Figure 2b shows the values of the sequence {g2n (h0n )} for n up to 107
and it clearly indicates that g2n (h0n ) → 0 as n → ∞.
A cross-validation pilot bandwidth for kernel density estimation
1239
Fig. 1. MMSE function in the Gaussian case
Fig. 2a. The function g2,100 (h)
Fig. 2b. The sequence {g2n (h0n )}
3. For asymptotic results, it is important to decide which of the two sequences,
{h0n } and {g2n (h0n )} is asymptotically bigger. The common assumption is that
h0n /g2n (h0n ) → c for some 0 ≤ c < ∞ as n → ∞ (see, e.g., [HMP92]). However,
Fig. 3 shows that h0n /g2n (h0n ) → ∞ as n → ∞ in the Gaussian case.
4 Simulations
We define the cross-validation pilot bandwidth selector, GCV (h) as the value of g
which minimizes the criterion MCVn (h, g) that appears in Theorem 2 above. Using
this pilot bandwidth selector we can give a new bandwidth selector for h,
∗∗
HND = argmin Mn,L,G
(h),
CV (h)
h>0
where the subscript ND is used to remind that the diagonals are not used in the
∗∗
criterion Mn,L,g
(h).
1240
J.E. Chacón, J. Montanero, A.G. Nogales and P. Pérez
Fig. 3. The sequence {h0n /g2n (h0n )}
Completely analogous calculations can be made to give HD , the minimizer of
∗
the criterion Mn,L,g
(h) with an appropriate pilot bandwidth g chosen by crossvalidation.
Clearly, further theoretical work must be done before the selectors HND and
HD can be used in practice. The surprising properties of the previous section and
the asymptotic rates of convergence of the methods should be studied in detail.
However, for the sake of completeness, we provide the results of a brief simulation
study for investigating the small sample performance of the two new bandwidth
selectors, comparing them with the cross-validation and Sheather-Jones bandwidth
selectors for h (see, e.g., [JMS96]). For sample size n = 100, we have used the 15
“test densities” that appear in [MW92]. The next graph shows, for every selector
H in the study, the log of the average value of the relative mean squared error
{(H − h0n )/h0n }2 for every density in the Marron-Wand set of test densities.
In view of the results of the simulation study, some conclusions may be obtained:
1. The performance of the two new selectors, HND and HD , is often quite similar,
although the selector HD including the diagonals usually outperforms the nodiagonal version.
2. The cross-validation selector HCV appears to be the most robust selector, as
it never breaks down completely for any of the 15 densities. Moreover, HCV
performs quite well for the “difficult-to-estimate” densities #10–#15.
3. Overall, we could say that HSJ is the “winner” of the simulation study, its
performance is particularly good for those densities close to Gaussian. This is
not surprising, as the Sheather-Jones method uses a Gaussian normal reference
density at its final stage.
4. Apart from the close-to-Gaussian densities, the performance of the new selectors
introduced here is comparable to that of the Sheather-Jones method, with the
advantage that the new selectors do not use an arbitrary reference density at
any stage.
A cross-validation pilot bandwidth for kernel density estimation
1241
Fig. 4. Results of the simulation study
Acknowledgements. This research has been supported by Spanish Ministerio de
Ciencia y Tecnologı́a project MTM2005-06348.
References
[Ald95]
Aldershof, B., Marron, J.S., Park, B.U. and Wand, M.P.: Facts about the
Gaussian probability density function. Applicable Analysis, 59, 289–306
(1995)
[Cao90] Cao-Abad, R.: Aplicaciones y nuevos resultados del método bootstrap
en la estimación no paramétrica de curvas. MA Thesis, Universidad de
Santiago de Compostela (1990)
[Cha04] Chacón, J.E.: Estimación de densidades: algunos resultados exactos y asintóticos. MA Thesis, Universidad de Extremadura (2004)
[CMN04] Chacón, J.E., Montanero, J., Nogales, A G., Pérez, P.: Two statistical
experiments for bootstrapping. Far East Journal of Theoretical Statistics,
12, 191–200 (2004)
[HM87] Hall, P., Marron, J.S.: Estimation of integrated squared density derivatives. Statistics and Probability Letters, 6, 109–115 (1987)
1242
J.E. Chacón, J. Montanero, A.G. Nogales and P. Pérez
[HMP92] Hall, P., Marron, J.S., Park, B.U.: Smoothed cross-validation. Probability
Theory and Related Fields, 92, 1–20 (1992)
[JMS96] Jones, M.C., Marron, J.S., Sheather, S.J.: Progress in data-based bandwidth selection for kernel density estimation. Computational Statistics,
11, 337–381 (1996)
[JS91]
Jones, M.C., Sheather, S.J.: Using non-stochastic terms to advantage in
kernel-based estimation of integrated squared derivatives. Statistics and
Probability Letters, 11, 511–514 (1991)
[Lee90] Lee, A.J.: U -Statistics: Theory and Practice. Marcel Dekker (1990)
[MW92] Marron, J.S., Wand, M.P.: Exact mean integrated square error. Annals of
Statistics, 20, 712–736 (1992)
[SJ91]
Sheather, S.J., Jones, M.C.: A reliable data-based bandwidth selection
method for kernel density estimation. Journal of the Royal Statistical
Society Ser. B, 53, 683–690 (1991)
[Sil86]
Silverman, B.W.: Density Estimation for Statistics and Data Analysis.
Chapman & Hall (1986)
Download