A cross-validation method for choosing the pilot bandwidth in kernel density estimation J.E. Chacón, J. Montanero, A.G. Nogales and P. Pérez Dpto. de Matemáticas, Universidad de Extremadura, Avda. de Elvas, s/n, E-06071 Badajoz, Spain. jechacon@unex.es, jmf@unex.es, nogales@unex.es, paloma@unex.es Summary. Bootstrap bandwidth selection in kernel density estimation requires the use of a pilot bandwidth. Typically, this pilot bandwidth is chosen according to some asymptotic criterion, which also needs another pilot bandwidth for this second stage. In contrast, our proposal is based on a non-asymptotic minimum variance unbiased estimator of the error criterion for the pilot bandwidth. Key words: kernel density estimation, bandwidth selection, bootstrap, pilot bandwidth, cross-validation 1 Introduction The problem is about estimating a density function f using a real-valued sample X1 , . . . , Xn from the probability distribution with density f . We are going to consider here the kernel density estimator fn,K,h (x) = n 1X Kh (x − Xi ), n i=1 R where the kernel K is an integrable symmetric function with K = 1, the bandwidth h is a positive real number and we have used the rescaled-kernel notation Kh (x) = K(x/h)/h. We are interested in an automatic way of choosing the bandwidth h, as this parameter controls the amount of smoothing in the kernel estimator (see, e.g., [Sil86]). The L2 -error criterion will be used; that is, we will measure the error of the estimate fn,K,h through the mean integrated squared error (MISE), defined by Z MISEn (h) = E [fn,K,h (x) − f (x)]2 dx. Thus, henceforth we will assume that both f and K are square integrable functions. Our goal is to propose a new bandwidth selector; that is, an estimator of the optimal bandwidth h0n := argminh>0 MISEn (h). To this end, we must propose an \ n (h) of the MISE function and then define the selector as the miniestimator MISE \ n (h). mizer of this criterion function MISE 1236 J.E. Chacón, J. Montanero, A.G. Nogales and P. Pérez It is easy to show (see [Cha04]) that the MISE function can be written as MISEn (h) = R(f ) + R(K) + RK,h e (f ), nh where we have used the notations R R R(K) = K 2 R(f ) = f 2 , R e e = n−1 (K ∗ K) − 2K, RK,h e (f ) = (Kh ∗ f )f, K n with ∗ standing for the convolution operator. Therefore, the optimal bandwidth can be defined as the minimizer of the function Mn (h) = MISEn (h) − R(f ) = R(K) + RK,h e (f ), nh which depends on the unknown density f only through the functional RK,h e (f ). 2 Bootstrap bandwidth selection The bootstrap approach consists of estimating Mn (h) by ∗ Mn,L,g (h) := R(K) + RK,h e (fn,L,g ), nh where fn,L,g is another kernel estimator, with possibly different kernel L and pilot bandwidth g (see, e.g., [Cao90] or [CMN04] for a general formulation of the bootstrap method), and taking the bootstrap bandwidth selector as the minimizer of the ∗ criterion Mn,L,g (h). However, this approach raises the new problem of choosing the pilot bandwidth g. So, following the steps for selecting h, we must evaluate the mean square error (MSE) function ∗ MSEn (h, g) = E {Mn,L,g (h) − Mn (h)}2 and try to estimate the optimal pilot bandwidth, g1n (h), which minimizes MSEn (h, g) as a function of g. Notice that, using this methodology, the pilot bandwidth must be necessarily local, as Mn 6∈ Lp , for any p ≥ 1. The plug-in estimator of the functional RK,h e (f ) can be written as RK,h e (fn,L,g ) = n 1 X e (Kh ∗ L̄g )(Xi − Xj ), n2 i,j=1 e h ∗ L̄g )(0)/n = R e (Lg )/n. with L̄ = L ∗ L. It includes the non-random term (K K,h e As E[(Kh ∗ L̄g )(X1 − X2 )] = RK,h e (Lg ∗ f ), we have that E[RK,h e (fn,L,g )] = n−1 1 RK,h e (Lg ∗ f ) + RK,h e (Lg ). n n ∗ This suggests replacing the bootstrap estimator Mn,L,g for its no-diagonal version ∗∗ Mn,L,g (h) = R(K)/(nh) + Tn,L,g (h), where Tn,L,g (h) = X 1 e h ∗ L̄g )(Xi − Xj ), (K n(n − 1) i6=j A cross-validation pilot bandwidth for kernel density estimation 1237 which satisfies E[Tn,L,g (h)] = RK,h e (Lg ∗f ); moreover, Tn,L,g (h) is the minimum variance unbiased estimator (MVUE) of RK,h e (Lg ∗ f ) for every fixed h > 0. Techniques related with these may be found in [HM87] and [JS91] for a similar problem. ∗∗ We will denote the optimal pilot bandwidth for Mn,L,g (h) as g2n (h) = argming>0 MMSEn (h, g), where ∗∗ MMSEn (h, g) = E {Mn,L,g (h) − Mn (h)}2 Thus, to choose the pilot bandwidth g2n (h) we must give an estimator of MMSEn (h, g). As usual, it is possible to give a squared bias-variance decomposition of the error MMSEn (h, g) = Bn2 (h, g) + Vn (h, g). Next we provide exact expressions for these two terms. Theorem 1. If we denote e h ∗ L̄g − K eh ψ ≡ ψh,g = K e h ∗ L̄g ϕ ≡ ϕh,g = K then the bias and variance functions can be expressed as Bn2 (h, g) = E[ψ(X1 − X2 )]2 Vn (h, g) = 4(n − 2) 2 4n − 6 ξ1 + ξ2 − ξ0 n(n − 1) n(n − 1) n(n − 1) where ξ0 = E[ϕ(X1 − X2 )]2 ξ1 = E[ϕ(X1 − X2 )ϕ(X2 − X3 )] ξ2 = E[ϕ(X1 − X2 )2 ]. Bn2 (h, g) is regular statistical functional of order 4 (see [Lee90], p. 2). Therefore, its MVUE is given by Un[4] where the sum [4] P (n,k) !−1 = n 4 X Ψ[4] (Xi1 , Xi2 , Xi3 , Xi4 ), (n,4) is taken over all subsets 1 ≤ i1 < . . . < ik ≤ n of {1, 2, . . . , n} and where Ψ (x1 , x2 , x3 , x4 ) denotes the symmetrization of ψ(x1 − x2 )ψ(x3 − x4 ); that is, Ψ[4] (x1 , x2 , x3 , x4 ) = 1 X ψ(xτ (1) − xτ (2) )ψ(xτ (3) − xτ (4) ) 4! τ ∈S 4 with S4 being the group of permutations of 4 elements. As ξ0 , ξ1 and ξ2 are also regular statistical functionals, it is possible to give their [2] [3] [4] MVUEs, Vn , Un and Un , respectively, in a similar way. Collecting all the MVUEs we get the following result: 1238 J.E. Chacón, J. Montanero, A.G. Nogales and P. Pérez Theorem 2. The MVUE of MMSEn (h, g) for every fixed h, g > 0 is given by MCVn (h, g) = Un[4] + 4(n − 2) [3] 2 4n − 6 [4] Un + Un[2] − Vn n(n − 1) n(n − 1) n(n − 1) We call this procedure “a cross-validation method” due to the analogy with the fact that the well-known cross-validation criterion for selecting h is the MVUE of the function Mn (h) (see [Cha04], Chapter 4). However, it is clear that the crossvalidation sense is lost in this case. 3 Exact calculations We will give here an exact expression for the function MMSEn (h, g) when f = K = L = φ, the density of a standard Gaussian distribution; that is, 2 1 φ(x) = √ e−x /2 , 2π x ∈ R. Using Theorem 1 above, to give an exact expression for MMSEn (h, g) we only need to calculate Bn (h, g) and ξi for i = 0, 1, 2. If we denote C0 (h, g, n) = RK,h e (Lg ∗ f ) C1 (h, g, s1 , s2 ) = E C2 (h, g, s1 , s2 ) = E φs1 h ∗ φg√2 (X1 − X2 ) φs2 h ∗ φg√2 (X2 − X3 ) φs1 h ∗ φg√2 (X1 − X2 ) φs2 h ∗ φg√2 (X1 − X2 ) then Bn2 (h, g) = [C0 (h, g, n) − C0 (h, 0, n)]2 ξ0 = C0 (h, g, n)2 ξi = 2 X αj αk Ci (h, g, p √ j, k) j,k=1 j≥k for i = 1, 2 and α1 = −2, α2 = n/(n − 1). As it is possible to obtain explicit formulae for the functions C0 , C1 , C2 using the results in [Ald95], we can compute the exact value of MMSEn (h, g) in this case. Figure 4 shows the function MMSEn (h, g) for sample size n = 100. Once we have an explicit formula for MMSEn (h, g) we can find g2n (h) numerically in the Gaussian case. This exact optimal pilot bandwidth exhibits some surprising properties: 1. Figure 2a shows g2n (h) as a function of h for n = 100 and it is clear that g2n (h) 6→ 0 as h → 0, contrary to the behavior exhibited by most existing methods for choosing the pilot bandwidth, as for instance the Sheather-Jones method ( [SJ91]); see also [JMS96]. 2. However, Figure 2b shows the values of the sequence {g2n (h0n )} for n up to 107 and it clearly indicates that g2n (h0n ) → 0 as n → ∞. A cross-validation pilot bandwidth for kernel density estimation 1239 Fig. 1. MMSE function in the Gaussian case Fig. 2a. The function g2,100 (h) Fig. 2b. The sequence {g2n (h0n )} 3. For asymptotic results, it is important to decide which of the two sequences, {h0n } and {g2n (h0n )} is asymptotically bigger. The common assumption is that h0n /g2n (h0n ) → c for some 0 ≤ c < ∞ as n → ∞ (see, e.g., [HMP92]). However, Fig. 3 shows that h0n /g2n (h0n ) → ∞ as n → ∞ in the Gaussian case. 4 Simulations We define the cross-validation pilot bandwidth selector, GCV (h) as the value of g which minimizes the criterion MCVn (h, g) that appears in Theorem 2 above. Using this pilot bandwidth selector we can give a new bandwidth selector for h, ∗∗ HND = argmin Mn,L,G (h), CV (h) h>0 where the subscript ND is used to remind that the diagonals are not used in the ∗∗ criterion Mn,L,g (h). 1240 J.E. Chacón, J. Montanero, A.G. Nogales and P. Pérez Fig. 3. The sequence {h0n /g2n (h0n )} Completely analogous calculations can be made to give HD , the minimizer of ∗ the criterion Mn,L,g (h) with an appropriate pilot bandwidth g chosen by crossvalidation. Clearly, further theoretical work must be done before the selectors HND and HD can be used in practice. The surprising properties of the previous section and the asymptotic rates of convergence of the methods should be studied in detail. However, for the sake of completeness, we provide the results of a brief simulation study for investigating the small sample performance of the two new bandwidth selectors, comparing them with the cross-validation and Sheather-Jones bandwidth selectors for h (see, e.g., [JMS96]). For sample size n = 100, we have used the 15 “test densities” that appear in [MW92]. The next graph shows, for every selector H in the study, the log of the average value of the relative mean squared error {(H − h0n )/h0n }2 for every density in the Marron-Wand set of test densities. In view of the results of the simulation study, some conclusions may be obtained: 1. The performance of the two new selectors, HND and HD , is often quite similar, although the selector HD including the diagonals usually outperforms the nodiagonal version. 2. The cross-validation selector HCV appears to be the most robust selector, as it never breaks down completely for any of the 15 densities. Moreover, HCV performs quite well for the “difficult-to-estimate” densities #10–#15. 3. Overall, we could say that HSJ is the “winner” of the simulation study, its performance is particularly good for those densities close to Gaussian. This is not surprising, as the Sheather-Jones method uses a Gaussian normal reference density at its final stage. 4. Apart from the close-to-Gaussian densities, the performance of the new selectors introduced here is comparable to that of the Sheather-Jones method, with the advantage that the new selectors do not use an arbitrary reference density at any stage. A cross-validation pilot bandwidth for kernel density estimation 1241 Fig. 4. Results of the simulation study Acknowledgements. This research has been supported by Spanish Ministerio de Ciencia y Tecnologı́a project MTM2005-06348. References [Ald95] Aldershof, B., Marron, J.S., Park, B.U. and Wand, M.P.: Facts about the Gaussian probability density function. Applicable Analysis, 59, 289–306 (1995) [Cao90] Cao-Abad, R.: Aplicaciones y nuevos resultados del método bootstrap en la estimación no paramétrica de curvas. MA Thesis, Universidad de Santiago de Compostela (1990) [Cha04] Chacón, J.E.: Estimación de densidades: algunos resultados exactos y asintóticos. MA Thesis, Universidad de Extremadura (2004) [CMN04] Chacón, J.E., Montanero, J., Nogales, A G., Pérez, P.: Two statistical experiments for bootstrapping. Far East Journal of Theoretical Statistics, 12, 191–200 (2004) [HM87] Hall, P., Marron, J.S.: Estimation of integrated squared density derivatives. Statistics and Probability Letters, 6, 109–115 (1987) 1242 J.E. Chacón, J. Montanero, A.G. Nogales and P. Pérez [HMP92] Hall, P., Marron, J.S., Park, B.U.: Smoothed cross-validation. Probability Theory and Related Fields, 92, 1–20 (1992) [JMS96] Jones, M.C., Marron, J.S., Sheather, S.J.: Progress in data-based bandwidth selection for kernel density estimation. Computational Statistics, 11, 337–381 (1996) [JS91] Jones, M.C., Sheather, S.J.: Using non-stochastic terms to advantage in kernel-based estimation of integrated squared derivatives. Statistics and Probability Letters, 11, 511–514 (1991) [Lee90] Lee, A.J.: U -Statistics: Theory and Practice. Marcel Dekker (1990) [MW92] Marron, J.S., Wand, M.P.: Exact mean integrated square error. Annals of Statistics, 20, 712–736 (1992) [SJ91] Sheather, S.J., Jones, M.C.: A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society Ser. B, 53, 683–690 (1991) [Sil86] Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman & Hall (1986)