Robust kernel estimator for densities of unknown smoothness First draft. Preliminary and incomplete. By Yulia Kotlyarova and Victoria Zinde-Walsh∗ Department of Economics, McGill University and CIREQ February 26, 2004 Abstract Results on nonparametric kernel estimators of density differ according to the assumed degree of density smoothness. While most asymptotic normality results assume that the density function is smooth there are cases where non-smooth density functions may be of interest. We provide asymptotic results for kernel estimation of a non-smooth continuous density and derive a new result: the limit joint distribution of kernel density estimators corresponding to different bandwidths and kernel functions. Using the results on the joint distribution of kernel density estimators we construct an estimator that optimally combines estimators for different bandwidth/kernel combinations to protect against the negative consequences of errors in assumptions about order of smoothness. The results of a Monte Carlo experiment confirm the usefulness of the combined estimator. We demontrate that while in the standard normal case the combined estimator has relatively higher MSE than the standard kernel estimator, both estimators are highly accurate. On the other hand, for a non-smooth density where the MSE gets very large, the combined estimator provides uniformly better results than the standard estimator. ∗ The support of the Social Sciences and Humanities Research Council of Canada (SSHRC), the FONDS QUEBECOIS DE LA RECHERCHE SUR LA SOCIETE ET LA CULTURE (FRQSC) is gratefully acknowledged. 1 Robust density estimator Corresponding author: Victoria Zinde-Walsh Department of Economics McGill University, 855 Sherbrooke Street West, Montreal, Quebec H3A 2T7 Canada Email: victoria.zinde-walsh@mcgill.ca 2 1 Introduction Results on nonparametric kernel estimators of density differ according to the assumed degree of density smoothness. Continuity of density implies that kernel estimators are consistent; finite sample biases and variances are known. While most asymptotic normality results assume that the density function is smooth there are cases where non-smooth density functions may be of interest. We provide asymptotic results for kernel estimation of a non-smooth continuous density and derive a new result: the limit joint distribution of kernel density estimators corresponding to different bandwidths and kernel functions. Because of the nonparametric rates of convergence each such estimator may provide additional information. The examination of the joint distribution is similar to theorems about the joint distributions of other esimators that converge at nonprametric rates such as smoothed maximum score (Kotlyarova and Zinde-Walsh, 2003) or smoothed least median of squares estimator (Zinde-Walsh, 2002). If second order or higher order derivatives of the density exist, a bandwidth that provides an optimal convergence rate can be found for a kernel of sufficiently high order. If, however, there is no certainty that the smoothness assumptions hold, under- and especially over-smoothing is likely. Oversmoothing leads to asymptotic bias and makes the estimator concentrate around the wrong value. Undersmoothing yields a consistent estimator but increases the mean squared error. If there are no grounds on which to assume smoothness of the density the chosen rate for the bandwidth may be in error and the estimator will suffer from the problems associated with under- or over- smoothing. Using the results on the joint distribution of kernel density estimators we construct an estimator that optimally combines estimators for different bandwidth/kernel combinations to protect against the negative consequences of errors in assumptions about order of smoothness. The results of a Monte Carlo experiment confirm the usefulness of the combined estimator in finite sample. We demontrate that while in the standard normal case the combined estimator has relatively larger MSE than the standard kernel estimator, both estimators are highly accurate. On the other hand, for a non-smooth density where the MSE gets very large, the combined estimator provides uniformly better results than the standard estimator. The paper is organized as follows. Section 2 provides the definitions, assumptions and known results for the kernel density estimator. Section 3 3 provides asymptotic results under weak (only continuity, no smoothness) assumptions for the kernel density estimator, as well as for the joint limit process for several estimators. The new combined estimator is defined in Section 4, where we discuss how to construct it (selection of bandwidths, smoothing kernels, estimation of the MSE of a linear combination) and evaluate its performance in a Monte Carlo experiment. Appendix A provides the proofs of the results in Section 3. 2 Definitions, notation, assumptions, known results Consider a random variable x and the corresponding distribution function R F (x). If density f (x) exists then F (x) = f (x)dx. Assumption 1. (a) (xi ), i = 1, ..., n, is a random sample of x. (b) a density function f (x) exists and is continuous at x. To estimate density we utilize kernel functions but do not restrict kernels to symmetric or nonnegative density functions; as will be clear later this may give us some extra flexiblity. Assumption 2. (a) The kernel smoothing function ψ is a continuously differentiable function; R (b) ψ(w)dw = 1; (c) The bandwidth parameter σ n → 0 and σ n n → ∞. The kernel density estimator is defined as ¶ µ n X 1 x x − i fˆ(x) = . ψ (1) nσ n i=1 σn We review here some of the known results; up to the difference in notation we refer to the summary in Pagan and Ullah, 1999. Our main aim here is to emphasize dependence of various results on the smoothness properties of the density function and to highlight results that do not require more than our Assumptions 1,2. The expressions for bias and variance in Theorem 2.1 of Pagan and Ullah 4 hold (in our notation) Z ˆ Bias(f ) = ψ(w) [f (σw + x) − f (x)] dw; (2) ·Z ¸2 Z 2 −1 −1 V ar(fˆ) = (nσ) ψ(w) f (σw + x)dw − n ψ(w)f (σw + x)dw (3). R Theorem 2.5 holds and implies that MSE fˆ → 0. If additionally |f (x)| dx < ∞, the kernel estimator is uniformly asymptotically unbiased (Theorem 2.4) and with nσ(log n)−1 → ∞ it is strongly consistent (Theorem 2.7). With uniform continuity of density (plus assumptions A12,A14) Theorem 2.8 establishes weak uniform consistency of the estimator. Theorem 2.9 holds under Assumptions 1,2 and provides Z ³ ´ 1 ˆ ˆ 2 (nσ) f − E f →d N(0, f (x) ψ(w)2 dw. (4) For the asymptotic normality result in Theorem 2.10 the existence of continuous second order derivatives of the density function is assumed; then 1 the sharp rate of bandwidth σ = cn− 5 is optimal for a second-order kernel and the problem of efficient estimation is to find an appropriate c. If higher order derivatives of density exist, further improvements in efficiency can be obtained by using a higher order kernel to reduce the bias. For non-smooth density precise rate results are not available but one can still establish asymptotic normality given a suitable rate for the bandwidth; we provide that in the next section. 3 3.1 Asymptotic results for the kernel density estimator Asymptotic results for the density estimator when density is continuous Without differentiability of the density the sharp conditions on the optimal rate of the estimator do not hold. To express the conditions under which we can state asymptotic results we define Z A(ψ, σ n , x) = [f (σ n w + x) − f (x)] ψ(w)dw, (5) 5 this is of course just the expression for the bias of the kernel density estimator. Under assumption 1(b) A(ψ, σ n , x) converges to 0. Under more stringent differentiability assumptions a sharp rate for A(ψ, σ n , x) could be determined but we do not make such assumptions. Theorem 1. Under Assumptions 1 - 2, if σ n is such that as n → ∞ 1/2 (a) n1/2 σ n A(ψ, σ n , x) → 0 R d then n1/2 σ 1/2 (fˆ(x) − f (x)) → N(0, f (x) ψ(w)2 dw); 1/2 (b) n1/2 σ n A(ψ, σ n , x) → A(ψ), where 0 < |A| < ∞ R d then n1/2 σ 1/2 (fˆ(x) − f (x)) → N(A(ψ), f (x) ψ(w)2 dw; ) 1/2 (c) n1/2 σ n |A(ψ, σ n , x)| h→ ∞ i −1 ˆ f (x) − f (x) − A(ψ, σ n , x) = op (1). then |A(ψ, σ n , x)| Proof in Appendix A. Thus for case (a) (undersmoothing) we obtain a limit normal distribution and for (b) and (c) the estimator is asymptotically biased. Without making assumptions about the degree of smoothness of density all that is known is that for some rate of σ n → 0 there is undersmoothing: no asymptotic bias and a limiting Gaussian distribution, and for some slower convergence rate of σ n there is oversmoothing. Existence of an optimal rate depends on convergence properties of A(ψ, σ n , x) that cannot be asserted without strengthening the assumptions. 3.2 The joint limit process for density estimators Assume that fˆ(x, (σ n , ψ)) represents the estimator when the function ψ and bandwidth σ n are utilized. Consider a number of bandwidths σ n : σ n1 < σ n2 < ... < σ nm . Assume that σ ni for i ≤ m0 corresponds to undersmoothing (part (a) of Theorem 1) while σ ni for i such that m0 < m00 < i ≤ m corresponds to oversmoothing (part (c) of Theorem 1). If an optimal rate exists (e.g. for a h times continuously differentiable density and using an h order 1 kernel the optimal rate is O(n− 2h+1 )˙ - see, e.g., Pagan and Ullah, p. 30) then one could have m00 > m0 + 1 and σ ni for i = m0 + 1, ...m00 corresponding to the optimal rate. We combine each σ ni with each smoothing function ψ j from some set of functions that satisfy Assumption 2, j = 1, ..., l. Consider A(ψ j , σ i , x). Define 6 1/2 1/2 n σ i (fˆ(x) − f (x)) for i = 1, ..., m0 n1/2 σ 1/2 ((fˆ(x) − f (x)) − A(σ , ψ , x)) for i = m0 + 1, ..., m00 i j i ¯−1 h ¯ i η(σ i , ψ j ) = ˆ ¯ ¯ f (x) − f (x) − A(ψ j , σ i , x) A(ψ j , σ i , x) 00 for i = m + 1, ..., m. R Let τ ψi ψj ≡ ψ i (w) ψ j (w) dw and δ = (τ ψ1 ψ1 , ..., δ ψl ψl ). Theorem 2. Suppose that Assumptions 1, 2 hold for each bandwidth σ i , 1 ≤ i ≤ m, and for each ψ j , 1 ≤ j ≤ l and that the functions {ψ j }lj=1 form a linearly independent set1 . (a) If each σ 1 , ..., σ m0 (m0 ≤ m) satisfies condition (a) of Theorem 1 then 0 (η(σ 1 , ψ1 )0 , ..., η(σ 1 , ψ l )0 , ..., η(σ m0 , ψ 1 )0 , ..., η(σ m0 , ψ l )0 ) d → N(0, Ψf (x)) where the lm0 × lm0 matrix Ψ has elements τ√ψi ψRj if σ i = σ j , {Ψ}ij = d ψ i (w) ψ j (dw) dw if σ i /σ j = d < ∞, 0 if σ i /σ j → 0 or σ i /σ j → ∞; (b) If each σ m0 +1 , ..., σ m00 (m0 ≤ m00 ≤ m) satisfies condition (b) of Theorem 1 then 0 (η(σ m0 +1 , ψ 1 )0 , ..., η(σ m0 +1 , ψ l )0 , ..., η(σ m00 , ψ 1 )0 , ..., η(σ m00 , ψ l )0 ) d → N(0, Ψf (x)) √ R where the lm0 × lm0 matrix Ψ has elements {Ψ}ij = d ψ i (w) ψj (dw) dw, with σ i /σ j = d < ∞; (c) If each σ m00 +1 , ..., σ m (m00 ≤ m) satisfies condition (c) of Theorem 1 then 0 p (η(σ m00 +1 , ψ 1 )0 , ..., η(σ m00 +1 , ψ l )0 , ..., η(σ m , ψ 1 )0 , ..., η(σ m , ψ l )0 ) → 0 (d) Cov(η(σ i1 , ψ j1 ), η(σ i2 , ψ j2 )) → 0 for 1 ≤ i1 ≤ m00 and m00 + 1 ≤ i2 ≤ m, and any j1 , j2 . 1 If some linear combination of smoothing kernels ψ j is zero then the joint distribution is degenerate. 7 R Thus, if the bandwidths approach 0 at different rates or ψ i (w)ψ j (w)dw = 0, the corresponding estimators fˆ(x, (σ, ψ)) are asymptotically independent. This is a consequence of the fact that only a small fraction of observations have any effect on the estimator, therefore reweighting observations with different kernel functions can produce estimators with independent limit processes. 3.3 Extension to multivariate density estimation Results similar to Theorems 1 and 2 can be obtained for kernel estimators of multivariate density functions. TO BE COMPLETED 4 The combined estimator As the results in Section 3 show, an optimal rate for a density estimator may be problematic. Here we use the results of Theorem 2 to construct a new combined estimator that optimally combines several bandwidths and smoothing functions in the sample instead of focussing on a single bandwidth. Although efficiency may suffer in straightforward cases when an optimal rate can be found, the Monte Carlo experiments show that the combined estimator provides remarkably robust performance over a variety of cases. Section 4.1 defines the combined estimator. Section 4.2 addresses practical issues of construction of the combined estimator. Section 4.3 discusses performance in a Monte Carlo experiment. 4.1 Definition of the combined estimator. Suppose that bandwidths σ n1 < σ n2 < ... < σ nm represent sequences of rates where σ n1 corresponds to undersmoothing and σ nm to oversmoothing; some optimal rate may or may not exist. For a set of smoothing functions ψ 1 , ..., ψ l , Theorem 2 indicates the structure of the joint limit distribution of fˆ(σ ni , ψ j ). P P Consider a linear combination fˆ({aij }) = aij fˆ(σ ni , ψ j ), aij = 1. Assume that the biases, variances and covariances for all fˆ(σ ni , ψ j ) are known. Then one could find weights {aij } that minimize the mean squared error 8 MSE(fˆ({aij })). Each individual fˆ(σ ni , ψ j ) is included, thus the minimized MSE cannot be above the MSE for individual (σ ni , ψ j ). To determine the weights in practice we need to estimate the biases and covariances of all fˆ(σ ni , ψ j ). Denote estimated biases and covariances by ”hats”. P d fˆ(σ i , ψ j )) d fˆ(σ i , ψ j ))bias( \ fˆ({aij })) = tr ai1 j1 ai2 j2 {bias( Then MSE( 1 2 1 2 ˆ ˆ d +Cov(f (σ i1 , ψ j1 ), f (σ i2 , ψ j2 ))}, and the combined estimator is 4.2 \ fˆ({aij })). fbc = fˆ({b aij }), where {b aij } = arg min MSE( (6) Construction of the combined estimator 4.2.1 Estimation of variances and biases Consistent estimators for biases and covariances can be obtained by various procedures, e.g. by bootstrap. Note that for i = 1 and j = 1, ..., l all fˆ(σ ni , ψ j ) are ”undersmoothed” and thus asymptotically unbiased: E fˆ(σ n1 , ψ j ) = f ; we can write that bias(fˆ(σ ni , ψ j )) = E(fˆ(σ ni , ψ j )) − E(fˆ(σ n1 , ψ j )). Then by bootstrap B l B d fˆ(σ ni , ψ j )) = B −1 P fˆs (σ ni , ψ j ) − l−1 B −1 P P fˆs (σ n1 , ψ j ), bias( s=1 j=1s=1 where B is the number of bootstrap samples. Analogously to Hall’s (1992) bootstrap estimator of variance we construct an estimator of covariance, d fˆ(σ i1 , ψ j ), fˆ(σ i2 , ψ j )) Cov( 1 2 = B −1 B X s=1 (fˆs (σ i1 , ψ j1 ) − B −1 X fˆs (σ i1 , ψ j1 )) × (fˆs (σ i2 , ψ j2 ) − B −1 X fˆs (σ i2 , ψ j2 )). In our Monte Carlo experiment we used less computationally intensive esR fd (x) ψ(w)2 dw timators. We estimate variances using the asymptotic formula nσ r djt V ar(ψ i , σ j )V ar(ψ s , σ t ) · and then approximate covariances Cov(b(ψ i , σ j ), b(ψ s , σ t )) by δiδs R ψ i (w)ψ s (djt w)dw, where djt = σ j /σ t . This result follows from Theorem 2. 9 The estimators with the smallest bandwidth (undersmoothing) are asymptotically unbiased. To find individual biases, we can subtract the average of estimators with the smallest bandwidth from actual estimators: Biasfˆ(ψ, σ j ) = fˆ(ψ, σ j ) − fˆ(., σ 1 ). 4.2.2 Selection of functions and bandwidths 2 In our Monte Carlo experiment we use a Gaussian kernel √12π e−x /2 that corresponds to h = 2. We can utilize even lower order kernels since if the density is not differentiable there is nothing to be gained from using kernels of order 2; on the other hand, if the density satisfies a Lipschitz condition, a second order kernel would provide an advantage (this follows from the fact that then A(ψ, σ, x) = o(σ)). The largest bandwidth in the set is a standard ”optimal” bandwidth. Following Pagan and Ullah (1999) we determine it as 0.79Rn−1/5 , where R denotes the interquartile range. If this bandwidth belongs to a truly optimal function/bandwidth combination then as sample size increases it should yield the fastest convergence rate. Otherwise, it will correspond to oversmoothing. The lowest bandwidth represents undersmoothing. It is chosen on the basis of the empirical distribution of |xi − x|. We find such a quantile α of this distribution that is equal to the largest bandwidth, and obtain other bandwidths as mi αth quantiles of |xi − x|, i = 1, ..., m. 4.2.3 Estimation procedure for the combined estimator The entire procedure for a combined estimator includes the following steps: (i) find the density estimator using a standard bandwidth; (ii) construct the empirical distribution of |xi − x| and determine m bandwidths; (iii) find the density estimators for all smoothing functions and bandwidths; (iv) estimate the biases and the covariance matrix; and (v) find the optimal weights for the linear combination and compute (6). 4.3 Performance of the combined estimator If an optimal function/bandwidth pair is included among the (σ ni , ψ j ) and all the biases and variances are consistently estimated, the true MSE of the combined estimator at its worst will be approaching the MSE of the optimal estimator; it will eventually be smaller when the ”optimal” procedure 10 actually selects an inappropriately large σ. Our Monte Carlo study provides the finite sample confirmation of these relations. 4.3.1 DGP and estimation of the density and combined estimator We consider three different density functions. For the first model we use the standard normal distribution: f1 (x) = φ(x). Its density is infinitely differentiable and very smooth; thus, the density estimator evaluated at the ”optimal” bandwidth should be a good choice. The properties of kernel density estimators for this case are well established and it is important to see how the combined estimator will perform. In the second model we consider the normal mean mixture f2 (x) = 0.5φ(x + 1.5) + 0.5φ(x − 1.5) (see Sheather and Jones, 1991). This density is also infinitely differentiable; however, it is bimodal and more wiggly than the standard normal density. The third model contains a non-smooth density that satisfies the Lipschitz condition everywhere except x = −2. Thus, we expect the combined estimator to perform well outside of a small neighbourhood of −2, where the density doesnot satisfy Assumption 1. 5.25 − 5x if x ∈ [0.95, 1.05], 0.5 if x ∈ [0, 0.95), 0.5 + 5x if x ∈ [−0.1, 0), f3 (x) = − 1 − 10 x if x ∈ [−2; −0.1), 38 38 0 otherwise. The sample size used in the experiments is N = 2000. 5000 replications per model have been performed. We work with five bandwidths and one smoothing kernel, the standard normal density. The combined estimator is constructed as described in Section 4.2. 4.3.2 Summary of the results The results are presented in Fig. 1 - 3. For each model, we plot the MSE of a simple estimator with the standard bandwidth and the MSE of the combined estimator over the range of values of x with increments ∆x = 0.05. In Fig. 1 (the standard normal density) the simple estimator performs uniformly better than the combined one. On average, the MSE of the combined estimator is 4 times larger than the MSE of the standard estimator. Both MSE’s, however, are small in absolute terms. 11 In Fig. 2 (the mixed normal density) we observe that the standard estimator has a very definite oscillating pattern of the MSE. The peaks correspond to the points of local extrema of the density function. The smallest values of the MSE are achieved on the segments of the density function that can be well approximated by a straight line. The combined estimator, on the contrary, has a relatively stable MSE, which is positioned between the two extremes of the standard estimator’s MSE. In Fig. 3 (non-smooth density) the combined estimator clearly dominates the standard estimator in precision. For both estimators, the points where the density is not smooth (x = −2, −0.1, 0, 0.95, 1.05) cause substantial increases in the MSE but non-smoothness affects the standard estimator over larger intervals. At x = −2 the density is discontinuous, therefore the combined estimator is not expected to perform well either. The Monte Carlo experiments demonstrate that in the absence of information about the smoothness of the density the combined estimator provides more reliable results than the standard ”optimal” estimator. When using the combined estimator, we may lose some efficiency in cases of smooth densities. Since such densities are usually estimated very precisely, the difference in MSE’s of both estimators is not large in absolute terms. At the same time, when the density is not smooth, the standard estimator can be seriously biased and the improvement offered by the combined estimator is very significant. Appendix A. Proofs of the Theorems Proof of Theorem 1. From definition (5) we have that h i h i 1 1 1 (nσ) 2 fˆ(x) − f (x) − (nσ) 2 fˆ(x) − E fˆ(x) = (nσ) 2 A(ψ, σ n , x). In condition (a) of the Theorem the right-hand side is o(1), thus using (4) proves (a). From (b) h i h i 1 1 (nσ) 2 fˆ(x) − f (x) − (nσ) 2 fˆ(x) − E fˆ(x) − A(ψ) = op (1), and (b) follows from (4). For (c) we get from (4) that ¯ ¯−1 h i h i 1 1 1 ¯ ¯ ¯(nσ) 2 A(ψ, σ n , x)¯ (nσ) 2 fˆ(x) − f (x) − (nσ) 2 fˆ(x) − E fˆ(x) = op (1) 12 and (c) obtains since nσ → ∞.¥ Proof of Theorem 2. To prove Theorem 2 all that is required in addition to the results in Theorem 1 is to consider covariances between η(σ, ψ). For (a) recall that Eη(σ i , ψ j ) → 0, and E(η(σ i1 , ψ j1 )η(σ i2 , ψ j2 )) ·µ ¶µ ¶¸ 1 1 X xi − x 1 X xi − x 2 ) − f (x) ) − f (x) ψ j1 ( ψ j2 ( = n (σ i1 σ i2 ) E nσ i1 σ i1 nσ i2 σ i2 X 1 1 xi − x xi − x )ψ j2 ( )+ Eψ j1 ( = n (σ i1 σ i2 ) 2 2 n σ i1 σ i2 σ i1 σ i2 X 1 1 xl − x xm − x Eψ j1 ( )Eψ j2 ( ) +n (σ i1 σ i2 ) 2 [ 2 n σ i1 σ i2 l6=m σ i1 σ i2 1 X xl − x 1 X xl − x ) − f (x) ) + f (x)2 ] Eψ j1 ( Eψ j2 ( −f (x) nσ i1 σ i1 nσ i2 σ i2 Z 1 w x w x − − )ψ j2 ( )f (w)dw + o(1), = (σ i1 σ i2 )− 2 ψ j1 ( σ i1 σ i2 )−f (x) = o(1) where the last equality follows from the fact that nσ1 i Eψ j ( xlσ−x i w−x for (a). Finally substitute σ i2 /σ i1 = d; v = σi in the last line, then 2 E(η(σ i1 , ψ j1 )η(σ i2 , ψ j2 )) = d1/2 Z ψ j1 (dv)ψ j2 (v)dv · f (x) + o(1). Part (a) follows. Part (b) is obtained similarly by noting that condition (b) implies that 0 < σ i2 /σ i1 = d < ∞ when m0 ≤ i1, i2 < m00 . Part (c) follows from (c) of Theorem 1. For (d) the covariances are zero because the rates σ i1 /σ i2 → 0.¥ 5 References [1] Kotlyarova Y. and V. Zinde-Walsh (2003) Improving the Efficiency and Robustness of the Smoothed Maximum Score Estimator, McGill University and CIREQ working paper. 13 [2] Pagan, A. and A. Ullah (1999) Nonparametric Econometrics, Cambridge University Press. [3] Sheather, S. and M. Jones (1991) A Reliable Data-Based Bandwidth Selection Method for Kernel Density Estimation, Journal of the Royal Statistical Society, Series B, 53, 683-690. [4] Zinde-Walsh, V. (2002) Asymptotic Theory for some High Breakdown Point Estimators. Econometric Theory 18, 1172-1196. 14