Asymptotic normality of plug-in level set estimates (expanded version) David M. Mason∗ and Wolfgang Polonik† University of Delaware and University of California, Davis October 20, 2008 Abstract We establish the asymptotic normality of the G-measure of the symmetric difference between the level set and a plug-in-type estimator of it formed by replacing the density in the definition of the level set by a kernel density estimator. Our proof will highlight the efficacy of Poissonization methods in the treatment of large sample theory problems of this kind. Short title: Level set estimation AMS 2000 Subject Classifications: 60F05, 60F15, 62E20, 62G07. Keywords: Central limit theorem, kernel density estimator, level set estimation 1 Introduction Let f be a Lebesgue density on IRd , d ≥ 1. Define the level set of f at level c ≥ 0 as C (c) = {x : f (x) ≥ c} . In this paper we are concerned with the estimation of C(c) for a given level c. Such level sets play a crucial role in various scientific fields, and their estimation has received significant recent interest in the fields of statistics and machine learning/pattern recognition (see below for more details). Theoretical research on this topic is mainly concerned with rates of convergence of level set estimators. While such results are interesting, they show only limited potential to be useful in practical applications. The available results do not permit statistical inference or making quantitative statements about the contour sets themselves. The contribution of this paper constitutes a significant step forward in this direction, since we establish the asymptotic normality of a class of level set estimators ∗ † Research partially supported by NSF Grant DMS–0503908 Research partially supported by NSF Grant DMS–0706971 1 Cn (c) formed by replacing f by a kernel density estimator fn in the definition of C (c), in a sense that we shall soon make precise. Here is our setup. Let X1 , X2 , . . . be i.i.d. with distribution function F and density f , and consider the kernel density estimator of f based on X1 , . . . , Xn , n ≥ 1, n x − Xi 1 X K fn (x) = , x ∈ IRd , 1/d nhn i=1 hn where K is a kernel and hn > 0 is a smoothing parameter. Consider the plug-in estimator Cn (c) = {x : fn (x) ≥ c} . Let G be a positive measure dominated by Lebesgue measure λ. Our interest is to establish the asymptotic normality of dG (Cn (c), C(c) ) := G(Cn (c) ∆ C(c)) Z = | I{fn (x) ≥ c } − I{ f (x) ≥ c} | dG(x), (1.1) IRd where A∆B = (A \ B) ∪ (B \ A) denotes the set-theoretic symmetric difference of two sets. Of particular interest is G being the Lebesgue measure λ, as well as G = H with H denoting the measure having Lebesgue density |f (x) − c|. The latter corresponds to the so-called excess-risk which is used frequently in the classification literature, i.e. Z dH (Cn (c), C(c) ) = |f (x) − c| dx. (1.2) Cn (c) ∆ C(c) It is well-known that under mild conditions dλ (Cn (c), C(c)) → 0 in probability as n → ∞, and also rates of convergence have already been derived (cf. Baillo et al. (2000, 2001), Cuevas et al. (2000), Baillo (2003), Baillo and Cuevas (2004)). Even more is known. Cadre (2006) derived assumptions under which for some µG > 0 we have p n hn dG (Cn (c), C(c)) → µG in probability as n → ∞. (1.3) However, asymptotic normality of dG (Cn (c), C(c)) has not yet been considered. Our main result says that under suitable regularity conditions there exist a normalizing 2 sequence {an,G } and a constant 0 < σG < ∞ such that an,G {dG (Cn (c), C(c) ) − EdG (Cn (c), C(c) )} →d σG Z, as n → ∞, (1.4) where Z denotes a standard normal random variable. In the important special cases of G = λ the Lebesgue measure, and G = H we shall see that under suitable regularity conditions 14 n , and (1.5) an,λ = hn 1 an,H = n3 hn 4 , (1.6) 2 respectively. In the next subsection we shall discuss further related work and relevant literature. In Section 2 we formulate our main result, provide some heuristics for its validity, discuss a possible statistical application and then present the proof of our result. We end Section 2 2 with an example and some proposals to estimate the limiting variance σG . A number of the technical details are relegated to the Appendix. 1.1 Related work and literature Before we present our results in detail, we shall extend our overview of the literature on level set estimation to include regression level set estimation (with classification as a special case) as well as density level set estimation. Observe that there exists a close connection between level set estimation and binary classification. The optimal (Bayes) classifier corresponds to a level set Cψ (0) = {x : ψ(x) ≥ 0 } of ψ = p f − (1 − p) g, where f and g denote the Lebesgue densities of two underlying class distributions F and G and p ∈ [0, 1] defines the prior probability for f . If an observation X falls into {x : ψ(x) ≥ 0 } then it is classified by the optimal classifier as coming from F , otherwise as coming from distribution G. Hall and Kang (2005) derive large sample results for this optimal classifier that are very closely related to Cadre’s result (1.3). In fact, if Err(C) denotes the probability of a misclassification of a binary classifier given by a set C, then Hall and Kong derive rates of convergence results for the quantity Err(Ĉ(0)) − Err(Cψ (0) where Ĉ is the plug-in classifier given by Ĉ(0) = {x : p fn (x) − (1 − p) gn (x) ≥ 0} with fn and gn denoting the kernel estimators for f and g, respectively. It turns out that Z Err(Ĉ(0)) − Err(Cψ (0) = |ψ(x)| dx. Ĉ(0) ∆ Cψ (0) The latter quantity is of exactly the form (1.2). The only difference is, that the function ψ is not a probability density, but a (weighted) difference of two probability densities. Similarly, the plug-in estimate is a weighted difference of kernel estimates. Though the results presented here do not directly apply to this situation, the methodology used to prove them can be adapted to it in a more or less straightforward manner. Hartigan (1975) introduced a notion of clustering via maximally connected components of density level sets. For more on this approach to clustering see Stuetzle (2003), and for an interesting application of this clustering approach to astronomical sky surveys refer to Jang (2006). Klemelä (2004, 2006a,b) applies a similar point of view to develop methods for visualizing multivariate density estimates. Goldenshluger and Zeevi (2004) use level set estimation in the context of the Hough transform, which is a well-known computer vision algorithm. Certain problems in flow cytometry involve the statistical problem of estimating a level set for a difference of two probability densities (Roederer and Hardy, 2001; see also Wand, 2005). Further relevant applications include detection of minefields 3 based on arial observations, the analysis of seismic data, as well as certain issues in image segmentation; see Huo and Lu (2004) and references therein. Another application of level set estimation is anomaly detection or novelty detection. For instance, Theiler and Cai (2003) describe how level set estimation and anomaly detection go along in the context of multispectral image analysis, where anomalous locations (pixels) correspond to unusual spectral signatures in these images. Further areas of anomaly detection include intrusion detection (e.g. Fan et al. 2001, Yeung and Chow, 2002), anomalous jet engine vibrations (e.g. Nairac et al. 1997, Desforges et al. 1998, King et al. 2002) or medical imaging e.g. (Gerig et al 2001, Prastawa et al. 2003) and EEG-based seizure analysis (Gardner et al. 2006). For a recent review of this area see Markou and Singh (2003). The above list of applications of level set estimation clearly motivates the need to understand the statistical properties of level set estimators. For this reason there has been lot of recent investigation into this area. Relevant published work (not yet mentioned above) include Hartigan (1987), Polonik (1995), Cavalier (1997), Tsybakov (1997), Walther (1997), Baillo et al. (2000, 2001), Cuevas et al. (2000), Baillo (2003), Bailllo and Cuevas (2004), Tsybakov (2004), Steinwart et al. (2004, 2005), Gayraud and Rousseau (2005), Willett and Novak (2005, 2006), Cuevas et al. (2006), Scott and Davenport (2006), Scott and Novak (2006), Vert and Vert (2006), and Rigollet and Vert (2008). Finally we mention a problem closely related to that of level set estimation. This is the problem of the estimation of the support of a density, when the support is assumed to be bounded. It turns out that the methods of estimation and the techniques used to study the asymptotic properties of the estimator are very similar to those of level set estimation. Refer especially to Biau et al. (2008) and the references therein. 2 Main result The rates of convergence in our main result depend on a regularity parameter 1/γg that describes the behavior of the slope of g at the boundary set β (c) = {x ∈ IRd : f (x) = c } (see assumption (G) below). In the important special case of G = λ the slope of g is zero, and this implies 1/γg = 0 (or γg = ∞). For G = H our assumptions imply that the slope of g close to the boundary is bounded away from zero and infinity which says that 1/γg = 1. Here is our main result. The indicated assumptions are quite technical to state and therefore for the sake of convenience they are formulated in Subsection 2.2 below. In particular, the integer k ≥ 1 that appears in the statement of our theorem is defined in (B.ii). Theorem. Under assumptions (D.i)−(D.ii), (K.i)−(K.iii), (G), (H) and (B.i)−(B.ii), we have as n → ∞ that an,G {dG (Cn (c), C(c)) − EdG (Cn (c), C(c)) } →d σG Z, 4 (2.1) where Z denotes a standard normal random variable, and γ1 n 41 p g . n hn an,G = hn (2.2) 2 The constant 0 < σG < ∞ is defined as in (2.57) in the case d ≥ 2 and k = 1; as in (2.61) in the case d ≥ 2 and k ≥ 2; and as in (2.62) in the case d = 1 and k ≥ 2. (The case d = 1 and k = 1 cannot occur under our assumptions. ) Remark 1. Write δn (c) = an,G dG (Cn (c), C(c)) − EdG (Cn (c), C(c)) . A slight extension of the proof of our theorem shows that if c1 , . . . , cm , m ≥ 1, are distinct positive numbers, each of which satisfies the assumptions of the theorem, then (δn (c1 ) , . . . , δn (cm )) →d (σ1 Z1 , . . . , σm Zm ) , where Z1 , . . . , Zm are independent standard normal random variables and σ1 , . . . , σm are as defined in the proof of the theorem. 2 Remark 2. In Subsection 2.7 we provide an example when the variance σG does have a closed form convenient for calculation. Since such a closed form cannot be given in 2 general, Subsection 2.7 also discusses some methods to estimate σG from the data. 2.1 Heuristics Before we continue with our exposition, we shall provide some heuristics to indicate why 1 an = hnn 4 is the correct normalizing factor in (1.5), i.e. we consider the case G = λ, or γg = ∞. This should help the reader to understand why our theorem is true. It is well-known that under certain regularity conditions we have p n hn {fn (x) − f (x)} = OP (1), as n → ∞. Therefore the boundary of the set can be expected to fluctuate in a band B with a Cn (c) width (roughly) of the order OP √n1hn around the boundary set β (c) = {x : f (x) = c }. For notational simplicity we shall write β = β (c). Partitioning B by N = O √n h1n hn = O √ 1 3 regions Rk , k = 1, . . . , N, of Lebesgue measure λ (Rk ) = hn , we can approxin hn mate dλ (Cn (c), C(c)) as dλ (Cn (c), C(c)) ≈ N Z X k=1 Rk | I{fn (x) ≥ c } − I{ f (x) ≥ c} | dx =: 1 . Writing Here we use the fact that the band B has width √nh n Z Yn,k = Λn (x) dx Rk 5 N X k=1 Yn,k . with Λn (x) = | I{fn (x) ≥ c } − I{ f (x) ≥ c} |, we see that Var(Yn,k ) = Z Rk Z Rk (2.3) cov(Λn (x), Λn (y)) dx dy = O λ(Rk ))2 = O(h2n ), where the O-terms turns out to be exact. Further, due to the nature of the kernel density estimator the variables Yn,k can be assumed to behave asymptotically like independent variables, since we can choose the regions Rk to be disjoint. Hence, the variance of 1/2 , which motivates the dλ (Cn (c), C(c)) can be expected to be of the order N h2n = hnn 1 normalizing factor an = hnn 4 . 2.2 A connection to Lp-rates of convergence of kernel density estimates The following discussion of Lp -rates, p ≥ 1, of convergence of kernel density estimates implicitly provides another heuristic for our our result. Consider the case G = Hp−1, where Hp−1 denotes the measure with Radon-Nikodym derivative hp−1 (x) = |f (x) − c|p−1 with p ≥ 1. Note that H2 = H with H from above. Then we have the identity Z ∞ Z 1 Hp−1(Cn (c) ∆ C(c)) dc = |fn (x) − f (x)|p dx, p ≥ 1. (2.4) d p 0 IR The proof is straightforward (see Detail 1 in the Appendix). The case p = 1 gives the geometrically intuitive relation Z ∞ Z ∞Z Z λ(Cn (c) ∆ C(c)) dc = dx dc = |fn (x) − f (x)| dx. 0 0 Cn (c)∆ C(c) IRd Assuming f to be bounded, we split up the vertical axis into successive intervals ∆(k), k = 1, . . . , N of length ≈ √n1hn with midpoints ck . Approximate the integral (2.4) by Z Z ∞ 1 p |fn (x) − f (x)| dx = Hp−1 (Cn (c)∆ C(c)) dc p IRd 0 N Z X Hp−1 (Cn (c)∆ C(c)) dc ≈ k=1 ∆(k) N 1 X ≈√ Hp−1 (Cn (ck )∆ C(ck )). n hn k=1 √ Utilizing the 1/ nhn -rate of fn (x) we see that the last sum consists of (roughly) independent random variables. Assuming further that the variance of each (or of most) of 6 −(p−1) n −1/2 nh these random variables is of the same order a−2 = (to obtain this, n n,p hn apply our theorem with γg = 1/(p − 1)) we obtain that the variance of the sum is of the order !p a−2 1 n √ . = 1− 1 nhn nh p n In other words, the normalizing factor of the Lp -norm of the kernel density estimator p −1 1− 1 p in IRd can be expected to be (nhn p 2 = nhn 2 hn 2 . In the case p = 2 this gives the −1/2 1/2 normalizing factor nhn hn = nhn , and this coincides with the results from Rosenblatt (1975). In the special case d = 2 these rates can also be found in Horvath (1991). 2.3 Possible application to online testing Suppose that when a certain industrial process is working properly it produces items, which may be considered as i.i.d. IRd random variables X1 , X2 , . . . with a known density function f . On the basis of a sample size n taken from time to time from the production we can measure the deviation of the sample X1 , X2 , . . . , Xn from the desired distribution by looking at the discrepancy between λ (Cn (c) ∆ C (c)) and its expected value Eλ (Cn (c) ∆ C (c)). The value c may be chosen so that Z f (x) dx = α, P {X ∈ C (c)} = {f (x)≥c} some typical values being α = .90, .95 and .99. We may decide to shut down the process and look for production errors if σλ−1 n hn 1/4 |λ (Cn (c) ∆ C (c)) − Eλ (Cn (c) ∆ C (c))| > 1.96. (2.5) Otherwise as long as the estimated level set Cn (c) does not deviate too much from the target level set C (c) in which fraction α of the data should lie if the process is functioning properly, we do not disrupt production. Our central limit theorem tells us that for large enough sample sizes n the probability of the event in (2.5) would be around .05, should, in fact, the process be working as it should. Thus using this decision rule, we make a type I error with roughly probability .05 if we decide to stop production, when it is actually working fine. Sometimes one might want to replace the Cn (c) in the first λ (Cn (c) ∆ C (c)) in (2.5) by Cn (cn ), where Z fn (x) dx = α. {fn (x)≥cn } A mechanical engineering application where this approach seems to be of some value is described in Desforges et al. (1998). This application considers gearbox fault data, the collection of which is described in that paper. In fact, two classes of data were collected, corresponding to two states: a gear in good condition, and a gear in bad condition, 7 respectively. Desforges et al. indicate a data analysis approach based on kernel density estimation to recognize the faulty condition. The idea is to calculate a kernel density estimator gm based on the data X1 , . . . , Xm from the gear in good condition, and then this estimator is evaluated at the data Y1 , . . . , Yn that are sampled under a bad gear condition. Desforges et al. then examine Pn the level sets of gm in which the faulty data 1 lie. One of their ideas is to use n i=1 gm (Yi ), to detect the faulty condition. Their methodology is ad hoc in nature and no statistical inference procedure is proposed. Our test procedure could be applied as follows by using f = gm (i.e. we are conditioning on X1 , . . . , Xm ). Set C(c) := {x : gm (x) ≥ c, } for an appropriate value of c, and find the corresponding set Cn (c) based on Y1 , . . . , Yn . Then check whether (2.5) holds. If yes, then we can conclude with at significance level .05 that the Y1 , . . . , Yn stem from a different distribution than the X1 , . . . , Xm . Observe that in this set-up we calculate Eλ(Cn (c) ∆ C (c)) as well as σλ2 by using gm as the underlying distribution. (In practice we may have to estimate these two quantities. See Subsection 2.7.) How this approach would work in practice is the subject of a separate paper. 2.4 Assumptions and notation Assumptions on the density f. (D.i) f is in C 2 (IRd ) and its partial derivatives of order 1 and 2 are bounded; (D.ii) inf x∈IRd f (x) < c < supx∈IRd f (x) . Notice that (D.i) implies the existence of positive constants M and A such that sup f (x) ≤ M < ∞ (2.6) x and 2 d d ∂ f (x) 1 XX =: A < ∞. sup 2 i=1 j=1 x∈IRd ∂xi ∂xj (2.7) (Condition (D.i) implies that f is uniformly continuous on IRd from which (2.6) follows.) Assumptions on K. (K.i) K is a probability density having support contained in the closed ball of radius 1/2 centered at zero and is bounded by a constant κ. Pd R (K.ii) i=1 IRd ti K (t) dt = 0; Observe that (K.i) implies that Z IRd |t|2 |K (t)| dt = κ1 < ∞. 8 (2.8) Assumptions on the boundary β = {x : f (x) = c} for d ≥ 2. (B.i) For all (y1 , . . . , yd ) ∈ β, ′ ′ f (y) = f (y1 , . . . , yd ) = Define Id = The d − 1 sphere ∂f (y1 , . . . , yd) ∂f (y1 , . . . , yd ) ,..., ∂y1 ∂yd 6= 0. [0, 2π) ,d = 2 [0, π]d−2 × [0, 2π) , d > 2. S d−1 = x ∈ IRd : |x| = 1 , can be parameterized by (e.g. see Lang, 1997) x (θ) = (x1 (θ) , . . . , xd (θ)) , θ ∈ Id , where x1 (θ) = cos (θ1 ) , x2 (θ) = sin (θ1 ) cos (θ2 ) , x3 (θ) = sin (θ1 ) sin (θ2 ) cos (θ3 ) , .. . xd−1 (θ) = sin (θ1 ) · · · sin (θd−2 ) cos (θd−1 ) , xd (θ) = sin (θ1 ) · · · sin (θd−2 ) sin (θd−1 ) . (B.ii) We assume that the boundary β can be written as β= k [ j=1 βj , with inf {|x − y| : x ∈ βj , y ∈ βl } > 0 if j 6= l, that is a function (depending on j) of the above parameterization x (θ) of S d−1 , which is 1–1 on Jd , the interior of Id , with ∂y (θ) ∂y1 (θ) ∂yd (θ) = ,..., 6= 0, θ ∈ Jd . ∂θi ∂θi ∂θi We further assume that for each j = 1, . . . , k and i = 1, . . . , d, the function continuous and uniformly bounded on Jd . Assumptions on the boundary β for d = 1. (B.i) ′ inf f (zi ) =: ρ0 > 0. 1≤i≤k 9 ∂y(θ) ∂θi is (B.ii) β = {z1 , . . . , zk }, k ≥ 2. (Condition (B.i) and f ∈ C 2 (IR) imply that the case k = 1 cannot occur when d = 1.) Assumptions on G. (G) The measure G has a bounded continuous Radon-Nikodym derivative g w.r.t. Lebesgue measure λ. There exists a constant 0 < γg ≤ ∞ such that the following holds. In the case d ≥ 2 there exists a function g (1) (·, ·) bounded on Id × S d−1 such that for each j = 1, . . . , k, for some cj ≥ 0, g(y (θ) + a z) (1) = o(1), as a ց 0, − c g (θ, z) sup sup j a1/γg |z|=1 θ∈Id with 0 < sup|z|=1 supθ∈Id |g (1) (θ, z)| < ∞, where y (θ) is the parametrization pertaining to βj , with at least one of the cj strictly positive. In the case d = 1 there exists a function g (1) (·) with 0 < |g (1) (zj )| < ∞, j = 1, . . . , k such that for each j = 1, . . . , k for some cj ≥ 0, g(zj + a z) (1) = o(1), as a ց 0, − c g (z ) sup j j a1/γg |z|=1 with at least one of the cj strictly positive. By convention, in the above statement 1 ∞ = 0. Assumptions on hn . As n → ∞, q 1+2/d (H) nhn → γ, with 0 ≤ γ < ∞ and nhn / log n → ∞, where γ = 0 in the case d = 1. Discussion of the assumptions and some implications Discussion of assumption (G). Measures G of particular interest that satisfy assumption (G) are given by g(x) = |f (x) − c|p with p ≥ 0, and also by g(x) = f (x). The latter of course leads to the F -measure of the symmetric distance. The former has connections to the Lp -norm of the kernel density estimator (see discussion in Subsection 2.2). As pointed out above in the Introduction, the choice p = 1 is closely connected to the excess risk from the classification literature. The choice p = 0 yields the Lebesgue measure of the symmetric difference. Assumptions (B.i) and (D.i) imply that (G) holds for |f (x) − c|p with 1/γg = p. For the choice g = f we have 1/γg = 0 (notice that by (D.ii) we have c > 0). Discussion of smoothness assumptions on f . Our smoothness assumptions on f imply that f has a γ-exponent with γ = 1 at the level c, i.e. we have F {x ∈ IRd : |f (x) − c| ≤ ǫ} ≤ C ǫ. 10 (This fact follows from Lebesgue-Bosicovich theorem; e.g. see Cadre, 2006). This type of assumption is common in the literature of level set estimation. It was used first by Polonik (1995) in the context of density level set estimation. Implications of (B) in the case d ≥ 2. In the following we shall record some conventions and implications of assumption (B), which are needed in the proof of our theorem. Using the notation introduced in assumption (B), we define ∂y(θ) for points on the boundary of ∂θi Id to be the limit taken from points in Jd . In this way, we see that each vector continuous and bounded on the closure I d of Id . ∂y(θ) ∂θi is Notice in the case d ≥ 2 that for each j = 1, . . . , k and i = 1, . . . , d − 1, df (y (θ)) ∂f (y (θ)) ∂y1 (θ) ∂f (y (θ)) ∂yd (θ) = +···+ = 0, dθi ∂y1 ∂θi ∂yd ∂θi (2.9) where y (θ) is the parameterization pertaining to βj . This implies that the unit vector u (θ) = (u1 (θ) , . . . , ud (θ)) := f ′ (y (θ)) |f ′ (y (θ))| (2.10) is normal to the tangent space of βj at y (θ). From assumption (B.ii) we infer that β is compact, which when combined with (B.i) says that ′ inf f (y , . . . , y ) (2.11) 1 d =: ρ0 > 0. (y1 ,...,yd )∈β In turn, assumptions (D.i), (B.i) and (B.ii), when combined with (2.11), imply that for each 1 ≤ i ≤ d − 1, the vector ∂ud (θ) ∂u1 (θ) ∂u (θ) = ,..., ∂θi ∂θi ∂θi is uniformly bounded on Id . Consider for each j = 1, . . . , k, with y (θ) being the parameterization pertaining to βj , the absolute value of the determinant, ∂y(θ) ∂θ1 · · (2.12) det =: ι (θ) . · ∂y(θ) ∂θ d−1 u (θ) We can infer from (B.ii) that we have sup ι (θ) < ∞. θ∈Id 11 (2.13) 2.5 Proof of the Theorem in the case d ≥ 2 We shall only present a detailed proof for the case k = 1. However, at the end we shall describe how the proof of the general k ≥ 2 case goes. Thus for ease of notation we shall drop the subscript j in the above assumptions. Also we shall assume c1 = 1 in assumption (G). We shall first show that with a suitably defined sequence of centerings bn , we have p (n/hn )1/4 ( n hn )1/γg {dG (Cn (c), C(c)) − bn } →d σZ (2.14) 2 for some σ 2 > 0. (For the sake of notational convenience, we write in the proof σ 2 = σG .) From this result we shall infer that our central limit theorem (2.1) holds. The asymptotic variance σ 2 will be defined in the course of the proof. It finally appears in (2.57) below. Theorem 1 of Einmahl and Mason (2005) implies that when hn satisfies (H) and f is bounded that for some constant γ1 > 0 s nhn sup |fn (x) − Efn (x)| ≤ γ1 , a.s. (2.15) lim sup log n x∈IRd n→∞ It is shown in (A.2) of Detail 2 in the Appendix, that under the assumptions (D), (K) and (H) for some γ2 > 0, p sup nhn sup |Efn (x) − f (x)| ≤ γ2 . (2.16) n≥2 Set with ς > √ x∈IRd 2 ∨ γ1 , En = √ ς log n x : |f (x) − c| ≤ √ . nhn (2.17) We see by (1.1), (2.15) and (2.16) that with probability 1 for all large enough n G (Cn (c) ∆ C (c)) = Z En |I {fn (x) ≥ c} − I {f (x) ≥ c}| g(x) dx =: Ln (c) . (2.18) It turns out that rather than considering the truncated quantity Ln (c) directly, it is more convenient to first study a Poissonized version of Ln (c) formed by replacing fn (x) by Nn x − Xi 1 X K , πn (x) = 1/d nhn i=1 hn where Nn is a mean n Poisson random variable independent of X1 , X2 , . . . (When Nn = 0 we set πn (x) = 0.) Notice that Eπn (x) = Efn (x) . 12 We shall make repeated use of the fact following from the assumption that K has support contained in the closed ball of radius 1/2 centered at zero, that πn (x) and πn (y) are 1/d independent whenever |x − y| > hn . Here is the Poissonized version of Ln (c) that we shall treat first. Define Z Πn (c) = |I {πn (x) ≥ c} − I {f (x) ≥ c}| g(x) dx. (2.19) En Our goal is to infer a central limit theorem for Ln (c) and thus for G (Cn (c) ∆ C (c)) from a central limit theorem for Πn (c) . Set ∆n (x) = |I {πn (x) ≥ c} − I {f (x) ≥ c}| . (2.20) √ The first item on this agenda is to verify that (n/hn )1/4 ( n hn )1/γg is the correct sequence of norming constants. To do this we must analyze the exact asymptotic behavior of the variance of Πn (c) . We see that Z Var (Πn (c)) = Var Z Z = En Let Yn (x) = " X j≤N1 (1) K x − Xj 1/d hn ∆n (x) dG(x) En cov (∆n (x) , ∆n (y)) dG(x) dG(y). En − EK x−X 1/d hn # ,s EK 2 x−X 1/d hn (n) and Yn (x), . . . , Yn (x) be i.i.d. Yn (x). Clearly =d ! πn (x) − Eπn (x) πn (y) − Eπn (y) p , p Var (πn (x)) Var (πn (x)) ! Pn Pn (i) (i) i=1 Yn (y) i=1 Yn (x) √ √ =: (π n (x) , πn (y)) . , n n Set R √ 1/d √ c − K(y)f x − yh dy nh d n n IR nhn (c − Efn (x)) r = . cn (x) = r x−X x−X 1 1 2 2 EK EK 1/d 1/d hn hn hn hn 13 Since K has support contained in the closed ball of radius 1/2 around zero, which implies 1/d that ∆n (x) and ∆n (y) are independent whenever |x − y| > hn , we have Z ∆n (x) dG(x) Var En Z Z I(|x − y| ≤ h1/d = n ) cov (∆n (x) , ∆n (y)) dG(x) dG(y), En En where now we write ) ( √ nh (c − f (x)) n ∆n (x) = I {π n (x) ≥ cn (x)} − I 0 ≥ 1/2 . 1 EK 2 x−X 1/d hn hn 1/d The change of variables y = x + thn , t ∈ B, with B = {t : |t| ≤ 1} , (2.21) gives Var Z ∆n (x) dx = hn Z En where En Z gn (x, t) dt dx, (2.22) B 1/d gn (x, t) = IEn (x)IEn (x + th1/d n ) cov ∆n (x) , ∆n x + thn × g(x) g(x + t h1/d n ). For ease of notation let an = an,G = lim a2n Z Var Z 2 = lim an hn n→∞ n hn 41 √ nhn ∆n (x) dG(x) En Z γ1 g . We intend to prove that gn (x, t) dt dx Z 1 1 + gn (x, t) dt dx = lim ( nhn ) 2 γg n→∞ En B Z Z 1 1 + = lim lim ( nhn ) 2 γg gn (x, t) dt dx =: σ 2 < ∞, n→∞ En τ →∞ n→∞ where Dn (τ ) := ZB Dn (τ ) (2.23) (2.24) B s u (θ) , θ ∈ Id , |s| ≤ τ z : z = y (θ) + √ nhn The set Dn (τ ) forms a band around the surface β of thickness . √2τ . nhn Recall the definition of B in (2.21). Since β is a closed submanifold of IRd without boundary the Tubular Neighborhood Theorem (see Theorem 11.4 on page 93 of Bredon 14 (1993)) says that for all δ > 0 sufficiently small for each x ∈ β + δB there is an unique θ ∈ Id and |s| ≤ δ such that x = y (θ) + s u (θ). This, in turn, implies that for all δ > 0 sufficiently small {y (θ) + su (θ) : θ ∈ Id and |s| ≤ δ} = β + δB. In particular, we see by using (H) that for all large enough n τ B, where B = {z : |z| ≤ 1} . Dn (τ ) = β + √ nhn (2.25) Moreover, it says that x = y (θ) + su (θ), θ ∈ Id and |s| ≤ δ is a well–defined parameterization of β + δB, and it validates the change of variables in the integrals below. We now turn to the proof of (2.24). Let ρn x, x + th1/d = Cov π n (x) , πn x + th1/d n n h i x−X x−X h−1 E K + t K 1d 1d n hn hn =r . x−X x−X −1 2 −1 2 hn EK +t hn EK h1d h1d n n s u(θ) It is routine to show that for each θ ∈ Id , |s| ≤ τ , x = y (θ) + √ and t ∈ B we have as nhn n → ∞ that s u (θ) s u (θ) 1/d 1/d , y (θ) + √ → ρ(t), + thn ρn (x, x + thn ) = ρn y (θ) + √ nhn nhn where ρ(t) := R IRd K (u) K (u + t) du R . K 2 (u) du IRd (See Detail 4 in Appendix.) Notice that ρ(t) = ρ(−t). One can then infer by the central limit theorem that for each θ ∈ Id , |s| ≤ τ and t ∈ B, π n (x) , π n x + th1/d n s u (θ) s u (θ) 1/d = π n y (θ) + √ , πn y (θ) + √ + thn nhn nhn p (2.26) →d Z1 , ρ(t)Z1 + 1 − ρ2 (t)Z2 , where Z1 and Z2 are independent standard normal random variables. We also get by using our assumptions and straightforward Taylor expansions that for u |s| ≤ τ , u = s u (θ), x = y (θ) + √nh and θ ∈ Id n √nh c − Ef y (θ) + √ u n n u nhn r = cn (x) = cn y (θ) + √ u nhn y(θ)+ √nh −X 1 n 2 EK 1/d hn hn 15 → −p n→∞ and similarly since q f ′ (y (θ)) · u s |f ′ (y (θ))| =− √ =: c(s, θ, 0) c kKk2 f (y (θ)) kKk2 1+2/d → γ, nhn cn x + th1/d n We also have (2.27) f ′ (y (θ)) · (u + t) → − p n→∞ f (y (θ)) kKk2 s |f ′ (y (θ))| γf ′ (y (θ)) · t − √ =: c(s, θ, γt). = − √ c kKk2 c kKk2 √ nhn (c − f (x)) r → c(s, θ, 0) n→∞ x−X 1 2 EK 1/d hn hn and √ 1/d nhn c − f x + thn r → c(s, θ, γt). n→∞ x−X 1 2 EK 1/d hn hn (See Detail 5 in the Appendix.) Hence by (2.26) and (G) for y (θ) ∈ β, u (n hn )1/γg gn (x, t) = (n hn )1/γg gn y(θ) + √ ,t nhn u u 1/d √ √ = IEn y (θ) + IEn y (θ) + + thn nh nh n n u u 1/d , ∆n y (θ) + √ + thn × cov ∆n y (θ) + √ nhn nhn u u 1/d 1/γg g y(θ) + √ + t hn × (nhn ) g y(θ) + √ nhn nhn → cov |I {Z1 ≥ c(s, θ, 0)} − I {0 ≥ c(s, θ, 0)}| , n→∞ n o p 2 I ρ(t)Z1 + 1 − ρ (t)Z2 ≥ c(s, θ, γt) − I {0 ≥ c(s, θ, γt)} su(θ) + γt 1 1 (1) (1) γg γg × |s| g (θ, u(θ)) |su(θ) + γt| g θ, |su(θ) + γt| =: Γ (θ, s, t) . (2.28) Using the change of variables sud (θ) su1 (θ) , . . . , xd = yd (θ) + √ , x1 = y1 (θ) + √ nhn nhn 16 (2.29) we get Z Dn (τ ) Z gn (x, t) dt dx = B Z τ −τ where su (θ) , t | Jn (θ, s)| dt dθ ds, gn y (θ) + √ nhn B Z Z Id |Jn (θ, s)| = det ∂y(θ) ∂θ1 ∂y(θ) ∂θd−1 + √1 nhn · · · ∂u(θ) ∂θ1 ∂u(θ) 1 + √nh n ∂θd−1 u(θ) √ nhn Clearly, with ι (θ) as in (2.12), . (2.30) p nhn |Jn (θ, s)| → ι (θ) . (2.31) √ Under our assumptions we have nhn |Jn (θ, s)| is uniformly bounded in n ≥ 1 and 1 (θ, s) ∈ Id × [−τ, τ ]. Also by using (G) we see that for all n large enough (nhn ) γg gn is 1 √ bounded on Id × B. Thus since (nhn ) γg gn and nhn |Jn | are eventually bounded on the appropriate domains, and (2.28) and (2.31) hold, we get by the dominated convergence theorem and (G) that Z Z 1 + 1 2 γg gn (x, t) dt dx (nhn ) Dn (τ ) B Z τZ Z 1 1 su (θ) + , t | Jn (θ, s)| dt dθ ds gn y (θ) + √ = (nhn ) 2 γg nhn −τ Id B Z τZ Z Γ (θ, s, t) ι (θ) dt dθ ds, as n → ∞. (2.32) → −τ Id B We claim that as τ → ∞ we have Z τZ Z Γ (θ, s, t) ι (θ) dt dθ ds −τ Id B Z ∞Z Z Γ (θ, s, t) ι (θ) dt dθ ds =: σ 2 < ∞ → −∞ and lim lim sup (nhn ) τ →∞ (2.33) B Id 1 + 1 2 γg n→∞ Z C (τ )∩E Dn n Z gn (x, t) dt dx = 0, (2.34) B which in light of (2.32) implies that the limit in (2.24) is equal to σ 2 as defined in (2.33). First we show (2.33). Consider + Γ (τ ) := Z τ 0 Z Z Id Γ (θ, s, t) ι (θ) dt dθ ds. B 17 We shall show existence and finiteness of the limit limτ →∞ Γ+ (τ ). Similar arguments apply to Z Z Z 0 lim Γ− (τ ) := lim τ →∞ τ →∞ −τ Observe that when s ≥ 0, Id B Γ (θ, s, t) ι (θ) dt dθ ds < ∞. |I {Z1 ≥ c (s, θ, 0)} − I {0 ≥ c (s, θ, 0)}| = I {Z1 < c (s, θ, 0)} and with Φ denoting the cdf of a standard normal distribution we write o E I Z1 < c (s, θ, 0) = Φ c (s, θ, 0) . Hence by taking into account (2.28), the assumed finiteness of sup sup g (1) (θ, z), and using |z|=1 the elementary inequality θ |cov (X, Y ) | ≤ 2E |X| , whenever |Y | ≤ 1, we get for all s ≥ 0 and some c1 > 0 that 1 1 1 γg γg γg |s| + γ Φ c (s, θ, 0) . |Γ(θ, s, t)| ≤ c1 |s| (2.35) The lower bound (2.11) implies the existence of a constant c̃ > 0 such that s |f ′ (y(θ))| ≤ Φ(−c̃ s). Φ c (s, θ, 0) = Φ − √ c kKk2 Together with (2.35) and (2.13) it follows that for some c > 0 we have Z τ 1 1 1 + γg γg γg |s| + γ ( 1 − Φ(c̃ s) ) ds < ∞. lim Γ (τ ) ≤ c lim |s| τ →∞ τ →∞ 0 Similarly, + + |Γ (∞) − Γ (τ )| ≤ c Z ∞ τ 1 |s| γg This validates claim (2.33). 1 1 |s| γg + γ γg (1 − Φ(c̃ s) ) ds → 0, as τ → ∞. Next we turn to the proof of (2.34). Recall the definition of gn (x, t) in (2.23). Notice that for all n large enough, we have Z Z 1 + 1 2 γg gn (x, t) dt dx (nhn ) C (τ )∩E B Dn n Z Z p cov ∆n (x) , ∆n x + th1/d ≤ nhn n C (τ )∩E Dn n B 1 ≤ p nhn Z C (τ )∩E Dn n × (nhn ) γg g(x) g(x + th1/d n ) dt dx Z 1/2 Var (∆n (x)) (2.36) B 1 × (nhn ) γg g(x) g(x + th1/d n ) dt dx. 18 (2.37) 1/d 1/d The last inequality uses the fact that ∆n (x + thn )) ≤ 1 and thus Var(∆n (x + thn )) ≤ 1. Applying the inequality |I {a ≥ b} − I {0 ≥ b}| ≤ I {|a| ≥ |b|} , (2.38) we obtain that Var (∆n (x)) + Var (|I {πn (x) ≥ c} − I {f (x) ≥ c}|) , ≤ E (I {|πn (x) − f (x)| ≥ |c − f (x)|})2 = P {|πn (x) − f (x)| ≥ |c − f (x)|} . Thus we get that p nhn ≤ Z C (τ )∩E Dn n p nhn Z Z p 1 Var(∆n (x))(nhn ) γg g(x) g(x + th1/d n ) dt dx B C (τ )∩E Dn n Z p B P {|πn (x) − f (x)| ≥ |c − f (x)|} 1 × (nhn ) γg g(x) g(x + th1/d n ) dt dx. (2.39) We must bound the probability inside the integral. For this purpose we need a lemma. Lemma 2.1 Let Y, Y1 , Y2, . . . be i.i.d. with mean µ and bounded by 0 < M < ∞. Independent of Y1 , Y2 , . . . let Nn be a Poisson random variable with mean n. For any v ≥ 2 (e30)2 EY 2 and with d = e30M we have for all λ > 0, (N ) n X λ2 /2 . (2.40) P Yi − nµ ≥ λ ≤ exp − nv + dλ i=1 Proof. Let N be a Poisson random variable with mean 1 independent of Y1 , Y2, . . . and let N X ω= Yi . i=1 Clearly if ω1 , . . . , ωn are i.i.d. ω, then Nn X i=1 Yi − nµ =d n X i=1 (ωi − µ) . Our aim is to use Bernstein’s inequality to prove (2.40). Notice that for any integer r ≥ 2, r N X (2.41) Yi − µ . E |ω − µ|r = E i=1 At this point we need the following fact, which is Lemma 2.3 of Giné, Mason and Zaitsev (2003). 19 Fact 1. If, for each n ≥ 1, ζ, ζ1, ζ2 , . . . , ζn , . . . , are independent identically distributed random variables, ζ0 = 0, and η is a Poisson random variable with mean γ > 0 and independent of the variables {ζi }∞ i=1 , then, for every p ≥ 2, p η p X h i p/2 15p ζi − γEζ ≤ E max γEζ 2 , γE|ζ|p . (2.42) log p i=0 Applying inequality (2.42) to (2.41) gives for r ≥ 2 r r N X h i r/2 15r r max EY 2 , E|Y |r . Yi − µ ≤ E |ω − µ| = E log r i=1 Now max h EY 2 r/2 i h i r/2−1 , E|Y |r ≤ max EY 2 EY 2 , EY 2 M r−2 ≤ EY 2 M r−2 . Moreover, since log 2 ≥ 1/2, we get E |ω − µ|r ≤ (30r)r EY 2 M r−2 . By Stirling’s formula (see page 864 of Shorack and Wellner (1986)) r r ≤ er r!. Thus E |ω − µ|r ≤ (e30r)r EY 2 M r−2 ≤ v 2 (e30)2 EY 2 r! (e30M)r−2 ≤ r! dr−2, 2 2 where v ≥ 2 (e30)2 EY 2 and d = e30M. Thus by Bernstein’s inequality (see page 855 of Shorack and Wellner (1986)) we get (2.40). ⊔ ⊓ i Here is how Lemma 2.1 is used. Let Yi = K x−X . Since by assumption both K and f 1/d hn are bounded, and K has support contained in the closed ball of radius 1/2 around zero, we obtain that for some D0 > 0 and all n ≥ 1, 2 x−X ≤ D0 hn . sup E K 1/d hn x∈IRd √ Consider z ≥ a/ nhn for some a > 0. With this choice, and since by (A.1) of Detail 2 in Appendix, supx |Efn (x) − f (x)| ≤ A1 h2/d ≤ 2 √anhn for n large enough by using (H), we have P {πn (x) − f (x) ≥ z} = P {πn (x) − Efn (x) ≥ z − (Efn (x) − f (x))} n 1 a o ≤ P πn (x) − Efn (x) ≥ z − √ 2 nhn n zo ≤ P πn (x) − Efn (x) ≥ 2 20 √ for z ≥ a/ nhn and n large enough. We get then from inequality (2.40) that for n ≥ 1, all z > 0 that for some constants D1 and D2 Nn n nX x − X x − X zo nhn z o i P πn (x) − Efn (x) ≥ =P K − nEK ≥ 1/d 1/d 2 2 hn hn i=1 ! (nhn )2 z 2 nhn z 2 ≤ exp − = exp − . D1 nhn + D2 nhn z D1 + D2 z √ We see that for some a > 0 for all z ≥ a/ nhn and n large enough, p nhn z 2 ≥ nhn z. D1 + D2 z √ Observe that for 0 ≤ z ≤ a/ nhn , p exp (a) exp − nhn z ≥ exp (a) exp (−a) = 1 ≥ P {πn (x) − f (x) ≥ z} . Therefore by setting A = exp (a) we get for all large enough n ≥ 1, z > 0 and x, p P {πn (x) − f (x) ≥ z} ≤ A exp − nhn z . In the same way, for all large enough n ≥ 1, z > 0 and x, p P {πn (x) − f (x) ≤ −z} ≤ A exp − nhn z . Notice these inequalities imply that for all large enough n ≥ 1, z > 0 and x, √nh |c − f (x)| p √ n P {|πn (x) − f (x)| ≥ |c − f (x)|} ≤ A exp − . 2 (2.43) Returning to the proof of (2.34), from (2.36), (2.37), (2.39) and (2.43) we get that for all large enough n ≥ 1, Z Z 1 + γ1 2 g gn (x, t) dt dx (nhn ) C (τ )∩E Dn n ≤ p √ Z nhn A C (τ )∩E Dn n which equals where ϕn (x) = B Z e− √ nhn |c−f (x)| 2 B Z p λ(B) nhn √ 1 (nhn ) γg g(x) g(x + th1/d n )dt dx, ϕn (x) dx, C (τ )∩E Dn n √ 1 nhn |c − f (x)| (nhn ) γg g(x) g(x + th1/d A exp − n ). 2 21 √ Our assumptions imply that for some 0 < η < 1 for all 1 ≤ |s| ≤ ς log n and n large √ nhn su (θ) ≥ η |s| . c − f y (θ) + √ 2 nhn (See Detail 3 in Appendix.) We get using the change of variables (2.29) that for all τ > 1, Z Z Z su (θ) ϕn (x) dx = | Jn (θ, s)| dθ ds, ϕn y (θ) + √ √ C (τ )∩E nhn τ ≤|s|≤ς log n Id Dn n which by our assumptions (refer to the remarks after equation (2.31) and assumption (G)) is for some constant C > 0, for all large enough τ and n √ su(θ) ! Z Z √ c − f y (θ) + nh n 2 C nhn γg dθ ds ≤√ |s| exp − 2 nhn τ ≤|s|≤ς √log n Id Z Z C η |s| ≤√ exp − dθ ds . 2 nhn τ ≤|s|≤ς √log n Id Thus Z C (τ )∩E Dn n 4π d−1 C exp − ητ2 √ . ϕn (x) dx ≤ η nhn B Z (2.44) Therefore after inserting all of the above bounds we get that (nhn ) 1 + 1 2 γg Z C (τ )∩E Dn n 4 π d−1 C exp − ητ2 gn (x, t) dx dt ≤ , η B Z and hence we readily conclude that (2.34) holds. Putting everything together we get that as n → ∞, a2n Var (Πn (c)) → σ 2 , (2.45) with σ 2 defined as in (2.33). For future use, we point out that we can infer by (2.24), (2.34) and (2.45) that for all ε > 0 there exist a τ0 and an n0 ≥ 1 such that for all τ ≥ τ0 and n ≥ n0 2 σn (τ ) − σ 2 < ε, (2.46) where σn2 (τ ) = Var n hn 1/4 p nhn γ1 Z g ! ∆n (x) g (x) dx . Dn (τ ) (2.47) Our next goal is to de-Poissonize by applying the following version of a theorem in Beirlant and Mason (1995). 22 Lemma 2.2 Let N1,n and N2,n be independent Poisson random variables with N1,n being Poisson(nβn ) and N2,n being Poisson(n (1 − βn )) where βn ∈ (0, 1). Denote Nn = N1,n + N2,n and set N1,n − nβn N2,n − n (1 − βn ) √ √ Un = and Vn = . n n Let {Sn }∞ n=1 be a sequence of random variables such that (i) for each n ≥ 1, the random vector (Sn , Un ) is independent of Vn , (ii) for some σ 2 < ∞, Sn →d σZ, as n → ∞, (iii) βn → 0, as n → ∞. Then, for all x, P {Sn ≤ x | Nn = n} → P {σZ ≤ x} . The proof follows along the same lines as Lemma 2.4 in Beirlant and Mason (1995). Consult Detail 6 of the Appendix for a proof. We shall now use this de-Poissonization lemma to complete the proof of our theorem. Recall the definitions of Ln (c) and Πn (c) in (2.18) and (2.19), respectively. Noting that Dn (τ ) ⊂ En for all large enough n ≥ 1, we see that an (Ln (c) − EΠn (c)) Z = an {| I{fn (x) ≥ c } − I{ f (x) ≥ c}| − E∆n (x)} g(x) dx Dn (τ ) Z {| I{fn (x) ≥ c } − I{ f (x) ≥ c}| − E∆n (x)} g(x) dx + an =: Dn (τ )C ∩En Tn (τ ) + Rn (τ ) . We can control the Rn (τ ) piece of this sum using the inequality, which follows from Lemma 2.3 below, E (Rn (τ ))2 Z 2 ≤ 2an Var |I {πn (x) ≥ c} − I {f (x) ≥ c}| g(x) dx Dn (τ )C ∩En Z Z 21 + γ1 g gn (x, t) dt dx, = 2 nhn C (τ )∩E Dn n (2.48) B which goes to zero as n → ∞ and τ → ∞ as we proved in (2.34). The needed inequality is a special case of the following result in Giné, Mason and Zaitsev (2003). We say that a set D is a (commutative) semigroup if it has a commutative and associative operation, in our case sum, with a zero element. If D is equipped with a σ-algebra D for which the sum, + : (D × D, D ⊗ D) 7→ (D, D), is measurable, then we say the (D, D) is a measurable semigroup. 23 Lemma 2.3 Let (D, D) be a measurable semigroup; let Y0 = 0 ∈ D and let Yi , i ∈ N, be independent identically distributed D-valued random variables; for any given n ∈ N, let η be a Poisson random variable with mean n independent of the sequence {Yi }; and let B ∈ D be such that P {Y1 ∈ B} ≤ 1/2. If G : D 7→ R is non–negative and D-measurable, then ! ! η n X X EG I(Yi ∈ B)Yi ≤ 2EG I(Yi ∈ B)Yi . (2.49) i=0 i=0 Next we consider Tn (τ ). Observe that (Sn (τ ) |Nn = n) =d Tn (τ ) , σn (τ ) (2.50) where as above Nn denotes a Poisson random variable with mean n, R an Dn (τ ) {∆n (x) − E∆n (x)} g(x) dx Sn (τ ) = , σn (τ ) and σn2 (τ ) is defined as in (2.47). We shall apply Lemma 2.2 to Sn (τ ) with N1,n = Nn X √ 1 Xi ∈ Dn τ + nhn , i=1 N2,n Nn X √ 1 Xi ∈ / Dn τ + nhn = i=1 and √ βn = P Xi ∈ Dn τ + nhn . We first need to verify that as n → ∞ R an Dn (τ ) {∆n (x) − E∆n (x)} g(x) dx →d Z. Sn (τ ) = σn (τ ) To show this we require the following special case of Theorem 1 of Shergin (1990). Fact 2. (Shergin (1990)) Let {Xi,n : i ∈ Zd } denote a triangular array of mean zero m-dependent random fields, and let J n ⊂ Zd be such that P (i) Var X → 1, as n → ∞, and i,n i∈Jn P (ii) for some 2 < s < 3, i∈Jn E|Xi,n |s → 0, as n → ∞. Then X i∈Jn Xi,n →d Z, where Z is a standard normal random variable. 24 We use Shergin’s result as follows. Under our regularity conditions, for each τ > 0 there exist positive constants d1 , . . . , d5 such that for all large enough n, d1 |Dn (τ )| ≤ √ ; nhn (2.51) d2 ≤ σn (τ ) ≤ d3 . (2.52) Clearly (2.52) follows from (2.46), and (2.51) is shown in Detail 7 in the Appendix. There it is also shown that for each such integer n ≥ 1 there exists a partition {Ri , i ∈ Jn ⊂ Zd } of Dn (τ ) such that for each i ∈ Jn |Ri | ≤ d4 hn , where Define Xi,n = an R (2.53) d5 . |Jn | =: mn ≤ p nh3n Ri (2.54) {∆n (x) − E∆n (x)} g(x) dx σn (τ ) , i ∈ Jn . It follows from Detail 7 in the Appendix that Xi,n can be extended to a 1−dependent random field on Zd . Notice that by (G) there exists a constant A > 0 such that for all x ∈ Dn (τ ), 1 p −γ g . nhn |g(x)| ≤ A Recalling that an = an,G = n hn 14 √ nhn γ1 g we thus obtain for all for i ∈ Jn , √ an 2A |Ri | A nhn |Xi,n | ≤ σn (τ ) 2d4 Ahn ≤ d2 Therefore, X i∈Jn 5 2 E|Xi,n | ≤ mn n hn 1/4 = 1/4 2d4 nh3n d2 − γ1 g 1/4 2Ad4 nh3n . d2 52 ≤ d5 2d4 d2 25 This bound when combined with (H) implies that as n → ∞, X 5 E|Xi,n | 2 → 0, i∈Jn 25 nh3n 1/8 . which by the Shergin fact (with s = 5/2) yields X Xi,n →d Z. Sn (τ ) = i∈Jn Thus, using (2.50) and βn = P {Xi ∈ Dn (τ + √ nhn )} → 0, Lemma 2.2 implies that Tn (τ ) →d Z. σn (τ ) (2.55) Putting everything together we get from (2.48) that lim lim sup E (Rn (τ ))2 = 0 τ →∞ n→∞ and from (2.46) that lim lim sup σn2 (τ ) − σ 2 = 0, τ →∞ n→∞ which in combination with (2.55) implies that an (Ln (c) − EΠn (c)) →d σZ, where 2 σ = Z ∞ −∞ Z Z Id (2.56) Γ (θ, s, t) ι (θ) dtdθds, (2.57) B with Γ(θ, s, t) as defined in (2.28). Since by Lemma 2.3 E (an (Ln (c) − EΠn (c)))2 ≤ 2 Var (an Πn (c)) and we can conclude that and thus Var (an Πn (c)) → σ 2 < ∞, an (ELn (c) − EΠn (c)) → 0 an (Ln (c) − ELn (c)) →d σZ. This gives that Z an G (Cn (c) ∆ C (c)) − En E |I {fn (x) ≥ c} − I {f (x) ≥ c}| dG(x) →d σZ, which is (2.14). In light of (2.58) and keeping mind that Z EG (Cn (c) ∆ C (c)) = E |I {fn (x) ≥ c} − I {f (x) ≥ c}| g(x) dx, IRd 26 (2.58) we see that to complete the proof of (2.1) it remains to show that Z an E |I {fn (x) ≥ c} − I {f (x) ≥ c}| g(x) dx → 0. (2.59) c En We shall begin by bounding E |I {fn (x) ≥ c} − I {f (x) ≥ c}| , x ∈ Enc . Applying inequality (2.38) with a = fn (x) − f (x) and b = c − f (x) we have for x ∈ Enc , E |I {fn (x) ≥ c} − I {f (x) ≥ c}| ≤ EI {|fn (x) − f (x)| ≥ |c − f (x)|} = P {|fn (x) − f (x)| ≥ |c − f (x)|} ≤ P {|fn (x) − Efn (x)| ≥ |c − f (x)| − |f (x) − Efn (x)|} , which by recalling the definition of En in (2.17) is ) ( ς (log n)1/2 − |f (x) − Efn (x)| . ≤ P |fn (x) − Efn (x)| ≥ (nhn )1/2 This last bound is, in turn, by (A.1) ( ≤P ς (log n)1/2 − A1 h2/d |fn (x) − Efn (x)| ≥ n 1/2 (nhn ) ) . Thus for all large enough n uniformly in x ∈ Enc , E |I {fn (x) ≥ c} − I {f (x) ≥ c}| q 1+2/d 1/d 1/2 hn A1 nhn ς (log n) √ 1 − ≤ P |fn (x) − Efn (x)| ≥ . (nhn )1/2 ς log n which by (H) is for all large enough n uniformly in x ∈ Enc , ) ( ς (log n)1/2 =: pn (x) . ≤ P |fn (x) − Efn (x)| ≥ 2(nhn )1/2 We shall bound pn (x) using Bernstein’s inequality on the i.i.d. sum n 1 X x − Xi x − Xi fn (x) − Efn (x) = K − EK , 1/d 1/d nhn i=1 hn hn 27 Notice that for each i = 1, . . . , n, Z x−y 1 x − Xi 1 2 K Var K ≤ f (y) dy 1/d 1/d nhn (nhn )2 IRd hn hn 1 = 2 n hn and by (K.i), Z 2 IRd K (u) f x − h1/d n u kKk22 M du ≤ n2 hn K x − Xi − EK x − Xi ≤ 2κ . nhn 1/d 1/d hn hn Therefore by Bernstein’s inequality (i.e. page 855 of Shorack and Wellner (1986)), ς 2 log n − 4nhn pn (x) ≤ 2 exp 2 1/2 kKk2 M κ 2 ς(log n) + nhn 3 2(nhn )1/2 nhn 1 nhn = 2 n − ς log 4 2 exp n)1/2 kKk22 M + κς(log 3(nhn )1/2 . √ Hence by (H) and keeping in mind that ς > 2 in (2.17), we get for some constant a > 0 that for all large enough n, uniformly in x ∈ Enc , we have the bound pn (x) ≤ 2 exp (−ςa log n) . (2.60) We shall show below that λ (Cn (c) ∆ C (c)) ≤ m < ∞ for some 0 < m < ∞. Assuming this to be true, we have the following (similar lines of arguments are used in Rigollet and Vert, 2008) Z E |I {fn (x) ≥ c} − I {f (x) ≥ c}| g(x) dx c En Z = E |I {fn (x) ≥ c} − I {f (x) ≥ c}| g(x) dx c ∩(C (c)∆ C(c)) En n Z ≤ sup E |I {fn (x) ≥ c} − I {f (x) ≥ c} g(x)|dx c ∩A A:λ(A)≤m En Z E |I {fn (x) ≥ c} − I {f (x) ≥ c} | g(x) dx ≤ sup A:λ(A)≤m c ∩A En ≤ m sup g(x) sup E |I {fn (x) ≥ c} − I {f (x) ≥ c} | x c x∈En ≤ m sup g(x) sup pn (x) . x c x∈En 28 With c0 = m supx g(x) and (2.60) this gives the bound Z an E |I {fn (x) ≥ c} − I {f (x) ≥ c}| g(x) dx c En ≤ 2 c0 an exp (−ςa log n) . Clearly by (H), we see that for large enough ς > 0 an exp (−ςa log n) → 0 and thus (2.59) follows. It remains to verify that there exists 0 < m < ∞ with λ (Cn (c) ∆ C (c)) ≤ m. Notice that 1≥ Z and 1≥ Cn (c) Z C(c) fn (x) dx ≥ c λ (Cn (c)) f (x) dx ≥ c λ (C (c)) . Thus λ (Cn (c) ∆ C (c)) ≤ 2/c =: m. We see now that the proof of the theorem in the case k = 1 and d ≥ 2 is complete. The proof for the case k ≥ 2 goes through by an obvious extension of the argument used in the case k = 1. On account of (B.ii) we can write for large enough n Z ∆n (x) dG(x) = En n Z X j=1 ∆n (x) dG(x), Ej,n where the sets Ej,n , j = 1, . . . , k, are disjoint and constructed from the βj just as En was formed from the boundary set β in the proof for the case k = 1. Therefore by reason of the Poissonization, the summands are independent. Hence the asymptotic normality readily follows as before, where the limiting variance in (2.1) becomes σ2 = k X i=1 where each σj2 is formed just like (2.57). 29 c2j σj2 , (2.61) 2.6 Proof of the theorem in the case d = 1 The case d = 1 follows along very similar ideas as presented above in the case d ≥ 2 and is in fact somewhat simpler than the case d ≥ 2. We therefore skip all the details and only point out that by assumption (B.ii) the boundary set β = {x ∈ IR : f (x) = c } consists of k points zi , i = 1, . . . , k. Therefore, the integral over θ in the definition of σ 2 in (2.57) has to be replaced by a sum, leading to 2 σ := k X (1) g (zi ) i=1 where 2 Z ∞ −∞ Z 1 −1 2 Γ (i, s, t) |s| γg dt ds, (2.62) n n sf ′ (zi ) o sf ′ (zi ) o Γ (i, s, t) = cov I Z1 ≥ − √ − I 0 ≥ −√ , c kKk2 c kKk2 n n p s f ′ (zi ) o s f ′ (zi ) o − I 0 ≥ −√ . I ρ(t)Z1 + 1 − ρ2 (t)Z2 ≥ − √ c kKk2 c kKk2 We can drop the absolute value sign on f ′ (zi ) in our definition of Γ (i, s, t) for i = 1, . . . , k and thus σ 2 , since ρ(t) = ρ(−t). 2.7 Remarks on the variance and its estimation 2 Clearly the variance σG that appears in our Theorem does not have a nice closed form 2 and in many situations is not feasible to calculate. Therefore in applications σG will very likely have to be estimated either by simulation or from the data itself. In the latter case, an obvious suggestion is to apply the bootstrap and another is to use the jackknife. A separate investigation is required to verify that these methods work in this setup. (Similarly we may also need to estimate EG(Cn (c) ∆ C (c)).) 2 Here is a quick and dirty way to estimate σG . Let X1 , . . . , Xn be i.i.d. f . Choose a sequence of integers 1 ≤ mn ≤ n, such that mn → ∞ and n/mn → ∞. Set ςn = [n/mn ] and take a random sample of the data X1 , . . . , Xn of size mn ςn and then randomly divide this sample into ςn disjoint samples of size mn . Let ξi = dG (C(i) mn (c)∆ C(c)) for i = 1, . . . , ςn , (i) 2 , the sample where Cmn (c) is formed from sample i. We propose as our estimator of σG 1/4 variance of hmmn ξi , i = 1, . . . , ςn , n mn hmn 1/2 X ςn i=1 ξi − ξ 2 / (ςn − 1) . Under suitable regularity conditions it is routine to show that this is a consistent estimator 2 of σG , again the details are beyond the scope of this paper. 30 2 The variance σG under a bivariate normal model. In order to obtain a better understanding about the expression of the variance we consider it in the following simple bivariate normal example. Assume 2 x + y2 1 , (x, y) ∈ IR2 . exp − f (x, y) = 2π 2 A special case of our Theorem says that whenever nh2n → 0 and nhn / log n → ∞, (here γ = 0), then n hn 1/4 {λ(Cn (c)∆ C(c)) − Eλ(Cn (c)∆C(c)) } →d σλ Z. (2.63) (2.64) We shall calculate σλ in this case. We get that f ′ (x, y) = − (x, y) f (x, y) . Notice for any 0 < c < Setting 1 , 2π β = (x, y) : x2 + y 2 = −2 log (c2π) . r (c) = p −2 log (c2π), we see that β is the circle with center 0 and radius r (c). Choosing the obvious differmorphism, y (θ) = (r (c) cos θ, r (c) sin θ) , for θ ∈ [0, 2π] , we get that for θ ∈ [0, 2π] , u (θ) = (− cos θ, − sin θ) , y ′ (θ) = (r (c) sin θ, −r (c) cos θ) , and r (c) sin θ −r (c) cos θ ι (θ) = det − cos θ − sin θ Here g = 1 and we are assuming γ = 0. We get that = r (c) . √ s |f ′ (y (θ))| sr (c) c c (s, θ, γt) = c (s, θ, 0) = − √ =− . c kKk2 kKk2 Thus √ √ c c sr (c) sr (c) −I 0≥− Γ (θ, s, t) =cov I Z1 ≥ − , kKk2 kKk2 √ √ p c c sr (c) sr (c) I ρ(t)Z1 + 1 − ρ2 (t)Z2 ≥ − − I 0 ≥ − . kKk2 kKk2 31 This gives σλ2 = r (c) Set Z ∞ −∞ 2π Z 0 Z Γ (θ, s, t) dt dθ ds. B Υ (θ, u, t) =cov |I {Z1 ≥ −u} − I {0 ≥ −u}| , n o p 2 I ρ(t)Z1 + 1 − ρ (t)Z2 ≥ −u − I {0 ≥ −u} . We see then by the change of variables u = σλ2 kKk = √ 2 c Z ∞ −∞ Z 0 2π √ sr(c) c kKk2 Z that Υ (θ, u, t) dt dθ du. B For comparison, Theorem 2.1 of Cadre (2006) says that if nhn / (log n)16 → ∞ and nh3n (log n)2 → 0 then p nhn λ(Cn (c) ∆ C(c)) →P kKk2 r 2c π Z β dH = 2 kKk2 k∇f k (2.65) r 2π . c (2.66) The measure dH denotes the Hausdorff measure on β. In this case H (β) is the circumference of β. 1/4 √ 1/4 1/4 = (nh2n ) hn → 0, (2.64) and (2.66) imply that whenObserve that since nhn hnn ever (2.63) and (2.65) hold, we get r p 2π . nhn Eλ(Cn (c)∆C(c)) → 2 kKk2 c √ (Notice that the choice hn = 1/ n log n satisfies both (2.63) and (2.65).) APPENDIX In this appendix we supply a number of technical details that are needed in our proofs. Detail 1. The proof of the identity in (2.4) is as follows: 32 Z ∞ 0 = |f (x) − c| p−1 dx dc Cn (c)∆ C(c) Z IRd = Z Z ∞ 0 I{(x, c) : f (x) < c ≤ fn (x)} Z {x:f (x)≤fn (x)} + Z p−1 dc dx + I{ (x, c) : fn (x) < c ≤ f (x)} f (x) − c ∞ 0 Z p−1 I{c : f (x) < c ≤ fn (x)} f (x) − c dc dx {x:fn (x)≤f (x)} = Z {x:f (x)≤fn (x)} Z fn (x) f (x) + Z ∞ 0 p−1 dc dx I{c : fn (x) < c ≤ f (x)} f (x) − c f (x) − c p−1 dc dx Z {x:fn (x)≤f (x)} = Z {x:f (x)≤fn (x)} Z fn (x)−f (x) c p−1 dc dx Z {x:fn (x)≤f (x)} = {x:f (x)≤fn (x)} 1 = p Z IRd fn (x) f (x) − c p−1 dc dx 0 + Z f (x) Z Z f (x)−fn (x) c p−1 dc dx 0 p 1 fn (x) − f (x) dx + p Z {x:fn (x)≤f (x)} p 1 f (x) − fn (x) dx p fn (x) − f (x) p dx. Detail 2. Notice that (K.i), (K.ii), (2.7) and (2.8) imply after a change of variables and 1/d an application of Taylor’s formula for f (x + hn v) − f (x) that for some constant A1 > 0, sup h−2/d sup |Efn (x) − f (x)| ≤ A1 . n n≥2 Thus we see by (H) that there exists a γ2 > 0 such that p p 1/d sup nhn sup |Efn (x) − f (x)| ≤ A1 nhn h2/d n ≤ γ2 hn . n≥2 (A.1) x∈IRd x∈IRd 33 (A.2) Detail 3. By (2.7) for all x, v ∈ IRd d X ∂f (x) vi 2 ≤ 2A |v| . f (x + v) − f (x) − ∂xi (A.3) i=1 √ Thus we see that for 1 ≤ |s| ≤ ς log n, with n large enough, p p su (θ) su (θ) nhn c − f y (θ) + √ nhn f (y (θ)) − f y (θ) + √ = nhn nhn 2As2 − γ2 h1/d ≥ |sf ′ (y (θ)) · u (θ)| − √ n nhn 2As2 − γ2 h1/d ≥ |s| inf |f ′ (y (θ))| − √ n , θ∈Id nhn √ 2Aς log n ≥ |s| ρ0 − √ − γ2 h1/d n , nhn √ n → 0, we readily where for the last inequality we use (2.11). Since (H) implies 2Aς√nhlog n see that there √ is an 0 < η < 1 such that for all large enough n we have for all θ ∈ Id and 1 ≤ |s| ≤ ς log n, √ su (θ) nhn c − f y (θ) + √ ≥ η |s| . (A.4) 2 nhn Detail 4. Let ρn x, x + th1/d = Cov π n (x) , πn x + th1/d n n i h x−X x−X + t K h−1 E K n h1d h1d n n =r . x−X x−X −1 2 −1 2 hn EK +t hn EK h1d h1d n Choose any θ ∈ Id , |s| ≤ τ , x = y (θ) + 1/d y = x − uhn s u(θ) √ . nhn n We see that with the change of variable , ρn (x, x + th1/d n ) s u (θ) s u (θ) 1/d = ρn y (θ) + √ , y (θ) + √ + thn nhn nhn R K(u)K (u + t) f (yn (s, θ, u)) du , R 2 2 K (u)f (yn (s, θ, u)) du IRd K (u + t)f (yn (s, θ, u))du IRd = qR IRd 1/d s u(θ) where yn (s, θ, u) = y (θ) + √ − uhn , which by the bounded convergence theorem nhn converges as n → ∞ to R R d K (u) K (u + t) du d K (u) K (u + t) f (y (θ)) du IR R = IR R =: ρ(t). 2 K (u) f (y (θ)) du K 2 (u) du IRd IRd 34 Detail 5. We shall show that for |s| ≤ τ , u = s u (θ), x = y (θ) + √ nhn (c − Efn (x)) cn (x) = r 1 2 x−X EK 1/d hn √u nhn and θ ∈ Id that hn = cn y (θ) + √ → −p n→∞ u nhn √ = u nhn c − Efn y (θ) + √nh n r y(θ)+ √ u −X nhn 1 EK 2 1/d hn hn s |f ′ (y (θ))| f ′ (y (θ)) · u =− √ . c kKk2 f (y (θ)) kKk2 First the argument in Detail 4 implies that v ! u u u1 − X y (θ) + √nh √ n t EK 2 → c kKk2 . 1/d n→∞ hn hn Next by (A.1) and (H) u nhn c − Efn y (θ) + √ = nhn p u nhn c − f y (θ) + √ + o (1) nhn p and clearly by (A.3) p u nhn c − f y (θ) + √ → −s |f ′ (y (θ))| . n→∞ nhn Detail 6. Here we provide the proof of Lemma 2.2. Consider the characteristic function φn (t, u) := E exp(itSn + iuNn ) = ∞ X eiuk E (exp(itSn )|Nn = k) P (Nn = k). k=0 ¿From this we see by Fourier inversion that the conditional characteristic function of Sn , given Nn = n, is Z π 1 e−iun φn (t, u)du. ψn (t) := E (exp(itSn )|Nn = n) = 2πP (Nn = n) −π Applying Stirling’s formula, we obtain as n → ∞ 2πP (Nn = n) = 2πe−n nn−1 /(n − 1)! ∼ (2π/n)1/2 , 35 (A.5) √ which, after changing variables from u to v/ n and using assumption (i), gives Z π√n E exp(itSn + ivUn )E exp(ivVn )dv. ψn (t) = (2π)−1/2 (1 + o(1)) √ −π n We shall deduce our proof from this expression for the conditional characteristic function ψn (t), after we have collected some facts about the asymptotic behavior of the components in ψn (t). First, by assumption (iii) we have Un →P 0, which in combination with (ii) implies E exp(itSn + ivUn ) → φ(t), (A.6) where φ(t) = exp −σ 2 t2 /2 . Next, by arguing as in the proof of Theorem 3 on pages 490-91 of Feller (1966) we see that as n → ∞ Z π √n Z 2 |E exp(ivVn ) − exp(−v /2)|dv + exp(−v 2 /2)dv → 0, (A.7) √ √ −π n which implies that Z π√n √ −π n |v|>π n |E exp(itSn + iuUn ) E exp(ivVn ) − exp(−v 2 /2) |dv → 0. Here is the argument that shows (A.7). Now it it φn (t) := E exp(itVn ) = exp nβn exp √ −1− √ . n n Next using the elementary inequality 3 2 exp(ix) − 1 − ix + x ≤ |x| 2 6 gives 2 |φn (t)| ≤ exp − 3 βn |t| βn t + √ 2 6 n √ which for all large enough n and |t| ≤ π n is 2 t . ≤ exp − 3 ! Now since φn (t) → exp (−t2 /2) for all t, we infer by setting √ gn (t) = φn (t) 1 |t| ≤ π n 36 , (A.8) and applying the Lebesgue dominated convergence theorem that Z gn (t) − exp −t2 /2 dt → 0, as n → ∞, IR which yields (A.7). To complete the proof of the lemma, we get by putting (A.6) and (A.8) together with the Lebesgue dominated convergence theorem that Z 1 ψn (t) → √ φ(t) exp(−v 2 /2)dv = exp(−σ 2 t2 /2). 2π IR ⊔ ⊓ Detail 7. First we show (2.51). For any measurable set C ⊂ Id define [ τ B . y (θ) + √ Dn (τ, C) = nh n θ∈C We see that, with |A| = λ (A) for a measurable subset A of IRd , Z |Dn (τ, C)| = dx, Dn (τ,C) which by the change of variables (2.29) is equal to Z τZ | Jn (θ, s)| dθ ds, −τ C where Jn (θ, s) is defined as in (2.30). Since uniformly in (θ, s) ∈ Id × [−τ, τ ], p nhn |Jn (θ, s)| → ι (θ) and there exists a constant 0 < b/2 < ∞ such that 0 ≤ ι (θ) ≤ b/2, we see that for all large enough n ≥ 1 for all (θ, s) ∈ Id × [−τ, τ ], 0 ≤ |Jn (θ, s)| ≤ b. This implies that, p nhn |Dn (τ, C)| → 2τ and for all large enough n ≥ 1, 0≤ p Z ι (θ) dθ C nhn |Dn (τ, C)| < 2τ b |C| . 37 Since Dn (τ ) = Dn (τ, Id ) , we get, in particular, that for all large enough n ≥ 1, 4τ bπ d−1 2τ b |Id | = √ . 0 ≤ |Dn (τ )| ≤ √ nhn nhn (A.9) This establishes (2.51). Next, we construct the partition asserted after (2.54). Consider the regular grid given by Ai = (xi1 , xi1 +1 ] × · · · × (xid , xid +1 ], 1/d where i =(i1 , . . . , id ), i1 , . . . , id ∈ Z and xi = i hn for i ∈ Z. Define Ri = Ai ∩ Dn (τ ). With Jn = {i : Ai ∩ Dn (τ ) 6= ∅ } we see that {Ri : i ∈ Jn } constitutes a partition of Dn (τ ) with |Ri | ≤ hn . Further we have [ i∈Jn S Ri ⊂ [ i∈Jn The inequality | i∈Jn Ai | ≤ |Dn (τ + and for n large enough we have Ai ⊂ Dn (τ + √ √ d h1/d n ). 1/d d hn )| implies via (A.9) that with mn := |Jn | 5τ bπ d−1 , mn hn ≤ √ nhn giving 5τ bπ mn ≤ p , nh3n which is (2.54). Observing the fact that if B1 , . . . , Bk are disjoint sets such that for 1 ≤ i 6= j ≤ k inf {|x − y| : x ∈ Bi, y ∈ Bj } > h1/d (A.10) n , then Z ∆n (x) g(x) dx, i = 1, . . . , k, are independent, Bi 38 (A.11) we can easily infer that Xi,n := n hn 1/4 √ nhn γ1 R g Ri {∆n (x) − E∆n (x)} g(x) dx σn (τ ) constitutes a 1-dependent random field on Zd . Acknowledgement The authors would like to thank Gerard Biau for pointing out to them the Tubular Neighborhood Theorem, and also the Associate Editor and the referee for useful suggestions and a careful reading of the manuscript. References Baillo, A., Cuevas, A. and Justel A. (2000). Set estimation and nonparametric detection. Canad. J. Statist. 28, 765-782. Baillo A., Cuestas-Albertos, J.A., and Cuevas, A. (2001). Convergence rates in nonparametric estimation of level sets. Stat. Probab. Lett. 53, 27-35. Baillo, A. (2003). Total error in a plug-in estimator of level sets. Stat. Probab. Lett. 65, 411-417. Baillo A. and Cuevas, A. (2006). Image estimators based on marked bins. Statistics. 40(4), 277-288. Beirlant, J. and Mason, D. M. (1995). On the asymptotic normality of Lp –norms of empirical functionals. Math. Methods of Statistics, 4, 1-19. Biau, G., Cadre, B. and Pelletier, B. (2008). Exact rates in density support estimation. J. Multivariate Anal. In press. Bredon, G. E. (1993). Topology and Geometry, Volume 139 of Graduate Texts in Mathematics. Springer-Verlag, New York. Cadre, B. (2006). Kernel estimation of density level sets. J. Multivariate Anal. 97, 999–1023. Cavalier, L. (1997). Nonparametric estimation of regression level sets. Statistics 29, 131-160. Cuevas, A., Febrero, M., and Fraiman, R. (2000). Estimating the number of clusters. Canad. J. Statist. 28, 367-382. Cuevas, A., Gonzalez-Manteiga, W., and Rodriguez-Casal, A. (2006). Plugin estimation of general level sets. Australian and New Zealand J. Statist. 48(1), 7-19. 39 Desforges, M.J., Jacob, P.J. and Cooper, J.E. (1998). Application of probability density estimation to the detection of abnormal conditions in engineering. In: Proceedings of the Institute of Mechanical Engineering 212, 687-703. Einmahl, U. and Mason, D. M. (2005). Uniform in bandwidth consistency of kerneltype function estimators. Ann. Statist. 33, 1380–1403. Fan, W., Miller, M., Stolfo, S.J., Lee, W. and Chan, P.K. (2001). Using artificial anomalies to detect unknown and known network intrusions. In IEEE International Conference on Data Mining (ICDM’01), IEEE Computer Society, 123130. Feller, W. (1966). An Introduction to Probability Theory and Its Applications, Volume II. Wiley, New York. Gardner, A.B., Krieger, A.M., Vachtsevanos, G. and Litt B. (2006). Oneclass novelty detection for seizure analysis from intracranial EEG. J. Machine Learning Research 7, 1025 - 1044. Gayraud, G. and Rousseau, J. (2005). Rates of convergence for a Bayesian level set estimation. Scand. J. Statist. 32(4), 639 - 660. Gerig, G., Jomier, M. and Chakos, M. (2001). VALMET: a new validation tool for assessing and improving 3D object segmentation. In Niessen, W., Viergever, M. eds.: Medical image computing and computer assisted intervention MICCAI 2208, Springer, New York, 516-523. Giné, E, Mason, D. M. and Zaitsev, A. (2003). The L1 −norm density estimator process. Ann. Probab. 31, 719–768. Goldenshluger, A. and Zeevi, A. (2004). The Hough transform estimator. Ann. Statist. 32(5), 1908-1932. Hall, P. and Kang, K-H. (2005). Bandwidth choice for nonparametric classification. Ann. Statist. 33, 284-306. Hartigan, J.A. (1975). Clustering algorithms. Wiley, New York Hartigan, J.A. (1987). Estimation of a convex density contour in two dimensions. J. Amer. Statist. Assoc. 82, 267-270. Horvath, L. (1991). On Lp -norms of multivariate density estimators. Ann. Statist. 19(4), 1933-1949. Huo, X. and Lu, J.-C. (2004). A network flow approach in finding maximum likelihood estimate of high concentration regions. Comp. Stat. Data Anal. 46, 33-56. 40 Jang, W. (2006). Nonparametric density estimation and clustering in astronomical sky survey. Comp. Statist. & Data Anal. 50, 760 - 774. Johnson, W. B., Schechtman, G. and Zinn, J. (1985). Best constants in moment inequalities for linear combinations of independent and exchangeable random variables. Ann. Probab. 13, 234–253. King, S.P., King, D.M., Anuzis, P., Astley, K. Tarassenko, L., Hayton, P. and Utete, S. (2002). The use of novelty detection techniques for monitoring high-integrity plant. In Proceedings of the 2002 International Conference on Control Applications Vol. 1, Cancun, Mexico, 221-226. Klemelä, J. (2004). Visualization of multivariate density estimates with level set trees. J. Comput. Graph. Statist. 13(3), 599-620. Klemelä, J. (2006a). Visualization of multivariate density estimates with shape trees. J. Comput. Graph. Statist. 15(2), 372-397. Klemelä, J. (2006b). Visualization of scales of multivariate density estimates. Unpublished manuscript. Lang, S. (1997). Undergraduate Analysis. Second edition. Springer, New York. Markou, M. and Singh, S. (2003). Novelty detection: a review - part 1: statistical approaches. Signal Processing 83 2481-2497. Molchanov, I.S. (1998). A limit theorem for solutions of inequalities. Scand. J. Stat. 25, 235-242. Nairac, A., Townsend, N., Carr, R., King, S., Cowley, L., and Tarassenko, L. (1997). A system for the analysis of jet engine vibration data. Integrated Comput. Aided Eng. 6, 53-65. Polonik, W. (1995). Measuring mass concentration and estimating density contour clusters: an excess mass approach. Ann. Statist. 23 855-881. Prastawa M., Bullitt, E., Ho, S. and Gerig, G. (2003). Robust estimation for brain tumor segmentation. In: MICCAI proceedings LNCS 2879, 530-537. Rigollet, P. and Vert, R. (2008). Fast rates for plug-in estimators of density level sets. Available at arxiv.org/pdf/math/0611473 Roederer, M. and Hardy, R.R. (2001). Frequency difference gating: A multivariate method for identifying subsets that differ between samples. Cytometry 45, 56-64. Rosenblatt, M. (1975). The quadratic measure of deviation of two-dimensional density estimates and a test of independence. Ann. Statist. 3(1), 1 - 14. 41 Scott, C.D. and Davenport, M. (2006). Regression level set estimation via costsensitive classification. To appear in IEEE Trans. Inf. Theory. Scott, C.D. and Nowak, R.D. (2006). Learning minimum volume sets. J. Machine Learning Research 7, 665-704. Shergin, V. V. (1990). The central limit theorem for finitely dependent random variables. In: Proc. 5th Vilnius Conf. Probability and Mathematical Statistics, Vol. II, Grigelionis, B., Prohorov, Y.V., Sazonov, V.V. and Statulevicius V. (Eds.) 424 431. Shorack, G. R. and Wellner, J. A. (1986). Empirical Processes with Applications to Statistics. Wiley Series in Probability and Mathematical Statistics. John Wiley, New York. Steinwart, I., Hush, D. and Scovel, C. (2004). Density level detection is classification. Technical Report, Los Alamos National Laboratory. Steinwart, I., Hush, D. and Scovel, C. (2005). A classification framework for anomaly detection. J. Machine Learning Research 6, 211-232. Stuetzle, W. (2003). Estimating the cluster tree of a density by analyzing the minimum spanning tree of a sample. J. Classification 20(5), 25 - 47. Theiler, J. and Cai, D.M. (2003). Resampling approach for anomaly detection in multispectral images. In Proceedings of the SPIE 5093, 230-240. Tsybakov, A.B. (1997). On nonparametric estimation of density level sets. Ann. Statist. 25, 948-969. Tsybakov, A. B. (2004). Optimal aggregation of classifiers in statistical learning. Ann. Statist. 32, 135 - 166. Vert, R. and Vert, J.-P. (2006). Consistency and convergence rates of one-class SVM and related algorithms. J. Machine Learning Research 7, 817-854. Walther, G. (1997). Granulometric smoothing. Ann. Statist. 25, 2273 - 2299. Wand, M. (2005): Statistical methods for flow cytometric data. Presentation; available at http://www.uow.edu.au/∼mwand/talks.html. Willett, R.M. and Nowak, R.D. (2005). Level Set Estimation in Medical Imaging, To appear in Proceedings of the IEEE Statistical Signal Processing Workshop, June 17-20 2005, Bordeaux, France. Willett, R.M. and Nowak, R.D. (2006). Minimax optimal level set estimation. Submitted to IEEE Trans. Image Proc. (Available at http://www.ee.duke.edu/∼willett/). 42 Yeung, D.W. and Chow, C. (2002). Parzen window network intrusion detectors. In: Proceedings of the 16th International Conference on Pattern Recognition, Quebec, Canada, Vol. 4 385-388. Statistics Program University of Delaware 206 Townsend Hall Newark, DE 19717 USA davidm@udel.edu Departement of Statistics University of California, Davis One Shields Avenue Davis, CA 95616-8705 USA wpolonik@ucdavis.edu 43