Chapter 12 Bracketing methods SECTION 1 explains how the concept of bracketing may be thought of as a scheme to partition an index set of functions into finitely many regions, on each which there exists a single approximating function with a guaranteed error bound. SECTION 2 describes a refined technique of truncation and successive approximation, for the simplest case of an empirical process built from independent variables. It derives an L1 maximal inequality as a recusive bound relating the errors of successive approximations. SECTION 3 presents some examples to illustrate the uses of the results from Section 2. SECTION 4 generalizes the method from Section 2 to an abstract setting that includes several sorts of dependence between the summands. The arguments are stated in an abstract form that covers both independent and dependent summands. All details of possible dependence are hidden in two Assumptions, one concerning the norm used to express approximations to classes of functions, the other specifying the behaviour of maxima of finite collections of centered sums. SECTION 5 studies the special case of phi-mixing sequences. SECTION 6 studies the special case of absolutely regular (beta-mixing) sequences. SECTION 7 points out the difficulties involved in the study of strong-mixing sequences. SECTION 8 develops a maximal inequality for tail probabilities, by means of a modification of the methods from Section 4. [§] 1. What is bracketing? Bracketing arguments have long been used in empirical process theory. A very simple form of bracketing is often used in textbooks to prove the GlivenkoCantelli theorem—the most basic example of a uniform laws of large numbers. The empirical distribution function Fn for a sample ξ1 , . . . , ξn from a distribution function F on the real line. That is, Fn (t) denotes the proportion of the observations less than or equal to t, 1 Fn (t) = {ξi ≤ t} for each t in R. n i≤n The Glivenko-Cantelli theorem asserts that supt |Fn (t) − F(t)| converges to zero almost surely. The bracketing argument controls the contributions from an interval t1 ≤ t ≤ t2 by means of bounds that hold throughout the interval. For such t we have Fn (t1 ) − F(t2 ) ≤ Fn (t) − F(t) ≤ Fn (t2 ) − F(t1 ). The two bounds converge almost surely to F(t1 ) − F(t2 ) and F(t2 ) − F(t1 ). If t2 and t1 are close enough together—meaning that the probability measure c David Pollard Asymptopia: 23 March 2001 1 Section 12.1 What is bracketing? of the interval (t1 , t2 ] is small enough—then all the Fn (t) − F(t) values, for t1 ≤ t ≤ t2 , get squeezed close to the origin. If we cover the whole real line by a union of finitely many such intervals, we are able to deduce that supt |Fn (t) − F(t)| is eventually small. There is a more fruitful way to think of the increment F(t2 )− F(t1 ). If P is the probability measure corresponding to the distribution function F, the increment equals the L1 (P) distance between the two indicator functions (−∞, t1 ] and (−∞, t2 ]. The concept of bracketing then has an obvious extension to arbitrary classes of (measurable) functions on a measurable space (X, A). bracket.def1 <1> Definition. A pair of P -integrable functions ≤ u on X defines a bracket, [ , u] := {g : (x) ≤ g(x) ≤ u(x0 for all x}. For 1 ≤ q ≤ ∞, the bracketing (q) number N[ ] (δ, P) for a subclass of functions F ⊆ Lq (P) is defined as the smallest value of N for which there exist brackets [ i , u i ] with P(u i − i )q ≤ δ q for i = 1, . . . , N and F ⊆ ∪i [ i , u i ]. Need bracketing functions in Lq also? The Definition allows the possibility that the bracketing numbers be might (q) be infinite, but they are useful only when finite. The quantity N[ ] (δ, P) is also called the metric entropy with bracketing. Uniform approximations correspond to bracketing numbers for q = ∞. For proofs of uniform laws of large numbers, the bracketing numbers for q = 1 are more natural. As you will see later in this Chapter, q = 2 is better suited for approximations related to functional central limit theorems. In the earlier empirical process literature for central limit theorems for bounded classes of functions, bracketing numbers with q = 1 were often used; but for extensions to unbounded classes, do seem to require q = 2. For classes of functions the role of the empirical distribution function is taken over by the empirical measure Pn , which puts mass 1/n at each of the n observations, 1 Pn g = g(ξi ). n i≤n On the real line, Pn (−∞, t] = Fn (t). uslln <2> lipschitz <3> Example. Suppose the {ξi } are sampled independently from a fixed probability distribution P, and suppose F is a class of measurable functions with N[](1) (δ, F, P) finite for each δ > 0. Deduce a uniform strong law of large numbers, supF |Pn f − P f | → 0 almost surely. Complete proof Example. Bracketing arguments often appear disguised as smoothness assumptions in the statistics literature. Typically F is a parametric class of functions { f t : t ∈ T } indexed by a bounded subset of an Euclidean space Rk . Suppose the functions satisfy a Lipschitz condition, | f t (x) − f s (x)| ≤ M(x)|t − s|α , for some fixed α > 0 and some fixed M in Lq (P). Write C for the Lq (P) norm of M. For some constant C0 there exist a set of N ≤ C0 (1/δ 1/α )k points Get eucliden-style bracketing numbers 2 Asymptopia: 23 March 2001 c David Pollard Chapter 12 Bracketing methods More delicate bracketing arguments will be easier to describe if we recast the definition into a slightly different form. Suppose F ⊆ ∪i [ i , u i ], a covering for a finite δ-bracketing. Consider an f in F. As a tie-breaking rule, choose the smallestk for which f ∈ [ k , u k ]. Write Aδ f for ( k + u k )/2 and Bδ f for u k − k . Then | f − Aδ f | ≤ Bδ f and Bδ ≤ δ, for whatever norm is used to define the bracketing. Even if F is infinite, there are only finitely many different approximating functions Aδ f and bounding functions Bδ f . Indeed, the bracketing serves to partition F into finitely many regions, π = {F1 , . . . , F N }: if f 1 and f 2 belong to the same Fk then they share the same approximating and bounding functions. Put another way, as maps from F into finite collections of functions, both Aδ and Bδ take constant values on each member of the partition π; they are simple functions, in their dependence on f . In general, let us call a function on F π-simple if it takes a constant value on each Fi in π . The definition of bracketing, for an arbitrary norm · on classes of functions, can be thought of as a scheme of approximation via simple functions. bracket.defn <4> Definition. For given δ > 0, say that a finite partition πδ of F into disjoint regions supports a δ -bracketing (for the norm · ) if there exist functions Aδ (x, f ) and B(x, f ) such that: (i) | f (x) − Aδ (x, f )| ≤ Bδ (x, f ) for all x ; (ii) Bδ (·, f ) ≤ δ for every f ; (iii) each Aδ (x, ·) and Bδ (x, ·) is πδ -simple as a function on F. The bracketing number N (δ) is defined as the smallest number of regions needed for a partition that supports a δ -bracketing. The bracketing function N (·) is decreasing. Again, it is of use only when it is finite-valued. Typically N (δ) tends to infinity as δ tends to zero. For the finer applications of bracketing arguments, we will derive bounds expressed as integrals over d involving the bracketing numbers. The bounds will be usefulk only when the integrals converge; the convergence corresponds to assumptions about the rate of increase of N (δ) as δ tends to zero. The application of bracketing to prove uniform strong laws of large numbers is crude. More refined arguments are needed to get sharper bounds corresponding to the central limit theorem level of asymptotics. traditionally these bounds have been expressed in terms of the empirical process νn = √ n(Pn − P), by which sums are standardized in a way appropriate to central limit theorems, νn g = n −1/2 g(ξi ) − Pg(ξi ) . i≤n The results in this Chapter are stated for empirical processes constructed from random elements ξ1 , . . . , ξn taking values in a space X, indexed by classes of integrable functions. The general problem is to develop uniform approximations to the empirical process {νn f : f ∈ F} indexed by a class of functions F on X. In my opinion, the most useful general solutions give probabilistic bounds on supF |νn f − νn (Aδ f )|, such as an inequality for tail probabilities or an L q bound. The behaviour of the process indexed by F can then be related to the behaviour of the finite-dimensional process {νn f δ : f δ ∈ Fδ }. In this Chapter, all the arguments for the various probabilistic bounds will make use of a recursive approximation scheme known as chaining. It is not hard to understand why useful bounds for the empirical process do not usually c David Pollard Asymptopia: 23 March 2001 3 Section 12.1 What is bracketing? follow by means of a single bracketing approximation. The bracketing bound destroys the centering. For example, with the upper bound we have √ √ νn ( f − Aδ f ) = n Pn ( f − Aδ f ) − n Pn ( f − Aδ f ) √ √ ≤ n Pn (Bδ f ) + n P(Bδ f ) √ = νn (Bδ f ) + 2 n P(Bδ f ). √ The lower bound reverses the sign on the 2 n P(Bδ f ). If P(Bδ f ) were small compared with n −1/2 , the change in centering would not be important. Unfortunately that level of approximation would usually require a decreasing value of δ as n gets larger; we lose the benefits of an approximation by means of a fixed, finite collection of functions. The solution to the problem of the recentering is to derive the approximation in several steps. Suppose A1 and B1 refer to the approximations and bounds for a δ1 -bracketing, and A2 and B2 refer to the bracketing for a smaller value δ2 . Apply the empirical process to both sides of the equality f − A1 f = f − A2 f + A2 f − A1 f , then take a supremum over F to derive the inequality recursion1 <5> sup |νn ( f − A1 f )| ≤ sup |νn ( f − A2 f )| + max |νn (A2 f − A1 f )|. F F F The two suprema bound the error s of approximation for the two bracketings. The maximum runs over at most N (δ1 )N (δ2 ) pairs of differences between approximating functions; I write it as a maximum to remind you that it involves only finitely many differences, even for an infinite F. If we can bound probabilistically the contribution from that last term then we arrive at a recursive inequality relating the errors two bracketing approximations. Repeated application of the same idea would give a bound for the crude bracketing approximation as a sum of a bound for a finer approximation plus a sum of terms coming from the maxima over differences of approximating functions. It remains only to bound the contributions from those maxima. Each of the differences A2 f − A1 f is bounded by B1 f + B2 f , with norm bounded by δ1 + δ2 . Typically use L2 norm. Bernstein for tail probs of bounded functions. Chaining must stop when the constant term in bound overwhelms variance term. For unifromly bounded clases of functions, luckily the contributions at end of chain are easy to take care of. For unbounded classes, need to truncate. Cite Dudley 81 for first form. Moment condition on envelope. Best form due to Ossiander 1987 and Seattle group—see comments about history in the Notes Section. Bernstein < 6> Lemma. For independent ξ1 , . . . , ξn and a measurable function g bounded in absolute value by a constant β , − 1/2t 2 P{|νn g| ≥ t} ≤ 2 exp √ g22 + 2/3βt n The chaining bounds will be derived by a recursive procedure based on successive finite approximations to F in the sense of a norm · . For independent {ξi }, it will usually be the L 2 norm, g22 = Pg(ξi )2 . i≤n However, the argument is written so that it works for other norms, such as those introduced by Rio (1994) and Doukhan, Massart & Rio (1994) for mixing processes. Only two extra specific properties are required of the norm. 4 Asymptopia: 23 March 2001 c David Pollard Chapter 12 Bennett <7> Bracketing methods In the literature, the most familiar example of such a bound is the Bennett inequality for sums of independent random variables. Suppose ξ1 , . . . , ξn are independent and |g| is bounded by a constant β, that is, β(g) ≤ β. Then Bennett’s inequality asserts that P{|νn g| ≥ λg2 } ≤ 2 exp − 12 λ2 B(βλ/g2 ) , for λ > 0, where (1 + x) log(1 + x) − 1 − x . x 2 /2 It is the presence of the nuisance factor, B(βλ/g2 ) , that complicates the chaining argument for tail probabilities. If β stays fixed while λ and g2 are made smaller, the nuisance factor begins to dominate the bound. It was for this reason that Bass (1985) and Ossiander (1987) needed to add an extra truncation step to the chaining argument. The truncation keeps the βλ/g2 close enough to zero that one can ignore the nuisance factor, and act as if the centered sum has sub-gaussian tails. The possible dependence for the general chaining argument with L 1 norms can also be hidden behind a single assumption. The keys ideas are easiest to explain (and the proof is simplest) for the L 1 bound, P supF |νn f − νn (Aδ f )|. The analogous arguments for tail bounds are better known in the literature. B(x) = [§] 2. Independent summands Suppose ξ1 , . . . , ξn are independent random variables. Several simplifications are possible when dealing with independent summands, largely because sums of bounded independent variables behave almost like subgaussian variables (Chapter 9/). It is natural to use the L2 norm, 1 f 22 := P f (ξi )2 , n i≤n independent.trunc <8> for two reasons: it enters as a measure of scale in inequalities for finite collections of functions; and it is easy to bound contributions discarded during truncation arguments, by means of the inequality √ n −1/2 Pg(ξi ){g(ξi ) > ng2 /t} ≤ Pg(ξi )2 t/ng2 = tg2 . i i This Section will be devoted to an explanation of the ideas involved in the proof of the following maximal inequality. indep.mean <9> Theorem. Suppose ξ1 , . . . , ξn are independent and let F be a class of functions with finite bracketing numbers (for the L2 norm) for which 1 log N (x) d x < ∞. Then, for some universal constant C , 0 δ P sup |νn ( f − Aδ ( f ))| ≤ C F where log(2N (x)) d x + Rδ , 0 Rδ = n −1/2 PB(ξi ){B(ξi ) > √ nβ}, i≤n where B(x) = maxF Bδ (x, f ) and β = δ/ log 2N (δ). Anyone worries about questions of measurability could work throughout with the outer integral P∗ . c David Pollard Asymptopia: 23 March 2001 5 Section 12.2 Independent summands The independence is needed only to establish a maximal inequality for finitely many bounded functions. It depends upon the elementary fact (see Pollard (1996, Chapter 4?) or Chow & Teicher (1978, page 338)) that the function defined by ψ(x) = 2(e x − 1 − x)/x 2 for x = 0, and ψ(0) = 1, is positive and increasing over the whole real line. indep.max <10> Lemma. Suppose ξ1 , . . . , ξn are independent under P. Let G be a finite class √ consisting of N functions, each bounded in absolute value by a constant β n and g2 ≤ δ . Then there exists a universal constant C such that P max |νn g| ≤ C0 δ log(2N ) g∈G if β ≤ δ/ log(2N ). √ Proof. Consider first the bound for a single sum. Let Wi = (g(ξi ) − Pg(ξi )) / n. Then |Wi | ≤ 2β and i PWi2 ≤ δ 2 . The moment generating function of νn g = i Wi equals 1 + P(t Wi ) + P 12 t 2 Wi2 ψ(t Wi ) i ≤ 1 + P 12 t 2 Wi2 ψ(2tβ) i ≤ exp ≤ exp 1 2 t 2 1 2 increasing ψ PWi2 ψ(2tβ) using 1 + x ≤ e x i t 2 δ 2 ψ(2tβ) . The same,argument works for −g. Sum up 2N such moment generating functions to get the maximal inequality. For fixed t > 0, exp(tP max |νn g|) ≤ P exp t max |νn g| by Jensen’s inequality G G ≤ P exp t max νn g + P exp t max νn (−g) G G ≤ 2N exp 12 t 2 δ 2 ψ(2tβ) . Take logarithms then put t = P max |νn g| ≤ δ G log(2N )/δ to get log(2N ) 1 + 12 ψ(2β log(2N )/δ) . The stated inequality holds with C = 1 + ψ(2)/2. Comment on how the factor of 2 could be absorbed into the constant, except when N = 1. It would be notationally convenient if we could dispense with the 2. The proof of the Theorem consists of repeated application of the Lemma to the finite maxima obtained from a sequence of approximations. For k = 0, 1, . . . and δk = δ/2k , let πk be a partition of F into at most Nk = N (δk ) regions. Write f k (· ) = Ak (·, , f ) and Bk (·) = B √k (· , f ) for the πk -simple functions that define the δk -bracketing. Define h k = log Nk . Here, and subsequently, I omit the argument f when it can be inferred from context. Integrals bound sums Typically integrals that appear in bounds like the one in Theorem <9> are just tidier substitutes for sums of error terms, a substitution made possible by 6 h(δk) Asymptopia: 23 March 2001 c David Pollard Chapter 12 Bracketing methods the geometric decrease in the {δk }: If h(·) is a decreasing function on R+ , then δk h(δk ) = 2(δk − δk+1 )h(δk ) ≤ 2 {δk+1 < x ≤ δk }h(x) d x. Sum over k, to deduce that m δ0 δk h(δk ) ≤ h(x) d x ≤ δm+1 k=0 Put h(x) = δ h(x) d x. 0 log(2N (x)). Nested partitions using logs The bracketing argument is much simpler if the πk partitions are nested—that is, each πk+1 is a refinement of the preceeding πk —and if we can take the Bk as decreasing in k. The logarithmic dependence on the bracketing numbers lets us arrange Without loss of generality we may assume the partitions to be nested and the bounding functions to decrease as k increases. Once again write h(x) for log 2N (x). Let πk∗ be the common refinement of the partitions π1 , . . . , πk and let Bk∗ ( f ) = mini≤k Bk ( f ). Within each region of the πk∗ partition choose A∗k ( f ) to correspond to the Bi ( f ) that achieves the minimum defining Bk∗ ( f ). Notice that πk∗ partitions F into at most Mk = N (δ1 )N (δ2 ) . . . N (δk ) ≤ exp( i ≤ kh(δi )2 ) ≤ exp ( i ≤ kh(δi ))2 . The integral term in Theorem <9> also bounds the sum corresponding to the finer partition: ∞ ∞ δk log Mk ≤ δk { j ≤ k}h(δ j ) k=0 j k=0 = ∞ h(δ j )2δ j because k{ j ≤ k}2− j = 2−k+1 j=0 δ ≤8 h(x) d x. 0 Rather than carry the superscript for πk∗ and Bk∗ , let me just write πk and Bk , absorbing the extra constants due to the nesting into the C. Truncation regions Lemma <10> suggests that we truncate the functions at a level depending on their L2 norms. Let {βk } be a decreasing sequence of constants, to be specified. It will turn out that β0 equals the β specified in the statement of the Theorem. √ Split supF |νn √ ( f − f o )| into the contributions from the regions {B0 ≤ β0 n} and {B0 > β0 n}. For each f , bound |νn ( f − f 0 )| by √ √ B0 (ξi , f ){B0 (ξi , f ) > β0 n} |νn ( f − f 0 ){B0 ≤ β0 n}| + n −1/2 top.split <11> +n −1/2 i≤n √ PB0 (ξi , f ){B0 (ξi , f ) > β0 n} i≤n The third term contributes at most PB0 (ξi , f )2 δ2 n −1/2 = δh(δ), ≤ √ β0 β0 n i≤n c David Pollard Asymptopia: 23 March 2001 7 Section 12.2 Independent summands which gets absorbed into integral contribution to the bound. The supremum of the second term over F is less than the Rδ error. The first term is the starting point for the recursive procedure known as chaining. At each succeeding step in the chaining argument we truncate the differences f − f k more severely, based on the size of Bk . √ Write Tk ( f ), or just Tk , for the indicator function of the set ∩i≤k {Bi ( f ) ≤ βi n}. Because the partitions are nested, there is no harm in letting the truncations accumulate. I will argue recursively to bound the contribution from the truncated remainders, k = P sup |νn ( f − f k )Tk ( f )|. F Notice that 0 equals the first term in <top.splity>. Recursive inequality Start from the recursive equality, rec.ineq.indep ( f − f k )Tk = ( f − f k+1 )Tk+1 + ( f k+1 − f k )Tk+1 <12> c + ( f − f k )Tk Tk+1 . Notice that the truncations are arranged so that each function on the right-hand side of <12> is bounded. Apply νn to both sides of <12>, take suprema of absolute values, then expectations, to get mean.indep.rec k ≤ k+1 + P max |νn ( f k+1 − f k )Tk+1 | <13> F + P sup |νn ( f − F c f k )Tk Tk+1 | links trunc The differences f k+1 − f k contribute an error due to moving down one level in the chain of approximations; they are contributed by the “links” of the chain. c The indicators Tk Tk+1 pick out the contribution to the error of approximation in when we move from the kth level of truncation to the (k + 1)st. The links term is already set up for an application of Lemma <10>. The trunc term can be bounded by a maximum over a finite class, by means of the property taht defines a bracketing. Bound for the links term Notice that each of f k = Ak ( f ), f k+1 = Ak+1 ( f ), and Tk+1 is πk+1 -simple; functions f that lie in the same region of the partition share the same values for these three quantities. The maximum need run over at most Nk+1 ≤ exp(h 2k+1 ) representative f , one from each region. Bound | f k − f k+1 | by | f k − f | + | f k+1 − f | ≤ Bk + Bk+1 . The indicator function Tk+1 ensures that, for every f , sup |( f k − f k+1 )Tk+1 | ≤ βk + βk+1 ≤ 2βk , x max ( f k − f k+1 )Tk+1 2 ≤ Bk 2 + Bk+1 2 ≤ δk + δk+1 = 3δk+1 . F Notice that the truncation plays no role in bounding the L2 norm. The constraint required by Lemma <10> is satisfied provided indep.constraint1 <14> 8 2βk ≤ 3δk+1 / h k+1 . Asymptopia: 23 March 2001 c David Pollard Chapter 12 Bracketing methods If we can choose the truncation levels {βk } to satisfy this constraint we will have P max |νn ( f k − f k+1 )Tk+1 | ≤ 3C0 δk+1 h k+1 t Bound for the trunc term c The truncated remainder process ( f − f k+1 )Tk Tk+1 is not simple, because ( f − f k+1 ) depends on f through more than just the regions of a finite partition of F. However, it is bounded in absolute value by the πk+1 -simple c function Bk+1 Tk Tk+1 . In general, if g and G are functions for which |g| ≤ G, then |νn g| ≤ n −1/2 h(ξi ) + n −1/2 PG(ξi ) = νn G + 2n −1/2 PG(ξi ). i≤n i≤n i≤n c c Invoke the inequality with g = ( f − f k+1 )Tk Tk+1 and G = Bk+1 Tk Tk+1 . c P sup|νn ( f − f k+1 )Tk Tk+1 | F bracket.bnd c ≤ P max |νn Bk+1 Tk Tk+1 | + 2 max Pn −1/2 <15> F F c Bk+1 Tk Tk+1 (ξi ). i≤n Again the maximum need run over at most N K +1 representative f . The c functions Bk+1 Tk Tk+1 are bounded in absolute value by βk (because Bk+1 ≤ Bk ≤ βk on Tk ) and have L2 norms less than δk+1 . Provided indep.constraint2 βk ≤ δk+1 / h k+1 , <16> Lemma <10> will bound the first term on the right-hand side by C0 δk+1 h k+1 . 2 /βk+1 . Inequality <8> bounds the second term by 2δk+1 In summary: provided constraints <14> and <16> hold, we have the recursive inequality mean.indep.rec2 <17> 2 k ≤ k+1 + 4C0 δk+1 h k+1 + 2δk+1 /βk+1 We minimize Rδ and the error term in <17> by choosing the βk as large as possible, that is, βk = δk+1 / h k+1 . Summation The recursive inequality then simplifies to k ≤ k+1 + (2 + 4C0 )δ − ki + 1h k+1 . Repeated substitution, and replacement of the resulting sum by its bounding integral leaves us with the inequality 0 ≤ k + (8 + 16C0 ) [§] 3. δ0 h(x) d x. δk+1 As k tends to infinity, k tends to zero (it is bounded by 2nβk ), which leads to the integral bound as stated in the theorem. A generic bracketing bound In general the norm plays the role of a scaling, as suggested by some probabilistic bound for a single νn g. Such bounds typically also depend on a second measure of size β(·). In this paper, unless otherwise specified, β(g) will denote the supremum norm supx |g(x)|. c David Pollard Asymptopia: 23 March 2001 9 Section 12.3 A generic bracketing bound Improvements • Applies to general norms. • Lower terminal of integral not quite zero—helpful for mixing applications. • Add one more term to the recursive equality, to avoid assumption that partitions are nested. mean.assumption <18> Assumption:. Suppose the norm satisfies (i) g1 ≤ g2 if |g1 | ≤ |g2 |, (ii) there exists a nonnegative increasing function D(·) on R+ such that Pg(ξi ){g(ξi ) > g/t} ≤ gD(t) for each t > 0. i The form of the upper bound is suggested by the methods of Doukhan et al. (1994) for absolutely regular processes. For independent processes, with the L 2 norm · 2 , we can take D(t) ≡ t. max.mean.assumption <19> Assumption:. Suppose there exists increasing functions G(·) and H (·) for which the following properties hold. If G is a finite set of at most N functions on X for which β(g) ≤ β and g ≤ δ for each g ∈ G, then P max |νn g| ≤ δ H (N ) if β ≤ δ/G(N ). g∈G For example, for independent summands with · 2 as norm, both H (N ) and G(N ) can be taken as mutiples of log(2N ), as will be shown in Section 2. The upper bounds as stated are sensible only if the various integrals converge. The detail of the proof shows that the lower terminal to the integrals can be replaced by a δ ∗ > 0, with only a slight increase in the constant. As shown by Birgé & Massart (1993), such a refinement is important for applications to minimum contrast estimators for infinite-dimensional parameters. For Theorem <20>, the δ ∗ is determined by the equality √ nδ ∗ = δ δ∗ J (x) d x. For Theorem <39>, it is determined by the equality what? main.mean <20> Theorem. Suppose Assumptions <18> and <19> hold. Then, for a fixed δ > 0, for some universal constant C0 , δ P sup |νn ( f − Aδ ( f ))| ≤ C0 J (x) d x, +n −1/2 PB(ξi ){B(ξi ) >??}, δ∗ F i≤n where B(x) = max Bδ (x, f ) F J (x) = H (N (x)N (x/2)) + D (2G (N (x)N (x/2))) ?? =?? and δ ∗ is the largest value for which H (N (2δ ∗ )) ≥ √ n. Proof. For k = 0, 1, . . . and δk = δ/2 , let πk be a partition of F into at most Nk = N (δk ) regions. Write f k = Ak ( f ) and Bk = Bk (·, f ) for the πk -simple functions that define the δk -bracketing. Define γk = G(Nk Nk+1 ) and θk = H (Nk Nk+1 ). k 10 Asymptopia: 23 March 2001 c David Pollard Chapter 12 Bracketing methods The bound will be derived by a recursive argument, involving succesive simple approximations to νn f . At the kth step, the approximating functions will be πk∗ -simple, where πk∗ denotes the common refinement of πk and πk+1 . Suprema over classes of πk∗ -simple functions reduce to maxima over at most Nk Nk+1 representatives, one for each region of πk∗ . Assumption <19> therefore suggests that we need functions bounded in absolute value by βk = δk /γk , an property that will be achieved by truncation. Write Tk ( f ), or just Tk , for the indicator function of the set {Bk ( f ) ≤ βk }. Notice that β0 ≤ M, and hence T0 ( f ) = X. I will argue recursively to bound the contribution from the truncated remainders, k = P∗ sup |νn ( f − f k )Tk ( f )|. F Notice that 0 is the quantity we seek to bound. The recursive inequality for k will be derived from the equality ( f − f k )Tk = ( f − f k+1 )Tk+1 − ( f − f k+1 )Tkc Tk+1 + ( f k+1 − f k )Tk Tk+1 recursive.equality <21> c + ( f − f k )Tk Tk+1 . Here and subsequently I omit the argument f when it can be inferred from context. To verify equality <21>, notice that the first two terms on the righthand side sum to ( f − f k+1 )Tk Tk+1 , the third term then replaces the factor ( f − f k+1 ) by ( f − f k ), then the last term combines the two contributions from Bk+1 . The role of the second term is to undo the effect of the Bk truncation after it has done its work; the successive truncations do not accumulate as in Ossiander’s (1987) argument. Without such a tidying up, products such as Nk Nk+1 would be replaced by products N0 N1 . . . Nk Nk+1 , which might cause summability difficulties if H (N ) grew faster than a slow logarithmic rate. Notice that the truncations are arranged so that each summand on the right-hand side of <21> is bounded. Apply νn to both sides of <21>, take suprema of absolute values, then expectations, to get k ≤ k+1 + P sup |νn ( f − f k+1 )Tkc Tk+1 | F + P max |νn ( f k+1 − f k )Tk Tk+1 | F mean.recursive <22> c + P sup |νn ( f − f k )Tk Tk+1 |. F Notice that the maximum in the third term on the right-hand side runs over only finitely many distinct functions; Assumption <19> will handle this term directly. The bracketing will increase both the second and fourth terms to bounds on simple processes that will be handled by the same inequality. The two Assumptions lead to simple bounds for the last three terms on the right-hand side of <22>. Bound for the third term Bound | f k − f k+1 | by | f k − f | + | f k+1 − f | ≤ Bk + Bk+1 . The indicator function Tk Tk+1 ensures that max β (( f k − f k+1 )Tk Tk+1 ) ≤ βk + βk+1 , F max ( f k − f k+1 )Tk Tk+1 ≤ δk + δk+1 . F c David Pollard Asymptopia: 23 March 2001 11 Section 12.3 A generic bracketing bound The constraint required by Assumption <19> is satisfied: βk + βk+1 3 3 1 = + ≤ . δk + δk+1 γk γk+1 γk It follows that P max |νn ( f k − f k+1 )Tk Tk+1 | ≤ (δk + δk+1 )θk . t Bound for the second term The truncated remainder process ( f − f k+1 )Tkc Tk+1 is not simple, because ( f − f k+1 ) depends on f through more than just the regions of a finite partition of F. However, it is bounded in absolute value by Bk+1 Tkc Tk+1 . In general, if g and h are functions for which |g| ≤ h, then |νn g| ≤ h(ξi ) + Ph(ξi ) = νn h + 2 Ph(ξi ). i≤n i≤n i≤n Apply the bound with g = ( f − f k+1 )Tkc Tk+1 and h = Bk+1 Tkc Tk+1 bracket.bnd P sup |νn ( f − f k+1 )Tkc Tk+1 | ≤ P max |νn Bk+1 Tkc Tk+1 | F F + 2 max P Bk+1 Tkc Tk+1 (ξi ). <23> F i≤n From the bounds max β(Bk+1 Tkc Tk+1 ) ≤ max β(Bk+1 Tk+1 ) ≤ βk+1 , F F max Bk+1 Tkc Tk+1 ≤ max Bk+1 ≤ δk+1 , F t Assumption <19> gives P max |νn Bk+1 Tkc Tk+1 | ≤ δk+1 θk , F because βk+1 = δk+1 /γk+1 ≤ δk+1 /γk . For the contribution from the expected values, split acording to which of Bk or Bk+1 is larger, to bound the function Bk+1 Tkc Tk+1 by mean.bnd <24> Bk+1 {Bk+1 > βk }+ Bk {Bk > βk } ≤ Bk+1 {Bk+1 > 2δk+1 /γk }+ Bk {Bk > δk /γk } Thus, via Assumption <18>, PBk+1 Tkc Tk+1 (ξi ) ≤ δk+1 D(γk /2) + δk D(γk ). i≤n Bound for the fourth term By the symmetry in k and k + 1 in the argument for the second term, we can interchange their roles to get bracket.bnd c c P sup |νn ( f − f k )Tk+1 Tk | ≤ P max |νn Bk Tk+1 Tk | F F c + 2 max P Bk Tk+1 Tk (ξi ). <25> F i≤n Assumption <18> applies because βk = δk /γk , to bound the first contribution by δk θk . Arguing as for the second term, bound the contribution from the means by PBk (ξi ){Bk (ξi ) > δk /2γk+1 } + PBk+1 (ξi ){Bk+1 (ξi ) > δk+1 /γk+1 } i≤n ≤ δk+1 D(γk+1 ) + δk D(2γk+1 ) Notice that D(2γk+1 ) is the largest of the four contributions from D to the second and fourth terms. 12 Asymptopia: 23 March 2001 c David Pollard Chapter 12 Bracketing methods Recursive inequality The recursive inequality <22> has now simplified to k ≤ k+1 + 6δk+1 θk + 6δk+1 D(2γk+1 ). Recursive substitution, and replacement of the resulting sum by its bounding integral leaves us with the inequality 0 ≤ k 12 δ0 δk J (x) d x, with J (x) as in Section 1. As k tends to infinity, k ≤ 2nβk tends to zero, which leads to the integral bound as stated in the theorem. A slightly better bound is obtained by choosing a k just large enough to make k comparable to the other terms in the bound. Remember that δ ∗ is √ δ determined by the equality nδ ∗ = δ∗ J (x) d x = J. Choose the largest k for ∗ which δk+1 ≥ δ . Bound k using the bracketing inequality, k ≤ P max νn Bk {Bk ≤ βk } + 2 PBk {Bk ≤ βk }. F i≤n c Tk+1 Notice that the factor is missing. That has no effect on the first contribution, which, by virtue of Assumption <19>, is less than δk H (Nk ) ≤ 2 δk δk+1 J (x) d x. The sum of expected values is no longer in the form needed by Assumption <18>. Instead, it can be bounded by the corresponding second moment, 1 1/2 √ 2n PBk2 ≤ 2 nδk ≤ 8J. n i≤n [§] 4. The entire k contribution has been absorbed into integral terms, leaving a final upper bound of 20J for 0 . Phi mixing Let Bk denote the sigma-field generated by ξ1 , . . . , ξk and Ak denote the sigma-field generated by ξk , ξk+1 , . . .. Say that {ξi } has phi-mixing coefficients {φm } if, for all nonegative integers k and m, |PAB − PAPB| ≤ φm PB for all B ∈ Bk and A ∈ Ak+m . If X is Bk -measurable and integrable, and Y is Ak+m -measurable and bounded by a constant K , a simple approximation argument (see Billingsley 1968, page 170) shows that covar <26> |PX Y − PX PY | ≤ 2K φm P|X |. This inequality leads to a bound on the moment generating function of a sum, which will play the same role as Lemma <indep.mgf>. The argument is a slightly modified form of the proof of Collomb’s (1984) Lemma 1, with elimination of his first moment quantities from the bound. Once again, work with the L 2 norm, g22 = i≤n Pg(ξi )2 . c David Pollard Asymptopia: 23 March 2001 13 Section 12.4 phi.mgf <27> Phi mixing Lemma. Let W1 , . . . , Wn be random variables with phi-mixing coefficients {φm }. Suupose each |Wi | is bounded by a constant α and PWi = 0, and √ that i≤n PWi2 ≤ σ 2 . Then, with C = 4 + 16 i≤n φi , √ P exp(t X ) ≤ exp Ct 2 σ 2 + n eφm /m i≤n i for each t > 0 and each positive integer m with 1 ≤ m ≤ n and mtα ≤ 1/4. Proof. Break {1, . . . , n} into successive blocks B1 , B1 , . . . , B N , BN of length m (except for B N and BN , which might be shorter). Notice that n + 2m ≥ 2N m ≥ n, whence n/2m ≥ N − 1. Define Bi = j∈Bi X j and Ti = B1 + . . . + Bi . Define Bi and Ti in a similar fashion. Write σ j2 for PX 2j , and Vi for j∈Bi σ j2 . By convexity, 1 P exp(t X ) = P exp (2t TN + 2t TN ) i≤n i 2 1 1 ≤ P exp(2t TN ) + P exp(2t TN ). 2 2 Consider the first term on the right-hand side. Peel off the contribution from the block B N , using the inequality (<26>) for a separation of at least m. Notice that |2t B N |√≤ 2tmα ≤ 12 by the constraint on m and t, whence exp(2t B N )∞ ≤ e. From inequality <26>, √ P exp(2t (TN −1 + B N )) ≤ P exp(2t TN −1 )P exp(2t B N ) + 2φm P exp(2t TN −1 ) e. Because e x ≤ 1 + x + x 2 for |x| ≤ 12 , we also have P exp(2t B N ) ≤ 1 + 4t 2 PB N2 . The mixing condition lets us bound PB N2 by 14 C VN (the argument is essentially the one on page 172 of Billingsley (1968)), which leads to √ P exp(2t TN ) ≤ P exp(2t TN −1 ) 1 + Ct 2 VN + 2φm e √ ≤ P exp(2t TN −1 ) exp(Ct 2 VN + 2φm e ). Repeat the argument another N − 2 times to get √ P exp(2t TN ) ≤P exp(2t T1 ) exp Ct 2 (V2 + . . . + VN ) + 2(N − 1)φm e n √ ≤ exp(4t 2 V1 ) exp Ct 2 (V2 + . . . + VN ) + φm e . m A similar argument gives a similar inequality for P exp(2t TN ). The asserted bound on the moment generating function follows. The inequality, as stated, requires no convergence condition on the mixing coefficients; the bound is for a fixed n. However, in applications the inequality is needed for arbitrarily large n. In that case one needs the familiar condition ∞ √ φk < ∞. k=1 The dependence between the variables has contributed the φm /m term, which indirectly provides the constraint appearing in the phi-mixing incarnation of Assumption <19>. The constraint is best expressed in terms of a decreasing function ρ for which φm /m ≤ ρ(m) for each m. The dependence throws an extra factor ρ −1 ((log 2N )/n) into the definition of G. When log 2N > nρ(1), this factor should be interpreted as 1. 14 Asymptopia: 23 March 2001 c David Pollard Chapter 12 phi.max <28> Bracketing methods √ Corollary. Let {ξi } have phi-mixing coefficients for which ∞ φm < ∞. m=1 Let G be a finite class consisting of N functions for which β(g) ≤ β and g2 ≤ δ . Then there exists a constant C4 , depending only on the mixing coefficients, such that P max |Sg| ≤ C4 δ g∈G log(2N ) where G(N ) = 16ρ −1 log 2N n if β ≤ δ/G(N ), log(2N ). That is, Assumption <19> holds with H (N ) = C1 given. Proof. log(2N ) and G(N ) as Argue as for Corollary <10> that, provided 0 < 8mβt < 1, √ P max |Sg| ≤ log(2N )/t + C1 tδ 2 + eρ(m)n/t. G Derive the analogous tail bound? phi.clt1 The effect of the extra factor in the definition of G is best understood through a specific example. <29> Section incomplete from here on. phi.clt2 Choose t = log(2N )/δ, and let m be the smallest integer greater than 1 for which nρ(m) ≤ log(2N ). Then m/2 ≤ m − 1 ≤ ρ −1 (log N /n), from which the constraint follows. <30> Example. Suppose {ξi } is a stationary sequence with marginal distributions P and mixing coefficients such that ρ(x) = O(x − p ). Suppose the class F has envelope F such that P F v < ∞ for some v > 2, and bracketing numbers with log N2 (x) = O(x −2w ) for some w < 1. As for Example <mean.Ossiander>, a functional central limit theorem can be proved provided p, v, and w are suitably related. 1/v Let n tend to zero so slowly that √ n P{F > n n } → 0. Then the ∗ 1/v truncated class F = { f {F ≤ n n }/ n : f ∈ F} has elements bounded by Mn = n n 1/v−1/2 , which tends to zero faster than a small negative power of n. Need to let δ tend to zero fast enough so that the n 1/ p does not blow up the entropy integral. This δ will be different from the fixed δ for the stoch equicty. Example. la DMR. Same setup as for previous example, but work with new norm, a DMR (following Rio.cov94 ) realized that it was more elegant to absorb the required rate of convergence for a sequence of mixing coefficients into the definition of a norm, X 22,r = 1 r −1 (u)Q X (u)2 du. 0 With a slight change of notation, their Lemma 5 corresponds precisely to my Assumption <18>. Make sure r −1 ≥ 1, so that new norm is larger than L 2 norm. For a nonnegative random variable X define F(x) = P{X > x}, with corresponding quantile function Q X (u) = inf{x : F(x) ≤ u} for 0 < u < 1. The subtle choice of inequalities ensures that F(x) ≤ u if and only if x < Q X (u). Consequently, X has the same distribution as Q X (U ) where U is distributed Uniform(0, 1). c David Pollard Asymptopia: 23 March 2001 15 Section 12.4 DMR5 <31> Phi mixing Lemma. For each θ > 0, √ √ PX {X > X 2,r / θρ −1 (θ )} ≤ X 2,r θ. Write δ for X 2,r . Put θ = ρ(). By definition of the norm, Proof. r () δ2 ≥ r −1 (u)Q X (u)2 du ≥ r ()Q X (r ())2 , 0 √ from which it follows that δ ≥ Q(r ()) θρ −1 (θ ). Use the quantile representation to rewrite the left-hand side of the asserted inequality as 1 Q X (u){u < r ()} du 1 r −1 (u) ≤ Q X (u){u < r ()} du, 0 √ √ which, by the Cauchy-Schwarz inequality, is less than ≤ δ r ()/ = δ θ , as asserted. √ Let g ∗ = g/ n and g ∗ = g(ξ1 )2,r . Deduce that √ √ √ √ g ∗ (ξi ){g(ξi > g ∗ / n θρ −1 (θ )} ≤ n Pg{g > g2,r / θρ −1 (θ )} 0 1 Q X (u){Q X (u) > Q(r ())} du = 0 i≤n ≤ nθ. √ √ √ −1 In consequence, D( n θρ (θ )) = nθ. Apply with θ = (log 2N )/n to get g ∗ (ξi ){g(ξi > g ∗ /G(N )} ≤ g ∗ log 2N . i≤n That should lead to an entropy integral like the one for independent variables, provided bracketing numbers are calculated using the · 2,r norm. The fnal clt should then look like the result for indep rv’s. [§] 5. Absolute regularity DMR have established a fclt for stationary, absolutely regular sequences, a slightly weaker property than the phi-mixing assumption. The precise defintion of abs reg is unimportant. It matters only that it leads to a moment maximal inequality involving an interesting new norm. Using the Berbee coupling between abs reg sequences and independent sequences, DMR also established an inequality that corresponds to my Assumption <19>. Write L(N ) for the maximum of 1 and log N . Let G be a finite class consisting of N functions for which β(g) ≤ β and g2,r ≤ δ. Put G(N ) = ρ −1 (L(N )/n) L(N )/8. DMR4 <32> Then Lemma 4 of DMR implies existence of a universal constant C3 such that: √ √ . P max |Sg/ n| ≤ δC3 L(N ) if β ≤ nδ/G(N ) g∈G How to rescale? Derive the DMR bound? 16 Asymptopia: 23 March 2001 c David Pollard Chapter 12 A saturation lemma <33> Bracketing methods Lemma. Suppose D(·) is a decreasing nonegative function on (0, 1] for 1 which 0 D(x) d x < ∞. The integral condition imposes some constraint on how fast D(x) can increase as x tends to zero. Often one needs more precise control over the rate. DMR have shown how to replace D by a slightly larger function for which one has such control. More precisely, the function defined by D(x) = sup (t/x)2 D(t), 0<t≤x has the following properties. (i) D(x/2) ≤ 4D(x), for each x . (ii) D is decreasing. (iii) D(x) ≥ D(x). (iv) For each nonegative function γ on R+ for which γ (x)/ X is increasing, γ D(x) d x ≤ 4 γ (D(x)) d x, 0 0 for each 0 < ≤ 1. Property (i) follows from the fact that the supremum defining D(x/2) runs over half the range, and it has an extra factor of (1/2)2 in the denominator, as compared with the definition for D(x). Properties (ii) and (iii) follow from the equivalent definition for D, D(x) = sup s 2 D(sx). 0<s≤1 Property (iv) is more delicate. Split the range for the supremum defining D(x) into 0 < t ≤ x/2 and x/2 < t ≤ x, then bound D(t) from below by D(x/2) on the second subrange, to obtain D(x) = max 14 D(x/2), D(x/2) . Apply γ to both sides, invoke the fact that γ (t/4) ≤ γ (t)/4, then bound the maximum by a sum. γ D(x) d x ≤ γ D(x/2) /4 d x + γ (D(x/2)) d x. 0 0 0 With the change of variable y = x/2, the first integral on the right-hand side is seen to be less than half the integral on the left-hand side, and the second integral is seen to be less than 2 0 γ (D(x)) d x. Property (iv) would then follow by a rearrangement of terms, provided all integrals are finite. To cover the case of infinite integrals, replace DC (x) = D(x) by D(x) ∧ C, for a constant C, derive the inequality from property (iv) with D replaced by DC , then invoke Monotone Convergence as C tends to infinity. [§] 6. Strong mixing Let Y = (Y1 , . . . , Yn ). Fix a γ > 0. For each τ ≥ 2 define 1/τ τ/(τ +γ ) N (τ, Y) = P|Yi | . i≤n Doukhan (or Andrews and Pollard) have shown, under the assumption that ∞ γ /(τ +γ ) (1 + i)τ −2 αi < ∞, i=1 c David Pollard Asymptopia: 23 March 2001 17 Section 12.6 Strong mixing that Doukhan.moment P|W1 + . . . + Wn |τ ≤ Cττ min (N (2, W), N (τ, W)) . <34> Let g = N (2, Y) where Yi = g(ξi ). Fact: If β(g) ≤ δ = g then N (Q, Y) ≤ N (2, Y). Proof: Let s = Q(2 + γ )/2. Then 1/Q 1/Q Q+γ Q/(Q+γ ) s Q/s (P|Yi | ) ≤ (P|Yi | ) i≤n i≤n ≤ 1/Q (P|Yi | 2+γ β (s−2−γ ) )Q/s i≤n ≤ β (Q−2)/Q N (2, Y)2/Q ≤ N (2, Y). Fact: P|Yi |{Yi > Y/t} ≤ Yt 1+γ . i≤n Proof: Let δ = Y. Let αi denote the L 2+γ norm of Yi . Note that δ ≥ maxi αi . LHS is less than 2+γ γ P|Yi |2+γ t 1+γ /δ 1+γ ≤ t 1+γ αi /(δαi ) i≤n i≤n = t 1+γ αi2 /δ i≤n = δt [§] 7. 1+γ . Tail probabilitybounds A tail bound for a single Sg easily implies a maximal inequality for a finite class of functions—one has merely to add together a finite number of bounds. For example, with independent summands, and a class G of at most N functions each satisfying g2 ≤ δ and β(g) ≤ β, the Bennett inequality would give P{max |Sg| > λδ} ≤ exp log(2N ) − 12 λ2 B(βλ/δ) . G In chaining arguments we need λ large enough to absorb the log(2N ), which suggests that we replace λ by λ + log(2N ). If β is constrained to be less than δ/(λ + log(2N )) then we get the sub-gaussian tail exp(−cλ2 ), for some constant c. It turns out that the derivation of such a maximal inequality for finite classes is the only part of the argument where independence is needed. Similar inequalities (derived by a slightly more involved arguments—see Sections 3 and 4) hold for various types of mixing sequence; they play the same role in the chaining argument. In general, for a chaining argument with tail probabilities, all details about possible dependence can be hidden behind a single assumption. 18 Asymptopia: 23 March 2001 c David Pollard Chapter 12 max.tail.assumption <35> Bracketing methods Assumption:. Suppose there exists functions G(·, ·) and H (·) increasing in each argument, and a decreasing function τ (·), for which the following property holds. If G is a finite set of at most N functions on X for which β(g) ≤ β and g ≤ δ for each g ∈ G, P{max |Sg| ≥ δ(H (N ) + λ)} ≤ τ (λ) if β ≤ δ/G(N , λ), g∈G for each λ > 0. The argument is slightly more involved than for the proof of Theorem <20>, because there are two more sequence of constants to be chosen correctly. main.tail <36> Theorem. Suppose Assumptions <18> and <38> hold. Suppose the functions in F are bounded in absolute value by a constant M ≤ δ/G (N (δ)N (δ/2)), for a fixed δ > 0. Let λ(·) be a decreasing, nonnegative function on R+ . Then, for some universal constants C1 and C2 , δ δ τ λ(x) P{sup |S( f − Aδ ( f ))| > C2 K (x) d x} ≤ C1 d x, x F 0 0 where K (x) = H (N (x)N (x/2)) + λ(x)D (2G (N (x)N (x/2), λ(x))). As before, for k = 0, 1, . . . and δk = δ/2k , let πk be a partition of F into at most Nk = N (δk ) regions, and write f k = Ak ( f ) and Bk = Bk (·, f ) for the πk -simple functions that define the δk -bracketing. Define γk = G(Nk Nk+1 , λk ) and βk = δk /γk and θk = H (Nk Nk+1 ) and λk = λ(δk ). Start once more from the recursive equality c ( f − f k )Tk = ( f − f k+1 )Tk+1 −( f − f k+1 )Tkc Tk+1 +( f k+1 − f k )Tk Tk+1 +( f − f k )Tk Tk+1 . Define a corresponding sequence of constants {Rk }, with Rk = Rk+1 + 6δk+1 (θk + λk+1 + ηk ) where ηk = D(2γk ). This time write k for P{supF |S( f − f k )Tk | > Rk }. Then we have the recursive tail bound k ≤ k+1 + P{sup |S( f − f k+1 )Tkc Tk+1 | > δk+1 (θk + λk + 3ηk )} F + P{max |S( f k+1 − f k )Tk Tk+1 | > (δk + δk+1 )(θk + λk } F tail.recursive <37> c + P{sup |S( f − f k )Tk Tk+1 | > δk (θk + λk + 3ηk }. F The argument for the second, third, and fourth terms on the right-hand side of <40> parallels the argument started from <22>. I omit most of the repeated detail. The maximum in the third term again runs over at most Nk Nk+1 distinct representatives, each with a bound of at most βk + βk+1 and a norm of at most δk + δk+1 . The βk were again chosen to ensure that the constraint required by Assumption <38> is satisfied. It follows that the third term contributes at most τ (λk ). As before, |S( f − f k+1 )Tkc Tk+1 | ≤ SBk+1 Tkc Tk+1 + 2 max P Bk+1 Tkc Tk+1 (ξi ), F i≤n and the sum of the means is less than 3ηk . Thus the second term is less than P{max SBk+1 Tkc Tk+1 > δk+1 (H (Nk Nk+1 ) + λk+1 )} ≤ τ (λk+1 ). F Similarly, the fourth term is bounded by τ (λk ). c David Pollard Asymptopia: 23 March 2001 19 Section 12.7 Tail probabilitybounds With repeated substitution, the recursive inequality <40> leads to 0 ≤ k + 2 δ δk+1 with R0 = Rk + 122 τ (λ(x)) d x, x δ δk+1 K (x) d x. Argue as in the previous section, but with J (·) replaced by K (·), to bound the contributions from k and Rk foran appropriately chosen k. Not quite right. There is another contribution to the tail bound. Fix. Old section on Tail probabilities max.tail.assumption <38> Assumption:. Suppose there exists functions G(·, ·) and H (·) and a decreasing function τ (·), increasing in each argument, for which the following property holds. If G is a finite set of functions on X for which β(g) ≤ β and g ≤ δ for each g ∈ G, P{max |Sg| ≥ δ(H (N ) + λ)} ≤ τ (λ) g∈G main.tail if β ≤ δ/G(N , λ), for each λ > 0. <39> Theorem. Suppose Assumptions <38> and <38> hold. Suppose the functions in F are bounded in absolute value by a constant M . Suppose δ > 0 is such that M ≤ δ/G (N (δ)N (δ/2)). Then, for some universal constant C0 , δ P{sup |S( f −Aδ ( f ))| > R0 } ≤ C0 F where R0 = H (N (x)N (x/2))+D (G (8N (x)N (x/2))) d x, 0 and Proof. The argument is slightly more involved than for the proof of Theorem <main>, because there is one more sequence of constants to be chosen correctly. As before, for k = 0, 1, . . . and δk = δ/2k , let πk be a partition of F into at most Nk = N (δk ) regions. Write f k = Ak ( f ) and Bk ( f ) for the πk -simple functions that define the δk -bracketing. Define γk = G(Nk Nk+1 , λk ) and βk = δk /γk and θk = H (Nk Nk+1 ), where the {λk } remian to be chosen. Start once more from the recursive equality c ( f − f k )Tk = ( f − f k+1 )Tk+1 −( f − f k+1 )Tkc Tk+1 +(Ak+1 −Ak )Tk Tk+1 +( f − f k )Tk Tk+1 . Define a corresponding sequence of constants {Rk }, with Rk = Rk+1 +???θk +???δk λk+1 +???ηk . where ηk = δk D(1/γk ). This time write k for P{supF |S( f − f k )Tk | > Rk }. Then we have the recursive tail bound k ≤k+1 + P{sup |S( f − f k+1 )Tkc Tk+1 | ≥ (δk + δk+1 )(θk + λ + k)} F tail.recursive c + P{max <40 > |S( f k+1 − f k )Tk Tk+1 | > δk (θk + λk+1 + ηk } + P{sup |S( f − f k )Tk Tk+1 | > δk (θk + λk+1 + ηk }. F F Bound for the third term Bound | f k − f k+1 | by | f k − f | + | f k+1 − f | ≤ Bk+1 + Bk . The indicator function Tk Tk+1 ensures that max β (( f k − f k+1 )Tk Tk+1 ) ≤ βk + βk+1 , F max ( f k − f k+1 )Tk Tk+1 ≤ δk + δk+1 . F 20 Asymptopia: 23 March 2001 c David Pollard Chapter 12 Bracketing methods The constraint required by Assumption <38> is satisfied: βk + βk+1 2 1 1 1 1 = + ≤ . δk + δk+1 3 γk 3 γk+1 γk It follows that P max |S( f k − f k+1 )Tk Tk+1 | ≤ 3δk+1 θk . t Bound for the second term The truncated remainder process ( f − f k+1 )Tkc Tk+1 is not simple, because ( f − f k+1 ) depends on f through more than just the regions of a finite partition of F. However, it is bounded in absolute value by Bk+1 Tkc Tk+1 . In general, if g and h are functions for which |g| ≤ h, then |Sg| ≤ h(ξi ) + Ph(ξi ) = Sh + 2 Ph(ξi ). i≤n tail.bracket.bnd <41> i≤n i≤n Apply the bound with g = ( f − f k+1 )Tkc Tk+1 and h = Bk+1 Tkc Tk+1 P sup |S( f − f k+1 )Tkc Tk+1 | ≤ P max |SBk+1 Tkc Tk+1 |+2 max P Bk+1 Tkc Tk+1 (ξi ). F F F i≤n From the bounds max β(Bk+1 Tkc Tk+1 ) ≤ max β(Bk+1 Tk+1 ) ≤ βk+1 , F F max Bk+1 Tkc Tk+1 ≤ max Bk+1 ≤ δk+1 , F t Assumption <38> gives P max |SBk+1 Tkc Tk+1 | ≤ δk+1 θk , F tail.mean.bnd <42> because βk+1 = δk+1 /γk+1 ≤ δk+1 /γk . For the contribution from the expected values, split acording to which of Bk or Bk+1 is larger, to bound the function Bk+1 Tkc Tk+1 by Bk+1 {Bk+1 ≥ Bk > βk }+Bk {Bk > βk } ≤ Bk+1 {Bk+1 > 2δk+1 /γk }+Bk {Bk > δk /γk } Thus, via Assumption <38>, PBk+1 Tkc Tk+1 (ξi ) ≤ δk+1 D(γk /2) + δk D(γk ). i≤n Bound for the fourth term tail.bracket.bnd <43> By the symmetry in k and k + 1 in the argument for the second term, we can interchange their roles to get c c c P sup |S( f − f k )Tk+1 Tk | ≤ P max |SBk Tk+1 Tk | + 2 max P Bk Tk+1 Tk (ξi ). F F F i≤n Assumption <38> applies because βk = δk /γk , to bound the first contribution by δk θk . Arguing as for the second term, bound the contribution from the means by PBk (ξi ){Bk (ξi ) > δk /2γk+1 } + PBk+1 (ξi ){Bk+1 (ξi ) > δk+1 /γk+1 } i≤n ≤ δk+1 D(γk+1 ) + δk D(2γk+1 ) Recursive inequality The recursive inequality <recursive> has now simplified to k ≤ k+1 + 6δk+1 θk + 6δk D(2γk+1 ). c David Pollard Asymptopia: 23 March 2001 21 Section 12.7 Tail probabilitybounds Because k ≤ 2nβk → 0 as k → ∞, recursive substitution leaves us with a bound for 0 less than ∞ ∞ 5 θk + 2n λk . k=0 NewBennett k? <44> NewBennett <45> Better to stop at some finite k=0 The integral term in Theorem <main> is a cleaner looking bound for the sum. Lemma. Suppose ξi = (xi , yi ) with all xi and yi mutually independent. Suppose | f (x, y)| ≤ M(y)φ(x), with P exp(α M(yi ) ≤ for all i and |φ(x)| ≤ β for all x . Write V for i≤n Pφ(xi )2 . Then 2 2 2V λ α provided β ≤ . P{S f ≥ λ} ≤ exp − 4V λα Lemma. Suppose ξi = (xi , yi ) with all xi and yi mutually independent. Suppose | f (x, y)| ≤ M(y)φ(x), with P exp(α M(yi ) ≤ for all i and |φ(x)| ≤ β for all x . Write V for i≤n Pφ(xi )2 . Then 2 2 2V λ α provided β ≤ . P{S f ≥ λ} ≤ exp − 4V λα The Bennett inequality <7> for P{ i Wi ≥ λ} would follow from a minimization of exp − λt + 12 t 2 σ 2 ψ(tα) over t > 0. [§] 8. Check with the comments from Pyke. Notes The two main results (Theorems <20> and <39>) give maximal inequalities for expected values (L 1 norms) and tail probabilities. These inequalities abstract and extend the arguments first developed by the Seattle group (Pyke 1983, Alexander & Pyke 1986, Alexander 1984, Bass 1985, Ossiander 1987) and subsequently generalized by the Paris group (Massart 1987, Birgé & Massart 1993, Doukhan et al. 1994, Birgé & Massart 1995). Some of the history of the main ideas is discussed in Section History. The argument to establish the maximal inequality of Assumption <19> for finite classes is essentially Pisier’s (1983) method combined with the first step in the derivation of the Bernstein/Bennett inequality. Donsker? cf. Parthasarathy (1967). Revesz aad donsker (Dudley 1981) (Dudley 1978) Dudley (1978) introduced the concept of metric entropy with bracketing in order to prove a functional central limit theorem for empirical processes indexed by classes of sets, later extending it to classes of functions in Dudley (1981). History according to Ron Pyke Ken Alexander saw the paper of Pyke (1983), and realized how to improve the truncation technique used there. He applied the improvement in Alexander 84(?). They wrote another joint paper (Alexander & Pyke 1986)—see the note at the end of the paper. Bass (1985) applied the truncation to set-indexed partial-sum processes (thge paper was not written up before December 84). Bass and Pyke (paper around 1983?) recognized the truncation problem; they didn’t use the 22 Asymptopia: 23 March 2001 c David Pollard Chapter 12 Bracketing methods best form of truncation. Mina Ossiander worked on her dissertation during the spring and summer of 1984, producing her thesis (published as Ossiander 1987) and a technical report in November–December of that year. She started from the Alexander&Pyke paper, then developed a more general form of the truncation argument (?). There were many discussions between Ossiander and Bass. The final publication dates are not indicative of the true order in which work carried out, because of delays in refereeing. AGOZ dmr Cite Dehardt. birge and Massart • mention where bracketing comes from: history of Seattle contributions, as related by Ron Pyke. The method is based on the truncation/bracketing argument developed by the Seattle group. I have isolated the key ingredients into the following two assumptions. For an application to dependent variables, DMR introduced a new type of norm for a random variable X . The quantile transformation constructs a random variable Q(U ) with the same distribution as X , by means of a U distributed uniformly on (0, 1). For a nonnegative decreasing function L determined by mixing assumptions, they defined a norm by X 2 = PL(U )Q(U )2 . This new norm has properties analogous to those of an L 2 norm. They proved elegant maximal inequalities for mixing processes based on bracketings with the new norm. References Alexander, K. S. (1984), ‘Probability inequalities for empirical processes and a law of the iterated logarithm’, Annals of Probability 12, 1041–1067. Alexander, K. S. & Pyke, R. (1986), ‘A uniform central limit theorem for setindexed partial-sum processes with finite variance’, Annals of Probability 14, 582–597. Andrews, D. W. K. & Pollard, D. (1994), ‘An introduction to functional central limit theorems for dependent stochastic processes’, International Statistical Review 62, 119–132. Bass, R. (1985), ‘Law of the iterated logarithm for set-indexed partial-sum processes with finite variance’, Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete 70, 591–608. Billingsley, P. (1968), Convergence of Probability Measures, Wiley, New York. Birgé, L. & Massart, P. (1993), ‘Rates of convergence for minimum contrast estimators’, Probability Theory and Related Fields 97, 113–150. Birgé, L. & Massart, P. (1995), Minimum contrast estimators on sieves, in D. Pollard, E. Torgersen & G. L. Yang, eds, ‘A Festschrift for Lucien Le Cam’, Springer-Verlag, New York, pp. ???–??? Chow, Y. S. & Teicher, H. (1978), Probability Theory: Independence, Interchangeability, Martingales, Springer, New York. Collomb, G. (1984), ‘Propriétés de convergence presque complète du prédicteur à noyau’, Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete 66, 441–460. Doukhan, P. & Portal, F. (1983), ‘Moments de variables aléatoires mélangeantes’, Comptes Rendus de l’Academie des Sciences, Paris 297, 129–132. Doukhan, P. & Portal, F. (1984), ‘Vitesse de convergence dans le théorème central limite pour des variables aléatoires mélangeantes à valeurs dans un espace de hilbert’, Comptes Rendus de l’Academie des Sciences, Paris 298, 305–308. c David Pollard Asymptopia: 23 March 2001 23 Section 12.8 Notes Doukhan, P., Massart, P. & Rio, E. (1994), ‘The functional central limit theorem for strongly mixing processes’, Annales de l’Institut Henri Poincaré ??, ???–??? Dudley, R. M. (1978), ‘Central limit theorems for empirical measures’, Annals of Probability 6, 899–929. Dudley, R. M. (1981), Donsker classes of functions, in M. Csörgő, D. A. Dawson, J. N. K. Rao & A. K. M. E. Saleh, eds, ‘Statistics and Related Topics’, North-Holland, Amsterdam, pp. 341–352. Hall, P. & Heyde, C. C. (1980), Martingale Limit Theory and Its Application, Academic Press, New York, NY. Massart, P. (1987), Quelques problemes de vitesse de convergence pour des processus empiriques, PhD thesis, Université Paris Sud, Centre d’Orsay. Chapter 1A= Massart (1986); Chapter 1B = “Invariance principles for empirical processes: the weakly dependent case. Ossiander, M. (1987), ‘A central limit theorem under metric entropy with L 2 bracketing’, Annals of Probability 15, 897–919. Parthasarathy, K. R. (1967), Probability Measures on Metric Spaces, Academic, New York. Pollard, D. (1984), Convergence of Stochastic Processes, Springer, New York. Pollard, D. (1990), Empirical Processes: Theory and Applications, Vol. 2 of NSF-CBMS Regional Conference Series in Probability and Statistics, Institute of Mathematical Statistics, Hayward, CA. Pollard, D. (1996), An Explanation of Probability, ??? (Unpublished book manuscript.). Pyke, R. (1983), A uniform central limit theorem for partial-sum processes indexed by sets, in J. F. C. Kingman & G. E. H. Reuter, eds, ‘Probability, Statistics and Analysis’, Cambridge University Press, Cambridge, pp. 219– 240. Rio, E. (1994), ‘Covariance inequalities for strongly mixing processes’, Annales de l’Institut Henri Poincaré ??, ???–??? Sen, P. (1974), ‘Weak convergence of multidimensional empirical processes for stationary φ-mixing processes’, Annals of Probability 2, 147–154. Yukich, J. E. (1986), ‘Rates of convergence for classes of functions: the non-iid case’, Journal of Multivariate Analysis 20, 175–189. 24 Asymptopia: 23 March 2001 c David Pollard