Bracketing methods

advertisement
Chapter 12
Bracketing methods
SECTION 1 explains how the concept of bracketing may be thought of as a scheme to
partition an index set of functions into finitely many regions, on each which there
exists a single approximating function with a guaranteed error bound.
SECTION 2 describes a refined technique of truncation and successive approximation,
for the simplest case of an empirical process built from independent variables.
It derives an L1 maximal inequality as a recusive bound relating the errors of
successive approximations.
SECTION 3 presents some examples to illustrate the uses of the results from Section 2.
SECTION 4 generalizes the method from Section 2 to an abstract setting that includes
several sorts of dependence between the summands. The arguments are stated
in an abstract form that covers both independent and dependent summands. All
details of possible dependence are hidden in two Assumptions, one concerning the
norm used to express approximations to classes of functions, the other specifying
the behaviour of maxima of finite collections of centered sums.
SECTION 5 studies the special case of phi-mixing sequences.
SECTION 6 studies the special case of absolutely regular (beta-mixing) sequences.
SECTION 7 points out the difficulties involved in the study of strong-mixing sequences.
SECTION 8 develops a maximal inequality for tail probabilities, by means of a
modification of the methods from Section 4.
[§]
1.
What is bracketing?
Bracketing arguments have long been used in empirical process theory. A very
simple form of bracketing is often used in textbooks to prove the GlivenkoCantelli theorem—the most basic example of a uniform laws of large numbers.
The empirical distribution function Fn for a sample ξ1 , . . . , ξn from a
distribution function F on the real line. That is, Fn (t) denotes the proportion
of the observations less than or equal to t,
1
Fn (t) =
{ξi ≤ t}
for each t in R.
n i≤n
The Glivenko-Cantelli theorem asserts that supt |Fn (t) − F(t)| converges to
zero almost surely.
The bracketing argument controls the contributions from an interval
t1 ≤ t ≤ t2 by means of bounds that hold throughout the interval. For such t
we have
Fn (t1 ) − F(t2 ) ≤ Fn (t) − F(t) ≤ Fn (t2 ) − F(t1 ).
The two bounds converge almost surely to F(t1 ) − F(t2 ) and F(t2 ) − F(t1 ).
If t2 and t1 are close enough together—meaning that the probability measure
c David Pollard
Asymptopia: 23 March 2001 1
Section 12.1
What is bracketing?
of the interval (t1 , t2 ] is small enough—then all the Fn (t) − F(t) values, for
t1 ≤ t ≤ t2 , get squeezed close to the origin. If we cover the whole real
line by a union of finitely many such intervals, we are able to deduce that
supt |Fn (t) − F(t)| is eventually small.
There is a more fruitful way to think of the increment F(t2 )− F(t1 ). If P is
the probability measure corresponding to the distribution function F, the increment equals the L1 (P) distance between the two indicator functions (−∞, t1 ]
and (−∞, t2 ]. The concept of bracketing then has an obvious extension to
arbitrary classes of (measurable) functions on a measurable space (X, A).
bracket.def1
<1>
Definition. A pair of P -integrable functions ≤ u on X defines a bracket,
[ , u] := {g : (x) ≤ g(x) ≤ u(x0 for all x}. For 1 ≤ q ≤ ∞, the bracketing
(q)
number N[ ] (δ, P) for a subclass of functions F ⊆ Lq (P) is defined as the
smallest value of N for which there exist brackets [ i , u i ] with P(u i − i )q ≤ δ q
for i = 1, . . . , N and F ⊆ ∪i [ i , u i ].
Need bracketing functions in Lq also?
The Definition allows the possibility that the bracketing numbers be might
(q)
be infinite, but they are useful only when finite. The quantity N[ ] (δ, P) is also
called the metric entropy with bracketing.
Uniform approximations correspond to bracketing numbers for q = ∞.
For proofs of uniform laws of large numbers, the bracketing numbers for q = 1
are more natural. As you will see later in this Chapter, q = 2 is better suited
for approximations related to functional central limit theorems. In the earlier
empirical process literature for central limit theorems for bounded classes of
functions, bracketing numbers with q = 1 were often used; but for extensions
to unbounded classes, do seem to require q = 2.
For classes of functions the role of the empirical distribution function is
taken over by the empirical measure Pn , which puts mass 1/n at each of
the n observations,
1
Pn g =
g(ξi ).
n i≤n
On the real line, Pn (−∞, t] = Fn (t).
uslln
<2>
lipschitz
<3>
Example. Suppose the {ξi } are sampled independently from a fixed probability distribution P, and suppose F is a class of measurable functions with
N[](1) (δ, F, P) finite for each δ > 0. Deduce a uniform strong law of large
numbers, supF |Pn f − P f | → 0 almost surely. Complete proof
Example. Bracketing arguments often appear disguised as smoothness
assumptions in the statistics literature. Typically F is a parametric class
of functions { f t : t ∈ T } indexed by a bounded subset of an Euclidean
space Rk . Suppose the functions satisfy a Lipschitz condition, | f t (x) − f s (x)| ≤
M(x)|t − s|α , for some fixed α > 0 and some fixed M in Lq (P). Write C for
the Lq (P) norm of M.
For some constant C0 there exist a set of N ≤ C0 (1/δ 1/α )k points
Get eucliden-style bracketing numbers
2
Asymptopia: 23 March 2001
c David Pollard
Chapter 12
Bracketing methods
More delicate bracketing arguments will be easier to describe if we recast
the definition into a slightly different form. Suppose F ⊆ ∪i [ i , u i ], a covering
for a finite δ-bracketing. Consider an f in F. As a tie-breaking rule, choose
the smallestk for which f ∈ [ k , u k ]. Write Aδ f for ( k + u k )/2 and Bδ f for
u k − k . Then | f − Aδ f | ≤ Bδ f and Bδ ≤ δ, for whatever norm is used
to define the bracketing. Even if F is infinite, there are only finitely many
different approximating functions Aδ f and bounding functions Bδ f . Indeed, the
bracketing serves to partition F into finitely many regions, π = {F1 , . . . , F N }:
if f 1 and f 2 belong to the same Fk then they share the same approximating and
bounding functions. Put another way, as maps from F into finite collections
of functions, both Aδ and Bδ take constant values on each member of the
partition π; they are simple functions, in their dependence on f . In general, let
us call a function on F π-simple if it takes a constant value on each Fi in π .
The definition of bracketing, for an arbitrary norm · on classes of functions,
can be thought of as a scheme of approximation via simple functions.
bracket.defn
<4>
Definition. For given δ > 0, say that a finite partition πδ of F into disjoint
regions supports a δ -bracketing (for the norm · ) if there exist functions
Aδ (x, f ) and B(x, f ) such that:
(i) | f (x) − Aδ (x, f )| ≤ Bδ (x, f ) for all x ;
(ii) Bδ (·, f ) ≤ δ for every f ;
(iii) each Aδ (x, ·) and Bδ (x, ·) is πδ -simple as a function on F.
The bracketing number N (δ) is defined as the smallest number of regions
needed for a partition that supports a δ -bracketing.
The bracketing function N (·) is decreasing. Again, it is of use only when
it is finite-valued. Typically N (δ) tends to infinity as δ tends to zero. For the
finer applications of bracketing arguments, we will derive bounds expressed as
integrals over d involving the bracketing numbers. The bounds will be usefulk
only when the integrals converge; the convergence corresponds to assumptions
about the rate of increase of N (δ) as δ tends to zero.
The application of bracketing to prove uniform strong laws of large
numbers is crude. More refined arguments are needed to get sharper bounds
corresponding to the central limit theorem level of asymptotics. traditionally
these
bounds have been expressed in terms of the empirical process νn =
√
n(Pn − P), by which sums are standardized in a way appropriate to central
limit theorems,
νn g = n −1/2
g(ξi ) − Pg(ξi ) .
i≤n
The results in this Chapter are stated for empirical processes constructed from
random elements ξ1 , . . . , ξn taking values in a space X, indexed by classes of
integrable functions.
The general problem is to develop uniform approximations to the empirical
process {νn f : f ∈ F} indexed by a class of functions F on X. In my
opinion, the most useful general solutions give probabilistic bounds on
supF |νn f − νn (Aδ f )|, such as an inequality for tail probabilities or an L q
bound. The behaviour of the process indexed by F can then be related to the
behaviour of the finite-dimensional process {νn f δ : f δ ∈ Fδ }.
In this Chapter, all the arguments for the various probabilistic bounds will
make use of a recursive approximation scheme known as chaining. It is not
hard to understand why useful bounds for the empirical process do not usually
c David Pollard
Asymptopia: 23 March 2001 3
Section 12.1
What is bracketing?
follow by means of a single bracketing approximation. The bracketing bound
destroys the centering. For example, with the upper bound we have
√
√
νn ( f − Aδ f ) = n Pn ( f − Aδ f ) − n Pn ( f − Aδ f )
√
√
≤ n Pn (Bδ f ) + n P(Bδ f )
√
= νn (Bδ f ) + 2 n P(Bδ f ).
√
The lower bound reverses the sign on the 2 n P(Bδ f ). If P(Bδ f ) were
small compared with n −1/2 , the change in centering would not be important.
Unfortunately that level of approximation would usually require a decreasing
value of δ as n gets larger; we lose the benefits of an approximation by means
of a fixed, finite collection of functions.
The solution to the problem of the recentering is to derive the approximation
in several steps. Suppose A1 and B1 refer to the approximations and bounds
for a δ1 -bracketing, and A2 and B2 refer to the bracketing for a smaller
value δ2 . Apply the empirical process to both sides of the equality f − A1 f =
f − A2 f + A2 f − A1 f , then take a supremum over F to derive the inequality
recursion1
<5>
sup |νn ( f − A1 f )| ≤ sup |νn ( f − A2 f )| + max |νn (A2 f − A1 f )|.
F
F
F
The two suprema bound the error s of approximation for the two bracketings.
The maximum runs over at most N (δ1 )N (δ2 ) pairs of differences between
approximating functions; I write it as a maximum to remind you that it
involves only finitely many differences, even for an infinite F. If we can bound
probabilistically the contribution from that last term then we arrive at a recursive
inequality relating the errors two bracketing approximations.
Repeated application of the same idea would give a bound for the crude
bracketing approximation as a sum of a bound for a finer approximation plus
a sum of terms coming from the maxima over differences of approximating
functions. It remains only to bound the contributions from those maxima.
Each of the differences A2 f − A1 f is bounded by B1 f + B2 f , with norm
bounded by δ1 + δ2 .
Typically use L2 norm. Bernstein for tail probs of bounded functions.
Chaining must stop when the constant term in bound overwhelms variance
term. For unifromly bounded clases of functions, luckily the contributions
at end of chain are easy to take care of. For unbounded classes, need to
truncate. Cite Dudley 81 for first form. Moment condition on envelope. Best
form due to Ossiander 1987 and Seattle group—see comments about history
in the Notes Section.
Bernstein
< 6>
Lemma. For independent ξ1 , . . . , ξn and a measurable function g bounded
in absolute value by a constant β ,
− 1/2t 2
P{|νn g| ≥ t} ≤ 2 exp
√
g22 + 2/3βt n
The chaining bounds will be derived by a recursive procedure based
on successive finite approximations to F in the sense of a norm · . For
independent {ξi }, it will usually be the L 2 norm,
g22 =
Pg(ξi )2 .
i≤n
However, the argument is written so that it works for other norms, such as
those introduced by Rio (1994) and Doukhan, Massart & Rio (1994) for mixing
processes. Only two extra specific properties are required of the norm.
4
Asymptopia: 23 March 2001
c David Pollard
Chapter 12
Bennett
<7>
Bracketing methods
In the literature, the most familiar example of such a bound is the Bennett
inequality for sums of independent random variables. Suppose ξ1 , . . . , ξn are
independent and |g| is bounded by a constant β, that is, β(g) ≤ β. Then
Bennett’s inequality asserts that
P{|νn g| ≥ λg2 } ≤ 2 exp − 12 λ2 B(βλ/g2 ) ,
for λ > 0,
where
(1 + x) log(1 + x) − 1 − x
.
x 2 /2
It is the presence of the nuisance factor, B(βλ/g2 ) , that complicates the
chaining argument for tail probabilities. If β stays fixed while λ and g2 are
made smaller, the nuisance factor begins to dominate the bound. It was for this
reason that Bass (1985) and Ossiander (1987) needed to add an extra truncation
step to the chaining argument. The truncation keeps the βλ/g2 close enough
to zero that one can ignore the nuisance factor, and act as if the centered sum
has sub-gaussian tails.
The possible dependence for the general chaining argument with L 1 norms
can also be hidden behind a single assumption.
The keys ideas are easiest to explain (and the proof is simplest) for the L 1
bound, P supF |νn f − νn (Aδ f )|. The analogous arguments for tail bounds are
better known in the literature.
B(x) =
[§]
2.
Independent summands
Suppose ξ1 , . . . , ξn are independent random variables. Several simplifications
are possible when dealing with independent summands, largely because sums
of bounded independent variables behave almost like subgaussian variables
(Chapter 9/).
It is natural to use the L2 norm,
1
f 22 :=
P f (ξi )2 ,
n i≤n
independent.trunc
<8>
for two reasons: it enters as a measure of scale in inequalities for finite
collections of functions; and it is easy to bound contributions discarded during
truncation arguments, by means of the inequality
√
n −1/2
Pg(ξi ){g(ξi ) > ng2 /t} ≤
Pg(ξi )2 t/ng2 = tg2 .
i
i
This Section will be devoted to an explanation of the ideas involved in the
proof of the following maximal inequality.
indep.mean
<9>
Theorem. Suppose ξ1 , . . . , ξn are independent and let F be a class
of functions with finite bracketing numbers (for the L2 norm) for which
1
log N (x) d x < ∞. Then, for some universal constant C ,
0
δ
P sup |νn ( f − Aδ ( f ))| ≤ C
F
where
log(2N (x)) d x + Rδ ,
0
Rδ = n −1/2
PB(ξi ){B(ξi ) >
√
nβ},
i≤n
where B(x) = maxF Bδ (x, f ) and β = δ/ log 2N (δ).
Anyone worries about questions of measurability could work throughout
with the outer integral P∗ .
c David Pollard
Asymptopia: 23 March 2001 5
Section 12.2
Independent summands
The independence is needed only to establish a maximal inequality for
finitely many bounded functions. It depends upon the elementary fact (see
Pollard (1996, Chapter 4?) or Chow & Teicher (1978, page 338)) that the
function defined by ψ(x) = 2(e x − 1 − x)/x 2 for x = 0, and ψ(0) = 1, is
positive and increasing over the whole real line.
indep.max
<10>
Lemma. Suppose ξ1 , . . . , ξn are independent under P. Let G be a finite class
√
consisting of N functions, each bounded in absolute value by a constant β n
and g2 ≤ δ . Then there exists a universal constant C such that
P max |νn g| ≤ C0 δ
log(2N )
g∈G
if β ≤ δ/ log(2N ).
√
Proof. Consider first the bound for a single sum. Let Wi = (g(ξi ) − Pg(ξi )) / n.
Then |Wi | ≤ 2β and i PWi2 ≤ δ 2 . The moment generating function of
νn g = i Wi equals
1 + P(t Wi ) + P 12 t 2 Wi2 ψ(t Wi )
i
≤
1 + P 12 t 2 Wi2 ψ(2tβ)
i
≤ exp
≤ exp
1 2
t
2
1
2
increasing ψ
PWi2 ψ(2tβ)
using 1 + x ≤ e x
i
t 2 δ 2 ψ(2tβ) .
The same,argument works for −g.
Sum up 2N such moment generating functions to get the maximal
inequality. For fixed t > 0,
exp(tP max |νn g|) ≤ P exp t max |νn g|
by Jensen’s inequality
G
G
≤ P exp t max νn g + P exp t max νn (−g)
G
G
≤ 2N exp 12 t 2 δ 2 ψ(2tβ) .
Take logarithms then put t =
P max |νn g| ≤ δ
G
log(2N )/δ to get
log(2N ) 1 + 12 ψ(2β
log(2N )/δ) .
The stated inequality holds with C = 1 + ψ(2)/2.
Comment on how the factor of 2 could be absorbed into the constant, except
when N = 1. It would be notationally convenient if we could dispense with
the 2.
The proof of the Theorem consists of repeated application of the Lemma to
the finite maxima obtained from a sequence of approximations. For k = 0, 1, . . .
and δk = δ/2k , let πk be a partition of F into at most Nk = N (δk ) regions.
Write f k (· ) = Ak (·, , f ) and Bk (·) = B
√k (· , f ) for the πk -simple functions that
define the δk -bracketing. Define h k = log Nk . Here, and subsequently, I omit
the argument f when it can be inferred from context.
Integrals bound sums
Typically integrals that appear in bounds like the one in Theorem <9> are
just tidier substitutes for sums of error terms, a substitution made possible by
6
h(δk)
Asymptopia: 23 March 2001
c David Pollard
Chapter 12
Bracketing methods
the geometric decrease in the {δk }: If h(·) is a decreasing function on R+ , then
δk h(δk ) = 2(δk − δk+1 )h(δk ) ≤ 2
{δk+1 < x ≤ δk }h(x) d x.
Sum over k, to deduce that
m
δ0
δk h(δk ) ≤
h(x) d x ≤
δm+1
k=0
Put h(x) =
δ
h(x) d x.
0
log(2N (x)).
Nested partitions using logs
The bracketing argument is much simpler if the πk partitions are nested—that
is, each πk+1 is a refinement of the preceeding πk —and if we can take the Bk
as decreasing in k. The logarithmic dependence on the bracketing numbers lets
us arrange Without loss of generality we may assume the partitions to be nested
and the bounding functions to decrease as k increases.
Once again write h(x) for log 2N (x). Let πk∗ be the common refinement
of the partitions π1 , . . . , πk and let Bk∗ ( f ) = mini≤k Bk ( f ). Within each region
of the πk∗ partition choose A∗k ( f ) to correspond to the Bi ( f ) that achieves the
minimum defining Bk∗ ( f ). Notice that πk∗ partitions F into at most
Mk = N (δ1 )N (δ2 ) . . . N (δk ) ≤ exp(
i ≤ kh(δi )2 ) ≤ exp (
i ≤ kh(δi ))2 .
The integral term in Theorem <9> also bounds the sum corresponding to the
finer partition:
∞
∞
δk log Mk ≤
δk
{ j ≤ k}h(δ j )
k=0
j
k=0
=
∞
h(δ j )2δ j
because
k{ j
≤ k}2− j = 2−k+1
j=0
δ
≤8
h(x) d x.
0
Rather than carry the superscript for πk∗ and Bk∗ , let me just write πk and
Bk , absorbing the extra constants due to the nesting into the C.
Truncation regions
Lemma <10> suggests that we truncate the functions at a level depending on
their L2 norms. Let {βk } be a decreasing sequence of constants, to be specified.
It will turn out that β0 equals the β specified in the statement of the Theorem.
√
Split supF |νn √
( f − f o )| into the contributions from the regions {B0 ≤ β0 n}
and {B0 > β0 n}. For each f , bound |νn ( f − f 0 )| by
√
√
B0 (ξi , f ){B0 (ξi , f ) > β0 n}
|νn ( f − f 0 ){B0 ≤ β0 n}| + n −1/2
top.split
<11>
+n
−1/2
i≤n
√
PB0 (ξi , f ){B0 (ξi , f ) > β0 n}
i≤n
The third term contributes at most
PB0 (ξi , f )2
δ2
n −1/2
= δh(δ),
≤
√
β0
β0 n
i≤n
c David Pollard
Asymptopia: 23 March 2001 7
Section 12.2
Independent summands
which gets absorbed into integral contribution to the bound. The supremum of
the second term over F is less than the Rδ error. The first term is the starting
point for the recursive procedure known as chaining.
At each succeeding step in the chaining argument we truncate the
differences f − f k more severely, based on the size of Bk . √
Write Tk ( f ), or
just Tk , for the indicator function of the set ∩i≤k {Bi ( f ) ≤ βi n}. Because the
partitions are nested, there is no harm in letting the truncations accumulate. I
will argue recursively to bound the contribution from the truncated remainders,
k = P sup |νn ( f − f k )Tk ( f )|.
F
Notice that 0 equals the first term in <top.splity>.
Recursive inequality
Start from the recursive equality,
rec.ineq.indep
( f − f k )Tk = ( f − f k+1 )Tk+1
+ ( f k+1 − f k )Tk+1
<12>
c
+ ( f − f k )Tk Tk+1
.
Notice that the truncations are arranged so that each function on the
right-hand side of <12> is bounded.
Apply νn to both sides of <12>, take suprema of absolute values, then
expectations, to get
mean.indep.rec
k ≤ k+1
+ P max |νn ( f k+1 − f k )Tk+1 |
<13>
F
+ P sup |νn ( f −
F
c
f k )Tk Tk+1
|
links
trunc
The differences f k+1 − f k contribute an error due to moving down one level in
the chain of approximations; they are contributed by the “links” of the chain.
c
The indicators Tk Tk+1
pick out the contribution to the error of approximation in
when we move from the kth level of truncation to the (k + 1)st. The links
term is already set up for an application of Lemma <10>. The trunc term
can be bounded by a maximum over a finite class, by means of the property
taht defines a bracketing.
Bound for the links term
Notice that each of f k = Ak ( f ), f k+1 = Ak+1 ( f ), and Tk+1 is πk+1 -simple;
functions f that lie in the same region of the partition share the same values for
these three quantities. The maximum need run over at most Nk+1 ≤ exp(h 2k+1 )
representative f , one from each region.
Bound | f k − f k+1 | by | f k − f | + | f k+1 − f | ≤ Bk + Bk+1 . The indicator
function Tk+1 ensures that, for every f ,
sup |( f k − f k+1 )Tk+1 | ≤ βk + βk+1 ≤ 2βk ,
x
max ( f k − f k+1 )Tk+1 2 ≤ Bk 2 + Bk+1 2 ≤ δk + δk+1 = 3δk+1 .
F
Notice that the truncation plays no role in bounding the L2 norm. The constraint
required by Lemma <10> is satisfied provided
indep.constraint1
<14>
8
2βk ≤ 3δk+1 / h k+1 .
Asymptopia: 23 March 2001
c David Pollard
Chapter 12
Bracketing methods
If we can choose the truncation levels {βk } to satisfy this constraint we will
have
P max |νn ( f k − f k+1 )Tk+1 | ≤ 3C0 δk+1 h k+1
t
Bound for the trunc term
c
The truncated remainder process ( f − f k+1 )Tk Tk+1
is not simple, because
( f − f k+1 ) depends on f through more than just the regions of a finite
partition of F. However, it is bounded in absolute value by the πk+1 -simple
c
function Bk+1 Tk Tk+1
. In general, if g and G are functions for which |g| ≤ G,
then
|νn g| ≤ n −1/2
h(ξi ) + n −1/2
PG(ξi ) = νn G + 2n −1/2
PG(ξi ).
i≤n
i≤n
i≤n
c
c
Invoke the inequality with g = ( f − f k+1 )Tk Tk+1
and G = Bk+1 Tk Tk+1
.
c
P sup|νn ( f − f k+1 )Tk Tk+1
|
F
bracket.bnd
c
≤ P max |νn Bk+1 Tk Tk+1
| + 2 max Pn −1/2
<15>
F
F
c
Bk+1 Tk Tk+1
(ξi ).
i≤n
Again the maximum need run over at most N K +1 representative f . The
c
functions Bk+1 Tk Tk+1
are bounded in absolute value by βk (because Bk+1 ≤
Bk ≤ βk on Tk ) and have L2 norms less than δk+1 . Provided
indep.constraint2
βk ≤ δk+1 / h k+1 ,
<16>
Lemma <10> will bound the first term on the right-hand side by C0 δk+1 h k+1 .
2
/βk+1 .
Inequality <8> bounds the second term by 2δk+1
In summary: provided constraints <14> and <16> hold, we have the
recursive inequality
mean.indep.rec2
<17>
2
k ≤ k+1 + 4C0 δk+1 h k+1 + 2δk+1
/βk+1
We minimize Rδ and the error term in <17> by choosing the βk as large as
possible, that is, βk = δk+1 / h k+1 .
Summation
The recursive inequality then simplifies to
k ≤ k+1 + (2 + 4C0 )δ − ki + 1h k+1 .
Repeated substitution, and replacement of the resulting sum by its bounding
integral leaves us with the inequality
0 ≤ k + (8 + 16C0 )
[§]
3.
δ0
h(x) d x.
δk+1
As k tends to infinity, k tends to zero (it is bounded by 2nβk ), which leads to
the integral bound as stated in the theorem.
A generic bracketing bound
In general the norm plays the role of a scaling, as suggested by some
probabilistic bound for a single νn g. Such bounds typically also depend on a
second measure of size β(·). In this paper, unless otherwise specified, β(g)
will denote the supremum norm supx |g(x)|.
c David Pollard
Asymptopia: 23 March 2001 9
Section 12.3
A generic bracketing bound
Improvements
• Applies to general norms.
• Lower terminal of integral not quite zero—helpful for mixing
applications.
• Add one more term to the recursive equality, to avoid assumption that
partitions are nested.
mean.assumption
<18>
Assumption:. Suppose the norm satisfies
(i) g1 ≤ g2 if |g1 | ≤ |g2 |,
(ii) there exists a nonnegative increasing function D(·) on R+ such that
Pg(ξi ){g(ξi ) > g/t} ≤ gD(t)
for each t > 0.
i
The form of the upper bound is suggested by the methods of Doukhan et
al. (1994) for absolutely regular processes. For independent processes, with the
L 2 norm · 2 , we can take D(t) ≡ t.
max.mean.assumption
<19>
Assumption:. Suppose there exists increasing functions G(·) and H (·) for
which the following properties hold. If G is a finite set of at most N functions
on X for which β(g) ≤ β and g ≤ δ for each g ∈ G, then
P max |νn g| ≤ δ H (N )
if β ≤ δ/G(N ).
g∈G
For example, for independent summands with · 2 as norm, both H (N )
and G(N ) can be taken as mutiples of log(2N ), as will be shown in Section 2.
The upper bounds as stated are sensible only if the various integrals
converge. The detail of the proof shows that the lower terminal to the integrals
can be replaced by a δ ∗ > 0, with only a slight increase in the constant.
As shown by Birgé & Massart (1993), such a refinement is important for
applications to minimum contrast estimators for infinite-dimensional parameters.
For Theorem <20>, the δ ∗ is determined by the equality
√
nδ ∗ =
δ
δ∗
J (x) d x.
For Theorem <39>, it is determined by the equality
what?
main.mean
<20>
Theorem. Suppose Assumptions <18> and <19> hold. Then, for a
fixed δ > 0, for some universal constant C0 ,
δ
P sup |νn ( f − Aδ ( f ))| ≤ C0
J (x) d x, +n −1/2
PB(ξi ){B(ξi ) >??},
δ∗
F
i≤n
where
B(x) = max Bδ (x, f )
F
J (x) = H (N (x)N (x/2)) + D (2G (N (x)N (x/2)))
?? =??
and δ ∗ is the largest value for which H (N (2δ ∗ )) ≥
√
n.
Proof. For k = 0, 1, . . . and δk = δ/2 , let πk be a partition of F into at
most Nk = N (δk ) regions. Write f k = Ak ( f ) and Bk = Bk (·, f ) for the
πk -simple functions that define the δk -bracketing. Define γk = G(Nk Nk+1 ) and
θk = H (Nk Nk+1 ).
k
10
Asymptopia: 23 March 2001
c David Pollard
Chapter 12
Bracketing methods
The bound will be derived by a recursive argument, involving succesive
simple approximations to νn f . At the kth step, the approximating functions
will be πk∗ -simple, where πk∗ denotes the common refinement of πk and πk+1 .
Suprema over classes of πk∗ -simple functions reduce to maxima over at most
Nk Nk+1 representatives, one for each region of πk∗ . Assumption <19> therefore
suggests that we need functions bounded in absolute value by βk = δk /γk ,
an property that will be achieved by truncation. Write Tk ( f ), or just Tk , for
the indicator function of the set {Bk ( f ) ≤ βk }. Notice that β0 ≤ M, and
hence T0 ( f ) = X. I will argue recursively to bound the contribution from the
truncated remainders,
k = P∗ sup |νn ( f − f k )Tk ( f )|.
F
Notice that 0 is the quantity we seek to bound.
The recursive inequality for k will be derived from the equality
( f − f k )Tk = ( f − f k+1 )Tk+1
− ( f − f k+1 )Tkc Tk+1
+ ( f k+1 − f k )Tk Tk+1
recursive.equality
<21>
c
+ ( f − f k )Tk Tk+1
.
Here and subsequently I omit the argument f when it can be inferred from
context. To verify equality <21>, notice that the first two terms on the righthand side sum to ( f − f k+1 )Tk Tk+1 , the third term then replaces the factor
( f − f k+1 ) by ( f − f k ), then the last term combines the two contributions from
Bk+1 . The role of the second term is to undo the effect of the Bk truncation
after it has done its work; the successive truncations do not accumulate as
in Ossiander’s (1987) argument. Without such a tidying up, products such as
Nk Nk+1 would be replaced by products N0 N1 . . . Nk Nk+1 , which might cause
summability difficulties if H (N ) grew faster than a slow logarithmic rate.
Notice that the truncations are arranged so that each summand on the right-hand
side of <21> is bounded.
Apply νn to both sides of <21>, take suprema of absolute values, then
expectations, to get
k ≤ k+1 + P sup |νn ( f − f k+1 )Tkc Tk+1 |
F
+ P max |νn ( f k+1 − f k )Tk Tk+1 |
F
mean.recursive
<22>
c
+ P sup |νn ( f − f k )Tk Tk+1
|.
F
Notice that the maximum in the third term on the right-hand side runs over
only finitely many distinct functions; Assumption <19> will handle this term
directly. The bracketing will increase both the second and fourth terms to
bounds on simple processes that will be handled by the same inequality.
The two Assumptions lead to simple bounds for the last three terms on the
right-hand side of <22>.
Bound for the third term
Bound | f k − f k+1 | by | f k − f | + | f k+1 − f | ≤ Bk + Bk+1 . The indicator function
Tk Tk+1 ensures that
max β (( f k − f k+1 )Tk Tk+1 ) ≤ βk + βk+1 ,
F
max ( f k − f k+1 )Tk Tk+1 ≤ δk + δk+1 .
F
c David Pollard
Asymptopia: 23 March 2001 11
Section 12.3
A generic bracketing bound
The constraint required by Assumption <19> is satisfied:
βk + βk+1
3
3
1
=
+
≤ .
δk + δk+1
γk
γk+1
γk
It follows that
P max |νn ( f k − f k+1 )Tk Tk+1 | ≤ (δk + δk+1 )θk .
t
Bound for the second term
The truncated remainder process ( f − f k+1 )Tkc Tk+1 is not simple, because
( f − f k+1 ) depends on f through more than just the regions of a finite partition
of F. However, it is bounded in absolute value by Bk+1 Tkc Tk+1 . In general, if
g and h are functions for which |g| ≤ h, then
|νn g| ≤
h(ξi ) +
Ph(ξi ) = νn h + 2
Ph(ξi ).
i≤n
i≤n
i≤n
Apply the bound with g = ( f − f k+1 )Tkc Tk+1 and h = Bk+1 Tkc Tk+1
bracket.bnd
P sup |νn ( f − f k+1 )Tkc Tk+1 | ≤ P max |νn Bk+1 Tkc Tk+1 |
F
F
+ 2 max P
Bk+1 Tkc Tk+1 (ξi ).
<23>
F
i≤n
From the bounds
max β(Bk+1 Tkc Tk+1 ) ≤ max β(Bk+1 Tk+1 ) ≤ βk+1 ,
F
F
max Bk+1 Tkc Tk+1 ≤ max Bk+1 ≤ δk+1 ,
F
t
Assumption <19> gives
P max |νn Bk+1 Tkc Tk+1 | ≤ δk+1 θk ,
F
because βk+1 = δk+1 /γk+1 ≤ δk+1 /γk .
For the contribution from the expected values, split acording to which of
Bk or Bk+1 is larger, to bound the function Bk+1 Tkc Tk+1 by
mean.bnd
<24>
Bk+1 {Bk+1 > βk }+ Bk {Bk > βk } ≤ Bk+1 {Bk+1 > 2δk+1 /γk }+ Bk {Bk > δk /γk }
Thus, via Assumption <18>,
PBk+1 Tkc Tk+1 (ξi ) ≤ δk+1 D(γk /2) + δk D(γk ).
i≤n
Bound for the fourth term
By the symmetry in k and k + 1 in the argument for the second term, we can
interchange their roles to get
bracket.bnd
c
c
P sup |νn ( f − f k )Tk+1
Tk | ≤ P max |νn Bk Tk+1
Tk |
F
F
c
+ 2 max P
Bk Tk+1
Tk (ξi ).
<25>
F
i≤n
Assumption <18> applies because βk = δk /γk , to bound the first contribution
by δk θk . Arguing as for the second term, bound the contribution from the means
by
PBk (ξi ){Bk (ξi ) > δk /2γk+1 } + PBk+1 (ξi ){Bk+1 (ξi ) > δk+1 /γk+1 }
i≤n
≤ δk+1 D(γk+1 ) + δk D(2γk+1 )
Notice that D(2γk+1 ) is the largest of the four contributions from D to the
second and fourth terms.
12
Asymptopia: 23 March 2001
c David Pollard
Chapter 12
Bracketing methods
Recursive inequality
The recursive inequality <22> has now simplified to
k ≤ k+1 + 6δk+1 θk + 6δk+1 D(2γk+1 ).
Recursive substitution, and replacement of the resulting sum by its bounding
integral leaves us with the inequality
0 ≤ k 12
δ0
δk
J (x) d x,
with J (x) as in Section 1. As k tends to infinity, k ≤ 2nβk tends to zero,
which leads to the integral bound as stated in the theorem.
A slightly better bound is obtained by choosing a k just large enough to
make k comparable to the other terms in the bound. Remember that δ ∗ is
√
δ
determined by the equality nδ ∗ = δ∗ J (x) d x = J. Choose the largest k for
∗
which δk+1 ≥ δ . Bound k using the bracketing inequality,
k ≤ P max νn Bk {Bk ≤ βk } + 2
PBk {Bk ≤ βk }.
F
i≤n
c
Tk+1
Notice that the
factor is missing. That has no effect on the first contribution,
which, by virtue of Assumption <19>, is less than
δk H (Nk ) ≤ 2
δk
δk+1
J (x) d x.
The sum of expected values is no longer in the form needed by Assumption <18>. Instead, it can be bounded by the corresponding second moment,
1 1/2
√
2n
PBk2
≤ 2 nδk ≤ 8J.
n i≤n
[§]
4.
The entire k contribution has been absorbed into integral terms, leaving a final
upper bound of 20J for 0 .
Phi mixing
Let Bk denote the sigma-field generated by ξ1 , . . . , ξk and Ak denote the
sigma-field generated by ξk , ξk+1 , . . .. Say that {ξi } has phi-mixing coefficients
{φm } if, for all nonegative integers k and m,
|PAB − PAPB| ≤ φm PB
for all B ∈ Bk and A ∈ Ak+m .
If X is Bk -measurable and integrable, and Y is Ak+m -measurable and bounded
by a constant K , a simple approximation argument (see Billingsley 1968,
page 170) shows that
covar
<26>
|PX Y − PX PY | ≤ 2K φm P|X |.
This inequality leads to a bound on the moment generating function of a sum,
which will play the same role as Lemma <indep.mgf>. The argument is
a slightly modified form of the proof of Collomb’s (1984) Lemma 1, with
elimination of his first moment quantities from the bound.
Once again, work with the L 2 norm, g22 = i≤n Pg(ξi )2 .
c David Pollard
Asymptopia: 23 March 2001 13
Section 12.4
phi.mgf
<27>
Phi mixing
Lemma. Let W1 , . . . , Wn be random variables with phi-mixing coefficients {φm }. Suupose each |Wi | is bounded by a constant
α and PWi = 0, and
√
that i≤n PWi2 ≤ σ 2 . Then, with C = 4 + 16 i≤n φi ,
√
P exp(t
X ) ≤ exp Ct 2 σ 2 + n eφm /m
i≤n i
for each t > 0 and each positive integer m with 1 ≤ m ≤ n and mtα ≤ 1/4.
Proof. Break {1, . . . , n} into successive blocks B1 , B1 , . . . , B N , BN of
length m (except for B N and BN , which might be shorter). Notice that
n + 2m ≥ 2N m ≥ n, whence n/2m ≥ N − 1. Define Bi = j∈Bi X j and
Ti = B1 + . . . + Bi . Define Bi and Ti in a similar fashion. Write σ j2 for PX 2j ,
and Vi for j∈Bi σ j2 .
By convexity,
1
P exp(t
X ) = P exp (2t TN + 2t TN )
i≤n i
2
1
1
≤ P exp(2t TN ) + P exp(2t TN ).
2
2
Consider the first term on the right-hand side. Peel off the contribution from
the block B N , using the inequality (<26>) for a separation of at least m.
Notice that |2t B N |√≤ 2tmα ≤ 12 by the constraint on m and t, whence
exp(2t B N )∞ ≤ e. From inequality <26>,
√
P exp(2t (TN −1 + B N )) ≤ P exp(2t TN −1 )P exp(2t B N ) + 2φm P exp(2t TN −1 ) e.
Because e x ≤ 1 + x + x 2 for |x| ≤ 12 , we also have
P exp(2t B N ) ≤ 1 + 4t 2 PB N2 .
The mixing condition lets us bound PB N2 by 14 C VN (the argument is essentially
the one on page 172 of Billingsley (1968)), which leads to
√ P exp(2t TN ) ≤ P exp(2t TN −1 ) 1 + Ct 2 VN + 2φm e
√
≤ P exp(2t TN −1 ) exp(Ct 2 VN + 2φm e ).
Repeat the argument another N − 2 times to get
√ P exp(2t TN ) ≤P exp(2t T1 ) exp Ct 2 (V2 + . . . + VN ) + 2(N − 1)φm e
n √ ≤ exp(4t 2 V1 ) exp Ct 2 (V2 + . . . + VN ) + φm e .
m
A similar argument gives a similar inequality for P exp(2t TN ). The asserted
bound on the moment generating function follows.
The inequality, as stated, requires no convergence condition on the mixing
coefficients; the bound is for a fixed n. However, in applications the inequality
is needed
for arbitrarily large n. In that case one needs the familiar condition
∞ √
φk < ∞.
k=1
The dependence between the variables has contributed the φm /m term,
which indirectly provides the constraint appearing in the phi-mixing incarnation
of Assumption <19>. The constraint is best expressed in terms of a decreasing
function ρ for which φm /m ≤ ρ(m) for each m. The dependence throws an
extra factor ρ −1 ((log 2N )/n) into the definition of G. When log 2N > nρ(1),
this factor should be interpreted as 1.
14
Asymptopia: 23 March 2001
c David Pollard
Chapter 12
phi.max
<28>
Bracketing methods
√
Corollary. Let {ξi } have phi-mixing coefficients for which ∞
φm < ∞.
m=1
Let G be a finite class consisting of N functions for which β(g) ≤ β and
g2 ≤ δ . Then there exists a constant C4 , depending only on the mixing
coefficients, such that
P max |Sg| ≤ C4 δ
g∈G
log(2N )
where
G(N ) = 16ρ
−1
log 2N
n
if β ≤ δ/G(N ),
log(2N ).
That is, Assumption <19> holds with H (N ) = C1
given.
Proof.
log(2N ) and G(N ) as
Argue as for Corollary <10> that, provided 0 < 8mβt < 1,
√
P max |Sg| ≤ log(2N )/t + C1 tδ 2 + eρ(m)n/t.
G
Derive the analogous tail
bound?
phi.clt1
The effect of the extra factor in the definition of G is best understood
through a specific example.
<29>
Section incomplete from here
on.
phi.clt2
Choose t = log(2N )/δ, and let m be the smallest integer greater than 1 for
which nρ(m) ≤ log(2N ). Then m/2 ≤ m − 1 ≤ ρ −1 (log N /n), from which the
constraint follows.
<30>
Example. Suppose {ξi } is a stationary sequence with marginal distributions P
and mixing coefficients such that ρ(x) = O(x − p ). Suppose the class F has
envelope F such that P F v < ∞ for some v > 2, and bracketing numbers with
log N2 (x) = O(x −2w ) for some w < 1. As for Example <mean.Ossiander>,
a functional central limit theorem can be proved provided p, v, and w are
suitably related.
1/v
Let n tend to zero so slowly that
√ n P{F > n n } → 0. Then the
∗
1/v
truncated class F = { f {F ≤ n n }/ n : f ∈ F} has elements bounded by
Mn = n n 1/v−1/2 , which tends to zero faster than a small negative power of n.
Need to let δ tend to zero fast enough so that the n 1/ p does not blow up the
entropy integral. This δ will be different from the fixed δ for the stoch equicty.
Example.
la DMR.
Same setup as for previous example, but work with new norm, a
DMR (following Rio.cov94 ) realized that it was more elegant to absorb
the required rate of convergence for a sequence of mixing coefficients into the
definition of a norm,
X 22,r =
1
r −1 (u)Q X (u)2 du.
0
With a slight change of notation, their Lemma 5 corresponds precisely to my
Assumption <18>.
Make sure r −1 ≥ 1, so that new norm is larger than L 2 norm.
For a nonnegative random variable X define F(x) = P{X > x}, with
corresponding quantile function Q X (u) = inf{x : F(x) ≤ u} for 0 < u < 1.
The subtle choice of inequalities ensures that
F(x) ≤ u
if and only if
x < Q X (u).
Consequently, X has the same distribution as Q X (U ) where U is distributed
Uniform(0, 1).
c David Pollard
Asymptopia: 23 March 2001 15
Section 12.4
DMR5
<31>
Phi mixing
Lemma.
For each θ > 0,
√
√
PX {X > X 2,r / θρ −1 (θ )} ≤ X 2,r θ.
Write δ for X 2,r . Put θ = ρ(). By definition of the norm,
Proof.
r ()
δ2 ≥
r −1 (u)Q X (u)2 du ≥ r ()Q X (r ())2 ,
0
√
from which it follows that δ ≥ Q(r ()) θρ −1 (θ ). Use the quantile representation to rewrite the left-hand side of the asserted inequality as
1
Q X (u){u < r ()} du
1
r −1 (u)
≤
Q X (u){u < r ()} du,
0
√
√
which, by the Cauchy-Schwarz inequality, is less than ≤ δ r ()/ = δ θ , as
asserted.
√
Let g ∗ = g/ n and g ∗ = g(ξ1 )2,r . Deduce that
√
√ √
√
g ∗ (ξi ){g(ξi > g ∗ / n θρ −1 (θ )} ≤ n Pg{g > g2,r / θρ −1 (θ )}
0
1
Q X (u){Q X (u) > Q(r ())} du =
0
i≤n
≤ nθ.
√
√ √ −1
In consequence, D( n θρ (θ )) = nθ. Apply with θ = (log 2N )/n to get
g ∗ (ξi ){g(ξi > g ∗ /G(N )} ≤ g ∗ log 2N .
i≤n
That should lead to an entropy integral like the one for independent variables,
provided bracketing numbers are calculated using the · 2,r norm. The fnal clt
should then look like the result for indep rv’s.
[§]
5.
Absolute regularity
DMR have established a fclt for stationary, absolutely regular sequences, a
slightly weaker property than the phi-mixing assumption. The precise defintion
of abs reg is unimportant. It matters only that it leads to a moment maximal
inequality involving an interesting new norm.
Using the Berbee coupling between abs reg sequences and independent
sequences, DMR also established an inequality that corresponds to my Assumption <19>. Write L(N ) for the maximum of 1 and log N . Let G be a finite
class consisting of N functions for which β(g) ≤ β and g2,r ≤ δ. Put
G(N ) = ρ −1 (L(N )/n) L(N )/8.
DMR4
<32>
Then Lemma 4 of DMR implies existence of a universal constant C3 such that:
√
√
.
P max |Sg/ n| ≤ δC3 L(N )
if β ≤ nδ/G(N )
g∈G
How to rescale?
Derive the DMR bound?
16
Asymptopia: 23 March 2001
c David Pollard
Chapter 12
A saturation lemma
<33>
Bracketing methods
Lemma. Suppose D(·) is a decreasing nonegative function on (0, 1] for
1
which 0 D(x) d x < ∞. The integral condition imposes some constraint on
how fast D(x) can increase as x tends to zero. Often one needs more precise
control over the rate. DMR have shown how to replace D by a slightly larger
function for which one has such control. More precisely, the function defined
by
D(x) = sup (t/x)2 D(t),
0<t≤x
has the following properties.
(i) D(x/2) ≤ 4D(x), for each x .
(ii) D is decreasing.
(iii) D(x) ≥ D(x).
(iv) For each nonegative function γ on R+ for which γ (x)/ X is increasing,
γ D(x) d x ≤ 4
γ (D(x)) d x,
0
0
for each 0 < ≤ 1.
Property (i) follows from the fact that the supremum defining D(x/2) runs
over half the range, and it has an extra factor of (1/2)2 in the denominator, as
compared with the definition for D(x). Properties (ii) and (iii) follow from the
equivalent definition for D,
D(x) = sup s 2 D(sx).
0<s≤1
Property (iv) is more delicate. Split the range for the supremum defining D(x)
into 0 < t ≤ x/2 and x/2 < t ≤ x, then bound D(t) from below by D(x/2) on
the second subrange, to obtain
D(x) = max 14 D(x/2), D(x/2) .
Apply γ to both sides, invoke the fact that γ (t/4) ≤ γ (t)/4, then bound the
maximum by a sum.
γ D(x) d x ≤
γ D(x/2) /4 d x +
γ (D(x/2)) d x.
0
0
0
With the change of variable y = x/2, the first integral on the right-hand side
is seen to be less than half the integral on the left-hand side, and the second
integral is seen to be less than 2 0 γ (D(x)) d x. Property (iv) would then
follow by a rearrangement of terms, provided all integrals are finite. To cover
the case of infinite integrals, replace DC (x) = D(x) by D(x) ∧ C, for a
constant C, derive the inequality from property (iv) with D replaced by DC ,
then invoke Monotone Convergence as C tends to infinity.
[§]
6.
Strong mixing
Let Y = (Y1 , . . . , Yn ). Fix a γ > 0. For each τ ≥ 2 define
1/τ
τ/(τ +γ )
N (τ, Y) =
P|Yi |
.
i≤n
Doukhan (or Andrews and Pollard) have shown, under the assumption that
∞
γ /(τ +γ )
(1 + i)τ −2 αi
< ∞,
i=1
c David Pollard
Asymptopia: 23 March 2001 17
Section 12.6
Strong mixing
that
Doukhan.moment
P|W1 + . . . + Wn |τ ≤ Cττ min (N (2, W), N (τ, W)) .
<34>
Let g = N (2, Y) where Yi = g(ξi ).
Fact: If β(g) ≤ δ = g then
N (Q, Y) ≤ N (2, Y).
Proof: Let s = Q(2 + γ )/2. Then
1/Q 1/Q
Q+γ Q/(Q+γ )
s Q/s
(P|Yi |
)
≤
(P|Yi | )
i≤n
i≤n
≤
1/Q
(P|Yi |
2+γ
β
(s−2−γ )
)Q/s
i≤n
≤ β (Q−2)/Q N (2, Y)2/Q
≤ N (2, Y).
Fact:
P|Yi |{Yi > Y/t} ≤ Yt 1+γ .
i≤n
Proof: Let δ = Y. Let αi denote the L 2+γ norm of Yi . Note that δ ≥ maxi αi .
LHS is less than
2+γ
γ
P|Yi |2+γ t 1+γ /δ 1+γ ≤ t 1+γ
αi /(δαi )
i≤n
i≤n
= t 1+γ
αi2 /δ
i≤n
= δt
[§]
7.
1+γ
.
Tail probabilitybounds
A tail bound for a single Sg easily implies a maximal inequality for a finite
class of functions—one has merely to add together a finite number of bounds.
For example, with independent summands, and a class G of at most N functions
each satisfying g2 ≤ δ and β(g) ≤ β, the Bennett inequality would give
P{max |Sg| > λδ} ≤ exp log(2N ) − 12 λ2 B(βλ/δ) .
G
In chaining arguments we need λ large enough to absorb the log(2N ), which
suggests that we replace λ by λ + log(2N ). If β is constrained to be less
than δ/(λ + log(2N )) then we get the sub-gaussian tail exp(−cλ2 ), for some
constant c. It turns out that the derivation of such a maximal inequality for finite
classes is the only part of the argument where independence is needed. Similar
inequalities (derived by a slightly more involved arguments—see Sections 3
and 4) hold for various types of mixing sequence; they play the same role in the
chaining argument. In general, for a chaining argument with tail probabilities,
all details about possible dependence can be hidden behind a single assumption.
18
Asymptopia: 23 March 2001
c David Pollard
Chapter 12
max.tail.assumption
<35>
Bracketing methods
Assumption:. Suppose there exists functions G(·, ·) and H (·) increasing in
each argument, and a decreasing function τ (·), for which the following property
holds. If G is a finite set of at most N functions on X for which β(g) ≤ β and
g ≤ δ for each g ∈ G,
P{max |Sg| ≥ δ(H (N ) + λ)} ≤ τ (λ)
if β ≤ δ/G(N , λ),
g∈G
for each λ > 0.
The argument is slightly more involved than for the proof of Theorem <20>,
because there are two more sequence of constants to be chosen correctly.
main.tail
<36>
Theorem. Suppose Assumptions <18> and <38> hold. Suppose the functions in F are bounded in absolute value by a constant M ≤ δ/G (N (δ)N (δ/2)),
for a fixed δ > 0. Let λ(·) be a decreasing, nonnegative function on R+ . Then,
for some universal constants C1 and C2 ,
δ
δ τ λ(x)
P{sup |S( f − Aδ ( f ))| > C2
K (x) d x} ≤ C1
d x,
x
F
0
0
where K (x) = H (N (x)N (x/2)) + λ(x)D (2G (N (x)N (x/2), λ(x))).
As before, for k = 0, 1, . . . and δk = δ/2k , let πk be a partition of F into
at most Nk = N (δk ) regions, and write f k = Ak ( f ) and Bk = Bk (·, f ) for the
πk -simple functions that define the δk -bracketing. Define γk = G(Nk Nk+1 , λk )
and βk = δk /γk and θk = H (Nk Nk+1 ) and λk = λ(δk ).
Start once more from the recursive equality
c
( f − f k )Tk = ( f − f k+1 )Tk+1 −( f − f k+1 )Tkc Tk+1 +( f k+1 − f k )Tk Tk+1 +( f − f k )Tk Tk+1
.
Define a corresponding sequence of constants {Rk }, with
Rk = Rk+1 + 6δk+1 (θk + λk+1 + ηk )
where ηk = D(2γk ).
This time write k for P{supF |S( f − f k )Tk | > Rk }. Then we have the
recursive tail bound
k ≤ k+1 + P{sup |S( f − f k+1 )Tkc Tk+1 | > δk+1 (θk + λk + 3ηk )}
F
+ P{max |S( f k+1 − f k )Tk Tk+1 | > (δk + δk+1 )(θk + λk }
F
tail.recursive
<37>
c
+ P{sup |S( f − f k )Tk Tk+1
| > δk (θk + λk + 3ηk }.
F
The argument for the second, third, and fourth terms on the right-hand side
of <40> parallels the argument started from <22>. I omit most of the repeated
detail.
The maximum in the third term again runs over at most Nk Nk+1 distinct
representatives, each with a bound of at most βk + βk+1 and a norm of at
most δk + δk+1 . The βk were again chosen to ensure that the constraint required
by Assumption <38> is satisfied. It follows that the third term contributes at
most τ (λk ).
As before,
|S( f − f k+1 )Tkc Tk+1 | ≤ SBk+1 Tkc Tk+1 + 2 max P
Bk+1 Tkc Tk+1 (ξi ),
F
i≤n
and the sum of the means is less than 3ηk . Thus the second term is less than
P{max SBk+1 Tkc Tk+1 > δk+1 (H (Nk Nk+1 ) + λk+1 )} ≤ τ (λk+1 ).
F
Similarly, the fourth term is bounded by τ (λk ).
c David Pollard
Asymptopia: 23 March 2001 19
Section 12.7
Tail probabilitybounds
With repeated substitution, the recursive inequality <40> leads to
0 ≤ k + 2
δ
δk+1
with
R0 = Rk + 122
τ (λ(x))
d x,
x
δ
δk+1
K (x) d x.
Argue as in the previous section, but with J (·) replaced by K (·), to bound the
contributions from k and Rk foran appropriately chosen k.
Not quite right. There is another contribution to the tail
bound.
Fix.
Old section on Tail probabilities
max.tail.assumption
<38>
Assumption:. Suppose there exists functions G(·, ·) and H (·) and a
decreasing function τ (·), increasing in each argument, for which the following
property holds. If G is a finite set of functions on X for which β(g) ≤ β and
g ≤ δ for each g ∈ G,
P{max |Sg| ≥ δ(H (N ) + λ)} ≤ τ (λ)
g∈G
main.tail
if β ≤ δ/G(N , λ),
for each λ > 0.
<39>
Theorem. Suppose Assumptions <38> and <38> hold. Suppose the
functions in F are bounded in absolute value by a constant M . Suppose δ > 0
is such that M ≤ δ/G (N (δ)N (δ/2)). Then, for some universal constant C0 ,
δ
P{sup |S( f −Aδ ( f ))| > R0 } ≤ C0
F
where R0 =
H (N (x)N (x/2))+D (G (8N (x)N (x/2))) d x,
0
and
Proof. The argument is slightly more involved than for the proof of Theorem <main>, because there is one more sequence of constants to be chosen
correctly.
As before, for k = 0, 1, . . . and δk = δ/2k , let πk be a partition of F
into at most Nk = N (δk ) regions. Write f k = Ak ( f ) and Bk ( f ) for the
πk -simple functions that define the δk -bracketing. Define γk = G(Nk Nk+1 , λk )
and βk = δk /γk and θk = H (Nk Nk+1 ), where the {λk } remian to be chosen.
Start once more from the recursive equality
c
( f − f k )Tk = ( f − f k+1 )Tk+1 −( f − f k+1 )Tkc Tk+1 +(Ak+1 −Ak )Tk Tk+1 +( f − f k )Tk Tk+1
.
Define a corresponding sequence of constants {Rk }, with
Rk = Rk+1 +???θk +???δk λk+1 +???ηk .
where ηk = δk D(1/γk ). This time write k for P{supF |S( f − f k )Tk | > Rk }.
Then we have the recursive tail bound
k ≤k+1 + P{sup |S( f − f k+1 )Tkc Tk+1 | ≥ (δk + δk+1 )(θk + λ + k)}
F
tail.recursive
c
+
P{max
<40
>
|S( f k+1 − f k )Tk Tk+1 | > δk (θk + λk+1 + ηk } + P{sup |S( f − f k )Tk Tk+1
| > δk (θk + λk+1 + ηk }.
F
F
Bound for the third term
Bound | f k − f k+1 | by | f k − f | + | f k+1 − f | ≤ Bk+1 + Bk . The indicator function
Tk Tk+1 ensures that
max β (( f k − f k+1 )Tk Tk+1 ) ≤ βk + βk+1 ,
F
max ( f k − f k+1 )Tk Tk+1 ≤ δk + δk+1 .
F
20
Asymptopia: 23 March 2001
c David Pollard
Chapter 12
Bracketing methods
The constraint required by Assumption <38> is satisfied:
βk + βk+1
2 1
1 1
1
=
+
≤ .
δk + δk+1
3 γk
3 γk+1
γk
It follows that
P max |S( f k − f k+1 )Tk Tk+1 | ≤ 3δk+1 θk .
t
Bound for the second term
The truncated remainder process ( f − f k+1 )Tkc Tk+1 is not simple, because
( f − f k+1 ) depends on f through more than just the regions of a finite partition
of F. However, it is bounded in absolute value by Bk+1 Tkc Tk+1 . In general, if
g and h are functions for which |g| ≤ h, then
|Sg| ≤
h(ξi ) +
Ph(ξi ) = Sh + 2
Ph(ξi ).
i≤n
tail.bracket.bnd
<41>
i≤n
i≤n
Apply the bound with g = ( f − f k+1 )Tkc Tk+1 and h = Bk+1 Tkc Tk+1
P sup |S( f − f k+1 )Tkc Tk+1 | ≤ P max |SBk+1 Tkc Tk+1 |+2 max P
Bk+1 Tkc Tk+1 (ξi ).
F
F
F
i≤n
From the bounds
max β(Bk+1 Tkc Tk+1 ) ≤ max β(Bk+1 Tk+1 ) ≤ βk+1 ,
F
F
max Bk+1 Tkc Tk+1 ≤ max Bk+1 ≤ δk+1 ,
F
t
Assumption <38> gives
P max |SBk+1 Tkc Tk+1 | ≤ δk+1 θk ,
F
tail.mean.bnd
<42>
because βk+1 = δk+1 /γk+1 ≤ δk+1 /γk .
For the contribution from the expected values, split acording to which of
Bk or Bk+1 is larger, to bound the function Bk+1 Tkc Tk+1 by
Bk+1 {Bk+1 ≥ Bk > βk }+Bk {Bk > βk } ≤ Bk+1 {Bk+1 > 2δk+1 /γk }+Bk {Bk > δk /γk }
Thus, via Assumption <38>,
PBk+1 Tkc Tk+1 (ξi ) ≤ δk+1 D(γk /2) + δk D(γk ).
i≤n
Bound for the fourth term
tail.bracket.bnd
<43>
By the symmetry in k and k + 1 in the argument for the second term, we can
interchange their roles to get
c
c
c
P sup |S( f − f k )Tk+1
Tk | ≤ P max |SBk Tk+1
Tk | + 2 max P
Bk Tk+1
Tk (ξi ).
F
F
F
i≤n
Assumption <38> applies because βk = δk /γk , to bound the first contribution
by δk θk . Arguing as for the second term, bound the contribution from the means
by
PBk (ξi ){Bk (ξi ) > δk /2γk+1 } + PBk+1 (ξi ){Bk+1 (ξi ) > δk+1 /γk+1 }
i≤n
≤ δk+1 D(γk+1 ) + δk D(2γk+1 )
Recursive inequality
The recursive inequality <recursive> has now simplified to
k ≤ k+1 + 6δk+1 θk + 6δk D(2γk+1 ).
c David Pollard
Asymptopia: 23 March 2001 21
Section 12.7
Tail probabilitybounds
Because k ≤ 2nβk → 0 as k → ∞, recursive substitution leaves us with a
bound for 0 less than
∞
∞
5
θk + 2n
λk .
k=0
NewBennett
k?
<44>
NewBennett
<45>
Better to stop at some finite
k=0
The integral term in Theorem <main> is a cleaner looking bound for the sum.
Lemma.
Suppose ξi = (xi , yi ) with all xi and yi mutually independent. Suppose
| f (x, y)| ≤ M(y)φ(x), with P exp(α M(yi ) ≤ for all i and |φ(x)| ≤ β for
all x . Write V for i≤n Pφ(xi )2 . Then
2 2
2V
λ α
provided β ≤
.
P{S f ≥ λ} ≤ exp −
4V
λα
Lemma.
Suppose ξi = (xi , yi ) with all xi and yi mutually independent. Suppose
| f (x, y)| ≤ M(y)φ(x), with P exp(α M(yi ) ≤ for all i and |φ(x)| ≤ β for
all x . Write V for i≤n Pφ(xi )2 . Then
2 2
2V
λ α
provided β ≤
.
P{S f ≥ λ} ≤ exp −
4V
λα
The Bennett inequality
<7> for P{ i Wi ≥ λ} would follow from a
minimization of exp − λt + 12 t 2 σ 2 ψ(tα) over t > 0.
[§]
8.
Check with the comments from
Pyke.
Notes
The two main results (Theorems <20> and <39>) give maximal inequalities for
expected values (L 1 norms) and tail probabilities. These inequalities abstract and
extend the arguments first developed by the Seattle group (Pyke 1983, Alexander
& Pyke 1986, Alexander 1984, Bass 1985, Ossiander 1987) and subsequently
generalized by the Paris group (Massart 1987, Birgé & Massart 1993, Doukhan
et al. 1994, Birgé & Massart 1995). Some of the history of the main ideas is
discussed in Section History.
The argument to establish the maximal inequality of Assumption <19>
for finite classes is essentially Pisier’s (1983) method combined with the first
step in the derivation of the Bernstein/Bennett inequality.
Donsker? cf. Parthasarathy (1967).
Revesz
aad
donsker
(Dudley 1981) (Dudley 1978) Dudley (1978) introduced the concept of
metric entropy with bracketing in order to prove a functional central limit
theorem for empirical processes indexed by classes of sets, later extending it to
classes of functions in Dudley (1981).
History according to Ron Pyke
Ken Alexander saw the paper of Pyke (1983), and realized how to improve the
truncation technique used there. He applied the improvement in Alexander 84(?).
They wrote another joint paper (Alexander & Pyke 1986)—see the note at the
end of the paper. Bass (1985) applied the truncation to set-indexed partial-sum
processes (thge paper was not written up before December 84). Bass and Pyke
(paper around 1983?) recognized the truncation problem; they didn’t use the
22
Asymptopia: 23 March 2001
c David Pollard
Chapter 12
Bracketing methods
best form of truncation. Mina Ossiander worked on her dissertation during the
spring and summer of 1984, producing her thesis (published as Ossiander 1987)
and a technical report in November–December of that year. She started from the
Alexander&Pyke paper, then developed a more general form of the truncation
argument (?). There were many discussions between Ossiander and Bass. The
final publication dates are not indicative of the true order in which work carried
out, because of delays in refereeing.
AGOZ
dmr
Cite Dehardt.
birge and Massart
• mention where bracketing comes from: history of Seattle contributions,
as related by Ron Pyke. The method is based on the truncation/bracketing
argument developed by the Seattle group. I have isolated the key ingredients
into the following two assumptions.
For an application to dependent variables, DMR introduced a new type of
norm for a random variable X . The quantile transformation constructs a random
variable Q(U ) with the same distribution as X , by means of a U distributed
uniformly on (0, 1). For a nonnegative decreasing function L determined by
mixing assumptions, they defined a norm by X 2 = PL(U )Q(U )2 . This new
norm has properties analogous to those of an L 2 norm. They proved elegant
maximal inequalities for mixing processes based on bracketings with the new
norm.
References
Alexander, K. S. (1984), ‘Probability inequalities for empirical processes and a
law of the iterated logarithm’, Annals of Probability 12, 1041–1067.
Alexander, K. S. & Pyke, R. (1986), ‘A uniform central limit theorem for setindexed partial-sum processes with finite variance’, Annals of Probability
14, 582–597.
Andrews, D. W. K. & Pollard, D. (1994), ‘An introduction to functional central
limit theorems for dependent stochastic processes’, International Statistical
Review 62, 119–132.
Bass, R. (1985), ‘Law of the iterated logarithm for set-indexed partial-sum
processes with finite variance’, Zeitschrift für Wahrscheinlichkeitstheorie
und Verwandte Gebiete 70, 591–608.
Billingsley, P. (1968), Convergence of Probability Measures, Wiley, New York.
Birgé, L. & Massart, P. (1993), ‘Rates of convergence for minimum contrast
estimators’, Probability Theory and Related Fields 97, 113–150.
Birgé, L. & Massart, P. (1995), Minimum contrast estimators on sieves, in
D. Pollard, E. Torgersen & G. L. Yang, eds, ‘A Festschrift for Lucien
Le Cam’, Springer-Verlag, New York, pp. ???–???
Chow, Y. S. & Teicher, H. (1978), Probability Theory: Independence, Interchangeability, Martingales, Springer, New York.
Collomb, G. (1984), ‘Propriétés de convergence presque complète du prédicteur
à noyau’, Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete
66, 441–460.
Doukhan, P. & Portal, F. (1983), ‘Moments de variables aléatoires mélangeantes’,
Comptes Rendus de l’Academie des Sciences, Paris 297, 129–132.
Doukhan, P. & Portal, F. (1984), ‘Vitesse de convergence dans le théorème
central limite pour des variables aléatoires mélangeantes à valeurs dans
un espace de hilbert’, Comptes Rendus de l’Academie des Sciences, Paris
298, 305–308.
c David Pollard
Asymptopia: 23 March 2001 23
Section 12.8
Notes
Doukhan, P., Massart, P. & Rio, E. (1994), ‘The functional central limit
theorem for strongly mixing processes’, Annales de l’Institut Henri
Poincaré ??, ???–???
Dudley, R. M. (1978), ‘Central limit theorems for empirical measures’, Annals
of Probability 6, 899–929.
Dudley, R. M. (1981), Donsker classes of functions, in M. Csörgő, D. A.
Dawson, J. N. K. Rao & A. K. M. E. Saleh, eds, ‘Statistics and Related
Topics’, North-Holland, Amsterdam, pp. 341–352.
Hall, P. & Heyde, C. C. (1980), Martingale Limit Theory and Its Application,
Academic Press, New York, NY.
Massart, P. (1987), Quelques problemes de vitesse de convergence pour des
processus empiriques, PhD thesis, Université Paris Sud, Centre d’Orsay.
Chapter 1A= Massart (1986); Chapter 1B = “Invariance principles for
empirical processes: the weakly dependent case.
Ossiander, M. (1987), ‘A central limit theorem under metric entropy with
L 2 bracketing’, Annals of Probability 15, 897–919.
Parthasarathy, K. R. (1967), Probability Measures on Metric Spaces, Academic,
New York.
Pollard, D. (1984), Convergence of Stochastic Processes, Springer, New York.
Pollard, D. (1990), Empirical Processes: Theory and Applications, Vol. 2
of NSF-CBMS Regional Conference Series in Probability and Statistics,
Institute of Mathematical Statistics, Hayward, CA.
Pollard, D. (1996), An Explanation of Probability, ??? (Unpublished book
manuscript.).
Pyke, R. (1983), A uniform central limit theorem for partial-sum processes
indexed by sets, in J. F. C. Kingman & G. E. H. Reuter, eds, ‘Probability,
Statistics and Analysis’, Cambridge University Press, Cambridge, pp. 219–
240.
Rio, E. (1994), ‘Covariance inequalities for strongly mixing processes’, Annales
de l’Institut Henri Poincaré ??, ???–???
Sen, P. (1974), ‘Weak convergence of multidimensional empirical processes for
stationary φ-mixing processes’, Annals of Probability 2, 147–154.
Yukich, J. E. (1986), ‘Rates of convergence for classes of functions: the non-iid
case’, Journal of Multivariate Analysis 20, 175–189.
24
Asymptopia: 23 March 2001
c David Pollard
Download