Resampling Methods Math 6070, Spring 2013 Contents Davar Khoshnevisan

advertisement
Resampling Methods
Math 6070, Spring 2013
Davar Khoshnevisan
University of Utah
April 4, 2013
Contents
1 Introduction
1
2 Integration
2.1 Stieldjes Integration . . . . . . . . . . . . . . . . . . . . . . .
2.2 Some Examples . . . . . . . . . . . . . . . . . . . . . . . . . .
3
3
3
3 The
3.1
3.2
3.3
3.4
3.5
3.6
3.7
Bootstrap
A Bootstrap Confidence Interval
Subsampling . . . . . . . . . . .
The Bootstrap for Variance . . .
The Bootstrap for Quantiles . . .
The Bootstrap for Skewness . . .
The Bootstrap for Kurtosis . . .
The Bootstrap Can Fail . . . . .
4 The
4.1
4.2
4.3
Jackknife
15
The Jackknife, and Bias Reduction . . . . . . . . . . . . . . . 15
The Jackknife Estimator of Variance . . . . . . . . . . . . . . 17
The Jackknife Can Fail . . . . . . . . . . . . . . . . . . . . . . 17
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
6
7
8
9
11
12
14
Introduction
One of the classical questions in asymptotic statistics is the problem of finding a reasonable asymptotic confidence interval for p for a population that is
1
distributed as Bernoulli(p). Let us recall the usual method of accomplishing
this.
Let X1 , X2 , . . . , Xn be an i.i.d. sample from F , where F denotes the
cumulative distribution function for Bernoulli(p) and p ∈ (0 , 1) is unknown.
That is,


0 if x < 0,
F (x) = q if 0 6 x < 1,


1 if x > 1,
where I am writing q := 1 − p to keep with the traditional notation of this
subject. In
Pother words, P{Xi = 1} = p and P{Xi = 0} = q. Note that
X̄ := n−1 ni=1 Xi is the sample proportion of ones. Since
E(X̄) = p
and
Var(X̄) =
the central limit theorem assures us that
d
√
n X̄ − p −→ N(0 , pq)
pq
,
n
as n → ∞.
In particular, when n is large, the following has probability very nearly equal
to (1 − α) × 100%:
r r
pq
pq
, X̄ + zα
,
(1)
p ∈ X̄ − zα
n
n
where zα is a non-random number that is chosen so that
P {|N(0 , 1)| 6 zα } = 1 − α.
However, despite first appearances, (1) is not an asymptotic level (1 − α) ×
100% confidence interval for p, since the interval in question depends itself on
p, which is sadly unknown. The most common remedy is to estimate further
the unknown term pq—in (1)—by X̄(1 − X̄)—which is an asymptoticallyconsistent estimator of pq thanks to the law of large numbers. This leads us
to the following, which is asymptotically a level (1 − α) × 100% confidence
interval for p, thanks to Slutsky’s theorem:
!
r
r
X̄(1 − X̄)
X̄(1 − X̄)
p ∈ X̄ − zα
, X̄ + zα
,
(2)
n
n
This is a typical exam of the [parametric] bootstrap: At the second stage
of constructing our confidence interval we use the sample proportion X̄ of
ones in place of the population proportion p of ones. Magically, everything
works more or less equally well.
2
2
2.1
Integration
Stieldjes Integration
Let F denote a cumulative distribution function, and let us write f :=
the mass function of F in the discrete case and f := the pdf of F in the
absolutely continuous case. That is,
(
F (x) − F (x−) if discrete,
f (x) =
F 0 (x)
if absolutely continuous.
R
Then we can define the “integral” g dF as
(P
Z
Z ∞
g(x)f (x)
if discrete,
g dF :=
g(x) dF (x) := R ∞x
−∞
−∞ g(x)f (x) dx if absolutely continuous,
for every function g Rwhere the preceding definition makes sense and/or is
finite. The integral g dF has all of the usual properties of an integral:
Whenever a1 , . . . , an are real numbers and g1 , . . . , gn are “integrable functions,”
Z X
Z
n
n
X
ai gi dF =
ai gi dF.
i=1
If h(x) > 0 for all x then
and so forth.
2.2
R
R
i=1
h dF > 0, we have the triangle inequality,
Z
Z
g dF 6 |g| dF,
g dF is an example of what is called the Stieldjes integral.
Some Examples
Throughout, suppose X, X1 , · · · , Xn ∼ F is an i.i.d. sample. As before, F̂n
denotes the empirical distribution function of X1 , . . . , Xn .
Example 1 (Mean). The mean of X can always be thought of as a Stieldjes
integral. For instance, in the continuous case,
Z ∞
Z ∞
x dF (x) =
xf (x) dx = E(X),
−∞
and
Z
−∞
∞
x dF (x) =
−∞
X
x
3
xf (x) = E(X),
in the discrete case. Similarly, the mean of the random cdf F̂n is
Z ∞
n
1X
x dF̂n (x) =
Xi = X̄.
n
−∞
i=1
In somewhat loose words, the population mean is the mean of F and the
sample mean is the mean of F̂n .
Example 2 (Variance). For our second elementary example, let us note
that if µ := E(X), then
Z ∞
(x − µ)2 dF (x).
Var(X) =
−∞
The proof is similar to that of the previous example. Likewise, the variance
of F̂n is
Z ∞
n
1X
(x − X̄)2 dF̂n (x) =
(Xi − X̄)2 ,
n
−∞
i=1
which is the [biased] sample variance.
Example 3 (Median). Recall that a median of X is any number m such
that
P{X 6 m} = F (m) >
1
2
1
P{X > m} = 1 − F (m−) > .
2
and
Define the left-continuous inverse F −1 of F as follows:
F −1 (z) := min {x : F (x) > z} .
Then, the function F −1 is left-continuous, monotone increasing, and has the
property that
F (x) > z if and only if x > F −1 (z).
Let us note that
P X 6 F −1 (1/2) = P {F (X) 6 1/2} .
If X is absolutely continuous, then F (X) ∼ unif(0 , 1) and hence
1
P X 6 F −1 (1/2) = P {unif(0 , 1) 6 1/2} = ,
2
and
1
P X > F −1 (1/2) = P {unif(0 , 1) > 1/2} =
2
4
And if X is discrete, then
X
P X 6 F −1 (1/2) =
f (x) =
x6F −1 (1/2)
X
x: F (x)61/2
1
f (x) > ,
2
and P{X > F −1 (1/2)} > 1/2, similarly. In other words,
F −1 (1/2) := the population median.
Similarly,
F̂n−1 (1/2) := the sample median;
that is the smallest point such that at least half of the sample lies to its left.
In practice, any point such that at least half of the sample lies to its left and
at least half lies to its right can be used. For our particular choice, we have
(
X(n+1)/2:n if n is odd,
F̂n (1/2) =
Xn/2:n
if n is even,
written in terms of the order statistics. [Usually, people use F̂n (x) :=
1
−1 1
2 {Xn/2:n + X(n/2)+1:n } when n is even, in place of F̂n ( /2), as we have
done.]
Example 4 (Quantiles). Given a number p ∈ (0 , 1), a pth quantile Qp of
X [and/or F ] is any number that satisfies
P{X 6 Qp } > p
and
P{X > Qp } > 1 − p.
One argues, as we did in the previous example, in order to see that Qp =
F −1 (p) is always a pth population quantile, and F̂n−1 (p) is the pth sample
quantile.
3
The Bootstrap
Let F denote the [typically unknown] cumulative distribution function of a
given population of interest to us. A statistical functional is a real-valued
function T (F ) of F . You should think of F as the variable of the functional
T . Thus, for instance,
Z ∞
T (F ) := µF :=
x dF (x)
−∞
5
is the mean functional, and in particular, µF̂n = X̄ is the sample mean.
The mean is an example of a linear statistical functional. A general linear
statistical function has the form
Z ∞
g(x) dF (x).
(3)
T (F ) := Eg(X) =
−∞
Even though g needR not be a linear function [e.g., as is the case with the
∞
variance functional −∞ (x − µF )2 dF (x)], such functionals are linear in the
following sense: If λ ∈ (0 , 1) and F and G are two cdf’s, then λF + (1 − λ)G
is also a cdf,1 and
T (λF + (1 − λ)G) = λT (F ) + (1 − λ)T (G).
3.1
A Bootstrap Confidence Interval
Let T be a statistical functional, and suppose that we wish to estimate
T (F ), where F denotes the unknown cdf for our population. The bootstrap
estimate of T (F ) is simply T (F̂n ). We have seen already that F̂n → F
uniformly in probability as n → ∞ [Glivenko–Cantelli theorem]. Therefore,
it stands to reason that many statistical functional T have the property
that T (F̂n ) → T (F ). [Such are called continuous functionals.] Rather than
study the functional analysis aspect of such problems, let us concentrate on
a simple though telling special case.
If T is a linear statistical functional of the form (3), then
n
T (F̂n ) =
1X
g(Xi ).
n
i=1
The asymptotic
theory of linear statistical functionals is elementary: Since
R
T (F ) = g dF = Eg(X), the law of large numbers ensures that
P
T (F̂n ) −→ T (F )
as n → ∞,
and in accord with the central limit theorem,
i
√ h
d
n T (F̂n ) − T (F ) −→ N 0 , σ 2 (F ) ,
1
The easiest way to understand the meaning of λF + (1 − λ)G is as follows: Suppose
X ∼ F and Y ∼ G are independent. Set Z := X with probability λ and Z := Y with
probability 1 − λ. Then, you should check that Z ∼ λF + (1 − λ)G.
6
where
Z
2
∞
σ (F ) := Var (g(X)) =
(g(x) − T (F ))2 dF (x)
−∞
denotes the asymptotic variance. Therefore, if n is large then the following
has probability ≈ 1 − α:
σ(F )
σ(F )
T (F ) ∈ T (F̂n ) − √ , T (F̂n ) + √
.
n
n
But this is not a proper confidence interval [with asymptotic level 1 − α],
because the “limiting standard deviation” σ(F ) is unknown. Therefore, we
use the following bootstrap estimator for σ 2 (F ):
Z ∞
2
2
σ (F̂n ) =
g(x) − T (F̂n ) dF̂n (x)
−∞

2
n
n
X
X
1
g(Xi ) − 1
=
g(Xj ) .
n
n
i=1
j=1
The quantity σ 2 (F̂n ) is the bootstrap estimator for the standard error of
T (F̂n ). And our bootstrap confidence interval for T (F ) is
!
σ(F̂n )
σ(F̂n )
.
(4)
, T (F̂n ) + √
T (F̂n ) − √
n
n
By the law of large numbers, σ 2 (F̂n ) converges in probability to σ 2 (F ) as
n → ∞, and therefore Slutsky’s theorem justifies the remainder of the following.
Theorem 5. If E(|g(X)|2 ) < ∞, then the preceding bootstrap confidence
interval for T (F )—see (4)—has asymptotic level (1 − α) × 100% as n → ∞.
3.2
Subsampling
One can view bootstrapping quite abstractly as a resampling and/or subsampling scheme. Suppose F is the unknown cdf and we wish to know about
some functional T (F ) of F . In order to learn about T (F ) we select a random
sample X1 , . . . , Xn from F and estimate T (F ) with Tn := T (F̂n ). We can
also estimate statistical functionals such as
S(F ) := P{Tn 6 t}.
7
Let us write this quantity, somewhat more precisely, as
S(F ) = PF {Tn 6 t},
in order to emphasize that we are assuming that F is indeed the true cdf.
Then we can estimate S(F ) by S(F̂n ). This looks easy [and is, in some sense]
but its meaning is quite subtle, Indeed, S(F̂n ) is the [conditional] probability
that we have sampled [and conditioned on] X1 , . . . , Xn and obtain F̂n , then
subsample from F̂n a sample X1∗ , . . . , Xn∗ of size n, compute the corresponding Tn∗ from the subsample X1∗ , . . . , Xn∗ , and then find the odds that Tn∗ is
at most t. A question that arises is: What does it mean to sample from F̂n ?
Since F̂n is a proper cdf on n points [albeit a random one], we sample F̂n as follows: Independently from [and conditionally on] the original
sample X1 , . . . , Xn , we sample with replacement X1∗ , . . . , Xn∗ from the set
{X1 , . . . , Xn }, all points being equally likely. Then, the probability of interest is the conditional probability
S(F̂n ) = P ( T (Fn∗ ) 6 t | X1 , . . . , Xn ) ,
where Fn∗ denotes the empirical cdf of the subsamples; that is,
n
Fn∗ (x) :=
1X ∗
I Xj 6 x .
n
j=1
In general, one cannot compute the random probability S(Fn∗ ). Therefore,
one simulates it by sampling N conditionally independent groups of n ran∗ , . . . , X ∗ , . . . , X ∗ , · · · , X ∗ , and then applying the law
dom variables X1,1
1,n
N,1
N,n
of large numbers [valid as N → ∞]:
S(F̂n ) ≈
N
1 X ∗
I Fj,n 6 t ,
N
j=1
where
∗
Fj,n
(x)
n
1X ∗
:=
I Xj,k 6 x
n
k=1
∗ , . . . , X ∗ of subsamples.
denotes the empirical cdf of the jth group Xj,1
j,n
3.3
The Bootstrap for Variance
In order to gel the ideas further, let us now consider a statistical functional
T , and apply the bootstrap statistic Tn := T (F̂n ) to estimate T (F ). A
8
question that arises is, how do we estimate the variance of Tn ? The standard methods for doing this sort of thing, at this level generality, typically
involve resampling; one uses either the bootstrap [which we do now] and
the jackknife [which we cover later]. Both methods work well under some
regularity assumptions on the functional T , though I will not discuss this
sort of optimality in great detail here because the discussion requires a great
deal of functional analysis that is beyond the level of this course.
In order to estimate θ = Var(Tn ) via the bootstrap, we perform the
following algorithmically:
1. Sample with replacement and at random X1∗ , . . . , Xn∗ from the data,
all data points being equally likely;
2. Compute Tn∗ := T (X1∗ , . . . , Xn∗ ), using the same functional T as the
one that led us here;
3. Repeat steps 1 and 2 N times—for a large N —in order to obtain N
∗ ,...,T∗ ;
conditionally independent replicas T1,1
N,n
4. The bootstrap estimate for T (F ) := Var(Tn ) is
Vbootstrap
N
n
1 X
1X ∗
∗
:=
Tj,k
Tj,n
−
N
n
j=1
3.4
!2
.
k=1
The Bootstrap for Quantiles
Let us say a few things about the bootstrap estimation of the pth quantile
T (F ) := F −1 (p) of the cdf F , where 0 < p < 1 is known. Of course, the
bootstrap estimate is the pth sample quantile T (F̂n ) = F̂n−1 (p), as we have
seen earlier. The issue is to decide on the asymptotic behavior of F̂n−1 (p).
Theorem 6 (Efron, 1979). Suppose that f = F 0 exists, is continuous, and
f (F −1 (p)) > 0. Then,
√ −1
pq
d
−1
n F̂n (p) − F (p) −→ N 0 ,
,
|f (F −1 (p))|2
as n → ∞, where q := 1 − p as before.
Sketch of the Proof. In order to understand this result, consider first the
special case where the data comes from unif(0 , 1). In that case, f (x) = 1,
F (x) = x, and F −1 (p) = p. The theorem states that, in this case,
√ d
n p − F̂n−1 (p) −→ N(0 , pq).
9
But this is not hard to prove, given our existing knowledge of empirical
process theory. Indeed, we know that uniformly for all 0 6 x 6 1,
√ n F̂n (x) − x ≈ B ◦ (x),
and hence the preceding is true even at the random point x = F̂n−1 (p). That
is,
√ n p − F̂n−1 (p) ≈ B ◦ F̂n−1 (p) .
One ready consequence of this is that p−F̂n−1 (p) is roughly of order n−1/2 and
hence F̂n−1 (p) ≈ p with high probability when n 1. Since the Brownian
bridge is continuous, it follows that
√ n p − F̂n (p) ≈ B ◦ (p) ∼ N (0 , pq) .
This proves the result in the unif(0 , 1) case. The general case can be
reduced to this one, after a little more work. I will skip almost all of
the remaining details, except for the fact that the key ingredient involves
variance stabilization. What this means, in rough terms is this: Suppose
√
n(Zn − µ) ≈ N(0 , pq) in distribution, say thanks to some sort of central
limit theorem argument. Then, Zn ≈ µ with high probability and so a
Taylor expansion yields the following for every nice function g:
2 √
√
n (g(Zn ) − g(µ)) ≈ g 0 (µ) × n (Zn − µ) ≈ N 0 , pq g 0 (µ) ,
in distribution. The connection to the Efron’s theorem is through the application g(x) := F −1 (x). If F −1 were the classical [continuous] inverse
function to F , then it follows that g 0 (x) = 1/f (F −1 (x)), thanks to elementary calculus. But, as you have been warned, the remaining details are
skipped.
Let us return to Theorem 6 and build an estimator for F −1 (p). According
to Theorem 6, if f = F 0 exists and is continuous with f (F −1 (p)) > 0, then
F̂n−1 (p) is a consistent estimator for F −1 (p). Next we might wish to develop
a confidence interval. But that task presents us with a serious challenge:
We need to estimate the asymptotic variance
τ2 =
pq
|f (F −1 (p))|2
.
Once we do this, we know how to proceed, using standard theory.
10
The quantites p and q = 1 − p are, of course, known. Therefore, one
possible approach is to use the estimator
pq
τ̂ 2 := 2 ,
ˆ −1
f F̂n (p) where fˆ is a reliable kernel density estimator of f . Theorem 6 ensures that
P
F̂n−1 (p) → F −1 (p). Therefore, under the conditions of Parzen’s uniform conP
sistency theorem, τ̂ 2 → τ 2 as n → ∞ [Slutsky’s theorem]. A more classical
approach is to use a subsampling method [Hall, DiCiccio, and Roman, 1989;
Falk and Kaufman ,1991]. Recall that the real issue is to find zα such that
n√ o
n F̂n−1 (p) − F −1 (p) 6 zα ≈ 1 − α.
PF
This can be done by resampling methods. That is, we can seek to find zα
such that
n√ o
n (Fn∗ )−1 (p) − F̂n−1 (p) 6 zα ≈ 1 − α,
PF̂n
where (Fn∗ )−1 denotes the left-continuous inverse to the resampled empirical
cdf Fn∗ .
3.5
The Bootstrap for Skewness
The skewness of a random variable X is defined as
"
#
E[(X − EX)3 ]
X − EX 3
γ :=
=E
.
SD(X)
[Var(X)]3/2
If the cdf of X is F , then we can always write κ in terms of F [the “skewness
of F ”] as follows:
R∞
Z ∞
3
−∞ (x − µF ) dF (x)
x dF (x).
γ = hR
i3/2 , where µF :=
∞
−∞
2
−∞ (x − µF ) dF (x)
The skewness γ of F is a “unit-free quantity” and a fairly good measure
of how asymmetric the distribution of X is. Stated in terms of F , X is
symmetric if and only if F (x) = 1−F ((−x)−) for all x ∈ R, where F (y−) :=
limz↑y F (z). Whenever X is symmetric γ = 0. If γ 0 then X is mostly
“skewed to the right,” whereas γ 0 implies that X is mostly “skewed
11
to the left.” We may think of γ as γ(F ) and apply our boostrap [plugin]
estimator
R∞
(x − µF̂n )3 dF̂n (x)
Gbootstrap := hR −∞
i3/2
∞
2 dF̂ (x)
(x
−
µ
)
n
F̂n
−∞
Pn
−1
3
n
j=1 (Xj − X̄)
=h
i3/2
P
n−1 nj=1 (Xj − X̄)2
√ Pn
n j=1 (Xj − X̄)3
=h
i3/2 .
Pn
2
j=1 (Xj − X̄)
3.6
The Bootstrap for Kurtosis
The kurtosis of a random variable X is defined as
!
E |X − EX|4
X − EX 4
κ :=
−3=E
− 3.
SD(X)
[Var(X)]2
This is a unit-free parameter.
In order to understand this definition better, consider
the case that X ∼
p
N(µ , σ 2 ). In that case, we can let Z := (X − µ)/ Var(X) in order to see
that κ = E(Z 4 ) − 3, where Z ∼ N(0 , 1). Therefore,
r Z ∞
Z ∞
1
2
2
4 −x2 /2
κ= √
dx − 3 =
x e
x4 e−x /2 dx − 3
π 0
2π −∞
Z ∞
4
y 3/2 e−y dy
(y := x2 /2)
=√
π 0
4
= √ Γ(5/2).
π
√
But Γ(5/2) = (3/2)Γ(3/2) = (3/2)(1/2)Γ(1/2) = (3/4) π. Therefore, κ = 0 for
any normal distribution!
It turns out that κ is usually very sensitive to the choice of normality
in the preceding computation. That is, κ 6= 0 for most “natural” nonnormal distributions. Therefore, we can fairly reliably know whether or not
a population is normal by just checking [parametrically!] to see if κ = 0 or
not. Because
R∞
(x − µF )4 dF (x)
κ = hR −∞
i2 ,
∞
2 dF (x)
(x
−
µ
)
F
−∞
12
we have the plugin [bootstrap] estimate K of kurtosis,
P
P
n ni=1 (Xi − X̄)4
n−1 ni=1 (Xi − X̄)4
−
3
=
Kbootstrap = 2
− 3.
Pn
P
2 2
(X
−
X̄)
n−1 ni=1 (Xi − X̄)2
i
i=1
We can understand κ better after we inspect a few more examples.
Example 7 (unif(−1 , 1)). If X ∼ unif(−1 , 1), then the pdf of X is f (x) =
1
2 I{−1 < x < 1}, and hence
EX = 0,
Z
1
1
x2 dx = ,
3
−1
Z 1
1
1
E |X − EX|4 = E(X 4 ) =
x4 dx = .
2 −1
5
Var(X) = E(X 2 ) =
1
2
Therefore,
κ≈
1/5
9
6
−3= −3=− .
2
(1/3)
5
5
This is strictly negative, which means that the uniform distribution on
(−1 , 1) has lighter tails [i.e., has greater concentration near the mean] than
any normal distribution does.
Example 8 (DE). Consider the case that X has the double exponential
distribution with pdf f (x) = 12 exp(−|x|). In that case,
EX = 0,
Z
Z ∞
1 ∞ 2 −|x|
x e
dx =
x2 e−x dx = Γ(3) = 2,
2 −∞
Z ∞
Z0 ∞
1
E |X − EX|4 = E(X 4 ) =
x4 e−|x| dx =
x4 e−x dx = Γ(5) = 4! = 24.
2 −∞
0
Var(X) = E(X 2 ) =
Therefore,
24
− 3 = 3.
4
The interpretation of this formula is that, since κ > 0, the double exponential distribution has heavier tails [i.e., has smaller concentration near the
mean] than any normal does.
κ=
13
3.7
The Bootstrap Can Fail
The bootstrap is a powerful method of broad utility. However, there are
occasions where it fails demonstrably. Here is an example.
Example 9 (Efron, 1979). Let X1 , . . . , Xn be i.i.d. unif(0 , θ) where θ > 0
is unknown. The usual estimator for θ is the largest order statistic Xn:n ;
this is, for example, the MLE. In order to develop confidence bounds, we
may study
Tn := n (θ − Xn:n ) .
√
The scaling factor is n and not n because in this way
d
Tn −→ Exp(θ).
The proof is easy, and many of you have seen it in your Math. 5080–5090
sequence: For all t > 0,
t
P {Tn < t} = 1 − P {n (θ − Xn:n ) > t} = 1 − P Xn:n < θ −
n
n
t
= 1 − P X1 < θ −
n
t n
=1− 1−
(if n > t/θ is large enough)
nθ
t
→ 1 − exp −
as n → ∞.
θ
But the bootstrap estimate of this probability is
PF̂n {Tn∗ < t} ,
where Tn∗ is obtained by taking n independent subsamples X1∗ , . . . , Xn∗ from
X1 , . . . , Xn and then computing the conditional probability
∗
PF̂n {n (Xn:n − Xn:n
) < t} .
∗ , the preceding probability is greater than or equal
Since Xn:n > Xn:n
to the following:
∗
∗
PF̂n {n (Xn:n − Xn:n
) = 0} = PF̂n { Xn:n = Xn:n
| X1 , . . . , Xn }
∗
= 1 − PF̂n { Xn:n
< Xn:n | X1 , . . . , Xn }
n
n−1
=1−
n
→ 1 − e−1 .
14
That is, the bootstrap probability estimate satisfies2
lim PF̂n {Tn∗ < t} > 1 − e−1
n→∞
for all t > 0.
Therefore, it is easy to see that the preceding does not have the same limit
as P{Tn < t} ≈ 1 − exp(−t/θ) when t < θ. In other words, the bootstrap
probability estimate [for the confidence values] fails to converge to the correct
value for many values of t.
4
The Jackknife
4.1
The Jackknife, and Bias Reduction
The jackknife refers to a loosely-connected family of resampling schemes
that are, for example, used in improving the asymptotic behavior of biased
estimators. Suppose T is a statistical functional, and we are interested in
estimating θ := T (F ) with T (F̂n ). Let us think of T (F̂n ) as a function of
X1 , . . . , Xn , instead of as a function of F̂n . In other words, our estimator
for θ has the form
Tn := Tn (X1 , . . . , Xn ).
Let T(−i) := T(−i),n denote the same estimator, but applied to the data after
we excise Xi from the data. Then the jackknife estimator of the bias of Tn
is
!
n
1X
bjackknife := (n − 1)
T(−i) − Tn := (n − 1) T̄n − Tn .
n
i=1
And the bias-corrected jackknife estimate for θ is
Tjackknife := Tn − bjackknife .
The choice of these jackknife estimators has to do with some heuristics that
are justified in various concrete settings. Indeed, it turns out that some
times Tn has a bias which has the asymptotic form
α1 α2 α3
bias(Tn ) :=
+ 2 + 3 + ··· ,
(5)
n
n
n
as n → ∞, for constants α1 , α2 , . . . with α1 6= 0. Since T(−i) has the same
structure as Tn but with the ith sample point removed, it follows from (5)
that
α1
α2
bias(T(−i) ) =
+
+ ··· .
n − 1 (n − 1)2
2
To be perfectly honest, the limit should be replaced by a lim inf.
15
Average over the first n values of i to see that
bias(T̄n ) =
α1
α2
+ ··· .
+
n − 1 (n − 1)2
Therefore,
bias(bjackknife ) = (n − 1)
"
= (n − 1)
α1
α2
+ ···
+
n − 1 (n − 1)2
α2
α1
+
n(n − 1)
α
α2
−
+ 2 + ···
n
n
#
2
2
n − (n − 1)
+ ···
n2 (n − 1)2
1
α1 α2 (2n − 1)
+ 2
+ ···
n
n (n − 1)
α1 2α2
=
+ 2 + ··· .
n
n
=
In particular, the bias of our jackknife estimator looks like
bias(Tjackknife ) = bias(Tn ) − bias(bjackknife )
α2
= − 2 + ··· ,
n
and this quantity is much smaller than (α1 /n) + · · · = bias(Tn ) when n is
large. In other words, if our estimator Tn has a bias that satisfies (5), then
the jackknifed version θjackknife of Tn has a significantly smaller bias when
the sample size is large. The following is a standard example of an estimator
that satisfies (5).
Example 10. Suppose θ = Var(X1 ) = T (F ) > 0 is unknown. The estimator that we will use is
n
1X
T (F̂n ) =
(Xi − X̄)2 .
n
i=1
Clearly, this is biased. In fact, we know from Math. 5080 that E(Tn ) =
(n − 1)θ/n, whence
bias(Tn ) =
(n − 1)θ
θ
−θ =− .
n
n
In this case, α1 = θ > 0 and α2 = α3 = · · · = 0. Hence the jackknife
estimator is, in fact, unbiased.
16
4.2
The Jackknife Estimator of Variance
Suppose, as before, that Tn = T (F̂n ) is the estimator of some θ := T (F ).
The ith pseudo-value is the quantity
Tei := nTn − (n − 1)T(−i)
(1 6 i 6 n).
The jackknife estimator of Var(Tn ) is defined as
Vjackknife

2
n
n
X
X
1
1
Tei −
:=
Tej  .
n(n − 1)
n
i=1
j=1
That is, Vjackknife is a normalized version of the sample variance of all of the
pseudo-values.
P
Example 11. Consider thePlinear case where Tn = n−1 ni=1 g(Xi ). In
this case, T(−i) = (n − 1)−1 j6=i g(Xj ), and hence the ith pseudo-value is
Tei = g(Xi ). Consequently, the jackknife estimator of the variance of Tn is
Vjackknife =
1
n(n − 1)
n
X
i=1
2
n
X
g(Xi ) − 1
g(Xj ) ,
n

j=1
which is 1/(n − 1) times the usual sample variance of the g(Xi )’s.3 Thanks
to the law of large numbers, Vjackknife is a good estimator of Var(Tn ) =
Var(g(X1 ))/n in the following sense:
Vjackknife
=
Var(Tn )
(n − 1)−1
Pn i=1
g(Xi ) − n−1
Var(g(X1 ))
Pn
j=1 g(Xj )
2
P
−→ 1,
as n → ∞.
4.3
The Jackknife Can Fail
The preceding suggests [quite appropriately] that the jackknife variance estimate is typically a very good one, for instance when applied to linear
statistical functionals. Striking Theorems of Bradley Efron [p = 1/2] and
Michael Martin [0 < p < 1] show that there are natural settings in which
the jackknife can fail completely.
3
In particular, note that if Tn = X̄ [that is, g(x) = x], then Vjackknife = S 2 /n.
17
Theorem 12 (Efron, 1979; Martin, 1990). If T (F ) = F −1 (p) is the pth
quantile of F , then
Vjackknife d p
−→ Exp(1),
Var(Tn )
as n → ∞.
I will not prove this beautiful fact. The proof is neither particularly
long, nor hard. But it rests on a series of specialized facts about the order
statistics, which I will omit. Instead, let me point out that the preceding
implies that
Vjackknife
lim lim P − 1 > = 1.
→0 n→∞
Var(Tn )
That is, the jackknife estimator of the variance of the pth sample quantile
Tn is always inconsistent [0 < p < 1]!4
4
This is a much stronger statement than the one which says that the jackknife estimator
is never consistent. I will let you ponder over the logical details.
18
Download