Presentation_Apr_18

advertisement
April 21, 2010
STAT 950
Chris Wichman
Motivation
 Every ten years, the U.S. government conducts a
population census, and every five years the U. S.
National Agricultural Statistics Service conducts an
Agriculture Census.
 Notice, that for the given “moment in time” that the
census is taken, the total population, N, is known. In
the intervening years, the numbers from each census
are used to make inferences. For example, mean
population in urban areas, and farm output (average
bushels/acre).
Motivation
 Of interest is an intervening year population average:
 N
1

N
j 1
xj
 Two statistics commonly employed in these situations:
 The ratio estimator:
t rat
 xn
 u N 
 un

(1  f )
 with est varian ce v rat 

n ( n  1)

u j t rat

xj 
j 1 
uN


n

n




 The regression estimator:
t reg
(1  f )
ˆ
ˆ
  0   1u N with est varian ce v reg 
n(n  2)
j 1
2
( x j  ˆ 0  ˆ1u j )
Sample Average
Without Replacement Samples
 Population

Average  
estimator of μ is Y  
n
N
j 1
yj N
, where the unbiased
yj n
j 1
 When Y is based on a sample taken without
replacement, the true variance of Y is:
N

Var (Y )  (1  f )
(yj  )
2
j 1
 (1  f )
n ( N  1)

2
; where
n
n
Var (Y )  (1  f )
j
 y)
j 1
n ( n  1)
2
 (1  f )
n
N
the unbiased estimator of which is:
 (y
f 
s
2
n
The Problem with the
Ordinary Bootstrap
 Recall, when a resample,
*
*
Y1  Y n
is taken with
replacement from the original sample y1  y n then:
( n  1) s
*
*
ˆ
Var (Y | F )  Var (Y ) 
2
n
*
2


n
j 1
( y j  y)
n
2
2
 Note that the Var * (Y * ) only matches the form of Var (Y )
if the sampling fraction,
f 
1
n
.
 In other words, the ordinary bootstrap fails to realize
the “contraction” in Var (Y ) .
Proposed Resampling Methods
 Modified Sample Size
 With replacement
 Without replacement
 Mirror Match
 Population
 Superpopulation
Modified Sample Size
 Find a resampling size n  such that the
approximately matched by
 Process:
 Find the form of Var * (Y * )
 Take the expected value of
 Solve for n 
*
*
Var (Y )
*
is
.
Var (Y ) and
*
Var (Y )
set equal to
Var (Y )
Modified Sample Size
With-Replacement
 For with replacement resampling, the bootstrapped
variance of
Y
*
is:
2
 1   ( n  1) s
Var (Y )    
n
 n  
*
*




    (1 
2
 1   ( n  1) E s
E Var (Y )    
n
 n  

*
*

 1   ( n  1) 
  
n
 n  
n 
( n  1)
(1  f )
2


f)

2
n
2


  (1  f )

n

this leads to a modified sample size > than n
Modified Sample Size
Without-replacement
 For without-replacement resampling, notice that the
effective N for each resample is really n.
 The
2
 1   ( n  1) s
Var (Y )  (1  f )   
n
 n  
*
*
choice for n  one in which
 n   nf
f 




making the obvious
n
n
Mirror Match
 Goals:
 Capture the dependence due to sampling withoutreplacement
 Minimize the instability of the resampled statistic, by
matching the original sample size
 Process:
 Suppose m  nf , and k  n / m are whole numbers
 Then simply concatenate k resamples of size m together
to form an n   n
Mirror Match
 When m and k are not integers:
 Round m = nf to the nearest whole number
 Choose k such that km  n  ( k  1) m
 Randomly select either k or (k+1) without-replacement
resamples of size m from y1  y n .

Sampling probabilities should be chosen to match f
Population Bootstrap
 If
k  N
n
is an integer:
 create a fake population Y*, by repeating y1  y n k
times.
 Generate R replicate samples of size n, by sampling
without-replacement from Y*.
 Each resample will have the same sampling fraction as
the original sample.
Population Bootstrap
 If
k  N
n
is not an integer:
 Find k and l such that N = nk + l, and 0  l  n .
 create a fake population Y*, by repeating y1  y n k
times and joining it with a without replacement sample
of size l from y1  y n . This step is repeated R times.
 Generate R replicate samples of size n, by sampling
without-replacement from Y*.
 Each resample will have the same sampling fraction as
the original sample.
Superpopulation Bootstrap
 For each resample, 1,. . .,R
 Create a fake population, Y*, of size N, by resampling
with replacement from y1  y n , N times.
 From each Y1*, . . . , YN* take a without replacement
sample of size n.
 Each resample will have the same sampling fraction as
the original sample.
Example 3.15: City Population Data
A Comparison of Confidence Intervals
 In this example, the normal approximation C.I. refers
to the bias corrected interval:
0 .5
0 .5
t  ˆ bs  v z (1  )    t  ˆ bs  v z 
 The remaining intervals are Studentized confidence
intervals :
tv
0 .5
z (( R  1 )( 1   ))    t  v
*
0 .5
*
z (( R  1 )  )
Example 3.15: City Population Data
Table 3.7
Resampling
Scheme
Normal
Ratio
Regression
132.65
175.18
Modified Size, n' = 2
46.55
298.93
Modified Size, n' = 11
109.14
209.42
111.31
283.13
Mirror Match, m = 2
118.42
174.79
117.06
245.09
Population
116.72
199.18
113.56
267.37
107.7
204.17
110.43
300.64
Superpopulation
Resampling
Scheme
Normal
128.48
NA
Ratio
161.09
NA
Regression
137.8
174.7
123.7
Modified Size, n' = 2
58.9
298.6
Modified Size, n' = 11
111.9
196.2
114
258.2
Mirror Match, m = 2
115.6
196
112.8
258.7
Population
118.9
193.3
116.1
240.7
Superpopulation
120.3
195.9
114
255.4
NA
152
NA
Example 3.15: City Population Data
Table 3.8
Coverage
Recreated in R
Lower
Length
Upper
Overall
Average
SD
Normal
6
88
82
22.35
7.62
Modified Size, n' = 2
0
96
96
164.48
143.2
Modified Size, n' = 11
1
94
93
38.09
20.97
Mirror Match, m = 2
1
86
85
26.77
14.87
Population
1
91
90
34.75
19.61
Superpopulation
0
94
94
39.08
21.29
Coverage
From BMA pg 96
Lower
Length
Upper
Overall
Average
SD
Normal
7
89
82
23
8.2
Modified Size, n' = 2
1
98
98
151
142
Modified Size, n' = 11
2
91
89
34
19
Mirror Match, m = 2
3
91
88
33
19
Population
2
91
89
36
21
Superpopulation
1
92
91
41
24
Example 3.15: City Population Data
Figure 3.6
How Well does the Normal Approximation
fit the Distribution of treg and trat?
How Well does the Normal Approximation
fit the Distribution of treg and trat?
Conclusions About trat and treg
 The normal approximation for the ratio and regression
estimators performs poorly.
 The estimated expected length of confidence intervals
based on the normal approximation are very short
relative to the other resampling methods.
 The estimated variance of the regression estimator is
unstable, potentially causing huge swings in z*
ultimately affecting the bounds of Studentized
confidence intervals.
Stratified Sampling
 Suppose the population of interest is divided into k
strata, then the population total, N
 N1    N k
 Each strata now has it’s own sampling fraction, f i 
 Each strata represents
the population.
wi 
Ni
N
; i  1,  , k
ni
Ni
proportion of
trat for a Stratified Sample
 Of interest is the overall mean:
 N
1
 
k
N
i 1
j 1
x ij
 The ratio estimator for a stratified population
becomes:
t rat
 xi.
  w i u Ni 
i 1
 u i.
k
2

w i (1  f i )
 with est varian ce v rat 

n i ( n i  1)

 x
ni
j 1
ij
 t i u ij 
Example 3.17: Stratified Ratio
 Here, Davison and Hinkley drop the regression
estimator, due to the potential instability of the
variance affecting the bootstrapped confidence
intervals.
 They also drop the Modified Sample, n   nf because
they felt it was a “less promising” finite population
resampling scheme.
Example 3.17: Methodology
 Simulate N pairs (u, x) divided into k strata of sizes
N 1 , , N k
 “small-k”:
 “small-k”:
 “large-k”:
k = 3, Ni = 18, ni = 6
k = 5, Ni = 72, ni = 24
k = 20, Ni = 18, ni = 6
 1000 different samples of size n  n1    n k were taken
from the dataset(s) produced above. For each sample,
R=199 resamples were used to compute confidence
intervals for θ.
Example 3.17: Methodology
 All methods were used on the sample as described in
example 3.15, with the exception of superpopulation
resampling, which was conducted for each strata.
BMA Table 3.9
k=20, N=18
k=5, N=72
k=3, N=18
L
U
O
L
U
O
L
U
O
Normal
5
93
88
4
94
90
7
93
86
Modified Sample Size
6
94
89
4
94
90
6
96
90
Mirror-match
9
92
83
8
90
82
6
94
88
Population
6
95
89
5
95
90
6
95
89
Superpopulation
3
97
95
2
98
96
3
98
96
Conclusions: Stratified Sample
 The estimated coverage for Normal, Modified Sample Size,
and Population resampling methods are all close to the
nominal 90% desired. The “tail” probabilities are each
roughly 5%.
 Neither the Mirror-match (estimated coverage of 83%), nor
the Superpopulation (estimated coverage of 95%)
performed very well.
 Due to their ease of calculation, Davison and Hinkley
conclude that the Population and Modified Sample Size
perform the best.
Download