April 21, 2010 STAT 950 Chris Wichman Motivation Every ten years, the U.S. government conducts a population census, and every five years the U. S. National Agricultural Statistics Service conducts an Agriculture Census. Notice, that for the given “moment in time” that the census is taken, the total population, N, is known. In the intervening years, the numbers from each census are used to make inferences. For example, mean population in urban areas, and farm output (average bushels/acre). Motivation Of interest is an intervening year population average: N 1 N j 1 xj Two statistics commonly employed in these situations: The ratio estimator: t rat xn u N un (1 f ) with est varian ce v rat n ( n 1) u j t rat xj j 1 uN n n The regression estimator: t reg (1 f ) ˆ ˆ 0 1u N with est varian ce v reg n(n 2) j 1 2 ( x j ˆ 0 ˆ1u j ) Sample Average Without Replacement Samples Population Average estimator of μ is Y n N j 1 yj N , where the unbiased yj n j 1 When Y is based on a sample taken without replacement, the true variance of Y is: N Var (Y ) (1 f ) (yj ) 2 j 1 (1 f ) n ( N 1) 2 ; where n n Var (Y ) (1 f ) j y) j 1 n ( n 1) 2 (1 f ) n N the unbiased estimator of which is: (y f s 2 n The Problem with the Ordinary Bootstrap Recall, when a resample, * * Y1 Y n is taken with replacement from the original sample y1 y n then: ( n 1) s * * ˆ Var (Y | F ) Var (Y ) 2 n * 2 n j 1 ( y j y) n 2 2 Note that the Var * (Y * ) only matches the form of Var (Y ) if the sampling fraction, f 1 n . In other words, the ordinary bootstrap fails to realize the “contraction” in Var (Y ) . Proposed Resampling Methods Modified Sample Size With replacement Without replacement Mirror Match Population Superpopulation Modified Sample Size Find a resampling size n such that the approximately matched by Process: Find the form of Var * (Y * ) Take the expected value of Solve for n * * Var (Y ) * is . Var (Y ) and * Var (Y ) set equal to Var (Y ) Modified Sample Size With-Replacement For with replacement resampling, the bootstrapped variance of Y * is: 2 1 ( n 1) s Var (Y ) n n * * (1 2 1 ( n 1) E s E Var (Y ) n n * * 1 ( n 1) n n n ( n 1) (1 f ) 2 f) 2 n 2 (1 f ) n this leads to a modified sample size > than n Modified Sample Size Without-replacement For without-replacement resampling, notice that the effective N for each resample is really n. The 2 1 ( n 1) s Var (Y ) (1 f ) n n * * choice for n one in which n nf f making the obvious n n Mirror Match Goals: Capture the dependence due to sampling withoutreplacement Minimize the instability of the resampled statistic, by matching the original sample size Process: Suppose m nf , and k n / m are whole numbers Then simply concatenate k resamples of size m together to form an n n Mirror Match When m and k are not integers: Round m = nf to the nearest whole number Choose k such that km n ( k 1) m Randomly select either k or (k+1) without-replacement resamples of size m from y1 y n . Sampling probabilities should be chosen to match f Population Bootstrap If k N n is an integer: create a fake population Y*, by repeating y1 y n k times. Generate R replicate samples of size n, by sampling without-replacement from Y*. Each resample will have the same sampling fraction as the original sample. Population Bootstrap If k N n is not an integer: Find k and l such that N = nk + l, and 0 l n . create a fake population Y*, by repeating y1 y n k times and joining it with a without replacement sample of size l from y1 y n . This step is repeated R times. Generate R replicate samples of size n, by sampling without-replacement from Y*. Each resample will have the same sampling fraction as the original sample. Superpopulation Bootstrap For each resample, 1,. . .,R Create a fake population, Y*, of size N, by resampling with replacement from y1 y n , N times. From each Y1*, . . . , YN* take a without replacement sample of size n. Each resample will have the same sampling fraction as the original sample. Example 3.15: City Population Data A Comparison of Confidence Intervals In this example, the normal approximation C.I. refers to the bias corrected interval: 0 .5 0 .5 t ˆ bs v z (1 ) t ˆ bs v z The remaining intervals are Studentized confidence intervals : tv 0 .5 z (( R 1 )( 1 )) t v * 0 .5 * z (( R 1 ) ) Example 3.15: City Population Data Table 3.7 Resampling Scheme Normal Ratio Regression 132.65 175.18 Modified Size, n' = 2 46.55 298.93 Modified Size, n' = 11 109.14 209.42 111.31 283.13 Mirror Match, m = 2 118.42 174.79 117.06 245.09 Population 116.72 199.18 113.56 267.37 107.7 204.17 110.43 300.64 Superpopulation Resampling Scheme Normal 128.48 NA Ratio 161.09 NA Regression 137.8 174.7 123.7 Modified Size, n' = 2 58.9 298.6 Modified Size, n' = 11 111.9 196.2 114 258.2 Mirror Match, m = 2 115.6 196 112.8 258.7 Population 118.9 193.3 116.1 240.7 Superpopulation 120.3 195.9 114 255.4 NA 152 NA Example 3.15: City Population Data Table 3.8 Coverage Recreated in R Lower Length Upper Overall Average SD Normal 6 88 82 22.35 7.62 Modified Size, n' = 2 0 96 96 164.48 143.2 Modified Size, n' = 11 1 94 93 38.09 20.97 Mirror Match, m = 2 1 86 85 26.77 14.87 Population 1 91 90 34.75 19.61 Superpopulation 0 94 94 39.08 21.29 Coverage From BMA pg 96 Lower Length Upper Overall Average SD Normal 7 89 82 23 8.2 Modified Size, n' = 2 1 98 98 151 142 Modified Size, n' = 11 2 91 89 34 19 Mirror Match, m = 2 3 91 88 33 19 Population 2 91 89 36 21 Superpopulation 1 92 91 41 24 Example 3.15: City Population Data Figure 3.6 How Well does the Normal Approximation fit the Distribution of treg and trat? How Well does the Normal Approximation fit the Distribution of treg and trat? Conclusions About trat and treg The normal approximation for the ratio and regression estimators performs poorly. The estimated expected length of confidence intervals based on the normal approximation are very short relative to the other resampling methods. The estimated variance of the regression estimator is unstable, potentially causing huge swings in z* ultimately affecting the bounds of Studentized confidence intervals. Stratified Sampling Suppose the population of interest is divided into k strata, then the population total, N N1 N k Each strata now has it’s own sampling fraction, f i Each strata represents the population. wi Ni N ; i 1, , k ni Ni proportion of trat for a Stratified Sample Of interest is the overall mean: N 1 k N i 1 j 1 x ij The ratio estimator for a stratified population becomes: t rat xi. w i u Ni i 1 u i. k 2 w i (1 f i ) with est varian ce v rat n i ( n i 1) x ni j 1 ij t i u ij Example 3.17: Stratified Ratio Here, Davison and Hinkley drop the regression estimator, due to the potential instability of the variance affecting the bootstrapped confidence intervals. They also drop the Modified Sample, n nf because they felt it was a “less promising” finite population resampling scheme. Example 3.17: Methodology Simulate N pairs (u, x) divided into k strata of sizes N 1 , , N k “small-k”: “small-k”: “large-k”: k = 3, Ni = 18, ni = 6 k = 5, Ni = 72, ni = 24 k = 20, Ni = 18, ni = 6 1000 different samples of size n n1 n k were taken from the dataset(s) produced above. For each sample, R=199 resamples were used to compute confidence intervals for θ. Example 3.17: Methodology All methods were used on the sample as described in example 3.15, with the exception of superpopulation resampling, which was conducted for each strata. BMA Table 3.9 k=20, N=18 k=5, N=72 k=3, N=18 L U O L U O L U O Normal 5 93 88 4 94 90 7 93 86 Modified Sample Size 6 94 89 4 94 90 6 96 90 Mirror-match 9 92 83 8 90 82 6 94 88 Population 6 95 89 5 95 90 6 95 89 Superpopulation 3 97 95 2 98 96 3 98 96 Conclusions: Stratified Sample The estimated coverage for Normal, Modified Sample Size, and Population resampling methods are all close to the nominal 90% desired. The “tail” probabilities are each roughly 5%. Neither the Mirror-match (estimated coverage of 83%), nor the Superpopulation (estimated coverage of 95%) performed very well. Due to their ease of calculation, Davison and Hinkley conclude that the Population and Modified Sample Size perform the best.