Double Sampling (Chapter 14) To this point, we have considered a number of sampling or estimation methods (ratio and regression estimation, stratified sampling, e.g.) whereby auxiliary information was used to estimate a population mean or total. In all of these cases, it was assumed that we knew population information on the auxiliary variable. Double sampling is a sampling method which makes use of auxiliary data where the auxiliary information is obtained through sampling. More precisely, we first take a sample of units strictly to obtain auxiliary information, and then take a second sample where the variable(s) of interest are observed. It will often be the case that this second sample is a subsample of the preliminary sample used to acquire auxiliary information. Two common situations where double sampling is employed to use auxiliary information to improve the estimate of some response variable are outlined below. 1. If the variable of interest is “expensive” to measure, but a related variable is much “cheaper,” we might first sample many of the sampling units and measure the “cheaper” (auxiliary) variable, and only measure the response variable on a subsample (or smaller sample). • A common example of this scenario is any case where a visual estimate of some response variable can be made much more quickly (cheaply) than measuring the response outright. For example, if we want to estimate the number of leaves in some area, it could be quite time-consuming to count the number of leaves in a number of 18x18 inch quadrats, whereas a visual estimate of the number of leaves in such an area is relatively simple to make. If a definite relationship between the visual and actual number of leaves in a quadrat can be established, we could make efficient use of the visual estimates (even if they are highly biased) to improve an estimate of the total number of leaves through double sampling. 2. A common problem in many surveys (as discussed in this class) is that of potential bias from nonresponse. Double sampling can be used with stratification principles to adjust for nonresponse in surveys by taking a second sample of the nonrespondents. To illustrate the use of double sampling, consider the following example. Example: Suppose it is desired to estimate the total biomass of vegetation and average biomass (gm/m2 ) on a 1000 m2 area. A systematic sample of twenty 1 m2 plots is selected and a visual estimate of the total grams of biomass is recorded. Five of the twenty plots are then randomly selected and the total biomass is carefully determined on these 5 plots. The visual and actual measurements of total biomass are given on the next page. Visual Only 60 60 200 0 100 80 20 150 100 80 104 60 60 20 150 60 Visual and Actual Visual: 20 80 150 40 80 Actual: 14 62 155 36 71 Ratio Estimation in Double Sampling: Let n0 = the total # of units observed, n = the # of units in the second sample, where both x and y are observed. In the biomass example, n0 = 20 and n = 5. Suppose we want to estimate τ = N X yi (total biomass in the 1000 m2 area). i=1 n X • First, compute the sample ratio: r = i=1 n X yi for the second (smaller) sample. xi i=1 • Recall that with ratio estimation, we assumed we knew the population total for X, namely τX , and we then estimated the total for Y by: τbr = rτX . But here, we need to estimate τX (since we only have a sample of the x-values). How? • With this estimate for τX then, a ratio estimate of τ is given by: τbr = r · τbX , where via the Delta Method, the variance of τbr is: à ! 2 0 σr2 0 σ 2 n −n Var(τbr ) = N (N − n ) 0 + N , where: n n0 n σr2 = N N 1 X 1 X (yi − Rxi )2 , σ 2 = (yi − µ)2 . N − 1 i=1 N − 1 i=1 • With ratio estimation as before, we had n0 = N , so that³the 1st term of the variance ´ 2 σr N −n 2 expression above was just zero. And then: Var(τbr ) = N as before. N n • The estimated variance is given by: à d τb ) = Var( r ! n0 − n s2r s2 , where: N (N − n ) 0 + N 2 n n0 n n n 1 X 1 X 2 2 2 sr = (yi − rxi ) , s = (yi − y)2 (usual sample variance). n − 1 i=1 n − 1 i=1 0 • The value of s2r will be small (as before) if the relationship between the visual and actual estimates is linear and goes through the origin. 105 • The ratio estimate of the mean response and corresponding standard error are given by: τbr SE(τbr ) µb r = , SE(µb r ) = . N N Back to the Biomass Example: To estimate the total biomass in the 1000 m2 area or the mean biomass, the following R code was used. > x <- c(20,80,150,40,80,60,60,200,0,100,80,20,150,100,80,60, 60,20,150,60) > y <- c(14,62,155,36,71) > N <- 1000 # Population size > np <- 20 # Initial sample size: n’ > n <- 5 # Subsample size > x1 <- x[1:5] # X-values for the subsample > r <- sum(y)/sum(x1) > r [1] 0.9135135 à r= à n X ! à n X yi / i=1 !! xi i=1 # Estimate of actual / visual 0 n NX > tau.hat.x <- (N/np)*sum(x) > tau.hat.x [1] 78500 xi n0 i=1 # Estimate of tau.x for all 1000 plots > tau.hat.r <- r*tau.hat.x > tau.hat.r [1] 71710.81 (τbr = rτbx ) # Estimate of total biomass τbx = à s2r n 1 X = (yi − rxi )2 n − 1 i=1 ! > sr2 <- (1/(n-1))*sum((y-r*x1)^2) à ! ! > var.tau.hat.r <- (N*(N-np)*var(y))/ à 0 2 s2r 0 s 2 n −n d br ) = N (N − n ) Var( τ + N np + N^2*((np-n)/np)*sr2/n n0 n0 n > sqrt(var.tau.hat.r) µ ¶ q d τb ) SE(τbr ) = Var( [1] 12613.57 # SE of tau.hat.r r > mu.hat.r <- tau.hat.r/N > mu.hat.r [1] 71.71081 > sqrt(var.tau.hat.r)/N [1] 12.61357 (µb r = τbr /N ) # Estimate of biomass per plot (gm/m sq) µ # SE of mu.hat.r 106 SE(µb r ) = q d τb )/N Var( r ¶ To investigate how much improvement these estimators with double sampling give, consider obtaining estimates of total biomass and biomass per plot based just on the y-values (i.e.: an SRS of size n = 5). # Unbiased Estimates # ============= > N*mean(y) [1] 67600 > mean(y) [1] 67.6 > N*sqrt((1/n)*var(y)) [1] 24034.56 > sqrt((1/n)*var(y)) [1] 24.03456 # # # # # # # # (τb = N y) Estimate of total biomass (SRS) (µb = y) Estimate of mean s biomass (SRS) 2 s ignoring the fpc SE of estimated total SE(τb) = N n biomass (SRS) s SE of estimated mean 2 s SE(µ) b = ignoring the fpc biomass (SRS) n Allocation in Double Sampling for Ratio Estimation It was mentioned at the outset that one major reason for using double sampling is because auxiliary information may be cheaper to measure than the main variable of interest. Hence, the cost of sampling at the first stage (auxiliary info) as compared to the second stage (both auxiliary and response info) is very important in deciding how to allocate sample units at the two stages. Suppose the total cost of sampling is fixed at C, and let C = c0 n0 + cn , where: c0 = cost of observing x on one unit (visual estimate), c = cost of observing y on one unit (actual estimate). For a fixed cost (C, c0 , c), we want to find the optimal values of n, n0 , that is, those values that minimize the variance of the mean or total estimator. These optimal values can be shown to satisfy: v à ! u 0 uc σr2 n t . = n0 c σ 2 − σr2 • s2r is approximately unbiased for σr2 and s2 is unbiased for σ 2 . So we could use s2r and s2 from a preliminary or prior study as “guesses” of these standard deviations to answer the allocation question. • In the biomass example, we found s2r = 117.2 and s2 = 2888.3. Hence, thereÃwas much ! σr2 more variation in the y’s than between the y’s and x’s. This in turn makes σ 2 − σr2 small. Is this typical? When? 107 s2r σr2 = 0.0423 as an estimate of the ratio , we can compute allocations s2 − s2r σ 2 − σr2 for a variety of cost ratios: • Using c0 /c n/n0 0.1 .065 0.2 .091 0.9 .195 1.0 .206 • So, for example, if the cost ratio is c0 /c = 0.1 (10 times more expensive to measure the response), then the subsample (of the y’s) should be about 6.5% the size of the visual sample (of the x’s) (i.e.; we would measure y on roughly every 15th unit.) Regression Estimation for Double Sampling: Suppose there is an underlying relationship (linear or otherwise) between y and x which does not go through the origin. Then regression estimation will be more appropriate than ratio estimation. Consider the simple linear regression model: y = A + Bx on our sample of size n. • The regression estimate of the total in previous problems was given as: τbL = a + bτx . With double sampling, however, τx is unknown and must be estimated: • The resulting estimated variance of τbL is: 2 d τb ) = N (N − n0 ) s + N 2 Var( L 0 n à n0 − n n0 ! n X 1 (yi − a − bxi )2 . n(n − 2) i=1 Double Sampling for Stratification: Consider now the use of auxiliary information for the purpose of stratifying the original population into groups which are homogeneous within and heterogeneous between for the variable of interest. Recall that to take a stratified random sample, we need to know which strata the individual sampling units belong to before the sampling is done, as it is with this information that the sampling is actually performed. In cases where the stratum identifications were not known, but the relative population stratum sizes (Nh /N ) could be assumed known, poststratification techniques can be applied to obtain desired estimators of the population mean or total. If, however, these relative population stratum sizes are not known, then they need to be estimated. One way to accomplish this is to take an initial sample whose sole purpose is to estimate these relative stratum sizes, and then take a second subsample from this first sample where we stratify the initial sample according to the estimated relative stratum sizes. 108 Before describing how double sampling is used here, recall the following notation for stratified random sampling. Let: N = population size, Nh = the size of stratum h in the population, Wh = Nh /N = the proportion of the population in stratum h. Double sampling in this stratified framework is conducted as follows in two steps: 1. Select an SRS of size n0 . Here, we find n0h , the number of observations in this “easy-totake” sample from stratum h, h = 1, . . . , L. • Let wh = n0h = the proportion of the sample in stratum h. n0 2. Take a stratified random sample from the n0 units in the initial sample. • Let nh = the number of units sampled from stratum h. ¶ L µ X Nh • Normally, the stratified random sample estimates the population µ using y = yh. N The difference here is that we are estimating Nh /N with n0h /n0 based on the initial sample. This gives the double sampling stratified estimator of the mean: h=1 yd = L X à h=1 ! n0h y h , where y h = the sample mean in the hth stratum. n0 • The variance and estimated variance of y d are given by: à Var(y d ) = N − n0 N ! σ2 + n0 L X Wh σh2 h=1 | n0 à {z ! n0h −1 nh , estimated by: } extra source of variability from the secondary sample µ d Var(y d) = | à ! " # ¶ L L N −1 X n0h − 1 nh − 1 N − n0 X s2h wh (y h − y d )2 − wh + 0 0 N n −1 N −1 nh N (n − 1) h=1 h=1 {z } | {z } Notes: • Suppose we take an SRS of 100 plots, where we plan to subsample 10 of these plots to measure the variable of interest. Normally, we would sample the “subsample plots” during the course of sampling the original 100 plots for efficiency reasons (i.e.: the sampling would be done all at once). However, for a stratified random sample in this setting, we don’t know the strata identifications until we sample the 100 plots. 109 • An alternative to this double sampling for stratification procedure might be to take a systematic sample at the first stage and at the second stage. This might be better at guaranteeing coverage of the area of interest. At the second stage (to mimic stratified random sampling), we could systematically select sites in each of the strata proportionally. • In general, we may want to sample more units than dictated by the proportional stratum sizes to guarantee that at least 2 samples from each stratum are taken. Why? Example (A Way to Handle Nonresponse): Consider a population of N = 400 individuals on which we want to conduct a mail survey. The parameter of interest in this study is the proportion of people who respond with a “Yes” to a particular question. Call this population proportion p. • Suppose an initial SRS of size n0 = 120 people is taken where 30 people respond and the other 90 do not. On the basis of this sample, we will view the respondents as one stratum and the nonrespondents as a second stratum (call them strata 1 & 2 respectively). We view them as separate strata to allow for the (likely) possibility that the nonrespondents might have responded differently than the respondents, had they responded. So L = 2 here. • In terms of the stratification notation, we have: n01 = 30 respondents, and n02 = 90 nonrepondents. n01 n02 = 0.25 and w = = 0.75. 2 n0 n0 • Suppose 20 of the 30 respondents answered “Yes” to the question. Our initial estimate of the proportion who responded “Yes” is thus: y 1 = 20/30 = 0.667. We then guess that w1 = • To try to obtain more information about the population of nonrespondents, suppose we now take an SRS of 25 nonrespondents from the 90 in our original sample. As the preliminary survey was a mail survey (which often has a low response rate), we might try a phone survey or even a face-to-face interview at this second stage. Suppose in this intensive follow-up, 20 of the 25 nonrespondents sampled respond and 4 answer “Yes” to the original question. • Viewing this two-stage sample as a double sample, we have the following: Stratum Respondents (1) Nonrespondents (2) Initial Sample Size n01 = 30 n02 = 90 Secondary Sample Size n1 = 30 n2 = 25 110 Stratum Mean y 1 = 0.667 y 2 = 4/20 = 0.2 • The estimated stratum mean for the nonrespondents stratum, namely y 2 = 0.2 assumes that the 20 people actually sampled are representative of the 25 we attempted to sample. • The resulting estimate of the proportion answering “Yes” to the question is: pbd = y d = 2 X wh y h = (.25)(.667) + (.75)(.2) = 0.317, h=1 a significant decrease from the estimate of 0.667 which ignored the nonrespondents. 111