Statistics 475 Notes 18 Revised Reading: Lohr, Chapter 7.1-7.4 Schedule: On Wednesday, Professor Brown will finish up his presentation on sampling issues in the Census. Final homework assignment due Mon., Dec. 8th. The last two classes, December 1 and December 3, will be for short presentations on your final projects. I will be providing more details shortly. Final report on your project due Wed., Dec. 17th, 5 p.m. I. Complex Surveys: Example We have discussed a number of ways in which surveys can use designs/analyses other than simple random sampling to be more feasible or efficient: (1) Use of auxiliary variables (ratio estimation) (2) Stratification (3) Clustering (4) Unequal probability sampling Many large surveys use more than one of these elements. Such surveys are called complex surveys. 1 Example: Malaria is a serious health problem in The Gambia, a country in West Africa. Malaria morbidity can be reduced by using bed nets impregnated with insecticide, but this is only effective if the bed nets are in widespread use. In 1991, a nationwide survey was designed to estimate the prevalence of bed net use in rural areas. The survey is described and results reported in D’Alessandro et al. (1994, Bulletin of the World Health Organization). Veronica Njeri lost two children to malaria. (Evelyn Hockstein for The New York Times) The following is a sample design that is similar to the actual sampling design. The sampling frame consisted of all rural villages of fewer than 3000 people in The Gambia. The villages are divided into districts. In this population, there are 3 regions (eastern, central and western) with 20 districts per region. The villages in each district are stratified into two strata, those with a public health center (PHC) and those without a public health center. There are ten villages per strata and 50 compounds per village. The 2 sampling design starts by sampling five districts in each region selected with replacement with probability proportional to the population (# of beds) in the district. Within each sampled district, 2 PHC and 2 non-PHC villages were selected with replacement with probability proportional to the population in the village. Within each village, 6 compounds were sampled using a simple random sample. Within each sampled compound, we counted the number of beds and the number of bed nets in use. A total of 360 compounds were sampled. In summary, the sampling design is the following: Stage Sampling Unit Stratification 1 District Region 2 Village PHC/non-PHC 3 Compound II. Sampling Weights In complex surveys, formulas for estimating means, totals and especially standard errors can become horrendous. To deal with this, we will use sampling weights for computing estimates and resampling methods for computing standard errors (Chapter 9). The sampling weight for a sampled unit is the number of units in the population in the population that the sampled unit “represents.” For without replacement sampling, the sampling weight for a unit is the reciprocal of the probability that the unit is 3 selected in the sample. For with replacement sampling, the sampling weight for a unit is the reciprocal of the number of observations in the sampled unit’s stratum times the reciprocal of the per-draw probability that the observation unit is selected into the sample on a draw from the sampled unit’s stratum. Examples: (1) For question 6.2 from Homework 4, recall the sampling design: The population schedules of the 1940 census are preserved on 4,576 microfilm reels. Each census page contains information on 40 individuals. Two lines on each page were designed as “sample lines” by the Census Bureau: the individuals falling on those lines – 5 percent of the population – were asked a set of supplemental questions that appear at the bottom of the census page. Two of every five census pages were systematically selected for examination. On each selected census page, one of the two designated sample lines was then randomly selected. Data-entry personnel then counted the size of the sample unit containing the targeted sample line. Units size six or smaller were included in the sample in inverse proportion to their size. Thus, every one-person unit was included in the sample, every second two-person unit, every third three-person unit, and so on. Units with seven or more persons were included with a probability of 1-in-7: every seventh household of size seven or more was selected for the sample. 4 The probability that a person (say Mr. X) is selected into the sample is the following, where we let the number of persons in Mr. X’s household be h Sum over persons in Mr. X’s household of P(person is on sampled page) * P(person is on sampled line)* P(sampled line is selected)* P(household is retained)= 1 2 2 1 1 h * * * * h 1, , 7 5 40 2 h 100 h * 2 * 2 * 1 * 1 h h7 5 40 2 7 700 Thus, the sampling weight for a person in a household of size 7 or less is 100 and the sampling weight for a person in h a household of size greater than 7 is 700 . (2) Gambia survey. The Gambia survey is a stratified sample with the strata consisting of combinations of region and PHC/no PHC. So there are six strata – eastern PHC, eastern no PHC, central PHC, central no PHC, western PHC, western no PHC. There are 60 units in each stratum. The per-draw probability of a compound from a given strata, say central PHC, being selected into the sample is P(district selected) * P(village selected|district selected) * P(compound selected | district and village selected) 5 D1 V 1 * * R D2 C where C=number of compounds in the village (C=50) V=number of beds in the village D2=number of beds in district in PHC villages D1=number of beds in central region in PHC villages R = number of beds in the compound’s region For each stratum, 60 compounds are sampled (five districts, two villages in each district, six compounds per village) so 1 1 * the sampling weights are 60 D1 * V * 1 . R D2 C The Gambia bed net survey was designed so that within each region, each compound would have almost the same probability of being included in the survey: probabilities varied only because different districts had different numbers of persons. The data in the file gambiasample.out on the course web site are Region= denotes the region that the sampled compound was in (1,2,3) District=denotes the district that the sampled compound was in (1,2,3,4,5) Strata=denotes the strata (1=non-PHC, 2=PHC) of the sampled compound 6 Village=denotes the village of the sampled compound (1,2,…10) Compound=denotes the compound number (1,2,….50) Beds=total number of beds in the compound Nets= total number of anti-malaria nets in the compound Dsize=total number of beds in the district [you don’t need to use this variable] Vsize=total number of veds in the village [you don’t need to use this variable] Psiv=single-draw probability ( (# beds in village i)/(total #beds in the strata) i for selection of the villages Psid=single-draw probability ( (# beds in district i)/(total number of beds in the region) i for selection of the districts # Calculate Sample Weights for Gambia Data gambiadata=read.table("gambiasample.out",header=TRUE,sep=","); attach(gambiadata); sample.weight=(1/60)*1/(psid*psiv*(1/50)); hist(sample.weight) 7 III. Estimation Using Sample Weights The sample weight for a sampled unit is the number of units in the population that the sampled unit represents. Thus, our general estimate of the population total is tˆ wi yi iS For estimating the population mean, we could use tˆ / N , where N is the population size. But a generally more efficient estimator uses ratio estimation. Note that 8 N 1 N wi yi wi yi t n y i 1N i 1N 1 N wi wi n i 1 i 1 We estimate y by the ratio estimator: wi yi tˆ ˆy iS w i wi iS iS The denominator w i iS estimates the number of units N in the population. It is more efficient to use in the denominator, because if would expect w y iS i w is larger than iS i i w i iS than N w is smaller than N , we iS i to be smaller than t . Conversely, if N , we would expect w y iS i i to be larger than t R code: that=sum(sample.weight*nets); ybarhat=that/sum(sample.weight); > that [1] 6029881 > ybarhat [1] 89.19908 Thus, we estimate that there are an average of 89 nets per compound. 9 We will discuss how to obtain a standard error and confidence interval later. IV. Specifying the sample design in the Survey package By putting the data into the Survey package in R, we will be able to use a number of nice built in functions. # Set up survey design for Gambia data library(survey); gambiadesign=svydesign(id=~village+compound,weights=~sample.weight, data=gambiadata); svymean(nets,design=gambiadesign); svytotal(nets,design=gambiadesign); > svymean(nets,design=gambiadesign); mean SE [1,] 89.199 10.185 > svytotal(nets,design=gambiadesign); total SE [1,] 1004980 177023 IGNORE THESE STANDARD ERRORS – THEY DO NOT ACCOUNT FOR THE SURVEY DESIGN V. Estimating a distribution function So far we have concentrated on estimating population means, totals and ratios. But statistics other than means or totals may be of interest. For example, we might be interested in a median or a histogram. We can estimate any of these quantities with sampling weights. The sampling weights allow us to construct an empirical distribution for the population. 10 Suppose values for the entire population of N units are known. Then any quantity of interest may be calculated from the probability mass function: number of units whose value is y f ( y) N or the distribution function number of units whose value is y F ( y) N Any population quantity can be calculated from the probability mass function or distribution function. Sampling weights allow us to construct an estimated (empirical) probability mass functions and distribution function. fˆ ( y ) iS , yi y wi w iS i Here, we weight each observation by its sampling weight. The estimated distribution function is Fˆ ( y) fˆ ( x) . x y R Code: # Plot estimated CDF cdfest=svycdf(~nets,design=gambiadesign); plot(cdfest); 11 The estimated CDF can also be used to plot histograms and boxplots that incorporate the sampling weights. svyhist(~nets,gambiadesign); 12 # Compare box plots that uses the sampling weights to boxplot that ignores # the sampling weights par(mfrow=c(1,2)); svyboxplot(nets~1,gambiadesign,ylab="Nets",main="Boxplot Using Weights"); boxplot(nets,ylab="Nets",main="Usual Boxplot Ignoring Weights"); 13 One other useful plot is a bubble plot that plots each point with a bubble that is proportional in size to the sampling weight. svyplot(nets~region,gambiadesign,xlab="Region",ylab="Nets",main="Bubb le Plot of Nets in Regions"); 14 15