Stat 475/920 Notes 2 Reading: Lohr, Chapter 2 Office hour and R session notes: Prof. Small’s office hours: Tues., Thurs., 4:45-5:45; by appointment. Huntsman Hall 464. Xin Fu’s office hours: Monday, 5-6. Introductory tutorial on R: Monday, Sept. 15th, 5-6, Huntsman Hall 441. Goal of survey sampling is to obtain information about a population by examining only a fraction of the population. To simplify presentation, we assume for now that the sampled population is the target population, that the sampling frame is complete, that there is no nonresponse or missing data and that there are no errors of observation. We will return to nonsampling errors later. I. Population In survey sampling, we are typically concerned with a finite population (universe) with N units labeled {1, 2, , N } . Let yi be a characteristic associated with the i th unit, e.g., the i th person in the population’s income. 1 The population quantities we will most often be interested in estimating are 1 N (1) The population mean yU : yU N yi . i 1 N (2) The population total t : t yi . i 1 2 (3) The population variance S : 1 N S ( yi yU ) 2 N 1 i 1 2 (4) The population standard deviation S : S S 2 1 N ( yi yU )2 . N 1 i 1 The population standard deviation can be thought of as measuring the typical absolute deviation of a randomly chosen unit’s y from the population mean. The coefficient of variation (CV) is a measure of variability in S CV ( y ) the population relative to the mean: yU It is sometimes helpful to have a special notation for proportions. The proportion of units having a characteristic is simply a special case of the mean, obtained by letting yi 1 if 2 the i th unit has the characteristic of interest (e.g., has income above $100,000 per year) and yi 0 if the i th unit does not have the characteristic of interest. Let p be the proportion of units with the characteristic of interest: N yi number of units with characteristic in the population p i 1 N N II. Simple Random Sampling Probability sampling: Each unit in the population has a known probability of being selected in the sample and a chance method such as using numbers from a random number table is used to choose the specific units to be included in the sample. The simplest type of probability sampling is simple random sampling without replacement (SRS). A simple random sample without replacement of size n can be done by putting the numbers {1, , N } into a hat and randomly selecting n of them. In a simple random sample with replacement, after each draw, we put the number drawn back into the hat. For short, we’ll call simple random sampling without replacement just simple random sampling as simple random sampling without replacement is preferable to simple random sampling with replacement for a finite population. In a simple random sample of size n , every possible subset of n distinct units has the same probability of being selected into the 3 N sample. There are n possible samples and each is equally likely, so the probability of selecting any individual sample S of n units is 1 n !( N n)! P(S ) N! N n Let i denote the probability that the i th unit will appear in the n sample. In simple random sampling, i N for all the units (we’ll prove this later). We can draw a simple random sample of size n from a population of size N in R by the following commands: > n=25 > N=500 > sample(N,n,replace=FALSE) [1] 274 218 160 249 50 451 271 496 378 178 84 3 328 19 141 190 279 248 221 [20] 37 425 106 40 404 43 Example of simple random sampling: The Effect of Agent Orange on Troops in Vietnam During the Vietnam War, the U.S. army engaged in herbicidal warfare in which the objective was to destroy forests that 4 provided cover for the Viet Cong army. The most widely used herbicide was known as Agent Orange. A UH-1D helicopter from the 336th Aviation Company sprays a defoliation agent on a dense jungle area in the Mekong Delta. 07/26/1969/National Archives photograph. Many Vietnam veterans are concerned that their health may have been affected by exposure to Agent Orange. The particularly worrisome component of Agent Orange is dioxin, which in high doses is known to be associated with certain cancers. High levels of dioxin can be detected 20 or more years after heavy exposure to Agent Orange. To examine the dioxin levels in Vietnam veterans, researchers from the Centers from Disease Control used military records to identify a sampling frame of 646 living U.S. Army combat personnel who served in Vietnam during 1967 and 1968 in the areas that were most heavily treated with Agent Orange (Centers for Disease Control Veterans Health Studies, “Serum 2, 3, 7, 8 – Tetrachlorodibenzo-p-dixoin Levels in U.S. Army Vietnam-era Veterans,” Journal of the American Medical Association 260, Sept. 2, 1988: 1249-1254). In the actual study, the dioxin levels in 1987 were obtained from all 646 veterans. The dioxin level is measured as concentration in parts per trillion. The data can be found on the 5 class web site http://wwwstat.wharton.upenn.edu/~dsmall/stat475-f08/ under Data Sets. The data can be read into R by the following command agent.orange.data=read.table("agent_orange_data.txt", header=TRUE) The population mean is dioxin=agent.orange.data$dioxin mean(dioxin) [1] 4.260062 The population standard deviation is sd(dioxin) [1] 2.642617 Histogram of the population distribution of dioxin hist(dioxin) 6 As a point of comparison, the mean dioxin levels in 1987 in a sample of veterans who served between 1965-1971 in United States and Germany was 4.19. Suppose that instead of obtaining the dioxin levels of all 646 veterans, we would like to take a sample of size 50. n=50 N=646 sample1=sample(N,n,replace=FALSE) sample1 [1] 132 527 542 110 475 354 458 586 370 389 544 626 538 324 159 359 272 233 213 [20] 422 238 522 596 182 329 37 74 480 563 528 623 38 81 610 111 543 577 525 [39] 344 335 4 198 83 452 145 249 455 418 540 175 mean(dioxin[sample1]) [1] 4.42 Another sample is sample2=sample(N,n,replace=FALSE sample2 [1] 198 501 254 392 8 417 237 221 363 146 102 390 600 388 453 55 233 211 292 [20] 34 190 549 185 579 478 111 583 217 407 103 441 263 46 452 486 199 499 70 [39] 401 529 219 1 68 227 477 415 618 77 341 362 > mean(dioxin[sample2]) [1] 3.78 7 How accurate is the sample mean as an estimate of the sample mean? The key to answering this question is to understand the sampling distribution of the sample mean. III. Sampling Distribution Let y denote the sample mean. The sampling distribution of y is the distribution of y over repeated samples. Mean and variance of the sampling distribution of y (to be proved later): Mean: E ( y ) yU . The sample mean is an unbiased estimator of the population mean. S2 n Var ( y ) 1 Variance: n N . Finite population correction: The usual formula for the variance of the sample mean of independent and identically distributed random variables is the variance divided by the sample size. The members of our sample are not independent because we are sampling without replacement. For simple random sampling without replacement, n 1 we multiply by an additional factor N , which is called the finite population correction (fpc). Intuitively, we make this correction because with small populations, the greater our 8 sampling fraction n / N , the more information we have about the population and thus the smaller the variance. If we take a complete census, i.e, we sample the whole population so that n N , then the fpc is 0 and Var ( y ) 0 since y will always equal yU . For most samples that are taken from extremely large populations, the fpc is approximately 1. For large populations it is the size of the sample taken, not the sampling fraction, that determines the precision of the estimator. For example, a sample of size 100 from a population of 100,000 has almost the same precision as a sample of size 100 from a population of 100 million: S2 100 S 2 Var ( y ) 1 0.999 100 100, 000 100 for N 100, 000 S2 100 S2 Var ( y ) 1 0.999999 100 100, 000, 000 100 for N 100, 000, 000 If your soup is well stirred, you need to taste only one or two spoonfuls to check the seasoning, whether you have made 1 liter or 20 liters of soup. 2 The variance of y involves the population variance S , which depends on the values for the entire population. We can estimate the population variance by the sample variance: 1 s2 ( yi y ) 2 , n 1 iS where S denotes the sample. An unbiased estimator of the variance of y is 9 s2 n ˆ ( y ) 1 Var n N We usually report not the estimated variance of y but its square root, the standard error (SE): s2 n SE ( y ) 1 n N. We can simulate the sampling distribution of y for the Agent Orange data: nosims=2000; # of samples to be simulated n=50; # size of sample N=646; # size of population samplemean=rep(0,nosims); for(i in 1:nosims){ tempsample=sample(N,n,replace=FALSE); samplemean[i]=mean(dioxin[tempsample]); } mean(samplemean) # mean of sample means [1] 4.25794 mean(dioxin) # population mean [1] 4.260062 sd(samplemean) [1] 0.3562426 truesd.samplemean=((sd(dioxin)/sqrt(50))*(1-50/646)) # true SD of sample mean 10 > truesd.samplemean [1] 0.3447966 hist(samplemean) # histogram of sample means Estimating proportions: For proportions, the above formulas for the sampling distribution can be specialized. 1 N Let p N yi yU denote the population proportion and i 1 p̂ y denote the sample proportion. 11 Then, N S2 ( y p) i 1 i N 2 y i 1 2 i N 2 p yi Np 2 i 1 N 1 N 1 S2 n p (1 p ) N n Var ( pˆ ) 1 n N n N 1 Also, 1 n 2 ˆ ˆ ˆ s2 ( y p ) p (1 p ) i n 1 iS n 1 pˆ (1 pˆ ) n ˆ ( pˆ ) Var 1 n 1 N SE ( pˆ ) p (1 p ) N N 1 pˆ (1 pˆ ) n 1 n 1 N Example: Suppose we are interested in the proportion of Vietnam veterans with dioxin level greater than 5. n=50; N=646; tempsample=sample(N,n,replace=FALSE); phat=mean(dioxin[tempsample]>5); var.phat=(phat*(1-phat)/(n-1))*(1-n/N); se.phat=sqrt(var.phat); phat [1] 0.14 se.phat 12 [1] 0.04761262 Estimating the population total: The natural estimate of the population total is tˆ Ny . N (Note that t yi NyU ) i 1 The sampling distribution of tˆ is E (tˆ) t 2 S n Var (tˆ) N Var ( y ) N 1 n N 2 2 2 s n ˆ (tˆ) N Var 1 n N 2 IV. Confidence Intervals Consider our first sample from the Vietnam veterans population: n=50 N=646 sample1=sample(N,n,replace=FALSE) mean(dioxin[sample1]) [1] 4.42 It is not sufficient to just report the sample mean. We would like to give an idea of the accuracy of our estimate of the population mean and a range of plausible values for the population mean. This is done by means of a confidence interval (CI). 13 A 95% confidence interval for a population parameter should have the following property: If we take samples from our population over and over again and construct a confidence interval using our procedure for each of the samples, 95% of the resulting intervals should include the true value of the population parameter. Another way of thinking about confidence intervals is if we take a survey on a different subject each day and come up with a valid 95% confidence interval for the population mean, then over our lifetime, about 95% of the intervals will contain the true population means. In introductory statistics, for situations in which the population is infinite, a central limit theorem for infinite population sampling with replacement was introduced and used to form approximate confidence intervals: If y1 , y2 , are iid random variables with mean and variance 2 , then under regularity conditions, for n sufficiently large, 1 n yi n i 1 has approximately a standard normal distribution. n For finite population sampling, the usual central limit theorem cannot be applied because the sample size cannot exceed the population size. However, there is a finite population central limit theorem in which we consider a sequence of populations 14 and sample sizes such that the population sizes and sample sizes increase to infinity. Hájek (1960) proved that if certain technical conditions hold and if n, N , and N n are all “sufficiently large,” then y yU n S has approximately a standard normal distribution 1 N n For n, N , and N n “sufficiently large,” an approximate 95% confidence interval for the population mean is s n s n 1 , y 1.96 1 y 1.96 SE ( y ), y 1.96 SE ( y ) (1.1) y 1.96 n N n N Note that if n / N is small, then this confidence interval is approximately the same as the usual confidence interval from introductory statistics. The imprecise term sufficiently large in the central limit theorem occurs because the adequacy of the normal approximation depends on n and how closely the population { yi , i 1, , N} resembles a population generated from a normal distribution. The magic number of n 30 , often cited in introductory statistics books as a sample size that is “sufficiently large” for the central limit theorem to apply, often does not suffice in finite population sampling problems. Many populations are highly skewed. Simulation study of performance of confidence intervals for the Vietnam veteran agent orange study: 15 Recall that the dioxin distribution if fairly skewed. To simulate the coverage (proportion of times the confidence interval contains the true population parameter) for the approximate 95% CI (1.1) for a sample size of n (n=50 in the code), we use the following R code: nosims=10000; n=50; N=646; samplemean=rep(0,nosims); # vector to store the sample mean lowerci=rep(0,nosims); # vector to store lower end point of confidence interval upperci=rep(0,nosims); # vector to store upper end point of confidence interval popmean=mean(dioxin); # population mean for(i in 1:nosims){ tempsample=sample(N,n,replace=FALSE); # take sample samplemean[i]=mean(dioxin[tempsample]); # compute sample mean se=sqrt((1-n/N)*sd(dioxin[tempsample])^2/n); # computes standard error of sample mean lowerci[i]=samplemean[i]-1.96*se; # lower end point of CI 16 upperci[i]=samplemean[i]+1.96*se; # upper end point of CI } hist(samplemean) # histogram of sample means ci.coverage=mean((lowerci<=popmean)*(upperci>=popmean)); # Proportion of times confidence interval contains the population mean ci.coverage Sample size n Coverage of large sample size 95% CI (1.1) 15 0.903 30 0.911 50 0.919 75 0.921 100 0.922 300 0.921 The confidence interval only has 91% coverage for n 30 and around 92% coverage for n 50 . For a more normally distributed population, the confidence intervals perform better. We simulate data from a Gamma distribution with shape=5 and scale=1. # Simulation from Gamma distribution N=646 gammapop=rgamma(N,shape=10,rate=1); 17 hist(gammapop); For this population, the simulated confidence interval coverage is Sample size n Coverage of large sample size 95% CI (1.1) 15 0.929 30 0.937 50 0.946 75 0.946 100 0.946 300 0.949 18 For skewed populations, stratified sampling is useful for obtaining more accurate estimates and more accurate confidence intervals (stratified sampling will be studied in Chapter 4). Confidence intervals for proportions: For a proportion, the central limit theorem based confidence interval is pˆ (1 pˆ ) n pˆ 1.96 1 n 1 N , which, for n / N small and n large , is approximately pˆ (1 pˆ ) ˆp 1.96 n Brown, Cai and DasGupta (2001, Statistical Science) show that this confidence interval has erratic coverage properties. They recommend several alternatives, the easiest of which is to use yi 2 p iS p̂ n 4 in place of , i.e. the CI is p(1 p) p 1.96 n ( p is the Bayes estimate of p from a Beta(2,2) prior). 19