IDA/Statistics Annica Isaksson/Anders Nordgaard October 17, 2011 732A28: Computer lab 1 Introduction In this computer lab, we utilize the software SAS. We use the procedure SURVEYSELECT to select samples from a finite population, and the procedure SURVEYMEANS to calculate survey estimates. For detailed help on the procedures, including syntax and examples, see http://v8doc.sas.com/sashtml/ (in the left margin, choose SAS/STAT>SAS/STAT User’s Guide and scroll down). Data We use data from the U.S. 1992 Census of Agriculture. For a description of the survey, see Example 2.5 on pp. 34-35 in Lohr’s book. We have access to two data sets: agpop.dat containing data on the whole population agsrs.dat containing data from a simple random sample from the population The files are described on page 437 in Lohr’s book. They are available in the CD-ROM attached to the book, and also at the course homepage. Start by copying the files to your home area. Then, use the programs agpop.sas and agsrs.sas, respectively, to import the data into SAS. Both programs are available at the course homepage. Exercise 1 Run agpop.sas and check the printout to see that the 3078 observations have been imported correctly. There is some nonresponse in the variables acres92 and acres87 which is coded -99. Change this to the nonresponse code . used in SAS, for instance by use of the following data step (which changes the nonresponse code for the variable acres92): data agrarp; set agrarp; if acres92=-99 then acres92=.; run; Check that the re-coding was successful. Also, calculate some simple descriptive statistics by running the procedure PROC MEANS. For instance, to obtain the mean and standard deviation for the variable acres92, write proc means data=agrarp mean std; var acres92; run; How do you obtain the population total for acres92 from the calculated mean? Exercise 2 Run agsrs.sas and check the printout to see that the 300 observations have been imported correctly. No re-coding is necessary since there is no nonresponse. Let us now use SURVEYMEANS to estimate the population mean for the variable acres92 from the sample. Try the following simple program: proc surveymeans data=agrars total=3078 all; var acres92; run; Maybe you do not want all the resulting statistics. Then, you can replace all with what you want. If you simply remove all, you will get the number of observations (nobs), the estimated mean (mean), the standard deviation for the estimated mean (stderr), and the confidence limits (clm). Try! A common option except data= and total= is alpha=, which is used if you do not want a 95% confidence interval. For instance, if you want a 99% interval, write alpha=0.01. The procedure SURVEYMEANS The syntax for SURVEYMEANS is the following: proc surveymeans <options> <statisticskommandon>; by variables; class variables; cluster variables; strata variables; var variables; weight variables; We have already used the main command and var. by is used if you want to produce estimates separately for different groups, and group membership is indicated by a variable. class is used if a numeric variables shall be treated as categorical. If a categorical variable is nonnumeric, it will always be treated as categorical and the class command is not needed. cluster is only used if you have a cluster sample, and the variables indicate cluster membership. strata is only used if you have a stratified sample, and the variables indicate stratum membership. weight requires that there is a variable which gives the weight for each observation. Typically, the weight is the inverted design weight. If you have SRS, stratified SRS or SRS of clusters, you do not need to use weight in order to estimate a mean (but you still need it to estimate a total). The procedure SURVEYSELECT The syntax of SURVEYSELECT resembles the one for SURVEYMEANS: proc surveyselect <options>; strata variables; control variables; size variable; id variables; If you select a pps sample, you need to use size and the variable should be the size measure for the pps sampling. If you conduct stratified sampling, the strata variables indicate the strata. After id, list the variables that you want in your sample (if you do not use list, all variables from the population will be in the sample). If you want to sort data somehow, for instance for systematic sampling, use the command control. There are many options in the main command. We will mainIy use data=, out=, method=, sampsize= and rep=. The name of the resulting file is given in out= . The sample size is given in sampsize= and the number of samples in rep= (default is 1 sample). There are many possible sample methods, indicated by method=. The default sampling method is srs. Exercise 3 Let us select a new simple random sample of size 300 from the population agpop by use of SURVEYSELECT. We want to include all variables in the sample, and the sample shall be saved in the file agosu. Try the following program: proc surveyselect data=agrarp sampsize=300 out=agosu; run; Now run SURVEMEANS to estimate the population mean for the variable acres92 (as in Exercise 2). Does the confidence interval include the true value? Exercise 4 Now, let us select 5 independent simple random samples from the population and save them in the file agosu5. This time, we only save the variable acres92. Try the following program: proc surveyselect data=agrarp sampsize=300 rep=5 out=agosu5; id acres92; run; The resulting file should contain 5 samples, each of size 300. Note that the samples are specified by a new variable replicate. This variable is useful when you want to produce estimates separately for each sample. Exercise 5 Select 10 simple random samples from agpop to study the variables acres92 and acres87. Use SURVEYMEANS (with a by command) to estimate the population means for the two variables, separately for the two samples. Do all intervals cover the true values? Exercise 6 Once more, we want to produce estimates from the simple random sample agsrs.dat (imported as agrars). Now, we also want to estimate the population total for the variable acres92. This is possible with the commands sum och clsum (which produce estimates with 95% confidence intervals). Try the following program: proc surveymeans data=agrars total=3078 mean clm sum clsum; var acres92; run; Note that the resulting confidence interval for the population mean is (260706,335088), which we know from before is correct. However, for the population total, the confidence interval is (78211877,100526351). This intervall is 300 times larger than the interval for the mean, not 3078 times larger as it should be! Why did this happen? The problem is that SURVEYMEANS uses sampling weights, and since we have not specified any weight variable, it will use the default weight 1. In Exercise 7, we handle this by creating a weight variable which is the inverted inclusion probability. Since agrars is selected with simple random sampling, the inverted inclusion probability is N/n=3078/300 for all units in the population. Exercise 7 Create a variable called samplingweight with the value 3078/300, for instance with the following program: data agrars; set agrars, if (acres92<0) then samplingweight=0; else samplingweight=3078/300; run; Run the program in Exercise 6 again, but add the line weight samplingweight; before run; . Check the result. The population total should be estimated by 916927110 with the standard deviation 58169381. Exercise 8 If we use SURVEYSELECT for sampling, we get the sampling weights automatically for all sampling methods except simple random sampling (and other sampling methods where the first order inclusion probabilities are equal). For methods such as simple random sampling, the sampling weights (and the inclusion probabilities) will be included if you add stats to the first command. Select a simple random sample of size 300 from agpop. Add stats and only include the variables county, state and acres92. You can try the following program: proc surveyselect data=agrarp method=srs sampsize=300 stats output=agosu; id county state acres92; run; Check the contents of the file agosu. Aside from the variables county, state and acres92, the variables selectionprob and samplingweight have been added. Estimate population totals by running SURVEYMEANS. Compare with the results in Exercise 7.