LINKÖPINGS UNIVERSITET Institutionen för datavetenskap Statistik, ANd 732A28 SURVEY SAMPLING, 6 CDTS Master’s program in Statistics, Data analysis and Knowledge discovery Fall semester 2009 Home exam Home exam for Student 2 Rules: The home exam is given under “gentlemen’s agreement”. This implies that cooperation with other persons is not allowed. Solutions to the tasks should be submitted no later than the deadline set out. Late submissions will under normal circumstances be discarded. Except for cooperation with other persons, there is a free choice of means (textbooks, calculators, computer software, etc.). Deadline for submission: November 4, 2009, at 12.00 PM (24.00) Submission can either be in forms of a hand-written manuscript, a printed out document or as an attachement to an e-mail. E-mail address for attachements is Anders.Nordgaard@liu.se If anything is unclear in the tasks, questions should be directed by e-mail to Anders.Nordgaard@liu.se Grades: The grading on this home exam uses the whole ECTS-scale, i.e. the grades may range from F to A. Grades are given on basis of the total impression from the solution of tasks, no specific points or intervals of points are used. Good luck! Task 1 Assume we have taken an SRS of 200 individuals from a population consisting of 55560 individuals. Various variables are recorded for the 200 selected individuals and among these is the total amount of hours watching television last week. The sum for the whole sample is 5540 hours and the standard deviation is 15 hours. (a) Estimate the total amount of hours , t, of watching television the last week in the whole population and give a 99% confidence interval for this quantity. (b) Based on the information obtained from the sample, how many individuals should at least be sampled, if the standard error of a point estimate of t should be at most 280000 hours? (c) Now assume that we identify 10 of the sampled individuals not to be part of the surveyed population. The mean number of hours watching television for these 10 persons is 28.5 and the standard deviation is 10.9 hours. Calculate a new estimate of t and a 99% confidence interval. (Hint: Treat the group of sampled individuals actually belonging to the surveyed population as a domain of study.) (d) Now assume that the sampling frame used for this survey was far from recently updated and besides the non-eligible respondents dicovered in c) there are at least 200 individuals belonging to the population that do not exist in the sampling frame. What consequence(s) would there be for the estimation of t. Can any kind of adjustment be done? Explain. 1 Someone suggest that the sampling design used is not proper for assessing the savings of the individuals in the population. A parallel study showed that in a random sample of 150 indivudals from those in the population having an employment the total amount of hours watching television was 3350 and the standard deviation was 10.3 hours. In a random sample of 50 individuals from unemployed individuals in the population the corresponding total amount of hours was 1800 and the standard deviation was 9.5 hours. Approximately 7.1% of the individuals in the population are unemployed. (e) Use these “new” data to calculate a point estimate of t and a 99% confidence interval. Compare the results with those of a) (and/or c)) and explain. (f) In both studies there were in total 200 individuals sampled. If we would consider the second approach, i.e. split the sampling between employed and unemployed, what would an optimal allocation of 200 individuals over the two groups be? Task 2 Let us assume that in a large city there are 55 schools with pupils in the ages 13-16 years old. I each school the pupils are divided into classes with respect to the corresponding study year and also with respect to which school they belonged to previously to entering one of these schools. The latter classification variable is more difficult to sort out and will at first be ignored. Regarding the former we concentrate on pupils of study year 8. The total number of pupils of study year 8 in the 55 schools is 5954 and the total number of classes for all three study years is 631. We would like to investigate the proportion of pupils of study year 8 that have been drinking alcoholic beverages during the last month. An SRS of 10 schools was taken from the 55 schools. In each school drawn an SRS of 10% (number rounded downwards) of the pupils of study year 8 was taken and in each of these samples the number of pupils that had been drinking alcoholic beverages during the last month was recorded (obviously without any problems!). The result was as follows: School 1 2 3 4 5 6 7 8 9 10 Total classes years) 11 9 13 15 12 12 9 12 10 9 number of (all study Number of pupils in study year 8 92 83 149 140 88 119 86 118 120 80 Number of pupils in sample that have been drinking alcoholic beverages 4 2 5 5 3 4 1 4 3 0 (a) Calculate an unbiased point estimate of the proportion, p, of pupils of study year 8 in all 55 schools, that have been drinking alcoholic beverages during the last month. Calculate also a 95% confidence interval for p. 2 (b) As the number of pupils in study year 8 vary a lot between the schools the unbiased estimate may be too uncertain. Calculate a ratio estimate of p with a 95% confidence interval and compare with your results from a). (c) Is the sample self-weighting? Motivate your answer. (d) Now assume that the schools in the sample have not been taken as an SRS but been drawn with replacement and with probabilities proportional to number of classes. Calculate again a point estimate and a 95% confidence interval for p. (e) Consider the first 3 schools in the table above and assume that these constitute a sample on their own. Further, assume that these 3 schools were drawn without replacement and with probabilities proportional to number of classes. The inclusion probabilities for these 3 schools are assumed as in the table below. School i 1 2 3 πj School 1 0.0023 0.0034 0.052 j 2 3 πi 0.0023 0.0034 0.052 0.0029 0.043 0.0029 0.062 0.043 0.062 Calculate the Horvitz-Thompson estimate of p. Is an estimate of standard error reliable? Motivate your answer. Task 3 A study was made in a population of 6000 households of which an SRS of 10% of the households were taken. A mail survey was used in the study and after the planned numbers of reminders there were 259 completed questionnaires returned. The questionnaire used for the study had a question about the number of mobile phones among the members of the household. The total number of mobile phones among the 259 households was 603 and the standard deviation was 1.6. (a) Ignoring the non-respondents, estimate the mean number of mobile phones per household in the whole population and calculate a 95% confidence interval for this number. To adjust for the potential bias in the estimated mean in a) a follow-up was done by taking an SRS of 60 households from the non-respondents, and contacting these by telephone. All households contacted gave a response to the question in particular and the total number of mobile phones was 156 and the standard deviation was 1.4. (b) Use these new data to compute a new estimate of the mean number of mobile phones per household in the population and a 95% confidence interval for this number. Compare the results with your results in a). Now, suppose that the follow-ups were not done, but another adjustment is to be done. The households in the original sample could possibly be classified according to two variables: ownership and size. In the population 3200 of the households reside in rented apartments 3 while the rest own either their apartments or the houses where they reside. Further, 1030 of the households have one person only, 1850 have two persons, 980 have three persons, 1530 have four persons and the rest have five persons or more. In the following table the sample sizes and the respondents are divided according to these two classification variables: Household Size 1 2 3 4 5 or more Ownership Sample size Respondents rent own rent own rent own rent own rent own 90 10 85 102 58 33 69 88 28 37 36 10 40 40 31 17 29 37 11 8 Total number of mobile phones 17 9 67 71 70 43 106 147 41 32 (c) Use the table to derive a raking adjusted point estimate of the mean number of mobile phones per household in the population. Start with the initial weight 6000/259 for each respondent. Compare this estimate to your results in a) and b). What assumption(s) do you need to trust the raking adjusted estimate? Let us look in particular at the subgroup of households with at least five members that rent their apartments. There were originally 28 households in the sample but questionnaires were apparently received from only 11. If we were only interested in the current question about the number of mobile phones we could try to impute the missing 17 values. Below are the answers from the 11 responding households: 3, 3, 6, 5, 3, 2, 4, 4, 2, 5, 4 (d) Use multiple imputation with the approximate Bayesian bootstrap (ABB) to impute the missing values. Construct 5 complete data sets and use repeated imputation inference to calculate a point estimate and a 95% confidence interval for the mean number of mobile phones per houshold among households with at least five members that rent their apartments. (e) Compare your point estimate and confidence interval with the corresponding quantities obtained when the 11 responses are analysed as were they a complete SRS. (Hint: When a subpopulation size is unknown, the finite population correction for statistics based on the entire sample may be used.) Task 4 An SRS of 700 individuals was taken from a newly updated register of members of a political party. The register contains 381 960 members and has a lot of background information about them, but the objective of this study was to investigate other properties of this population. 4 One question put to the respondents was how much money they spent on their holiday last year. In the register there is a record for each member about the taxable yearly income, although that figure can be as old as 5 years, and the intention is to use that information in the estimation. The following statistics were computed for the responses from the sampled members (everyone responded): P (× SEK 1000) P(money spent on holiday)=3655.89 2 2 P(money spent on holiday) =23881.3 (× (SEK 1000) ) (× SEK 1000) P(yearly taxable income)=248721 2 2 P(yearly taxable income) =92032277 (× (SEK 1000) ) (money spent on holiday)×(yearly taxable income)=1400947 (× (SEK 1000)2 ) Further, the summarized yearly taxable income for all members in the register is about 135730000 (× SEK 1000). (a) Calculate a ratio stimate of the mean amount of money spent on holiday last year among the members with yearly taxable income as the auxiliary variable. Calculate also a 95% confidence interval for this mean amount. (b) Calculate a regression estimate of the mean amount of money spent on holiday last year among the members with yearly taxable income as the auxiliary variable. Calculate also a 95% confidence interval. (c) Compare the two point estimates in a) and b) and also make a comparison of each with the ordinary sample mean with respect to their estimated variances. Which of them would you prefer and why? Task 5 Consider the following citation: “...If we compare our voluntary web survey panel with the panel randomly sampled by Acme Inc we find the following benefits of our panel: • Our panel has 43000 members compared to the 2300 randomly selected members of the Acme telephone survey panel. This will give us much higher certainty in our estimates. • The cost for conducting a survey among our 43000 respondents is approximately a tenth of the cost for conducting the analogous survey among the 2300 respondents of the Acme panel. • Due to the almost 20 times larger number of respondents in our panel, the coverage factor for the population is superior to that of Acme. • The web survey technique we use will remove almost all of the non-sampling errors that are still present in Acme surveyes, e.g. there will be no interviewer effect, no respondent effect, no instrument effect and no partial non-response (as the questionnaire must be completed to be submittable) 5 • The potential problem with undercoverage can be resolved by efficient use of weights due to the large amount of background information among the members of the panel. ...” Obviously there are some clear stupidities in the arguments above, but make a thorough discussion of these points and explain in more detail why some statements are erroneous. In addition point out those parts of the arguments you believe are correct and explain why. 6