LINKÖPINGS UNIVERSITET Institutionen för datavetenskap Statistik, ANd 732A28 SURVEY SAMPLING, 6 CDTS Master’s program in Statistics, Data analysis and Knowledge discovery Fall semester 2009 Home exam Home exam for Student 1 Rules: The home exam is given under “gentlemen’s agreement”. This implies that cooperation with other persons is not allowed. Solutions to the tasks should be submitted no later than the deadline set out. Late submissions will under normal circumstances be discarded. Except for cooperation with other persons, there is a free choice of means (textbooks, calculators, computer software, etc.). Deadline for submission: November 4, 2009, at 12.00 PM (24.00) Submission can either be in forms of a hand-written manuscript, a printed out document or as an attachement to an e-mail. E-mail address for attachements is Anders.Nordgaard@liu.se If anything is unclear in the tasks, questions should be directed by e-mail to Anders.Nordgaard@liu.se Grades: The grading on this home exam uses the whole ECTS-scale, i.e. the grades may range from F to A. Grades are given on basis of the total impression from the solution of tasks, no specific points or intervals of points are used. Good luck! Task 1 Assume we have taken an SRS of 300 individuals from a population consisting of 11232 individuals. Various variables are recorded for the 300 selected individuals and among these is the total amount of money saved in bank accounts the last month (i.e. previous savings and newly saved amounts as a total). The sum for the whole sample is SEK 15.1 million and the standard deviation is SEK 55000. (a) Estimate the total amount of savings, t, in the whole population and give a 99% confidence interval for this quantity. (b) Based on the information obtained from the sample, how many individuals should at least be sampled, if the width of a confidence interval corresponding with that requested in a) should not be larger than SEK 160 million? (c) Now assume that we identify 6 of the sampled individuals not to be part of the surveyed population. The recorded savings for these 5 individuals are respectively (SEK) 34000, 10900, 22800, 700 and 63000. Calculate a new estimate of t and a 99% confidence interval. (Hint: Treat the group of sampled individuals actually belonging to the surveyed population as a domain of study.) (d) Without knowing anything in particular about the population we might presume that the variable savings can have a minor number of very high amounts of savings among the individuals. Discuss whether this could be a problem for the conclusions drawn from the point estimates and confidence intervals above. 1 Someone suggest that the sampling design used is not proper for assessing the savings of the individuals in the population. A parallel study showed that in a random sample of 195 indivudals from those in the population having an employment the sum of savings (as defined above) was SEK 11.55 million and the standard devistion was SEK 33000. In a random sample of 105 individuals from unemployed individuals in the population the corresponding sum of savings was SEK 2.62 million and the standard deviation was SEK 36000. Approximately 7.1% of the individuals in the population are unemployed. (e) Use these “new” data to calculate a point estimate of t and a 99% confidence interval. Compare the results with those of a) (and/or c)) and explain. (f) In both studies there were in total 300 individuals sampled. If we would consider the second approach, i.e. split the sampling between employed and unemployed, what would an optimal allocation of 300 individuals over the two groups be? Task 2 Let us assume that in a large city there are 55 schools with pupils in the ages 13-16 years old. I each school the pupils are divided into classes with respect to the corresponding study year and also with respect to which school they belonged to previously to entering one of these schools. The latter classification variable is more difficult to sort out and will at first be ignored. Regarding the former we concentrate on pupils of study year 9, the highest study year in these schools. The total number of pupils of study year 9 in the 55 schools is 6140. These pupils will leave the school after the current year and are about to enter the high school (or do something else if possible). We would like to investigate the proportion of pupils of study year 9 that have decided what to do after this year, i.e. which study program in high school they intend to apply for, or if they have decided not to go to high school what they will do instead. An SRS of 10 schools was taken from the 55 schools. In each school drawn an SRS of 10% (number rounded downwards) of the pupils of study year 9 was taken and in each of these samples the number of pupils that had decided what to to after this year was recorded. The result was as follows: School 1 2 3 4 5 6 7 8 9 10 Number of pupils in study year 9 113 81 149 142 90 116 85 120 116 88 Number of pupils in sample that have decided what to to after this year 7 4 8 13 7 7 7 5 7 6 2 (a) Calculate an unbiased point estimate of the proportion, p, of pupils of study year 9 in all 55 schools, that have decided what to do after this year. Calculate also a 95% confidence interval for p. (b) As the number of pupils in study year 9 vary a lot between the schools the unbiased estimate may be too uncertain. Calculate a ratio estimate of p with a 95% confidence interval and compare with your results from a). (c) Approximately, how many pupils of study year 9 in the 55 schools does each pupil in the sample represent? (d) Now assume that the schools in the sample have not been taken as an SRS but been drawn with replacement and with probabilities proportional to their numbers of pupils of study year 9. Calculate again a point estimate and a 95% confidence interval for p. (e) Consider the first 3 schools in the table above and assume that these constitute a sample on their own. Further, assume that these 3 schools were drawn without replacement and with probabilities proportional to their numbers of pupils of study year 9. The inclusion probabilities for these 3 schools are assumed as in the table below. School i 1 2 3 πj School 1 0.0027 0.0043 0.056 j 2 3 πi 0.0027 0.0043 0.056 0.0032 0.040 0.0032 0.074 0.040 0.074 Calculate the Horvitz-Thompson estimate of p. Is an estimate of standard error reliable? Motivate your answer. Task 3 A stratified sampling design was applied to a population of households, where stratum 1 consists of households that rent their appartment and stratum 2 consists of households that own either their apartments or the houses where they reside. The sizes of the two strata in the population were 3300 and 2700 respectively and a proportionally allocated sample of 600 households were taken from the population. A mail survey was used in this study and after the planned numbers of reminders there were 147 completed questionnaires returned from stratum 1 and 112 completed questionnaires returned from stratum 2. The questionnaire used for the study had a question about the number of mobile phones among the members of the household. The total number of mobile phones among the 147 households from stratum 1 was 301 and the standard deviation was 1.2. The total number of mobile phones among the 112 households from stratum 2 was 302 and the standard deviation was 2.0. (a) Ignoring the non-respondents, estimate the mean number of mobile phones per household in the whole population and calculate a 95% confidence interval for this number. 3 To adjust for the potential bias in the estimated mean in a) a follow-up was done among the non-respondents in both strata, by taking SRSs from the sets of non-responding housholds in the two samples, and contacting these by telephone. All households contacted gave a response to the question in particular and details are found in the table below. Sample 1 2 Number tacted 30 25 of con- Total number mobile phones 77 67 of Standard deviation 1.35 1.9 (b) Use these new data to compute a new estimate of the mean number of mobile phones per household in the population and a 95% confidence interval for this number. Compare the results with your results in a). Now, suppose that the follow-ups were not done, but another adjustment is to be done. The households in the original sample were of different sizes (i.e. had different numbers of household members) and these sizes were known previous to sampling. The following table gives the details: Household Size 1 2 3 4 5 or more Stratum Sample size Respondents 1 2 1 2 1 2 1 2 1 2 90 10 85 102 58 33 69 88 28 37 36 10 40 40 31 17 29 37 11 8 Total number of mobile phones 17 6 67 74 68 43 108 143 41 36 (c) Calculate a weighting-class adjusted point estimate of the mean number of mobile phones per household. Compare this estimate to your results in a) and b). What assumption(s) do you need to trust the weighting-class adjusted estimate? Let us look in particular at the subgroup of households with 3 members in stratum 2. There were originally 33 households in the sample but questionnaires were apparently received from only 17. If we were only interested in the current question about the number of mobile phones we could try to impute the missing 16 values. Below are the answers from the 17 responding households: 2, 3, 2, 2, 5, 2, 2, 2, 3, 2, 2, 3, 1, 3, 3, 3, 3 (d) Use multiple imputation with the approximate Bayesian bootstrap (ABB) to impute the missing values. Construct 5 complete data sets and use repeated imputation inference to calculate a point estimate and a 95% confidence interval for the mean number of mobile phones per houshold among households with 3 members belonging to stratum 2. 4 (e) Compare your point estimate and confidence interval with the corresponding quantities obtained when the 17 responses are analysed as were they a complete SRS. (Hint: When a subpopulation size is unknown, the finite population correction for statistics based on the entire sample may be used.) Task 4 An SRS of 400 individuals was taken from a newly updated register of members of a political party. The register contains 365 080 members and has a lot of background information about them, but the objective of this study was to investigate other properties of this population. One question put to the respondents was how much money they spent on their holiday last year. In the register there is a record for each member about the taxable yearly income, although that figure can be as old as 5 years, and the intention is to use that information in the estimation. The following statistics were computed for the responses from the sampled members (everyone responded): P (× SEK 1000) P(money spent on holiday)=2054.44 2 2 P(money spent on holiday) =13280.3 (× (SEK 1000) ) (× SEK 1000) P(yearly taxable income)=141283 2 2 P(yearly taxable income) =51805731 (× (SEK 1000) ) (money spent on holiday)×(yearly taxable income)=781083 (× (SEK 1000)2 ) Further, the summarized yearly taxable income for all members in the register is about 128910000 (× SEK 1000). (a) Calculate a ratio stimate of the mean amount of money spent on holiday last year among the members with yearly taxable income as the auxiliary variable. Calculate also a 95% confidence interval for this mean amount. (b) Calculate a regression estimate of the mean amount of money spent on holiday last year among the members with yearly taxable income as the auxiliary variable. Calculate also a 95% confidence interval. (c) Compare the two point estimates in a) and b) and also make a comparison of each with the ordinary sample mean with respect to their estimated variances. Which of them would you prefer and why? Task 5 Consider the following citation: “...If we compare our voluntary web survey panel with the panel randomly sampled by Acme Inc we find the following benefits of our panel: • Our panel has 43000 members compared to the 2300 randomly selected members of the Acme telephone survey panel. This will give us much higher certainty in our estimates. • The cost for conducting a survey among our 43000 respondents is approximately a tenth of the cost for conducting the analogous survey among the 2300 respondents of the Acme panel. 5 • Due to the almost 20 times larger number of respondents in our panel, the coverage factor for the population is superior to that of Acme. • The web survey technique we use will remove almost all of the non-sampling errors that are still present in Acme surveyes, e.g. there will be no interviewer effect, no respondent effect, no instrument effect and no partial non-response (as the questionnaire must be completed to be submittable) • The potential problem with undercoverage can be resolved by efficient use of weights due to the large amount of background information among the members of the panel. ...” Obviously there are some clear stupidities in the arguments above, but make a thorough discussion of these points and explain in more detail why some statements are erroneous. In addition point out those parts of the arguments you believe are correct and explain why. 6