LINK ¨ OPINGS UNIVERSITET 732A28 SURVEY SAMPLING, 6 CDTS Institutionen f¨

advertisement
LINKÖPINGS UNIVERSITET
Institutionen för datavetenskap
Statistik, ANd
732A28 SURVEY SAMPLING, 6 CDTS
Master’s program in Statistics, Data analysis
and Knowledge discovery
Fall semester 2009
Home exam
Home exam for Student 2
Rules:
The home exam is given under “gentlemen’s agreement”. This implies that cooperation with other persons
is not allowed.
Solutions to the tasks should be submitted no later than the deadline set out. Late submissions will under
normal circumstances be discarded.
Except for cooperation with other persons, there is a free choice of means (textbooks, calculators, computer
software, etc.).
Deadline for submission: November 4, 2009, at 12.00 PM (24.00)
Submission can either be in forms of a hand-written manuscript, a printed out document or as an attachement
to an e-mail. E-mail address for attachements is Anders.Nordgaard@liu.se
If anything is unclear in the tasks, questions should be directed by e-mail to Anders.Nordgaard@liu.se
Grades: The grading on this home exam uses the whole ECTS-scale, i.e. the grades may range from F to
A. Grades are given on basis of the total impression from the solution of tasks, no specific points or intervals
of points are used.
Good luck!
Task 1
Assume we have taken an SRS of 200 individuals from a population consisting of 55560
individuals. Various variables are recorded for the 200 selected individuals and among these
is the total amount of hours watching television last week. The sum for the whole sample is
5540 hours and the standard deviation is 15 hours.
(a) Estimate the total amount of hours , t, of watching television the last week in the
whole population and give a 99% confidence interval for this quantity.
(b) Based on the information obtained from the sample, how many individuals should at
least be sampled, if the standard error of a point estimate of t should be at most 280000
hours?
(c) Now assume that we identify 10 of the sampled individuals not to be part of the
surveyed population. The mean number of hours watching television for these 10
persons is 28.5 and the standard deviation is 10.9 hours. Calculate a new estimate of t
and a 99% confidence interval. (Hint: Treat the group of sampled individuals actually
belonging to the surveyed population as a domain of study.)
(d) Now assume that the sampling frame used for this survey was far from recently updated and besides the non-eligible respondents dicovered in c) there are at least 200
individuals belonging to the population that do not exist in the sampling frame. What
consequence(s) would there be for the estimation of t. Can any kind of adjustment be
done? Explain.
1
Someone suggest that the sampling design used is not proper for assessing the savings of
the individuals in the population. A parallel study showed that in a random sample of
150 indivudals from those in the population having an employment the total amount of
hours watching television was 3350 and the standard deviation was 10.3 hours. In a random
sample of 50 individuals from unemployed individuals in the population the corresponding
total amount of hours was 1800 and the standard deviation was 9.5 hours. Approximately
7.1% of the individuals in the population are unemployed.
(e) Use these “new” data to calculate a point estimate of t and a 99% confidence interval.
Compare the results with those of a) (and/or c)) and explain.
(f) In both studies there were in total 200 individuals sampled. If we would consider the
second approach, i.e. split the sampling between employed and unemployed, what
would an optimal allocation of 200 individuals over the two groups be?
Task 2
Let us assume that in a large city there are 55 schools with pupils in the ages 13-16 years
old. I each school the pupils are divided into classes with respect to the corresponding study
year and also with respect to which school they belonged to previously to entering one of
these schools. The latter classification variable is more difficult to sort out and will at first be
ignored. Regarding the former we concentrate on pupils of study year 8. The total number
of pupils of study year 8 in the 55 schools is 5954 and the total number of classes for all
three study years is 631. We would like to investigate the proportion of pupils of study year
8 that have been drinking alcoholic beverages during the last month.
An SRS of 10 schools was taken from the 55 schools. In each school drawn an SRS of 10%
(number rounded downwards) of the pupils of study year 8 was taken and in each of these
samples the number of pupils that had been drinking alcoholic beverages during the last
month was recorded (obviously without any problems!). The result was as follows:
School
1
2
3
4
5
6
7
8
9
10
Total
classes
years)
11
9
13
15
12
12
9
12
10
9
number
of
(all study
Number of pupils in
study year 8
92
83
149
140
88
119
86
118
120
80
Number of pupils in sample
that have been drinking alcoholic beverages
4
2
5
5
3
4
1
4
3
0
(a) Calculate an unbiased point estimate of the proportion, p, of pupils of study year 8
in all 55 schools, that have been drinking alcoholic beverages during the last month.
Calculate also a 95% confidence interval for p.
2
(b) As the number of pupils in study year 8 vary a lot between the schools the unbiased
estimate may be too uncertain. Calculate a ratio estimate of p with a 95% confidence
interval and compare with your results from a).
(c) Is the sample self-weighting? Motivate your answer.
(d) Now assume that the schools in the sample have not been taken as an SRS but been
drawn with replacement and with probabilities proportional to number of classes. Calculate again a point estimate and a 95% confidence interval for p.
(e) Consider the first 3 schools in the table above and assume that these constitute a sample
on their own. Further, assume that these 3 schools were drawn without replacement
and with probabilities proportional to number of classes. The inclusion probabilities
for these 3 schools are assumed as in the table below.
School i
1
2
3
πj
School
1
0.0023
0.0034
0.052
j
2
3
πi
0.0023 0.0034 0.052
0.0029 0.043
0.0029
0.062
0.043 0.062
Calculate the Horvitz-Thompson estimate of p. Is an estimate of standard error reliable? Motivate your answer.
Task 3
A study was made in a population of 6000 households of which an SRS of 10% of the
households were taken. A mail survey was used in the study and after the planned numbers
of reminders there were 259 completed questionnaires returned.
The questionnaire used for the study had a question about the number of mobile phones
among the members of the household. The total number of mobile phones among the 259
households was 603 and the standard deviation was 1.6.
(a) Ignoring the non-respondents, estimate the mean number of mobile phones per household in the whole population and calculate a 95% confidence interval for this number.
To adjust for the potential bias in the estimated mean in a) a follow-up was done by taking
an SRS of 60 households from the non-respondents, and contacting these by telephone. All
households contacted gave a response to the question in particular and the total number of
mobile phones was 156 and the standard deviation was 1.4.
(b) Use these new data to compute a new estimate of the mean number of mobile phones
per household in the population and a 95% confidence interval for this number. Compare the results with your results in a).
Now, suppose that the follow-ups were not done, but another adjustment is to be done. The
households in the original sample could possibly be classified according to two variables:
ownership and size. In the population 3200 of the households reside in rented apartments
3
while the rest own either their apartments or the houses where they reside. Further, 1030 of
the households have one person only, 1850 have two persons, 980 have three persons, 1530
have four persons and the rest have five persons or more. In the following table the sample
sizes and the respondents are divided according to these two classification variables:
Household
Size
1
2
3
4
5 or more
Ownership
Sample size
Respondents
rent
own
rent
own
rent
own
rent
own
rent
own
90
10
85
102
58
33
69
88
28
37
36
10
40
40
31
17
29
37
11
8
Total number of mobile phones
17
9
67
71
70
43
106
147
41
32
(c) Use the table to derive a raking adjusted point estimate of the mean number of mobile
phones per household in the population. Start with the initial weight 6000/259 for each
respondent. Compare this estimate to your results in a) and b). What assumption(s)
do you need to trust the raking adjusted estimate?
Let us look in particular at the subgroup of households with at least five members that rent
their apartments. There were originally 28 households in the sample but questionnaires were
apparently received from only 11. If we were only interested in the current question about
the number of mobile phones we could try to impute the missing 17 values. Below are the
answers from the 11 responding households:
3, 3, 6, 5, 3, 2, 4, 4, 2, 5, 4
(d) Use multiple imputation with the approximate Bayesian bootstrap (ABB) to impute
the missing values. Construct 5 complete data sets and use repeated imputation inference to calculate a point estimate and a 95% confidence interval for the mean number
of mobile phones per houshold among households with at least five members that rent
their apartments.
(e) Compare your point estimate and confidence interval with the corresponding quantities
obtained when the 11 responses are analysed as were they a complete SRS. (Hint:
When a subpopulation size is unknown, the finite population correction for statistics
based on the entire sample may be used.)
Task 4
An SRS of 700 individuals was taken from a newly updated register of members of a political
party. The register contains 381 960 members and has a lot of background information about
them, but the objective of this study was to investigate other properties of this population.
4
One question put to the respondents was how much money they spent on their holiday last
year. In the register there is a record for each member about the taxable yearly income,
although that figure can be as old as 5 years, and the intention is to use that information in
the estimation. The following statistics were computed for the responses from the sampled
members (everyone responded):
P
(× SEK 1000)
P(money spent on holiday)=3655.89
2
2
P(money spent on holiday) =23881.3 (× (SEK 1000) )
(× SEK 1000)
P(yearly taxable income)=248721
2
2
P(yearly taxable income) =92032277 (× (SEK 1000) )
(money spent on holiday)×(yearly taxable income)=1400947 (× (SEK 1000)2 )
Further, the summarized yearly taxable income for all members in the register is about
135730000 (× SEK 1000).
(a) Calculate a ratio stimate of the mean amount of money spent on holiday last year
among the members with yearly taxable income as the auxiliary variable. Calculate
also a 95% confidence interval for this mean amount.
(b) Calculate a regression estimate of the mean amount of money spent on holiday last year
among the members with yearly taxable income as the auxiliary variable. Calculate
also a 95% confidence interval.
(c) Compare the two point estimates in a) and b) and also make a comparison of each
with the ordinary sample mean with respect to their estimated variances. Which of
them would you prefer and why?
Task 5
Consider the following citation:
“...If we compare our voluntary web survey panel with the panel randomly sampled by Acme
Inc we find the following benefits of our panel:
• Our panel has 43000 members compared to the 2300 randomly selected members of the
Acme telephone survey panel. This will give us much higher certainty in our estimates.
• The cost for conducting a survey among our 43000 respondents is approximately a
tenth of the cost for conducting the analogous survey among the 2300 respondents of
the Acme panel.
• Due to the almost 20 times larger number of respondents in our panel, the coverage
factor for the population is superior to that of Acme.
• The web survey technique we use will remove almost all of the non-sampling errors that
are still present in Acme surveyes, e.g. there will be no interviewer effect, no respondent
effect, no instrument effect and no partial non-response (as the questionnaire must be
completed to be submittable)
5
• The potential problem with undercoverage can be resolved by efficient use of weights
due to the large amount of background information among the members of the panel.
...”
Obviously there are some clear stupidities in the arguments above, but make a thorough
discussion of these points and explain in more detail why some statements are erroneous. In
addition point out those parts of the arguments you believe are correct and explain why.
6
Download