LINK ¨ OPINGS UNIVERSITET 732A28 SURVEY SAMPLING, 6 CDTS Institutionen f¨

advertisement
LINKÖPINGS UNIVERSITET
Institutionen för datavetenskap
Statistik, ANd
732A28 SURVEY SAMPLING, 6 CDTS
Master’s program in Statistics, Data analysis
and Knowledge discovery
Fall semester 2009
Home exam
Home exam for Student 1
Rules:
The home exam is given under “gentlemen’s agreement”. This implies that cooperation with other persons
is not allowed.
Solutions to the tasks should be submitted no later than the deadline set out. Late submissions will under
normal circumstances be discarded.
Except for cooperation with other persons, there is a free choice of means (textbooks, calculators, computer
software, etc.).
Deadline for submission: November 4, 2009, at 12.00 PM (24.00)
Submission can either be in forms of a hand-written manuscript, a printed out document or as an attachement
to an e-mail. E-mail address for attachements is Anders.Nordgaard@liu.se
If anything is unclear in the tasks, questions should be directed by e-mail to Anders.Nordgaard@liu.se
Grades: The grading on this home exam uses the whole ECTS-scale, i.e. the grades may range from F to
A. Grades are given on basis of the total impression from the solution of tasks, no specific points or intervals
of points are used.
Good luck!
Task 1
Assume we have taken an SRS of 300 individuals from a population consisting of 11232
individuals. Various variables are recorded for the 300 selected individuals and among these
is the total amount of money saved in bank accounts the last month (i.e. previous savings
and newly saved amounts as a total). The sum for the whole sample is SEK 15.1 million
and the standard deviation is SEK 55000.
(a) Estimate the total amount of savings, t, in the whole population and give a 99%
confidence interval for this quantity.
(b) Based on the information obtained from the sample, how many individuals should at
least be sampled, if the width of a confidence interval corresponding with that requested
in a) should not be larger than SEK 160 million?
(c) Now assume that we identify 6 of the sampled individuals not to be part of the surveyed
population. The recorded savings for these 5 individuals are respectively (SEK) 34000,
10900, 22800, 700 and 63000. Calculate a new estimate of t and a 99% confidence
interval. (Hint: Treat the group of sampled individuals actually belonging to the
surveyed population as a domain of study.)
(d) Without knowing anything in particular about the population we might presume that
the variable savings can have a minor number of very high amounts of savings among
the individuals. Discuss whether this could be a problem for the conclusions drawn
from the point estimates and confidence intervals above.
1
Someone suggest that the sampling design used is not proper for assessing the savings of
the individuals in the population. A parallel study showed that in a random sample of
195 indivudals from those in the population having an employment the sum of savings
(as defined above) was SEK 11.55 million and the standard devistion was SEK 33000. In
a random sample of 105 individuals from unemployed individuals in the population the
corresponding sum of savings was SEK 2.62 million and the standard deviation was SEK
36000. Approximately 7.1% of the individuals in the population are unemployed.
(e) Use these “new” data to calculate a point estimate of t and a 99% confidence interval.
Compare the results with those of a) (and/or c)) and explain.
(f) In both studies there were in total 300 individuals sampled. If we would consider the
second approach, i.e. split the sampling between employed and unemployed, what
would an optimal allocation of 300 individuals over the two groups be?
Task 2
Let us assume that in a large city there are 55 schools with pupils in the ages 13-16 years
old. I each school the pupils are divided into classes with respect to the corresponding study
year and also with respect to which school they belonged to previously to entering one of
these schools. The latter classification variable is more difficult to sort out and will at first
be ignored. Regarding the former we concentrate on pupils of study year 9, the highest
study year in these schools. The total number of pupils of study year 9 in the 55 schools is
6140. These pupils will leave the school after the current year and are about to enter the
high school (or do something else if possible). We would like to investigate the proportion of
pupils of study year 9 that have decided what to do after this year, i.e. which study program
in high school they intend to apply for, or if they have decided not to go to high school what
they will do instead.
An SRS of 10 schools was taken from the 55 schools. In each school drawn an SRS of 10%
(number rounded downwards) of the pupils of study year 9 was taken and in each of these
samples the number of pupils that had decided what to to after this year was recorded. The
result was as follows:
School
1
2
3
4
5
6
7
8
9
10
Number of pupils in study
year 9
113
81
149
142
90
116
85
120
116
88
Number of pupils in sample that
have decided what to to after this
year
7
4
8
13
7
7
7
5
7
6
2
(a) Calculate an unbiased point estimate of the proportion, p, of pupils of study year 9
in all 55 schools, that have decided what to do after this year. Calculate also a 95%
confidence interval for p.
(b) As the number of pupils in study year 9 vary a lot between the schools the unbiased
estimate may be too uncertain. Calculate a ratio estimate of p with a 95% confidence
interval and compare with your results from a).
(c) Approximately, how many pupils of study year 9 in the 55 schools does each pupil in
the sample represent?
(d) Now assume that the schools in the sample have not been taken as an SRS but been
drawn with replacement and with probabilities proportional to their numbers of pupils
of study year 9. Calculate again a point estimate and a 95% confidence interval for p.
(e) Consider the first 3 schools in the table above and assume that these constitute a sample
on their own. Further, assume that these 3 schools were drawn without replacement
and with probabilities proportional to their numbers of pupils of study year 9. The
inclusion probabilities for these 3 schools are assumed as in the table below.
School i
1
2
3
πj
School
1
0.0027
0.0043
0.056
j
2
3
πi
0.0027 0.0043 0.056
0.0032 0.040
0.0032
0.074
0.040 0.074
Calculate the Horvitz-Thompson estimate of p. Is an estimate of standard error reliable? Motivate your answer.
Task 3
A stratified sampling design was applied to a population of households, where stratum 1
consists of households that rent their appartment and stratum 2 consists of households that
own either their apartments or the houses where they reside. The sizes of the two strata
in the population were 3300 and 2700 respectively and a proportionally allocated sample of
600 households were taken from the population. A mail survey was used in this study and
after the planned numbers of reminders there were 147 completed questionnaires returned
from stratum 1 and 112 completed questionnaires returned from stratum 2.
The questionnaire used for the study had a question about the number of mobile phones
among the members of the household. The total number of mobile phones among the 147
households from stratum 1 was 301 and the standard deviation was 1.2. The total number of
mobile phones among the 112 households from stratum 2 was 302 and the standard deviation
was 2.0.
(a) Ignoring the non-respondents, estimate the mean number of mobile phones per household in the whole population and calculate a 95% confidence interval for this number.
3
To adjust for the potential bias in the estimated mean in a) a follow-up was done among the
non-respondents in both strata, by taking SRSs from the sets of non-responding housholds
in the two samples, and contacting these by telephone. All households contacted gave a
response to the question in particular and details are found in the table below.
Sample
1
2
Number
tacted
30
25
of
con- Total number
mobile phones
77
67
of Standard deviation
1.35
1.9
(b) Use these new data to compute a new estimate of the mean number of mobile phones
per household in the population and a 95% confidence interval for this number. Compare the results with your results in a).
Now, suppose that the follow-ups were not done, but another adjustment is to be done.
The households in the original sample were of different sizes (i.e. had different numbers of
household members) and these sizes were known previous to sampling. The following table
gives the details:
Household
Size
1
2
3
4
5 or more
Stratum
Sample size
Respondents
1
2
1
2
1
2
1
2
1
2
90
10
85
102
58
33
69
88
28
37
36
10
40
40
31
17
29
37
11
8
Total number of mobile phones
17
6
67
74
68
43
108
143
41
36
(c) Calculate a weighting-class adjusted point estimate of the mean number of mobile
phones per household. Compare this estimate to your results in a) and b). What
assumption(s) do you need to trust the weighting-class adjusted estimate?
Let us look in particular at the subgroup of households with 3 members in stratum 2. There
were originally 33 households in the sample but questionnaires were apparently received
from only 17. If we were only interested in the current question about the number of mobile
phones we could try to impute the missing 16 values. Below are the answers from the 17
responding households:
2, 3, 2, 2, 5, 2, 2, 2, 3, 2, 2, 3, 1, 3, 3, 3, 3
(d) Use multiple imputation with the approximate Bayesian bootstrap (ABB) to impute
the missing values. Construct 5 complete data sets and use repeated imputation inference to calculate a point estimate and a 95% confidence interval for the mean number
of mobile phones per houshold among households with 3 members belonging to stratum
2.
4
(e) Compare your point estimate and confidence interval with the corresponding quantities
obtained when the 17 responses are analysed as were they a complete SRS. (Hint:
When a subpopulation size is unknown, the finite population correction for statistics
based on the entire sample may be used.)
Task 4
An SRS of 400 individuals was taken from a newly updated register of members of a political
party. The register contains 365 080 members and has a lot of background information about
them, but the objective of this study was to investigate other properties of this population.
One question put to the respondents was how much money they spent on their holiday last
year. In the register there is a record for each member about the taxable yearly income,
although that figure can be as old as 5 years, and the intention is to use that information in
the estimation. The following statistics were computed for the responses from the sampled
members (everyone responded):
P
(× SEK 1000)
P(money spent on holiday)=2054.44
2
2
P(money spent on holiday) =13280.3 (× (SEK 1000) )
(× SEK 1000)
P(yearly taxable income)=141283
2
2
P(yearly taxable income) =51805731 (× (SEK 1000) )
(money spent on holiday)×(yearly taxable income)=781083 (× (SEK 1000)2 )
Further, the summarized yearly taxable income for all members in the register is about
128910000 (× SEK 1000).
(a) Calculate a ratio stimate of the mean amount of money spent on holiday last year
among the members with yearly taxable income as the auxiliary variable. Calculate
also a 95% confidence interval for this mean amount.
(b) Calculate a regression estimate of the mean amount of money spent on holiday last year
among the members with yearly taxable income as the auxiliary variable. Calculate
also a 95% confidence interval.
(c) Compare the two point estimates in a) and b) and also make a comparison of each
with the ordinary sample mean with respect to their estimated variances. Which of
them would you prefer and why?
Task 5
Consider the following citation:
“...If we compare our voluntary web survey panel with the panel randomly sampled by Acme
Inc we find the following benefits of our panel:
• Our panel has 43000 members compared to the 2300 randomly selected members of the
Acme telephone survey panel. This will give us much higher certainty in our estimates.
• The cost for conducting a survey among our 43000 respondents is approximately a
tenth of the cost for conducting the analogous survey among the 2300 respondents of
the Acme panel.
5
• Due to the almost 20 times larger number of respondents in our panel, the coverage
factor for the population is superior to that of Acme.
• The web survey technique we use will remove almost all of the non-sampling errors that
are still present in Acme surveyes, e.g. there will be no interviewer effect, no respondent
effect, no instrument effect and no partial non-response (as the questionnaire must be
completed to be submittable)
• The potential problem with undercoverage can be resolved by efficient use of weights
due to the large amount of background information among the members of the panel.
...”
Obviously there are some clear stupidities in the arguments above, but make a thorough
discussion of these points and explain in more detail why some statements are erroneous. In
addition point out those parts of the arguments you believe are correct and explain why.
6
Download