Uploaded by axelyoel58

calculus

advertisement
Examination
Date: 2007-04-26
Statistics for Business and Economics
Module 3: Statistical survey methodology
Name:
..........................................
Personal code number: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Time of examination:
9.00-13.00,
ÖP sal (hall) 5
Aid:
Pocket calculator. Formulae are handed out.
Write your answers on loose paper except for the multiple choice questions
where you mark your answer. Evaluation of the exercises is done by the
teacher. For the grade pass, 50% of maximal mark is required. To pass with
distinction, 75% is required.
Note that omitted or imperfect explanation leads to reduction of marks.
Exercise
1
2
3
4
5
6
Sum
Points
10
9
10
6
5
10
50
Break a leg! /J
Note: Examination with plausible solutions
Author: Joakim Malmdin, 2007-04-30
Exercise 1, 10p
These statements are either true or false.
1) Stratification may produce a smaller bound on the error of estimation,
B. This is especially true when the strata are heterogeneous.
True
False
2) A drawback with systematic sampling is the risk of periodicity.
True
False
3) If our findings are statistically significant they are too unusual to often
occur just by chance.
True
False
4) The margin of error includes only random sampling error.
True
False
5) A census is a sample survey that attempts to include the entire population in the sample.
True
False
6) The use of a control group in an experiment allows us to control the
effects of lurking variables.
True
False
7) Clusters should be as heterogeneous as possible within, and one cluster
should look very much like another in order for the economic advantages of cluster to pay off.
True
False
8) The finite population correction (or correction factor) takes into account the fact that an estimate based on a sample n = 10 from a
population of N = 20 000 items contains more information about the
population than a sample of n = 10 from a population of N = 20.
True
2
False
9) Reliability has to do with the quality of measurement. In its everyday
sense, reliability is the “repeatability” of your measures.
True
False
10) We usually favour stem-plots when we have a small number of observations, and histograms for larger amounts of data.
True
3
False
Exercise 2, 9p
For each of the following sample situations, identify the target population
and the frame used. Comment upon the coverage and identify also the sampling technique used and eventual shortcomings. Finally, suggest another
way of doing the investigation with the same target population.
1. A sociologist is interested in determining the extent to which tenthgraders in the USA are self-motivated. A sample of four high schools in
Large City is taken and all tenth-graders in each school is interviewed.
3p
Answer [very short]:
Target population: Tenth-graders in the USA
Frame: High schools in Large City
Coverage: Undercoverage
Technique: Clustered sampling of high schools
Shortcomings: Bias due to nonsampling errors (undercoverage)
Suggestion: The risk of bias is due to the fact that just high schools
from one city is selected. If the target population is tenth-graders
in the USA we have to include all tenth-graders in the frame. One
way of doing this is to randomly select cities and then schools
within each city.
2. The host of a local radio talk show in London wonders if people who
are actively religious are happier than those who are not. He asks the
listeners to call in and the station receives calls from 48 listeners who
voice their opinions.
3p
Answer [very short]:
Target population: London citizens (eventually other too, but unspecified)
Frame: Listeners to the show
Coverage: Undercoverage (eventually overcoverage)
Technique: Voluntary response
Shortcomings: Bias due to sampling errors (voluntary response) and
nonsampling errors (undercoverage)
Suggestion: To base an investigation on voluntary answers (self-selection)
is not a good idea. A more serious attempt to reach a trustworthy result would be to randomly select London citizens from the
telephone book for example, i.e. simple random sampling.
4
3. Every tenth person between 2 pm and 4 pm the first day of a term
outside the library at Umeå University is asked whether he or she
prefer written or oral exams this term.
3p
Answer [very short]:
Target population: Students at Umeå University a specific term
Frame: People passing by the library between 2 pm and 4 pm the first
day of a specific term
Coverage: Over- and undercoverage
Technique: Systematic sampling
Shortcomings: Bias due to nonsampling errors (undercoverage)
Suggestion: The risk of bias in this survey is connected to the choice of
time and place. We could instead use lists of all students accepted
for studying at the University the term in question and wait until
the registration is fulfilled. Systematic or simple random sampling
could be used.
5
Exercise 3, 10p
Newspapers sold
50000
Newspapers sold
20000
30000
40000
News for you
Daily words
10000
World in words
0
Time for news
News for you
Daily words
Time for news
World in words
Figure 1: Newspapers sold displayed in two graphs
a) In these two graphs (Figure 1) the sale numbers of the four biggest
newspapers in News Island are displayed. Choose the most proper
graph and give two reasons for your choice.
3p
Answer: The bar chart is the most proper graph. 1) The four newspapers do not sum up to a whole (we get an impression of that the whole
population is represented in the pie chart). 2) It is easier to compare
the bars in order to see the difference in number of sold papers.
b) Can we use the graph in Figure 2? Explain.
2p
Answer: No. With nominal data (the four newspapers) it does not
make sense to connect the categories with a line.
c) Explain why a pictogram would have been improper to use.
2p
Answer: A pictogram (of newspapers) would be misleading since both
the height and width must be increased in order to avoid distortion, i.e.
it is not only the height of the picture which get larger, so do the width
and thereby the area.
[An alternative to this is to keep the same width regardless of height,
but then the pictures are distorted.]
d) Describe the data set displayed in Figure 3.
6
3p
30000
20000
ds
or
W
Tim
ef
or
ne
ld
in
wo
r
ws
ds
Da
ily
wo
r
Ne
ws
f
or
yo
u
0
10000
Count
40000
50000
Newspapers sold
Figure 2: Newspapers sold displayed with combining line
Answer: This is a boxplot, which is a graph of the five-number summary. All observations (except one outlier) lies in between approximately 18 and 50 (the minimum and the maximum), which is covered
by the whiskers. The interquartile range stretches from approximately
30 (first quartile) to 45 (third quartile) and constitutes the box, consisting of 50% of the observations. The measure of central location,
the median, is also a measure of relative standing, and its value is approximately 37, meaning that half of the observations are larger than
37, and half of the observations are smaller.
Since the difference between the first and second quartiles is approximately equal to the difference between the second and third quartiles,
a good guess is that the distribution is approximately symmetric.
0
20
40
60
80
Barry Bond’s 19 home run counts
Figure 3: Home run counts
7
Exercise 4, 6p
A forester wants to estimate the total number of farm acres planted in trees
for a state. Since the number of acres of trees varies considerably with the
size of the farm, she decides to stratify on farm sizes. The 240 farms in
the state are placed in one of four categories according to size. A stratified
random sample of 40 farms, selected by using proportional allocation, yields
the results shown in Table 1 on number of acres planted in trees.
Table 1: Acres of trees on farms
Stratum I
0-200 Acres
Stratum II
201-400 Acres
Stratum III
401-600 Acres
Stratum IV
Over 600 Acres
N1 = 86
n1 = 14
97, 67, 42, 125,
25, 92, 105, 86,
27, 43, 45, 59,
53, 21
N2 = 72
n2 = 12
125, 155, 67, 96,
256, 47, 310, 236,
220, 352, 142, 190
N3 = 52
n3 = 9
142, 256, 310, 440,
495, 510, 320, 396,
196
N4 = 30
n4 = 5
167, 655, 220, 540,
780
400
0
200
Acres of trees
600
800
Stratified random sample of 40 farms
Stratum I
Stratum II
Stratum III
Stratum IV
Figure 4: Acres of trees
a) Estimate the total number of acres of trees on farms in the state by
using the information given in Table 1.
3p
8
Answer: Use the formulae in A.2.1 and A.5 to find the answer.
First calculate the mean value of the random variable of interest (Y =”number
of farm acres planted in trees” ) for each stratum.
Ȳ1 = 63.3571
Ȳ2 = 183
Ȳ3 = 340.5555
Ȳ4 = 472.4
Now we have that
L
1 X
Ni Ȳi
N
Ȳst =
i=1
4
X
1
240
=
Ni Ȳi
i=1
1
(86 × 63.3571 + 72 × 183
240
+52 × 340.5555 + 30 × 472.4)
=
= 210.43999
and the total is estimated to
τ̂
= N Ȳst
= 240 × 210.43999
= 50 505.6 ≈ 50 506
∴ The total number of acres of trees on farms in the state, τ̂ , is estimated to 50 506 .
b) Place a bound on the error of estimation.
Answer:
The margin of error is given by the formula in A.6
q
B = 2 Vb (θ̂)
3p
where θ̂ (the estimate of interest) here is τ̂ (i.e. N Ȳst ), so we have to
estimate the variance of Ȳst in order to estimate the margin of error
of the total. We use the formula in A.2.1 to estimate the variance of
Ȳst :
Vb (Ȳst ) =
L
1 X 2 Ni − ni s2i
Ni
N2
Ni
ni
i=1
9
where the estimated variance in each stratum is
s21 = 1071.786
s22 = 9054.182
s23 = 16794.28
s24 = 72376.3
The estimated variance of Ȳst is then
Vb (Ȳst ) =
1
(474 035.6366 + 3 259 505.52
2402
+4 172 445.564 + 10 856 445)
= 325.7366618
The variance of the estimated total is now estimated according to the
formula in A.5
Vb (N Ȳst ) = N 2 Vb (Ȳst )
= 2402 × 325.7366618
= 18 762 431.72
and, finally, calculate the margin of error of τ̂ to
√
B = 2 18 762 431.72
= 8663.1245 ≈ 8663
∴ The margin of error of τ̂ is 8663 .
10
Exercise 5, 5p
100
50
Income
150
Suppose we are interested in determining the average daily sales (income)
for a chain of grocery stores. In Figure 5 we see the true sale numbers for
the last 12 days.
2
4
6
8
10
12
Days
Figure 5: Daily sales
a) Suppose we want to sample days in order to estimate the average daily
sales. Comment upon the use of systematic sampling.
2p
Answer: There is periodicity since it seems to be peak sales every
second or every third day. The effectiveness of a 1 − in − k sample
depends on the value we choose for k. The risk is that we over- or
underestimate the parameter of interest.
[We could change the random starting point several times in order
to reduce the possibility of choosing observations from the same relative position in a periodic population. Note that the corresponding
terminology in time series analysis is seasonal variation, which refer
to systematic patterns that occur over short repetitive calendar periods
(with a duration of less than one year.)]
b) Another task is to estimate the average number of customers per grocery store for the chain. The 300 stores are listed in 50 geographical
clusters of 6 each, and a simple random sampling of three clusters is
selected (see Table 2).
3p
Answer: Use the formula in A.4.1 to estimate the population mean,
µ, with the sample mean Ȳ . The random variable of interest is defined
11
Table 2: Number of customers
Cluster
Number of customers
1
2
3
34, 56, 78, 56, 100, 87
47, 212, 220, 34, 68, 90
98, 67, 88, 99, 29, 58
as Y =”number of customers per grocery store”.
Pn
yi
Ȳ = Pni=1
i=1 mi
where m1 = m2 = m3 = m (the number of elements in each cluster)
and yi is the total of all observations in the ith cluster. Here the total
sample size is equal to nm elements. We get
Pn
yi
411 + 671 + 439
= 84.5
Ȳ = i=1 =
nm
3(6)
∴ The average number of customers per grocery store is estimated to
84.5 .
12
Exercise 6, 10p
Fill in the right answer.
a) The drawback of a web-survey with voluntary answers is that it. . . 2p
may be biased.
costs too much.
is a very simple random sample.
b) What do we call the distribution in Figure 6?
unimodal.
bimodal.
multimodal.
10
0
5
Frequency
15
20
2p
0
20
40
60
80
100
X
Figure 6: Distribution of X
c) When performing a significance test we would like to know about the
sample size. Why?
2p
The p-value depends on the sample size.
The sample size depends on the p-value.
The true value of the parameter depends on the sample size.
d) We can describe the overall pattern of a histogram or stem-plot by
giving its shape, centre, and . . .
2p
spread.
height.
stem.
e) When the respondent has not responded to any of the questions we
call this. . .
2p
random sampling error.
error.
undercoverage.
13
nonresponse
A
Formulae for estimation
A.1
A.1.1
Srs
The mean and the variance of the mean
Pn
θi
θ̄ = i=1
n
2
s N −n
b
V (θ̄) =
n
N
where N is the size of the population, n the size of the sample, and
Pn
(θi − θ̄)2
2
s = i=1
n−1
A.1.2
The proportion and the variance of the proportion
Pn
θi
p̂ = i=1
n
p̂q̂
N −n
Vb (p̂) =
n−1
N
where q = 1 − p.
A.2
Strs
L =number of strata, Ni =number of sampling units in stratum i, and ni
the number of sampled units.
A.2.1
The mean and the variance of the mean
L
1 X
Ni θ̄i
θ̄st =
N
i=1
2
L
X
N
−
n
si
1
i
i
2
Ni
Vb (θ̄st ) = 2
N
Ni
ni
i=1
A.2.2
The proportion and the variance of the proportion
p̂st =
L
1 X
Ni p̂i
N
i=1
L
1 X 2b
b
Ni V (p̂i )
V (p̂st ) = 2
N
i=1
14
A.2.3
Neyman allocation
Ni σ i
ni = n PL
k=1 Nk σk
assuming that the cost per observation are equal. σi is the standard deviation in stratum i (often estimated with si ).
A.2.4
Proportional allocation
Ni
n i = n PL
k=1 Nk
=n
Ni
N
assuming that the cost per observation and the standard deviation are equal
for all strata.
A.3
Sys
A.3.1
The mean and the variance of the mean
Pn
θi
θ̄sy = i=1
n
assuming a randomly ordered population we have
s2 N − n
Vb (θ̄sy ) =
n
N
A.3.2
A.4
The proportion and the variance for proportion
Pn
θi
p̂sy = i=1
n
p̂sy q̂sy N − n
b
V (p̂sy ) =
n−1
N
Clus
N =the number of clusters in the population, n =the number of clusters
selected in a simple random sample, mi =the number of elements in cluster
i, m̄ =the average cluster size for the sample, M =the number of elements in
the population, M̄ =the average cluster size for the population ( M
N ), θi =the
total of all observations in the ith cluster
15
A.4.1
The mean and the variance of the mean
Pn
θi
θ̄ = Pni=1
i=1 mi
Pn
2
N −n
i=1 (θi − θ̄mi )
Vb (θ̄) =
n−1
N nM̄ 2
A.4.2
The proportion and the variance for proportion
Pn
ai
P
p̂ = ni=1
i=1 mi
where ai denote the total number of elements in cluster i that possess the
characteristic of interest.
Pn
2
N −n
i=1 (ai − p̂mi )
b
V (p̂) =
n−1
N nM̄ 2
A.5
The total and the variance of the total
τ̂ = N θ̄
A.6
Vb (N θ̄) = N 2 Vb (θ̄)
Sample size and the margin of error
Sample size required to estimate µ with margin of error B when using Srs
or Sys
N σ2
n=
(N − 1)D + σ 2
2
where D = B4
The margin of error is
q
B = 2 Vb (θ̂)
16
Download