Statistics 475 Notes 10 - Wharton Statistics Department

advertisement
Statistics 475 Notes 18 Revised
Reading: Lohr, Chapter 7.1-7.4
Schedule:
On Wednesday, Professor Brown will finish up his
presentation on sampling issues in the Census.
Final homework assignment due Mon., Dec. 8th.
The last two classes, December 1 and December 3, will be
for short presentations on your final projects. I will be
providing more details shortly.
Final report on your project due Wed., Dec. 17th, 5 p.m.
I. Complex Surveys: Example
We have discussed a number of ways in which surveys can
use designs/analyses other than simple random sampling to
be more feasible or efficient:
(1) Use of auxiliary variables (ratio estimation)
(2) Stratification
(3) Clustering
(4) Unequal probability sampling
Many large surveys use more than one of these elements.
Such surveys are called complex surveys.
1
Example: Malaria is a serious health problem in The
Gambia, a country in West Africa. Malaria morbidity can
be reduced by using bed nets impregnated with insecticide,
but this is only effective if the bed nets are in widespread
use. In 1991, a nationwide survey was designed to estimate
the prevalence of bed net use in rural areas. The survey is
described and results reported in D’Alessandro et al. (1994,
Bulletin of the World Health Organization).
Veronica Njeri lost two children to malaria. (Evelyn Hockstein for The New York Times)
The following is a sample design that is similar to the
actual sampling design. The sampling frame consisted of
all rural villages of fewer than 3000 people in The Gambia.
The villages are divided into districts. In this population,
there are 3 regions (eastern, central and western) with 20
districts per region. The villages in each district are
stratified into two strata, those with a public health center
(PHC) and those without a public health center. There are
ten villages per strata and 50 compounds per village. The
2
sampling design starts by sampling five districts in each
region selected with replacement with probability
proportional to the population (# of beds) in the district.
Within each sampled district, 2 PHC and 2 non-PHC
villages were selected with replacement with probability
proportional to the population in the village. Within each
village, 6 compounds were sampled using a simple random
sample. Within each sampled compound, we counted the
number of beds and the number of bed nets in use. A total
of 360 compounds were sampled.
In summary, the sampling design is the following:
Stage
Sampling Unit
Stratification
1
District
Region
2
Village
PHC/non-PHC
3
Compound
II. Sampling Weights
In complex surveys, formulas for estimating means, totals
and especially standard errors can become horrendous. To
deal with this, we will use sampling weights for computing
estimates and resampling methods for computing standard
errors (Chapter 9).
The sampling weight for a sampled unit is the number of
units in the population in the population that the sampled
unit “represents.”
For without replacement sampling, the sampling weight for
a unit is the reciprocal of the probability that the unit is
3
selected in the sample. For with replacement sampling, the
sampling weight for a unit is the reciprocal of the number
of observations in the sampled unit’s stratum times the
reciprocal of the per-draw probability that the observation
unit is selected into the sample on a draw from the sampled
unit’s stratum.
Examples:
(1) For question 6.2 from Homework 4, recall the sampling
design:
The population schedules of the 1940 census are preserved
on 4,576 microfilm reels. Each census page contains
information on 40 individuals. Two lines on each page
were designed as “sample lines” by the Census Bureau: the
individuals falling on those lines – 5 percent of the
population – were asked a set of supplemental questions
that appear at the bottom of the census page.
Two of every five census pages were systematically
selected for examination. On each selected census page,
one of the two designated sample lines was then randomly
selected. Data-entry personnel then counted the size of the
sample unit containing the targeted sample line. Units size
six or smaller were included in the sample in inverse
proportion to their size. Thus, every one-person unit was
included in the sample, every second two-person unit,
every third three-person unit, and so on. Units with seven
or more persons were included with a probability of 1-in-7:
every seventh household of size seven or more was selected
for the sample.
4
The probability that a person (say Mr. X) is selected into
the sample is the following, where we let the number of
persons in Mr. X’s household be h
Sum over persons in Mr. X’s household of
P(person is on sampled page) *
P(person is on sampled line)*
P(sampled line is selected)*
P(household is retained)=
1
 2 2 1 1
h
*
*
*
*

h  1, , 7
 5 40 2 h 100

h * 2 * 2 * 1 * 1  h
h7
 5 40 2 7 700
Thus, the sampling weight for a person in a household of
size 7 or less is 100 and the sampling weight for a person in
h
a household of size greater than 7 is 700 .
(2) Gambia survey.
The Gambia survey is a stratified sample with the strata
consisting of combinations of region and PHC/no PHC. So
there are six strata – eastern PHC, eastern no PHC, central
PHC, central no PHC, western PHC, western no PHC.
There are 60 units in each stratum.
The per-draw probability of a compound from a given
strata, say central PHC, being selected into the sample is
P(district selected) * P(village selected|district selected) *
P(compound selected | district and village selected)
5
D1 V 1
*
*
R D2 C
where
C=number of compounds in the village (C=50)
V=number of beds in the village
D2=number of beds in district in PHC villages
D1=number of beds in central region in PHC villages
R = number of beds in the compound’s region

For each stratum, 60 compounds are sampled (five districts,
two villages in each district, six compounds per village) so
1
1
*
the sampling weights are 60 D1 * V * 1 .
R D2 C
The Gambia bed net survey was designed so that within
each region, each compound would have almost the same
probability of being included in the survey: probabilities
varied only because different districts had different
numbers of persons.
The data in the file gambiasample.out on the course web
site are
Region= denotes the region that the sampled compound
was in (1,2,3)
District=denotes the district that the sampled compound
was in (1,2,3,4,5)
Strata=denotes the strata (1=non-PHC, 2=PHC) of the
sampled compound
6
Village=denotes the village of the sampled compound
(1,2,…10)
Compound=denotes the compound number (1,2,….50)
Beds=total number of beds in the compound
Nets= total number of anti-malaria nets in the compound
Dsize=total number of beds in the district [you don’t need
to use this variable]
Vsize=total number of veds in the village [you don’t need
to use this variable]
Psiv=single-draw probability ( (# beds in village i)/(total
#beds in the strata) i
for selection of the villages
Psid=single-draw probability ( (# beds in district i)/(total
number of beds in the region) i
for selection of the districts
# Calculate Sample Weights for Gambia Data
gambiadata=read.table("gambiasample.out",header=TRUE,sep=",");
attach(gambiadata);
sample.weight=(1/60)*1/(psid*psiv*(1/50));
hist(sample.weight)
7
III. Estimation Using Sample Weights
The sample weight for a sampled unit is the number of
units in the population that the sampled unit represents.
Thus, our general estimate of the population total is
tˆ   wi yi
iS
For estimating the population mean, we could use tˆ / N ,
where N is the population size. But a generally more
efficient estimator uses ratio estimation. Note that
8
N
1 N
wi yi  wi yi

t
n
y   i 1N
 i 1N
1
N
wi
wi


n i 1
i 1
We estimate y by the ratio estimator:
wi yi

tˆ
ˆy  iS

w
 i  wi
iS
iS
The denominator
w
i
iS
estimates the number of units N
in the population. It is more efficient to use
in the denominator, because if
would expect
w y
iS
i
 w is larger than
iS
i
i
w
i
iS
than N
 w is smaller than N , we
iS
i
to be smaller than t . Conversely, if
N , we would expect
w y
iS
i
i
to be
larger than t
R code:
that=sum(sample.weight*nets);
ybarhat=that/sum(sample.weight);
> that
[1] 6029881
> ybarhat
[1] 89.19908
Thus, we estimate that there are an average of 89 nets per
compound.
9
We will discuss how to obtain a standard error and
confidence interval later.
IV. Specifying the sample design in the Survey package
By putting the data into the Survey package in R, we will
be able to use a number of nice built in functions.
# Set up survey design for Gambia data
library(survey);
gambiadesign=svydesign(id=~village+compound,weights=~sample.weight,
data=gambiadata);
svymean(nets,design=gambiadesign);
svytotal(nets,design=gambiadesign);
> svymean(nets,design=gambiadesign);
mean SE
[1,] 89.199 10.185
> svytotal(nets,design=gambiadesign);
total SE
[1,] 1004980 177023
IGNORE THESE STANDARD ERRORS – THEY DO
NOT ACCOUNT FOR THE SURVEY DESIGN
V. Estimating a distribution function
So far we have concentrated on estimating population
means, totals and ratios. But statistics other than means or
totals may be of interest. For example, we might be
interested in a median or a histogram. We can estimate any
of these quantities with sampling weights. The sampling
weights allow us to construct an empirical distribution for
the population.
10
Suppose values for the entire population of N units are
known. Then any quantity of interest may be calculated
from the probability mass function:
number of units whose value is y
f ( y) 
N
or the distribution function
number of units whose value is  y
F ( y) 
N
Any population quantity can be calculated from the
probability mass function or distribution function.
Sampling weights allow us to construct an estimated
(empirical) probability mass functions and distribution
function.
fˆ ( y ) 

iS , yi  y
wi
w
iS
i
Here, we weight each observation by its sampling weight.
The estimated distribution function is
Fˆ ( y)   fˆ ( x) .
x y
R Code:
# Plot estimated CDF
cdfest=svycdf(~nets,design=gambiadesign);
plot(cdfest);
11
The estimated CDF can also be used to plot histograms and
boxplots that incorporate the sampling weights.
svyhist(~nets,gambiadesign);
12
# Compare box plots that uses the sampling weights to boxplot that ignores
# the sampling weights
par(mfrow=c(1,2));
svyboxplot(nets~1,gambiadesign,ylab="Nets",main="Boxplot Using
Weights");
boxplot(nets,ylab="Nets",main="Usual Boxplot Ignoring Weights");
13
One other useful plot is a bubble plot that plots each point
with a bubble that is proportional in size to the sampling
weight.
svyplot(nets~region,gambiadesign,xlab="Region",ylab="Nets",main="Bubb
le Plot of Nets in Regions");
14
15
Download