Survey sampling

advertisement
Survey sampling
Sampling & non-sampling error
 Bias
 Simple sampling methods
 Sampling terminology
 Cluster sampling
 Design effect
 Stratified sampling
 Sampling weights

Why sample?
To make an inference about a
population
 Studying entire pop is impractical or
impossible

Example of sampling
Estimate the proportion of adults,
ages 18-65, in Port Elizabeth that
have type 2 diabetes
 Select a sample from which to
estimate the proportion
 Population: adults aged 18-65 living
in Port Elizabeth
 Inference: proportion with type 2
diabetes

Probability sampling
Each individual has known (nonzero) probability of selection
 Precision of estimates can be
quantified

Non-probability sampling
Cheaper, more convenient
 Quality of estimates cannot be
assessed
 May not be representative of
population

Sampling error
v.
Non-sampling error
Sampling error
Random variability in sample
estimates that arises out of the
randomness of the sample selection
process
 Precision can be quantified
(estimation of standard errors,
confidence intervals)

Non-sampling error

Estimation error that arises from
sources other than random variation
– non-response
– undercoverage of survey
– poorly-trained interviewers
– non-truthful answers
– non-probability sampling

This type of error is a bias
What is bias?
We want to estimate the mean weight of all
women aged 15-44 living in Coopersville.
Suppose there are 50,000 such women and
the true mean weight is 61.7 kg.
We select a sample of 200 such women and
interview them, asking each woman what
her weight is.
The sample mean weight is 59.4 kg.
Is our estimate biased?
Bias
Suppose we could repeat the survey
many, many times.
Then we compute the mean of all the
sample means.
Say the mean of the means = 62.9
Bias = (mean of means) - (true mean)
= 62.9 - 61.7 = 1.2 kg
Unbiased estimation
If . . .
(mean of the means) = (true mean)
then the bias is zero, and we say that
the estimator is unbiased.
The “mean of the means” is called the
“expected value” of the estimator.
Simple sampling methods
Task: Select a sample of n
individuals or items from a
population of N individuals or
items
 Common methods

– simple random sampling
– systematic sampling
Simple sampling methods

Simple random sampling (SRS)
– each item in population is equally likely
to be selected
– each combination of n items is equally
likely to be selected

Systematic sampling (typical method)
– randomly select a starting point
– select every kth item thereafter
Systematic sampling example







Stack of 213 hospital admission forms; select a
sample of 15
213/15 = 14.2  Select every 14th form
Starting point: random number between 1 and 14
(we choose 11)
First form selected is 11th from top
Second form selected is 25th from top (11 + 14 = 25)
Third form selected is 39th from top (11 + 2x14 = 39)
And so forth . . .
Systematic sampling, continued
What is the probability that the 146th
form will be selected? The 195th?
 Does this qualify as a simple random
sample? Why or why not?
 Is there any potential problem arising
from the use of systematic sampling
in this situation?

Example was typical
quick method
In the preceding example, we
selected every 14th form
 Ideally, we would select every 14.2th
form (see later example on 2-stage
sample of nurses)
 Example is a quick and easy method,
commonly used in the field; it is a
good approximation to the more
rigorous procedure

Systematic sampling: + and 
Advantages of systematic sampling
– typically simpler to implement than SRS
– can provide a more uniform coverage

Potential disadvantage of systematic
sampling
– can produce a bias if there is a
systematic pattern in the sequence of
items from which the sample is selected
Role of simple sampling methods

These simple sampling methods are
necessary components of more
complex sampling methods:
– cluster sampling
– stratified sampling

We’ll discuss these more complex
methods next (following some
definitions)
Definitions

Listing units (or enumeration units)
– the lowest level sampled units (e.g.,
households or individuals)

PSUs (primary sampling units)
– the first units sampled (e.g., states or
regions)

Sampling probability
– for any unit eligible to be sampled, the
probability that the unit is selected in
the sample
More definitions

EPSEM sampling
– “equal probability of selection method”,
thus a method in which each listing unit
has the same sampling probability

Sampling frame
– the set of items from which sampling is
done--often a list of items.
More definitions

Undercoverage: the degree to which
we fail to identify all eligible units in
the population
– incomplete lists
– incomplete or incorrect eligibility
information
Still more definitions

Non-response: failure to interview
sampled listing units (study subjects)
– refusal
– death
– physician refusal
– inability to locate subject
– unavailability
Still more definitions

Precision: the amount of random
error in an estimate
– often measured by the width or halfwidth of the confidence interval
– standard error is another measure of
precision
– estimates with smaller standard error or
narrower CI are said to be more precise
CLUSTER SAMPLING
single stage
Clusters
Subsets of the listing units in the
population
 Set of clusters must be mutually
exclusive and collectively exhaustive

– counties
– townships
– regions
– institutions
Example
Single-stage cluster sampling
There are 361 nurses working at the
31 hospitals and clinics in Region 4
 We wish to interview a sample of
these nurses

– select a simple random sample of 5
hospitals/clinics
– interview all nurses employed at the 5
selected institutions
Assessing the example
Hospitals/clinics are the PSUs
 Nurses are the listing units
 Sampling probability for each nurse
is 5/31
 Thus, this is an EPSEM sample
 Sampling frame is the list of 31
hospitals and clinics

CLUSTER SAMPLING
two stage
Cluster sampling -- two stage
Select a sample of clusters, as in the
single-stage method
 From each selected cluster, select a
subsample of listing units

Cluster sampling -- two stage

It is always nice to do EPSEM
sampling because such samples are
self-weighting
– don’t need sampling weights in analysis

A common EPSEM method for twostage sampling is PPS (probability
proportional to size)
PPS sampling

The key to the method is that the
sampling probabilities of clusters in
the first stage are proportional to the
“sizes” of the clusters
– size = number of listing units in cluster

At stage 2, select the same number
of listing units from each selected
cluster
Nurse example revisited
Two-stage sampling
We want to interview a sample of 36
nurses
 We can afford to visit 9 different
hospitals/clinics
 Thus, we need to interview 36/9 = 4
nurses at each institution

Nurse example revisited
Two-stage sampling

Stage 1: select a sample of 9
hospitals/clinics
– Selection prob. proportional to “size”
Stage 2: select a sample of 4 nurses
from each selected institution
 At each stage, use one of the simple
sampling methods

Nurse example revisited
Two-stage sampling
PSUs are the hospitals/clinics
 Listing units are the nurses
 Sampling frames

– Stage 1: List of 31 hospitals/clinics
– Stage 2: Lists of nurses at each
selected hospital/clinic
Selecting 2-stage nurse sample






Sampling interval, I = 361/9 = 40.1
Starting point, random number between 1
and 40; we choose R = 14
First sampling number = R = 14
2nd sampling number = 14 + 1x40.1 = 54.1
3rd sampling number = 14 + 2x40.1 = 94.2
We have selected institutions 2, 5, 9, . . .
Two-stage nurse sample
Institution
Number
1
2
3
4
5
6
7
8
9
.
.
31
Total
No. of
Nurses
12
7
9
18
11
7
10
14
8
.
.
9
361
Cumulative
Nurses
12
19
28
46
57
64
74
88
96
.
.
361
Sampling
Number
14
54.1
94.2
Applying the sampling numbers
For each sampling number, choose
the first unit with cumulative “size”
equal to or greater than the sampling
number
 Example: sampling number 54.1
– first unit with cumulative size  54.1
is unit 5 (cum. no. of nurses = 57)
– so we select unit 5 for the sample

Optional challenge
What is the selection probability for institution 1?
12/40.1 = 0.299
What is the selection probability for a nurse in
institution 1?
(12/40.1) x (4/12) = 0.998 = 36/361
What is the selection probability for a nurse in
institution 2?
(7/40.1) x (4/7) = 0.998 = 36/361
All nurses have the same selection probability.
Why do cluster sampling instead
Of a simple sampling method?

Advantages
– reduced logistical costs (e.g., travel)
– list of all 361 nurses may not be available
(reduces listing labor)

Disadvantages
– estimates are less precise
– analysis is more complicated (requires
special software)
Design effect

Relative increase in variance of an
estimate due to the sampling design
– “variance” = (standard error)2

Formula
– s1 = standard error under simple
random sampling
– s2 = standard error under complex
sampling design (e.g., cluster sampling)
– design effect = (s2/s1)2
Design effect for cluster sampling
For cluster sampling designs, the
design effect is always >1
 This means that estimates from a
survey done with cluster sampling
are less precise than corresponding
estimates obtained from a survey
having the same sample size done
with simple random sampling

Cluster sizes
Recommended “take” per cluster is
20-40 for multi-purpose surveys
 Time and resource limitations will
often dictate the maximum number of
clusters you can include in the study
 Including more clusters improves the
precision of your estimates more
than a corresponding increase in
sample size within the clusters
already in the sample

STRATIFIED
SAMPLING
Strata
Subsets of the listing units in the
population
 Set of strata must be mutually
exclusive and collectively exhaustive
 Strata are often based on
demographic variables

– age
– sex
– race
Stratified sampling
Sample from each stratum
 Often, sampling probabilities vary
across strata

Stratified sampling

Advantages
– guarantees coverage across strata
– can over-sample some strata in order to obtain
precise within-stratum estimates
– typically, design effect < 1

Disadvantages
– with unequal sampling probabilities, sampling
weights must be included in analysis
• more complicated
• requires special software
Example: sampling breast cancer
cases for the Women’s CARE Study

Stratification variables
– geographic site
– race (2 races)
– five-year age group
Over-sampled younger women
 Over-sampled black women

Example: Sampling households
for a reproductive health survey
in 11 refugee camps in Pakistan
Selected simple random sample of
households from within each of the
11 camps
 All households were selected with
the same probability

Refugee camp sampling
Camp
Lakhte Banda
Kotki 1
Kotki 2
Kata Kanra
Mohd Khoja
Doaba
Darsamand
Kahi
Naryab
Thal 1
Thal 2
Dallan
Total
Population
12,943
7,262
5781
8,437
12,791
13,584
17,797
11,061
5,543
11,087
17,130
10,990
134,406
Sample
Size
64
36
29
42
63
67
88
55
28
55
85
55
667
Completed
Interviews
61
29
21
38
45
25
53
32
19
44
60
45
472
The sampling operation

Must be carefully controlled
– don’t leave to discretion in the field
– use a carefully defined procedure

Document what you did
– for reference during analysis
– to defend your study
Sampling frames

A list containing all listing units is
great if you can get it
– ok if it includes some ineligibles

Problems associated with geographic
location-based sampling
– map-based sampling
– EPI sampling
Sampling weights
Inverse of the net sampling
probability
 Interpretation: the sampling weight
for an sampled individual is the
number of individuals his/her data
“represent”

Example--sampling weights

There are 150 employees in a firm
– stratum 1: 50 employees aged 18-29
– stratum 2: 100 employees aged 30-69
We sample 10 from each stratum
 Sampling probabilities are

– stratum 1: 10/50 = 0.20
– stratum 2: 10/100 = 0.10
Example: sampling weights
(continued)

Sampling weights
– stratum 1: 1/0.20 = 5
– stratum 2: 1/0.10 = 10

Interpretation:
– Each sampled employee in stratum 1
represents 5 employees
– Each sampled employee in stratum 2
represents 10 employees
What about non-response?
1 employee in the stratum 1 sample
and 3 employees in the stratum 2
sample refuse to participate in the
survey
 Net sampling probabilities

– stratum 1: 9/50 = 0.18
– stratum 2: 7/100 = 0.07
Revised sampling weights

Sampling weights revised for nonresponse
– stratum 1: 1/0.18 = 5.56
– stratum 2: 1/0.07 = 14.29

This computation is often done by
multiplying the original sampling
weights by adjustment factors to
account for non-response rates
Post-stratification weighting




Define strata, which may or may not have
been used as strata in the sampling design
Compute sampling probabilities = proportion
of each stratum that was actually sampled
Compute sampling weights from these
sampling probabilities
Allows post-hoc treatment of unequal
representation of population segments in
the sample
Discussion topics
What is the population of interest?
 Infinite populations
 Selecting random numbers
 Selecting simple random samples

– from finite populations
– from infinite populations

Analysis software for complex
surveys
Download