What is sampling? - VAM Resource Center

advertisement
Sampling
Methodology
Intermediate Training in
Quantitative Analysis
Bangkok 19-23 November 2007
Some materials are modified from the presentation
‘Comprehensive Survey Design’, Bradley A. Woodruff, CDC
LEARNING PROGRAMME
Topics to be covered in this
presentation
1.
2.
3.
4.
5.
Basic Introduction
Bias and error, accuracy and precision
Calculating sample size
Sampling Methodologies
Final Exercise
LEARNING PROGRAMME - 2
Learning objectives
By the end of this session, the participant
should be able to:
 Differentiate between precision and
accuracy, bias and error
 Calculate sample size
 Understand different sampling
methodologies
LEARNING PROGRAMME - 3
Starting point
 Define objectives of the survey:
 Specific indicators to be measured (food sec, nutr)
 Target groups (displaced hhs)
 Population groups or geographic areas to be
included/studied in survey
 Must also determine the level(s) at which to survey
(the unit of analysis)
 Community
 Household (most common for CFSVAs)
 Children under 5 years of age
LEARNING PROGRAMME - 4
Survey Starting point cont.
 Must clearly define geographic area to be surveyed
 Defines population to which results can be generalized
 May be defined by:
 Area in which a programme has been implemented or is
planned
 An easily defined political unit: district, province, country
 Combination of units: rural areas in a province,
 Other units, such as livelihood zones, agro-ecological
zones, etc.
LEARNING PROGRAMME - 5
What is a cross-sectional survey?
A cross-sectional survey is a collection of data
from a specific population at a single point in time.
CFSVAs and EFSAs are typically cross-sectional
surveys
Often referred to as a ‘snapshot in time’
Sometimes referred to as a population survey.
(FSMS is typically a longitudinal survey)
LEARNING PROGRAMME - 6
What is sampling?
Sampling is the process of selecting a number of
subjects (a “sample population”) from all the subjects
in a “target population” or “universe.”
Source: Last. A Dictionary of Epidemiology
LEARNING PROGRAMME - 7
Two sampling methods
Probability
Non-probability
Random methods decide who is selected
and the chance of a person being selected
is known
Subjective judgment is used to select the
sample and you do not know the chance of a
person being selected
LEARNING PROGRAMME - 8
Why use probability
sampling??
 To estimate/ measure certain outcomes (prevalence
of child malnutrition, food insecurity, etc) for a larger
population by measuring only a sub-set of that
population
 Without probability sampling, a correct estimate for
the larger population could only be attained by
measuring the entire population
We will focus exclusively on probability sampling methods
LEARNING PROGRAMME - 9
Bias and Error,
Accuracy and Precision
LEARNING PROGRAMME - 10
Bias and error
Non sampling bias
Bias
Sampling bias
Sampling error
Sampling error
LEARNING PROGRAMME - 11
Non-sampling bias
Bias introduced into the survey that is not related to
your sampling methodologies/ sample schemes
Always present to some extent and immeasurable
Examples:
Sampling frame out of date/ do not have accurate
population numbers/ households locations;
non response to certain modules of the
questionnaire for whatever reason;
measurement error- child ages and weights not
recorded correctly
LEARNING PROGRAMME - 12
Sampling bias
Bias that is introduced by inadequate sampling
methodologies
 Almost impossible to measure
 Examples:
 Non representative sampling
 Failure to weight
LEARNING PROGRAMME - 13
Sampling error
Difference between survey result and population value
due to random selection of sample
Measurable and can be accounted for
Example:
15% GAM rate in survey population but 10% GAM rates in the
overall population (error of 5%)
Sampling error is influenced by:
 Sample size
 Sampling scheme
 The spread of the indicator we want to measure
LEARNING PROGRAMME - 14
Sampling error
Measures of sampling error:





Confidence limits
Standard error
Coefficient of variation
Probability values (P values)
Others
Use these measures to:
 Calculate sample size prior to sampling
 Determine how sure we are of result after analysis
LEARNING PROGRAMME - 15
Bias and error need to be understood
within the context of two other terms…
Accuracy:
The degree to which a measurement, or an estimate based on
measurements, represents the true value of the attribute that is being
measured
How close the sample pop estimate is to true pop value
Precision:
Precision corresponds to the reduction of random error.
How close are the sample pop estimates if the survey is repeated
A measurement can be precise (low random error) but still inaccurate
(because of a systematic bias): give examples
LEARNING PROGRAMME - 17
Accuracy: obtaining results
close to truth
Driven by whether the
instrument accurately
measures what is intended;
whether the pop measured is
representative of true pop; etc
(whether there is bias)
Survey 1
Survey 2
Survey 3
Real
population
value
LEARNING PROGRAMME - 18
Precision: obtaining similar
results with repeated
measurement
Driven by sample size
(error in the sample)
LEARNING PROGRAMME - 19
How do bias and error relate to
these terms?
 Bias (both sampling and non sampling)
affects accuracy
 Sampling Error affects precision and
precision can be controlled through sample
size
LEARNING PROGRAMME - 20
Explain survey estimates in terms
of each of these terms
LEARNING PROGRAMME - 21
And this??
LEARNING PROGRAMME - 22
And???
LEARNING PROGRAMME - 23
Finally, this???
LEARNING PROGRAMME - 24
Calculating sample size
LEARNING PROGRAMME - 25
Calculate sample size
Sample size calculation determines the number of
individuals that need to be interviewed in order to
properly estimate information for a larger population
Why calculate sample size?
 Collecting data is expensive
 Collecting data and specimens is inconvenient for
subjects
 Collecting data takes time.
LEARNING PROGRAMME - 26
Calculate sample size
To estimate sample size for single survey, need to
know:
 Estimate of the prevalence of the outcome (% food
insecure hhs, % of wasted children, etc.)
 Precision desired
 Size of total population
 Level of confidence (always use 95%)
LEARNING PROGRAMME - 27
Calculate sample size
To calculate sample size for estimate of
prevalence with 95% confidence limit:
N =
2
1.96 x (P)(1-P)
2
d
1.96 = Z value for p = 0.05 or 95% confidence
intervals
(1.64= Z value for p=0.10 or 90% confidence
intervals)
P = Estimated prevalence
d = Desired precision (for example, 0.08 for ± 8%)
LEARNING PROGRAMME - 28
Precision and sample size
Effect of Changing the Desired Precision
(assume 95% CI, +/- .05, large population)
Required sample size
3000
2500
2000
1500
1000
500
0
0.00
0.05
0.10
0.15
0.20
Width of Confidence Interval
LEARNING PROGRAMME - 29
Calculate sample size
Where to get information to make assumption
about prevalence?
Prior surveys
Qualitative estimates
Wild guesses
Err toward an assumed prevalence of 50%
when calculating sample size.
LEARNING PROGRAMME - 30
Estimated prevalence and
sample size
Effect of Changing the Estimated Prevalence
(assume 95% CI, +/- .05, large population)
Required sample size
500
400
300
200
100
0
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Estimated prevalence
LEARNING PROGRAMME - 31
What about sample size for a
Cluster survey?
Design
Effect
=
=
Variance with cluster
sampling
Variance with SRS
Sample size of a cluster sample
that gives same CI as SRS
Sample size with SRS
SRS: Systematic Random Sampling
LEARNING PROGRAMME - 32
Design effect
Generally speaking, design effect households in the
same village are often similar to each other (there is
an intra-cluster correlation).
Twenty households from two villages will not tell us as
much about the entire population as twenty
households all coming from different villages.
The higher the intra-cluster correlation and the more
households come from the same cluster, the higher
the design effect. Example by chance we have 2
villages of predominantly fisherfolk"
LEARNING PROGRAMME - 33
It is good to be familiar with these formulas, but
we have computers to help us with the
calculations….
LEARNING PROGRAMME - 34
Sample size calculators
 Epi Info (www.cdc.gov)
 ODAN stat calculator Excel worksheet
LEARNING PROGRAMME - 35
Sampling methodologies
 Simple random sample
 Systematic random sample
 Cluster sample
 Stratified sample
 Complex sampling designs
LEARNING PROGRAMME - 36
Simple random sampling
(SRS)
 Most basic type of sampling
 Statistical theory based on SRS
 Calculate p values and confidence limits
 Output from most statistical computer
programs assume a SRS
 Selection of people is independent and random
LEARNING PROGRAMME - 37
Advantages and
disadvantages of SRS
 No selection bias
 Self-weighting
 Requires knowledge of population
 Costly to survey when population is
spread out
 Sampling frame may not be available
or complete
LEARNING PROGRAMME - 38
Steps for conducting a simple
random sample
 Create list of all the sampling units
 Number each unit consecutively
 Randomly select numbers between 1 and the total
number of sampling units
 Random number table- Computer generated (RAND in
Excel) and pick the highest numbers
 “Pick a number from a hat”
 Birth day or serial number on paper money
 Flip a coin, roll a die, pick a card, pull a straw
LEARNING PROGRAMME - 39
Simple random sampling
Number
Household
1
Smith
2
Pfeiffer
3
Anderson
4
Timmer
5
Huff
6
Hunt
7
Parvanta
8
Grummer-Strawn
9
Bobrow
0
Cooper
Random number table
7648 2352 6959 1937
2554 6804 9098 4316
4318 2346 7276 1880
7136 9603 0163 3152
7000 2865 8357 4475
9804 0042 1106 7949
2932 9958 9582 2235
1140 1164 7841 1688
4097 8995 5030 1785
5420 0125 4953 1332
5540 6278 1584 4392
3258 1374 1617 7427
LEARNING PROGRAMME - 40
Simple random sampling
Number
Household
1
Smith
2
Pfeiffer
3
Anderson
4
Timmer
5
Huff
6
Hunt
7
Parvanta
8
Grummer-Strawn
9
Bobrow
0
Cooper
Random number table
7648 2352 6959 1937
2554 6804 9098 4316
4318 2346 7276 1880
7136 9603 0163 3152
7000 2865 8357 4475
9804 0042 1106 7949
2932 9958 9582 2235
1140 1164 7841 1688
4097 8995 5030 1785
5420 0125 4953 1332
5540 6278 1584 4392
3258 1374 1617 7427
LEARNING PROGRAMME - 41
Systematic random sampling
 Similar to simple random sampling
 First person chosen randomly
 Systematic selection of subsequent people
 Statistics same as simple random sampling
LEARNING PROGRAMME - 42
Steps for systematic random
sample
 List the sampling units
 Divide the number of sampling units by the sample size to determine
sampling fraction
 Choose random number between 1 and sampling fraction
 Identifies the first selected sampling unit
 Add the sampling fraction to the random number to identify the second
selected sampling unit
 Continue to add the sampling interval until end of list
LEARNING PROGRAMME - 43
Systematic sample example
•
Example: A survey was undertaken to assess household livelihoods in one
community of 480 houses. Sample size calculations revealed that 40
households would need to be sampled systematically to be representative the
larger community.
•
Sampling interval= 480/40
•
Random number between 1-12 was chosen (7)
•
First house sampled= 7
•
Subsequent households sampled
•
7 + 12= 19
•
19+12=31
•
31+12=43
•
Etc
 Danger: unknown, hidden patterns in the population could bias the sample
LEARNING PROGRAMME - 44
Household listing
LEARNING PROGRAMME - 45
Systematic random sampling
For instance every third unit in the sampling frame.
Danger! if the sampling frame has a built-in structure or order ..
LEARNING PROGRAMME - 46
What is required for both
simple and systematic random
sampling?
Both require a complete list of all basic sampling units
arranged in some order.
Resources have to be adequate to sample throughout the
target population
LEARNING PROGRAMME - 47
What if there is no household listing??
What if the area of the target population
is to widespread for available
resources??
LEARNING PROGRAMME - 48
Cluster
Sampling!!
LEARNING PROGRAMME
Cluster sampling
Definition: Probability sampling in which
sampling units at some point in the
selection process are collections, or
clusters, of population elements
Source: Kalsbeek, Introduction to survey sampling
LEARNING PROGRAMME - 50
Cluster sampling
Objective: To choose smaller geographic
areas in which simple or systematic
random sampling can be done
Cluster sampling, for our purpose, are
multistage (usually 2 or 3 stages)
LEARNING PROGRAMME - 51
Cluster Sampling- Illustration
Simple random
sampling
(30 households)
Sampling
universe
Non-selected
households
Selected
households
LEARNING PROGRAMME - 52
Cluster Sampling- Illustration
2
1
3
Cluster
sampling 30 HHs in 3
clusters of 10
each
LEARNING PROGRAMME - 53
Cluster sampling
Advantages
 Cheaper - basic
sampling units
closer together
 Does not need
complete list of
basic sampling units
(usually hhs)
Disadvantages
 Decreased precision
of estimate
 Calculation of p
values and
confidence limits
more complicated
LEARNING PROGRAMME - 54
Cluster sampling
 EFSAs, CFSVAs almost always use cluster
sampling techniques.
 Also used in UNICEF MICS surveys (Multiple
Indicator Cluster Surveys), DHS surveys, and
almost all other large scale surveys.
LEARNING PROGRAMME - 55
What about sample size for a
Cluster survey?
 Cluster samples, as we just saw, have the
disadvantage of decreased precision of an estimate.
 To counteract this effect, we must increase the
sample size by a certain factor.
 This factor is referred to as the Design Effect, and is simply
a number to multiply your calculated SRS sample size by,
to give you the required sample size in a cluster sample
that will have the same precision.
LEARNING PROGRAMME - 56
Cluster surveys
To calculate sample size for estimate of
prevalence with 95% confidence limit
taking into account cluster sampling
N = DEFF x
2
1.96 x (P)(1-P)
2
d
DEFF = Design effect
1.96 = Z value for p = 0.05 or 95% confidence limits
P = Estimated prevalence
d = Desired precision (for example, 0.05 for ± 5%)
LEARNING PROGRAMME - 57
Where do you get design effect to
calculate sample size?
• Prior surveys
• Assessment of clustering of outcome in
population
• Wild guess:
•2 is usually used as the default design effect.
•Remember that this is a best guess of design effect
•Overestimate design effect if uncertain
LEARNING PROGRAMME - 58
How design effects behave
 Design effect increases with
 Clustering of outcome in population
 Increasing size of clusters (fewer clusters with
more households)
 To minimize design effect:
 Include more clusters and decrease the size of
clusters
LEARNING PROGRAMME - 59
Steps to conduct a cluster
survey
1.
Determine the number of clusters needed and the total number
of households in each cluster








Calculate total sample size (with design effect)
Decide how many clusters and of which size
More clusters with lesser number of households results in
smaller design effect
More clusters increase cost and time required
Fewer than 30 clusters with 12 or more households each,
results in high design effect
But > 30 clusters (reducing the number of households to
less than 10 / cluster) doesn’t usually decrease design
effect much
Divide total sample size by number of clusters
Revisit logistic constraints given cluster size
LEARNING PROGRAMME - 60
Cluster sample- steps
2. Determine what grouping of sampling units will be used as the
primary sampling unit (PSUs)
 Population groups, communities, villages, etc.
3. Select required number of PSUs using probability proportional
to size sampling
LEARNING PROGRAMME - 61
Probability proportional to size
(PPS) sampling
 Ensures that probability of any single household or person in the
target population being chosen is the same
 Process is the same as systematic random sampling, BUT
Larger PSUs (communities/ villages) have a higher probability
to be chosen than do smaller ones
For example: a village of 2000 people gets 10 numbers assigned, while
a village of 200 people only gets 1 number....
LEARNING PROGRAMME - 62
Cluster sample- steps
4. Visit each selected PSU and select required number of
households
 SRS or systematic sampling is preferred 2nd stage sampling
method
LEARNING PROGRAMME - 63
Cluster sample example
 We are doing a national household survey in
Mongolia. Sample size calculations (taking into
account a design effect of 2) indicated that 450
households had to be sampled to be representative of
the target population. We have decided to include 12
households per cluster which means we would need to
sample 38 clusters.
 Sampling fraction calculation:
 We want 38 clusters, divide total number of
households in the sampling frame by number of
clusters
 129,177 / 38 = 3399.39 ≈ 3399
LEARNING PROGRAMME - 64
no.
cumulativenumber of
hhs
Province
village
1
Bayan-Ulgii
tsagaannuur
231
231
2
Bayan-Ulgii
nogoonnuur
912
1143
3
Bayan-Ulgii
ulgii
3099
4242
4
Bayan-Ulgii
altansogts
376
4618
5
Bayan-Ulgii
bugat
484
5102
6
Bayan-Ulgii
bayannuur
763
5865
7
Bayan-Ulgii
tolbo
672
6537
8
Bayan-Ulgii
deluun
1251
7788
9
Bayan-Ulgii
bulgan
816
8604
10
Uvs
bukhmoron
326
8930
11
Uvs
davst
248
9178
12
Uvs
ulaangom
2455
11633
13
Uvs
khovd
389
12022
14
Uvs
umnogobi
512
12534
15
Uvs
ulgii
438
12972
16
Uvs
erdeneburen
457
13429
17
khovd
khovd
464
13893
18
khovd
myangad
455
14348
19
khovd
buyat
425
14773
20
khovd
jargalant
2837
17610
21
khovd
dorgon
412
18022
22
khovd
chandmana
545
18567
23
khovd
darvi
356
18923
etc…
184
number of HHs
etc…
dornod
matad
Cluster sample
example
Random number table
7678 2352 6959 1937
2554 6804 9098 4316
4318 2346 7276 1880
7136 9603 0163 3152
7000 2865 8357 4475
9804 0042 1106 7949
2932 9958 9582 2235
1140 1164 7841 1688
4097 8995 5030 1785
5420 0125 4953 1332
5540 6278 1584 4392
3258 1374 1617 7427
etc…
267
129177
LEARNING PROGRAMME - 65
Taking cluster sampling into
account during analysis
 As we learned when discussing cluster
design, cluster sampling usually
DECREASES the precision of your
estimates.
 BUT, SPSS assumes a SRS.
 So, when calculating statistical tests,
standard deviations, variations, etc. (as you’ll
talk about in the next sessions), the error will
likely be UNDERestimated.
LEARNING PROGRAMME - 66
Accounting for Cluster sampling
in the analysis

How do we take this increased error into account?
1.
2.
3.

Use Complex Samples option in SPSS (only available
on some versions, highly complicated)
Aggregate values by cluster, and analyze the cluster
level database.
Use as many clusters as possible and thus smaller size
clusters in each strata to decrease the amount of this
error, and account for in analysis by remembering that
your CI are underestimated.
Option 3 is usually used by WFP analysts,
although option 1 may be utilized where strict
methodologies exist (wasting prevalence, vitamin
deficiencies, etc.).
LEARNING PROGRAMME - 67
Overview of Stratified Sampling
 Members of a target population are put into similar
groups: geography, sex, livelihood
 Each group is called a stratum
 Random, systematic, or cluster sampling is used to
select a sufficient number of subjects in each stratum
 Must know approximate population size in each
stratum
LEARNING PROGRAMME - 68
Goals of stratified sampling
 Overall objective: to define the target population by a certain
characteristic (usually urban/ rural status, gender, admin units,
etc)
 There are three primary reasons for stratification
1.
Stratifying increases precision of the survey by selecting homogeneous sub-groups (a
priori or post hoc) and accounting for these groups in the analysis of results for the whole
group.
2.
Stratification ensures a better sample
3.
A more common goal is to split a sampling universe into smaller groups or universes, with
the goal of providing results for each of these groups, or strata.

In CFSVAs, this is commonly done when creating a sampling design.
LEARNING PROGRAMME - 69
Types of stratification
 Proportionate stratification
 If the number of people/hhs sampled per stratum
is proportional to the population in each stratum
 Disproportionate stratification
 If the number of people/ hhs sampled per stratum
is independent of the population in each stratum
LEARNING PROGRAMME - 70
Stratified Sampling
3 distinct livelihood zones
Country X
Pastoral zone; N= 13
Urban area; N= 20
Agricultural zone; N= 18
LEARNING PROGRAMME - 71
Proportionate stratification
12 sampled
Country X
8 sampled
10 sampled
LEARNING PROGRAMME - 72
Disproportionate stratification
10 sampled
Country X
10 sampled
10 sampled
LEARNING PROGRAMME - 73
Disproportionate sampling
 When several strata are identified during the
survey design phase, with the goal of
providing final estimates for these strata, this
amounts to doing several individual surveysone for each strata.
 Sample size increases as a direct factor of the
number of strata that will be analyzed and
reported on!!!
LEARNING PROGRAMME - 74
Complex sample designs
 Complex samples simply use a combination
of techniques. For example:
 A country is stratified into 5 provinces.
 A cluster sample is used to select 30 villages in
each of the strata.
 A systematic random sample is taken in each
village to select 15 households.
LEARNING PROGRAMME - 75
Recommendations and rules
of thumb
 Disproportionate sampling is usually used for strata you want to
report on (provinces); proportionate sampling for strata that are
there to have a better sample (i.e. districts - we will not report on
them)
 CFSVAs usually have a complex sample design (cluster and
stratified)
 Sampling is usually done as a 2 stage cluster (village,
household).
 In any area of estimation, about 250 to 300 households are
needed.
 Clusters are typically 10 households each, giving 25 to 30
clusters per area of estimation.
 In reporting an assumed design effect of 2 is used (this is
actually based on research of key food security indicators).
LEARNING PROGRAMME - 76
Recommendations and rules
of thumb
 Where logistics allows, more households per cluster may
be sampled (maintaining the 25 to 30 clusters, and thus
increasing the overall sample size).
 For example, when teams visit one cluster per day, and can
accomplish 10 or 12 households without considerable extra
effort/cost, the number of households per cluster is increased.
 95% confidence intervals are always used, and an
estimated 50% prevalence of the indicator. with the deff
and the sample size, this gives a 95% CI of about 8
percentage points to estimates of 50% prevalence.
LEARNING PROGRAMME - 77
Download