Respondent-Driven Sampling

advertisement
Respondent-Driven Sampling
Carl Kendall, Ph.D.
Professor of International Health and
Development
Tulane University
New Orleans, LA, USA
FIOCRUZ, July 29, 2006
RDS was developed by:
Douglas D. Heckathorn, Ph.D.
Professor of Sociology, Cornell University
Web Page:
www.soc.cornell.edu/faculty/heckathorn.shtml
Through support from the Centers for Disease
Control and Prevention, the National Institute on
Drug Abuse, and the National Endowment for the
Arts.
Lecture in two parts
 Introduction,
theory behind RDS
 Brief description of what the PN has
been doing in Brazil with RDS
Why?




HIV epidemic is multiple epidemics, taking
place in multiple sub-populations
Many of these sub-populations are “hidden”
To understand epidemic dynamics, need to
know what is going on in these networks of
individuals at enhanced risk because of their
behavior and their environment
In order to develop effective interventions for
specific populations
Several basic methods have dominated the study of
hidden populations.





Institutional sampling—May provide easy access to numerous
high-risk individuals, but institutions draw the sample.
Time/Space sampling--Sampling frame combines places and
times where the population gathers, and samples are weighted to
compensate for variations in population density. Best suited for
populations clustered in large public venues.
Targeted sampling—Simplified form of time/space sampling
where the space is the street, no weights are used, and in most
implementations snowball methods are also used.
Chain-referral sampling—Recruitment through networks
reaches respondents who avoid public venues and institutions.
Until recently, these have been considered convenience samples.
For a comparison and assessment see: “Street and Network
Sampling in Evaluation Studies of HIV Risk-Reduction
Interventions” by Salaam Semaan, Jennifer Lauby and Jon
Liebman, AIDS Review 2002
Respondent Driven Sampling


Chain referral sampling characterized by long referral
chains and a statistical theory of the sampling process
which controls for bias including effects of choice of
seeds, and differences in network size. (Heckathorn
1997, 2002)
RDS also serves as the recruitment mechanism for a
form of HIV-prevention intervention, termed a PeerDriven Intervention (PDI), also know as the “ECHO
Model” (Broadhead and Heckathorn 1994, Heckathorn
et al, 1999)
Classic Statement on
Probability versus Convenience Samples
“The major strength of probability sampling is that the
probability selection mechanism permits the development of
statistical theory to examine the properties of sample estimators.
Thus, estimators with little or no bias can be used, and estimates
of the precision of sample estimates can be made.”
In contrast, the precision of estimators from nonprobability
samples can be assessed only by “subjective evaluation.”
Kalton (1983)
Implication: Making chain-referral sampling a form of
probability sampling requires a statistical theory of the sampling
process.
This is part of a new class of sampling methods termed
adaptive/link-tracing designs (Thompson and Frank 2000)
Can chain-referral sampling be a reliable method
even though seeds from a hidden population
cannot be selected randomly?
Referral patterns reflect a self-affiliation bias:
Race/Ethnicity of
Person who
Recruited, IDUs in
Meriden
Race/Ethnicity of Recruit
White Hispanic
Black
Other
Total
Non-Hispanic
White (70% of
Population)
81%
8%
6%
5%
100%
Hispanic (20% of
Population)
43%
45%
10%
2%
100%
Non-Hispanic Black
(8% of Population)
50%
14%
36%
0%
100%
Other (2% of
Population)
38%
63%
0%
0%
100%
A Statistical Theory: Recruitment as a Markov Chain,
W=white, B=black, H=Hispanic, O=other
81%
50%
36%
8%
W
5%
43%
45%
10%
6%
B
H
63%
14%
63%
38%
2%
O
Recruitment can be seen as a stochastic process that moves
from node to node governed by the probabilities associated
with the arrows.
Two theorems regarding regular Markov chains are
relevant to an understanding of RDS:


THEOREM ONE: The "law of large numbers for regular
Markov chains" (Kemeny and Snell 1960) states that the
probability that a system will be in any given state over the course
of a large number of steps is independent of its starting state.
Implication: As the sample expands wave by wave the
composition of the sample becomes stable, reaching what is
termed an “equilibrium,” so bias from the seeds disappears if the
number of waves is large enough.
THEOREM TWO: The equilibrium is attained at a geometric
(i.e., rapid) rate.
Implication: Only a moderate number of waves of recruitment
are required for the subject composition to reach equilibrium
(usually only 4 to 6).
Simulations of Recruitment in a Respondent-Driven
Sample Based on the Recruitment Matrix: Race and
ethnicity of recruits in each wave, beginning with all
Hispanic or all non-Hispanic white seeds.
Hispanic Seeds
100%
90%
90%
80%
80%
70%
60%
50%
40%
70%
70%
% of Population
70%
% of Population
White Seeds
100%
60%
50%
40%
30%
30%
20%
20%
17%
10%
17%
10%
0%
0%
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
non-Hispanic White
5
6
7
8
9
10
Recruitment Wave
Recruitment wave
non-Hispanic Black
4
Hispanic
Other
non-Hispanic Black
non-Hispanic White
Hispanic
Other
Therefore, referral chains should be long
The operational details are:
1) Recruiters are rewarded
2) Coupons with unique
serial numbers document
and ration recruitment
3) Coupons are dollar-bill
sized, printed on medium
card stock using
PowerPoint
4) Respondents call for
appointments, bring their
coupons to the interview,
and are later given three
more for recruiting peers
Implementation
Stages in the Operation of RDS
Recruitment of
Seeds
Recruitment by
a Peer in the
Community
Interview/HIV
Testing and
Counseling in
the Interview
Site
Recruitment of
Peers in the
Community
Debriefing and
Reward for
Peer
Recruitment
After the seeds begin recruiting, the recruitment cycle continues until
the sampling goal is reached
Requirements

Four absolute requirements:
Who recruited whom
 Recruiters and recruits must know one another
 Ration recruitment so a few cannot do all the
recruiting
 Ask about personal network sizes

Using Respondent-Driven Sampling to Study
Spatial Networks (Using Zip Code Level Data)
Network structure is revealed by successive waves of peer recruitment. The beginning
point for one recruitment network, Seed #4, a black female bass player, is marked by the
red pin near Times Square.
Wave 1, Seed 4
Douglas Heckathorn, Cornell University, 2003
Wave 2, Seed 4
Douglas Heckathorn, Cornell University, 2003
Wave 3, Seed 4
Douglas Heckathorn, Cornell University, 2003
Wave 4, Seed 4
Douglas Heckathorn, Cornell University, 2003
All Waves, All Seeds
Douglas Heckathorn, Cornell University, 2003
Panning Out….
Douglas Heckathorn, Cornell University, 2003
Wave 1, Seed 4
Douglas Heckathorn, Cornell University, 2003
Wave 2, Seed 4
Douglas Heckathorn, Cornell University, 2003
Wave 3, Seed 4
Douglas Heckathorn, Cornell University, 2003
Wave 4, Seed 4
Douglas Heckathorn, Cornell University, 2003
All Waves, All Seeds
Douglas Heckathorn, Cornell University, 2003
All Waves, All Seeds
San Francisco RDS
Douglas Heckathorn, Cornell University, 2003
Can RDS be a valid method despite:

Differences in network sizes, (more recruitment
paths lead to those with large networks, so they are
over-sampled);
 Self-affiliation bias, (as seen above, people tend to
recruit those like themselves);
 Differential recruitment, (some groups recruit
more than others, so their recruitment patterns are
over represented).
Therefore, the sample composition may not mirror
that of the population from which it was drawn, so
a valid population estimator must take these factors
into account.
Population Estimates from RDS:
Population Estimates are Derived from Network
Indicators
Data from
RespondentDriven Sample
Network
Indicators
(Proportion of
cross-cutting
ties and
network size)
Population
Estimates
(Proportional
group sizes,
affiliation
indices)
Aim: To compensate for the effects of network
structure on recruitment into the sample, including
differences in network sizes, and clustering.
Biased Network Theory (1)





Developed in the early 1950s by Anatol Rapoport, later
elaborated by Fararo and Sunshine (1968)
Network ties are formed randomly, through a stochastic
process
In an unstructured system, ties are formed through
random mixing, e.g., if a group makes up 75% of the
population, it will have 75% in-group ties.
More than a century ago, Galton recognized that
friendships tend to form among those who are similar—
a tendency called homophily.
Ties can also form based on complementarity, e.g.,
sexual relations among heterosexuals, this is negative
homophily also known as heterophily.
Biased Network Theory and Affiliation Patterns

In Biased Network Theory structure can be defined
using an Index of Network Clustering termed
Homophily (Fararo and Sunshine, 1968, Heckathorn
2002)

Homophily = 1 if all ties are formed to the in-group;
Homophily = 0 if all ties are formed randomly;
Homophily = -1 if all ties are formed to the out-group

Intermediate values are defined similarly, e.g.,
homophily = .32 if ties for formed as though 32% of the
time an in-group tie is formed, and the rest of the time
ties are formed by random mixing.

This clustering index is used because of its fit with
RDS sampling theory: If homophily sums correctly the
equilibrium sample composition mirrors the population
from which it was drawn.
The Reciprocity Model:
How to estimate population size based on network indicators
When ties are reciprocal,
the number of ties from any
group A to B, Tab, is equal to
the number of ties from B to
A, Tba, i.e.,
Tab = Tba, e.g., 2 = 2
The number of ties from A to B is the product of four terms: (1) the
number of nodes in the system, X, (2) the proportional size of A, Pa,
(3) the average network size of A, Na, and (4) the proportion of ties
from A to B, Sab.
Tab = X * Pa * Na * Sab e.g., 5 * .4 * 2 * .5 = 2
The Reciprocity Model (2):
How to estimate population size based on network indicators
Given that:
Tab = Tba, by expansion
X*Pa*Na*Sab=X*Pb*Nb*Sba
When (1-Pa) is substituted for Pb, this reduces to:
Note that the term for total population drops out, so this model
yields population proportions but not absolute sizes.
RDS and IRB issues






Unlike snowball sampling, in which Rs provide peers’ contact
information and investigators make contact; RDS does not ask Rs to
violate privacy of peers
Incentives should be kept modest enough to prevent coercion; but
questions should asked about it to ensure it does not happen
Very large incentives could be viewed as in themselves coercive
IRBs have sometimes resisted raising incentive amounts, so do not start
with too small an amount—focus groups and pilot studies are useful
IRBs have also been concerned about giving money to IDUs; recruitment
quotas limit the annual amount that can be earned, so incentives will not
affect drug habits (e.g., in ECHO studies, 99.7% of Rs earned less than
$100/year)
Because of tracking by serial numbers, coupons cannot become an
alternative source of currency (unlike store coupons, food stamps, etc.)
Limitations of Respondent-Driven
Sampling


1)
2)
3)
4)
5)
Limitations inherent in all sampling methods apply to RDS,
e.g., the interview site must be readily accessible, interviewers
must be culturally sensitive, and no sampling method can
completely eliminate non-response bias.
In addition, there are limitations specific to RDS:
Population members must know one another as members of
the target population. This can occur, for example, through
the contact patterns created by sexual contact or drug sharing.
Network ties must be dense enough to sustain the chainreferral process.
Means must exist to motivate population members to recruit
their peers.
Means must exist for verifying membership in the target
population, lest others seek entry into the study to gain
respondent fees.
Statistical power decreases when homophily is high.
Advantages of RDS






Controls for the biases associated with chain-referral methods,
providing both population estimates and estimates of variability for
those estimates.
Requires little formative research, and therefore sampling can begin
quickly. In contrast, time/space and targeted sampling require detailed
prior mapping of the target population.
Accesses persons through their social networks, even reaching those
who shun large public venues and avoid the street.
Recruitment is carried out by respondents at minimal cost, no field staff
is required, so training requirements and costs are reduced.
Number of additional questions that must be added to the instrument is
small. Therefore, the method’s overhead is minor.
Problem of non-response bias is reduced by dual incentive system
(respondent fees and peer pressure)
RDS Software

IRIS coupon manager
RDSat

http://www.respondentdrivensampling.org/main.htm

Software: RDSat

Calculates population estimates based on











Linear Least Squares or Data Smoothing (normal or enhanced)
Arithmetic or Weighted Net Sizes
Net size outliers can be pulled in (large outliers make the arithmetic mean
unstable; small ones make the weighted mean unstable)
Equilibrium Sample Composition
Weights
Reciprocity Index
Homophily (useful for calculating design effects)
Standard Errors
Accepts data files created by IRIS 3.0, so calculations can be made
in the field
Creates a data file useful for studying recruitment networks using
UCINET or Pajik
Limits: sample size  2,500, coupons  40 per respondent
Selected Bibliography on RDS








"Respondent-Driven Sampling: A New Approach to the Study of Hidden Populations." By
Douglas D. Heckathorn Social Problems. (1997)
"Respondent-Driven Sampling II: Deriving Valid Population Estimates from Chain-Referral
Samples of Hidden Populations." By Douglas D. Heckathorn Social Problems, 2002.
"Extensions of Respondent-Driven Sampling: A New Approach to the Study of Injection
Drug Users Aged 18-25." By Douglas D. Heckathorn, Salaam Semaan, Robert S.
Broadhead, and James J. Hughes. AIDS and Behavior, 2002.
"Group Solidarity as the Product of Collective Action: Creation of Solidarity in a
Population of Injection Drug Users." By Douglas D. Heckathorn and Judith E. Rosenstein.
Advances in Group Processes, 2002.
"Development of a Theory of Collective Action: From the Emergence of Norms to AIDS
Prevention and the Analysis of Social Structure." By Douglas D. Heckathorn In New
Directions in Sociological Theory: Growth of Contemporary Theories (Joseph Berger and
Morris Zelditch, editors). Rowman and Littlefield, 2002.
“Finding the Beat: Using Respondent-Driven Sampling to Study Jazz Musicians.” By
Douglas D. Heckathorn and Joan Jeffri. Poetics, 2001.
“Making Unbiased Estimates from Hidden Populations Using Respondent-Driven
Sampling.” By Matthew J. Salganik and Douglas D. Heckathorn. Paper presented at the
International Social Network Conference, February, 2003, Cancun, Mexico
“Jazz Networks: Using Respondent-Driven Sampling to Study Stratification in Two Jazz
Musician Communities.” By Douglas D. Heckathorn and Joan Jeffri. Paper to be presented
at the American Sociological Association meetings, August, 2003, Atlanta, GA.
39
RDS in Brazil

Planning meeting, introduction to RDS
November 2004:
Carl Kendall, Tulane
 Keith Sabin, CDC


Protocol development May 2005




Lisa Johnston, Tulane
Data collection August-October 2005
Analysis workshop May 2006
Writing workshop July 2006 UCSF - 15 papers
Site
Target Population
Participants
Campinas 1
MSM
Maeve Mello
Curitiba
Female CSW
Clea Ribiero
Augusto Evangelista
Fortaleza
MSM
Ligia Sansigolo
Kerr
Linda Maia
Recife
Drug Users (DU)
Luiz Oscar
Ferreria
Eniel Oliveira
Porto Alegre
CSWs
Cintia Germany
Mauro Ramos
Santos
CSWs
Neide Silva
Regina Maria
Lacerda
Campinas 2
IDU
Elvira Maria
Filipe
Marcia Moreira
Holcman
Manaus 1
Female CSW
Joao Catarino
Dutra
Felicien Vasques
Manaus 2
DU
Marcos Santos
Roberio Reboucas
An empirical comparison of RDS, targeted
TLS and snowball sampling methodologies in
a hidden population in Fortaleza, Brazil
Ligia Kerr†, Carl Kendall‡, Rogério Gondim•, Guillerme
Werneck◊, Lisa Johnston‡, Keith SabinΩ
Study design


Cross-sectional study in Fortaleza/Ce
2002 (401)
32% “Snow Ball”
 68% TLS


2005 (406)


100% RDS
Measures

Questionnaire based on BSS

Socio-economic status (education/social class)
Main results
Table 1. Education and social class of two survey rounds using three different
methods, Fortaleza/ Ce, 2006.
Variable
2002 (Snowball)
Prevalence rate
(CI)
2002 (TLS)
Prevalence rate
(CI)
2005 (RDS)
Prevalence rate
(CI)
Social class
A
B
C
D
E
N=126
15.1 (9.3-22.5)
39.7 (31.1-48.8)
33.3 (25.2-42.3)
11.9 (6.8-18.9)
0.0
N=254
25.6(20.3-31.4)
37.4 (31.4-43.7)
24.8 (19.6-30.6)
11.0 (7.5-15.5)
1.2 (0.2-3.4)
N=406
0,8 (0,1-1,2)
2,6 (1,4-3,8)
24,0 (19,6-30,7)
44,5 (39,7-50,9)
27,9 (20,7-32,4)
Education
Illiterate or 1º Incomplete
1º Complete or 2º Incomplete
2º Complete or higher
N=126
5.6 (2.3-11.1)
61.9 (52.8-70.4)
32.5 (24.5-41.5)
N=254
8.5 (5.5-12.5)
54.4 (48.3-60.5)
37.0 (31.3-43.1)
N=388
29,0 (22,1-35,0)
24,0 (18,6-29,4)
46,8 (41,1-54,1)
Secondary results
Table 2. Education of Aids cases among MSM in Ceará. Fortaleza/ Ce, 20002005.
Education
Illiterate or 1º Incomplete
Aids cases among MSM
in Ceará
52.9
1º Complete or 2º Incomplete
30.9
2º Complete or higher
16.2
Public Health implications



RDS reaches lower social class respondents than
TLS in this example in Fortaleza.
Social classes D and E have a higher proportion
of AIDS cases.
RDS would appear to be the sampling method
of choice in Fortaleza.
A EFETIVIDADE DO USO DA
METODOLOGIA RESPONDENT DRIVEN
SAMPLING PARA VIGILÂNCIA
COMPORTAMENTAL DO HIV EM
TRABALHADORAS DO SEXO EM SANTOS
JULHO – 2006
EXECUTORA: ASPPE
SANTOS
WWW.ASPPE.ORG
REDE SOCIAL – STATUS HIV
HIV HIV +
Obligado!
Download