A stata program for
Respondent Driven Sampling
Matthias Schonlau, DIW, RAND (USA)
Elisabeth Liebau, DIW
Stata User Conference
Berlin, June 25, 2010
What is RDS?
• RDS = Respondent Driven Sampling
• Invented by a sociologist (Heckathorn, 1997)
• RDS is a chain referral sampling procedure
– Sampling probabilities can be calculated
• It is the only alternative yielding a probability
sample when traditional methods do not
work.
Typical RDS populations
RDS is employed where traditional probabilistic
sampling methods do not work well:
• Sampling frame cannot easily be constructed
– e.g. no registry available
• Low prevalence
– screening is ineffective/expensive
– E.g. jazz musicians
• Anonymity is an issue
– E.g. Questions about illegal drugs
RDS Sampling Procedure
• Approach several seed respondents
• Each respondents approaches 3 further
respondents from their social network
– Payments to respondents and for each referral
who contacts interviewers
• Stop when desired sample size reached
Differences to snowball sampling
• Respondent recruits directly and do not give contact
information to interviewer
• Length of referral chain is crucial to reach equilibrium
• Formal theory requires keeping track of who recruits
whom
– No theory in snowball sampling
• Theory attaches different sampling weights to
recruits depending on their network size and the
transition matrix
– Snowball sampling does not use sampling weights
Red /blue Example
16
15
13
14
7
12
5
2
3
8
17
(not counting
seed)
11
1
6
4
9
• Single seed
• 3 recruits
• Max chain
length =3
•
Example data
from Heckathorn
et al. 2002
20
18
19
10
The name “red/blue”
is explained later.
Motivation for Theory
• If the referral chains are sufficiently long,
characteristics of the eventual sample will be
independent of the seeds
• The recruitment distribution reaches an
equilibrium
• The probability of recruiting someone from a
certain group (e.g. „white female“) can be
derived.
Example: 2 groups (red/blue)
Recruit
Recruiter red
blue
red
7
7
blue
1
4
19
18
20
13
9
Transition Count
8
11
17
5
3
2
1
12
16
6
7
Transition probability
4
14
Recruit
15
10
Recruiter red
blue
Red
0.5
0.5
blue
0.2
0.8
Data required
• id: respondent coupon
• ref1,ref2,ref3 :referral
coupons
• degree: network size
• key: analysis variable
rds syntax
Two steps:
rds_network
rds
analyzes the network
does the estimation
Example: Iguchi et al. study
• Large US Study of Men who have sex with
men, drug users, and their sex partners.
• Innovative design, multiple sites
• For illustration, we look at data from Los
Angeles (Phase II)
• Iguchi, M., Ober, A., Berry, S., Fain, T., Heckathorn, D., Gorbach, P., et al.
(2009). Simultaneous Recruitment of Drug Users and Men Who Have Sex
with Men in the United States and Russia Using Respondent-Driven
Sampling: Sampling Methods and Implications. Journal of Urban Health,
86, 5-31.
Large
number of
seed
responden
ts.
The largest
referral
length is
18.
Required
referral length
(5) is smaller
than largest
chain (18,
previous slide).
Convergence
has been
reached.
If there were
only two
categories
(here 4), both
transition
matrices would
be identical.
In practice,
cumulative sample
proportions stabilize
later, perhaps after
13 waves.
(In practice,
assumptions are
never perfectly met.)
.3
.1
.2
Percent
Theoretically,
sample proportions
should stabilize after
5 waves (see
program output).
.4
.5
Cumulative sample proportions for
increasing number of waves
0
5
10
Wave
hispanic
black
Los Angeles
15
white
other
20
Population + Sample proportion
The estimated population proportions are the main result.
The sample proportions are surprisingly similar here. This is because
the Multiplicity degree does not vary a lot by group
Equilibrium
•If all assumptions are met, the sample proportions will eventually
converge to the equilibrium.
•The equilibrium does not equal the population proportion, because
groups that are better networked (larger degree) are sampled more
often.
Degree
•In the sample, each Hispanic reports an average of 15
connections in the target population.
•By design, Average Degree is always greater than the multiplicity
degree.
Homophily
•Race “other” recruits at random 96% of the time.
•Race “black” recruits 47% of the time other blacks and 53% of the
time at random
Weight
•For example, each Hispanic receives the weight 1.0954048 .
•These weights can be exported using the wgt option.
Weights
Weights reproduce the estimated
proportions
rds ethnic, id(id) degree(netsize) recruiter_id(p_id) recruiter_var(p_key) wgt(wgt)
Bootstrap results
Bootstrapping is a method for obtaining confidence intervals.
bootstrap _b , reps(1000) : ///
rds ethnic, id(id) degree(netsize) recruiter_id(p_id) recruiter_var(p_key)
estat bootstrap, percentile
Outlook
• Currently working on a paper
• Software will be downloadable in about a
month from within stata by typing
Net search rds
and following the link.
For now please email me and I will send the code.
THE END
Contact :
Matt Schonlau: mschonlau@diw.de (until August)
matt@rand.org
Elisabeth Liebau: eliebau@diw.de
Acknowledgement:
We are grateful to Martin Iguchi, Sandy Berry, Allison Ober, Terry Fain for giving
us access to the data for the example. The group is preparing a public release
version of the data after additional publications are written.