A stata program for Respondent Driven Sampling Matthias Schonlau, DIW, RAND (USA) Elisabeth Liebau, DIW Stata User Conference Berlin, June 25, 2010 What is RDS? • RDS = Respondent Driven Sampling • Invented by a sociologist (Heckathorn, 1997) • RDS is a chain referral sampling procedure – Sampling probabilities can be calculated • It is the only alternative yielding a probability sample when traditional methods do not work. Typical RDS populations RDS is employed where traditional probabilistic sampling methods do not work well: • Sampling frame cannot easily be constructed – e.g. no registry available • Low prevalence – screening is ineffective/expensive – E.g. jazz musicians • Anonymity is an issue – E.g. Questions about illegal drugs RDS Sampling Procedure • Approach several seed respondents • Each respondents approaches 3 further respondents from their social network – Payments to respondents and for each referral who contacts interviewers • Stop when desired sample size reached Differences to snowball sampling • Respondent recruits directly and do not give contact information to interviewer • Length of referral chain is crucial to reach equilibrium • Formal theory requires keeping track of who recruits whom – No theory in snowball sampling • Theory attaches different sampling weights to recruits depending on their network size and the transition matrix – Snowball sampling does not use sampling weights Red /blue Example 16 15 13 14 7 12 5 2 3 8 17 (not counting seed) 11 1 6 4 9 • Single seed • 3 recruits • Max chain length =3 • Example data from Heckathorn et al. 2002 20 18 19 10 The name “red/blue” is explained later. Motivation for Theory • If the referral chains are sufficiently long, characteristics of the eventual sample will be independent of the seeds • The recruitment distribution reaches an equilibrium • The probability of recruiting someone from a certain group (e.g. „white female“) can be derived. Example: 2 groups (red/blue) Recruit Recruiter red blue red 7 7 blue 1 4 19 18 20 13 9 Transition Count 8 11 17 5 3 2 1 12 16 6 7 Transition probability 4 14 Recruit 15 10 Recruiter red blue Red 0.5 0.5 blue 0.2 0.8 Data required • id: respondent coupon • ref1,ref2,ref3 :referral coupons • degree: network size • key: analysis variable rds syntax Two steps: rds_network rds analyzes the network does the estimation Example: Iguchi et al. study • Large US Study of Men who have sex with men, drug users, and their sex partners. • Innovative design, multiple sites • For illustration, we look at data from Los Angeles (Phase II) • Iguchi, M., Ober, A., Berry, S., Fain, T., Heckathorn, D., Gorbach, P., et al. (2009). Simultaneous Recruitment of Drug Users and Men Who Have Sex with Men in the United States and Russia Using Respondent-Driven Sampling: Sampling Methods and Implications. Journal of Urban Health, 86, 5-31. Large number of seed responden ts. The largest referral length is 18. Required referral length (5) is smaller than largest chain (18, previous slide). Convergence has been reached. If there were only two categories (here 4), both transition matrices would be identical. In practice, cumulative sample proportions stabilize later, perhaps after 13 waves. (In practice, assumptions are never perfectly met.) .3 .1 .2 Percent Theoretically, sample proportions should stabilize after 5 waves (see program output). .4 .5 Cumulative sample proportions for increasing number of waves 0 5 10 Wave hispanic black Los Angeles 15 white other 20 Population + Sample proportion The estimated population proportions are the main result. The sample proportions are surprisingly similar here. This is because the Multiplicity degree does not vary a lot by group Equilibrium •If all assumptions are met, the sample proportions will eventually converge to the equilibrium. •The equilibrium does not equal the population proportion, because groups that are better networked (larger degree) are sampled more often. Degree •In the sample, each Hispanic reports an average of 15 connections in the target population. •By design, Average Degree is always greater than the multiplicity degree. Homophily •Race “other” recruits at random 96% of the time. •Race “black” recruits 47% of the time other blacks and 53% of the time at random Weight •For example, each Hispanic receives the weight 1.0954048 . •These weights can be exported using the wgt option. Weights Weights reproduce the estimated proportions rds ethnic, id(id) degree(netsize) recruiter_id(p_id) recruiter_var(p_key) wgt(wgt) Bootstrap results Bootstrapping is a method for obtaining confidence intervals. bootstrap _b , reps(1000) : /// rds ethnic, id(id) degree(netsize) recruiter_id(p_id) recruiter_var(p_key) estat bootstrap, percentile Outlook • Currently working on a paper • Software will be downloadable in about a month from within stata by typing Net search rds and following the link. For now please email me and I will send the code. THE END Contact : Matt Schonlau: mschonlau@diw.de (until August) matt@rand.org Elisabeth Liebau: eliebau@diw.de Acknowledgement: We are grateful to Martin Iguchi, Sandy Berry, Allison Ober, Terry Fain for giving us access to the data for the example. The group is preparing a public release version of the data after additional publications are written.