Respondent-Driven Sampling Carl Kendall, Ph.D. Professor of International Health and Development Tulane University New Orleans, LA, USA FIOCRUZ, July 29, 2006 RDS was developed by: Douglas D. Heckathorn, Ph.D. Professor of Sociology, Cornell University Web Page: www.soc.cornell.edu/faculty/heckathorn.shtml Through support from the Centers for Disease Control and Prevention, the National Institute on Drug Abuse, and the National Endowment for the Arts. Lecture in two parts Introduction, theory behind RDS Brief description of what the PN has been doing in Brazil with RDS Why? HIV epidemic is multiple epidemics, taking place in multiple sub-populations Many of these sub-populations are “hidden” To understand epidemic dynamics, need to know what is going on in these networks of individuals at enhanced risk because of their behavior and their environment In order to develop effective interventions for specific populations Several basic methods have dominated the study of hidden populations. Institutional sampling—May provide easy access to numerous high-risk individuals, but institutions draw the sample. Time/Space sampling--Sampling frame combines places and times where the population gathers, and samples are weighted to compensate for variations in population density. Best suited for populations clustered in large public venues. Targeted sampling—Simplified form of time/space sampling where the space is the street, no weights are used, and in most implementations snowball methods are also used. Chain-referral sampling—Recruitment through networks reaches respondents who avoid public venues and institutions. Until recently, these have been considered convenience samples. For a comparison and assessment see: “Street and Network Sampling in Evaluation Studies of HIV Risk-Reduction Interventions” by Salaam Semaan, Jennifer Lauby and Jon Liebman, AIDS Review 2002 Respondent Driven Sampling Chain referral sampling characterized by long referral chains and a statistical theory of the sampling process which controls for bias including effects of choice of seeds, and differences in network size. (Heckathorn 1997, 2002) RDS also serves as the recruitment mechanism for a form of HIV-prevention intervention, termed a PeerDriven Intervention (PDI), also know as the “ECHO Model” (Broadhead and Heckathorn 1994, Heckathorn et al, 1999) Classic Statement on Probability versus Convenience Samples “The major strength of probability sampling is that the probability selection mechanism permits the development of statistical theory to examine the properties of sample estimators. Thus, estimators with little or no bias can be used, and estimates of the precision of sample estimates can be made.” In contrast, the precision of estimators from nonprobability samples can be assessed only by “subjective evaluation.” Kalton (1983) Implication: Making chain-referral sampling a form of probability sampling requires a statistical theory of the sampling process. This is part of a new class of sampling methods termed adaptive/link-tracing designs (Thompson and Frank 2000) Can chain-referral sampling be a reliable method even though seeds from a hidden population cannot be selected randomly? Referral patterns reflect a self-affiliation bias: Race/Ethnicity of Person who Recruited, IDUs in Meriden Race/Ethnicity of Recruit White Hispanic Black Other Total Non-Hispanic White (70% of Population) 81% 8% 6% 5% 100% Hispanic (20% of Population) 43% 45% 10% 2% 100% Non-Hispanic Black (8% of Population) 50% 14% 36% 0% 100% Other (2% of Population) 38% 63% 0% 0% 100% A Statistical Theory: Recruitment as a Markov Chain, W=white, B=black, H=Hispanic, O=other 81% 50% 36% 8% W 5% 43% 45% 10% 6% B H 63% 14% 63% 38% 2% O Recruitment can be seen as a stochastic process that moves from node to node governed by the probabilities associated with the arrows. Two theorems regarding regular Markov chains are relevant to an understanding of RDS: THEOREM ONE: The "law of large numbers for regular Markov chains" (Kemeny and Snell 1960) states that the probability that a system will be in any given state over the course of a large number of steps is independent of its starting state. Implication: As the sample expands wave by wave the composition of the sample becomes stable, reaching what is termed an “equilibrium,” so bias from the seeds disappears if the number of waves is large enough. THEOREM TWO: The equilibrium is attained at a geometric (i.e., rapid) rate. Implication: Only a moderate number of waves of recruitment are required for the subject composition to reach equilibrium (usually only 4 to 6). Simulations of Recruitment in a Respondent-Driven Sample Based on the Recruitment Matrix: Race and ethnicity of recruits in each wave, beginning with all Hispanic or all non-Hispanic white seeds. Hispanic Seeds 100% 90% 90% 80% 80% 70% 60% 50% 40% 70% 70% % of Population 70% % of Population White Seeds 100% 60% 50% 40% 30% 30% 20% 20% 17% 10% 17% 10% 0% 0% 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 non-Hispanic White 5 6 7 8 9 10 Recruitment Wave Recruitment wave non-Hispanic Black 4 Hispanic Other non-Hispanic Black non-Hispanic White Hispanic Other Therefore, referral chains should be long The operational details are: 1) Recruiters are rewarded 2) Coupons with unique serial numbers document and ration recruitment 3) Coupons are dollar-bill sized, printed on medium card stock using PowerPoint 4) Respondents call for appointments, bring their coupons to the interview, and are later given three more for recruiting peers Implementation Stages in the Operation of RDS Recruitment of Seeds Recruitment by a Peer in the Community Interview/HIV Testing and Counseling in the Interview Site Recruitment of Peers in the Community Debriefing and Reward for Peer Recruitment After the seeds begin recruiting, the recruitment cycle continues until the sampling goal is reached Requirements Four absolute requirements: Who recruited whom Recruiters and recruits must know one another Ration recruitment so a few cannot do all the recruiting Ask about personal network sizes Using Respondent-Driven Sampling to Study Spatial Networks (Using Zip Code Level Data) Network structure is revealed by successive waves of peer recruitment. The beginning point for one recruitment network, Seed #4, a black female bass player, is marked by the red pin near Times Square. Wave 1, Seed 4 Douglas Heckathorn, Cornell University, 2003 Wave 2, Seed 4 Douglas Heckathorn, Cornell University, 2003 Wave 3, Seed 4 Douglas Heckathorn, Cornell University, 2003 Wave 4, Seed 4 Douglas Heckathorn, Cornell University, 2003 All Waves, All Seeds Douglas Heckathorn, Cornell University, 2003 Panning Out…. Douglas Heckathorn, Cornell University, 2003 Wave 1, Seed 4 Douglas Heckathorn, Cornell University, 2003 Wave 2, Seed 4 Douglas Heckathorn, Cornell University, 2003 Wave 3, Seed 4 Douglas Heckathorn, Cornell University, 2003 Wave 4, Seed 4 Douglas Heckathorn, Cornell University, 2003 All Waves, All Seeds Douglas Heckathorn, Cornell University, 2003 All Waves, All Seeds San Francisco RDS Douglas Heckathorn, Cornell University, 2003 Can RDS be a valid method despite: Differences in network sizes, (more recruitment paths lead to those with large networks, so they are over-sampled); Self-affiliation bias, (as seen above, people tend to recruit those like themselves); Differential recruitment, (some groups recruit more than others, so their recruitment patterns are over represented). Therefore, the sample composition may not mirror that of the population from which it was drawn, so a valid population estimator must take these factors into account. Population Estimates from RDS: Population Estimates are Derived from Network Indicators Data from RespondentDriven Sample Network Indicators (Proportion of cross-cutting ties and network size) Population Estimates (Proportional group sizes, affiliation indices) Aim: To compensate for the effects of network structure on recruitment into the sample, including differences in network sizes, and clustering. Biased Network Theory (1) Developed in the early 1950s by Anatol Rapoport, later elaborated by Fararo and Sunshine (1968) Network ties are formed randomly, through a stochastic process In an unstructured system, ties are formed through random mixing, e.g., if a group makes up 75% of the population, it will have 75% in-group ties. More than a century ago, Galton recognized that friendships tend to form among those who are similar— a tendency called homophily. Ties can also form based on complementarity, e.g., sexual relations among heterosexuals, this is negative homophily also known as heterophily. Biased Network Theory and Affiliation Patterns In Biased Network Theory structure can be defined using an Index of Network Clustering termed Homophily (Fararo and Sunshine, 1968, Heckathorn 2002) Homophily = 1 if all ties are formed to the in-group; Homophily = 0 if all ties are formed randomly; Homophily = -1 if all ties are formed to the out-group Intermediate values are defined similarly, e.g., homophily = .32 if ties for formed as though 32% of the time an in-group tie is formed, and the rest of the time ties are formed by random mixing. This clustering index is used because of its fit with RDS sampling theory: If homophily sums correctly the equilibrium sample composition mirrors the population from which it was drawn. The Reciprocity Model: How to estimate population size based on network indicators When ties are reciprocal, the number of ties from any group A to B, Tab, is equal to the number of ties from B to A, Tba, i.e., Tab = Tba, e.g., 2 = 2 The number of ties from A to B is the product of four terms: (1) the number of nodes in the system, X, (2) the proportional size of A, Pa, (3) the average network size of A, Na, and (4) the proportion of ties from A to B, Sab. Tab = X * Pa * Na * Sab e.g., 5 * .4 * 2 * .5 = 2 The Reciprocity Model (2): How to estimate population size based on network indicators Given that: Tab = Tba, by expansion X*Pa*Na*Sab=X*Pb*Nb*Sba When (1-Pa) is substituted for Pb, this reduces to: Note that the term for total population drops out, so this model yields population proportions but not absolute sizes. RDS and IRB issues Unlike snowball sampling, in which Rs provide peers’ contact information and investigators make contact; RDS does not ask Rs to violate privacy of peers Incentives should be kept modest enough to prevent coercion; but questions should asked about it to ensure it does not happen Very large incentives could be viewed as in themselves coercive IRBs have sometimes resisted raising incentive amounts, so do not start with too small an amount—focus groups and pilot studies are useful IRBs have also been concerned about giving money to IDUs; recruitment quotas limit the annual amount that can be earned, so incentives will not affect drug habits (e.g., in ECHO studies, 99.7% of Rs earned less than $100/year) Because of tracking by serial numbers, coupons cannot become an alternative source of currency (unlike store coupons, food stamps, etc.) Limitations of Respondent-Driven Sampling 1) 2) 3) 4) 5) Limitations inherent in all sampling methods apply to RDS, e.g., the interview site must be readily accessible, interviewers must be culturally sensitive, and no sampling method can completely eliminate non-response bias. In addition, there are limitations specific to RDS: Population members must know one another as members of the target population. This can occur, for example, through the contact patterns created by sexual contact or drug sharing. Network ties must be dense enough to sustain the chainreferral process. Means must exist to motivate population members to recruit their peers. Means must exist for verifying membership in the target population, lest others seek entry into the study to gain respondent fees. Statistical power decreases when homophily is high. Advantages of RDS Controls for the biases associated with chain-referral methods, providing both population estimates and estimates of variability for those estimates. Requires little formative research, and therefore sampling can begin quickly. In contrast, time/space and targeted sampling require detailed prior mapping of the target population. Accesses persons through their social networks, even reaching those who shun large public venues and avoid the street. Recruitment is carried out by respondents at minimal cost, no field staff is required, so training requirements and costs are reduced. Number of additional questions that must be added to the instrument is small. Therefore, the method’s overhead is minor. Problem of non-response bias is reduced by dual incentive system (respondent fees and peer pressure) RDS Software IRIS coupon manager RDSat http://www.respondentdrivensampling.org/main.htm Software: RDSat Calculates population estimates based on Linear Least Squares or Data Smoothing (normal or enhanced) Arithmetic or Weighted Net Sizes Net size outliers can be pulled in (large outliers make the arithmetic mean unstable; small ones make the weighted mean unstable) Equilibrium Sample Composition Weights Reciprocity Index Homophily (useful for calculating design effects) Standard Errors Accepts data files created by IRIS 3.0, so calculations can be made in the field Creates a data file useful for studying recruitment networks using UCINET or Pajik Limits: sample size 2,500, coupons 40 per respondent Selected Bibliography on RDS "Respondent-Driven Sampling: A New Approach to the Study of Hidden Populations." By Douglas D. Heckathorn Social Problems. (1997) "Respondent-Driven Sampling II: Deriving Valid Population Estimates from Chain-Referral Samples of Hidden Populations." By Douglas D. Heckathorn Social Problems, 2002. "Extensions of Respondent-Driven Sampling: A New Approach to the Study of Injection Drug Users Aged 18-25." By Douglas D. Heckathorn, Salaam Semaan, Robert S. Broadhead, and James J. Hughes. AIDS and Behavior, 2002. "Group Solidarity as the Product of Collective Action: Creation of Solidarity in a Population of Injection Drug Users." By Douglas D. Heckathorn and Judith E. Rosenstein. Advances in Group Processes, 2002. "Development of a Theory of Collective Action: From the Emergence of Norms to AIDS Prevention and the Analysis of Social Structure." By Douglas D. Heckathorn In New Directions in Sociological Theory: Growth of Contemporary Theories (Joseph Berger and Morris Zelditch, editors). Rowman and Littlefield, 2002. “Finding the Beat: Using Respondent-Driven Sampling to Study Jazz Musicians.” By Douglas D. Heckathorn and Joan Jeffri. Poetics, 2001. “Making Unbiased Estimates from Hidden Populations Using Respondent-Driven Sampling.” By Matthew J. Salganik and Douglas D. Heckathorn. Paper presented at the International Social Network Conference, February, 2003, Cancun, Mexico “Jazz Networks: Using Respondent-Driven Sampling to Study Stratification in Two Jazz Musician Communities.” By Douglas D. Heckathorn and Joan Jeffri. Paper to be presented at the American Sociological Association meetings, August, 2003, Atlanta, GA. 39 RDS in Brazil Planning meeting, introduction to RDS November 2004: Carl Kendall, Tulane Keith Sabin, CDC Protocol development May 2005 Lisa Johnston, Tulane Data collection August-October 2005 Analysis workshop May 2006 Writing workshop July 2006 UCSF - 15 papers Site Target Population Participants Campinas 1 MSM Maeve Mello Curitiba Female CSW Clea Ribiero Augusto Evangelista Fortaleza MSM Ligia Sansigolo Kerr Linda Maia Recife Drug Users (DU) Luiz Oscar Ferreria Eniel Oliveira Porto Alegre CSWs Cintia Germany Mauro Ramos Santos CSWs Neide Silva Regina Maria Lacerda Campinas 2 IDU Elvira Maria Filipe Marcia Moreira Holcman Manaus 1 Female CSW Joao Catarino Dutra Felicien Vasques Manaus 2 DU Marcos Santos Roberio Reboucas An empirical comparison of RDS, targeted TLS and snowball sampling methodologies in a hidden population in Fortaleza, Brazil Ligia Kerr†, Carl Kendall‡, Rogério Gondim•, Guillerme Werneck◊, Lisa Johnston‡, Keith SabinΩ Study design Cross-sectional study in Fortaleza/Ce 2002 (401) 32% “Snow Ball” 68% TLS 2005 (406) 100% RDS Measures Questionnaire based on BSS Socio-economic status (education/social class) Main results Table 1. Education and social class of two survey rounds using three different methods, Fortaleza/ Ce, 2006. Variable 2002 (Snowball) Prevalence rate (CI) 2002 (TLS) Prevalence rate (CI) 2005 (RDS) Prevalence rate (CI) Social class A B C D E N=126 15.1 (9.3-22.5) 39.7 (31.1-48.8) 33.3 (25.2-42.3) 11.9 (6.8-18.9) 0.0 N=254 25.6(20.3-31.4) 37.4 (31.4-43.7) 24.8 (19.6-30.6) 11.0 (7.5-15.5) 1.2 (0.2-3.4) N=406 0,8 (0,1-1,2) 2,6 (1,4-3,8) 24,0 (19,6-30,7) 44,5 (39,7-50,9) 27,9 (20,7-32,4) Education Illiterate or 1º Incomplete 1º Complete or 2º Incomplete 2º Complete or higher N=126 5.6 (2.3-11.1) 61.9 (52.8-70.4) 32.5 (24.5-41.5) N=254 8.5 (5.5-12.5) 54.4 (48.3-60.5) 37.0 (31.3-43.1) N=388 29,0 (22,1-35,0) 24,0 (18,6-29,4) 46,8 (41,1-54,1) Secondary results Table 2. Education of Aids cases among MSM in Ceará. Fortaleza/ Ce, 20002005. Education Illiterate or 1º Incomplete Aids cases among MSM in Ceará 52.9 1º Complete or 2º Incomplete 30.9 2º Complete or higher 16.2 Public Health implications RDS reaches lower social class respondents than TLS in this example in Fortaleza. Social classes D and E have a higher proportion of AIDS cases. RDS would appear to be the sampling method of choice in Fortaleza. A EFETIVIDADE DO USO DA METODOLOGIA RESPONDENT DRIVEN SAMPLING PARA VIGILÂNCIA COMPORTAMENTAL DO HIV EM TRABALHADORAS DO SEXO EM SANTOS JULHO – 2006 EXECUTORA: ASPPE SANTOS WWW.ASPPE.ORG REDE SOCIAL – STATUS HIV HIV HIV + Obligado!