Estimating the size and characteristics of MARPs using Network Scale-up Chris McCarty PHC6716 July 20, 2011 The Problem • Certain populations are at high risk for contracting and spreading HIV • Most At Risk Populations (MARPs) typically fall into one of three categories – Female Sex Workers – Men Who Have Sex With Men – IV Drug Users • Members of all three populations engage in behavior that increases the chance of contracting HIV • All three populations are difficult to measure directly What is known about MARPS? • Many studies have been done to estimate the prevalence of HIV among these populations, and to measure characteristics of the populations • Representative samples are drawn to estimate the proportion of the population with HIV • Unlike other sample surveys there is no known population size to which these proportions can be applied • This means that the size of the problem remains largely unknown, particularly on a regional or local level Methods to Estimate the Size of MARPs (http://data.unaids.org/pub/Manual/2003/20030701_gs_estpopulationsize_en.pdf) • Methods that require a sample frame – Census • Counting all members – Enumeration: • Counting members in a sample frame then scaling up – Population Survey: • Draw a representative sample (similar to enumeration) • Methods that do not require a sample frame – Capture-Recapture – Multiplier Capture-Recapture • Method originated in biology to estimate the size of fish and wildlife populations • The method involves five steps: 1. 2. 3. 4. 5. “Capture” a sample of subjects Tag them Release them back into the population “Re-capture” a sample at a later time Estimate the population size based no the proportions • With humans “tagging” is sometimes done by providing a unique object • Otherwise tagging is done by using information about respondent (e.g. Social Security Number or other identifying characteristic) Capture-Recapture (cont.) • N=MC/R where: – – – – N = Estimate of total population size M = Total number captured and marked on the first visit C = Total number captured on the second visit R = Number of captured on the first visit that were recaptured on the second visit • Example: M=200, C=200, R=10 then N=200x200/10=4,000 • Assumes a closed system without in or out migration • More complex models allow for multiple sites Multiplier • Relies on overlap of information between two sources: 1. Data on attendance by target population at an institution that serves them (e.g. a clinic) 2. Data from target population about their attendance • Example – – – – Clinic screened 3,500 sex workers in a two week period A survey of 600 sex workers yielded 404 who said they had been screened The multiplier = 600/404 = 1.49 The population estimate = 3,500 x 1.49 = 5,215 Problems with these approaches • All these methods require interviews with members of the target population • The Census, Enumeration and Population Surveys require sample frames which are lacking for hidden or elusive populations • The Capture-Recapture and Multiplier methods are difficult to do across large geographies A note on RDS • Respondent Driven Sampling (RDS) is a method to measure the characteristics of an elusive population (http://www.respondentdrivensampling.org/) • This starts as a snowball sample where respondents recruit other respondents using coupons • Each respondent must report the names of others they know in the population • RDS requires completion of minimal chains without breaks (RDS does not work in disconnected populations) • RDS is a weighting procedure that adjusts for the non-random procedure for collecting the data • RDS will NOT give estimates of the size of a population An Alternative: Network Scale-up • This is a population-based survey approach that does not require a sample frame of the target population • The method relies on asking respondents (not necessarily in the target population) about people they know in the target population – Not talking to the target population was politically unpopular in the Ukraine • This is a method developed over the past 20 years but has recently gained recognition Background on Network Scale-up • The idea came from Russ Bernard after the 1985 earthquake in Mexico City • Official reports estimated deaths at around 7,000 • These estimates did not jibe with anecdotal reports from residents many whom knew someone who died in the earthquake, and opposition newspapers who had the death toll as high as 22,000 • A study was conducted asking a random selection of 400 residents how many people they knew who had died – 23 percent said they did • A model was created to estimate what their personal network size would have to be to account for this high percentage – this suggested a much higher death rate • Later reports by the Red Cross established the deaths at more than 25,000 This suggests a primitive model • t = the size of a population (e.g. Mexico City) • e = the size of some subpopulation within it (e.g. all those who died in the earthquake) • m = the number of people a respondent knows in e • c = personal network size • Assumption: Everyone’s network in a society reflects the distribution of subpopulations in that society m e c t New model • We can use reports about many populations of known size to back estimate personal network size c • Given an individual c and reports of an unknown population m we can then backestimate e t L ci m j 1 ij L e j 1 j What we did • We conducted a series of telephone surveys • For each respondent we asked how many people they knew in populations of known size • We also asked how many people they knew in populations of unknown size with estimates from other sources • We estimated the distribution of c and back-estimated e for each unknown subpopulation • • • 1998 Killworth, P.D., E.C. Johnsen, C. McCarty, G.A. Shelley, and H.R. Bernard. A Social Network Approach to Estimating Seroprevalence in the United States. Social Networks 20:23-50. 1998 P. D. Killworth, C. McCarty, H. R. Bernard, G. A. Shelley, and E. C. Johnsen. Estimation of Seroprevalence, Rape and Homelessness in the U.S. Using a Social Network Approach. Evaluation Review 22:289–308. Killworth, P. D., C. McCarty, H. R. Bernard, G. A. Shelley, and E. C. Johnsen. Estimation of Seroprevalence, Rape and Homelessness in the U.S. Using a Social Network Approach. Evaluation Review 22:289–308 Populations in survey Known Population Average known Known Population Average known Unknown Population Average known Native Americans 3.5 Michael 4.8 HIV positive Gave birth in past 12 months Women who adopted a child in past 12 months Widow(er) under 65 years old On kidney dialysis 3.6 Christina 1.3 0.3 Christopher 1.8 Women raped in past 12 0.2 months Homeless 0.7 3.2 Jacqueline 0.7 0.6 James 3.4 Postal worker 2.2 Jennifer 2.3 Commercial pilot 0.7 Anthony 1.7 Member of Jaycees 1.1 Kimberly 1.4 Diabetic 3.3 Robert 4.1 Stephanie 1.3 David 3.5 Nicole 1.1 Opened a business in past 1.1 12 months Have a twin brother or 2.0 sister Licensed gun dealer 0.5 0.7 Estimates of unknowns • RDD telephone survey of 1554 adults in the U.S. in 1994. – Seroprevalence: 800,000 ± 43,000 – Homeless: 526,000 ± 35,000 – Women raped in the last 12 months: 194,000 ± 21,000 • These were all close to other estimates made with various enumeration or surveillance methods. • 1998 P. D. Killworth, C. McCarty, H. R. Bernard, G. A. Shelley, and E. C. Johnsen. Estimation of Seroprevalence, Rape and Homelessness in the U.S. Using a Social Network Approach. Evaluation Review 22:289–308. Estimates of c are reliable across multiple surveys • Across seven surveys, we consistently find an average network size (c) of 290 (sd 232, median 231). • And 290 is not an average of averages. It’s a repeated finding. Is 290 is an artifact of the method? • We tested this in three ways: 1. Make the estimates using a different method. 2. Experiment with parameters and see if the outcome varies in expected ways. 3. Compare values of c across populations of known relative sizes. Reliability I: Compare to a different method Category Average known Immediate family 3.5 Other birth family 24.0 Family of spouse or significant other 12.3 Co-workers 35.6 People at work but don’t work with directly Best friends/confidantes 62.1 • We also used the known populations People know through hobbies/recreation People from religious organization 12.3 • The summation method produced a mean c of 290.7, while the known population method produced a mean c of 290.8 People from other organization 17.1 School relations 18.3 Neighbors 12.8 Just friends 22.6 People known through others 22.6 Childhood relations 6.8 People who provide a service 7.7 Other 3.9 • In one survey, we estimated c by asking people how many people they know in each of 16 relation categories and summing. • McCarty, C., P. D. Killworth, H. R. Bernard, E. Johnsen, and G. A. Shelley. Comparing Two Methods for Estimating Network Size. Human Organization 60:38–39 4.3 43.4 Reliability II: Change the data • We changed reported values at or above 5 to a value of 5 precisely. – The mean dropped to 206, a change of 29%. • We set values of at least 5 to a uniformly distributed random value between 5 and 15. We repeated the random change (5 – 15), but only for large subpopulations (with >1 million). – The mean increased to 402, a change of 38% -- in the opposite direction. Reliability III: Survey a population with en expected large network size • We surveyed a national sample of 159 members of the clergy – people who are widely thought to have large networks. • Mean c = 598 for the scale-up method • Mean c = 948 for the summation method So, 290 is not a coincidence 1. Two different methods of counting produce the same result. 2. Changing the data produces large changes in the results, and in the expected directions. 3. People who are widely thought to have large networks do have large networks. Can we predict what we do know? • We can test our model by seeing how well we do on the 29 populations of known size • The overall result is encouraging, but we don’t estimate some populations well • There is a tendency for people to overestimate small populations (<2 million) and to underestimate large ones (>3 million). • The two largest populations are people who have a twin and diabetics, the two outliers in the upper left • Without these two outliers, the correlation rises from r = .79 to r = .94 Another encouraging result • Charles Kadushin ran a national survey to estimate the prevalence of crimes in 14 cities, large and small, in the U.S. • He asked 17,000 people to report the number of people they knew who had been victims of six kinds of crime and the number of people they knew who used heroin regularly. • 2006 C. Kadushin, P. D. Killworth, H. Russell Bernard, and A. Beveridge. Scale-up methods as applied to estimates of heroin use. Journal of Drug Issues 36:417-440. Compromising assumptions • Barrier Error – Everyone in t has an equal chance of knowing someone in e. • Transmission Error – Everyone knows everything about everyone they know. • Inaccurate recall – People don’t recall accurately the number of people they know in the subpopulations we ask them about. Barrier Error Correlation between the mean number of Native Americans known and the percent of the state population that is Native American is 0.58, p = 0.0001. Network social barriers • Race (African Americans may know more diabetics than White people do.) • Gender (Men may know more gun dealers than women do.) • Even first names are associated with the barrier effect. • We address the barrier effect by using a random, nationally representative sample of respondents Transmission Error Study • We recruited 30 members of one of the known populations used in the network scale-up method. • We randomly selected male and female first names proportionate to the 1990 U.S. Census. • For each of 25 hits, the respondent provided some information about the alter, including the alter’s phone number. Total=30x25=750. • We contacted 220 of 750 named alters and asked them things about themselves and about ego. Findings from the study • We see from this table that it is much easier to know people in some populations than in others. • It is much easier to know that someone is a kidney dialysis patient than it is to know that they are a diabetic. • Diabetes is much less visible. Population % who knew % who did not know Respondents # of alters Am. Ind. 100 0 2 12 Diabetic 55 45 6 44 Birth in last 12 mos. 93 7 3 27 Gun dealer 92 8 1 12 Member of JC’s 58 42 1 12 Dialysis 88 12 5 26 Business in last 12 mos. 75 25 4 16 Postal worker 100 0 1 10 Has twin 88 12 2 24 Widowed <65 97 3 4 38 Can we account for these errors? • Can we use this kind of information to tweak the model? • We tried to develop weightings for classes of characteristics about subpopulations … classes like “things that carry a strong stigma” and “things that carry a moderate stigma” and “things that just don’t come up in conversation.” • While we found some signals like these, we don’t know how to know whether two populations require the same weighting. • Matt Salganik has recently completed a study in Brazil attempting to refine these weights Informant Inaccuracy • We tried procedures to improve accuracy 1. Asking respondents to provide names for all the knowns they nominated 2. Asking respondents to report on knowns twice 3. Asking respondents on a scale of 1 to 5 how confident they were in their answers • None of these procedures changed the results much Countries where Network Scale-up has been implemented • • • • • • • United States Mexico Ukraine Moldova Peru Brazil Thailand How to conduct a network scale-up survey Network scale-up begins like most surveys • Define respondent population • Choose sample frame • Choose survey mode • Choose sample size • Design questionnaire (This is the part that’s different) Selecting respondent population • Respondent population is not the same as the population to be estimated (target population) – U.S. respondents to estimate homeless population – Urban population to estimate heroin users • You must know the size of the respondent population • Do transmission and barrier errors suggest using a respondent population with more ties to target population? • This opportunity to do this research in multiple countries could help solve this problem Choose sample frame • The sample frame represents the respondent population • For our work we used random digit dial telephone numbers • For face-to-face a general population survey may rely on census or voter registration data Choosing mode • There are five survey modes – – – – – Face-to-face Telephone (this is what we used) Mail Drop and collect Web • There is a large literature on mode effects in surveys • For the populations of interest to UNAIDS a face-to-face or mixed mode makes sense Choose sample size • As with any survey, the sample size should be based on expected margins of error • For this survey we have margins of error associated with network size • Although estimates of network size are remarkably reliable, they have large standard deviations • Our data suggest that a survey of 400 respondents would generate a margin of error of ±26 alters • A survey of 1,000 would generate a margin of error of ±16 alters • Keep in mind these are based on variance for U.S. respondents Design questionnaire • Network scale-up questionnaire has three parts 1. Demographics used to estimate bias 2. Question to estimate the number of alters respondents knows in the target population 3. Questions to estimate network size (c) • Steps 2 and 3 require a boundary definition of who is counted as a network alter Alter boundary • Definition of who is an alter can have enormous effects on the estimate • Defining the alter boundary as 12 months will generate different network sizes than a boundary of two years • Our definition: – You know them and they know you by sight or by name. You have had some form of contact with them in the past two years and you could contact them if you had to – Question: Should respondents be instructed to exclude those met on networking sites such as Facebook? There are two ways to estimate c • Scaling from known populations • The summation method Using known populations • Select a set of known populations, the more the better • Populations should vary in size and type – – – – Limiting the study to populations related to health conditions, although plentiful, may introduce barrier error Using only large populations (such as men or people over age 65) introduces a lot of estimation error Using only small populations introduces error from very few hits Known populations should be within .1% to 4% of population (this may change as we learn more) • The demographic characteristics of the known populations should match as closely as possible the demographic characteristics of the population upon which the known estimates are based • Populations are often related to transmission and barrier effects • In the past we assumed that by using populations of multiple size and type these effects are cancelled out Examples of populations we used • In the U.S. there are a variety of sources for known populations: – The U.S. Statistical Abstract – The U.S. Census – The FBI Crime Statistics • Ideally collection of sub-population data will be recurring so that they can be used in subsequent years • It is important that the data all reflect the same year (be aware that some population data lags) • Known populations are very susceptible to transmission and barrier error Relationship between number known and demographic characteristics Population Native Americans Gave birth in past 12 months Adopted a child in past year Widow(er) under 65 years On kidney dialysis Postal worker Commercial pilot Member of Jaycees Diabetic Opened a business in year Have a twin brother or sister Licensed gun dealer Came down with AIDS Males in prison Homicide victim in past year Suicide in past year Died in wreck in past year Women raped in past year Homeless HIV positive State Sex Race Age Education Marital status Work status Religion Political Party We experimented with names • Census provides estimates of both first names and last names • We experimented with both types and found problems with each • The advantage of names is that they vary in size and are typically ascribed • Countries and cultures vary in the way they use names • They are prone to barrier error Relationship between number known and demographic characteristics Population Michael Christina Christopher Jacqueline James Jennifer Anthony Kimberly Robert Stephanie David Nicole State Sex Race Age Education Marital status Work status Religion Political Party Summation method • We can estimate network size (c) directly by asking respondents to tell us how many people they know • This is an unreasonable task unless it is broken into reasonable subtasks • We use culturally relevant categories of relation types that are mutually exclusive and exhaustive • These are small enough that respondents can estimate them reliably Relation categories we used • • • • • • • • • • • • • • • • Immediate family Other birth family Family of spouse or significant other Co-workers People at work but don't work with directly Best friends/confidantes People know through hobbies/recreation People from religious organization People from other organization School relations Neighbors Just friends People known through others Childhood relations People who provide a service Other Developing a protocol for discovering summation categories • We assume that relation categories used to elicit estimates will be culturally relative – Different languages will require their own category names – The way people maintain people in their mind will almost certainly vary by culture • Further research is needed to determine the best protocol for discovering these categories • Summation categories must be mutually exclusive, exhaustible and small enough that respondents count ratrher than estimate Approaches we are studying • Our current categories emerged from a previous study about the ways people know each other • This is not ideally suited to this study • We are exploring using cultural consensus analysis or personal network structure to quickly develop these categories • An empirical approach is to start with very large culturally relevant categories and use alter characteristics to split them when they are too large Estimates of network size from two methods (scaling from known and summation) are very close • Scaling from known populations – 290.8 (SD 264.4) • Summation method – 290.7 (SD 258.8) • We checked in multiple ways to see whether this was an artifact of the method • It wasn’t Advantages of the summation method • It is quicker, taking about half the time or less than estimating from known sub-populations • It should not be subject to transmission or barrier error • It does not require finding known populations, which could be a problem in some countries Disadvantages of summation method • It cannot be verified statistically • It may be easy for respondents to double count network alters as they are multiplex relations (such as co-worker and social contact) • Network size calculated from scaling known populations can be checked by back-estimating each known with the other knowns Modeling issues • At this point in our work we are convinced that our estimates of network size are relatively reliable, but not absolutely reliable • If my network is 300 then I am confident it is half as large as that of someone with a network of size 600 • I am not confident that the network size is actually 300 • This compromises our ability to estimate the absolute size of a population • Again, the opportunity to replicate this method may yield solutions How to generate scale-up estimates • There are two steps – Estimate network size c – Use c with respondents’ estimates of unknown populations to scale-up to the size of the unknown in the population • We will look at these steps separately Step 1: Estimating c using summation method • With the summation method you add up the estimates form each relation category to get a c value for each respondent • The c used in the formula will be the average of all those c values from each respondent Step 1: Estimating c using known populations • This procedure requires three parameters • t=the size of the population to which you are scaling up (this is the same for each respondent) • e=the sum of all the known populations you are using in the survey (this is the same for each respondent) • m=the sum of all the reported known subpopulation sizes for each respondent • c for each respondent is (m*t)/e • The c used in the formula will be the average of all those c values from each respondent Step 2: Applying c • This step also requires three parameters • t=the size of the population to which you are scaling up (this is the same for each respondent) • c=the average c value, either from the scale-up or the summation method • m=the average of all respondents’ estimates of the number of people they know in the unknown subpopulation • The formula to estimate the size of the unknown subpopulation e=(m/c)*t