Fundamental Building Blocks of Social Structure Honoring Peter Killworth’s contribution to social network theory Southampton, Sept. 28, 2006 The network scale-up team Peter D. Killworth (SOC) Christopher McCarty (U Florida) Gene A. Shelley (Georgia State U) Eugene Johnsen (UC-Santa Barbara) H. Russell Bernard (U Florida) Some background: “I’ll have a go at that” (Scripps, 1972). I asked everyone on a ship to rank order their interactions with all the others. I came to the physics department coffee break and asked "anybody here want to know the social structure of a vessel that gets all your data?" The ocean-going physicists in the room knew they weren't supposed to talk to people like me and didn't even look up. Peter hadn’t gotten the memo about social scientists and said he thought it might be fun. And that’s what it’s been, for 34 years and 40-odd papers later … How to get at the structure of these data? “Let’s try this …” Peter applied an algorithm from F.S. Acton’s (then) recent book “Numerical Methods that (Usually) Work” ... The algorithm had been developed to solve the a traffic problem: How to get from point A to point B fastest, irrespective of the number of red lights on the path. Visualizing the messy result. The prison studies We combined numerical methods with ethnography. The cliques always made sense, until one day … Three numerically tied inmates whose connections made no apparent sense: different crimes, North and South, rural and urban, Black and White. Finally, finally an artifact …. Peter: “This is too easy.” We discovered that physicists don’t apply their models to social structure and anthropologists don’t test the error bounds of their instruments. We were half-way on this one, so we started the accuracy studies. How to study accuracy? We studied people whose real communication could be unobtrusively monitored and whose members we could ask questions like: "So, in the last [day], [week], [month], who did you talk to in this group?" Deaf people on TTYs Ham radio operators in a local network An early e-mail group An office A fraternity Half of what people tell you is incorrect People don’t recall behaviors that did occur and recall behaviors that didn't occur. People aren’t lying. They’re just terrible behaviorscopes. Extending (or redefining) the problem We asked: are the instruments for gathering data about human behavior producing accurate measurements of human behavior? Others used our data and asked: what do those instruments produce a valid measurement of? Answer: If you ask people who they interact with, people retrieve who they usually interact with and report who they ought to interact with, given everything they already know about their place in the social structure. Next, the small world… Milgram’s famous small-world experiment told us that there are 5.5 links between any two white people in the U.S. and exactly one more link between any white and any black person in the U.S. But these numbers do not tell us anything about the structure of the society. Peter: “Let’s find out how the SW actually operates” Show people a list of SW targets, complete with the information about location, occupation, hobbies, and organizations. ask people to tell us their first link in a small-world experiment. Repeat 500 times and analyze the information needed by people to make their choice of a first link. The reverse small world experiments We ran six of these experiments in the U.S., in Micronesia and in Mexico. Things that people in the US find useful to the task (name, location, occupation, hobbies, organizations) are the same things that people in other cultures need to know to place someone in their network. For both of us, the cross-cultural regularity discovered in this series of experiments is among the most exciting results of our work. We created a similarity matrix between targets: how many people used the same choice for a given pair of targets? A 2-d MDS shows the enduring influence of Gerhard Mercator on schooling. Finding the distribution c Our real objective, though, is to understand the basic components of social structure. One quantity that seems important is the number of people whom people know. We call this c Network size … “It’s just one number” From the first, Peter pushed us all to learn more about the basic quanta: How does network size vary, within and across cultures? What’s the distribution look like? Our first estimate, in 1978, for average network size in the U.S. was 250. Peter: “You have to start somewhere.” And what was that 250? It was the number of people on whom the people of Morgantown, West Virginia who sat through this grueling, 8-hour experiment could call on to be first links if Milgram had shown up and asked them to participate in a small world experiment. Deriving c from an assumption Let t be the size of a population, and let e be the size of some subpopulation within it. We assume that the fractional size p = e/t of that subpopulation also applies to any individual’s network, other things being equal. That is, everyone’s network in a society reflects the distribution of subpopulations in that society. The scale-up method to estimate c To test this, we ask a representative sample of people to tell us how many people they know in many subpopulations whose sizes are known: e.g., diabetics, gun dealers, postal workers, women named Nicole, men named Michael People answer accurately Now, assuming that people can and do answer our question accurately A maximum likelihood estimate of an individual’s network size: t L ci m j 1 ij L e j 1 j where there are L known subpopulations. (Here i is the individual, who knows mij in subpopulation j.) Network size is (the sum of all the people you say you know in some subpopulations of known size, divided by the total size of those subpopulations) times the population within which the subpopulations are embedded. The estimates of c are reliable This doesn’t deal with the big IF, but across 7 surveys in the U.S., average network size = 290 (sd 232, median 231). The 290 is not an average of averages. It’s a repeated finding. And it’s almost certainly not an artifact of the method. Reliability I: In one survey, we estimated c by asking people how many people they know in each of 17 relation categories – people who are in their immediate family, people who are coworkers, people who provide a service – and summing. The summation method (due to Chris McCarty) produced a mean for c of 290. Reliability II: Change the data We changed reported values at or above 5 to a value of 5 precisely. The mean dropped to 206, a change of 29%. We set values of at least 5 to a uniformly distributed random value between 5 and 15. We repeated this random change only for large subpopulations (with > 1 million). The mean increased to 402, a change of 38% -- in the opposite direction. Reliability III: Survey clergy We surveyed a national sample of 159 members of the clergy – people who are widely thought to have large networks. Mean c = 598 for the scale-up method Mean c = 948 for the summation method 290 is not a coincidence 1. Two different methods of counting produce the same result. 2. Changing the data produces large changes in the results. 3. People who are widely thought to have large networks do have large networks. Something is going on This next slide shows the probability, for two of our surveys, of knowing no one in each of 29 populations of known size, by the actual size of those populations. The two distributions track, except for the expected offset. The distribution of c Here is the graph of the distribution of network size: Reliability vs. validity Ok, it’s reliable. But if the model works, we ought to be able to use it to estimate the size of populations whose sizes are not known. Create a maximum likelihood estimate for the size of an unknown subpopulation based on what all respondents tell us and our estimates of their network sizes. “Roughly speaking, inverting the previous formula.” Can we predict what we know? Test this by ‘predicting’ the size of 29 populations of known size. The overall result is encouraging: r =.79 … but note the outliers Over- and under-estimation The two largest populations are people who have a twin brother or sister and diabetics. These are highly underestimated. Without these two outliers, the correlation rises from r = .79 to r = .94 “No cheating …” Stigma vs. not newsworthy Being a twin or a diabetic is neither stigmatizing, nor newsworthy. From Gene Shelley’s work, we know that personal information about close co-workers or business associates can take a decade or more to be transmitted ... and in the case of being a twin or a diabetic, may never be transmitted. Another encouraging result Charles Kadushin ran a national survey to estimate the prevalence of crimes in 14 cities, large and small, across the U.S. He asked 17,000 people to report the number of people they knew who had been victims of six kinds of crime and the number of people they knew who used heroin regularly. Here are the estimates for the number of heroin users in each of the 14 cities, along with the estimates from the UCR. The fact that we track well with official estimates means only that we have a much, much less expensive way to get at these estimates – not that the estimates are correct. And estimates of other crimes in those 14 cities did not track so well. Reliability, validity, and accuracy So, while definitely reliable and perhaps valid, our estimate of network size (and its distribution) is not sufficiently accurate. Compromising assumptions 1. Transmission effects: Everyone knows everything about everyone they know. 2. Barrier effects: Everyone in the population has an equal chance of knowing someone in any subpopulation. Correlation between the mean number of Native Americans known and the percent of the state population that is Native American is 0.58, p = 0.0001. Network social barriers Race (Blacks may know more diabetics than Whites do.) Gender (men may know more gun dealers than women do.) Even first names are associated with the barrier effect. We address the barrier effect by using a random, nationally representative sample of respondents. However, using the method on specific populations may still lead to incorrect estimates. The transmission effect We asked people things about people they knew … and then called up those people to see how much people really do know about their network members. Some things are easy to get right 99% know their alters’ marital status. People know how many children 89% of their alters have. 98% know the employment status of their alters. Some things are harder to know People say they know the state in which 70% of their alters were born, but only 57% of the reports (ego’s and alter’s) agree on this. People don’t know the number of siblings their alters have 52% of the time. Some people withdraw Gene Shelley found that people who are HIV+ withdraw from their network in order to limit the number of people who know their HIV status. Eugene Johnsen confirmed this by showing that HIV+ people have, on average, networks that are one-third the global average. A theory of transmission bias Take another look at the comparison of the data from clergy and others: It’s likely that you know at least one Christopher (the probability of knowing NO Christophers is close to zero). Twins are likely to be underreported Peter said: Assume that people report correctly what they know but that what they know is incorrect. What would happen to the jaggedy curve if people responded honestly to correct information instead of honestly to incorrect information? How to adjust the x-axis rather than the y-axis in the diagram? Suppose that widows don’t tell half the people they know about their being a widow. The .013 on the x-axis remains the same but the number that people would be responding to would be .013/2. To make the x-axis the effective size of that population, we slide it to the left while the y-axis remains the same. “The jaggedy line would go” Of course, we have no idea what the transmission error might be. We do know that if the numbers remain the same on the y-axis and we make up the effective sizes on the x-axis, the jaggedy line would go. Peter did this analytically and computed the predicted distribution of c. The next slide shows that we may be on the right track: Peter’s (highly) unusual place in the social sciences No. of articles 154 In Social Science journals (43) Total number of Citations: 3194 In Social Science journals: 456 (14 %) In non-Social Science journals: 2738 (86%) 1981-2005 Overall Standard Category Baseline Category Description CITATIONS SOURCES UNCITED MEAN1 AGD Agricultural Sciences 3026292 405216 110302 7.47 BID Biology & Biochemistry 30591140 1289467 144133 23.72 CHD Chemistry 25653067 2160543 406171 11.87 CLD Clinical Medicine 55909778 3637855 675163 15.37 CSD Computer Science 881307 162037 62040 5.44 EVD Ecology/Environment 4375528 369984 71045 11.83 ECD Economics & Business 2037455 230987 70664 8.82 EDD Education 271748 64763 24699 4.2 EGD Engineering 6389607 1155040 405542 5.53 GED Geosciences 5532230 419742 90315 13.18 IMD Immunology 8335481 263920 19389 31.58 LAD Law 305586 46858 16145 6.52 MSD Materials Science 3705565 537936 166000 6.89 MTD Mathematics 1820446 303592 90453 6 MCD Microbiology 7234059 362533 37267 19.95 MBD Molecular Biology & Genetics 15207323 418987 37618 36.3 OTD Multdisciplinary 2557670 279654 80925 9.15 NED Neurosciences & Behavior 15246539 581369 51923 26.23 PMD Pharmacology 5444675 384245 50634 14.17 PHD Physics 20291158 1820458 407636 11.15 PLD Plant & Animal Science 10260550 1042930 212907 9.84 PSD Psychology/Psychiatry 6277431 439095 83162 14.3 SSD Social Sciences, general 3015451 484739 152641 6.22 ASD Space Science 3469032 181566 23561 19.11 • http://garfield.library.upenn.edu/histcomp/k illworth-pd_citing/ (http://tinyurl.com/nmhdc) • http://garfield.library.upenn.edu/histcomp/k illworth-pd_auth/ (http://tinyurl.com/ppr82)