Commentary In defense of forensic social science Big Data & Society July–December 2015: 1–3 ! The Author(s) 2015 Reprints and permissions: sagepub.co.uk/journalsPermissions.nav DOI: 10.1177/2053951715601145 bds.sagepub.com Amir Goldberg Abstract Like the navigation tools that freed ancient sailors from the need to stay close to the shoreline—eventually affording the discovery of new worlds—Big Data might open us up to new sociological possibilities by freeing us from the shackles of hypothesis testing. But for that to happen we need forensic social science: the careful compilation of evidence from unstructured digital traces as a means to generate new theories. Keywords Forensic social science, hypothesis testing, Big Data, machine learning, abductive reasoning, statistics In one of the most celebrated scenes of HBO’s The Wire—American television’s most sociologically perceptive masterpiece, camouflaged as a crime drama—detectives McNulty and Moreland revisit the scene of an old, unsolved homicide. The detectives are reluctantly forced by their sergeant to reexamine the crime scene, which they believe to be unrelated to the organized crime syndicate that their secret unit is trying to expose. The old police report suggests that the victim, a 20-year-old college girl, was shot during a romantic or drug-related dispute went wrong. Despite their preconceptions, McNulty and Moreland—in a beautifully syncopated exchange of expletives—let the data speak for themselves. Tracing the gunshot’s trajectory, they discover that the girl was, in fact, premeditatedly executed. They leave the cordoned-off apartment with a new theory about who killed Deirdre Kresson. The idea that a detective would investigate a crime scene only to confirm a predetermined hunch about the identity of the murderer would seem abhorrent to many; it is dangerously susceptible to biased identification. In fact, the best crime thrillers enthrall us precisely because the murderer is never who we initially suspected her to be. But sociologists often round up the usual suspects. They enter metaphorical crime scenes every day, armed with strong and well-theorized hypotheses about who the murderer should, or at least plausibly might be. It is not unlikely that many perpetrators are still walking free as a consequence. Consider them the sociological fugitives of hypothesis testing. As anyone graduating from a sociology program in the last 50 years knows, hypothesis testing dominates the social sciences. This scientific paradigm has led to numerous achievements, and my purpose here is not to challenge or belittle them. Some have attacked, or defended, hypothesis testing on philosophical grounds—debates that were particularly vibrant a century ago but that have since mostly been forgotten. What remains today are a set of institutionalized practices that, beyond the cautionary ‘‘correlation is not causation’’ adage, are all too often followed on autopilot mode. The problem with hypothesis testing is not its epistemological foundations, or its ontological validity; rather that, as a practice, it has become entirely taken for granted. In fact, even those of us who enter the proverbial crime scene without an idea of the likely killer often retrospectively tell the story of her capture as if we knew her identity all along. The normative pressure to pay homage to the objective authority of statistical significance testing is difficult to resist. In our published articles we lay out a set of hypotheses, and then marshal the empirical evidence—mostly fetishized as stars (with any luck three!) hovering to the right of our Graduate School of Business, Stanford University, Stanford, CA, USA Corresponding author: Amir Goldberg, Graduate School of Business, Stanford University, 655 Knight Way, Stanford, CA 94305, USA. Email: amirgo@stanford.edu Creative Commons CC-BY-NC: This article is distributed under the terms of the Creative Commons Attribution-NonCommercial 3.0 License (http://www.creativecommons.org/licenses/by-nc/3.0/) which permits non-commercial use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access pages (https://us. sagepub.com/en-us/nam/open-access-at-sage). Downloaded from by guest on June 21, 2016 2 Big Data & Society coefficients—that, low and behold, supports the hypothetical story we had just told. Rarely does one find a paper conceding that it stumbled on a finding while looking for a different answer. Even more difficult to publish is one reporting a null finding. Letting the data speak for themselves has almost become a sociological taboo. Allow me to hazard a hypothesis—one that I will be unable to test or even meaningfully support, and that others have stated before with greater clarity, more nuance and depth, and better evidence (e.g. Latour and Wolgar, 1979)—as to why data-driven research is normatively frowned upon in sociology. Hypothesis testing’s ascendance in the social sciences coincided with the emergence and establishment of statistical modeling, and the tools that it entails. Two methods of research became particularly prominent: surveys and laboratory experiments. Both are costly and time-consuming, and require a significant investment in infrastructure and personnel. Such an upfront investment makes exploratory research potentially wasteful and therefore highly risky. In other words, practical considerations made it necessary for researchers in the social sciences—especially sociology, which was never among the best funded disciplines, even compared to its impoverished social science siblings—to come into their crime scenes knowing exactly who they are looking for. Theoretical development precedes data collection because researchers can only afford collecting the data that they know, with a high level of certainty, will inform their theories. This material necessity required ideological justification. And it comes at the price of overlooking unusual suspects (Evans, 2010). Researchers in leading universities with access to resources and funding were particularly invested in the ideological project of celebrating hypothesis testing, because it reinforced their material advantages, and also provided the coveted scientific legitimacy that distanced them from the humanities. By the 1980s, this practice was under concerted attack by critical theorists, mostly outside leading Anglo-Saxon universities. But while the postmodern onslaught ravaged anthropology, sociological positivism emerged as healthy as ever. If anything, encouraged by the publicized successes of the endogeneity police in economics and other influential parts of the social scientific cosmos, recent sociological work published in leading American journals has become ever more empirical in focus, and more attuned to problems of causal identification, measurement, and robustness. In the process, sociological grand theorizing of the kind that characterized American sociology in the postwar era has waned, and seems to be in danger of extinction (Lizardo, 2014). Ironically, it appears that theoretically informed hypothesis testing can lead to the narrowing of one’s sociological imagination. Might Big Data harbor the potential for a theoretical revival? This suggestion may seem peculiar at first. If anything, Big Data invokes an image of mechanistic and a-theoretical pattern analysis. But with the pain of drowning in an ocean of amorphous data also comes the liberation of unshackling oneself from the blinders of one’s limited imagination: we no longer need to come to the crime scene with an idea about the identity of the killer. She is hiding somewhere in between the heaps of unstructured data. At last, we are free to find her. Thus, the analytical focus shifts from thinking about the most cost-effective data that one needs to collect in order to support, or refute, a hypothesis, to figuring out how to structure a mountain of data into meaningful categories of knowledge. Big Data is both a blessing and a curse. In fact, Big Data is nothing more than big data without the conceptual and practical tools for carving meaningful structures out of it. One way of doing that is to imagine, and consequently parse, the variables worth exploring in the data. But such an endeavor falls into the same trap created by hypothesis testing: it necessitates assuming in advance the possible dimensions of variability. The crime scene conundrum, in other words, is not so much a result of having to assume the identity of the killer in advance, but rather of having to imagine all possible candidates. Such an approach necessarily precludes the possibility of identifying a suspect that we could not have imagined apriori. We all suffer from myopia: we cannot see what is beyond our field of vision. Categorical myopia is perhaps the biggest problem plaguing traditional survey-based research. We can only collect information about the categories of data that we assume, ahead of time, are meaningful to the domain of study in question. By constructing a survey, the researcher necessarily imposes her schematic understanding of the world on her subjects. The General Social Survey (GSS), for example, occasionally asks respondents, among other things, how much they like musical genres such as rock or classical music. But what if ‘‘rock’’ implies different things to different respondents? Or even worse, what if different respondents use different categorical prisms to structure music into distinct buckets of meaning? That was precisely what I discovered when I applied Relational Class Analysis (RCA) to GSS data on Americans’ musical tastes (Goldberg, 2011). RCA induces groups of respondents whose opinions follow similar patterns, as a means to infer their underlying schematic organization of the domain. What I found was that whereas some Americans think of rock through the distinction between high and low brow music, others interpret it as a symbol of cosmopolitanism, standing in opposition to traditional Americana. This implies not only that people have competing understandings of the social symbolism of this musical genre, but probably that they draw categorical boundaries in different ways altogether. Downloaded from by guest on June 21, 2016 Goldberg 3 This kind of inductive exploration is what computer scientists refer to as machine learning: the algorithmic process of finding patterns of relationships in unstructured data. Classification is central to machine learning. It is the iterative procedure by which categories of objects are inductively discovered, and objects are subsequently assigned to them. Machine learning classifiers, in essence, let the data ‘‘speak for themselves.’’ They parse the data into categories that the researcher, presumably, could not have imagined. It then becomes the researcher’s responsibility to assign meaningful interpretations to them. Of course, data cannot really speak for themselves. The statistical tools with which they are delineated into separate entities are ultimately rhetorical devices disguised as objective representations of reality. Data mining techniques are never purely objective. They embody assumptions about the world, and necessarily privilege certain vantage points over others. Data do not speak for themselves because their story invariably needs to be narrated. It would therefore be naı̈ve to presume that Big Data would necessarily free us from the tyranny of our shortsighted sociological imagination. But the painstaking search for a needle in a haystack is better than having to know exactly where to look for it in advance. As an emergent field, ‘‘Big Data’’ is still a cacophonous set of data processing, visualization, and pattern recognition techniques. This methodological pluralism affords the possibility of stumbling on exciting and groundbreaking discoveries. It is precisely this potential for surprise that makes Big Data, and the machine learning tools used to excavate and refine it, an opportunity for theoretical discovery. Surprise is the cornerstone of abductive reasoning: the catalysis of new theoretical insights through discovery of unexpected empirical evidence. Because abduction transcends the simplistic dichotomy between theory-building and empirical research, many have hailed it as a promising pathway for conceptual innovation in the social sciences (e.g. Timmermans and Tavory, 2014). Novel discovery necessitates the availability of rich, multidimensional and granular data; Big Data is uniquely positioned to facilitate such exploration. This is not to suggest that ‘‘Big Data’’—as a research strategy—constitutes a paradigmatic shift, or that it makes hypothesis testing useless or obsolete. If anything, many researchers operating squarely within the tradition of hypothesis testing probably follow an abductive process of theoretical recalibration through empirical discovery; but this is mostly done out of sight (Van Maanen et al., 2007). Big Data might help legitimate making this practice visible, and open to scrutiny. Like the navigation tools that freed ancient sailors from the need to stay close to the shoreline—eventually affording the discovery of new worlds—Big Data might possibly open us up to new sociological possibilities. But for that to happen we need forensic social science: the careful compilation of evidence from unstructured digital traces. Like McNulty and Moreland, we might find ourselves leaving the crime scene not only with new ideas about what crime had been committed, but also with theories about who committed them. Declaration of conflicting interests The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. Funding The author(s) received no financial support for the research, authorship, and/or publication of this article. References Evans JA (2010) Industry induces academic science to know less about more. American Journal of Sociology 116(2): 389–425. Goldberg A (2011) Mapping shared understandings using relational class analysis: The case of the cultural omnivore reexamined. American Journal of Sociology 116(5): 1397–1436. Latour B and Woolgar S (1979) Laboratory Life: The Construction of Scientific Facts. Beverly Hills, CA: Sage Publications. Lizardo O (2014) The end of theorists: The relevance, opportunities, and pitfalls of theorizing in sociology today. Pamphlet based on the Lewis Coser Memorial Lecture, delivered at the 2014 Annual Meeting of the American Sociological Association in San Francisco. Timmermans S and Tavory I (2014) Theory construction in qualitative research: from grounded theory to abductive analysis. Sociological Theory 30(3): 167–186. Van Maanen J, Sørensen J and Mitchell TR (2007) Introduction to special topic forum: The interplay between theory and method. The Academy of Management Review 32(4): 1145–1154. This article is part of a special theme on Colloquium: Assumptions of Sociality. To see a full list of all articles in this special theme, please click here: http://bds.sagepub.com/content/colloquium-assumptions-sociality. Downloaded from by guest on June 21, 2016