In defense of forensic social science Commentary

advertisement
Commentary
In defense of forensic social science
Big Data & Society
July–December 2015: 1–3
! The Author(s) 2015
Reprints and permissions:
sagepub.co.uk/journalsPermissions.nav
DOI: 10.1177/2053951715601145
bds.sagepub.com
Amir Goldberg
Abstract
Like the navigation tools that freed ancient sailors from the need to stay close to the shoreline—eventually affording the
discovery of new worlds—Big Data might open us up to new sociological possibilities by freeing us from the shackles of
hypothesis testing. But for that to happen we need forensic social science: the careful compilation of evidence from
unstructured digital traces as a means to generate new theories.
Keywords
Forensic social science, hypothesis testing, Big Data, machine learning, abductive reasoning, statistics
In one of the most celebrated scenes of HBO’s
The Wire—American television’s most sociologically
perceptive masterpiece, camouflaged as a crime
drama—detectives McNulty and Moreland revisit the
scene of an old, unsolved homicide. The detectives are
reluctantly forced by their sergeant to reexamine the
crime scene, which they believe to be unrelated to the
organized crime syndicate that their secret unit is trying
to expose. The old police report suggests that the victim,
a 20-year-old college girl, was shot during a romantic or
drug-related dispute went wrong. Despite their preconceptions, McNulty and Moreland—in a beautifully syncopated exchange of expletives—let the data speak for
themselves. Tracing the gunshot’s trajectory, they discover that the girl was, in fact, premeditatedly executed.
They leave the cordoned-off apartment with a new theory
about who killed Deirdre Kresson.
The idea that a detective would investigate a crime
scene only to confirm a predetermined hunch about the
identity of the murderer would seem abhorrent to many;
it is dangerously susceptible to biased identification. In
fact, the best crime thrillers enthrall us precisely because
the murderer is never who we initially suspected her to be.
But sociologists often round up the usual suspects. They
enter metaphorical crime scenes every day, armed with
strong and well-theorized hypotheses about who the
murderer should, or at least plausibly might be. It is
not unlikely that many perpetrators are still walking
free as a consequence. Consider them the sociological
fugitives of hypothesis testing.
As anyone graduating from a sociology program in
the last 50 years knows, hypothesis testing dominates the
social sciences. This scientific paradigm has led to numerous achievements, and my purpose here is not to challenge or belittle them. Some have attacked, or defended,
hypothesis testing on philosophical grounds—debates
that were particularly vibrant a century ago but that
have since mostly been forgotten. What remains today
are a set of institutionalized practices that, beyond the
cautionary ‘‘correlation is not causation’’ adage, are all
too often followed on autopilot mode.
The problem with hypothesis testing is not its epistemological foundations, or its ontological validity;
rather that, as a practice, it has become entirely taken
for granted. In fact, even those of us who enter the proverbial crime scene without an idea of the likely killer
often retrospectively tell the story of her capture as if
we knew her identity all along. The normative pressure
to pay homage to the objective authority of statistical
significance testing is difficult to resist. In our published
articles we lay out a set of hypotheses, and then marshal
the empirical evidence—mostly fetishized as stars (with
any luck three!) hovering to the right of our
Graduate School of Business, Stanford University, Stanford, CA, USA
Corresponding author:
Amir Goldberg, Graduate School of Business, Stanford University, 655
Knight Way, Stanford, CA 94305, USA.
Email: amirgo@stanford.edu
Creative Commons CC-BY-NC: This article is distributed under the terms of the Creative Commons Attribution-NonCommercial
3.0 License (http://www.creativecommons.org/licenses/by-nc/3.0/) which permits non-commercial use, reproduction and distribution
of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access pages (https://us.
sagepub.com/en-us/nam/open-access-at-sage).
Downloaded from by guest on June 21, 2016
2
Big Data & Society
coefficients—that, low and behold, supports the hypothetical story we had just told. Rarely does one find a
paper conceding that it stumbled on a finding while looking for a different answer. Even more difficult to publish
is one reporting a null finding. Letting the data speak for
themselves has almost become a sociological taboo.
Allow me to hazard a hypothesis—one that I will be
unable to test or even meaningfully support, and that
others have stated before with greater clarity, more
nuance and depth, and better evidence (e.g. Latour and
Wolgar, 1979)—as to why data-driven research is normatively frowned upon in sociology. Hypothesis testing’s ascendance in the social sciences coincided with
the emergence and establishment of statistical modeling,
and the tools that it entails. Two methods of research
became particularly prominent: surveys and laboratory
experiments. Both are costly and time-consuming, and
require a significant investment in infrastructure and personnel. Such an upfront investment makes exploratory
research potentially wasteful and therefore highly risky.
In other words, practical considerations made it
necessary for researchers in the social sciences—especially sociology, which was never among the best funded
disciplines, even compared to its impoverished social
science siblings—to come into their crime scenes
knowing exactly who they are looking for.
Theoretical development precedes data collection
because researchers can only afford collecting the
data that they know, with a high level of certainty,
will inform their theories. This material necessity
required ideological justification. And it comes at the
price of overlooking unusual suspects (Evans, 2010).
Researchers in leading universities with access to
resources and funding were particularly invested in the
ideological project of celebrating hypothesis testing,
because it reinforced their material advantages, and also
provided the coveted scientific legitimacy that distanced
them from the humanities. By the 1980s, this practice was
under concerted attack by critical theorists, mostly outside
leading Anglo-Saxon universities. But while the postmodern onslaught ravaged anthropology, sociological positivism emerged as healthy as ever. If anything, encouraged by
the publicized successes of the endogeneity police in economics and other influential parts of the social scientific
cosmos, recent sociological work published in leading
American journals has become ever more empirical in
focus, and more attuned to problems of causal identification, measurement, and robustness. In the process, sociological grand theorizing of the kind that characterized
American sociology in the postwar era has waned, and
seems to be in danger of extinction (Lizardo, 2014).
Ironically, it appears that theoretically informed
hypothesis testing can lead to the narrowing of one’s
sociological imagination. Might Big Data harbor the
potential for a theoretical revival?
This suggestion may seem peculiar at first. If
anything, Big Data invokes an image of mechanistic
and a-theoretical pattern analysis. But with the pain of
drowning in an ocean of amorphous data also comes the
liberation of unshackling oneself from the blinders of
one’s limited imagination: we no longer need to come
to the crime scene with an idea about the identity of the
killer. She is hiding somewhere in between the heaps of
unstructured data. At last, we are free to find her. Thus,
the analytical focus shifts from thinking about the most
cost-effective data that one needs to collect in order to
support, or refute, a hypothesis, to figuring out how to
structure a mountain of data into meaningful categories
of knowledge. Big Data is both a blessing and a curse.
In fact, Big Data is nothing more than big data without
the conceptual and practical tools for carving meaningful
structures out of it. One way of doing that is to imagine,
and consequently parse, the variables worth exploring in
the data. But such an endeavor falls into the same trap
created by hypothesis testing: it necessitates assuming in
advance the possible dimensions of variability. The crime
scene conundrum, in other words, is not so much a result
of having to assume the identity of the killer in advance,
but rather of having to imagine all possible candidates.
Such an approach necessarily precludes the possibility of
identifying a suspect that we could not have imagined apriori. We all suffer from myopia: we cannot see what is
beyond our field of vision.
Categorical myopia is perhaps the biggest problem
plaguing traditional survey-based research. We can only
collect information about the categories of data that we
assume, ahead of time, are meaningful to the domain of
study in question. By constructing a survey, the researcher
necessarily imposes her schematic understanding of the
world on her subjects. The General Social Survey (GSS),
for example, occasionally asks respondents, among other
things, how much they like musical genres such as rock or
classical music. But what if ‘‘rock’’ implies different things
to different respondents? Or even worse, what if different
respondents use different categorical prisms to structure
music into distinct buckets of meaning?
That was precisely what I discovered when I applied
Relational Class Analysis (RCA) to GSS data on
Americans’ musical tastes (Goldberg, 2011). RCA
induces groups of respondents whose opinions follow
similar patterns, as a means to infer their underlying
schematic organization of the domain. What I found
was that whereas some Americans think of rock
through the distinction between high and low brow
music, others interpret it as a symbol of cosmopolitanism, standing in opposition to traditional Americana.
This implies not only that people have competing
understandings of the social symbolism of this musical
genre, but probably that they draw categorical boundaries in different ways altogether.
Downloaded from by guest on June 21, 2016
Goldberg
3
This kind of inductive exploration is what computer
scientists refer to as machine learning: the algorithmic
process of finding patterns of relationships in unstructured data. Classification is central to machine learning.
It is the iterative procedure by which categories of
objects are inductively discovered, and objects are subsequently assigned to them. Machine learning classifiers, in essence, let the data ‘‘speak for themselves.’’
They parse the data into categories that the researcher,
presumably, could not have imagined. It then becomes
the researcher’s responsibility to assign meaningful
interpretations to them.
Of course, data cannot really speak for themselves.
The statistical tools with which they are delineated into
separate entities are ultimately rhetorical devices disguised as objective representations of reality. Data
mining techniques are never purely objective. They
embody assumptions about the world, and necessarily
privilege certain vantage points over others. Data do
not speak for themselves because their story invariably
needs to be narrated. It would therefore be naı̈ve to
presume that Big Data would necessarily free us from
the tyranny of our shortsighted sociological imagination. But the painstaking search for a needle in a haystack is better than having to know exactly where to
look for it in advance. As an emergent field, ‘‘Big Data’’
is still a cacophonous set of data processing, visualization, and pattern recognition techniques. This methodological pluralism affords the possibility of
stumbling on exciting and groundbreaking discoveries.
It is precisely this potential for surprise that makes
Big Data, and the machine learning tools used to excavate and refine it, an opportunity for theoretical discovery. Surprise is the cornerstone of abductive reasoning:
the catalysis of new theoretical insights through discovery of unexpected empirical evidence. Because abduction transcends the simplistic dichotomy between
theory-building and empirical research, many have
hailed it as a promising pathway for conceptual innovation in the social sciences (e.g. Timmermans and
Tavory, 2014). Novel discovery necessitates the availability of rich, multidimensional and granular data; Big
Data is uniquely positioned to facilitate such exploration. This is not to suggest that ‘‘Big Data’’—as a
research strategy—constitutes a paradigmatic shift, or
that it makes hypothesis testing useless or obsolete. If
anything, many researchers operating squarely within
the tradition of hypothesis testing probably follow an
abductive process of theoretical recalibration through
empirical discovery; but this is mostly done out of sight
(Van Maanen et al., 2007). Big Data might help legitimate making this practice visible, and open to scrutiny.
Like the navigation tools that freed ancient sailors
from the need to stay close to the shoreline—eventually
affording the discovery of new worlds—Big Data might
possibly open us up to new sociological possibilities.
But for that to happen we need forensic social science:
the careful compilation of evidence from unstructured
digital traces. Like McNulty and Moreland, we might
find ourselves leaving the crime scene not only with new
ideas about what crime had been committed, but also
with theories about who committed them.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with
respect to the research, authorship, and/or publication of this
article.
Funding
The author(s) received no financial support for the research,
authorship, and/or publication of this article.
References
Evans JA (2010) Industry induces academic science to know
less about more. American Journal of Sociology 116(2):
389–425.
Goldberg A (2011) Mapping shared understandings using
relational class analysis: The case of the cultural omnivore
reexamined. American Journal of Sociology 116(5):
1397–1436.
Latour B and Woolgar S (1979) Laboratory Life: The
Construction of Scientific Facts. Beverly Hills, CA: Sage
Publications.
Lizardo O (2014) The end of theorists: The relevance, opportunities, and pitfalls of theorizing in sociology today.
Pamphlet based on the Lewis Coser Memorial Lecture,
delivered at the 2014 Annual Meeting of the American
Sociological Association in San Francisco.
Timmermans S and Tavory I (2014) Theory construction in
qualitative research: from grounded theory to abductive
analysis. Sociological Theory 30(3): 167–186.
Van Maanen J, Sørensen J and Mitchell TR (2007)
Introduction to special topic forum: The interplay between
theory and method. The Academy of Management Review
32(4): 1145–1154.
This article is part of a special theme on Colloquium: Assumptions of Sociality. To see a full list of all articles in
this special theme, please click here: http://bds.sagepub.com/content/colloquium-assumptions-sociality.
Downloaded from by guest on June 21, 2016
Download