Uploaded by Jiaqi Jiang

Big data in the public sector

advertisement
555751
research-article2014
AASXXX10.1177/0095399714555751Administration & SocietyDesouza and Jacob
Article
Administration & Society
2017, Vol. 49(7) 1043–1064
© The Author(s) 2014
DOI: 10.1177/0095399714555751
journals.sagepub.com/home/aas
Big Data in the Public
Sector: Lessons for
Practitioners and
Scholars
Kevin C. Desouza1 and Benoy Jacob2
Abstract
In this essay, we consider the role of Big Data in the public sector. Motivating
our work is the recognition that Big Data is still in its infancy and many
important questions regarding the true value of Big Data remain unanswered.
The question we consider is as follows: What are the limits, or potential, of
Big Data in the public sector? By reviewing the literature and summarizing
insights from a series of interviews from public sector Chief Information
Officers (CIOs), we offer a scholarly foundation for both practitioners and
researchers interested in understanding Big Data in the public sector.
Keywords
big data, public organizations, public management, policy analysis
The amount of data in our world has been exploding, and analyzing large
datasets—so-called Big Data—will become a key basis of competition,
underpinning new waves of productivity growth, innovation, and consumer
surplus.
—McKinsey Global Institute (2010)
1Arizona
State University, Phoenix, AZ, USA
of Colorado, Denver, CO, USA
2University
Corresponding Author:
Kevin C. Desouza, Arizona State University, 411 N. Central Ave., M/C 3520, Suite #750,
Phoenix, AZ 85004-0685, USA.
Email: kev.desouza@gmail.com
1044
Administration & Society 49(7)
The era of Big Data has begun. Computer scientists, physicists, economists,
mathematicians, political scientists, bit-informaticists, sociologists, and many
others are clamoring for access to the massive quantities of information
produced by and about people, things and their interactions.
—Boyd and Crawford (2012)
Big Data is indeed a Big Deal.
—Dr. John Holdren (2012; Director of the White House Office of Science and
Technology Policy)
Introduction
As suggested in the introductory quotes, there is an increasingly popular perception that Big Data holds vast potential for improving the decision-making
processes of both public and private organizations. In the hopes of solving
previously intractable problems, scholars, analysts, and entrepreneurs, from a
wide range of fields, are actively pursuing novel approaches to mining the
digital traces and deposit data that comprise Big Data (Boyd & Crawford,
2012). Against this emerging backdrop, policymakers, public managers, and
citizens have started to consider the ways in which Big Data can be used to
improve public sector outcomes, that is, public policies, programs, and democratic processes.
Several Big Data initiatives have recently emerged in the public sector.
For example, in March of 2012, the Obama Administration put forward the
Big Data Research and Development Initiative. The objective of this initiative was to understand the “technologies needed to manipulate and mine massive amounts of information; apply that knowledge to other scientific fields
as well as address the national goals in the areas of health energy defense,
education and research” (Mervis, 2012, p. 22). Big Data efforts have also
been initiated at other levels of government. For example, a host of municipalities have created “open data platforms” and held “civic hackathons” to
engage citizens with public data. Several novel applications have emerged
from these efforts addressing a wide range of local issues such as providing
information on blighted properties, identifying local resources for underserved citizens, and helping parents access information on local schools.1
Simply stated, it appears that Big Data can, indeed, provide the public sector with “a powerful arsenal of strategies and techniques for boosting productivity and achieving higher levels of efficiency and effectiveness” (Manyika
et al., 2011, p. 54). That said, Big Data is still in its infancy and many important questions regarding the true value of Big Data remain unanswered (Boyd
Desouza and Jacob
1045
& Crawford, 2012; Desouza, 2014). Indeed, observers have noted that Big
Data solutions are being promoted as a way to address public issues but, with
little consideration of how, where, and when it is most likely to be successful.2 Thus, it appears that there is, at least some, tension between the promise
of Big Data and the reality.3 The question at hand then is, “What are the limits, or potential, of Big Data in the public sector?”
In this article, we begin to address this question by reviewing the nascent
Big Data literature as it pertains to the management of public organizations.
The insights we draw are further informed by findings from a recent survey
of Chief Information Officers (CIOs) in different public organizations.4 As
such, we provide a scholarly foundation for both practitioners and researchers interested in understanding Big Data in the public sector.
Following this introduction, our article is organized into five sections. The
next four sections summarize key themes from the Big Data literature and
consider the implications of each for public organizations; in particular, the
bounds of Big Data, governance and privacy, decision-making, and “the end
of theory.” The final section offers a short summary of lessons for practitioners and some thoughts on potential research directions for scholars.
The Bounds of Big Data
As a relatively new phenomenon, much of the literature on Big Data focuses
on defining the “bounds” of Big Data. That is, what is Big Data and what
does it mean to operate in a Big Data environment. Despite the ubiquity of
the term, Big Data is a difficult term to define (Franks, 2012; Laney, 2001;
Manyika et al., 2011). There is, however, some consensus among scholars
and practitioners that four factors characterize Big Data—volume, velocity,
variety, and complexity.5 More than just simple semantics, these characteristics have potentially important implications for management practices.
Big Data is just too big for us. Where . . . big data begin[s] and end[s] is not
known. I have been struggling to identify digestible bites for us to take to move
on big data . . ..
First, at its core, Big Data must clearly be “big.” Big Data datasets are
“beyond the ability of typical database software tools to capture, store, manage and analyze” (Franks, 2012, p. 4).6 Thus, Big Data, in terms of volume,
is a function of the underlying and pre-existing capacity of an organization to
collect, store, and analyze its data. This definition suggests that Big Data is,
in terms of volume, a moving target. For example, household demographics
that were once difficult to manage now “fit on a thumb drive and can be analyzed by a low-end laptop” (Franks, 2012, p. 24).7
1046
Administration & Society 49(7)
The second defining characteristic of Big Data is its velocity. This refers
to the speed at which data are being created and stored, and their associated
rates of retrieval (Kaisler, Armour, Espinosa, & Money, 2013). Much like the
volume of data, however, there is no established benchmark by which to consider when data velocity meets a Big Data threshold. Rather, the salient issue
is that the data are being created at historically fast rates. An example of the
current velocity of data is provided by Mayer-Schonberger and Cukier
(2013):
Google processes more than 24 petabytes of data per day, a volume that is
thousands of times the quantity of all printed material in the U.S. Library of
Congress. Facebook, a company that didn’t exist a decade ago, gets more than
10 million new photos uploaded every hour. Facebook members click a “like”
button or leave a comment nearly three billion times a day, creating a digital
trail that the company can mine to learn about users’ preferences. Meanwhile,
the 800 million monthly users of Google’s YouTube service upload over an
hour of video every second. The number of messages on Twitter grows at
around 200 percent a year and by 2012 exceeded 400 million tweets a day.
(p. 8)
A third defining characteristic of Big Data is its variety. Big Data is comprised of data in a wide range of forms, including text, images, and videos.
Generally speaking, then, it will include data that are structured, semi-structured, or unstructured. Structured data refer to data that have an organized
structure and are, thus, clearly identifiable. A simple example would be a
database with specific information that is stored in columns and rows. Semistructured data do not conform to a formal structure per se; however, it contains “tags” that help separate the data records or fields. For example, data in
many bibliographical software programs reflect semi-structured data. That is,
the file is composed of records, but the structure is not regular in the sense
that fields may be missing or be comprised of more “open” formats, such as
a “Notes” section. Finally, unstructured data, as its name implies, have no
identifiable structure. Examples of unstructured data, then, include the following: text messages, photos, videos, and audio files.
As datasets become increasingly “complex”—from structured to unstructured—the processing and analytical capabilities required to collect, manage,
and analyze the data increases significantly. Thus, a better understanding of
the defining characteristics of an organizations real, or potential, data repositories offers insights into the level and types of investments needed.
The final defining characteristic of Big Data is its complexity—the degree
to which the data are interconnected. Many of the novel applications and/or
insights that have emerged from Big Data applications are a result of
1047
Desouza and Jacob
“Small
Data”
Low volume
Low velocity
Low variety
Low complexity
Example:
Land Use
Data for
Small City
“Big Data”
High volume
Low velocity
Low variety
Low complexity
High volume
High velocity
Low variety
Low complexity
Example:
Census Data
Example:
Twi!er
Data or
Video
Feeds
High volume
High velocity
High variety
Low complexity
High volume
High velocity
High variety
High complexity
Example:
Data-sets
with
different
structures
Example:
Linked
data-sets
with
different
structures
Figure 1. Data continuum - Volume, velocity, variety and complexity.
connecting otherwise unrelated datasets. One oft-cited example is the joint
effort between Google and the Center for Disease Control (CDC). In this
case, Google was able to connect their database of “search terms”—entered
in their search engine—such as “cold medicine” and “flu symptoms,” with
the CDC’s data on the H1N1 virus. In doing so, analysts were able to predict
the spread of the H1N1 virus by “connecting” two previously disconnected
datasets.8
Given that many public organizations are in the nascent stages of implementing data-based decision processes, the characteristics of an organization’s data provide a simple framework for assessing the potential data needs
and subsequent investments of the organization. More precisely, if we consider the data environment as varying along a data continuum (Figure 1)9—
whereby Big Data reflects one extreme—in which the volume, velocity,
variety, and complexity of the data are very high—we can better appreciate
the potential data infrastructure and subsequent investment that an organization requires.
First, consider organizations at the “small data” end of the data spectrum—where data are characterized as “low” in terms of volume, velocity,
variety, and complexity. Such organizations must think about data differently
1048
Administration & Society 49(7)
than their “bigger data” counterparts. More precisely, at the low-end of the
data spectrum—by definition—the organization is unlikely to have a great
deal of data to begin with (low volume and low velocity). Moreover, relevant
data are unlikely to be accessible through complementary organizations (low
complexity). Such an environment is not unusual for contemporary public
organizations. First, many public organizations have yet to adopt—for good
reasons—data-based decision processes. In addition, the nature of many public programs does not lend them to the creation/collection of reliable data. In
this context, the issue at hand will be to first determine what data, if any, will
improve programmatic outcomes. To the degree that outcomes could be
improved, the primary investment will be to increase the volume and velocity
of data. That is, to generate appropriate data (increased volume) and then to
ensure that the data are collected in an ongoing fashion (increased velocity).
In contrast to small data organizations, many public organizations have
spent a great deal of time generating, collecting, and storing data. These public organizations are likely to be characterized by higher volumes of data,
often spread across multiple departments. In these contexts, the primary
question will be as follows: How can existing data resources be better
employed to improve programmatic outcomes? For example, one high visibility Big Data application in the public sector is the Office of Policy and
Strategic Planning in the Office of the Mayor of New York City. This office
employs a small group of analysts that mine data from approximately 60 different city agencies to address a host of issues, including building and development issues, infrastructure problems, the selling of bootleg cigarettes, and
the “flipping” of business licenses (Howard, 2012). This office “connects,”
and then “mines,” a host of existing datasets from different agencies within
the city. Simply stated, this office leverages existing analytical capabilities by
bringing together otherwise unconnected datasets to find correlations that
help them refocus their programmatic efforts in more efficient ways. A tangible example offers some additional clarity.
Every year, New York City receives approximately 20,000 complaints for
“illegal conversions.” That is, where an apartment or house is zoned to
accommodate a certain number of people but is accommodating many more.
Given the potential safety problems associated with this issue, it is critical
that the City be able to identify illegal conversions before a problem arises.
Historically, the only way to address this problem was to investigate complaints. Such a process is idiosyncratic at best; with only about 200 inspectors
to address the complaints, the City had been unable to “get ahead” of the
problem. By compiling a wide range of information such as property tax,
building structure, and age of the building, the data analytics team found a
curious correlation between illegal conversions and a particular building
Desouza and Jacob
1049
characteristic. This allowed inspectors to proactively inspect buildings with
this characteristic. This led to a sharp increase in the number of illegal conversions that the City was able to “fix.”10
In this example, the data being used are—from the perspective of Figure
1—not quite “Big Data.” It does, however, reflect “bigger data”: high volume
(large datasets)11 and high complexity (high interconnectedness) but relatively low velocity and variety. In this context, which characterizes many
public organizations, the practical issue is to leverage existing resources to
take full advantage of existing data. This idea of leveraging existing data was
evident throughout the CIO interviews. Indeed, many of the CIOs were cognizant of the fact that they actually collect and maintain large stores of data.
But the analytical potential of these data had yet to be fully realized. As noted
by one CIO:
We need to focus on analyzing the data we currently have stored [in our
systems]. My guess is that we only analyze about 30% of it . . . there is a huge
opportunity for us to work on the rest [of the data] and create value . . ..
Finally, at one end of the Big Data spectrum are those organizations that
use data that are characterized by high volume, velocity, variety, and complexity. While there are, likely, few examples of truly “Big Data” in the public sector there are some. For example, the Los Angeles Police Department’s
Real-Time Analysis and Critical Response Division, in collaboration with
researchers from University of California, Los Angeles (UCLA), uses both
historical and real-time data (that includes live feeds of city and traffic cameras), to “predict” where future crime might occur. This division provides
“real time investigative information to officers and detectives throughout the
city and region.”12 These data allow the Los Angeles Police Department
(LAPD) to concentrate resources in geographically defined areas. The data
used in this case provide a rare, but important, example of how “Big Data”—
high volume, velocity, variety, and complexity—can support public efforts.
The issue at hand, however, is investment and management issues involved
with this type of data are different from the issues when dealing with “smaller”
types of data. In particular, to recognize data in its unstructured form and then
to understand how to “connect it” to more conventional forms of data. In this
particular case, the LAPD had to first recognize that “live feeds” and “videostreams” are best thought of as data—and that these “data” could be connected, through geocoding, to other data.
This discussion, of the characteristics of data, suggests insights for both
practitioners and scholars. First, the primary insight is that large efficiencies
can be achieved through analytics by simply recognizing the type of data that
1050
Administration & Society 49(7)
do, or are likely to, characterize an organization. Indeed, many CIOs have
already recognized the importance of appreciating their organization’s existing data resources. This idea—that public investments for data should begin
with an assessment of existing data reservoirs—suggests a first step for
scholarship on Big Data in the public sector. More precisely, an important
first step in this research agenda should be to survey and assess the types of
data that public organizations are collecting. This simple, albeit difficult, task
would provide the foundation to consider several important questions. For
example, what are the characteristics of organizations that are most likely to
have and benefit from Big Data? What types of organizations are mostly
characterized by “small data?” And, are different types of data being used to
support different types of decision processes?
A second insight to draw from this discussion, particularly when considering organizations that have more than “small data,” is that maximizing the
analytical power of existing data will likely require that it be considered in
relation to data in other parts of the organization. Increasing the data complexity in this way—by connecting across departments—poses unique challenges. This issue of collaboration, and its importance, is further developed in
the next section on governance and privacy.
Governance and Privacy
As described above, one of the defining characteristics of Big Data is its
complexity. That is, the degree to which an organization’s data are drawn
from, and connected to, data in other departments and other organizations. As
such, a large portion of the Big Data literature is focused on understanding
the “networked” nature of Big Data and the challenges it presents. Two key
issues emerge from this literature—governance and personal privacy.
First, while an extensive body of work has developed that explores the
issue of collaboration, particularly in the public sector, very little of this work
has focused on the issue of collaboration with respect to data. This omission
is non-trivial. Data present unique challenges for collaborative forms of governance. For example, even in cases where public organizations collect significant amounts of data, it tends to be fragmented. More precisely, public
agencies often operate in silos when it comes to their information technologies—there is limited, if any, interoperability among information systems
used across agencies. In and of itself, this data fragmentation is not problematic. However, as it relates to the idea of creating and leveraging big(ger) data
fragmentation inhibits the integration of individual datasets into big data.
Coordinating between these different “silos” is difficult and certainly not
costless. Relative to most collaborative efforts, the coordination costs
Desouza and Jacob
1051
associated with data-sharing may be greater than other types of collaborative
endeavors in the public sector.13
The importance and difficulty of “governing data” is a major theme to
emerge in the CIO interviews. Indeed, many CIOs stated that poor data governance was a significant factor in limiting their efforts to pursue Big Data.
For example, the CIOs noted that because data are “localized” within particular departments and agencies, reconciling data across systems is challenging.
In addition, it can be difficult to initiate collaborative efforts that involved
sharing data, because there is limited guidance in terms of policy and legal
frameworks (Desouza, 2014). This second issue reflects an important theme
in Big Data literature, more generally, that is, the implications of Big Data for
individual privacy.
Much of the power of Big Data arises because of how it connects and finds
correlations between previously disparate sets of data. This dimension of Big
Data has led to concerns about privacy, because these connections can lead to
insights about an individual that (s)he did not consent to. Consider the following case: public agencies in New York came under critical scrutiny for
public disclosure for an uncalculated, and some might argue an emotional,
response to a real-time situation. In the wake of the Connecticut shooting
incident, a group of researchers obtained through the Freedom of Information
Act information regarding gun owners living in the suburbs of Westchester,
Rockland, and Putnam counties in New York. In addition to publishing an
article about the licensed gun owners in the neighborhood, the authors also
published an interactive visual map that provides information about gun
owners’ names and addresses (Worley, 2012). The information is published
with an intention to provide “open knowledge” about an individual’s possession of arms, but, at the same time, the information presented in the article
can assist criminals. Criminals can use this information to target home owners who do not own guns or target homes to steal guns to make lucrative
business through the sale of illegal stolen guns (Mackey, 2013). The issue of
privacy is particularly attenuated when we consider Big Data efforts in the
public sector.
A citizen’s right to information about the government and how its decisions and processes might affect his or her personal interest is considered an
essential value of democratic societies (Galnoor, 1975; Piotrowski &
Rosenbloom, 2002). Responding to specific instruction from President
Obama, the Office of Management and Budget issued an Open Government
Directive in December 2009. The stated objective of this directive was to
“direct executive departments and agencies to take specific actions to implement the principles of transparency, participation, and collaboration,” which
are argued to be the “cornerstone of an open government.” Fundamental to
1052
Administration & Society 49(7)
this directive is the development, maintenance, and accessibility of public
data. For example, the directive makes clear that data should be made available “online” and in “open formats.” This directive sets the stage for much of
the current interest in public sector Big Data. That is, as more public sector
data become openly available, it seems to invite the use of Big Data analytical tools.14 That said, in both the public and private sectors, these novel data
connections are increasingly concerning because it is unclear what might be
revealed about particular individuals. By initiating open data programs, government officials are hoping that Big Data techniques can be used to improve
such accountability. In particular, because many of the policy domains that
generate data in the public sector are also governed by a host of privacy regulations (e.g., Health Insurance Portability and Accountability Act [HIPAA] in
the health care domain or micro-level education data).
In sum, the complexity of Big Data datasets introduces two management
challenges—governance and privacy. That said, there is a lengthy literature
on public sector collaboration and governance. While this literature points to
several management issues that will support efforts to effectively develop
collaborative systems, data—Big Data in particular—poses unique challenges that public officials need to consider. More precisely, the governance
issues associated with collaborating across agencies, departments, and even
working groups are attenuated by potential security and privacy issues associated with data. Unfortunately, the literature offers few insights for public
managers and policymakers on how to mitigate the potential privacy issues
around Big Data efforts. However, the CIOs interviewed suggest that—in the
absence of clearly defined protocols—leadership and transparency are critical factors for overcoming the governance and privacy issues associated with
Big Data initiatives (Desouza, 2014). For example, many CIOs described a
similar process of creating interdepartmental or interagency working groups
as the initial step in an effective Big Data strategy.
Decision-Making in a Big Data Environment
Not surprisingly, a large portion of the Big Data literature is focused on
understanding the ways in which Big Data has, or can, improve decisionmaking. As it relates to the public sector, the value of Big Data is often at the
programmatic level. For example, in the cases of New York and Los Angeles
offered above, Big Data was used to improve programmatic outcomes. One
stream of the literature suggests, however, that Big Data can enhance “higher
order” decision processes. That is, not only can Big Data enhance particular
programs, but it can also support the creation and development of public
policy, more generally. The underlying logic of this argument is that Big Data
Desouza and Jacob
1053
technologies can engage citizens in novel ways and, thus, improve the aggregation and revelation of citizen preferences, with respect to public policies.
In a democratically governed society, a requisite objective of the government is to ensure that their policies—and the subsequent provision of public
goods and services—reflect the preferences of its citizens. A well-established
finding in political economy, however, is that—because of the heterogeneity
of preferences found in large groups—the provision of collective goods will
always be suboptimal; society will be provided with a level of public goods
that does not account for the heterogeneous preferences of citizens (see, for
example, Alesina, Baqir, & Easterly, 1999; Olson, 1965; Samuelson, 1954).
One explanation for this problem is the inability of traditional democratic
mechanisms to adequately aggregate preferences (Arrow, 1950). That is, the
outcomes of traditional voting mechanisms—majority voting and representative democracy—do not necessarily lead to preferred outcomes.15 In this context, policy outcomes are highly reliant on information offered by
policy-experts and/or agenda-setters. Neither of which, necessarily, represent
the “will of the people.” An important component of the argument for Big
Data applications in the public sector is the promise that it can “solve” this
problem. That is, it can provide information about the preferences of citizens
regarding public policies without relying on policy-experts. Big data proponents point to two different applications that will allow policymakers to
undertake better assessments of “the will of the people”: prediction markets
and sentiment analysis. From an “investment” point of view, these efforts
require new forms of data—whereas the previous examples we have discussed leveraging existing data through novel analytics. Thus, the level of
investment is likely greater, or at the very least quite different, than other
forms of Big Data investment. Despite some of the enthusiasm around the
potential for Big Data in the policy process, a close reading of the literature
suggests to us that the Big Data mechanisms proposed to improve public
policies—particularly prediction markets and sentiment analysis—face
important limitations.
First, scholars and practitioners have considered how Big Data can take
advantage of the “wisdom of crowds”—through prediction markets—to predict potential outcomes. These markets, also known as “information markets” or “event futures,” are markets where participants trade “contracts”
whose payoff depends on unknown future events (Wolfers & Zitzewitz,
2006). In traditional markets, the equilibrium outcome reflects the market
price. In prediction markets, the equilibrium outcome reflects the markets
“expectation” of an outcome. For example, if a contract pays US$1 if an
event occurs (and nothing otherwise) and the contract last trades at 30 cents,
then the market’s expectation of that event occurring is 0.30. Thus, prediction
1054
Administration & Society 49(7)
markets provide a mechanism to access the “wisdom of crowds” (Surowiecki,
2004) to determine the likelihood of an event occurring. These types of markets have been shown to be extremely accurate predictors of events, such as
Oscar winners, sales of new products, and presidential elections.16
As suggested by the title of his recent book, Accelerating Democracy:
Transforming Democracy Through Technology, John O. McGinnis (2012)
argues that prediction markets—and other contemporary forms of technology—can be used to foster better public policies. He argues that policymakers and citizens need to take advantage of the vast stores of data and
information currently available, in particular, through the use of policy prediction markets. As platforms for “the public to speculate on election and
policy outcomes,” McGinnis argues that prediction markets can aggregate
vast sums of information from an array of individuals and can, thus, assess
“the likely effects of policies before they are implemented” (McGinnis, 2012,
p. 60). Despite McGinnis’s optimism, the broader literature seems to suggest
a more cautious approach with respect to the potential for Big Data to support
public policy.
First, it might not be politically feasible to benefit from trading on the
outcome of critical issues (Green, Armstrong, & Graefe, 2007). For example,
in the wake of the 9-11 terrorist attacks, Defense Advanced Research Projects
Agency (DARPA) established a prediction market to predict events related to
national security, such as regime changes in the Middle East or the likelihood
of terrorist attacks. This market was immediately criticized by a host of politicians and citizens, and was, subsequently, canceled only 1 day after it was
announced (Wolfers & Zitzewitz, 2006). Second, many public policy issues
are difficult to translate into contracts for prediction markets. Most public
policies—even at the local level—are adequately complex that characterizing
them in a single “contract” to be traded will be difficult. Finally, a requisite
component of effective markets—prediction or otherwise—is a large number
of participants. “Thin markets” are less likely to reflect true equilibrium outcomes. In the case of prediction markets, fewer participants will lead to less
reliable predictions. Along this vein, there is limited evidence to suggest that
prediction markets—particularly for public policy matters—are likely to generate the requisite levels of participation that would lead to efficient or true
predictions. So while prediction markets provide, in principle, a tool that
could support public policy, there are important limits that, we feel, constrain
its potential.
Another Big Data application that proponents argue will improve policy
outcomes is sentiment analysis.17 Sentiment analysis draws on recent efforts
by various governments to assess the subjective well-being of citizens, relative to the policy environment.18 Such analytics can provide assessments of
Desouza and Jacob
1055
how citizens are responding to proposed changes to legislation. This can take
different forms. For example, analysts could assess “real-time” data that
emerge during a policy speech or they could consider archived citizen
responses over the evolution of a particular piece of legislation. Either way,
sentiment analysis for public policy draws on the increasing acceptance of
social media as a “platform for real-time public communication.” For example, a group of researchers at Northeastern University and Harvard have
established a project titled Pulse of a Nation, which uses twitter data to assess
the “mood” throughout each day.19 Another example is the United Nations
(UN) Global Pulse initiative in collaboration with Crimson Hexagon (a social
media analysis and analytics platform developed at Harvard University). This
effort launched a research project to analyze tweets to understand the sentiments, choices, and socioeconomic conditions of people (Lopez & Amand,
2012). These data could be used to correlate changes in sentiment related to
specific policies. As such, sentiment analysis particularly when undertaken
using Big Data (such as twitter) can offer a unique understanding of the
degree to which citizen preferences have been met, or are likely to be met,
through public policies and programs.
Not surprisingly, some of these limitations reflect existing limitations in
the democratic process, such as engaging otherwise marginalized groups,
limiting the influence of potentially biased agenda-setters, and validating the
sources of information. More precisely, drawing on existing critiques of
social media data in science and research, we point to three key limitations of
this application in the public sector—the unequal distribution of real-time
data, deliberate data manipulation by key stakeholders, and the unlikely ability to engage citizens in policy matters.
First, real-time data, such as Twitter, are not necessarily representative of the
population. Moreover, Twitter accounts and users are not “equivalent.” This
idea and its importance are clearly articulated by Boyd and Crawford (2012):
Some users have multiple accounts. Some accounts are used by multiple
people. Some people never establish an account, and simply access Twitter
using the web. Some accounts are “bots” that produce automated content
without involving a person. Furthermore, the notion of an active account is
problematic. While some users post content frequently through Twitter, other
participate as “listeners” . . . Due to uncertainties about what an account
represents and what engagement looks like, it is standing on precarious ground
to sample Twitter and make claims about people and users. (p. 6)
Relatedly, there is the issue of the “digital divide”—the unequal access
between socioeconomic groups, with respect to information and communication technology. For example, groups at the lower-end of the socioeconomic
1056
Administration & Society 49(7)
ladder tend to have less access to computers and information technology.
Thus, even if the online platforms, such as Twitter, reflected true usage, the
data generated from these sources will be biased in the normal ways, that is,
favoring the upper-middle class groups of American society. Not everyone is
online. A focus on simply mining data created by those who have digital
footprints could lead to spurious results, given the fact that we are only sampling a section of society.
Building upon the idea that Twitter accounts are “not equivalent,” a second, and important, critique of using Twitter (or other real-time data) for
sentiment analysis is that it can be subject to manipulation. For example, in
2011, the Obama administration proposed the Keystone XL pipeline project
to carry tar sands oil from Alberta down to Texas (Olson & Efstathiou, 2011;
Sheppard, 2011). To counter the expressed concerns among stakeholders, the
American Petroleum Institute and other lobbyists were able to manipulate
social media sentiments to show support or the project. By using “fake”
Twitter accounts, they were able to send an inordinate number of tweets to
show support for the project, which did not accurately represent the sentiment on the ground. While this example is perhaps more nefarious than typical, it does reflect a real concern regarding the potential for manipulating data
to sway public policy.
An additional critique of both prediction markets and sentiment analysis is
based on the underlying assumption of engagement. That is, proponents of
these approaches seem to be assuming that the current lack of engagement—
and subsequent loss of information in the policy process—is due to limits in
technology. As it relates to serious policy deliberations and discussions we
disagree. Much of the limited engagement is more likely a function of the
cumbersome nature of the democratic process more generally. Thus, if we
consider the length of time it takes to enact laws, change codes, revise policies, and so on, it seems unlikely that—regardless of the technology—citizens are going to stay engaged throughout the process, such that policymakers
can take advantage of “real-time” data. It is more likely that citizens would
participate in a priori or post hoc opinion surveys. To be clear, this critique
does not imply that Big Data efforts should be avoided but only that they are
unlikely to replace the existing data efforts and should subsequently be
viewed, more reasonably, as another input into the policy-making process.
In sum, Big Data approaches to informing the policy process are limited in
important ways. Both prediction markets and sentiment analysis—using realtime data—seem unlikely to better represent the populace than traditional
democratic mechanisms. In addition, prediction markets are difficult to
implement for complex or controversial policies. Thus, the claim that Big
Data—and its related technology and analytical approaches—offer novel
Desouza and Jacob
1057
opportunities for policymakers to assess the true preferences of the populace,
and subsequently make better policies, seems overstated. Both these Big
Data tools suffer from important limitations that constrain their ability to
broadly enhance public policy. That said, this circumspect view suggests an
important research agenda. Scholars need to explore the policy areas where
different Big Data efforts—particularly those that use novel data collection
and aggregation tools—can best support discourse around public policy.
The “End” of Theory
In the first section of our article—the bounds of data—we defined Big Data
in terms of its underlying structure—volume, velocity, variety, and complexity. An important stream of the literature, however, offers a different tack on
the issue of defining Big Data. Many argue that Big Data is less about a
change in the structure of data, as much as it is a change in the way we think
about research and analytics (Boyd & Crawford, 2012; Mayer-Schonberger
& Cukier, 2013). The nature of this shift is defined in terms of a move away
from “causal theories” and toward “simple” correlations. The logic underlying this shift is that in a Big Data environment, correlations provide clear
evidence that an event is occurring. And this is enough information upon
which to base decisions. For example, “if we can save money by knowing the
best time to buy a plane ticket without understanding the method behind airfare madness, that’s good enough” (Mayer-Schonberger & Cukier, 2013, p.
55). This idea is summarized nicely in the following quote from Chris
Anderson’s (2008) controversial essay about Big Data titled The end of
theory:
This is a tool where massive amounts of data and applied mathematics replace
every other tool that might be brought to bear. Out with every theory of human
behavior from linguistics to sociology. Forget taxonomy, ontology, and
psychology. Who knows why people do what they do? The point is that they do
it, and we can track and measure it with unprecedented fidelity. With enough
data, the numbers speak for themselves.
That said, simple correlations may not serve the “longer-term” goals of
public organizations. More precisely, because of the “wicked” nature of the
problems that characterize the work of public sector organizations, it seems
that correlations may be necessary but certainly not sufficient.
Wicked problems are defined by uncertainty. The problems themselves
are often ill-defined, and the solution-set is often ambiguous. Moreover,
many of the solutions or programmatic options involve important tradeoffs.
1058
Administration & Society 49(7)
Examples of wicked problems in the public sector include poverty, homelessness, homeland security, and sustainability. In this context, analytics—Big
Data or otherwise—are necessary but not sufficient conditions for effective
programmatic decision-making (see, for example, deLeon & Denhardt, 2000;
Durant & Legge, 2006; Roberts, 2002).
While Big Data efforts may allow analysts novel insights into problems, it
is equally likely, given the complexity of public problems, to identify a host
of spurious relationships. This concern, to some degree, underlies some of the
current debate on police tactics such as “stop and frisk” and the National
Security Agency’s monitoring of phone records and conversations. Stated
differently, public programs driven by findings on correlations are unlikely to
address underlying social issues and could lead to a misallocation of
resources.
In the two cases described in the second section of this article, the LAPD
and the Office of Policy and Strategic Planning in New York City used Big
Data to achieve programmatic outcomes, reducing crime and improving public safety. To some degree, applying this approach to public sector organizations makes sense. That is, from the analytical perspective described in this
section, such a data-centric critique would be perceived as “missing the
point.” From the data-centered point of view, the salient issue is that analysts
were able to discern previous unperceived correlations and these correlations
helped address key social problems. If programmatic outcomes are improved
through a novel (Big Data) analytical approach, the definitional issue (and
the related resources it requires) is less problematic. That said, like others, we
caution against letting the pendulum swing too far in this direction. For
example, scholars have pointed out that, while for some stakeholders, the
correlation is all that will matter, for policymakers such correlative insights
may be less valuable. From a policymakers’ point of view, understanding
why and how correlations matter—that is, what are the causal issues—will be
critical for making long-term decisions. To appreciate this insight, consider
again the example of the New York Office of Policy and Strategic Planning.
From a programmatic—data-centered—point of view, this example is clearly
a Big Data success story. The data helped focus resources to better address a
critical public issue. That said, the correlation said nothing of the underlying
causes of illegal—such as low-housing supply for lower-income residents.
As such, it seems plausible (if not likely) that illegal conversions will just
shift to different types of buildings. The inspectors and analytical department
will continue to show success in terms of the numbers of conversions fixed,
but the underlying social issue will remain. The point of this then, is that even
if Big Data offers valuable insights that support the day-to-day operations of
a public organization, public managers and policymakers need to ensure that
Desouza and Jacob
1059
they do not to lose sight of the broader issues the public sector needs to
address.
Consider this case: The city of Boston introduced “Street Bump” app,
which automatically detects potholes and sends reports to city administrators.
As residents drive, this app collects data on smoothness of ride, which could
potentially aid in planning investments to fix roads in the city. However, after
the launch of the app, it was found that the program directed crews to wealthy
neighborhoods because people were more likely to have access to smartphones (Rampton, 2014). This example illustrates that public agencies cannot simply leverage technologies to address urban challenges; they need to
think through several dimensions. Who are the users of the application? Is it
representative of all sections of society? Also, data ownership becomes a crucial issue. Who owns the data? How can people be used as sensors without
violating their privacy? Incidents like these will affect cities’ resilience and
livability.
While this is an important “stream” of the Big Data literature, the idea of
true causal relationships seems far from the mind of public sector CIOs.
Indeed, the interview data suggest that public organizations are simply “so
early” in their data efforts that they have not been able to consider the “full
potential” of Big Data analytics. As noted in Desouza’s (2014) report:
CIOs overwhelmingly report that they are just getting started with big data
efforts. While big data as a concept has been discussed in the popular press and
the academic literature for years, public agencies have not yet fully embraced
the concept.
For scholars, this suggests an opportunity to examine and assess the analytical approaches currently being used by public agencies. In doing so, we can
better understand the causal connections between data, policy, public programs, and the social issues they are supposed to address.
Conclusion
We were motivated to write this article by what we saw as a dearth of scholarship on Big Data in the public sector. That is, the extant literature offers few,
explicit insights about the limits and potential of Big Data in the public sector. Thus, drawing on the broad literature on Big Data, as well as data from
interviews with public sector CIOs, we identified some important “bounds”
to the potential for Big Data. Throughout the article, we offered “lessons” for
practitioners with respect to developing a Big Data program. That said, we
also hinted at some ways that scholars can support Big Data programs in the
1060
Administration & Society 49(7)
public sector. The primary insight we offer for scholars, however, is that there
is room—indeed the need—for the development of a systematic research
agenda. In this concluding section, we highlight the components of this proposed research agenda.
First, what types of data characterize public organizations? This question
could lead to an important typology of public organizations and how they are,
or could, use different types of data. Second, how are public officials overcoming the issues of privacy associated with the data-sharing that is fundamental to Big Data programs. Third, given the nascent nature of Big Data in
the public sector, most of the related efforts have been targeted at improving
the “low-lying fruit” found at the programmatic level. That said, some recent
scholarship suggests that the true value of Big Data lies in its ability to
enhance public policy-making more generally. This literature is unsettled, at
best, and subsequently, there is a great deal of work that should explore (a)
the degree to which this is true (i.e., Can Big Data enhance public policymaking?) and (b) what public policy domains are best suited for Big Data
analytics? Finally, we described a growing body of work that “pushes”
against the Big Data narrative—that analytics need to focus exclusively, or at
least primarily, on correlations. Scholars and analysts writing in this area
have noted that an emphasis on correlations comes at the expense of understanding the underlying causal relationships. In some organizational contexts, this might be a trivial omission. In the public sector, however, where
many of the problems being addressed are “wicked” in nature, understanding
the causal connection between factors is critical for developing policies and
programs that address the longer-term components of the problem. From the
point of view of future research, we argue that scholars should look closely at
“how” data are currently being used and assess the degree to which it is being
“underutilized.”
Big data offers the potential to address many public sector problems.
There is, however, some tension between the promise of Big Data and reality.
In this article, we have looked at the extant literature for lessons that will support public officials in their efforts to leverage Big Data. That said, for the
potential of Big Data to be realized we argue that scholars have an important
role to play. With this in mind, we have also set forth the beginnings of a
research agenda that considers the limits and potential of Big Data in the
public sector.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research,
authorship, and/or publication of this article.
Desouza and Jacob
1061
Funding
The author(s) disclosed receipt of the following financial support for the research,
authorship, and/or publication of this article: Kevin C. Desouza graterfully acknowledges funding received from the IBM Center for the Business of Government.
Notes
1. See Code for America website (http://codeforamerica.org/) for more information
on these types of “apps.” That said, we will demonstrate that these types of open
data exercises are, in most instances, not truly representative of Big Data. They
are novel and interesting, but not Big Data.
2. As described in a recent article in the Atlantic Cities: “We are hearing from the
mayor’s office like, a cat got stuck in a tree, can we have a hackathon to get it
down?” (Badger, 2013).
3. Most recently, in a series of articles in The New York Times, Paul Krugman
and James Glanz offer competing views on the potential benefits of Big Data
for economic productivity (see http://krugman.blogs.nytimes.com/2013/08/18/
the-dynamo-and-big-data/?_r=0).
4. The full findings from the study are reported in Desouza (2014).
5. The list of characteristics seems to be growing. The 3V’s were popularized in
Laney (2001).
6. This of course suggests that, as storage becomes cheaper, and analytical tools
become more powerful, Big Data —defined purely in terms of volume—will be
a moving target (Manyika et al., 2011).
7. For more insights on the volume of data being created, see Bohn and Short
(2010); Bohn, Short, and Baru (2011); and Shapiro and Varian (1998).
8. A recent study has revealed that the Google team has been overestimating flu
outbreaks since 2011—their prediction is two points off compared with the
Center for Disease Control (CDC) estimates. In addition, by using its traditional
methods of estimation, the CDC has been accurately predicting flu outbreaks
(Lazer, Kennedy, King, & Vespignani, 2014).
9. A similar framework has been put forward by Birnhack (2014).
10. For more details on this particular case, see Franks (2012).
11. These datasets are not “big” but just large. These datasets are still analyzed using
traditional analytical tools (e.g., SPSS, Excel) and hence do not meet the standard definition for “Big Data.”
12. www.lapdonline.org/home/pdf_view/39375
13. John Bryson has written (with a variety of coauthors) extensively on this issue,
particularly as it relates to the sharing of information and resources (see, for
example, Bryson, Ackermann, & Eden, 2007; Bryson, Crosby, & Stone, 2006).
14. Similar “open data” efforts have been initiated at the local level. For example, a
recent newsletter from Alliance for Innovation, titled The Digital Future: Open
Data to Open Doors highlights open data initiatives in Hawaii, Austin, Texas,
and Palo Alto, California.
1062
Administration & Society 49(7)
15. See almost any introductory text book on political economy. For a simple exposition of this issue, see Jonathon Gruber’s (2014) textbook.
16. See Wolfers and Zitzewitz (2006).
17. McGinnis refers to this as dispersed media.
18. For example, the City of Santa Monica was one of five US$1 million awards
granted by the Bloomberg Philanthropies’ Mayors Challenge. This effort is
focused on the development of a Local Well-Being Index, “a dynamic measurement tool that will provide a multidimensional picture of our community’s
strengths and challenges across key elements of well-being (economics, social
connections, health, education & care, community engagement, and physical
environment).” The goal of this index is to be used by city officials to “make
data-driven decisions and targeted resource allocation.” http://www.smgov.net/
uploadedFiles/Wellbeing/Project-Summary.pdf
19. http://www.ccs.neu.edu/home/amislove/twittermood/
References
Alesina, A., Baqir, R., & Easterly, W. (1999). Public goods and ethnic divisions.
Quarterly Journal of Economics, 114, 1243-1284.
Anderson, C. (2008, June 23). The end of theory: The data deluge makes the scientific method obsolete. Wired. Retrieved from http://archive.wired.com/science/
discoveries/magazine/16-07/pb_theory
Arrow, K. J. (1950). A difficulty in the concept of social welfare. Journal of Political
Economy, 58, 328-346.
Badger, E. (2013). Are civic hackathons stupid? The Atlantic cities: Place matters.
Retrieved from http://www.theatlanticcities.com/technology/2013/07/are-hackathons-stupid/6111/
Birnhack, M. (2014). S-M-L-XL data: Big data as a new informational privacy paradigm. In Big data and privacy: Making ends meet (Future of Privacy Forum).
The Center for Internet and Society, Stanford Law School. Retrieved from http://
www.futureofprivacy.org/big-data-privacy-workshop-paper-collection/
Bohn, R., & Short, J. (2010). How much information 2009: Report on American consumers. San Diego: Global Information Industry Center, University of California,
San Diego.
Bohn, R., Short, J., & Baru, C. (2011). How much information 2010: Report on enterprise server information consumers. San Diego: Global Information Industry
Center, University of California, San Diego.
Boyd, D., & Crawford, K. (2012). Critical questions for big data. Information,
Communication & Society, 15, 662-679.
Bryson, J. M., Ackermann, F., & Eden, C. (2007). Putting the resource-based view
of strategy: An distinctive competencies to work in public organizations. Public
Administration Review, 67, 702-718.
Bryson, J. M., Crosby, B. C., & Stone, M. M. (2006). The design and implementation of cross-sector collaborations: Propositions from the literature. Public
Administration Review, 66, 44-56.
Desouza and Jacob
1063
Chow, B. (2012, February 8). LAPD pioneers high tech crime fighting war room.
Retrieved from: http://losangeles.cbslocal.com/2012/02/08/lapd-pioneers-hightech-crime-fighting-war-room/
deLeon, L., & Denhardt, R. B. (2000). The political theory of reinvention. Public
Administration Review, 60, 89-97.
Desouza, K. C. (2014). Realizing the promise of big data. Washington, DC: IBM
Center for the Business of Government.
Durant, R. F., & Legge, J. S. (2006). Wicked problems, public policy, and administrative theory: Lessons from the GM food regulatory arena. Administration &
Society, 38, 309-334.
Franks, B. (2012). Taming the big data tidal wave. Hoboken, NJ: John Wiley.
Galnoor, I. (1975) Government secrecy: exchanges, intermediaries and middlemen.
Public Administration Review, 35, 32-42.
Green, K. C., Armstrong, J. S., & Graefe, A. (2007). Methods to elicit forecasts from
groups: Delphi and prediction markets compared. Foresight: The International
Journal of Applied Forecasting, 8, 17-20.
Gruber, J. (2014). Public finance and public policy (4th ed.). New York, NY: Worth
Publishers.
Howard, A. (2012). Predictive data analytics is saving lives and taxpayer dollars in
New York City. Retrieved from http://strata.oreilly.com/2012/06/predictive-dataanalytics-big-data-nyc.html
Kaisler, S., Armour, F., Espinosa, J. A., & Money, W. (2013). Big data issues and challenges moving forward. Proceedings from the 46th Hawaii International Conference
on System Sciences. Retrieved from http://www.cse.hcmut.edu.vn/~ttqnguyet/
Downloads/SIS/References/Big%20Data/(2)%20Kaisler2013%20-%20Big%20
Data-%20Issues%20and%20Challenges%20Moving%20Forward.pdf
Laney, D. (2001). 3D data management: Controlling data volume, velocity, and
variety. META Group. Retrieved from http://blogs.gartner.com/doug-laney/
files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocityand-Variety.pdf
Lazer, D., Kennedy, R., King, G., & Vespignani, A. (2014). The parable of Google
flu: Traps in big data analysis. Science, 343, 1203-1205.
Lopez, G., & Amand, W. S. (2012). Discovering global socio economic trends hidden in big data. Retrieved from http://www.unglobalpulse.org/discoveringtrend
sinbigdata-CHguestpost
Mackey, A. (2013). In wake of Journal News publishing gun permit holder maps,
nation sees push to limit access to gun records. The News Media and The Law,
Winter 2013, 37(1). Retrieved from http://www.rcfp.org/browse-media-lawresources/news-media-law/news-media-and-law-winter-2013/wake-journalnews-publishin
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Hung
Byers, A. (2011). Big data: The next frontier for innovation, competition, and
productivity. McKinsey Global Institute.
Mayer-Schonberger, V., & Cukier, K. (2013). Big data: A revolution that will transform how we live, work and think. New York, NY: Houghton Mifflin Harcourt.
1064
Administration & Society 49(7)
McGinnis, J. O. (2012). Accelerating democracy: Transforming governance through
technology. Princeton, NJ: Princeton University Press.
Mervis, J. (2012). Agencies rally to tackle Big Data. Science, 336, 22.
Olson, M. (1965). The logic of collective action: Public goods and the theory of
groups. Cambridge, MA: Harvard University Press.
Olson, B., & Efstathiou, Jr. J. (2011, November 16). Enbridge’s pipeline threatens
Trans Canada’s Keystone XL Plan. Bloombrg BussinessWeek. Available at:
http://www.businessweek.com/news/2011-11-16/enbridge-s-pipeline-threatenstranscanada-s-keystone-xl-plan.html
Piotrowski, S. J., & Rosenbloom, D. H. (2002). Nonmission-Based Values in ResultsOriented Public Management: The Case of Freedom of Information. Public
Administration Review, 62, 643–657.
Rampton, R. (2014, April 27). White House looks at how “big data” can discriminate. Retrieved from http://uk.reuters.com/article/2014/04/27/uk-usa-obamaprivacy-idUKBREA3Q00S20140427
Roberts, N. C. (2002). Keeping public officials accountable through dialogue:
Resolving the accountability paradox. Public Administration Review, 62,
658-669.
Samuelson, P. (1954). The pure theory of public expenditures. Review of Economics
and Statistics, 36, 386-389.
Shapiro, C., & Varian, H. R. (1998). Information rules: A strategic guide to the network economy. Cambridge, MA: Harvard Business Press.
Sheppard, K. (2011, August 24). What’s All the Fuss About the Keystone XL Pipeline?
MotherJones. Available at: http://www.motherjones.com/blue-marble/2011/08/
pipeline-protesters-keystone-xl-tar-sands
Surowiecki, J. (2004). The wisdom of crowds: Why the many are smarter than the few
and collective wisdom shapes, business, economies, societies and nations. New
York, NY: Little Brown Book.
Wolfers, J., & Zitzewitz, E. (2006). Prediction markets in theory and practice (NBER
Working Paper 12083). Retrieved from http://www.nber.org/papers/w12083.pdf
Worley, D. R. (2012). The gun owner next door: What you don’t know about the
weapons in your neighborhood. Retrived from http://www.lohud.com/apps/pbcs.
dll/article?AID=2012312230056&nclick_check=1
Author Biographies
Kevin C. Desouza is the associate dean for research in the College of Public Programs,
a professor in the School of Public Affairs, and the interim director for the Decision
Theater in the Office of Knowledge Enterprise Development at Arizona State
University.
Benoy Jacob is the director of the Center for Local Government Research and
Training, and is an assistant professor in the School of Public Affairs at the University
of Colorado, Denver.
Download