555751 research-article2014 AASXXX10.1177/0095399714555751Administration & SocietyDesouza and Jacob Article Administration & Society 2017, Vol. 49(7) 1043–1064 © The Author(s) 2014 DOI: 10.1177/0095399714555751 journals.sagepub.com/home/aas Big Data in the Public Sector: Lessons for Practitioners and Scholars Kevin C. Desouza1 and Benoy Jacob2 Abstract In this essay, we consider the role of Big Data in the public sector. Motivating our work is the recognition that Big Data is still in its infancy and many important questions regarding the true value of Big Data remain unanswered. The question we consider is as follows: What are the limits, or potential, of Big Data in the public sector? By reviewing the literature and summarizing insights from a series of interviews from public sector Chief Information Officers (CIOs), we offer a scholarly foundation for both practitioners and researchers interested in understanding Big Data in the public sector. Keywords big data, public organizations, public management, policy analysis The amount of data in our world has been exploding, and analyzing large datasets—so-called Big Data—will become a key basis of competition, underpinning new waves of productivity growth, innovation, and consumer surplus. —McKinsey Global Institute (2010) 1Arizona State University, Phoenix, AZ, USA of Colorado, Denver, CO, USA 2University Corresponding Author: Kevin C. Desouza, Arizona State University, 411 N. Central Ave., M/C 3520, Suite #750, Phoenix, AZ 85004-0685, USA. Email: kev.desouza@gmail.com 1044 Administration & Society 49(7) The era of Big Data has begun. Computer scientists, physicists, economists, mathematicians, political scientists, bit-informaticists, sociologists, and many others are clamoring for access to the massive quantities of information produced by and about people, things and their interactions. —Boyd and Crawford (2012) Big Data is indeed a Big Deal. —Dr. John Holdren (2012; Director of the White House Office of Science and Technology Policy) Introduction As suggested in the introductory quotes, there is an increasingly popular perception that Big Data holds vast potential for improving the decision-making processes of both public and private organizations. In the hopes of solving previously intractable problems, scholars, analysts, and entrepreneurs, from a wide range of fields, are actively pursuing novel approaches to mining the digital traces and deposit data that comprise Big Data (Boyd & Crawford, 2012). Against this emerging backdrop, policymakers, public managers, and citizens have started to consider the ways in which Big Data can be used to improve public sector outcomes, that is, public policies, programs, and democratic processes. Several Big Data initiatives have recently emerged in the public sector. For example, in March of 2012, the Obama Administration put forward the Big Data Research and Development Initiative. The objective of this initiative was to understand the “technologies needed to manipulate and mine massive amounts of information; apply that knowledge to other scientific fields as well as address the national goals in the areas of health energy defense, education and research” (Mervis, 2012, p. 22). Big Data efforts have also been initiated at other levels of government. For example, a host of municipalities have created “open data platforms” and held “civic hackathons” to engage citizens with public data. Several novel applications have emerged from these efforts addressing a wide range of local issues such as providing information on blighted properties, identifying local resources for underserved citizens, and helping parents access information on local schools.1 Simply stated, it appears that Big Data can, indeed, provide the public sector with “a powerful arsenal of strategies and techniques for boosting productivity and achieving higher levels of efficiency and effectiveness” (Manyika et al., 2011, p. 54). That said, Big Data is still in its infancy and many important questions regarding the true value of Big Data remain unanswered (Boyd Desouza and Jacob 1045 & Crawford, 2012; Desouza, 2014). Indeed, observers have noted that Big Data solutions are being promoted as a way to address public issues but, with little consideration of how, where, and when it is most likely to be successful.2 Thus, it appears that there is, at least some, tension between the promise of Big Data and the reality.3 The question at hand then is, “What are the limits, or potential, of Big Data in the public sector?” In this article, we begin to address this question by reviewing the nascent Big Data literature as it pertains to the management of public organizations. The insights we draw are further informed by findings from a recent survey of Chief Information Officers (CIOs) in different public organizations.4 As such, we provide a scholarly foundation for both practitioners and researchers interested in understanding Big Data in the public sector. Following this introduction, our article is organized into five sections. The next four sections summarize key themes from the Big Data literature and consider the implications of each for public organizations; in particular, the bounds of Big Data, governance and privacy, decision-making, and “the end of theory.” The final section offers a short summary of lessons for practitioners and some thoughts on potential research directions for scholars. The Bounds of Big Data As a relatively new phenomenon, much of the literature on Big Data focuses on defining the “bounds” of Big Data. That is, what is Big Data and what does it mean to operate in a Big Data environment. Despite the ubiquity of the term, Big Data is a difficult term to define (Franks, 2012; Laney, 2001; Manyika et al., 2011). There is, however, some consensus among scholars and practitioners that four factors characterize Big Data—volume, velocity, variety, and complexity.5 More than just simple semantics, these characteristics have potentially important implications for management practices. Big Data is just too big for us. Where . . . big data begin[s] and end[s] is not known. I have been struggling to identify digestible bites for us to take to move on big data . . .. First, at its core, Big Data must clearly be “big.” Big Data datasets are “beyond the ability of typical database software tools to capture, store, manage and analyze” (Franks, 2012, p. 4).6 Thus, Big Data, in terms of volume, is a function of the underlying and pre-existing capacity of an organization to collect, store, and analyze its data. This definition suggests that Big Data is, in terms of volume, a moving target. For example, household demographics that were once difficult to manage now “fit on a thumb drive and can be analyzed by a low-end laptop” (Franks, 2012, p. 24).7 1046 Administration & Society 49(7) The second defining characteristic of Big Data is its velocity. This refers to the speed at which data are being created and stored, and their associated rates of retrieval (Kaisler, Armour, Espinosa, & Money, 2013). Much like the volume of data, however, there is no established benchmark by which to consider when data velocity meets a Big Data threshold. Rather, the salient issue is that the data are being created at historically fast rates. An example of the current velocity of data is provided by Mayer-Schonberger and Cukier (2013): Google processes more than 24 petabytes of data per day, a volume that is thousands of times the quantity of all printed material in the U.S. Library of Congress. Facebook, a company that didn’t exist a decade ago, gets more than 10 million new photos uploaded every hour. Facebook members click a “like” button or leave a comment nearly three billion times a day, creating a digital trail that the company can mine to learn about users’ preferences. Meanwhile, the 800 million monthly users of Google’s YouTube service upload over an hour of video every second. The number of messages on Twitter grows at around 200 percent a year and by 2012 exceeded 400 million tweets a day. (p. 8) A third defining characteristic of Big Data is its variety. Big Data is comprised of data in a wide range of forms, including text, images, and videos. Generally speaking, then, it will include data that are structured, semi-structured, or unstructured. Structured data refer to data that have an organized structure and are, thus, clearly identifiable. A simple example would be a database with specific information that is stored in columns and rows. Semistructured data do not conform to a formal structure per se; however, it contains “tags” that help separate the data records or fields. For example, data in many bibliographical software programs reflect semi-structured data. That is, the file is composed of records, but the structure is not regular in the sense that fields may be missing or be comprised of more “open” formats, such as a “Notes” section. Finally, unstructured data, as its name implies, have no identifiable structure. Examples of unstructured data, then, include the following: text messages, photos, videos, and audio files. As datasets become increasingly “complex”—from structured to unstructured—the processing and analytical capabilities required to collect, manage, and analyze the data increases significantly. Thus, a better understanding of the defining characteristics of an organizations real, or potential, data repositories offers insights into the level and types of investments needed. The final defining characteristic of Big Data is its complexity—the degree to which the data are interconnected. Many of the novel applications and/or insights that have emerged from Big Data applications are a result of 1047 Desouza and Jacob “Small Data” Low volume Low velocity Low variety Low complexity Example: Land Use Data for Small City “Big Data” High volume Low velocity Low variety Low complexity High volume High velocity Low variety Low complexity Example: Census Data Example: Twi!er Data or Video Feeds High volume High velocity High variety Low complexity High volume High velocity High variety High complexity Example: Data-sets with different structures Example: Linked data-sets with different structures Figure 1. Data continuum - Volume, velocity, variety and complexity. connecting otherwise unrelated datasets. One oft-cited example is the joint effort between Google and the Center for Disease Control (CDC). In this case, Google was able to connect their database of “search terms”—entered in their search engine—such as “cold medicine” and “flu symptoms,” with the CDC’s data on the H1N1 virus. In doing so, analysts were able to predict the spread of the H1N1 virus by “connecting” two previously disconnected datasets.8 Given that many public organizations are in the nascent stages of implementing data-based decision processes, the characteristics of an organization’s data provide a simple framework for assessing the potential data needs and subsequent investments of the organization. More precisely, if we consider the data environment as varying along a data continuum (Figure 1)9— whereby Big Data reflects one extreme—in which the volume, velocity, variety, and complexity of the data are very high—we can better appreciate the potential data infrastructure and subsequent investment that an organization requires. First, consider organizations at the “small data” end of the data spectrum—where data are characterized as “low” in terms of volume, velocity, variety, and complexity. Such organizations must think about data differently 1048 Administration & Society 49(7) than their “bigger data” counterparts. More precisely, at the low-end of the data spectrum—by definition—the organization is unlikely to have a great deal of data to begin with (low volume and low velocity). Moreover, relevant data are unlikely to be accessible through complementary organizations (low complexity). Such an environment is not unusual for contemporary public organizations. First, many public organizations have yet to adopt—for good reasons—data-based decision processes. In addition, the nature of many public programs does not lend them to the creation/collection of reliable data. In this context, the issue at hand will be to first determine what data, if any, will improve programmatic outcomes. To the degree that outcomes could be improved, the primary investment will be to increase the volume and velocity of data. That is, to generate appropriate data (increased volume) and then to ensure that the data are collected in an ongoing fashion (increased velocity). In contrast to small data organizations, many public organizations have spent a great deal of time generating, collecting, and storing data. These public organizations are likely to be characterized by higher volumes of data, often spread across multiple departments. In these contexts, the primary question will be as follows: How can existing data resources be better employed to improve programmatic outcomes? For example, one high visibility Big Data application in the public sector is the Office of Policy and Strategic Planning in the Office of the Mayor of New York City. This office employs a small group of analysts that mine data from approximately 60 different city agencies to address a host of issues, including building and development issues, infrastructure problems, the selling of bootleg cigarettes, and the “flipping” of business licenses (Howard, 2012). This office “connects,” and then “mines,” a host of existing datasets from different agencies within the city. Simply stated, this office leverages existing analytical capabilities by bringing together otherwise unconnected datasets to find correlations that help them refocus their programmatic efforts in more efficient ways. A tangible example offers some additional clarity. Every year, New York City receives approximately 20,000 complaints for “illegal conversions.” That is, where an apartment or house is zoned to accommodate a certain number of people but is accommodating many more. Given the potential safety problems associated with this issue, it is critical that the City be able to identify illegal conversions before a problem arises. Historically, the only way to address this problem was to investigate complaints. Such a process is idiosyncratic at best; with only about 200 inspectors to address the complaints, the City had been unable to “get ahead” of the problem. By compiling a wide range of information such as property tax, building structure, and age of the building, the data analytics team found a curious correlation between illegal conversions and a particular building Desouza and Jacob 1049 characteristic. This allowed inspectors to proactively inspect buildings with this characteristic. This led to a sharp increase in the number of illegal conversions that the City was able to “fix.”10 In this example, the data being used are—from the perspective of Figure 1—not quite “Big Data.” It does, however, reflect “bigger data”: high volume (large datasets)11 and high complexity (high interconnectedness) but relatively low velocity and variety. In this context, which characterizes many public organizations, the practical issue is to leverage existing resources to take full advantage of existing data. This idea of leveraging existing data was evident throughout the CIO interviews. Indeed, many of the CIOs were cognizant of the fact that they actually collect and maintain large stores of data. But the analytical potential of these data had yet to be fully realized. As noted by one CIO: We need to focus on analyzing the data we currently have stored [in our systems]. My guess is that we only analyze about 30% of it . . . there is a huge opportunity for us to work on the rest [of the data] and create value . . .. Finally, at one end of the Big Data spectrum are those organizations that use data that are characterized by high volume, velocity, variety, and complexity. While there are, likely, few examples of truly “Big Data” in the public sector there are some. For example, the Los Angeles Police Department’s Real-Time Analysis and Critical Response Division, in collaboration with researchers from University of California, Los Angeles (UCLA), uses both historical and real-time data (that includes live feeds of city and traffic cameras), to “predict” where future crime might occur. This division provides “real time investigative information to officers and detectives throughout the city and region.”12 These data allow the Los Angeles Police Department (LAPD) to concentrate resources in geographically defined areas. The data used in this case provide a rare, but important, example of how “Big Data”— high volume, velocity, variety, and complexity—can support public efforts. The issue at hand, however, is investment and management issues involved with this type of data are different from the issues when dealing with “smaller” types of data. In particular, to recognize data in its unstructured form and then to understand how to “connect it” to more conventional forms of data. In this particular case, the LAPD had to first recognize that “live feeds” and “videostreams” are best thought of as data—and that these “data” could be connected, through geocoding, to other data. This discussion, of the characteristics of data, suggests insights for both practitioners and scholars. First, the primary insight is that large efficiencies can be achieved through analytics by simply recognizing the type of data that 1050 Administration & Society 49(7) do, or are likely to, characterize an organization. Indeed, many CIOs have already recognized the importance of appreciating their organization’s existing data resources. This idea—that public investments for data should begin with an assessment of existing data reservoirs—suggests a first step for scholarship on Big Data in the public sector. More precisely, an important first step in this research agenda should be to survey and assess the types of data that public organizations are collecting. This simple, albeit difficult, task would provide the foundation to consider several important questions. For example, what are the characteristics of organizations that are most likely to have and benefit from Big Data? What types of organizations are mostly characterized by “small data?” And, are different types of data being used to support different types of decision processes? A second insight to draw from this discussion, particularly when considering organizations that have more than “small data,” is that maximizing the analytical power of existing data will likely require that it be considered in relation to data in other parts of the organization. Increasing the data complexity in this way—by connecting across departments—poses unique challenges. This issue of collaboration, and its importance, is further developed in the next section on governance and privacy. Governance and Privacy As described above, one of the defining characteristics of Big Data is its complexity. That is, the degree to which an organization’s data are drawn from, and connected to, data in other departments and other organizations. As such, a large portion of the Big Data literature is focused on understanding the “networked” nature of Big Data and the challenges it presents. Two key issues emerge from this literature—governance and personal privacy. First, while an extensive body of work has developed that explores the issue of collaboration, particularly in the public sector, very little of this work has focused on the issue of collaboration with respect to data. This omission is non-trivial. Data present unique challenges for collaborative forms of governance. For example, even in cases where public organizations collect significant amounts of data, it tends to be fragmented. More precisely, public agencies often operate in silos when it comes to their information technologies—there is limited, if any, interoperability among information systems used across agencies. In and of itself, this data fragmentation is not problematic. However, as it relates to the idea of creating and leveraging big(ger) data fragmentation inhibits the integration of individual datasets into big data. Coordinating between these different “silos” is difficult and certainly not costless. Relative to most collaborative efforts, the coordination costs Desouza and Jacob 1051 associated with data-sharing may be greater than other types of collaborative endeavors in the public sector.13 The importance and difficulty of “governing data” is a major theme to emerge in the CIO interviews. Indeed, many CIOs stated that poor data governance was a significant factor in limiting their efforts to pursue Big Data. For example, the CIOs noted that because data are “localized” within particular departments and agencies, reconciling data across systems is challenging. In addition, it can be difficult to initiate collaborative efforts that involved sharing data, because there is limited guidance in terms of policy and legal frameworks (Desouza, 2014). This second issue reflects an important theme in Big Data literature, more generally, that is, the implications of Big Data for individual privacy. Much of the power of Big Data arises because of how it connects and finds correlations between previously disparate sets of data. This dimension of Big Data has led to concerns about privacy, because these connections can lead to insights about an individual that (s)he did not consent to. Consider the following case: public agencies in New York came under critical scrutiny for public disclosure for an uncalculated, and some might argue an emotional, response to a real-time situation. In the wake of the Connecticut shooting incident, a group of researchers obtained through the Freedom of Information Act information regarding gun owners living in the suburbs of Westchester, Rockland, and Putnam counties in New York. In addition to publishing an article about the licensed gun owners in the neighborhood, the authors also published an interactive visual map that provides information about gun owners’ names and addresses (Worley, 2012). The information is published with an intention to provide “open knowledge” about an individual’s possession of arms, but, at the same time, the information presented in the article can assist criminals. Criminals can use this information to target home owners who do not own guns or target homes to steal guns to make lucrative business through the sale of illegal stolen guns (Mackey, 2013). The issue of privacy is particularly attenuated when we consider Big Data efforts in the public sector. A citizen’s right to information about the government and how its decisions and processes might affect his or her personal interest is considered an essential value of democratic societies (Galnoor, 1975; Piotrowski & Rosenbloom, 2002). Responding to specific instruction from President Obama, the Office of Management and Budget issued an Open Government Directive in December 2009. The stated objective of this directive was to “direct executive departments and agencies to take specific actions to implement the principles of transparency, participation, and collaboration,” which are argued to be the “cornerstone of an open government.” Fundamental to 1052 Administration & Society 49(7) this directive is the development, maintenance, and accessibility of public data. For example, the directive makes clear that data should be made available “online” and in “open formats.” This directive sets the stage for much of the current interest in public sector Big Data. That is, as more public sector data become openly available, it seems to invite the use of Big Data analytical tools.14 That said, in both the public and private sectors, these novel data connections are increasingly concerning because it is unclear what might be revealed about particular individuals. By initiating open data programs, government officials are hoping that Big Data techniques can be used to improve such accountability. In particular, because many of the policy domains that generate data in the public sector are also governed by a host of privacy regulations (e.g., Health Insurance Portability and Accountability Act [HIPAA] in the health care domain or micro-level education data). In sum, the complexity of Big Data datasets introduces two management challenges—governance and privacy. That said, there is a lengthy literature on public sector collaboration and governance. While this literature points to several management issues that will support efforts to effectively develop collaborative systems, data—Big Data in particular—poses unique challenges that public officials need to consider. More precisely, the governance issues associated with collaborating across agencies, departments, and even working groups are attenuated by potential security and privacy issues associated with data. Unfortunately, the literature offers few insights for public managers and policymakers on how to mitigate the potential privacy issues around Big Data efforts. However, the CIOs interviewed suggest that—in the absence of clearly defined protocols—leadership and transparency are critical factors for overcoming the governance and privacy issues associated with Big Data initiatives (Desouza, 2014). For example, many CIOs described a similar process of creating interdepartmental or interagency working groups as the initial step in an effective Big Data strategy. Decision-Making in a Big Data Environment Not surprisingly, a large portion of the Big Data literature is focused on understanding the ways in which Big Data has, or can, improve decisionmaking. As it relates to the public sector, the value of Big Data is often at the programmatic level. For example, in the cases of New York and Los Angeles offered above, Big Data was used to improve programmatic outcomes. One stream of the literature suggests, however, that Big Data can enhance “higher order” decision processes. That is, not only can Big Data enhance particular programs, but it can also support the creation and development of public policy, more generally. The underlying logic of this argument is that Big Data Desouza and Jacob 1053 technologies can engage citizens in novel ways and, thus, improve the aggregation and revelation of citizen preferences, with respect to public policies. In a democratically governed society, a requisite objective of the government is to ensure that their policies—and the subsequent provision of public goods and services—reflect the preferences of its citizens. A well-established finding in political economy, however, is that—because of the heterogeneity of preferences found in large groups—the provision of collective goods will always be suboptimal; society will be provided with a level of public goods that does not account for the heterogeneous preferences of citizens (see, for example, Alesina, Baqir, & Easterly, 1999; Olson, 1965; Samuelson, 1954). One explanation for this problem is the inability of traditional democratic mechanisms to adequately aggregate preferences (Arrow, 1950). That is, the outcomes of traditional voting mechanisms—majority voting and representative democracy—do not necessarily lead to preferred outcomes.15 In this context, policy outcomes are highly reliant on information offered by policy-experts and/or agenda-setters. Neither of which, necessarily, represent the “will of the people.” An important component of the argument for Big Data applications in the public sector is the promise that it can “solve” this problem. That is, it can provide information about the preferences of citizens regarding public policies without relying on policy-experts. Big data proponents point to two different applications that will allow policymakers to undertake better assessments of “the will of the people”: prediction markets and sentiment analysis. From an “investment” point of view, these efforts require new forms of data—whereas the previous examples we have discussed leveraging existing data through novel analytics. Thus, the level of investment is likely greater, or at the very least quite different, than other forms of Big Data investment. Despite some of the enthusiasm around the potential for Big Data in the policy process, a close reading of the literature suggests to us that the Big Data mechanisms proposed to improve public policies—particularly prediction markets and sentiment analysis—face important limitations. First, scholars and practitioners have considered how Big Data can take advantage of the “wisdom of crowds”—through prediction markets—to predict potential outcomes. These markets, also known as “information markets” or “event futures,” are markets where participants trade “contracts” whose payoff depends on unknown future events (Wolfers & Zitzewitz, 2006). In traditional markets, the equilibrium outcome reflects the market price. In prediction markets, the equilibrium outcome reflects the markets “expectation” of an outcome. For example, if a contract pays US$1 if an event occurs (and nothing otherwise) and the contract last trades at 30 cents, then the market’s expectation of that event occurring is 0.30. Thus, prediction 1054 Administration & Society 49(7) markets provide a mechanism to access the “wisdom of crowds” (Surowiecki, 2004) to determine the likelihood of an event occurring. These types of markets have been shown to be extremely accurate predictors of events, such as Oscar winners, sales of new products, and presidential elections.16 As suggested by the title of his recent book, Accelerating Democracy: Transforming Democracy Through Technology, John O. McGinnis (2012) argues that prediction markets—and other contemporary forms of technology—can be used to foster better public policies. He argues that policymakers and citizens need to take advantage of the vast stores of data and information currently available, in particular, through the use of policy prediction markets. As platforms for “the public to speculate on election and policy outcomes,” McGinnis argues that prediction markets can aggregate vast sums of information from an array of individuals and can, thus, assess “the likely effects of policies before they are implemented” (McGinnis, 2012, p. 60). Despite McGinnis’s optimism, the broader literature seems to suggest a more cautious approach with respect to the potential for Big Data to support public policy. First, it might not be politically feasible to benefit from trading on the outcome of critical issues (Green, Armstrong, & Graefe, 2007). For example, in the wake of the 9-11 terrorist attacks, Defense Advanced Research Projects Agency (DARPA) established a prediction market to predict events related to national security, such as regime changes in the Middle East or the likelihood of terrorist attacks. This market was immediately criticized by a host of politicians and citizens, and was, subsequently, canceled only 1 day after it was announced (Wolfers & Zitzewitz, 2006). Second, many public policy issues are difficult to translate into contracts for prediction markets. Most public policies—even at the local level—are adequately complex that characterizing them in a single “contract” to be traded will be difficult. Finally, a requisite component of effective markets—prediction or otherwise—is a large number of participants. “Thin markets” are less likely to reflect true equilibrium outcomes. In the case of prediction markets, fewer participants will lead to less reliable predictions. Along this vein, there is limited evidence to suggest that prediction markets—particularly for public policy matters—are likely to generate the requisite levels of participation that would lead to efficient or true predictions. So while prediction markets provide, in principle, a tool that could support public policy, there are important limits that, we feel, constrain its potential. Another Big Data application that proponents argue will improve policy outcomes is sentiment analysis.17 Sentiment analysis draws on recent efforts by various governments to assess the subjective well-being of citizens, relative to the policy environment.18 Such analytics can provide assessments of Desouza and Jacob 1055 how citizens are responding to proposed changes to legislation. This can take different forms. For example, analysts could assess “real-time” data that emerge during a policy speech or they could consider archived citizen responses over the evolution of a particular piece of legislation. Either way, sentiment analysis for public policy draws on the increasing acceptance of social media as a “platform for real-time public communication.” For example, a group of researchers at Northeastern University and Harvard have established a project titled Pulse of a Nation, which uses twitter data to assess the “mood” throughout each day.19 Another example is the United Nations (UN) Global Pulse initiative in collaboration with Crimson Hexagon (a social media analysis and analytics platform developed at Harvard University). This effort launched a research project to analyze tweets to understand the sentiments, choices, and socioeconomic conditions of people (Lopez & Amand, 2012). These data could be used to correlate changes in sentiment related to specific policies. As such, sentiment analysis particularly when undertaken using Big Data (such as twitter) can offer a unique understanding of the degree to which citizen preferences have been met, or are likely to be met, through public policies and programs. Not surprisingly, some of these limitations reflect existing limitations in the democratic process, such as engaging otherwise marginalized groups, limiting the influence of potentially biased agenda-setters, and validating the sources of information. More precisely, drawing on existing critiques of social media data in science and research, we point to three key limitations of this application in the public sector—the unequal distribution of real-time data, deliberate data manipulation by key stakeholders, and the unlikely ability to engage citizens in policy matters. First, real-time data, such as Twitter, are not necessarily representative of the population. Moreover, Twitter accounts and users are not “equivalent.” This idea and its importance are clearly articulated by Boyd and Crawford (2012): Some users have multiple accounts. Some accounts are used by multiple people. Some people never establish an account, and simply access Twitter using the web. Some accounts are “bots” that produce automated content without involving a person. Furthermore, the notion of an active account is problematic. While some users post content frequently through Twitter, other participate as “listeners” . . . Due to uncertainties about what an account represents and what engagement looks like, it is standing on precarious ground to sample Twitter and make claims about people and users. (p. 6) Relatedly, there is the issue of the “digital divide”—the unequal access between socioeconomic groups, with respect to information and communication technology. For example, groups at the lower-end of the socioeconomic 1056 Administration & Society 49(7) ladder tend to have less access to computers and information technology. Thus, even if the online platforms, such as Twitter, reflected true usage, the data generated from these sources will be biased in the normal ways, that is, favoring the upper-middle class groups of American society. Not everyone is online. A focus on simply mining data created by those who have digital footprints could lead to spurious results, given the fact that we are only sampling a section of society. Building upon the idea that Twitter accounts are “not equivalent,” a second, and important, critique of using Twitter (or other real-time data) for sentiment analysis is that it can be subject to manipulation. For example, in 2011, the Obama administration proposed the Keystone XL pipeline project to carry tar sands oil from Alberta down to Texas (Olson & Efstathiou, 2011; Sheppard, 2011). To counter the expressed concerns among stakeholders, the American Petroleum Institute and other lobbyists were able to manipulate social media sentiments to show support or the project. By using “fake” Twitter accounts, they were able to send an inordinate number of tweets to show support for the project, which did not accurately represent the sentiment on the ground. While this example is perhaps more nefarious than typical, it does reflect a real concern regarding the potential for manipulating data to sway public policy. An additional critique of both prediction markets and sentiment analysis is based on the underlying assumption of engagement. That is, proponents of these approaches seem to be assuming that the current lack of engagement— and subsequent loss of information in the policy process—is due to limits in technology. As it relates to serious policy deliberations and discussions we disagree. Much of the limited engagement is more likely a function of the cumbersome nature of the democratic process more generally. Thus, if we consider the length of time it takes to enact laws, change codes, revise policies, and so on, it seems unlikely that—regardless of the technology—citizens are going to stay engaged throughout the process, such that policymakers can take advantage of “real-time” data. It is more likely that citizens would participate in a priori or post hoc opinion surveys. To be clear, this critique does not imply that Big Data efforts should be avoided but only that they are unlikely to replace the existing data efforts and should subsequently be viewed, more reasonably, as another input into the policy-making process. In sum, Big Data approaches to informing the policy process are limited in important ways. Both prediction markets and sentiment analysis—using realtime data—seem unlikely to better represent the populace than traditional democratic mechanisms. In addition, prediction markets are difficult to implement for complex or controversial policies. Thus, the claim that Big Data—and its related technology and analytical approaches—offer novel Desouza and Jacob 1057 opportunities for policymakers to assess the true preferences of the populace, and subsequently make better policies, seems overstated. Both these Big Data tools suffer from important limitations that constrain their ability to broadly enhance public policy. That said, this circumspect view suggests an important research agenda. Scholars need to explore the policy areas where different Big Data efforts—particularly those that use novel data collection and aggregation tools—can best support discourse around public policy. The “End” of Theory In the first section of our article—the bounds of data—we defined Big Data in terms of its underlying structure—volume, velocity, variety, and complexity. An important stream of the literature, however, offers a different tack on the issue of defining Big Data. Many argue that Big Data is less about a change in the structure of data, as much as it is a change in the way we think about research and analytics (Boyd & Crawford, 2012; Mayer-Schonberger & Cukier, 2013). The nature of this shift is defined in terms of a move away from “causal theories” and toward “simple” correlations. The logic underlying this shift is that in a Big Data environment, correlations provide clear evidence that an event is occurring. And this is enough information upon which to base decisions. For example, “if we can save money by knowing the best time to buy a plane ticket without understanding the method behind airfare madness, that’s good enough” (Mayer-Schonberger & Cukier, 2013, p. 55). This idea is summarized nicely in the following quote from Chris Anderson’s (2008) controversial essay about Big Data titled The end of theory: This is a tool where massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behavior from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is that they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves. That said, simple correlations may not serve the “longer-term” goals of public organizations. More precisely, because of the “wicked” nature of the problems that characterize the work of public sector organizations, it seems that correlations may be necessary but certainly not sufficient. Wicked problems are defined by uncertainty. The problems themselves are often ill-defined, and the solution-set is often ambiguous. Moreover, many of the solutions or programmatic options involve important tradeoffs. 1058 Administration & Society 49(7) Examples of wicked problems in the public sector include poverty, homelessness, homeland security, and sustainability. In this context, analytics—Big Data or otherwise—are necessary but not sufficient conditions for effective programmatic decision-making (see, for example, deLeon & Denhardt, 2000; Durant & Legge, 2006; Roberts, 2002). While Big Data efforts may allow analysts novel insights into problems, it is equally likely, given the complexity of public problems, to identify a host of spurious relationships. This concern, to some degree, underlies some of the current debate on police tactics such as “stop and frisk” and the National Security Agency’s monitoring of phone records and conversations. Stated differently, public programs driven by findings on correlations are unlikely to address underlying social issues and could lead to a misallocation of resources. In the two cases described in the second section of this article, the LAPD and the Office of Policy and Strategic Planning in New York City used Big Data to achieve programmatic outcomes, reducing crime and improving public safety. To some degree, applying this approach to public sector organizations makes sense. That is, from the analytical perspective described in this section, such a data-centric critique would be perceived as “missing the point.” From the data-centered point of view, the salient issue is that analysts were able to discern previous unperceived correlations and these correlations helped address key social problems. If programmatic outcomes are improved through a novel (Big Data) analytical approach, the definitional issue (and the related resources it requires) is less problematic. That said, like others, we caution against letting the pendulum swing too far in this direction. For example, scholars have pointed out that, while for some stakeholders, the correlation is all that will matter, for policymakers such correlative insights may be less valuable. From a policymakers’ point of view, understanding why and how correlations matter—that is, what are the causal issues—will be critical for making long-term decisions. To appreciate this insight, consider again the example of the New York Office of Policy and Strategic Planning. From a programmatic—data-centered—point of view, this example is clearly a Big Data success story. The data helped focus resources to better address a critical public issue. That said, the correlation said nothing of the underlying causes of illegal—such as low-housing supply for lower-income residents. As such, it seems plausible (if not likely) that illegal conversions will just shift to different types of buildings. The inspectors and analytical department will continue to show success in terms of the numbers of conversions fixed, but the underlying social issue will remain. The point of this then, is that even if Big Data offers valuable insights that support the day-to-day operations of a public organization, public managers and policymakers need to ensure that Desouza and Jacob 1059 they do not to lose sight of the broader issues the public sector needs to address. Consider this case: The city of Boston introduced “Street Bump” app, which automatically detects potholes and sends reports to city administrators. As residents drive, this app collects data on smoothness of ride, which could potentially aid in planning investments to fix roads in the city. However, after the launch of the app, it was found that the program directed crews to wealthy neighborhoods because people were more likely to have access to smartphones (Rampton, 2014). This example illustrates that public agencies cannot simply leverage technologies to address urban challenges; they need to think through several dimensions. Who are the users of the application? Is it representative of all sections of society? Also, data ownership becomes a crucial issue. Who owns the data? How can people be used as sensors without violating their privacy? Incidents like these will affect cities’ resilience and livability. While this is an important “stream” of the Big Data literature, the idea of true causal relationships seems far from the mind of public sector CIOs. Indeed, the interview data suggest that public organizations are simply “so early” in their data efforts that they have not been able to consider the “full potential” of Big Data analytics. As noted in Desouza’s (2014) report: CIOs overwhelmingly report that they are just getting started with big data efforts. While big data as a concept has been discussed in the popular press and the academic literature for years, public agencies have not yet fully embraced the concept. For scholars, this suggests an opportunity to examine and assess the analytical approaches currently being used by public agencies. In doing so, we can better understand the causal connections between data, policy, public programs, and the social issues they are supposed to address. Conclusion We were motivated to write this article by what we saw as a dearth of scholarship on Big Data in the public sector. That is, the extant literature offers few, explicit insights about the limits and potential of Big Data in the public sector. Thus, drawing on the broad literature on Big Data, as well as data from interviews with public sector CIOs, we identified some important “bounds” to the potential for Big Data. Throughout the article, we offered “lessons” for practitioners with respect to developing a Big Data program. That said, we also hinted at some ways that scholars can support Big Data programs in the 1060 Administration & Society 49(7) public sector. The primary insight we offer for scholars, however, is that there is room—indeed the need—for the development of a systematic research agenda. In this concluding section, we highlight the components of this proposed research agenda. First, what types of data characterize public organizations? This question could lead to an important typology of public organizations and how they are, or could, use different types of data. Second, how are public officials overcoming the issues of privacy associated with the data-sharing that is fundamental to Big Data programs. Third, given the nascent nature of Big Data in the public sector, most of the related efforts have been targeted at improving the “low-lying fruit” found at the programmatic level. That said, some recent scholarship suggests that the true value of Big Data lies in its ability to enhance public policy-making more generally. This literature is unsettled, at best, and subsequently, there is a great deal of work that should explore (a) the degree to which this is true (i.e., Can Big Data enhance public policymaking?) and (b) what public policy domains are best suited for Big Data analytics? Finally, we described a growing body of work that “pushes” against the Big Data narrative—that analytics need to focus exclusively, or at least primarily, on correlations. Scholars and analysts writing in this area have noted that an emphasis on correlations comes at the expense of understanding the underlying causal relationships. In some organizational contexts, this might be a trivial omission. In the public sector, however, where many of the problems being addressed are “wicked” in nature, understanding the causal connection between factors is critical for developing policies and programs that address the longer-term components of the problem. From the point of view of future research, we argue that scholars should look closely at “how” data are currently being used and assess the degree to which it is being “underutilized.” Big data offers the potential to address many public sector problems. There is, however, some tension between the promise of Big Data and reality. In this article, we have looked at the extant literature for lessons that will support public officials in their efforts to leverage Big Data. That said, for the potential of Big Data to be realized we argue that scholars have an important role to play. With this in mind, we have also set forth the beginnings of a research agenda that considers the limits and potential of Big Data in the public sector. Declaration of Conflicting Interests The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. Desouza and Jacob 1061 Funding The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Kevin C. Desouza graterfully acknowledges funding received from the IBM Center for the Business of Government. Notes 1. See Code for America website (http://codeforamerica.org/) for more information on these types of “apps.” That said, we will demonstrate that these types of open data exercises are, in most instances, not truly representative of Big Data. They are novel and interesting, but not Big Data. 2. As described in a recent article in the Atlantic Cities: “We are hearing from the mayor’s office like, a cat got stuck in a tree, can we have a hackathon to get it down?” (Badger, 2013). 3. Most recently, in a series of articles in The New York Times, Paul Krugman and James Glanz offer competing views on the potential benefits of Big Data for economic productivity (see http://krugman.blogs.nytimes.com/2013/08/18/ the-dynamo-and-big-data/?_r=0). 4. The full findings from the study are reported in Desouza (2014). 5. The list of characteristics seems to be growing. The 3V’s were popularized in Laney (2001). 6. This of course suggests that, as storage becomes cheaper, and analytical tools become more powerful, Big Data —defined purely in terms of volume—will be a moving target (Manyika et al., 2011). 7. For more insights on the volume of data being created, see Bohn and Short (2010); Bohn, Short, and Baru (2011); and Shapiro and Varian (1998). 8. A recent study has revealed that the Google team has been overestimating flu outbreaks since 2011—their prediction is two points off compared with the Center for Disease Control (CDC) estimates. In addition, by using its traditional methods of estimation, the CDC has been accurately predicting flu outbreaks (Lazer, Kennedy, King, & Vespignani, 2014). 9. A similar framework has been put forward by Birnhack (2014). 10. For more details on this particular case, see Franks (2012). 11. These datasets are not “big” but just large. These datasets are still analyzed using traditional analytical tools (e.g., SPSS, Excel) and hence do not meet the standard definition for “Big Data.” 12. www.lapdonline.org/home/pdf_view/39375 13. John Bryson has written (with a variety of coauthors) extensively on this issue, particularly as it relates to the sharing of information and resources (see, for example, Bryson, Ackermann, & Eden, 2007; Bryson, Crosby, & Stone, 2006). 14. Similar “open data” efforts have been initiated at the local level. For example, a recent newsletter from Alliance for Innovation, titled The Digital Future: Open Data to Open Doors highlights open data initiatives in Hawaii, Austin, Texas, and Palo Alto, California. 1062 Administration & Society 49(7) 15. See almost any introductory text book on political economy. For a simple exposition of this issue, see Jonathon Gruber’s (2014) textbook. 16. See Wolfers and Zitzewitz (2006). 17. McGinnis refers to this as dispersed media. 18. For example, the City of Santa Monica was one of five US$1 million awards granted by the Bloomberg Philanthropies’ Mayors Challenge. This effort is focused on the development of a Local Well-Being Index, “a dynamic measurement tool that will provide a multidimensional picture of our community’s strengths and challenges across key elements of well-being (economics, social connections, health, education & care, community engagement, and physical environment).” The goal of this index is to be used by city officials to “make data-driven decisions and targeted resource allocation.” http://www.smgov.net/ uploadedFiles/Wellbeing/Project-Summary.pdf 19. http://www.ccs.neu.edu/home/amislove/twittermood/ References Alesina, A., Baqir, R., & Easterly, W. (1999). Public goods and ethnic divisions. Quarterly Journal of Economics, 114, 1243-1284. Anderson, C. (2008, June 23). The end of theory: The data deluge makes the scientific method obsolete. Wired. Retrieved from http://archive.wired.com/science/ discoveries/magazine/16-07/pb_theory Arrow, K. J. (1950). A difficulty in the concept of social welfare. Journal of Political Economy, 58, 328-346. Badger, E. (2013). Are civic hackathons stupid? The Atlantic cities: Place matters. Retrieved from http://www.theatlanticcities.com/technology/2013/07/are-hackathons-stupid/6111/ Birnhack, M. (2014). S-M-L-XL data: Big data as a new informational privacy paradigm. In Big data and privacy: Making ends meet (Future of Privacy Forum). The Center for Internet and Society, Stanford Law School. Retrieved from http:// www.futureofprivacy.org/big-data-privacy-workshop-paper-collection/ Bohn, R., & Short, J. (2010). How much information 2009: Report on American consumers. San Diego: Global Information Industry Center, University of California, San Diego. Bohn, R., Short, J., & Baru, C. (2011). How much information 2010: Report on enterprise server information consumers. San Diego: Global Information Industry Center, University of California, San Diego. Boyd, D., & Crawford, K. (2012). Critical questions for big data. Information, Communication & Society, 15, 662-679. Bryson, J. M., Ackermann, F., & Eden, C. (2007). Putting the resource-based view of strategy: An distinctive competencies to work in public organizations. Public Administration Review, 67, 702-718. Bryson, J. M., Crosby, B. C., & Stone, M. M. (2006). The design and implementation of cross-sector collaborations: Propositions from the literature. Public Administration Review, 66, 44-56. Desouza and Jacob 1063 Chow, B. (2012, February 8). LAPD pioneers high tech crime fighting war room. Retrieved from: http://losangeles.cbslocal.com/2012/02/08/lapd-pioneers-hightech-crime-fighting-war-room/ deLeon, L., & Denhardt, R. B. (2000). The political theory of reinvention. Public Administration Review, 60, 89-97. Desouza, K. C. (2014). Realizing the promise of big data. Washington, DC: IBM Center for the Business of Government. Durant, R. F., & Legge, J. S. (2006). Wicked problems, public policy, and administrative theory: Lessons from the GM food regulatory arena. Administration & Society, 38, 309-334. Franks, B. (2012). Taming the big data tidal wave. Hoboken, NJ: John Wiley. Galnoor, I. (1975) Government secrecy: exchanges, intermediaries and middlemen. Public Administration Review, 35, 32-42. Green, K. C., Armstrong, J. S., & Graefe, A. (2007). Methods to elicit forecasts from groups: Delphi and prediction markets compared. Foresight: The International Journal of Applied Forecasting, 8, 17-20. Gruber, J. (2014). Public finance and public policy (4th ed.). New York, NY: Worth Publishers. Howard, A. (2012). Predictive data analytics is saving lives and taxpayer dollars in New York City. Retrieved from http://strata.oreilly.com/2012/06/predictive-dataanalytics-big-data-nyc.html Kaisler, S., Armour, F., Espinosa, J. A., & Money, W. (2013). Big data issues and challenges moving forward. Proceedings from the 46th Hawaii International Conference on System Sciences. Retrieved from http://www.cse.hcmut.edu.vn/~ttqnguyet/ Downloads/SIS/References/Big%20Data/(2)%20Kaisler2013%20-%20Big%20 Data-%20Issues%20and%20Challenges%20Moving%20Forward.pdf Laney, D. (2001). 3D data management: Controlling data volume, velocity, and variety. META Group. Retrieved from http://blogs.gartner.com/doug-laney/ files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocityand-Variety.pdf Lazer, D., Kennedy, R., King, G., & Vespignani, A. (2014). The parable of Google flu: Traps in big data analysis. Science, 343, 1203-1205. Lopez, G., & Amand, W. S. (2012). Discovering global socio economic trends hidden in big data. Retrieved from http://www.unglobalpulse.org/discoveringtrend sinbigdata-CHguestpost Mackey, A. (2013). In wake of Journal News publishing gun permit holder maps, nation sees push to limit access to gun records. The News Media and The Law, Winter 2013, 37(1). Retrieved from http://www.rcfp.org/browse-media-lawresources/news-media-law/news-media-and-law-winter-2013/wake-journalnews-publishin Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Hung Byers, A. (2011). Big data: The next frontier for innovation, competition, and productivity. McKinsey Global Institute. Mayer-Schonberger, V., & Cukier, K. (2013). Big data: A revolution that will transform how we live, work and think. New York, NY: Houghton Mifflin Harcourt. 1064 Administration & Society 49(7) McGinnis, J. O. (2012). Accelerating democracy: Transforming governance through technology. Princeton, NJ: Princeton University Press. Mervis, J. (2012). Agencies rally to tackle Big Data. Science, 336, 22. Olson, M. (1965). The logic of collective action: Public goods and the theory of groups. Cambridge, MA: Harvard University Press. Olson, B., & Efstathiou, Jr. J. (2011, November 16). Enbridge’s pipeline threatens Trans Canada’s Keystone XL Plan. Bloombrg BussinessWeek. Available at: http://www.businessweek.com/news/2011-11-16/enbridge-s-pipeline-threatenstranscanada-s-keystone-xl-plan.html Piotrowski, S. J., & Rosenbloom, D. H. (2002). Nonmission-Based Values in ResultsOriented Public Management: The Case of Freedom of Information. Public Administration Review, 62, 643–657. Rampton, R. (2014, April 27). White House looks at how “big data” can discriminate. Retrieved from http://uk.reuters.com/article/2014/04/27/uk-usa-obamaprivacy-idUKBREA3Q00S20140427 Roberts, N. C. (2002). Keeping public officials accountable through dialogue: Resolving the accountability paradox. Public Administration Review, 62, 658-669. Samuelson, P. (1954). The pure theory of public expenditures. Review of Economics and Statistics, 36, 386-389. Shapiro, C., & Varian, H. R. (1998). Information rules: A strategic guide to the network economy. Cambridge, MA: Harvard Business Press. Sheppard, K. (2011, August 24). What’s All the Fuss About the Keystone XL Pipeline? MotherJones. Available at: http://www.motherjones.com/blue-marble/2011/08/ pipeline-protesters-keystone-xl-tar-sands Surowiecki, J. (2004). The wisdom of crowds: Why the many are smarter than the few and collective wisdom shapes, business, economies, societies and nations. New York, NY: Little Brown Book. Wolfers, J., & Zitzewitz, E. (2006). Prediction markets in theory and practice (NBER Working Paper 12083). Retrieved from http://www.nber.org/papers/w12083.pdf Worley, D. R. (2012). The gun owner next door: What you don’t know about the weapons in your neighborhood. Retrived from http://www.lohud.com/apps/pbcs. dll/article?AID=2012312230056&nclick_check=1 Author Biographies Kevin C. Desouza is the associate dean for research in the College of Public Programs, a professor in the School of Public Affairs, and the interim director for the Decision Theater in the Office of Knowledge Enterprise Development at Arizona State University. Benoy Jacob is the director of the Center for Local Government Research and Training, and is an assistant professor in the School of Public Affairs at the University of Colorado, Denver.