Local Big Data - CASCI Commons

advertisement
Local Big Data: The Role of Libraries in Building
Community Data Infrastructures
ABSTRACT
Communities face opportunities and challenges in many
areas, including education, health and wellness, workforce
and economic development, housing, and the environment
[21]. At the same time, governments have significant fiscal
constraints on their ability to address these challenges and
opportunities. Through a combination of open government,
open data, and civic engagement, however, governments,
citizens, civil society groups, and others are reinventing the
relationship between governments and the governed by
developing crowdsourced and other innovative solutions for
community advancement. Underlying this reinvention and
innovation is data – particularly local data about housing, air
quality, graduation rates, literacy rates, poverty, disease, and
more. And yet, not all communities have the capacity to
create, work with, or leverage data at the local level. Using
a case study approach in a medium-sized U.S. city, this paper
focuses on the issues that smaller communities face when
seeking to create local data infrastructures and the extent to
which libraries can develop their capabilities, capacity, and
abilities to work with community information and data to
facilitate community engagement and high-impact, locally
relevant analytics.
General Terms
Data management, communities, libraries.
Keywords
Big Data, Community engagement, Data infrastructure,
Data curation.
.
1. INTRODUCTION
Communities face opportunities and challenges in many
areas, including education, health and wellness, workforce
and economic development, housing, and the environment
(Seattle Foundation, 2006). At the same time, governments
have fiscal constraints which limit their ability to directly
address these challenges and opportunities. Through a
combination of open government, open data, and civic
engagement, however, governments, citizens, civil society
groups, and others are reinventing the relationship between
Permission to make digital or hard copies of all or part of this
work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or
commercial advantage and that copies bear this notice and the full
citation on the first page. To copy otherwise, or republish, to post
on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
governments and the governed by developing civic
crowdsourcing initiatives and other innovative solutions for
community advancement. Underlying this reinvention and
innovation is data – particularly local data about housing,
employment, air quality, graduation rates, literacy rates,
poverty, business activity, and disease..
Data have existed in many key domain areas for some time,
often in the form of large-scale national datasets, such as
those created by the U.S. Census Bureau, Bureau of Labor
Statistics, Environmental Protection Agency, and the
Centers for Disease Control. All of the data from these
agencies have varying levels of local granularity, and often
have more localized (e.g., block, neighborhood, city, county,
region) components. Emerging data integration capabilities
and analytic techniques, however, enable novel ways of
viewing and analyzing data. This, in turn, has supported new
strategies for informing policy-makers, decision-makers,
stakeholders, and citizens about their communities. Often
referred to as Big Data, the ability to harness geo-spatial
data, chronic disease data, literacy data, and others to create
data visualizations, interactive map-based analysis, and
more can often shed light on critical community needs, gaps,
and solutions [1, 5, 7].
But in order to engage in these data science efforts; create
analytic tools; and foster civic engagement, there are
underlying infrastructure needs which must be met. Critical
elements of community data infrastructures include, but are
not limited to [8, 10, 14]:





Central data repositories, where data are stored,
maintained, and catalogued;
Data standards, to which collected data adhere;
Data communities, which will collect, maintain,
and curate data;
Effective information structure/ecology, through
which to foster data communities, engagement,
and use; and
Awareness, at the organizational, neighborhood,
and individual levels, that data affect their daily
well-being and functioning.
In short, while data – and their analyses – are increasingly
central to better understanding and improving the
communities in which we live, realizing this potential
requires infrastructure, organization, and skills that many
communities are just now developing.
Over the past several years there has been a steady increase
in media and scholarly attention given to application of data
analytics undertaken to strengthen communities. However,
much of this research and media focus on Big Data and
Smart Cities has focused on the efforts of large metropolitan
areas, the use of vast data sets, and the large-scale open data
initiatives [20]. While important, this work overlooks the
fact that many communities operate on a much smaller,
“local”, scale. In the US alone, there are over 18,000 cities,
towns, and villages [22], many of which lack the population
or capacity to engage in data initiatives using the strategies
used by larger cities, national governments, and international
NGOs. For every San Francisco there are thousands of
smaller cities and towns, each of which has a range of local
data (what we might call Local Big Data) – agricultural,
cultural, community, historical – or the need for localized
data drawn from larger national and international datasets
[15].
While smaller communities often lack the resources,
personnel, and infrastructure to fully realize the potential of
Local Big Data using the same strategies employed by larger
cities, it would be incorrect to assume that they have no
information institutions that could facilitate local data
initiatives. There are over 16,700 library buildings across the
U.S., many of which are in small and rural communities [23].
Although libraries are not the first organizations that come
to mind in discussions of Big Data, they have a long history
of working with community members to make use of
information resources to meet their individual and
community needs. This, coupled with the growing role for
libraries in the dissemination of government data and
provision of public services, raises critical questions about
how (and if) libraries might help their communities realize
the potential of Local Big Data.
To explore the local aspects of Big Data, and reported gaps
between need and capacity within smaller communities,
this paper presents preliminary findings on a case study
conducted in a medium-sized U.S. city that focused on the
ways in which community data can be leveraged through
public libraries. In particular, the study explored the ways
in which libraries can co-develop their capabilities,
capacity, and abilities – and those of their small/mid-sized
communities – to work with community information and
data to better meet community challenges and
opportunities. The paper concludes with a call for
additional research that explores the challenges of Local
Big Data and how smaller communities might leverage
existing institutions and capacities to create robust local
data ecosystems.
2. Literature Review
When Barack Obama was a candidate running for the
presidency the first time in 2008, his campaign focused on
issues related to information and technology. The Obama
campaign not only used technology – particularly social
media – in new ways to raise money, target and contact
voters, and get out the vote; it also devoted considerable
attention to the ways in which an Obama presidency would
use technology in governance and the policies they would
support regarding information and technology [13]. The
promise and challenges of harnessing the potential of large
amounts of data were an area of emphasis in the campaign
literature, an intended focus of his administration, and a
critical factor in the success of his campaign’s fundraising
and voter organization efforts [13]. While Big Data is not a
new idea, candidate Obama was the first major presidential
contender to express real interest in its potential at a time
when technological advances made working with large
amounts of data increasingly practicable.
After initial efforts to promote open and transparent
government through Executive Orders, the Obama
Administration advanced openness through the Open
Government Partnership
(http://www.opengovpartnership.org/) [17] and in policies
that required the release of machine-readable datasets [17].
The overarching technology focus of the Obama
Administration has been on the use of technology to
increase government transparency, or at least increase the
volume of government information that is generally
available [5, 11]. This follows an overall trend in recent
years toward using e-government for greater access to
government records and increased focus on proactive
release [9]. The efforts of the Obama administration to
promote access, openness, and transparency have centered
around two main technologies – social media and open data
[4, 6, 11, 12, 18]. These approaches both promote access,
openness, and transparency and allow for members of the
public, organizations, communities, and others to generate
new and innovative insights from preexisting, but
previously inaccessible, data.
At their most advanced, Big Data initiatives from the
government are based on open data and offer the potential
for the democratization of working with data and providing
innumerable opportunities to generate new value and
insights from large amounts of existing information. For
researchers, the release of large-scale government data has
prompted new scientific breakthroughs in across disciplines.
For companies, greater availability of large amounts of
government data can make their products more effective,
such as commercial weather forecasting services that rely on
access to National Weather Service data. For community
members, access to large amounts of government data offers
opportunities to create new ways to understand, navigate,
and develop their communities. For community institutions,
like libraries, this type of data can provide insights into
community needs and community composition at a
previously-impossible level.
However, while there is significant potential benefit created
by making government data accessible, each of the
opportunities described above cannot be realized unless
communities, organizations, and individuals have the
necessary infrastructure, skills, and knowledge.
As a result, to date, most Local Big Data initiatives have
been the province of the largest cities such as New York
(https://data.cityofnewyork.us/),
San
Francisco
(https://data.sfgov.org/),
and
Chicago
(https://data.cityofchicago.org/). As reported by MeriTalk
[15], there are often a range of capacity and skills gaps that
smaller municipalities face when trying to promote
engagement with community data. Thus although there is a
recognition that managing and leveraging local data sets are
important, there is often an inability to do so in smaller
jurisdictions.
Often ignored in this discussion, however, is the role and
centrality of public libraries in the local data infrastructure
domain. With over 16,700 public library buildings in the
US, libraries are in almost every community – small and
large – and bring information management, data curation,
public access technology infrastructure, and digital literacy
skills that are essential to working with community data [3,
19]. Libraries and librarians, however, can [2]:




Provide data curation and management expertise.
Big Data require management, curation,
preservation, metadata schemes, and structures
for access and availability. Libraries are well
positioned to provide this expertise to the
communities that they serve.
Develop data analytics skills within libraries to
foster and promote the use of data within the
communities that the libraries, which can
facilitate policy development and decision
making.
Serve as facilitators of open data in order to
enhance transparency and openness of
government. By working with the open and big
data communities, libraries have an opportunity
to promote democratic processes within the
communities that they serve.
Host a range of data events such as hackathons
that promote the use of data for community
engagement.
While the involvement of public libraries in community
efforts to realize the potential of Local Big Data is in its
infancy, libraries such as Chattanooga Public Library
(http://opendata.chattlibrary.org/) and Hartford Public
Library (http://hartfordinfo.org/) have begun leading the
way. Yet, at this time the best strategies for blending Big
Data at the local level, public libraries, and communities
remains unclear. The potential, however, is substantial. Thus
this study sought to explore the topic through a case study
approach as described below.
needs, efforts, uses, and activities – and focus on existing
gaps, future directions, potential realization, and determine
the extent to which libraries might be able to help
communities develop their data infrastructure and facilitate
its use with the overall goal of exploring how (and if) to
position libraries at the center of these efforts.
The following research questions guided the study:







The study focused on central issues of building critical data
capabilities of within communities such as data
infrastructure; organization of data and data communities;
identification of key data sources and resources; assessing
and improving data frequency and quality; data curation;
facilitating data use, and measuring the impact of data
related investments.
The case study involved preliminary interviews with civil
society and community groups, discussions with state library
agency staff, an analysis of the city’s neighborhoods, an
analysis of the library branches in the communities and
surrounding areas, and a culminating workshop with a range
of stakeholders (libraries, community organizations,
researchers) intended to discuss and further identify issues
associated with Local Big Data. More specifically, the study
team:

3. METHODOLOGY
The study used a case study methodology [24] to explore the
data needs of community organizations and roles of public
libraries within a medium-sized US city between August and
October 2013 in meeting those needs. Site selection was
purposeful, as the researchers had knowledge of the
information and data landscape of the city, the civil society
community, the public library community (in and around the
city), and had support from the state library agency that
coordinates public library initiatives throughout the entire
state in which the city is located. In addition, the state library
agency was willing to fund the initial research to promote a
discussion regarding the roles that public libraries might play
in developing community data infrastructures. A critical
focus of the study, therefore, was to identify current data
What are the local data needs of community
organizations, libraries, and other community
stakeholders?
How do these stakeholders identify and select
data of interest?
How do these stakeholders currently manage the
data that they use?
Are there data that would be of use but are
currently out of the reach of these stakeholders?
How are these stakeholders using community
data, and what are the gaps in skills regarding
data use?
What roles can libraries play the collection,
management, and use of data within local
communities?
What challenges do libraries face in assuming
data infrastructure roles in their communities?



Documented the current practice and need
regarding activities, events, and services of the
community to better assess the community
infrastructure for disseminating information.
Identified current practices in libraries and
community organizations for collecting and using
activities, events, and services data.
Reviewed how the community organizations,
including libraries, create, collect, manage, and
use the information and data about activities,
events, and services.
Explored practices used by community
organizations to disseminate data and information
services and resources to community members.
These efforts provided baseline data that informed the
interviews conducted with community organizations, which
then informed the culminating workshop.
The study team identified and contacted 44 community
organizations (e.g., civil society, non-profit organizations)
institutions for interviews and were successful in
interviewing representatives from 14 of the organizations.
Phone interviews were recorded, and interviewer notes and
subject responses were captured using a Qualtrics survey to
create an interview record. The interviewees were asked to
describe how their organization engaged in annual planning
of its activities; the types of information that they typically
used in the planning process; the types of data that they used
to communicate with funding agencies; additional data that
they see would be beneficial to their organizations; and
challenges that they face regarding the gathering, analysis,
and use of data.
The study team also conducted an exploratory analysis of
library websites informed by a series of searches via Google.
The searches were designed to identify libraries that might
be engaging in local data initiatives, with a particular
emphasis on leveraging local data, building a community of
practice around data, and data engagement events. The
search yielded several libraries that had aspects of data
practices, but two in particular: Chattanooga Public Library
(http://opendata.chattlibrary.org/) and Hartford Public
Library (http://hartfordinfo.org/). The study team analyzed
the websites of these libraries to better understand the roles
that these libraries were playing in the local data ecosystem.
Both libraries sought to be a community data platform, with
Chattanooga engaging in a number of data events (i.e.,
hackathons) intended to foster community data use and
development. Hartford Public Library has to date focused
more on the gathering and repository aspects of community
data.
The data from the interviews and library website analysis
informed the study team’s September 2013 workshop
entitled “All Data is Local: The Role of Libraries in Local
Data Ecosystems.” The event brought together 12
representatives of civil society, community organizations,
researchers from the study team and local university, and the
library community. The study team presented its findings
from study activities, and facilitated discussion around
community data needs, data usage, data challenges, data
literacy skills, possible library roles in facilitating and
meeting community data needs, and proposed strategies to
address ways in which communities can leverage existing
resources to build data infrastructure capacities.
The next section presents key findings from the study. It is
important to note, however, that this exploratory study has
several limitations, including a focus on a single community,
the limited number of participants, and the challenges
experienced in gaining access to a broader cross section of
community organizations. These limitations constrain the
generalizability of the findings. However, while subject to
significant caveats, the findings of this preliminary study,
offer important insights into the challenges small and mid-
sized communities face in developing the capacity to engage
with and use Local Big Data.
4. FINDINGS
Based on the background data gathering, interviews, website
assessment, and workshop, findings emerged in four key
areas: 1) Data needs; 2) Building capacity; 3)
Demonstration; and 4) Building community.
4.1 Data Needs
The study found that non-profit institutions not only need
more data, they need more meaningful data. Although the
organizations were often aware of general data sources that
were available, such as U.S. Census Bureau data, their use
of them was limited. Because many of the free sources of
data are larger in scope it is hard for these institutions to get
targeted information that is relevant to their institutional
goals due to the varying degrees of data granularity. The
types of data these organizations reported needing were:




Targeted demographic data;
Neighborhood-level data;
Service supply and demand assessments; and
Information on individuals who would be likely
to donate funds.
Institutions with a narrower scope, such as housing support
and early literacy programs, need demographic information
specific to their audience, such as youth or people with
disabilities. Organizations focused on addressing local
issues at the micro-level need data for specific, often
idiosyncratically defined, regions. Many respondents
recognized that while access to data was important, having
the time, knowledge, and skills necessary to map the
available data to specific decisions, actions, or needs was a
critical gap.
Some of the challenges inherent with obtaining data for
small, locally focused organizations include obtaining the
initial data and keeping it updated, reaching the right
audience, selecting and applying appropriate analytic
methods, and making the best use of limited time and staff
available to process and interpret the available data.
Although these institutions reported having access to pools
of public data, these sources were often not targeted to the
institution’s needs. Therefore, many organizations collected
their own data, either through community meetings, budget
or strategic plans, their own historical data and demographic
studies, visiting similar centers, or holding benchmark
studies. As a result, there is a need to add value and relevance
to existing available datasets to better meet the data needs of
the institutions.
4.2 Building Capacity
A common theme that emerged was the need to develop the
ability communities in general, and libraries in particular, to
build, use, and maintain their capability to make use of local
data. Specifically respondents notes that their organizations
and communities would benefit from efforts to enhance
their:



Data infrastructure. There is a need to create,
curate, and manage local datasets. These can be
subsets of national datasets (e.g., Census, Center
for Disease Control, Education, etc.) that are
disaggregated at local levels, local datasets that
focus on domain areas (e.g., housing, health), or
regional/state datasets. Data infrastructure needs
to include coordinating mechanisms, metadata
standards, and other features that facilitate access
and use of these datasets.
Data portals. These can take a number of forms,
but there is a need to create coordinated and
centralized data portals that provide stakeholders
with ready access to datasets, data dictionaries,
information on metadata standards, and other
features that ensure a place to store data, access
to data, and data management techniques to
ensure currency and reliability of datasets.
Workshops. There is a clear need to host a range
of workshops to bring stakeholders communities
together and develop a range of skills such as
data management and curation skills, data
development and collection, data use, data
analytics, visualizations, and the like. These may
evolve into hackathons and connect to larger
hacking events such as the National Day of Civic
Hacking (http://hackforchange.org/).
The presence of these capabilities was also influenced by the
size of the organization, the size of their network, and the
numbers of different programs/services they offered.
However, no matter what their size and position in the
community, respondents consistently identified the need for
additional capacity in these areas.
4.3 Demonstration
Participants and interviewees identified the need for data,
but in some cases were unsure of what data, how it could be
used effectively, and ways to demonstrate impact of the data
collected. From these responses, it was clear that there is a
need for:


Identification of best practices. The collection of
examples and best practices of community
organizations, libraries, and other stakeholders
use of data, data tools, and the building of local
data infrastructures would provide clear paths for
libraries and communities to follow as they
engage in local data infrastructure development.
Demonstration projects. Pilot projects in differing
communities combined with workshops, can
facilitate and promote dialog among key
constituencies and further develop libraries as
central figures in the building of local data
infrastructures.

Seed funding. There is an opportunity to create
small amounts of funds for which community
organizations and libraries can qualify in order to
focus on local data infrastructure projects.
In combination, these efforts can foster innovation while
simultaneously building capacity and community data
infrastructures.
4.4 Building Community
The study identified the need for community building. More
specifically, the study bought to light the need to:



Bring together stakeholder communities. Civil
society, researchers, non-profits organizations,
and libraries often intersect and collaborate. They
often do not, however, in the area of data
infrastructures. Each of these constituencies can
play important roles in data development, use,
and impact, but each often operates
independently. Creating a holistic community
around data can create a much more robust local
data infrastructure. And libraries can play a
pivotal role as facilitator and convener.
Identify critical roles and capabilities. By
bringing together key data and community
stakeholders, libraries can help map local data
infrastructures, capabilities, and needs. Doing so
can help address and develop community needs
for coordinated data collection, management,
storage, availability, and use.
Create a central coordinating mechanism. It was
clear that none of the institutions that participated
in the study had the capacity to “do it all” – that
is, create foundational data infrastructure, build
data curation capacity, and develop the skills
required for high-impact analytics. Having a
central and neutral party for housing and
coordinating local data was seen as an important
community need – and one that at least some
participants thought that libraries could fulfill.
Thus the findings suggest that collaboration and
coordination can benefit individual stakeholders and the
community at large.
5. CONCLUSION
Although much attention is paid to Big Data initiatives at
national levels and large cities, there are many questions
regarding how to create Big Data activities in smaller
jurisdictions. Although subject to important caveats and
limitations, these findings show the need for the building of
a community of practice and advocacy focused on local data
– and important challenges such communities face in being
able to engage in data initiatives.
However, additional research is necessary to explore the
topic of local data infrastructures and the roles of libraries in
community data. Specifically, there is a need to gather more
data from better variety of organizations in different
geographic areas in order to see if the results can be
generalized. This would also include figuring out a standard
lexicon that everyone could use to refer to this type of
information. This research can be used to help create
framework of events that libraries could host to be more like
data platforms, and in essence be their community’s onestop-shop. In other words, the ultimate goal is to create a
working theoretical framework for these communities to use
data, and also create a model for data infrastructure that
could be universally applied.
References
[1] Bertot, J.C., & Choi, H. (2013). Big data and egovernment: issues, policies, and recommendations. In
Proceedings of the 14th Annual International Conference on
Digital Government Research (dg.o '13). ACM, New York,
NY, USA, 1-10.
[2] Bertot, J. C., Gorham, U., Jaeger, P. T. & Sarin, L.C.
(forthcoming). Big Data, Libraries, and the Information
Policies of the Obama Administration. The Bowker Annual.
[3] Bertot, J. C., Jaeger, P. T., Gorham, U., Taylor, N. G., &
Lincoln, R. (2013). Delivering e-government services and
transforming communities through innovative partnerships:
Public libraries, government agencies, and community
organizations. Information Polity, 18, 127-138.
[4] Bertot, J. C., Jaeger, P. T., & Grimes, J. M. (2012).
Promoting transparency and accountability through ICTs,
social media, and collaborative e-government.
Transforming Government: People, Process and Policy,
6(1), 78-91.
[5] Bertot, J. C., Jaeger, P. T., & Grimes, J. M. (2010).
Using ICTs to create a culture of transparency?: Egovernment and social media as openness and anticorruption tools for societies. Government Information
Quarterly, 27(3), 264-271.
[6] Bertot, J. C., Jaeger, P. T., Munson, S., & Glaisyer, T.
(2010). Engaging the public in open government: The
policy and government application of social media
technology for government transparency. IEEE Computer,
43(11), 53-59.
[7] Bollier, D. (2010). The Promise and Peril of Big Data.
Washington, DC: Aspen Institute. Available at
http://ilmresource.com.
[8] Boyd, D., & Crawford, K. (2012). Critical questions for
Big Data. Information, Communication & Society, 15(5),
662-679.
[9] Cullier, D., & Piotrowski, S. J. (2009). Internet
information-seeking and its relation to support for access to
government records. Government Information Quarterly,
26(3), 441-449.
[10] Frederikson, L. (2012). Big Data. Public Services
Quarterly, 8(4), 345-349.
[11] Jaeger, P. T., & Bertot, J. C. (2010). Transparency and
technological change: Ensuring equal and sustained public
access to government information. Government
Information Quarterly, 27(4), 371-376
[12] Jaeger, P. T., Bertot, J. C., & Shilton, K. (2012).
Information policy and social media: Framing governmentcitizen Web 2.0 interactions. In C. G. Reddick & S. K.
Aikins (Eds.), Web 2.0 Technologies and Democratic
Governance: Political, Policy and Management
Implications (pp. 11-25). London: Springer.
[13] Jaeger, P. T., Paquette, S., & Simmons, S. N. (2010).
Information policy in national political campaigns: A
comparison of the 2008 campaigns for President of the
United States and Prime Minister of Canada. Journal of
Information Technology & Politics, 7(1), 1-16.
[14] Little, G. (2012). Managing the data deluge. Journal of
Academic Librarianship, 38, 263-264.
[15] MeriTalk. (2013). The State and Local Big Data Gap.
Alexandria,
VA:
MeriTalk.
Available
at:
http://www.meritalk.com/state-and-local-big-data.php.
[16] Obama, B.H. (2013, May 9). Executive Order 13642:
Making Open and Machine Readable the New Default for
Government Information. Washington, DC: Office of the
Executive. Available at: http://www.gpo.gov/fdsys/pkg/FR2013-05-14/pdf/2013-11533.pdf.
[17] Office of Science and Technology Policy. (2011). The
Open Government Partnership: National action plan for the
United States of America. Washington, DC: Office of
Science and Technology Policy. Available at:
http://www.whitehouse.gov/sites/default/files/us_national_
action_plan_final_2.pdf.
[18] Paquette, S., Jaeger, P. T., & Wilson, S. C. (2010).
Identifying the risks associated with governmental use of
cloud computing. Government Information Quarterly,
27(3), 245-253.
[19] Shuler, J.A., Jaeger, P.T., & Bertot, J.C. (2014). Egovernment without government. Government Information
Quarterly, 31(1): 1-3.
[20] The Economist. (2012, October 27). Special report on
technology and geography: A sense of place. The Economist,
405(8808): 1-22.
[21] The Seattle Foundation. (2006). A Healthy Community:
What You Need to Know to give Strategically. Seattle, WA:
The
Seattle
Foundation.
Available
at:
http://www.seattlefoundation.org/aboutus/Documents/1002
9170_HCReport_web.pdf.
[22] U.S. Census Bureau. (2011). City and Town Totals:
Vintage
2011.
Available
at:
http://www.census.gov/popest/data/cities/totals/2011/index.
html.
[23] U.S. Institute of Museum and Library Services. (2013).
Public Libraries in the United States Survey: Fiscal Year
2010. Washington, DC, U.S. Institute of Museum and
Library
Services.
Available
at:
http://www.imls.gov/assets/1/AssetManager/PLS2010.pdf.
[24] Yin, R. (2013). Case Study Research: Design and
Methods (5th Ed). Thousand Oaks, CA: Sage Publications,
Inc.
Download