Cultural Heritage Institutions
and Big Data Collections
Leslie Johnston
Chief of Repository Development
Library of Congress
Cultural Heritage organizations
have, until recently, spoken of
“collections” and “content” and
“records” and even “files.”
Now it’s also data.
Data is not just generated by satellites,
identified during experiments, or collected
during surveys.
Datasets are not just scientific and business
tables and spreadsheets.
We have Big Data in our Libraries, Archives
and Museums.
Like other cultural heritage
organizations, the Library of
Congress has as one of its
mandates that it make its
collections freely available,
whether that is in person or on
the web.
What are some Library of
Congress examples of
collecting and preserving large
scale collections in many
formats, and making them
usable as collections and as
National Digital
Newspaper Program
This collection was transformative for the Library of Congress:
it was the first to be made to be available as a bulk download
and exposed as a text and image dataset.
Some researchers want to search for stories in historic
newspapers. Some researchers want to mine newspaper OCR
for trends across time periods and geographic areas.
Requests have come in to analyze the full collection..
The program has:
 Multiple producers (36 now, ultimately 54)
 Free and open public access
 APIs for machine access and automated processes,
including access to RDF linked data.
Over 6.7 million newspaper pages ingested to date
Over 250 Tb of data
Web Archives
The Library has been archiving the web since 2000. Subject area
specialists curate the collections, and Library catalogers create
collection-level metadata records.
The collections include:
• U.S. elections
• Web sites created by members of the House and Senate
• Thematic collections around events, such as elections in the
Philippines, the Iraq war, and the appointment of Supreme
Court Justices.
• Collections around an area of study, such as Legal
We frequently receive requests for access to full collections for
full-text data mining.
Every format possible on the web
Almost 8 billion files
Over 425 TB is still in its beta phase,
transforming congressional
information discovery.
Legislation from 1993 to the present,
The Congressional Record from
1995 to the present, Committee
Reports from 1995 to the present,
and Member profiles from 1973 to
the present (with some from 1947 to
The Twitter Archive
Every public tweet since Twitter’s launch in March
Research requests have included users looking for
their own Twitter history, the study of the
geographic spread of news, the study of the
spread of epidemics, and the study of the
transmission of new uses of language.
The collection comprises only a few TB, but 100s of
billions of tweets.
A White Paper is available online at:
social media
Research Datasets
Research datasets are created by
faculty, curators, researchers, and
federal and state agencies.
It is not enough to be collecting
publications; we must collect the
datasets that support the published
work, to allow for replicability and ruse in research.
We are now planning to expands its
collections to preserve research
data, in addition to recognizing that
the collections we already have are
Big Data to be mined.
And the full breadth of the
Library’s Collections
The American Memory collection, one of the oldest
and most used digital collections on the web.
The oral histories of the Veteran’s History Project.
The audio and video collections of the American
Folklife Center.
More than 1.2 million images from Prints and
Digitized maps and GIS data from Geography and
More than 300,000 digitized audio and video files
comprising over 5 PB at the Packard Campus.
And many, many, many more.
The Library of Congress is, in part, a
standards agency for rules used to
create metadata records and in
controlled vocabularies (authorities)
used to describe items.
The Library is gradually making its
vocabularies available as serialized
RDF datasets (SKOS and JSON).
In the library community, The LC
authorities are one of the most
common tools for building linked
data relationships.
What are some of the
technological challenges of
managing and preserving
large digital collections in
many formats, and making
them available for use?
Sheer amount.
Huge variation in file formats.
Unclear and undocumented rights.
Missing metadata.
Data citation and identifier issues.
Discovery expectations: discovery across collections and
institutions together.
I will mention infrastructure only in passing.
There are scale issues related to:
Software development
Staffing for processing
This Requires a Preservation Infrastructure
The Library developed the BagIt transfer specification for the
movement of files between and within organizations.
The Library inventories incoming files, and is gradually inventorying all
digital content.
The Library maintains multiple copies of files on servers and on tape,
in geographically distributed locations.
The Library has documented sustainability factors for file formats.
For cases where we do have control over content we receive, we have
a “Best Edition” Preferred Formats statement, which is currently being
There are many new
activities to be planned for
with new researcher uses
and expectations.
We still have collections. But what we also have is Big Data,
which requires us to rethink the infrastructure that is needed
to support Big Data services. Our community used to
expect researchers to come to us, ask us questions about
our collections, and use our digital collections in our
Now our collections are, more often than not, self-serve.
Researchers are taking collections as data away to work
with in their own computational environments. This is a shift
away from recent service models where libraries built out
and housed lab spaces for specialized activities such as text
mining and geospatial modeling and provided staff to assist
in acquiring and manipulating data.
More and more researchers want to use one
or more collections as a whole, mining and
organizing the information in novel ways.
Researchers use what used to be
unimaginable computing power on a desktop
to mine the rich information and tools to
create pictures that translate that information
into knowledge.
Should collections be pre-processed to create a
variety of derivatives that might be used in various
forms of analysis before ingesting them? Or do we
limit access to the native format? Or put on-the-fly
format transformation services for downloads in
We are beginning to put into place the infrastructure
needed to create full-text indexes for millions/billions
of items to support full discovery for researchers.
We are only just starting the process of generating
linked data representations of billions of items.
Cultural heritage institutions are increasingly looking
towards self-service – researchers need not ask to
download or tell us that they have. We may never
BUT … we do have collections that are limited to onsite only access due to licenses or gift agreements. In
that case, libraries may have to consider providing
high-powered workstations with analytical tools for
researchers to work with these collections and take
analysis outputs away with them.
Both have policy implications and implications for
public service staffing.
But the benefits outweigh
the challenges.
Cultural heritage institutions are managing
and preserving the datasets and big data
necessary for re-use and replicability.
We are working to make the deposit and
management of such data easier to
This is an important new role for our
organizations in enabling new research.
