Cultural Heritage Institutions and Big Data Collections Leslie Johnston Chief of Repository Development Library of Congress Cultural Heritage organizations have, until recently, spoken of “collections” and “content” and “records” and even “files.” Now it’s also data. Data is not just generated by satellites, identified during experiments, or collected during surveys. Datasets are not just scientific and business tables and spreadsheets. We have Big Data in our Libraries, Archives and Museums. Like other cultural heritage organizations, the Library of Congress has as one of its mandates that it make its collections freely available, whether that is in person or on the web. What are some Library of Congress examples of collecting and preserving large scale collections in many formats, and making them usable as collections and as data? National Digital Newspaper Program chroniclingamerica.loc.gov/ This collection was transformative for the Library of Congress: it was the first to be made to be available as a bulk download and exposed as a text and image dataset. Some researchers want to search for stories in historic newspapers. Some researchers want to mine newspaper OCR for trends across time periods and geographic areas. Requests have come in to analyze the full collection.. The program has: Multiple producers (36 now, ultimately 54) Free and open public access APIs for machine access and automated processes, including access to RDF linked data. Over 6.7 million newspaper pages ingested to date Over 250 Tb of data Web Archives http://www.loc.gov/webarchiving/ lcweb2.loc.gov/diglib/lcwa/html/lcwa-home.html The Library has been archiving the web since 2000. Subject area specialists curate the collections, and Library catalogers create collection-level metadata records. The collections include: • U.S. elections • Web sites created by members of the House and Senate • Thematic collections around events, such as elections in the Philippines, the Iraq war, and the appointment of Supreme Court Justices. • Collections around an area of study, such as Legal “Blawgs” We frequently receive requests for access to full collections for full-text data mining. Every format possible on the web Almost 8 billion files Over 425 TB congress.gov Congress.gov is still in its beta phase, transforming congressional information discovery. Legislation from 1993 to the present, The Congressional Record from 1995 to the present, Committee Reports from 1995 to the present, and Member profiles from 1973 to the present (with some from 1947 to 1972). The Twitter Archive Every public tweet since Twitter’s launch in March 2006. Research requests have included users looking for their own Twitter history, the study of the geographic spread of news, the study of the spread of epidemics, and the study of the transmission of new uses of language. The collection comprises only a few TB, but 100s of billions of tweets. A White Paper is available online at: http://blogs.loc.gov/loc/2013/01/update-on-thetwitter-archive-at-the-library-of-congress/ social science visualization social media status events personal commercial privacy Research Datasets Research datasets are created by faculty, curators, researchers, and federal and state agencies. It is not enough to be collecting publications; we must collect the datasets that support the published work, to allow for replicability and ruse in research. We are now planning to expands its collections to preserve research data, in addition to recognizing that the collections we already have are Big Data to be mined. And the full breadth of the Library’s Collections The American Memory collection, one of the oldest and most used digital collections on the web. The oral histories of the Veteran’s History Project. The audio and video collections of the American Folklife Center. More than 1.2 million images from Prints and Photographs. Digitized maps and GIS data from Geography and Maps More than 300,000 digitized audio and video files comprising over 5 PB at the Packard Campus. And many, many, many more. id.loc.gov The Library of Congress is, in part, a standards agency for rules used to create metadata records and in controlled vocabularies (authorities) used to describe items. The Library is gradually making its vocabularies available as serialized RDF datasets (SKOS and JSON). In the library community, The LC authorities are one of the most common tools for building linked data relationships. What are some of the technological challenges of managing and preserving large digital collections in many formats, and making them available for use? 13 Sheer amount. Huge variation in file formats. Unclear and undocumented rights. Security Missing metadata. Data citation and identifier issues. Discovery expectations: discovery across collections and institutions together. Cost. 14 I will mention infrastructure only in passing. There are scale issues related to: Storage Archiving Bandwidth Software development Staffing for processing This Requires a Preservation Infrastructure The Library developed the BagIt transfer specification for the movement of files between and within organizations. http://www.digitalpreservation.gov/documents/bagitspec.pdf The Library inventories incoming files, and is gradually inventorying all digital content. The Library maintains multiple copies of files on servers and on tape, in geographically distributed locations. The Library has documented sustainability factors for file formats. http://www.digitalpreservation.gov/formats/ For cases where we do have control over content we receive, we have a “Best Edition” Preferred Formats statement, which is currently being updated. •http://www.copyright.gov/circs/circ07b.pdf There are many new activities to be planned for with new researcher uses and expectations. We still have collections. But what we also have is Big Data, which requires us to rethink the infrastructure that is needed to support Big Data services. Our community used to expect researchers to come to us, ask us questions about our collections, and use our digital collections in our environment. Now our collections are, more often than not, self-serve. Researchers are taking collections as data away to work with in their own computational environments. This is a shift away from recent service models where libraries built out and housed lab spaces for specialized activities such as text mining and geospatial modeling and provided staff to assist in acquiring and manipulating data. More and more researchers want to use one or more collections as a whole, mining and organizing the information in novel ways. Researchers use what used to be unimaginable computing power on a desktop to mine the rich information and tools to create pictures that translate that information into knowledge. Should collections be pre-processed to create a variety of derivatives that might be used in various forms of analysis before ingesting them? Or do we limit access to the native format? Or put on-the-fly format transformation services for downloads in place? We are beginning to put into place the infrastructure needed to create full-text indexes for millions/billions of items to support full discovery for researchers. We are only just starting the process of generating linked data representations of billions of items. Cultural heritage institutions are increasingly looking towards self-service – researchers need not ask to download or tell us that they have. We may never know. BUT … we do have collections that are limited to onsite only access due to licenses or gift agreements. In that case, libraries may have to consider providing high-powered workstations with analytical tools for researchers to work with these collections and take analysis outputs away with them. Both have policy implications and implications for public service staffing. But the benefits outweigh the challenges. Cultural heritage institutions are managing and preserving the datasets and big data necessary for re-use and replicability. We are working to make the deposit and management of such data easier to accomplish. This is an important new role for our organizations in enabling new research. Discussion… Leslie Johnston lesliej@loc.gov