Understanding and Implementing the PREMIS Data Dictionary for Preservation Metadata Rebecca Guenther, Network Development & MARC Standards Office Library of Congress Preservation Metadata Preservation metadata includes: Provenance: • Who has had custody/ownership of the digital object? Content Authenticity: • Is the digital object what it purports to be? 10 years on 50 years on Preservation Activity: • What has been done to preserve it? Technical Environment: • What is needed to render and use it? Rights Management: • What IPR must be observed? Makes digital objects self-documenting across time Forever! PREMIS Data Dictionary May 2005: Data Dictionary for Preservation Metadata: Final Report of the PREMIS Working Group March 2008: PREMIS Data Dictionary for Preservation Metadata, version 2.0 (version 2.1 Jan. 2011) Includes PREMIS Data Dictionary, context/assumptions, data model, usage XML schema to support implementation Data Dictionary: examples Comprehensive view of information needed to support digital preservation • Guidelines/recommendations to support creation, use, management • Based on deep pool of institutional experiences in setting up and managing operational capacity for digital preservation • http://www.loc.gov/standards/premis/v2/premis-2-0.pdf What does PREMIS cover? Administrative metadata that supports the digital preservation process Provides information to help manage a resource for preservation purposes • • • Technical characteristics Information about actions on an object Relationships (structural and derivative) Structural: indicates how compound objects are put together • Derivative: results of common preservation actions • Rights metadata associated with preservation In OAIS terms: • Metadata as part of SIP, AIP or DIP • Fits into Preservation Description Information (Reference, Context, Provenance, Fixity) • What PREMIS is and is not What PREMIS is: • • • • Common data model for organizing/thinking about preservation metadata A checklist for core metadata in a repository Guidance for local implementations Standard for exchanging information packages between repositories What PREMIS is not: • • • • Out-of-the-box solution: need to instantiate as metadata elements in repository system All needed metadata: excludes business rules, format-specific technical metadata, descriptive metadata for access, non-core preservation metadata Lifecycle management of objects outside repository Rights management: limited to permissions regarding actions taken within repository PREMIS Data Model Intellectual Entities Rights Statements Agents Objects Events Intellectual Entities Examples: Rabbit Run by John Updike (a book) “Maggie at the beach” (a photograph) The Library of Congress Website (a website) The Library of Congress: American Memory Home page (a web page) Set of content that is considered a single intellectual unit for purposes of management and description (e.g., a book, a photograph, a map, a database) May include other Intellectual Entities (e.g. a website that includes a web page) **Has one or more digital representations** Previously not fully described in PREMIS DD, but will be in scope in version 3.0 Objects Discrete unit of information in digital form **Objects are what repository actually preserves** Three types of Object: FILE: named and ordered sequence of bytes that is known by an operating system • REPRESENTATION: set of files, including structural metadata, that, taken together, constitute a complete rendering of an Intellectual Entity • BITSTREAM: data within a file with properties relevant for preservation purposes (but needs additional structure or reformatting to be stand-alone file) Intellectual entity will become another level of object • Examples: chapter1.pdf (a file) chapter1.pdf + chapter2.pdf + chapter3.pdf (representation of a book w/3 chapters) TIFF file containing header and 2 images (2 bitstreams (images), each with own set of properties (semantic units): e.g., identifiers, technical metadata, inhibitors, … ) Object Example: book in two versions Intellectual Entity Da Vinci Code by Dan Brown Representation 1 Page image version File 1: page1.tiff File 2: page2.tiff File N: pageN.tiff Representation 2 ebook version File N+1: METS.xml File 1: book.lit Events Examples: Validation Event: use JHOVE tool to verify that chapter1.pdf is a valid PDF file Ingest Event: transform an OAIS SIP into an AIP Migration Event: create a new version of an Object in an up-to-date format An action that involves or impacts at least one Object or Agent associated with or known by the preservation repository Helps document digital provenance. Can track history of Object through the chain of Events that occur during the Objects lifecycle Determining which Events are in scope is up to the repository (e.g., Events which occur before ingest, or after de-accession) Determining which Events should be recorded, and at what level of granularity is up to the repository Agents Examples: Martha Anderson (a person) Library of Congress (an organization) Dark Archive in the Sunshine State implementation (a system) JHOVE version 1.0 (a software program) Person, organization, or software program/system associated with an Event or a Right (permission statement) Agents are associated only indirectly to Objects through Events or Rights Not defined in detail in PREMIS DD; not considered core preservation metadata beyond identification Rights Statements Example: Priscilla Caplan grants FCLA digital repository permission to make three copies of metadata_fundamentals.pdf for preservation purposes. An agreement with a rights holder that grants permission for the repository to undertake an action(s) associated with an Object(s) in the repository. Not a full rights expression language; focuses exclusively on permissions that take the form: • Agent X grants Permission Y to the repository in regard to Object Z. Technical metadata pertaining to objects Object identifier Preservation level Significant characteristics Object characteristics • fixity • format • size • creating application • inhibitors • object characteristics extension Creating application Original name Storage Environment • software • hardware Digital signatures Relationships Linking event identifier Linking permission statement identifier Semantic units pertaining to Events: provenance and preservation activity Event identifier Event type (e.g. capture, creation, validation, migration, fixity check) Event dateTime Event detail Event outcome Event outcome detail Linking agent identifier Linking object identifier Semantic units pertaining to Rights Rights Statement Rights Statement Identifier Rights Basis Copyright Information License Information Statute Information Rights Granted act restriction termOfGrant rightsGranted Linking Object Identifier Linking Agent Identifier rightsExtension Semantic units pertaining to Agents Agent Identifier Agent Name Agent Type Agent Note Agent Extension linking Event Identifier Linking Rights Identifier PREMIS timeline Metadata Framework For Digital Preservation 2002 PREMIS 2.0 released PREMIS Data Dictionary released Maintenance Activity formed 2003 PREMIS Working Group formed 2004 2005 2006 2007 2008 PREMIS UK Digital Preservation Editorial Committee formed Award 2009 PREMIS 2.1 released 2010 PREMIS Implementation Fairs 2011 The State of PREMIS de facto standard for preservation metadata; in some countries mandated for cultural heritage repositories PREMIS implementations are appearing in many places, many contexts, many forms Some experimentation is leading to changes in the data dictionary and schema PREMIS Implementation fairs: attempts to consolidate implementation experiences, issues, best practices, PREMIS Maintenance Activity Web site: • • • Permanent Web presence, hosted by Library of Congress Central destination for PREMIS-related info, announcements, resources Home of the PREMIS Implementers’ Group (PIG) discussion list PREMIS Editorial Committee: • • • Set directions/priorities for PREMIS development Coordinate future revisions of Data Dictionary and XML schema Promote implementation http://www.loc.gov/standards/premis/ PREMIS Editorial Committee membership Rebecca Guenther, Chair (Library of Congress) Yair Brama (ExLibris) Karin Bredenberg (Riksarkivet, Swedish National Archives) Priscilla Caplan (Florida Center for Library Automation) Angela Dappert (British Library) Angela Di Iorio (Fondazione Rinascimento Digitale) Markus Enders (British Library) Karsten Huth (Sächsisches Staatsarchiv) David Lake (US National Archives and Records Administration) Brian Lavoie (OCLC) Sébastien Peyrard (Bibliothéque nationale de France) Robert Sharpe (Tessella) Sally Vermaaten (Statistics New Zealand) Robert Wolfe (MIT/DSpace) Kate Zwaard (US Government Printing Office) PREMIS activities Integration with other standards and efforts • Survey of PREMIS in METS profiles (DLib magazine Sept 2010) http://www.dlib.org/dlib/september10/vermaaten/09vermaaten.html Extensibility: Add elements about extensions as in METS • US intelligence community extending for security classification PREMIS Documentation • Understanding PREMIS: Priscilla Caplan (2009) • Gentle introduction to the PREMIS standard • Spanish, German and Italian translations • PREMIS Data Dictionary for Preservation Metadata version 2.0: translation in Japanese and Spanish Workflows and registries • PREMIS Tools to facilitate automated workflows: PREMIS in METS toolkit made available as open source • PREMIS controlled vocabularies in id.loc.gov PREMIS OWL Ontology in development, soon to be released • Some implementers … DAITTSS (Florida): a preservation repository for the use of the libraries of the public universities of Florida. Ex Libris Rosetta: a commercial digital preservation system supporting acquisition, validation, ingest, storage, management, preservation and dissemination of different types of digital objects National Digital Newspaper Program Archivematica: comrehensive open-source digital preservation system National Archives of Sweden, National Archives of Scotland Carolina Digital Repository: repository for material in electronic formats produced by members of the University of North Carolina at Chapel Hill community. British Library electronic journal archiving project For more information see: • http://www.loc.gov/premis/premis-registry.html What does it mean to implement PREMIS? You are keeping preservation metadata that is defined in the PREMIS data dictionary as information you need to know to preserve digital objects There can be a phased approach to implementation in terms of which PREMIS entities to implement Most values can be extracted from the object or generated by a repository You don’t have to control all levels of objects; some may only manage files, not representations or bitstreams If you aren’t already, you should be planning to track actions on objects for future preservation activities (PREMIS events) You may or may not store data using METS as a container, but it is useful as a standard exchange package (SIP or DIP) PREMIS conformance statement was developed and is available PREMIS in METS toolbox Developed by Florida Center for Library Automation under contract with LC Uses PREMIS in METS guidelines A set of open-source tools to support the implementation of PREMIS especially in the METS container format 3 components: validate, convert, describe Source code available: http://pimtoolbox.sourceforge.net Describe: uses the DAITSS description service /a/real/file droid/jhove <premis> <ext> </premis> Convert: between PREMIS and PREMIS in METS OR PREMIS in METS to PREMIS <premis/> xslt <mets> <premis> </mets> Validate: PREMIS in METS document <mets> <premis/> </mets> Schematron confirmation or errors Tools continued Id.loc.gov Preservation events • Preservation level role • Cryptographic hash functions Additional vocabularies to be included soon • Conclusions PREMIS Data Dictionary provides critical piece of reliable digital preservation infrastructure comprised of technology, standards, and best practice PREMIS was produced from an international, cross-domain, consensus-building process and is applicable to any preservation effort PREMIS Data Dictionary is a building block with which effective, sustainable digital preservation strategies can be implemented PREMIS Data Dictionary and the Maintenance Activity is tightly focused on implementation Preservation metadata will be crucial for the future even if it doesn’t enhance current access URLs, etc. PREMIS Maintenance Activity: http://www.loc.gov/standards/premis/ PREMIS Data Dictionary for Preservation Metadata: http://www.loc.gov/standards/premis/v2/premis-2-1.pdf PREMIS Implementation Registry http://www.loc.gov/standards/premis/premis-registry.php PREMIS Implementers Group list http://listserv.loc.gov/listarch/pig.html